Recent advances in natural language processing: selected papers from RANLP'95 (Current Issues in Linguistic Theory)

RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE Gene...

Author: Ruslan Mitkov

32 downloads 934 Views 16MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING

AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE General Editor E.F. KONRAD KOERNER (University of Ottawa) Series IV - CURRENT ISSUES IN LINGUISTIC THEORY

Advisory Editorial Board Henning Andersen (Los Angeles); Raimo Anttila (Los Angeles) Thomas V. Gamkrelidze (Tbilisi); John E. Joseph (Edinburgh) Hans-Heinrich Lieb (Berlin); Ernst Pulgram (Ann Arbor, Mich.) E. Wyn Roberts (Vancouver, B.C.); Danny Steinberg (Tokyo)

Volume 136

Ruslan Mitkov and Nicolas Nicolov (eds) Recent Advances in Natural Language Processing

RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING SELECTED PAPERS FROM RANLP'95

Edited by

RUSLAN MITKOV University of Wolverhampton

NICOLAS NICOLOV University of Edinburgh

JOHN BENJAMINS PUBLISHING COMPANY AMSTERDAM/PHILADELPHIA

The paper used in this publication meets the minimum requirements of American National Standard for Information Sciences — Permanence of Paper for Printed Library Materials, ANSI Z39.48-1984.

Library of Congress Cataloging-in-Publication Data Recent advances in natural language processing : selected papers from RANLP'95 /dited by Ruslan Mitkov and Nicolas Nicolov. p. cm. - (Amsterdam studies in the theory and history of linguistic science. Series IV, Current issues in linguistic theory, ISSN 0304-0763 ; v. 136) Includes bibliographical references and index. 1. Computational linguistics-Congresses. I. Mitkov, Ruslan. II. Nicolov, Nicolas. III. International Conference on Recent Advances in Natural Language Processing (1st : 1995 : Tsigov Chark, Bulgaria) IV. Series: Amsterdam studies in the theory and history of linguistic science. Series IV, Current issues in linguistic theory : v. 136. P98.R44 1997 410'.285-dc21 97-38873 ISBN 90 272 3640 2 (Eur.) /1-55619-591-5 (US) (alk. paper) CIP © Copyright 1997 - John Benjamins B.V. No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Co. · P.O.Box 75577 · 1070 AN Amsterdam · The Netherlands John Benjamins North America · P.O.Box 27519 · Philadelphia PA 19118-0519 · USA

TABLE OF CONTENTS Editors' Foreword

ix

I. MORPHOLOGY AND SYNTAX Aravind K. Joshi Some linguistic, computational and statistical implications of lexicalised grammars

3

Allan Ramsay & Reinhard Schäler Case and word order in English and German

15

Khalil Sima'an An optimised algorithm for data oriented parsing

35

Marcel Cori, Michel de Fornel & Jean-Marie Marandin Parsing repairs

47

Matthew F. Hurst Parsing for targeted errors in controlled languages

59

Ismail Biskri & Jean-Pierre D e s c l è s Applicative and combinatory categorial grammar (from syntax to functional semantics)

71

Udo Hahn & Michael Strube PARSETALK about textual ellipsis

85

Iñaki Alegria, Xabier Artola & Kepa Sarasola Improving a robust morphological analyser using lexical transducers

97

VI

TABLE OF CONTENTS

IL SEMANTICS AND DISAMBIGUATION Eiåeki Kozima & Akira Ito Context-sensitive word distance by adaptive scaling of a semantic space

111

M. Victoria Arranz, Ian Radford, Sofia Ananiadou & Jun-ichi Tsujii Towards a sublanguage-based semantic clustering algorithm 125 Roberto Basili, Michelangelo Della Rocca, Maria Teresa Pazienza & Paola Velardi Contexts and categories: tuning a general purpose verb classification to sublanguages

137

Akito Nagai, Yasushi Ishikawa & Kunio Nakajima Concept-driven search algorithm incorporating semantic interpretation and speech recognition

149

Eneko Agirre & German Rigau A proposal for word sense disambiguation using conceptual distance

161

Olivier Ferret & Brigitte Grau An episodic memory for understanding and learning

173

Christian Boitet & Mutsuko Tomokiyo Ambiguities and ambiguity labelling: towards ambiguity data bases

185

III. DISCOURSE Małgorzata E. Stys & Stefan S. Zemke Incorporating discourse aspects in English - Polish MT

213

Ruslan Mitkov Two engines are better than one: generating more power and confidence in the search for the antecedent

225

TABLE OF CONTENTS

VII

Tadashi Nomoto Effects of grammatical annotation on a topic identification task

235

Wiebke Ramm Discourse constraints on theme selection

247

Geert-Jan M. Kruijff & Jan Schaake Discerning relevant information in discourses using TFA

259

IV. GENERATION Nicolas Nicolov, Chris Mellish & Graeme Ritchie Approximate chart generation from non-hierarchical representations

273

Christer Samuelsson Example-based optimisation of surface-generation tables

295

Michael Zock Sentence generation by pattern matching: the problem of syntactic choice

317

Ching-Long Yeh & Chris Mellish An empirical study on the generation of descriptions for nominal anaphors in Chinese

353

Kalina Bontcheva Generation of multilingual explanations from conceptual graphs

365

V. CORPUS PROCESSING AND APPLICATIONS Jun'ichi Tsujii Machine Translation: productivity and conventionality of language

377

Ye-Yi Wang & Alex Waibel Connectionist F-structure transfer

393

Yuji Matsumoto & Mihoko Kitamura Acquisition of translation rules from parallel corpora

405

viii

TABLE OF CONTENTS

Harris V. Papageorgiou Clause recognition in the framework of alignment

417

Daniel B. Jones & Harold Somers Bilingual vocabulary estimation from noisy parallel corpora using variable bag estimation

427

Jung Ho Shin, Young S. Han & Key-Sun Choi A HMM part-of-speech tagger for Korean with wordphrasal relations

439

Ivan Bretan, Måns Engstedt & Björn Gambäck A multimodal environment for telecommunication specifications

451

List and Addresses of Contributors

463

Index of Subjects and Terms

469

Editors' Foreword This volume brings together revised versions of a selection of papers pre sented at the First International Conference on "Recent Advances in Natural Language Processing" (RANLP'95) held in Tzigov Chark, Bulgaria, 14-16 September 1995. The aim of the conference was to give researchers the opportunity to present new results in Natural Language Processing (NLP) based on modern theories and methodologies. Alternative techniques to mainstream symbolic NLP, such as analogy-based, statistical and connectionist approaches were also covered. It would not be too much to say that this conference was the most sig nificant NLP event to have taken place in Eastern Europe since COLING'82 was held in Prague and COLING'88 in Budapest, and one of the most im portant conferences in NLP for 1995. The conference received submissions from more than 30 countries. Whilst we were delighted to have so many contributions, restrictions on the number of papers which could be presented forced us to be more selective than we would have liked. From the 48 papers presented at RANLP'95 we have selec ted the best for this book, in the hope that they reflect the most significant and promising trends (and succesful results) in NLP. The book is organised thematically. In order to allow for easier access, we have grouped the contributions according to the traditional topics found in Natural Language Processing, namely, morphology, syntax, grammars, parsing, semantics, discourse, grammars, generation, machine translation, corpus processing, and multimedia. Clearly, some papers lie at the inter section of various areas. To help the reader find his/her way we have added an index which contains major terms used in NLP. We have also included a list and addresses of contributors. We believe that this book will be of interest to researchers, lecturers and graduate students interested in Natural Language Processing and, more specifically, to those who work in Computational Linguistics, Corpus Lin guistics, and Machine Translation. Given the success of the 1995 Conference, it has been decided that "Re cent Advances in Natural Language Processing" will be the first in a series of conferences to be held biennially, the next being scheduled for 1997 (11-13 September 1997).

EDITORS' FOREWORD

χ

We would like to thank all members of the Program Committee. Without them the conference, although well organised (special thanks to Victoria Arranz), would not have had an impact on the development of NLP. Together they have ensured t h a t the best papers were included in the final proceed ings and have provided invaluable comments for the authors, so t h a t the papers are 'state of the art'. The following is a list of those who participated in the selection process and to whom a public acknowledgement is due: Branimir Boguraev Christian Boitet Eugene Charniak Key-S un Choi Jean-Pierre Déseles Anne DeRoeck Rodolfo Delmonte Steve Finch Eva Hajičová Johann Haller Paul Jacobs Aravind Joshi Lauri Karttunen Martin Kay Richard Kittredge Karen Kukich Josef Mariani Carlos Martin-Vide Yuji Matsumoto Kathleen McKeown Ruslan Mitkov Nicolas Nicolov Sergei Nirenburg Manfred Pinkal Allan Ramsey Harold Somers Pieter Seuren Oliviero Stock Benjamin T'sou Jun-ichi Tsujii Dan Tufis David Yarowsky Michael Zock

(Apple Computer, Cupertino) (IMAG, Grenoble) (Brown University) (KAIST, Taejon) (Université de la Sorbonne-Paris) (University of Essex) (University of Venice) (University of Edinburgh) (Charles University, Prague) (IAI, Saarbrücken) (SRA, Arlington) (University of Pennsylvania) (Xerox Grenoble) (Xerox, Palo Alto) (University of Montreal) (Bellcore, Morristown) (LIMSI-CNRS, Orsay) (University Rovira і Virgili) (Nara Institute of Science and Technology) (Columbia University) (IAI/Institute of Mathematics) (University of Edinburgh) (New Mexico State University) (University of Saarland, Saarbrücken) (University College Dublin) (UMIST, Manchester) (University of Nijmegen) (IRST, Trento) (City Polytechnic of Hong Kong) (UMIST, Manchester) (Romanian Academy of Sciences) (University of Pennsylvania) (LISMI-CNRS, Orsay)

EDITORS' FOREWORD

xi

Special thanks must go to: Steve Finch, Günter Görz, Dan Tufis, David Yarowsky, and Michael Zock who reviewed more papers than anyone else and who provided substantial comments. The conference grew out of an idea proposed by Ruslan Mitkov which we discussed at the international summer school "Contemporary Topics in Computational Linguistics" in 1994 (the summer school has taken place annually in Bulgaria since 1989). Among those who supported the idea at the time and encouraged us to organise RANLP'95 were Harold Somers, Michael Zock, Manfred Kudlek, and Richard Kittredge. We would like to acknowledge the unstinting help received from our series editor, Konrad Koerner, and from Ms Anke de Looper of John Benjamins in Amsterdam. Without them this book would not have been a viable project. Thank you both for the numerous clarifications and your constant encouragement! Nicolas Nicolov produced the typesetting code for the book, utilising the T E X system with the LATEX2є package. The technical support from the Department of Artificial Intelligence at the University of Edinburgh is gratefully acknowledged. May 1997

Ruslan Mitkov Nicolas Nicolov

I M O R P H O L O G Y AND SYNTAX

Some Linguistic, Computational and Statistical Implications of Lexicalised Grammars ARAVIND K.

JOSHI

University of Pennsylvania Abstract In this paper we discuss some linguistic, computational and stat istical aspects of lexicalised grammars, in particular the Lexicalised Tree-Adjoining Grammar (LTAG). Some key properties of LTAG, in particular, the extended domain of locality and the factoring of recursion from the domain of dependencies are described together with their statistical implications. The paper introduces a technique called supertag disambiguation based on LTAG trees. This technique and an explanation-based learning technique lead to 'almost' pars ing, i.e., a parsed output where the correct lexical trees have been assigned, but the features have not been checked. Some recent work on relating LTAGs to categorial grammars based on partial proof trees is also discussed.

1

Lexicalisation

A grammar G is said to be lexicalised if it consists of: • a finite of structures (strings, trees, dags, for example), each structure being associated with a lexical item, called its 'anchor', and • a finite set of operations for composing these structures. A grammar G is said to strongly lexicalise another grammar G' if G i s a lexicalised grammar and if the structured descriptions (e.g., trees) of G and G' are exactly the same (cf. Schabes, Abeille & Joshi 1988). The following results are easily established according to Joshi & Schabes (1992): • CFGs cannot strongly lexicalise CFGs. Although for every CFG there is an equivalent CFG in the Greibach Normal Form (GNF), it only weakly lexicalises the given CFG as only a weak equivalence is guar anteed by GNF. • Tree Substitution Grammars, (TSG) , i.e., grammars with a finite set of lexically anchored trees together with the operation of substitution, cannot strongly lexicalise CFGs.

4

ARAVIND JOSHI • TSGs with substitution and another operation called adjoining can strongly lexicalise CFGs. These grammars are exactly LTAGs. Thus LTAGs strongly lexicalise CFGs.

These results show how LTAGs arise naturally in the course of strong lexicalisation of CFGs. Strong lexicalisation is achieved by working with trees rather than strings, hence the property Extended Domain of Local ities (EDL), and by introducing adjoining, which results in the property Factoring Recursion from the Domain of Dependencies (FRD). Thus both EDL and FRD are crucial for strong lexicalisation. 2

Lexicalised Tree-Adjoining Grammar

Lexicalised Tree-Adjoining Grammar (LTAG) consists of elementary trees, with each elementary tree anchored on a lexical item on its frontier. An elementary tree serves as a complex description of the anchor and provides a domain of locality over which the anchor can specify syntactic and semantic (predicate-argument) constraints. Elementary trees are of two kinds — (i) initial trees and (ii) auxiliary trees. Nodes on the frontier of initial trees are substitution sites. Exactly one node on the frontier of an auxiliary tree, whose label matches the label of the root of the tree, is marked as a foot node; the other nodes on the frontier of an auxiliary tree are marked as substitution sites. Elementary trees are combined by substitution and adjunction operations. Each node of an elementary tree is associated with the top and the bot tom feature structures (FS). The bottom FS contains information relating to the subtree rooted at the node, and the top FS contains information relating to the supertree at that node. The features may get their values from three different sources such as the morphology of anchor, the struc ture of the tree itself, or by unification during the derivation process. FS are manipulated by substitution and adjunction. The result of combining the elementary trees shown is the derived tree. The process of combining the elementary trees to yield a parse of the sen tence is represented by the derivation tree. The nodes of the derivation tree are the tree names that are anchored by the appropriate lexical items. The combining operation is indicated by the nature of the arcs-broken line for substitution and and bold line for adjunction-while the address of the operation is indicated as part of the node label. The derivation tree can also be interpreted as a dependency tree with unlabeled arcs between words of the sentence.

LEXICALISED GRAMMARS

5

Elementary trees of LTAG are the domains for specifying dependencies. Mathematical, computational, and linguistic properties of LTAGs, their ex tensions and other related systems have been extensively studied. All these properties follow from two key properties of LTAGs:  Extended Domain of Locality (EDL): The elementary trees of LTAG provide an extended domain (as compared to CFG's or CFG-based grammars) for the specification of syntactic and related semantic de pendencies. • Factoring Recursion from the Domain of Dependencies (FRD); Re cursion is factored away from the domains over which dependencies are specified. LTAGs are more powerful than Context-free Grammars (CFGs) both weakly and more importantly, strongly, in the sense that even if a language is context-free, LTAGs can provide structural descriptions not available in a CFG. LTAGs can handle both nested and crossed dependencies. Variants of LTAGs have been developed for handling various word-order variation phe nomena. LTAGs belong to the so-called class of 'mildly context-sensitive' grammars. LTAGs have proved useful also in establishing equivalences among various classes of grammars, Head Grammars, Linear Indexed Gram mars, and Combinatory Categorial Grammars (CCGs) for example. All im portant properties of CFGs are carried over to LTAGs including polynomial parsability, however, with increased complexity 0(n6) (Joshi, Vijay-Shanker & Weir 1993). A wide-coverage grammar for English has been developed in the framework of LTAG. The XTAG system which is based on this grammar also as an LTAG grammar development system and consists of a predictive left-to-right parser, an X-window interface, a morphological analyser and a part-of-speech tagger. The wide-coverage English grammar of the XTAG system contains 317,000 inflected items in the morphology (213,000 of these are nouns and 46,500 are verbs) and 37,000 entries in the syntactic lexicon. The syntactic lexicon associates words with the trees that they anchor. There are 385 trees in all, in a grammar which is composed of 40 different subcategorisation frames. Each word in the syntactic lexicon, on the aver age, depending on the standard parts-of-speech of the word, is an anchor for about 8 to 40 elementary trees.

6 3

ARAVIND JOSHI Statistical implications

Probabilistic CFGs can be defined by associating a probability with each production (rule) of the grammar. Then the probability of a derivation can be easily computed because each rewriting in a CFG derivation is independ ent of context and hence the probabilities associated with the different re writing rules can be multiplied. However, the rule expansions are, in general, not context free. A probabilistic CFG can distinguish two words or phrases w and w' only if the probabilities P(w/N) and P(w'/N) as given by the grammar differ for some nonterminal. That is, all the distinctions made by a probabilistic CFG must be mediated by the nonterminals of the grammar. Representing distributional distinctions in nonterminals leads to an explo sion in the number of parameters required to model the language. These problems can be avoided by adopting probabilistic TAGs, which provide a framework for integrating the lexical sensitivity of stochastic approaches and the hierarchical structure of grammatical systems. Two features of LTAGs make it particularly suitable as the basis of a probabilistic framework for corpus analysis (Resnik 1992; Schabes 1992). First, since every tree is as sociated with a lexical anchor, words and their associated structures are tightly linked. Thus the probabilities associated with the operations of sub stitution and adjoining are sensitive to lexical context. This attention to lexical context is not acquired at the expense of the independence assump tion of probabilities because substitutions and adjoinings at different nodes are independent of each other. Second, FRD allows one to capture the co occurrences between the verb likes and the head nouns of the subject and the object of likes, as the verb and its subject and object all appear within a single elementary structure. 4

Synchronous TAGs

Synchronous TAGs are a variant of TAGs, which characterise correspond ences between languages (Shieber & Schabes 1992). Using EDL and FRD synchronous TAGs allow the application of TAGs beyond syntax to the task of semantic interpretation or automatic translation. The task of interpreta tion consists of associating a syntactic analysis of a sentence with some other structure-a logical form representation or an analysis of a target language sentence. In a synchronous TAG both the original language and its associ ated structure are defined by grammars stated in the TAG formalism. The two TAGs are synchronised with respect to the operations of substitution

LEXICALISED GRAMMARS

7

and adjoining, which are applied simultaneously to related nodes in pairs of trees, one tree for each language. The left member of a pair is an elementary tree from the TAG for one language, say English, and the right member of the pair is an elementary tree from the TAG for another language, say the logical form language. Synchronous TAGs have been applied to the tasks of semantic interpretation, language generation, and machine translation. 5

Viewing lexicalised trees as super parts-of-speech

Parts of speech disambiguation techniques (taggers) are often used to elim inate (or substantially reduce) the parts of speech ambiguity prior to parsing itself. The taggers are all local in the sense that they use only local inform ation in deciding which tag(s) to choose for each word. As is well known, these taggers are quite successful. In a lexicalised grammar such as LTAG each lexical item is associated with one more elementary structures. The elementary structures of LTAG localise dependencies including long distance dependencies. As a result of this localisation, a lexical item may be (and, in general, is almost always) associated with more than one elementary structure. We call these ele mentary structures associated with each lexical item supertags, in order to distinguish them from the usual parts of speech. Thus the LTAG parser needs to search a large space of supertags for a given sentence. Eventu ally, when the parse is complete there is only one supertag for each word (assuming there is no global ambiguity). Note that even when there there is unique standard parts of speech for a word, say a verb (V), there will be in general more than one supertag associated with this word, because of the localisation of dependencies and the syntactic locality that LTAG requires. It is the LTAG parser that is expected to carry out the supertag disambiguation. In this sense, supertag disambiguation is parsing. Since LTAGs are lexicalised, we are presented with a novel opportunity to eliminate (or substantially reduce) the supertag assignment ambiguity by using local information such as local lexical dependencies, prior to parsing. As in the standard parts of speech disambiguation we can use local statistical information, such as bigram and trigram models based on the distribution of supertags in a LTAG parsed corpus. Since the supertags encode dependency information, we can also use information about the distribution of distances of the dependent supertags for a given supertag. We have developed techniques for disambiguating supertags and in vestigated their performance and their impact on LTAG parsing (Joshi &

δ

ARAVIND JOSHI

Srinivas 1994). Note that in the standard parts of speech disambiguation, the disambiguation could have been carried out by a parser, however, car rying out the parts of speech disambiguation makes the job of the parser easier, there is less work for the parser to do. Supertag disambiguation, in a sense, reduces the work of the parser even further. After supertag dis ambiguation, we have in effect a parse in our hand except for depicting the substitutions and adjoining explicitly, hence, supertag disambiguation can be described as almost parsing. The data required for disambiguating supertags have been collected by parsing the Wall Street Journal, IBM-manual and ATIS corpora using the wide-coverage English grammar being developed as part of the XTAG sys tem. The parses generated by the system for these sentences from the corpora are not subjected to any kind of filtering or selection. All the de rivation structures are used in the collection of the statistics. The supertag statistics which have been used in the preliminary exper iments described below have been collected from the XTAG parsed cor pora. The derivation structures resulting from parsed corpora (Wall Street Journal, for the experiments described here) serve as training data for these experiments. We have investigated three models. One method of disambiguating the supertags assigned to each word is to order the supertags by the lexical preference that the word has for them. The frequency with which a certain supertag is associated with a word is a direct measure of its lexical preference for that supertag. Associating frequencies with the supertags and using them to associate a particular supertag with a word is clearly the simplest means of disambiguating supertags. In a unigram model a word is always associated with the supertag that is most preferred by the word, irrespective of the context in which the word appears. An alternate method that is sensitive to context is the η-gram model. The η-gram model takes into account the contextual dependency probabilities between supertags within a window of η words in associating supertags with words. In the n-gram model for disambiguating supertags, dependencies between supertags that appear beyond the η word window cannot be incorporated into the model. This limitation can be overcome if no a priori bound is set on the size of the window but instead a probability distribution of the distances of the dependent supertags for each supertag is maintained. A supertag is dependent on another supertag if the former substitutes or adjoins into the latter.

LEXICALISED GRAMMARS 6

9

LTAGs and explanation-based learning techniques

Some novel applications of the so-called Explanation-based Learning Tech nique (EBL) have been made to the problem of parsing LTAGs. The main idea of EBL is to keep track of problems solved in the past and to replay those solutions to solve new but somewhat similar problems in the future. Although put in these general terms the approach sounds attractive, it is by no means clear that EBL will actually improve the performance of the system using it, an aspect which is of great interest to us here. Rayner was the first to investigate this technique in the context of nat ural language parsing (Rayner 1988). Seen as an EBL problem, the parse of a single sentence represents an explanation of why the sentence is a part of the language defined by the grammar. Parsing new sentences amounts to finding analogous explanations from the training sentences. The idea is to reparse the training examples by letting parse tree drive the rule expansion process and halting the expansion of a specialised rule if the current node meets a 'tree-cutting' criteria. Samuelsson used the information-theoretic measure of entropy to derive the appropriate sized tree chunks automatically (Samuelsson 1994). Although our approach can be considered to be in this general direc tion, it is distinct in that it exploits some of the key properties of LTAG to: (i) achieve an immediate generalisation of parses in the training set of sen tences, (ii) achieve an additional level of generalisation of the parses in the training set, thereby dealing with test sentences which are not necessarily of the same length as the training sentences, and (iii) represent the set of generalised parses as a finite state transducer (FST), which is the first such use of FST in the context of EBL, to the best of our knowledge. Although our work can be considered to be in this general direction, it is distinguished by the following novel aspects. We exploit some of the key properties of LTAG (i) to achieve an immediate generalisation of parses in the training set of sentences, (ii) to represent the set of generalised parses as a finite state transducer (FST), which is the first such use of FST in the context of EBL, to the best of our knowledge, (iii) to achieve an additional level of generalisation of the parses in the training set, not possible in other approaches, thereby being able to deal with test sentences which are not ne cessarily of the same length as one of the training sentences more directly. In addition to these special aspects of our work, we will present experi mental results evaluating the effectiveness of our approach on more than one kind of corpora, which are far more detailed and comprehensive than

10

ARAVIND JOSHI

results reported so far. We also introduce a device called 'stapler', a very significantly impoverished parser, whose only job is to do term unification and compute alternate attachments for modifiers. We achieve substantial speed-up by the use of 'stapler' together with the output of the FST. 6.1

Implications of LTAG representation for EBL

An LTAG parse of a sentence can be seen as a sequence of elementary trees associated with the lexical items of the sentence along with substitution and adjunction links among the elementary trees. Given an LTAG parse, the generalisation of the parse is truly immediate in that a generalised parse is obtained by (i) uninstantiating the particular lexical items that anchor the individual elementary trees in the parse and (ii) uninstantiating the feature values contributed by the morphology of the anchor and the derivation process. In other EBL approaches (Rayner 1988; Samuelsson 1994) it is necessary to walk up and down the parse tree to determine the appropriate subtrees to generalise on and to suppress the feature values. The generalised parse of a sentence is stored under a suitable index computed from the sentence, such as, part-of-speech (POS) sequence of the sentence. In the application phase, the POS sequence of the input sentence is used to retrieve a generalised parse(s) which is then instantiated to the features of the sentence. If the retrieval fails to yield any generalised parse then the input sentence is parsed using the full parser. However, if the retrieval succeeds then the generalised parses are input to the 'stapler'. This method of retrieving a generalised parse allows for parsing of sen tences of the same lengths and the same POS sequence as those in the training corpus. However, in our approach there is another generalisation that falls out of the LTAG representation which allows for flexible matching of the index to allow the system to parse sentences that are not necessarily of the same length as some sentence in the training corpus. Auxiliary trees in LTAG represent recursive structures. So if there is an auxiliary tree that is used in an LTAG parse, then that tree with the trees for its arguments can be repeated any number of times, or possibly omitted altogether, to get parses of sentences that differ from the sentences of the training corpus only in the number of modifiers. This type of generalisation can be called modifier-generalisation. This type of generalisation is not possible in other EBL approaches. This implies that the POS sequence covered by the auxiliary tree and its arguments can be repeated zero or more times. As a result the index of

LEXICALISED GRAMMARS

11

a generalised parse of a sentence with modifiers is no longer a string but a regular expression pattern on the POS sequence and retrieval of a general ised parse involves regular expression pattern matching on the indices. If, for example, the training example was: Show/V me/N the/D flights/N from/P Boston/N to/P Philadelphia/'N. then, the index of this sentence is: V N D N (P N)* since the prepositions in the parse of this sentence would anchor auxiliary trees. A Finite State Transducer (FST) combines the generalised parse with the POS sequence (regular expression) that it is indexed by. The idea is to annotate each of the finite state arcs of the regular expression matcher with the elementary tree associated with that POS and also indicate which ele mentary tree it would be adjoined or substituted into. The FST represent ation is possible due to the lexicalised nature of the elementary trees. This representation makes a distinction between dependencies between modifiers and complements. The number in the tuple associated with each word is a signed number if a complement dependency is being expressed and is an unsigned number if a modifier dependency is being expressed. In addition to these special aspects of our approach, we have evaluated the effectiveness of our approach on more than one kind of corpus. A substantial speed-up (by a factor of about 60) by the use of 'stapler' in combination with the output of the FST has been achieved (Srinivas & Joshi 1995). 7

LTAGs and Categorial Grammars

LTAG trees can be viewed as partial proof trees (PPTs) in Categorial Gram mars (CGs). The main idea is to associate with each lexical item one or more PPTs as syntactic types. These PPTs are obtained by unfolding the arguments of the type that would be associated with that lexical item in a simple categorial grammar such as the Ajdukiewicz and Bar-Hillel grammar (AB). This (finite) set of basic PPTs (BPPT) is then used for the building blocks of the grammar. Complex proof trees are obtained by 'combining' these PPTs by a uniform set of inference rules that manipulate the PPTs. The main motivation is to incorporate into the categorial framework the key idea of LTAG, namely EDL and FRD (Joshi 1992; Joshi & Kulick 1995). Roughly speaking, EDL allows one to deal with structural adja-

12

ARAVIND JOSHI

cency rather than the strict string adjacency in a traditional categorial grammar. In LTAG, this approach provides more formal power (both weak and strong generative power) without increasing the computational com plexity too much beyond CFGs, while still achieving polynomial parsability (i.e., the class of mildly context-sensitive grammar formalisms (Joshi, VijayShanker & Weir 1993)). EDL also allows strong lexicalisation of CFGs, lead ing again to LTAGs. Therefore, just as strong lexicalisation, EDL, and FRD together lead to LTAGs from CFGs, we can investigate the consequences of incorporating these notions into an AB categorial grammar, leading to the system based on PPTs. This work is also related to the work on descrip tion trees by (Vijay-Shanker 1992) and HPSG compilation into LTAGs by (Kasper, Kiefer, Netter & Vijay-Shanker 1995). There are two aspects to the PPTS system: the construction of the indi vidual PPTs, and the inference rules that define how they are manipulated. The set BPPT is constructed by the following schemas: 1. Arguments of the type associated with a lexical item are unfolded by introducing assumptions. 2. There is no unfolding past an argument which is not an argument of the lexical item. 3. If a trace assumption is introduced while unfolding then it must be locally discharged, i.e., within the basic PPT which is being construc ted. 4. During the unfolding a node can be interpolated from a conclusion node X to an assumption node Y. All assumptions introduced in a PPT must be fulfilled by one of the following three operations: 1. application — The conclusion node of one PPT is linked to an as sumption node of another. 2. stretching — An interior node of a PPT is 'opened up', to create a conclusion node and an assumption node, in order to allow interaction with another PPT. 3. interpolation — The two ends of an interpolation construction (pre viously created within a PPT) are linked to another PPT. While traditional categorial grammar rules specify inferences between types, the inference rules for the three operation on PPTs instead specify inferences between proofs. This in a direct consequence of the extended domain of locality in PPTS. (However, the rules for building the set BPPT are similar to those of other categorial grammars.)

LEXICALISED GRAMMARS

13

These three operations are specified by inference rules that take the form of Α-operations, where the body of the Α-term is itself the proof. This is done by adapting a version of typed label-selective Α-calculus. this extension of Α-calculus uses labeling of abstractions and applications to allow unordered currying. Arguments have both symbol and number labels, and the intuitive idea is that the symbolic labels express the possibility of taking input on multiple channels, and the number labels expresses the order of input on that channel. In conclusion, we have discussed the notion of lexicalisation and its im plications for formal and computational properties of such systems. Our discussion is in the context of LTAGs, however, we have briefly discussed how this approach can be extended to Categorial Grammars and related systems. REFERENCES Joshi, Aravind K. 1992. "TAGs in Categorial Clothing". Proceedings of the 2nd Workshop on Tree-Adjoining Grammars, Institute for Research in Cognitive Science (IRCS), University of Pennsylvania. Joshi, Aravind K. & Seth Kulick. 1995. "Partial Proof-Trees as Building Blocks for Categorial Grammars". Submitted for publication. Joshi, Aravind K. & Yves Schabes. 1992. "Tree-Adjoining Grammars and Lexicalized Grammars". Tree Automata and Languages ed. by M. Nivat & A. Podelski, 409-431. New York: Elsevier. Joshi, Aravind K. & Bangalore Srinivas. 1994. "Disambiguation of Super Parts of Speech (or Supertags): Almost Parsing". Proceedings of the 17th Inter national Conference on Computational Linguistics (COLING-94), 154-160. Kyoto, Japan. Joshi, Aravind K., K. Vijay-Shanker & D. Weir. 1991. "The Convergence of Mildly Context-Sensitive Grammar Formalisms". Foundational Issues in Natural Language Processing, ed. by Peter Sells, Stuart Shieber & Thomas Wasow, 31-81. Cambridge, Mass.: MIT Press. Kasper, Robert, B. Kiefer, . Netter & K. Vijay-Shanker. 1995. "Compilation of HPSG to TAG". Proceedings of the 33th Annual Meeting of the Association for Computational Linguistics (ACL'95), 92-99. Rayner, Manny. 1988. "Applying Explanation-Based Generalisation to Natural Language Processing". Proceedings of the International Conference on Fifth Generation Computer Systems, 99-105. Tokyo, Japan.

14

ARAVIND JOSHI

Resnik, Philip. 1992. "Probabilistic Tree-Adjoining Grammar as a Framework for Statistical Natural Language Processing". Proceedings of the 14th Inter national Conference on Computational Linguistics (COLING-92), 418-424. Nantes, Prance. Samuelsson, Christer. 1994. "Grammar Specialisation through Entropy Thre sholds". Proceedings of the 32nd Meeting of the Association for Computa tional Linguistics (ACL'94), 150-156. Las Cruces, New Mexico. Schabes, Yves, Anne Abeille & Aravind K. Joshi. 1988. "Parsing Strategies with 'Lexicalized' Grammars: Application to Tree-Adjoining Grammars". Pro ceedings of the 12th International Conference on Computational Linguistics (COLING-88), 578-583. Budapest, Hungary. Schabes, Yves. 1992. "Stochastic Lexicalized Tree-Adjoining Grammars". Pro ceedings of the lĄth International Conference on Computational Linguistics (COLING-92), 426-432. Nantes, Prance. Shieber, Stuart & Yves Schabes. 1990. "Synchronous Tree-Adjoining Gram mars". Proceedings of the Fourteenth International Conference in Computa tional Linguistics (COLING-90), 1-6. Helsinki, Finland. Srinivas, Bangalore & Aravind K. Joshi. 1995. "Some Novel Applications of Explanation-Based Learning to Parsing Lexicalized Tree-Adjoining Gram mars". Proceedings of the 33rd Meeting of the Association for Computational Linguistics (ACL'95), 268-275. Steedman, Mark. "Combinatory Grammars and Parasitic Gaps". 1987. Natural Language and Linguistic Theory 5:403-439. Vijay-Shanker, K. 1992. "Using Descriptions of Trees in a Tree Adjoining Gram mar". Computational Linguistics 18.4:481-517. The XTAG Research Group. 1995. "A Lexicalized Tree-Adjoining Grammar of English". Technical Report IRCS 95-03. Philadelphia: Institute for Research in Cognitive Science (IRCS), University of Pennsylvania.

Case and word order in English and German ALLAN RAMSAY* & REINHARD SCHÄLER**

* Centre for Computational Linguistics, UMIST **Dept. of Computer Science, University College Dublin Abstract It is often argued that English is a 'fixed word order' language, whereas word order in German is 'free'. The current paper shows how the different distributions of verb complements in the two lan guages can be described using very similar constraints, and discusses the use of these constraints for efficiently parsing a range of construc tions in the two languages.

1

Background

The work reported here arises from an attempt to use a single syntactic/ semantic framework, and a single parser, to cope with both German and English. The motivation behind this is partly practical — having a uniform treatment of a large part of the two languages should make it easier to develop MT and other systems which are supposed to manipulate texts in both languages; and partly theoretical, just because any shared structural properties of languages with differing surface characteristics are of interest in themselves. The general framework is as follows: • Lexical items contain detailed information about the arguments they require and the targets they can modify. This information includes a specification of where a particular argument or target will be found. For arguments, this is done by partially ordering the arguments in terms of which is to be found next and specifying the direction in which it is to be found via a feature that can take one of the values l e f t or r i g h t . This is very similar to the treatment in categorial grammar, save that in the current approach the sequence in which arguments are to be found is given via a partial ordering, rather than the complete linear ordering of standard categorial grammar. Ad ditionally, the feature specifying the direction in which to look for a particular argument may not be instantiated until immediately before that argument is required.

16

ALLAN RAMSAY & REINHARD SCHALER • There is a strictly compositional semantics, expressed using a dynamic version of Turner's (1987) property theory 1 . • Syntactic, and hence semantic, analysis is performed by a chart parser driven by a 'head-corner' strategy, whereby phrases are built up by combining the head with its arguments looking either to the right or the left depending on the direction specified by the next argument.

This system would analyse the sentence 'he stole a car' as: A::{past(A)} simple(A, ΛB{ΙC ::{subset{C,ΛD(male{D))) Λ \C\ = 1}

E ::{subset(E,ΛF(book{F))) Λ \Ε\ = 1} event (B) Λ type (Β, steal) A object{B,E) A by(B,C))) This example displays most of the characteristics of our semantic analyses: • We use an event-based semantics, with aspect interpreted as a re lation between event types and temporal objects such as instants. An event type is represented as a A-abstraction over sentences about events (though remember that we are using property theory rather than typed Α-calculus as the means to interpret such expressions). • We use anchors to capture dynamic characteristics of referring ex pressions, so that an expression like ιC::{subset(C, ΛD(male(D))) Λ \C\ — 1}W says that W is true of the contextually unique singleton set of male individuals  if there is one, and is uninterpretable otherwise (in other words, W is true of he). • Thematic relations are named after the prepositions that give them their most obvious syntactic marking, so that by(, ) means that  is the agent of the event A, since agency is marked by the use of the case-marking preposition by when it is marked at all. This kind of semantic analysis is reasonably orthodox: the use of Davidsonian events has been widely adopted (e.g., see van Eijck & Alshawi 1992), the treatment of referring expressions via anchors into the context is very similar to the use of anchors in situation semantics (Barwise & Perry 1983), the decision to use the names of case-marking prepositions for thematic re lations can easily be justified by appeal to Dowty's (1988) analysis of the 1

Property theory allows you to combine the standard logical truth functional operat ors with the abstraction operator of the λ-calculus without either running into the paradoxes of self-reference or being restricted by an otherwise unnecessary hierarchy of types. See (Ramsay 1994) for phenomena whose analysis is greatly simplified by the absence of types from property theory.

CASE AND WORD ORDER

17

semantics of thematic relations. The most surprising element of the treat ment above is the analysis of aspect as a relation between a temporal object and an event type: dealing with aspect this way provides more flexibility than is available in the approach taken by Moëns and Steedman (1988), but as far as the present paper is concerned it makes little difference and if you find it unintuitive then the best thing to do is ignore it. Treatments of a variety of semantic phenomena in English have been published elsewhere (Ramsay 1992; Ramsay 1994). The purpose of the current paper is to describe the syntactic devices which are used to indicate thematic role in English and to show how these can be adapted with very minor changes to obtain the same information in German. 2

Case and order in English

English deploys two mechanisms for assigning thematic roles to arguments of a verb, (i) Thematic roles are partially ordered in terms of their affinity for the syntactic role of subject. In particular, if the list of required arguments includes an agent and this argument is not explicitly case marked then it must be the subject; and the only time the thematic object of any verb can be the syntactic subject is if there are no other candidates. The subject is always adjacent to the verb, either on the left (simple declarative sentences) or the right (aux-inverted questions). The subject has the surface case marker +nom.For passive verbs, the item which would have taken the role of subject for the active form is found and is marked as being optional and obligatorily case marked before the real subject is found, (ii) Any other arguments appear to the right of the verb and are otherwise freely ordered, with the proviso that the argument in what is usually termed direct object position should be required to be marked +acc if possible, while any other arguments should have a case marking which reflects their thematic role. This case marking typically comes in the form of a preposition. Thus in (1) He gave his mother a picture. (2) He gave a picture to his mother. he is the agent of the event, his mother is the recipient and the picture is the object. In both cases the subject has to be the agent, since agents always take precedence when allocating the role of subject. In (1) the second argument a picture has its thematic role assigned by the surface case marking. In this case, that surface case marking is +acc, which specifies that this argument is playing the role of object. This leaves the role of

18

ALLAN RAMSAY & REINHARD SCHALER

recipient to his mother. The explicit case marker for the role of recipient is overridden by the assignment of +acc to whatever appears in direct object position, but it doesn't matter because it is already clear that the other two arguments are the agent and the object, which leaves recipient as the only option. In (2) the second argument to his mother has the case marker to, and hence is clearly the recipient, leaving object as the only option for a picture. The behaviour of the verb open fits the same pattern: (3) (4) (5) (6)

He opened the door with the key. He opened the door. The key opened the door. The door opened.

(3) is just like (2): the role of subject is taken by the agent, the final argu ment has its thematic role explicitly marked by the case-marking preposition , and the remaining argument gets the role of object because that's all that is left. In the other cases, the role of subject gets allocated to the agent in (4), to the instrument in (5), and the object in (6), in descending order of affinity. The only real problem is that we would expect to get (3') He opened the key the door. as a sort of 'dative shift' variant of (3). We rule this out simply by banning the instrument of the verb open from appearing in this position. This mapping between thematic roles and surface appearance is determ ined by three sets of rules, (i) Local rules may specify properties of par ticular arguments, e.g., that the agent of any passive verb must be marked —nom, or that the instrument of the verb open must be marked —obj1. (ii) A set of 'subject affinity' rules specifies which thematic role will be realised by an NP playing the surface role of subject, (iii) A set of linear preced ence rules of the kind introduced in GPSG (Gazdar et al. 1985) specifies the permitted orders in which the the arguments of the verb may appear. Subject affinity rules The decision as to which item should take the role of subject is determined by a set of rules such as the following: (51) X[+agent, +nom] «subj Y (52) X[+nom] «subj Y[+object] The first of these says that the agent is a better candidate for the role of subject than anything else is, provided that it is in fact capable of playing this role at all. The side-condition that the agent must be capable of playing

CASE AND WORD ORDER

19

this role is specified by the requirement that it should satisfy the property of being +nom — in certain circumstances, notably in passives, the agent is required to be explicitly case-marked by the preposition by, and hence cannot be the subject. In any sensible implementation the explicit case marking of the agent should precede the application of the subject affinity rules, but it is not in fact a logical necessity. The second rule here says that the thematic object is the worst candid ate, among those that are eligible, for this role. These two rules cover most, if not all, cases in English: the only situ ations where they fail to determine the subject is if (i) there is no agent or the agent is not eligible, and (ii) there are two other arguments neither of which is the semantic object. Such situations are sufficiently rare to be ' ignored for the purposes of this paper. Linear precedence rules The notion of linear precedence (LP) rule used here is slightly different from the standard GPSG treatment. In particular, because the grammar here is highly lexical our LP-rules deal with the arguments of lexical items, rather than with daughters of ID-rules. We will want to use the LP-rules on the fly, to determine which argument to look for next, and where to look for it. The following are the key rules for the arguments of English verbs: (LP1) X «lp Y[-nom,mother = X] (LP2) X[+nom,mother = M] «lp Y[mother = M] (LP3) X[+nom,mother = Y] «lp Y[—inv] (LP1) says that any non-subject argument Y of X must follow X; (LP2) says that the subject of M must precede any non-subject argument; and (LP3) says that if  is marked as being non-invertible then its subject must precede it. (Sl-2) and (LP1-3) can be utilised within a head-corner parser to de termine what argument to look for next and where to look for it, as follows: • Start by applying the local rules: it's best to do this before choosing the subject, since the local rules will generally only be compatible with one choice of subject, but it is not strictly necessary to do so. • Next allocate the role of subject to one of the arguments of the verb by (Sl-2). Require this item to be marked +. • If there is an argument X of the verb V such that (i) V «lp X and (ii) there is no argument Y such that V «lp Y «lp X then look to the right for X, and delete X from the set of arguments waiting to be found. This step cannot sensibly be performed until the subject

20

ALLAN RAMSAY & REINHARD SCHALER has been found, since the LP rules depend on whether some item is

+ / - nom. • If there is an argument X of the verb V such that (i) X «lp V and (ii) there is no argument Y such that X «lp Y «lp V then look to the left for X, and delete X from the set of arguments waiting to be found. With one non-trivial extension, these rules cover virtually all the relevant phenomena in English. The key extension concerns the presence or absence of an explicit case marker on the leftmost item after the subject and the verb. If we mark this item as +obj1, then we need a default rule of the kind introduced in (Reiter 1980) of the form: M{X[+acc]) : X[+objl] X[+acc] This says that if it is possible to require the item in the relevant position to be marked +acc then you should do so. Unlike the previous rules this has to be a default rule and hence cannot be applied until the others have all done their work. The point here is that for a verb like rely the item in direct object position must be case-marked by the preposition on, as in He relied on her integrity: the consistency check in the above rule allows this by noting that the effect of the rule is incompatible with the effect of the lexical properties of rely, and hence the rule does not apply. Note that the requirement that the first non-subject argument after the verb has to be +obj1 provides the mechanism for ruling out he opened the door the key as a 'dative' version of he opened the door with the key. We simply mark the instrument of open as —obj1 (though not - n o m , since the instrument can get promoted to subject position if there is no explicit agent). The above rather straightforward rules cover virtually all the relevant phenomena in English. In particular they provide appropriate analyses for (1)-(6), and for: (7) A picture was given to his mother. (8) His mother was given a picture. (9) I saw him stealing a car. (10) He was seen stealing a car. (11) The ancient Greeks knew that the earth went round the sun. (12) That the earth went round the sun was known to the ancient Greeks.2 2

The case marker for the subject of the active sentence (11) turns out to be the pre-

CASE AND WORD ORDER

21

Extraposition in English English word order is not, however, as rigidly fixed as these examples sug gest. In particular, it is not unusual for one of the arguments which would normally appear to the right of the verb to appear way over to the left in front of the subject. The usual reason for this is that it provides a way of making the semantics of the shifted item available for some discourse operation such as contrast, as with the book in: (13) The film was banal, but the book I enjoyed. A wide variety of more-or-less formal accounts of such discourse operations have been provided (e.g., Halliday 1985, Krifka 1993, Ramsay 1994, Hoffman 1995), and there is no need to discuss their various merits and demerits here. The crucial point for the current paper is that surface word order does frequently get reorganised in this way in English, so that any claim that word order in English is fixed has to be treated very carefully. It is worth noting at this point that VP modifiers such as PPs and ADVPs seem to be subject to very similar kinds of constraint on where they can appear. Cases such as He suddenly stopped the car, he ate it in the park, I saw him sleeping by himself, . . . seem to indicate that there is a general rule in English that says that a VP can be modified by an appropriate modifier, and that the modifier should appear to the left of the VP if it is head-final and to the right if it is head-initial 3 (see Williams 1981, for discussion of this rule). This simple rule, however, is violated by examples like: (14) In the park he ate a peach. (15) She believed with all her heart that he loved her. In (14) the head-initial PP in the park is to the left of the S, rather than to the right of the VP; and in (15) the PP with all her heart is between the verb believed and its sentential complement that he loved her. In order to account for (14) we have to argue either that in the park can be either a left-modifier of an S or a right-modifier of a VP, in which case it will have to have different semantic types to combine appropriately with the types of its two potential targets; or that it is in fact a right-modifier of the VP which has been shifted to the left, probably in order to reduce ambiguity

3

position to, indicating that The Greeks is not the agent of know. This reflects the fact that agents typically intend the events that they bring about, which is not the case for the ancient Greeks in (11). Modifiers consisting of a single word, such as quietly, are both head-initial and headfinal, so that you get both he ate it quietly and he quietly ate it

22

ALLAN RAMSAY & REINHARD SCHALER

(since (14) has only one reading, whereas he ate a peach in the park has two) rather than to make it the argument of a discourse operator. The easiest way to account for (14) seems to be to argue that the complement that he loved her has been right shifted, probably again in order to reduce ambiguity (she believed that he loved her with all her heart sounds extremely odd, largely because the obvious attachment of with all her heart is to the VP loved her). In the system being described here, these 'shifts' of some argument or adjunct are dealt with using the standard unification technique of having a category valued feature called slash which can be given a value in order to denote the fact that some item is 'missing' from its expected position. We extend the standard notion, however, by allowing slash to have a stack of items as its value, indicating that more than one thing has gone missing. This is a departure from standard practice — in GPSG, for instance, the foot feature principle specifies that s l a s h can be given a non-trivial value by at most one of the daughters of a rule. This extension is required in English to cope with cases like (16) I was just talking to him when suddenly he collapsed. where the most obvious analysis assumes that both when and suddenly have been left-shifted. Much the same also holds for (17) Quietly, without a word, he turned his face to the wall. where quietly and without a word have both been topicalised, and for (18) where I believed at the time that he had left it where where has been topicalised out of that he had left it, which has itself been right-extraposed in the same way as that he loved her in (15). We will assume from now on that there is no pre-determined limit on the number of items that may be shifted either right or left, though there may well be local constraints that prevent extraposition happening in particular cases. The decision to allow multiple extrapositions could easily lead to an explosion in the number of partial parses that might be constructed. We therefore use a mechanism similar to Johnson and Kay's (1994) notion of 'sponsorship' to insist that for each object which you believe has been left-shifted there must indeed be at least one candidate item somewhere to the left. Furthermore, if more than one item has been left-extraposed then the sponsors must appear in the right order. With this filter on the freedom to hypothesise left-extrapositions, our move to permitting multiple extrapositions does not

CASE AND WORD ORDER

23

lead to an unacceptable increase in the number of potential analyses4. 3

Case and word order in German

We now turn to German, where surface case marking seems to be rather more important than word order in the allocation of thematic roles to ar guments. Very roughly, it seems that in German the following conditions hold: • General properties of a clause determine whether the verb appears as the first, second or final constituent. • One argument is marked as being the subject, and undergoes the usual agreement constraints for subjects. • The arguments of a verb are not subject to a strict set of LP-rules, though quite strong discourse effects can be obtained by putting some thing other than the subject as the leftmost argument. To take a simple example, (19) (20) (21) (22)

Er gab seiner Mutter ein Bild. Er gab ein Bild seiner Mutter. Seiner Mutter gab er ein Bild. Ein Bild gab er seiner Mutter.

are all reasonable translations of he gave a picture to his mother. In each case, the choice of er as the subject indicates that he was the agent, the dative marking of seiner Mutter indicates that the mother was the recipient, and the accusative marking of ein Bild shows that this is the thematic object. Choosing (21) or (22) would normally presuppose that the speaker wanted to make seiner Mutter or ein Bild available for some discourse operator, but all four options are certainly permissible. Similarly, (23) (24) (25) (26) 4

Gab er seiner Mutter ein Bild? Gab er ein Bild seiner Mutter? Gab seiner Mutter er ein Bild? Gab ein Bild er seiner Mutter?

It does not seem possible to extend the notion of sponsorship to deal with right ex trapositions, since you can't anticipate whether sponsors may turn up later on as you proceed. Fortunately the local constraints tend to restrict the number of items that could possibly be right-shifted.

24

ALLAN RAMSAY & REINHARD SCHALER

are all available as questions about the donation of a book to someone's mother, with the choice of which argument is to come immediately after the verb indicating whether, as in (23), we don't know whether the person he gave the book to was his mother, or, as in (26), we don't know whether what he gave her was a book. Uszkoreit (1987) argues that (19)-(22) and (23)-(26) can all be ob tained, as for the English cases, from a set of rules which choose the subject, a set of LP-rules, and a mechanism for left-extraposition. The essence of Uszkoreit's analysis is that there is one basic LP-rule, which in the terms used here would look like X «lp Y[mother = X] and that the simple declarative forms (19)—(22) are obtained by topicalisation. This looks very straightforward, and the only change that we would argue for at this point is that Uszkoreit deals with cases like (27) In dem Park aß er einen Apfel by treating in dem Park as an argument of aß, whereas it seems more sensible to treat it as an ordinary post-modifier of the VP and to allow it to be left-shifted just as in (14). It is notable that, in German as in English, cases where a preposition modifier is left-shifted are much less marked than ones where some other non-subject item appears in the leftmost position. The reason is that left-shifting a modifier can be used as a means of reducing ambiguity, and hence is a useful thing to do regardless of any discourse effect you want to produce. The first point at which this simple rule has to be altered arises when we consider verbs other than the main verbs of major clauses (i.e., non-finite verbs and main verbs of subordinate clauses. Following Uszkoreit we will mark these as —mc). In (28) Ich sah ihn ein Auto stehlen. (29) Ich habe ein Auto gestohlen. the NP ein Auto is certainly an argument of (ge)ste(o)hlen, yet appears to its left. It also seems as though in (28) ihn may also be an argument of stehlen, as something like a +acc marked subject. To accommodate these examples, we might adapt our observations about word order by simply saying that the arguments of a minor verb must precede it, and leave it at that. The LP rules would then become:

CASE AND WORD ORDER

25

X[+mc] «lp Y [mother = X] Y[mother = X] «lp X[—mc] These rules have much the same flavour as the ones for English, and could be used in just the same way by a parser which incrementally chose which argument to look for next and which direction to look for it in. Clearly the verb-second examples like (19)—(22) would require you to worry about left-extraposition, but this is not a major extra burden since you will always have to worry about that anyway. Unfortunately, you cannot always tell from the appearance of a verb whether it should be marked +rac or —rac. Non-finite verbs are always —mc, but there are plenty of cases, e.g., stehlen, where the appearance of the verb does not determine its form; and even where the form is determined, you cannot know for a tensed verb whether it is +rac or —mc until you know the context in which it appears. This means that any bottom-up parser which depends on the two LP-rules above is frequently going to have to investigate two sets of hypotheses, one looking to the right for all the arguments of the verb and one looking to the left. At this point it is worth recalling two points: (i) on Uszkoreit's account, the only difference between the polar interrogative form and the simple main clause declarative is that the latter has something (either an argument or a modifier) left-extraposed. (ii) For entirely independent reasons, it seemed sensible in English to allow multiple items to be extraposed. We therefore propose the following alternative treatment of —mc verbs in German. • There is only one LP-rule for verbs, namely X «lp Y[mother = X). • Polar interrogatives, simple declaratives and —rac verbs are distin guished entirely by the number of items which have been left-shifted. With these rules we get all the obvious cases, e.g., (30) Stahl er ein Auto? [[Stahl, right—er], right-[ein, right—Auto]] (31) . . . (weil) er ein Auto stahl. [er, [[ein, right—Auto], [[stahl,right—trace], right—trace]]] (32) Ein Auto stahl er. [[Ein, right—Auto], [[stahl,right—er], right—trace]]

26

ALLAN RAMSAY & REINHARD SCHALER

The markers left and right in these indicate where the item in question was found, and trace indicates that what was found was a trace of something which has been extraposed. Thus in (32) both arguments were found to the right of the verb, but one of them was a trace which was cancelled by the NP ein Auto which itself consisted of a determiner with a noun to its right. This is exactly as described by Uszkoreit for verb-initial and verb-second clauses. More interestingly, we can cope with embedded clauses without requiring —mc verbs to look to the left for their arguments: (33) Ich weiß, er stahl ein Auto. [Ich, [[weiß) right—trace], right—[er, [[stahl,right—trace], right—[ein, right—Auto]]]]] (34) Ich weiß, in dem Park stahl er ein Auto. [Ich, [[weiß, right—trace], right —[[in, right—[dem, right—Park]], [[[stahl,right—er],right—[ein,right—Auto]], right—trace]]]] (35) Ich weiß, daß er ein Auto stahl. [Ich, [[weiß, right—trace], right -[daß, right — [er, [[ein, right—Auto], [[stahl,right—trace], right—trace]]]]]] In (33) weiß requires a +mc clause as its argument, and hence er stahl ein Auto, with one left-shifted argument is fine. Similarly, the presence of the left-shifted PP in (34) means that the embedded clause is +mc. In (35), on the other hand, the complementiser daß requires a —mc clause as its argument, and hence both arguments er and ein Auto of stahl have to be left-shifted. The complementiser then returns a +mc clause, as required. Similarly, in (36) ein Auto, das er stahl,

CASE AND WORD ORDER

27

[ein, right —[Auto, right—[das, [er, [[stahl,right—trace], right—trace]]]]] the relative clause has to be —mc and hence has all its arguments left-shifted, with the WH-pronoun (!) das coming first because of the fact that you can't extrapose anything from a sentence which has already been WH-marked (so you don't get: * e i nAuto, er das stahl,). Verbs with non-finite sentential complements work exactly the same way: (37) Ich sah ihn ein Auto stehlen. [Ich, [[sah, right—trace], right — [ihn, [[ein, right—Auto], [[stehlen,right—trace], right—trace]]]]] The embedded clause ihn ein Auto stehlen has a non-finite, hence —rac, main verb, and therefore both arguments have again been left-shifted. Auxiliaries are slightly more awkward. In English, auxiliaries and mod als take VPs as their arguments, i.e., verbs which have found all their ar guments apart from the subject. To deal with that in the current context, we would have to allow slash elimination to occur with VP's as well as with S's, analysing the phrase ein Auto gestohlen in (38) Ich habe ein Auto gestohlen. by taking gestohlen as something like VP[subcat = {NP[+nom]},slash = {NP}] and then cancelling the slashed NP against ein Auto to obtain a normal VP. This is a possibility, but the decision to allow slash elimination to occur with items other than S's is a major step. For the moment we prefer to assume that auxiliaries and modals require S's whose subjects have been extraposed, rather than ones whose subjects have not been found, and to retain the principle that slash elimination only occurs with S's. We therefore treat (38) as [Ich, [[habe, right — [[ein, right—Auto], [[gestohlen, right—trace], right—trace]]], right—trace]]

28

ALLAN RAMSAY & REINHARD SCHALER

Here gestohlen has had both arguments extraposed, and habe has had its subject extraposed. Only one of the arguments for gestohlen gets cancelled, namely ein Auto, and therefore there is an S with its subject missing im mediately to the right of habe. This is therefore accepted as one argument, and the other is slashed. When it turns up, namely as Ich, the whole thing turns out to be a perfectly ordinary declarative main clause. This may turn out not to be the best solution for auxiliaries. For the moment we will just note that it does at least work, and that it does not require any radical extensions to the analysis developed above for the other cases. We will be looking again at this, but it does at least provide a treatment that works without incurring any substantial extra costs. 4

Implementation

The rules outlined above for computing the properties of the next argument to be found when saturating a verb in English and German have been implemented in a version of the parser and syntax/semantics reported in (Ramsay 1992; Ramsay 1994). Within this framework as much information as seems sensible is packed into the descriptions of lexical items, with a very small number of rules being used for saturating and combining structures together. In particular, the description of a lexical item W contains the following pieces of information: • a description of the syntactic properties of the item W' that would result from saturating W. • a description of the set of arguments which W requires. This set may be empty, as in the case of pronouns or simple nouns. • a description of the items that W' might modify (e.g., an adjective like old would specify that it could modify an N, a preposition like in would specify that when saturated it could modify either an N or a VP). The grammar then has four rules: • An unsaturated item can combine with one of its arguments under appropriate circumstances. • A modifier can combine with an appropriate target. • A sentence which has had something extraposed to the left or right can combine with an appropriate item on the left or right. • If X' is a redescription of X then any of the first three rules can be applied to X'. This rule captures the notion that items can often be

CASE AND WORD ORDER

29

viewed from different perspectives — that a generic N can be seen as an NP, that certain sorts of WH-clause can be seen as NP's (e.g., I don't know much about art but I know what I like), and so on. These rules are simple enough for it to be reasonable to build the parser around them. The key, of course, is that the first three all talk about 'appropriate' items and circumstances, and this notion of appropriate needs to be fleshed out. Part of what is meant here is that feature percolation principles have to be applied in order to complete the descriptions of the required items. These feature percolation principles are essentially dynamic, since they include pre-defaults which say things like "unless you already know that X is required to be something else then require it to be +acc" ; post-defaults, which say things like "unless you know that X is capable of functioning as an adjunct then assume it isn't"; and principles like the FFP which depend on properties of the siblings of the item in question. The issue of appropriateness also includes information about which argument from a set of arguments to look for next, and whether to look to the left or the right for it; and about whether a modifier should appear to the left or right of its target, or whether an extraposed item should be found to the left or right of the sentence it has been extracted from. The question of whether to look to the left or right for an item is also essentially dynamic. Consider, for instance, the following NP's: (39) a sleeping man (40) a quietly sleeping man (41) a man sleeping in the park In (39) and (40) the modifier has to appear to the left of the target, in (41) it has to appear to the right. The reason seems to be that sleeping and quietly sleeping are head-final, whereas sleeping in the park is not. This is a property of the phrase as a whole, rather than of its individual components, and hence cannot be determined until the whole phrase has been found. Similarly, the discussion of case-marking and argument order in Sections 2 and 3 above suggests that the direction in which the next argument should be found and the details of its syntactic properties depend on what was found last and what properties it had. Given this dynamic view of these otherwise rather skeletal rules, it seems reasonable to embody them directly into the parser. The term 'head-corner' reflects the fact that we work outwards from lexical items, trying to saturate them by looking either left or right, as determined dynamically by the LPrules. This strategy provides a very effective combination of top-down and

30

ALLAN RAMSAY & REINHARD SCHALER

bottom-up processing. As examples of cases where this pays off, consider the following English sentences: (42) That she should be so confident says a lot for her education. (43) Eating raw eggs can give you salmonella poisoning. (44) There is a dead rat in the kitchen. In (42), the sentence that she should be so confident is the subject of the verb says; in (43) the subject of give is the VP eating raw eggs; and in (44) the subject of is is the dummy item there. The fact that verbs can require either non-NP's or extremely special NP's as their subjects means that you can't afford to have a simple rule like: S  NP, VP[+tensed] since it won't cover (42), it won't cover (43) unless you regard present participle VP's as being a species of NP, and it won't specify the detailed characteristics of the subject NP in (44). You would therefore need a rule more like: S  X, VP[+tensed, subject = X] But any parser which worked generally left to right would produce unaccept able numbers of hypotheses in the presence of a rule like this. By working outwards from the head verb in directions specified by the LP-rules, we can cope with (42)—(44) without drowning in a sea of unwarranted hypotheses. Similarly, by replacing the general rule: X  X, conj,X by lexical entries whose subcategorisation frames say that a conjunction can be saturated to an X if you find an X to the left and then one to the right, we can cope with the combinatorial explosion that a rule of this kind would otherwise introduce. In much the same way, the fact that we determine the direction in which a modifier is to seek its target dynamically means that we can be economical about making hypotheses about where to look for adjunct/target pairs. The main reason for providing distinct mechanisms for combining heads with arguments and adjuncts with targets comes from our desire to treat examples like: (45) In the park there is a playground for preschool children. as involving extraposition of the PP in the park. This treatment is motivated on semantic grounds, since otherwise we have to be prepared to treat in the park as both a function of type t→ t when it modifies an S, as would happen in (45); and as a function of type ((e → t) → (e → t)) when it modifies a VP, as in:

CASE AND WORD ORDER

31

(46) The youths drinking cider in the park looked extremely threatening. The key difference is that in head/argument pairs, the argument can be extraposed, whereas in modifier/target pairs the modifier can. We therefore cannot afford to treat a preposition like in as being of type VP\ VP/NP, as would be done in raw categorial grammar, since there is no obvious way of extraposing the partially saturated structure in the park from this. This parser works fine for English. It works even better for German. Consider the verb gab. On the analysis outlined above, this generates six possible orders for the arguments, namely agent-object-recipient, agentrecipient-object, object-agent-recipient, object-recipient-agent, recipientobject-agent, recipient-agent-object. Some of these mark strong rhetorical devices, and others may only be possible with particular combinations of +/-heavy NP's, but they are all at least conceivable. Furthermore, if we take it that gab can appear in polar questions, +mc declarative sentences, and —mc clauses, then each of these can appear with the verb either at the start, after the first item, or at the end — a total of 18 possible sequences. And then we have to consider the possible presence of adjuncts, which could easily lead to +mc declarative forms in which the verb precedes all three core arguments. And finally, of course, in each case we have to consider the possibility that a given argument may have been extraposed, either for rhetorical reasons or simply to construct a relative clause. Within the current framework, we initially generate just three hypo theses — that the agent is the leftmost argument, or that the object is, or that the recipient is. We then look to the right for this argument: if we find a concrete instance then the case marking will almost certainly rule out all except one case, and if we decide to hallucinate an extraposed instance then the search for sponsors will ensure that we only do so if there is indeed something of the required kind already lying around. We therefore explore only a very constrained part of the overall search space. The parser we developed initially for English actually works even better for German! 5

English is German

Uszkoreit, rightly, complains that a consequence of the historical concentra tion on English is that other languages get forced into a framework which really does not fit them at all well. This is particularly unfortunate in view of the fact that English is in fact a rather messy amalgam of other languages, with German being a notable contributor. It is therefore appropriate to fin ish the current paper by noting a couple of English constructions which do

32

ALLAN RAMSAY & REINHARD SCHALER

not fit the analysis outlined in Section 2 above, but which do behave very much like the constructions described in Section 3. The first is a rather archaic form of polar question. It used to be possible to say things like (47) rather than (48): (47) Know ye not who I am? (48) Don't you know who I am? (47) is exactly parallel to the standard form of German polar question, and it is tempting to treat it in exactly the same way. It is also tempting, of course, to treat it using the standard English rules but allowing words other than auxiliaries to be marked +inu, and it would be a mistake to make too much of this example, but it is at the very least provocative. Perhaps more significant is the topicalisation of (49) to (50): (49) An old man was on the bus. (50) On the bus was an old man. The standard rules for topicalisation in English would have produced On the bus an old man was, parallel to On the bus an old man slept. The German rules, however, would have produced (50). Should we therefore deal with this one as though the English copula was in fact subject to the German LP-rules? Is at least part of English just German? REFERENCES Barwise, Jon & John Perry. 1983. Situations and Attitudes. Cambridge, Mass. Bradford Books. Dowty, David R. 1988. "Type raising, functional composition and non-constituent conjunction". Categorial Grammars and Natural Language Structures ed. by Richard T. Oehrle, Emmon Bach & Deirdre Wheeler, 153-198, Dordrecht: Reidel. Gazdar, Gerald, Ewan Klein, Geoffrey K. Pullum & Ivan Sag. 1985. Generalised Phrase Structure Grammar. Oxford: Basil Blackwell. Halliday, M. A. K. 1985. An Introduction to Functional Grammar. London: Edward Arnold. Hoffman, Beryl. 1995. "Integrating "Free", Word Order Syntax and Information Structure". Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics (EACL'95), 245-252. Dublin. Johnson, Mark & Martin Kay. 1994. "Parsing and Empty Nodes". Computa tional Linguistics 20:2.289-300. Krifka, Manfred. 1993. "Focus, Presupposition and Dynamic Interpretation". Journal of Semantics, 10.

33

CASE AND WORD ORDER

Moëns, Marc & Mark Steedman. 1988. "Temporal Ontology and Temporal Reference". Computational Linguistics 14:2.15-28. Ramsay, Allan M. 1992. "Bare Plural NPs and Habitual VPs". Proceedings of the 14th International Conference On Computational Linguistics (COLING-92), 226-231. Nantes, Prance. 1994. "Focus on 'only', and 'not'". Proceedings of the 15th International Conference On Computational Linguistics (COLING-94), 881-885. Kyoto, Japan. Reiter, Ray. 1980. 13:1.81-132. Turner, Ray.

1987.

"A Logic for Default Reasoning". "A Theory of Properties".

Artificial

Intelligence

Journal of Symbolic Logic

52:2.455-472. Uszkoreit, Hans. 1987. Word Order and Constituent Structure in German. CSLI, Stanford, Calif. van Eijck, Jan & Hiyan Alshawi. 1992. "Logical Forms". The Core Language Engine ed. by Hiyan Alshawi, 11-40. Cambridge, Mass.: MIT Press. Williams, Edwin. 1981. "On the notions 'lexically related' and 'head of a word' ". Linguistic Inquiry 12:2.254-274.

An Optimised Algorithm for Data Oriented Parsing KHALIL SIMA'AN

Utrecht University Abstract This paper presents an optimisation of a syntactic disambiguation algorithm for Data Oriented Parsing (DOP) (Bod 1993) in particu lar, and for Stochastic Tree-Substitution Grammars (STSG) in gen eral. The main advantage of this algorithm over existing alternat ives (Bod 1993; Schabes & Waters 1993) is time-complexity linear, instead of square, in grammar-size. In practice, the algorithm ex hibits substantial speed up. The paper also suggests a heuristic for DOP, supported by experiments measuring disambiguation-accuracy on the ATIS domain. Bracketing precision is 97%, 0-crossing sen tences are 84% of those parsed and average CPU-time is 18 seconds. 1

Introduction

Data Oriented Parsing (DOP) (Scha 1990; Bod 1992) projects an STSG directly from a given tree-bank. DOP projects an STSG by decompos ing each tree in the tree-bank in all ways, at zero or more internal nodes each time, obtaining a set of constituent structures, which then serves as the elementary-trees set of an STSG. An STSG is basically a ContextFree Grammar (CFG) with rules which have internal structure i.e., are elementary-trees (henceforth elem-trees). Deriving a parse for a given sen tence in STSG is combining elem-trees using the same substitution opera tion as used by CFGs. In contrast to CFGs, however, STSGs allow various derivations to generate the same parse. Crucial for natural language disam biguation, the set of trees generated by an STSG is not always generatable by a CFG; thus, STSGs impose extra constraints on the generated struc tures. For selecting a distinguished structure from the space of generated structures for a given sentence, DOP assigns probabilities to the applica tion of elem-trees in derivations. The probability, which DOP infers for each elem-tree, is the ratio between the number of its appearances in the tree-bank (i.e., either as a tree or as a subtree) and the total number of ap pearances of all elem-trees which share with it the same root non-terminal (for an example see Figure 1). A derivation's probability is then defined as the multiplication of the probabilities of the elem-trees which participate in

36

KHALIL SIMA'AN

it. And a parse's probability is the sum of the probabilities of all deriva tions which generate it. For disambiguation, one parse is selected from the many that a given sentence can be assigned with. In experiments repor ted in Bod (1993), on a manually corrected version of the ATIS tree-bank, both, the most probable parse (MPP) and the parse derived by the most probable derivation (MPD) were observed. As expected, the STSGs which DOP projects from a tree-bank, have a large number of deep elem-trees. This makes parsing and disambiguation time-consuming. The experiments in Bod (1993) had to employ MonteCarlo techniques (basically repeated random sampling). Execution-time in these experiments was a few hours per sentence. In Sima'an et al. (1994), various algorithms are presented for disambiguation under DOP; among them there is a polynomial-time algorithm for computing the MPD and an exponential-time algorithm for computing the MPP 1 . Another algorithm for computing the MPD for Stochastic Lexicalised Context-Free Grammars (SLCFGs) is presented in Schabes & Waters (1993). Time-complexity of both algorithms for computing the MPD is square in grammar size. For DOP grammars, these algorithms become unattractive as soon as the gram mar takes realistic sizes. In this paper the algorithm for computing the MPD (Sima'an et al. 1994) is refined to achieve time-complexity of order linear in grammar size. In addition, the present paper suggests a useful heuristic for reducing the sizes of DOP models projected from tree-banks. The structure of the paper is as follows. Section 2 presents shortly the necessary termin ology and properties pertaining STSGs and parsing. Section 3 presents the algorithm formally. Subsequently, Section 4 provides empirical evidence to its claimed performance and discusses a heuristic for DOP. Finally, in Section 5, the conclusions are discussed. 2

STSGs: Definitions, terminology and properties

Notation: A, ,  , N, S denote non-terminal, and w denotes a ter minal symbol. α, β denote strings of zero or more symbols which are either terminals or non-terminals. A CFG left-most (l.m.) derivation of (exactly one)/(zero or more)/(at least one) steps is denoted resp. with →/→*im/→+im. Note, → is also used in declarations of functions. \X\ de notes the size of a set X (i.e., its cardinality). A n STSG is a five-tuple (VN VT, S, C, PT), where VN and VT denote 1

Recently we proved the problem of computing the MPP under STSGs is NP-hard.

AN OPTIMISED ALGORITHM FOR DOP

37

Example: corpus tree tIis cut at the internal S node. The resulting set of elem-trees is at the right side. Elementary-trees etl and et3 occur each only once in the corpus-trees, while et2 occurs twice (once as a tree and once as a result of cutting tl). The total number of occurrences of these elem-trees is 4, leading to the probabilities shown in the figure. Fig. 1: An example: STSG projection in DOP respectively the set of non-terminal and the set of terminal symbols, S de notes the start non-terminal,  is a set of elem-trees (of arbitrary depth ≥ 1) and PT is a function which assigns a value 0 < PT(t) ≤ 1 (probab ility) to each elem-tree i such that ΣtЄc, root(t)=N PT(t) = 1 (where root(t) denotes the root of tree t). An elem-tree in  has only non-terminals as in ternal nodes but may have both terminals and non-terminals on its frontier. A non-terminal on the frontier is called an open-tree (). Substitution: If the left-most open-tree N of tree t is equal to the root of tree t\ then t o t1 denotes the tree obtained by substituting t 1 for N in t. The partial function  is called left-most substitution. Notice that the value PT(t) for elem-tree t with root N is the probability of substituting t for any open-tree N in any elem-tree in  . A Left-most derivation (l.m.d.) is a sequence of left-most substitutions Imd = (... (t1 ot 2 )o...)ot n , where t 1 , . . . , tn Є  , root(ti) = S and the frontier of Imd consists of only terminals. The probability Ρ (lmd) is defined as PT(t1) χ . . . x PT(tn). For convenience, derivation refers to l.m. derivation. A Parse is a tree generated by a derivation. A parse can be generated by many derivations. The probability of a parse is the sum of the probabilities of the derivations which generate it. A Finitely ambiguous grammar derives a finite string only in a finite number of ways. A STSG is in Extended CNF (ECNF) if in each elem-tree each non-terminal node has one non-terminal child, two non-terminal children or only one terminal child2. 2

Each STSG can be transformed into this form without disturbing it as a probabilistic model. Moreover a reverse transformation of any result obtained in the ECNF is easy

38

KHALIL SIMA'AN

Definition: A context-free rule (CF-rule) R=A → A1... An is said to appear in a tree t in  if one of the following is true: (1) A is the root of է and Α1... An are its direct children (in this order), (2) R appears in the subtree under one of the children of the root of t. Definition: (VN,VT,S,R ) is the CFG underlying the TSG (VN,VT,S,C) iff R is the set {R | rule R appears in a tree in  } (See example in Fig ure 2). A n item of a CFG is any of its rules of which the right-hand side (rhs) contains a dot 3 . ITEMS denotes the set of all items of a CFG. Global assumption: We assume STSGs that have a proper and finitely ambiguous underlying CFG in ECNF.

Example: Given the elem-tree set of a TSG on the left side of this figure, the parse shown in on the right side, is generated by the derivations (t3 o t1) and ((t3  t3)  t2). The CFG underlying the TSG has the two rules S → Sb and S → a. The appearances of these rules are represented, resp., by {1,2} and {3,11}, where the naturals in the sets decorate uniquely an appearance of a rule. Fig. 2: (Left) A tree-set and (Right) a derived parse Relevant properties: The set of the strings (language) generated by any STSG is a context-free language (CFL). The set of the parses (treelanguage) generated by an STSG can not always be generated by a CFG which generates the same language. For example, consider a TSG with elem-trees {t1, t2} of Figure 2. There exists no CFG which generates both the same language and the same tree-set as this TSG. The set of the paths (path-set), from the root to the frontier, in the parses generated by STSG derivations forms a regular language. 3

Disambiguating an input sentence

To syntactically disambiguate an input sentence, a 'distinguished' struc ture is assigned to it. This is a step further than mere parsing which 3

and valid. Items serve as notation for parsing.

AN OPTIMISED ALGORITHM FOR DOP

39

has the goal of discovering the structures which the sentence encapsu lates. Bod 1993 tested two selection criteria for the distinguished structure, namely the most probable derivation (MPD) and the most probable parse (MPP). The present paper is concerned with the computation of the MPD. Algorithms for computing the MPD for stochastically enriched restricted versions of Tree-Adjoining Grammars (TAGs) exist (e.g., Schab es & Wa ters 1993). These algorithms can easily be adapted to STSGs. However, the applications we have in mind assume large STSGs which employ a small set of CFG-rules and a large number of deeper trees. For such STSGs the mentioned algorithms have high time-consumption to the degree that their usefulness maybe questioned. The solution proposed in this paper is tailor ing an algorithm for large STSGs, which achieves acceptable execution-time. Two observations underly the structure of the present algorithm. Firstly, the tree-set generated by an STSG for a given sentence is always a subset of the tree-set generated by the underlying CFG for that sentence. And secondly, each STSG derivation can be represented by a unique decoration of the nodes of the parse it generates. Moreover, since the path set of a given STSG derivation always forms a regular set, over the nodes of the elem-trees which participate in the derivation, then there is a certain constraint on the decorations which correspond to STSG derivations. This constraint is de scribed below and is embedded in the so called viability property. Given an arbitrary decoration of a parse for a given sentence, it is possible to check whether it corresponds to an STSG derivation of that sentence by checking whether it fulfills this viability property. This implies that a characterisa tion of the tree-set of an STSG for a given sentence can be achieved through decorating the trees generated by the underlying CFG for the same sentence in a way which fulfills the viability property.

Fig. 3: The two modes of the viability property The viability property : Given an STSG (VN, VT, S, C, PT), assign to each non-frontier non-terminal in each elem-tree in  a unique code from a code-domain Π (say the integers), and consider the parse generated by a given derivation. The internal nodes of the parse are decorated by the

40

KHALIL SIMA'AN

codes that originally decorated the elem-trees participating in the given derivation. This specific decoration of the parse corresponds only to the derivation at hand. Clearly, not any decoration of a parse corresponds to a derivation. A closer study of a decorated tree, which results from an STSG derivation, reveals the following property: 1. The code of its start non-terminal S corresponds to the root of an elem-tree. And 2. for any two non-terminals T and Nj, which are parent and its j - t h child (j Є {1, 2}) in the tree, one of the following two properties holds: Parenthood: iV's code, c, and N j 's code,  , correspond, resp., to a parent and its j - t h child in an elem-tree (see right-hand side of Figure 3). Or substitution: iV's code, c, appears in an elem-tree with Nj as its open-tree child, and N j 's code,  , is the code decorating the root of an elem-tree (see left-hand side of Figure 3). Data structures: The following representation makes the viability property of an STSG explicit. Given an STSG (VN, VT, S, , PT) in which the non-frontier nodes of its elem-trees are coded uniquely with values from Π (e.g., the naturals). Infer three predicates: 1. Parent?(,  , j) denotes the proposition "c and  are resp. the codes of a parent and its j-th child in a tree in  ". 2. Root? () denotes the proposition "c is the code of the root of a tree in C". 3. OT?(c,j) denotes the proposition "child j (enumeration always from left to right) of the node with code  is an ". Now infer the seven-tuple (VN, VT, S', R , A, Viable?, P) where : • (VN , VT, S, R ) is the CFG underlying {VN, VT, S, C, F T ) , • Viable?(c, Cj, j) = Parent?(c, cj, j) or {OT?(c,j) and Root?(cj)), • A = {A{R) | R ЄR }, where A(R = N →α) = {c |  is code of N for an appearance of R in C} • P : Π →(Π x {1, 2}) →(0..1]. For c, c' Є Π and j Є {1, 2}:

P(ć)(c,j)=

PT(t) 1 0

If {Viable?{c,ć,j) and (t ЄC , ć = root(t))), Else If Viable?(c,c',j), Otherwise.

The set A(R), in this definition, denotes the set of all appearances (i.e., codes) of a rule R ЄR in the tree-set. In any decorated parse tree in which  decorates a node, the term P(c') denotes the probability of c' as a function of the code of its parent  and its child-number j (from left to right). It expresses the fact that, in an STSG, the probability of a CF-rule of the

AN OPTIMISED ALGORITHM FOR DOP

41

underlying CFG is a function of its particular appearance (code) in the tree-set. The algorithm: The algorithm is an extension to the CYK (Younger 1967) algorithm for CFGs. Firstly, the parse-space (parse-forest) of the in put sentence is constructed on basis of the CFG underlying the given STSG. Subsequently, the computation of the MPD is conducted on this parse-forest employing a constrained decoration mechanism. For the specification of the algorithm define the set A(item, i, j) to be A(R), where item is R with a dot somewhere on its rhs. And let Max denote the operator max on sets of reals4 . Parse-forest: A compact representation of the parse-space of an inputsentence is a data-structure called a parse-forest. A well known algorithm for constructing parse-forests for CFGs is the CYK (Younger 1967; Aho & Ullman 1972; Jelinek et al. 1990). It constructs for a given sentence wN0=W1 . . . wn a table with entries [, j], for 0 ≤  < j ≤ . Informally speaking, entry [, j] contains all items A→α · β such that α→*im wij. Computing the M P D . Algorithm MPD in Figure 4 computes the MPD. P(wn0) denotes the probability of the MPD of the sentence wn0. The function Pp : Π x ITEMS x [0, η) x [0, η] → [0, 1] computes the probability of the most probable among the derivations, which start with code  and generate a structure for wji. Algorithm MPD can be adapted to computing the probability of the sentence by exchanging every Max with ∑. The polynomiality of its com putation follows from that of the CYK and from the fact that the sets A(R) are all bounded in size by a constant. The time-complexity of this algorithm is |R| n 3 + |A| 2 n 3 . For natural-language tree-grammars, the ra tio |A|/|R| is usually quite large (an order of 100 is frequent). Therefore, the term |A| 2 n 3 dominates execution-time. In comparison to the algorithm described in Schabes & Waters 1993, the present algorithm is more suitable for larger STSGs. Its use of a CFG-based mechanism enables, in practice, a faster reduction of the parse-space. A n optimisation: Consider Figure 4, let itemP and itemCh denote respectively the item to the left of the semicolon and the item that appears in the overbraced term. The 'multiplication' of the two sets A(itemP,i,j) and A(itemCh,l,m) can be conducted in time linear instead of square in \A\. 4

For example Max pred ( x )A(x) is the maximum on the set {A(x) / Pred{x)} .

42

KHALIL SIMA'AN

For this purpose define the following partitions of these two sets for  = 1,2: HasOT(SET,k) = {cЄ SET / OT?(c,k)} HasCh(SET,k) = SET - HasOT(SET,k) where SET = A(itemP,i,j)

RootsOf(SET) ={c Є SET / Root?(c)} InternOf(SET) =SET - RootsOf (SET) | where SET = (iteh,l,)

These two partitions result each in two complementary subsets. Notice that a code in HasOT can be in the viability relation only with codes that correspond to roots of elem-trees, i.e., in RootsO f. Moreover, all codes of HasOT are in the viability relation with all codes of RootsO f. This is because all codes in HasOT allow the substitution of exactly the same roots of elem-trees, namely those in RootsO f. Thus, the multiplication of only one member of HasOT with all members of the set RootsO f should be sufficient. The result of this multiplication can then be copied to the rest of the codes in HasOT. This is done in time linear in |A{itemP,i,j)|. On the other hand, the set HasCh comprises codes that have children which are internal to elem-trees, i.e., in InternOf. But each code in HasCh has one and only one unique child in I nternOf (and vice versa). The search for this child can be done using binary search in log 2 ((iteh,1,)). However, if in the data-structures one also maintains for each code a refer ence to each of its children, then direct access is possible. To exploit these partitions, the right-most M a x expression (overbraced

43

AN OPTIMISED ALGORITHM FOR DOP num. of elem. trees 74450 26612 19094 19854

|R|

|A|

870 870 870 870

436831 381018 240619 74719

Avg.Sen. Length 9.5 9.5 9.5 9.5

Average CPU-secs. linear Bin.Search Square 445 993 9143 281 197 2458 131 223 346

Table 1: Disambiguation time on various STSG sizes in Figure 4) in each of the last three cases of algorithm MPD is rewrit ten. Let these three expressions be denoted by the more general expression MaxCļЄA(itemCh,m,q)Pp(cl,itemCh,m,q)(c,l). Substitute for this expression the following, where item, i and j are as defined by algorithm MPD: Case (cЄ HasOT{A(item,i,j),l))

: Max

Pp{cι,m,q)(c,l),

cι Є RootsOf(A(itemCh,m,q)) Case

(cЄ

HasOh{A(item,i,j),l))

: If cι Є InternOf(A(itemCh,m,q)) and Parent?(c, cΙ 1) Then Pp{cι,m,q)(c,l) Else 0.

This optimisation does not affect space-complexity (i.e., O(|A|n 2 )). 4

Experimental results

num. of

\n\ |A|

elem.trees

11499 11241 11082 10841

N

CPU

Cover

uw

Secs. -age

415 404 412 410

89208 87767 85295 84560

8 18 8.3 14 8.65 17 8.44 19

82% 76% 75% 71%

16% 16% 18% 21%

Exact SA

Bracket.

Bracket.

Match

Prec. 98.2% 95.8% 97% 97.5%

Recall 78.4% 70.5% 66.2% 63%

38/82 27/76 37/75 35/71

69/82 64/76 65/75 58/71

Table 2: Disambiguation accuracy on ATIS sentences The experiments reported below used the ATIS domain Penn Tree-bank II without modifications. They were carried out, on a Sun Sparc sta tion 10 with 64 MB RAM, parsing ATIS word-sequences (previous DOP experiments concerned PoS-Tag sequences).

44

KHALIL SIMA'AN

Efficiency experiments: The three versions of the algorithm were compared for execution-time varying STSG size. The STSGs were projected by varying the allowed maximum depth of elem-trees and by projecting only from part of the tree-bank. The experiment was conducted, for all versions of the algorithm, on the same 76 sentences randomly selected. The results are listed in Table 1. Average cpu-time includes parse-forest generation. Note the difference in growth of execution-time between the three versions as grammar size grows. Accuracy experiments: In Table 2, various accuracy measures are reported. Coverage is the percentage of sentences that were parsed (sen tences containing unknown-words were not parsed - see bellow). Exact match is percentage of parses assigned by the disambiguator that exactly match test-set counterparts. Sentence accuracy (SA) is the percentage of parses, assigned by the disambiguator, that contain no crossing constituents (i.e., 0-crossing) when compared to their test-set counterparts. Bracketing precision is the average, on all parsed sentences, percentage of brackets as signed by the disambiguator, that do not cross any brackets in the test-parse. Bracketing recall is the average ratio of non-crossing brackets assigned by the disambiguator to the total number of brackets in all test-set parses. U W denotes the percentage of sentences containing unknown-words, and N denotes the average number of words per sentence. In each experiment, a random training-set was obtained from the treebank (485 trees), and the rest (100 sentences) formed the test-set. Training was not allowed on test-set material. Various experiments were carried out changing each time the maximal depth of the elem-trees projected from the training-set as suggested in Bod (1993). However, limiting the depth was not effective in limiting the number of elem-trees (that exceeded 570000 for maximum depth 4) and sacrificed many linguistic dependencies. This became also apparent in the accuracy results. To minimise the number of elem-trees without sacrificing any dependencies we constrained the frontier of the elem-trees instead of their depth 5 . The frontiers of elem-trees are constrained to allow a maximum number of substitution-sites and a max imum number of lexical items per elem-tree. Since each substitution can be viewed as a 'bet' with a certain probability of success, the number of substitution sites should be as small as possible. The number of lexical items is chosen in order to control lexicalisation. Table 2 lists accuracy figures for four experiments on four different tťain/test partitions. The ex5

This constraint does not apply to elem-trees of depth 1.

AN OPTIMISED ALGORITHM FOR DOP

45

periments allowed 2 substitution sites and 7 lexical items per elem-tree. These figures are substantially better than those of DOP models that limit depth of elem-trees to 3 or 4. In the reported experiments we did not al low proper-nouns and determiners to lexicalise elem-trees of depth larger than one. We also removed punctuation and markings of empty category from training and test sets. And we did not employ PoS-Tagging since the words lexicalised the elem-trees. The sentences containing unknownwords formed 16-21%. These sentences were not parsed. As far as we know, 97.0% bracketing accuracy, 45% exact-match and 84% 0-crossing sentences are the best figures ever reported on ATIS word-strings. For example, Pereira & Schabes (1992) report around 90.4% bracketing preci sion (on ATIS I PoS-Tag sequences), using the Inside-Outside algorithm for PCFGs. Brill (1993), using Transformation-Based Error-Driven Parsing, reports precision of 91.1% and sentence-accuracy of 60% for experiments with an average test-sentence length 11.3 words. 5

Conclusions

The present optimised algorithm proved vital for experimenting with DOP. As can be seen from the experiments, space and time consumption are or ders of magnitude smaller than those employed in Bod (1993). Extensive experimentation supports constraining the frontier of elementary-trees, sim ilar to η-gram models, when projecting DOP grammars from Tree-banks. It reduces space- and time-complexity and, we suspect, also sparse-data effects. However, further study of the projection mechanism of DOP and other optimisations of the present algorithm is necessary. Acknowledgements. I thank Christer Samuelsson, Rens Bod, Steven Krauwer, and Remko Scha for discussions and comments on an earlier version of the paper. REFERENCES Aho, Alfred V. & Jeffrey, Ullman 1973. The Theory of Parsing, Translation and Compiling. (= Series in Automatic Computation). Englewood Cliffs, New Jersey: Prentice-Hall. Bod, Rens. 1992. "A computational model of language performance: Data Ori ented Parsing". Proceedings of the 14th International Conference on Com֊ putational Linguistics (COLING'92), 855-860. Nantes, Prance. 1993. "Monte Carlo Parsing". Proceedings of the 3rd International Work shop on Parsing Technologies, 1-11. Tilburg/Durbuy.

46

KHALIL SIMA'AN 1995. Enriching Linguistics with Statistics: Performance models of Natural Language, (= IL L ֊ dissertation series, 14). Ph.D. dissertation, University of Amsterdam, The Netherlands.

Brill, Eric. 1993. "Transformation-Based Error-Driven Parsing". Proceedings of the 3rd International Workshop on Parsing Technologies, 13-25. Tilburg/ Durbuy. Jelinek, Fred, John D. Lafferty & Robert L. Mercer. 1990. Basic Methods of Probabilistic Context Free Grammars. Technical Report IBM RC 16374 (#72684). Yorktown Heights, U.S.A.: IBM. Joshi, Aravind K. & Yves, Schabes. 1992. "Tree-Adjoining Grammars and Lexicalised Grammars". Tree Automata and Languages ed. by M. Nivat & Andreas Podelski, 409-430. Amsterdam: Elsevier Science Publishers. Magerman, David M. 1995. "Statistical Decision-Tree Models for Parsing". Pro ceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL'95), 276-283. Cambridge, Mass.: MIT. ɔ

ereira, Fernando & Yves, Schabes. 1992. "Inside-outside reestimation from partially bracketed corpora". Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics (ACL'92), 128-135. Newark.

Scha, Remko. 1990. "Taaltheorie en Taaitechnologie; Competence and Perform ance". Computertoepassingen in de Neerlandistiek, LVVN-jaarboek ed. by Q.A.M, de Kort & G.L.J. Leerdam, 7-22. Almere: Landelijke Vereniging van Neerlandici. [In Dutch.] Schabes, Yves & Richard, Waters. 1993. "Stochastic Lexicalised Context-Free Grammar". Proceedings of the 3rd International Workshop on Parsing Tech nologies, 257-266. Tilburg/Durbuy. Sima'an, Khalil, Rens Bod, Steven Krauwer & Remko Scha. 1994. "Efficient Disambiguation by Means of Stochastic Tree-Substitution Grammars". Pro ceedings of the International Conference on New Methods in Language Pro cessing, 50-58. CCL, UMIST, Manchester. Vijay-S hanker, . & David Weir. 1993. "Parsing Some Constrained Grammar Formalisms". Computational Linguistics 19:4.591-636. Younger, D.H. 1967. "Recognition and Parsing of Context-Free Languages in Time n 3 ". Information and Control 10:2.189-208.

Parsing Repairs M A R C E L CORI*, M I C H E L DE FORNEL** & J E A N - M A R I E MARANDIN*

*Université Paris 7, ** EHESS (CELÍTE) &*CNRS (URA 1028) Abstract The paper deals with the parsing of transcriptions of spoken utter ances with self-repair. A syntactic analysis of self-repair is given. A single well-formedness principle accounts for the regularities ob served in a large corpus of transcribed conversations: a constituent is a well-formed repair iff it can be substituted into the right edge of the tree which represents the syntactic structure of the interrupted utterance. The analysis is expressed in a PS-grammar. An augment ation of the Earley algorithm is presented which yields the correct inputs for conversational processing. 1

Introduction

If natural language understanding systems are ever to cope with transcrip tions of spoken utterances, they will have to handle the countless self-repairs (or self-corrections) that abound in them. This is a longstanding problem: "hesitations and false starts are a consistent feature of spoken language and any interpreter that cannot handle them will fail instantly" (Kroch & Hindie 1982:162). See also (Kay et al. 1993). The current assumption is that in terruptions and self-repairs should be handled by editing rules which allow the text to be normalised; these rules belong to a kind of adjustment mod ule within the performance device (Fromkin 1973; Kroch & Hindie 1982; Hindie 1983; Labov (pc); Schegloff 1979). We shall lay the foundations of another approach in this paper: interruptions and self-repairs can be directly handled by the syntactic module. Our proposal is based on the observation that "speakers repair in a linguistically principled way" (Levelt 1989:484). The regular character of self-repair has been emphasised in a number of detailed descriptive studies in different fields: linguistics, conversation analysis, psycholinguistics (Blanche-Benveniste 1987; Fornel 1992a, 1992b; Frederking 1988; Levelt 1983, 1989; Schegloff et al. 1977; Schegloff 1979). Among others, Levelt (1989:487) proposes that "self-repair is a syntactically regular process. In order to repair, the speakers tend to follow the normal rules of syntactic coordination". We have shown elsewhere that self-repair

48

M. CORI, M. DE FORNEL & J.-M. MARANDIN

cannot be reduced to a kind of coordination1. Nevertheless the forms of selfrepair are not only regular but they are submitted to a simple geometric principle of well-formedness. This principle is given a formal representa tion in a PS-grammar. It opens a fresh perspective on the parsing of non standard inputs with self-repairs: a simple and principled augmentation of a standard parsing algorithm can handle them. We make the point with the Earley algorithm. 2

Characterising self-repair

2.1

The overt characteristics of self-repair

The overt characteristics of self-repair are the following: an utterance is interrupted. The interruption is marked by a number of prosodic or phonetic signals such as cut-offs, pauses, hesitation markers or lengthenings. The interruption is followed by an arbitrary number of constituents which appear to be in a paratactic relation to the interrupted utterance. This is illustrated by the following sample taken from a corpus of transcribed conversations2: (1)

a. elle était:: an- mm irlandaise (.) enfin:: de l'Irlande b. elle ne sort plus de son:: euh studio  mais il faudrait que vous passiez par euh:: (.) par le:: par le numéro du commissariat hein d. je croyais qu'il était euh:: je croyais qu'il était encore là-bas jusqu'à ce soir3

We shall use the following shorthand convention in the following:  stands for the interrupted utterance, # for any prosodic or phonetic signal and R for the repair. 1

2

3

The argumentation is summed up in (Cori et al. 1995); it is fully developed in (Fornel & Marandin Forthcoming). The research is based on an extended corpus of spontaneous self-repairs in French (approximatively 2000 occurrences). They are taken from a large body of transcribed audio and video tapes of naturally occurring conversations in various settings (tele phone, everyday conversation, institutional interaction, etc.). We refer the reader to (Schegloff et al. 1977; Schegloff 1979) for the transcription conventions of (1). (l.a) She was En- mm Irish (.) from Ireland; (l.b) she doesn't leave her er studio; (l.c) but you should go through (.) through the (.) through the number of the police station; (l.d) I thought that he was er I thought that he was still there till tonight. In order to limit the word to word glossing of French utterances, we shall use simple forged examples in the following.

PARSING REPAIRS 2.2

49

The structural characteristics of self-repair

The structural features of self-repair are the following: - )  is a segment analysable as a well-formed syntactic unit apart from the fact that one or more sub-constituent(s) may be missing. - B) R is a segment analysable as a single syntactic unit. This unit may be lexical, phrasal or sentential. It is usually a maximal projection (Xmax or S) but need not be. R can be interrupted as  can be; this yields what we call a cascaded repair (§3.2 below). Note that any analysis which reduces self-repair to coordination presup poses (B). In this connection, note the difference between (2.a) and (2.b): (2)

a. ?? l'homme avec les lunettes a poussé le clown # avec les mous taches a poussé le clown b. l'homme a donné un coup de poing au # une gifle au clown4

The string avec les moustaches a poussé le clown does not make up a con stituent and thus is not a licit R, whereas une gifle au clown is a licit R since it can be treated as a single constituent, a ghost constituent (Dowty 1988), in a coordination and in a question-answer pair; une gifle au clown is not a maximal projection. - C) R depends on O 5 . The dependency between  and R includes two sub-relations: - C1) R repairs a constituent of  which immediately precedes R. Hence the ill-formedness of Vhomme avec les lunettes a poussé le clown # avec les moustaches. The PP avec les moustaches cannot repair avec les lunettes over the VP a poussé le clown. - C2) The choice of the category of R depends on O: R is a licit daughter in 0. This is illustrated in (3):

4

5

(2.a) The man with the spectacles pushed the clown # with the moustache pushed the clown; (2.b) the man gave a punch to the # a slap to the clown. (2.a) is judged an ill-formed repair by Levelt (1989:489). We have not encountered repairs such as (2.a) in our corpus. Levelt (1989:486) did observe the fact: "well-formedness of a repair is apparently not a property of its intrinsic syntactic structure. It is rather dependent on its relation to the interrupted original utterance"

50

M. CORI, M. DE FORNEL & J.-M. MARANDIN

(3)

a. Les enfants attendent le bateau # le ferry de Marseille b. Les enfants attendent le # que le bateau vienne c. Les enfants attendent le # bateau c'. Les enfants attendent le ferry # de Marseille6

Any contemporary theory of coordination puts two constraints on each con junct: (i) "each conjunct should be able to appear alone in place of the en tire coordinate structure" (Sag et al. 1985:165); (ii) each conjunct shares at least one feature with the other (categorial identity being the most frequent case). Self-repair does not have to meet the latter constraint (ii): this is why it cannot be reduced to a coordinate structure. On the other hand, it has to meet the former. - D) R completes O. The intuition which underlies the notion of repair is the following: when R is interpreted as a repair, R is interpreted as a constituent in O, it may, or may not, replace a constituent partially or completely realised in O. For example the sequences O#R in (3.a) and in (3.c') are interpreted as (4) would be; in (3.a) the NP le ferry de Marseille replaces le bateau whereas in (3.C') the PP de Marseille replaces nothing. (4)

3

Les enfants attendent le ferry de Marseille

Analysing self-repair

3.1

Syntactic well-formedness

Generalisation () which characterises the relation holding between  and R can be unified in a single principle, the principle of the right edge (REP) 7 : (5)

A constituent R is a well-formed repair for  iff it can be substituted into the right edge of the interrupted O.

The interrupted part of (3.b) Les enfants attendent le # may be repaired with an R of category N bateau, NP le ferry, VP espèrent que le bateau viendra or S Les enfants espèrent que le bateau viendra (the constituency 6

7

(3.a) The children wait for the boat # the ferry to Marseille; (3.b) that the boat arrives. Principle (5) is reminiscent of the Major Constituent Constraint on gapping (Hankamer 1973; Gardent 1991). On the status of the right edge for discourse processes, see (Gardent 1991; Prüst 1993).

PARSING REPAIRS

51

requirement involves categorial identity) or with S'[que] que le bateau vi enne (in accordance with the sub-categorisation requirement of the verb attendre). This is illustrated in Figure 1 8 .

Fig. 1: Illustration of the licensing pńnciple Principle (5) prevents ill-formed repairs such as (2.a) above. It accounts for all types of self-repair (reformulation, lemma substitution and restart) 9 . 3.2

Cascaded repair

A repair R itself can be interrupted and it can be repaired. Examples are given in (l.a) and (l.c) where the phonetic signal is followed by a "string of Rs" 10 . The sequence can be schematised as  # R 1 # R 2 . . . R m . The REP needs not be augmented or modified to handle this case once we have made precise the structures acting as  in the cascade. For example: (6) 8 9

10

Les enfants attendent le # le bateau de # qui va à Marseille

The category U stands for Utterance. On the contrary, the reduction of self-repair to coordination leads to distinguish three different processes (De Smedt & Kempen 1987). Blanche-Benveniste (1987) proposed that the Rs form a coordinate structure. See Fornel & Marandin (Forthcoming) for counter argumentation.

52

M. CORI, M. DE FORNEL & J.-M. MARANDIN

R1 {le bateau de) can be substituted into . R2 {qui va à Marseille) cannot be (?? les enfants attendent # qui va à Marseille). On the other hand, it can be substituted into the "new" configuration TV which is obtained by substituting R1 into  {les enfants attendent le bateau de). Cascaded repairs result from the iteration of repair. Repair always in volves only one  and one R at a time. The tree obtained by substituting R1 into  gives TV which becomes the  for repair R2 and so forth. 3.3

Interpreting

#R

The interpretation of #R is built on the tree TV obtained by substituting R into . Thus R is treated as a repair. For example, the interpretation of an utterance such as (3.b) Les enfants attendent le # que le bateau vienne discards the interrupted NP le # and is derived from the tree TV which is the repaired tree: Les enfants attendent que le bateau vienne. The main implication for the interpretation of #R is the following: the recovery of the interpretation is parallel to the licensing of the category of R. Once R is recognised as a constituent of 0 , no specific rule of interpretation has to be called for; the configuration TV is interpreted exactly in the same way that a canonical configuration would be 11 .

3.4

Parsing self-repairs

The analysis allows a simple solution to the problem of parsing an input  # R . The relevant features are the following: (i) R is a licit daughter of  and (ii) R is a daughter on the right edge of  according to the REP. (The REP restricts the choice of categories for R). Thus the input #R can be parsed with a classical algorithm such as Earley and as easily as any other input. Moreover, the same kind of ambiguity encountered in the parsing of canonical inputs arises: attachment ambiguity. For example, Marie in (7.a) can be substituted to Paul or to la femme de l;likewise in (7.b) le professeur Tournesol (...) can be substituted under S' or U. Here lies the other drawback of the reduction of self-repair to coordination: in a coordinate structure, the well-formedness constraints are distinguished from the inter pretative rules which depend on the choice of the conjunctions. If self-repair were a kind of coordination, its semantics should be given a separate and specific formulation. This does not seem plausible.

PARSING REPAIRS (7)

53

Jean aime la femme de Paul # Marie Tournesol m'a dit que l'élève # le professeur Tournesol m'a dit que l'élève n'était pas au point 12 We propose in (Fornel & Marandin Forthcoming) a heuristic rule that mini mises the attachment ambiguity.

4

a. b.

R e p r e s e n t i n g self-repair

Self-repair receives a straightforward formal representation in a PS-grammar. We first define the notions of interrupted tree and right subtree. Let G — (V T , V N , R , U) be a CF-grammar where VT is a terminal vocab ulary, VN a non-terminal vocabulary, U Є VN the axiom, and where the rules are numbered from 1 to n. Each rule i is left(ί) → right(i); λ(i) is the length of right (i); rightj(i) is th j-th symbol in right(i). We assume that there are no rules with right(i) being the empty string. An elementary tree is associated with each rule. Complex non punctual trees are represented by leftmost derivations: A = (i1... ip). root(A) is the label of the root of A. Definition 1 An interrupted tree, written A = (i1 . . . ik-1ik[l]ik+i...ip), is such that the l-th leaf of i k is a terminal leaf of the tree A = ( i 1 . . ik . ·. ip) (i.e., a leaf labelled with a symbol taken in VT), all nodes preceding this leaf (according to the precedence order) dominate terminal leaves of Ճ, and all nodes following this leaf are leaves of A. Definition 2 An elementary right subtree (ERS) of an interrupted tree A = (i1... ik-1ik[l]ik+1 ...ip) is defined as follows: (i) i 1 is an ERS of

(ii) if ij is an ERS of A and if right(ij) = αY, Y being a non-terminal symbol, if all non-terminal leaves of ij are roots of elementary trees in A, then the last one of these elementary trees, ¿ J+s , is also an ERS of A. If j + s = , we must have l ≥ λ(i j + s ) — 1. Definition 3 If ir is an ERS of A, then (ir... ip) is a right subtree of A. Right edge principle. We consider an interrupted tree  — [i1... i k [l)... ip) such that root(0) = U and a tree R = (j1.. .jq). (7.a) Jean loves the wife of Paul # Marie; (7.b) Tournesol told me that the student # Professor Tournesol told me that the student was not ready.

54

M. CORI, M. DE FORNEL & J.-M. MARANDIN

(8) R is a well formed repair for  iff either root(R) = U or there is an ERS ir of  and a rule ξ in the grammar such that left(ir) = left{ξ) and right(ir) — pX and right(ξ) = ρ root(R) with X Є VN U VT. Repaired tree. A repaired tree iV is obtained by substituting R for a right subtree of O: N = ( i 1 . . . ir-1ξj1... j q ). R is then a right subtree of N. Note that lexical repair is not a special case; it corresponds to the case of a punctual R tree. Cascaded repairs. We have two sequences: (i) N 0 , N 1 , . . . , Nm where N 0 , N1,..., N m _ 1 are interrupted trees and N m a complete tree such that root(N0) = U, and (ii) R 1 , . . . , Rm where R1 . . . , R m - 1 are interrupted trees (interrupted re pairs), and R=is a complete tree. Condition (8) is verified for each pair N i _ 1 , R i . Ni is a new tree obtained from Ni-1 and Ri. N m is the repaired tree of the cascade. 5

A n augmented Earley algorithm for repair

We show how to augment the Earley algorithm (Earley 1970) to parse in terrupted inputs with repairs. 5.1

String representations

Let LF be a set of lexical forms. A categorisation function associates a set of terminals with each lexical form u: cat(u) C VT. A representation of a string 1u2... un Є LF* is given by a tree A such that root(A) = U and such that if the ordered sequence of the leaves of A is z1z2 ... zN, then for each i, zi Є cat(ui). A string may be represented by an interrupted tree A= ( i 1 . . . ik[l]...ip) by taking z1z2... zq where zq is the lth leaf of ik. 5.2

Augmentations to the standard algorithm

A type is added in the definition of the states: • right (vs left) indicates whether an elementary tree is an ERS; • cut distinguishes the states involved in the building of interrupted trees. We add the following to the definition of the operations:

PARSING REPAIRS

55

• predict: [1.1.2] and [1.2] below are added to send into the set S m + 1 , which contains the initial states for R, all elementary trees which may dominate R. • scan: [2.1.2], [2.2] and [2.3] are added to handle the replacement of punctual subtrees of O. • complete: [3.2] is added in order to obtain a representation of the interrupted trees  in addition to the straightforward output of the algorithm: the repaired trees N. 5.3

The augmented algorithm

The input data of the algorithm is a grammar G = (VT, VN,R, U) and a string u u1... um #um+2um_+3 ... um+p+1 where each u1 Є LF. We add to the grammar a rule numbered 0 such that right(0) = U and left(0) Є VN. The algorithm builds a sequence of sets, 50,51,..., Sm+p+1, made of states. A state is a 5-uple (q, j, k,t, a) where q is a rule, j is a position in right(q) (0 < j < λ(q)),  is a set number (0 <  < m + p + 1), t is a type (right, left or cut), α is a string (the current result). The initial state (0,0,0, right, ε) is entered into S0. Consider the state 0 and j = X(q) — 1 and t = right then add also ((q, 0, , right, ε) to sm+i. [1.2] If i = m then [1.2.1] if j > 0 then for each (q', j', k', right, β) ε Sk such that rightj+1(q') = left(q) and ƒ = λ(q') — 1, for each rule ξ such that left(q') = left(ξ) and right(q') = pX and right (ξ) = pY, for each rule r such that left(r) = Y, add
56

M. CORI, M. DE FORNEL & J.-M. MARANDIN

[2] Scan: if t ≠ cut then: [2.1] If i ≠ m and i < m +p +1 and j < λ(q) and rightj+1(q) ε cat(ui+1) then [2.1.1] add q,t+ 1, k,t,α to Si+1. [2.1.2] If i = m — 1 and t = right and j = λ (q) — 1 and j ≠ 0 then for each rule ξ such that left(q) = left(ξ) and right(q) = pX and right (ξ) = pY, if Y ε cat{um+2) then add <ζ ,J + l,k,t,a) to Sm+2. [2.2] If i = m and t = right and j = λ (q) — 1 and j ≠  and rightj+1(q) ε cat(um+2) then add q, j + 1, k, t, α to 5m+2. [2.3] If i = m and j  0 then for each q', j',k'rightβ) ε Sk such that rightj'+i(q') = left(q) and ƒ = λ(q') - 1, for each rule ξ such that left(q') = left(ξ) and left(q') = pX and right (ξ) = pY if  € cat(um+2) then add (ξ, ƒ + 1, ', right, β) to Sm+2. [3] Complete: [3.1] If ; = λ(q) and t ≠ cut then for each q'', j' ,k',t',ß ε Sk such that rightj'+1(q') = left(q), add q' j', + l,k', t'.ßqa to Si. [3.2] If i = m and λ fø) > j  0 then for each q'', j', k', t', β ε Sk such that rightj'+1(q') = left(q), [3.2.1] If t ≠ cut and rightj (q) ε VT then add q', j' + 1, k',cut, ßq[j]a to Sm. [3.2.2] If t = cut then add q', j' + 1,' ,cut,ßqa to 5m. If (0,1,0, right, a) belongs to 5m+1+p then a new tree If (0,j,0,cut,ß) belongs to Sm then an interrupted tree The tree β represents the substring u1u2...um. Note mar is left-recursive, it may happen that there are an interrupted trees.

TV is given as a.  is given as β. that if the gram infinite number of

PARSING REPAIRS 5.4

57

Remarks

The algorithm can be easily extended to handle cascaded repairs. A sim pler version13 should be used to provide the inputs for the understanding component: it only yields the repaired trees. The repaired trees are the relevant inputs for building discourse units, i.e., sentential turn construc tional units in conversation (see Schegloff 1979). For example, they allow the possible completion of the current turn and make transition to the next turn possible. 6

Conclusion

The main result of the study is the following: even though the configuration "interrupted utterance + repairs(s)" does not belong to the syntactic rep ertoire of French, it is submitted to a syntactic well-formedness condition. The REP is a simple and unified account of the regularity of self-repair. It comes into line with Schegloff's observation (1979:277): "the effect [of suc cessful repair] is the resumption of the turn-unit before the repair initiation or, if the repair operation involves reconstruction of the whole turn-unit, production of the turn-unit to completion". It gives a more adequate con tent to Levelt's claim: "speakers repair in a linguistically principled way". Thanks to the augmentation of the Earley algorithm, we claim that parsers can parse repairs in a syntactically principled way. REFERENCES Blanche-Benveniste, Claire. 1987. "Syntaxe, choix de lexique, et lieux de bafouil lage". DRLAV 36-37.123-157. Paris: Université de Paris-VIII. Cori, Marcel, Michel de Fornel & J.-M. Marandin. 1995. "Analyse syntaxique de I' auto-réparation". Colloque 'Le traitement automatique du langage naturel', 209-219. Marseille. De Smedt, Koenraad & Gerard Kempen. 1987. "Incremental sentence produc tion, self-correct ion and coordination". Natural Language Generation ed. by G. Kempen, 365-376. Dordrecht: Kluwer Academic. Dowty, David. 1988. "Type raising, functional composition and non-constituent conjunction". Categorial Grammars and Natural Language Structures ed. by Richard Oehrle et al., 153-197. Dordrecht: D. Reidel. 13

Without cut as a value of type in the definition of a state and without [3.2] in the definition of Complete.

58

M. CORI, M. DE FORNEL & J.-M. MARANDIN

Earley, Jay. 1970. "An Efficient Context-Free Parsing Algorithm", Communica tions of the Association for Computing Machinery 13:2.94-102. Fornel, Michel de. 1992a. "The Return Gesture: Some Remarks on Context, Inference, and Iconic Gesture". The Contextualization of Language ed. by P. Auer & A. Di Luzio, 159-176. Amsterdam: John Benjamins. 1992b. "De la pertinence du geste dans les séquences de réparation". Les formes de la conversation ed. by Coneinet al., 119-154. Paris: CNET/CNRS. & J.M. Marandin. Forthcoming. "L'analyse grammaticale des auto-repa rations" . Frederking, Robert. 1988. Integrated Natural Language Dialog. Dordrecht: Kluwer. Fromkin, Victoria. 1973. Speech Errors as Linguistic Evidence. The Hague: Mouton. Gardent, Claire. 1991. Gapping and VP Ellipsis in a Unification-ased Gram mar. Ph.D. dissertation, University of Edinburgh, Edinburgh, Scotland. Hankamer, Jorge. 1973. "Unacceptable Ambiguity". Linguistic Inquiry 4.17-68. Hindle, Donald. 1983. "Deterministic Parsing of Syntactic Non-fluencies". Pro ceedings of the 21st Meeting of the Association of Computational Linguistics (ACL'83), 123-128. Kay, Martin, M. Gawron & P. Norvig. 1993. "Verbmobil: A Translation System for Face-to-face Dialog", (= CLSI Lectures Notes, 33). Chicago: Chicago University Press. Kroch, Anthony & D. Hindle. 1982. "On the Linguistic Character of NonStandard Input". Proceedings of the 20th Meeting of the Association of Com putational Linguistics (ACL'82), 161-163. Levelt, William. 1983. "Monitoring and Self-Repair in Speech". Cognition WAX֊ 104. 1989. Speaking: from Intention to Articulation. Press. Priist, Hub. 1993. On Discourse Structuring, dissertation, University of Amsterdam.

Cambridge, Mass.: MIT

VP Anaphora and Gapping. Ph.D.

Sag, Ivan, Th. Wasow, G. Gazdar & S. Weisler. 1985. "Coordination and how to distinguish categories". Natural Language and Linguistic Theory 3:2.117171. Schegloff, Emanuel. 1979. "The relevance of repair to syntax-for-conversation" Syntax and Semantics 12. ed. by T. Givon, 261-286. New-York: Academic Press. , G. Jefferson & H. Sacks. 1977. "The preference for self-correction in the organisation of repair in conversation". Language 2.361-382.

Parsing for Targeted Errors in Controlled Languages M A T T H E W F. H U R S T

University of Edinburgh Abstract The use of Controlled Languages in technical documentation is be coming a large concern for many organisations. Authoring texts which conform to these specifications is a problematic process. Tech nological support for the writing process may offer a number of aids, including style, or grammar, checkers. The ability to recognise vari ations to the prescribed grammar is at the heart of such systems. This paper presents a variation on the chart parsing method which encodes the grammar as finite state automata productions instead of a linear description of constituents. The system allows the grammar writer to define a number of variations to a grammar rule which are represented as transformations to the automata. 1

Introduction

The SEATS (Specialised English Author Training System) project aims to create technology capable of supporting the process of writing technical documentation according to the stylistic requirements of a Controlled Lan guage. Central to this support is a style checker based on a flexible pars ing mechanism. This paper introduces the notion of Controlled Language, overviews some relevant previous work in the area of robust parsing and describes a novel parser which uses finite state automata as a rule system in the chart parsing paradigm. 2

Controlled language and grammar checking

A controlled language (CL) is a restricted variation on some natural lan guage. The purpose of defining a CL for some domain is to control aspects of the language used to describe a task in that domain. The control is de signed to reduce the ambiguity inherent in natural languages, making the text easier to understand, and less prone to incorrect interpretation. A typ ical application area is one in which the correct execution of a procedure manipulating objects in the domain is safety (or legally) critical. Any aspect of a natural language may be controlled by more or less formal rules. These rules may be specific (e.g., preventing the use of a

60

MATTHEW F. HURST

particular word) or general (using a model of that linguistic component, be it lexical, grammatical or discourse level). Lexical control says something about the use of lexical items, typically providing a dictionary of approved words. Grammatical control endorses the use of a set of constructions. Discourse level control stipulates the introduction of topics, the structure of information introduction and so on. Controlled Languages offer an ideal application field for language check ing technology. Whereas the task of free text checking can never offer a complete coverage of the language, controlled languages can generate gram matical models very close, if not identical to, the intended coverage. Ad ditionally, free text may contain unseen lexical items, whereas controlled languages have a finite lexicon which forms part of the language definition.

3

Robust parsing

The field of robust parsing, or robust analysis, provides a useful set of tech niques which can be applied to the task of detecting and reporting errors in text. The goals of robust parsing differ slightly to those of error detection. • Robust Parsing aims to provide an analysis of ill-formed text. A grammar and lexicon are used together with some set of techniques to align the text with the grammar. • Error Detection aims to detect the cause of failed analysis of ill-formed text, and report the error. In general, any technique for robust analysis can be applied to the task of error detection by augmenting certain data structures with the appropriate record of the ill-formedness consumed.

3.1

Positive and negative detection

The first consideration in classifying techniques for error checking is the distinction between positive and negative detection. 'Positive detection' is concerned with writing rules which form the errors, i.e., ungrammatical rules. These rules are then used in the general analysis strategy, e.g., pars ing. If they complete, then an error may have been found, at which point further analysis may be done. 'Negative detection' classifies methods which provide a model of the correct language and employs techniques to compare this model with the input.

TARGETED ERRORS IN CONTROLLED LANGUAGES 3.2

61

Targeted and untargeted detection

Another dimension of technique classification is that of the mode of de tection. 'Targeted detection' employs some declaration of the flexibility required in order to detect errors. This declaration is expressed as some form of annotation to the language model. 'Untargeted detection' tech niques are those which use some general principal to align the model with the input (or the input with the model). The difference between targeted detection and positive detection is that in targeted detection, the core model is the correct grammar rule; the an notations to this rule describe the required flexibility. Positive detection, on the other hand, uses grammar rules which centre on the error as being the key concept. Untargeted detection usually appears as an algorithmic component which provides some form of relaxation to the grammatical model. 3.3

Single phase and multiple phase

This classification of techniques refers to the time at which the error rules are considered. A 'Single Phase' approach would incorporate the rule system at the same time as parsing. This approach would be appropriate to positive detection strategies, as they are identical in implementation to a normal parsing of text. Extending this approach to negative strategies introduces interesting computational problems due to the multiplicity of possibilities. A 'Multiple Phase' approach would incorporate the detection of errors by first analysing the text as if it were well formed, and then reworking this analysis, incorporating the error mechanisms allowed by the definition of the error technique. 3.4

Current methods

Methods which can be classified to some degree in the above manner can be found in the literature on robust analysis. Mellish (1989) describes an example of a negative, untargeted, mul tiple phase approach to robust analysis. The method uses a grammar of English (hence negative) which it uses to construct a well formed substring table (chart) employing a bottom up parsing algorithm. Following this, it uses a modified top-down parser (hence, multiple phase) to attempt to complete the parse with the minimum errors. The use of this general, grammar independent technique is an example of an untargeted approach.

62

MATTHEW F. HURST

Compare this with the negative, untargeted, single phase approach of Goeser (1992). Ballim & Russell (1994) describe a single phase approach which of fers the grammar developer a weakly targeted, negative grammar envir onment in which to construct and experiment with rules. Here, grammar rules are annotated with bounds on the relaxations that may provide flexib ility for certain constituents. Another single phase approach is described by Wang (1992). This method differs from the others mentioned here in that it employs a novel view of the parsing process, not relying on conven tional grammar rules. Its flexibility is derived from a mechanism capable of only three simple actions. Consequently, as it has no traditional grammar model used in its analysis it cannot strictly be classified with the other sys tems, however an approximation is as an untargeted (it presents a general mechanism capable of producing parses of ill-formed input) approach. The positive/negative distinction doesn't apply as there is no grammar. Strzalkowski (1992) describes a parsing system built for speed. Its ro bust capabilities are untargeted and work through a mechanism which skips ungrammatical input. The paper mentions that, through the use of a time out facility, no distinction is made between ungrammatical and simply expensive input. Skipped input can later be attached to the analysis, so the method is multiphase. Statistical approaches to under generation exist (e.g., Briscoe & Waegner 1994). This technique approaches the problem by defining probabilit ies to all possible rules (modulo certain constraints described) over a ter minal/nonterminal set in CNF. This approach is designed to be a single phase approach. However, its robust capabilities are captured during a stochastic training phase, consequently the normal model of a 'correct' grammar and an 'incorrect' input is less appropriate to this type of analysis. Work specifically in the area of error detection is less numerous. Douglas & Dale (1992) describe a system capable of relaxing constraints at a different level to those in which we are concerned with there. The robust PATR model can be used to relax constraints represented as the feature structures of PATR rules in order to accept ill-formed input. The sort of ill-formedness which this method handles are the normal constraints of PATR notation. The implementation of the parser described below allows for variation between single phase and multiple phase parsing, though currently is imple mented as a single phase process. It uses targeted negative detection (note that it is always possible to add positive detection to any parsing mech anism simply by adding grammatically ill-formed rules). It was decided

TARGETED ERRORS IN CONTROLLED LANGUAGES

63

to use targeted detection for purposes of speed. Mechanisms for arbitrary insertion and deletion for example are typically complex; Mellish (1989) reports a (worse case) 10 times increase in time taken when one error is introduced into a sentence. Additionally, the target controlled language (AECMA 1989) has many descriptions of variations to the correct gram mar which are not permitted. Writing targeted negative grammars fits this type of language definition. Finally, using a similar model for encoding the language and possible errors as that of the manual will provide a consistent view of the language, a factor which we think will aid the learning of the language as well as the construction of a complete grammar. 4

Chart parsing with finite state automata

The operations required to perform parsing using a well formed substring table are typically described as follows. 1. Rule invocation. An inactive edge is entered and rules are found for which this edge represents the initial constituent. 2. Combining with active edges. An inactive edges is entered and active edges are looked for with which it may combine. 3. Extension of active edges (usually termed the fundamental rule: Gazdar & Mellish 1989:193). An active edge is entered and inactive edges are looked for to complete or extend the span of the edge. A number of primitive operations are required to support these general operations. • matching: matching must be carried out between the constituents of rules. • addition of information: the creation of a new edge through step 2 or 3 can be viewed as the addition of information. This addition may be a simple update of a dotted rule, e.g., when using atomic categories, or may require more sophisticated operations like the unification of graphs in the case of a feature structure representation. The use of dotted rules (Earley 1986; Kay 1980) is the key behind the effi ciency of the paradigm. Traditionally, the grammars used in such parsing schemes have been straight-forward context free grammars. These gram mars may be implemented as simple atomic category rules (Andrews & Brown 1993) or more complex information representations such a unifica tion formalisms. In both cases, it is important to ensure that the primitive operations of matching and addition of information can be carried out in

64

MATTHEW F. HURST

efficient ways. The efficiency of the matching process can be increased by the use of indexing systems, both for the rule look up, in which case rules are stored according to the index value of their initial daughter, and for the storage of edges in the chart, active edges being indexed on their next required daughter, and inactive edges being indexed on their mother. For the edges in the chart, an index is a vector over a vertex. Entering edges in to the chart means entering them under the appropriate index at the delim iting vertices. An example of indexing is described by Andrews & Brown (1993). The use of finite state automata (FSA) as a grammar description has a similar form to the standard production described by a context free rule. Instead of a simple series of daughters, the right hand side consists of an FSA (Figure 1). The use of the language of regular expressions to describe finite state automata is well documented (Aho, Sethi & Ullman 1986:83; Gazdar & Mellish 1989:134) as are algorithms for constructing the machines from these descriptions.

Fig. 1: A finite state production The processes required to form well formed substring tables must be modi fied to accommodate the more complex rule description system of the FSA. In fact this alteration is not at all complex and is really a transfer of the notion of the 'dot' in the dotted rule from marking the next constituent to be consumed to marking the state in which the FSA is in. Again, efficiency is maintained by the use of indexing mechanisms, and it is these mechanism which allow the operations 1, 2 and 3 to remain unaltered. Active edges, and the set of rules, are indexed by the set of possible matches available at any given state, i.e., the arcs representing transitions between states via the consumption of those constituents. In this way, no extra complexity in computation occurs as the indexing mechanism acts as an abstract inter face between the representation and the algorithm. The indexing of edges as they enter the chart can be carried out in constant time as it can be accomplished by the simple addition of a precomputed index of arcs of a state and the index for the vertex. So in Figure 1, if the automaton is in

TARGETED ERRORS IN CONTROLLED LANGUAGES

65

state 2, the edge is indexed by  and D. Inactive edges are indexed by their mother categories as before. 5

Encoding grammatical variation with finite state automata

A set of transformations of the FSA allows for the encoding of targeted errors, grammatical variation, to be held in the rule as extensions to the core (correct) production. 5.1

Deletion

A deleted constituent is simply encoded by the use of a epsilon arc. Deleting  from Figure 1 results in the FSA in Figure 2.

Fig. 2: A finite state production with deletion

5.2

Insertion

Inserting a constituent is not as straight-forward. There are 2 possible ways to describe an insertion. 1. By describing the material which precedes it, i.e., the preceding con stituent. 2. By describing the material which follows it, i.e., the following con stituent. 3. By describing the complete context, i.e., the preceding and following constituent. The first two cases are trivial, and are achieved by placing an extra state in the FSA either before or after the appropriate constituent and adding arcs for the new constituent and an epsilon arc for optionality. Figure 3 shows the insertion of | before C. The third case, however, requires a little more work. For example, insert ing | between  and  cannot be achieved by placing an extra state between

66

MATTHEW F. HURST

Fig. 3: A finite state production with insertion states 1 and 2 as this would also represent an insertion of | between  and  and  and D. In fact, to describe the implementation of the third type of description, it is necessary to complete the brief description of the construction of finite state automata from regular expressions presented above. As described by Aho, Sethi & Ullman (1986):122-123, the generation of finite state automata from regular expression is a three case algorithm. The third case, that describing the construction of automata constructed from the disjunction (|) and zero or more repetitions (*) requires the insertion of epsilon productions. The full description of states 1 and 2 in Figure 1 would be that in Figure 4.

Fig. 4: A finite state production in full The algorithm for constructing the FSA guarantees that a state will have at most one exiting arc that is not empty, i.e., an epsilon arc. From any state, it is straight forward to compute the set of states which are reachable through epsilon arcs (this can be down off line to avoid any addition of complexity to the process). Inserting an extra constituent with a full description of context (preceding and following constituents) can then be achieved by using the set or reachable states from the entry state of the pre ceding context arc (in this example case, B, and checking for a match with the following context, in this case, C. The entry state for the arc labelled by  is 2, and both 2a and 2b are reachable from 2. The transformation then produces the FSA appearing in Figure 5.

TARGETED ERRORS IN CONTROLLED LANGUAGES

67

Fig. 5: A finite state production with insertion A check has to be made to ensure that the preceding context and the following context can not be ignored through the traversal of epsilon arcs. This can occur if optionality has been defined for those constituents, either through the rule definition, or through the inclusion of some deletion arcs. Insertion and deletion arcs are marked as such to distinguish them from the arcs present in the rule prior to transformation. 6

Complexity

The complexity of the parsing mechanism should be considered as an ex tension to the expression for the normal representation of CFG rules. In fact the complexity of the algorithm is not the issue, rather it is the com plexity of the grammar. This is because the processes of the algorithm are of the same complexity, locally, as of the normal version. Rule access is the same (rules are indexed on the possible first daughters as before), the fundamental rule is the same, modulo the number of reachable arcs match ing inactive edges in the chart that an edge has emitting from its current FSA state. The factor of the number of reachable arcs is an attribute of the grammar, and not the algorithm itself. The entry of an inactive edge, resulting in looking backwards in the chart for possible active edges with which to combine is unchanged, again due to the indexing of edges by the set of reachable constituent arcs. Consequently, use of FSAs should be thought of an encoding technique which, in effect, reduces the number of rules required. For example, the two rules: 1. A → B C D 2A→BCE may be represented as one rule: 1. A → B  D|E

68

MATTHEW F. HURST

In terms of the edges generated in the construction of the chart, up to the point of recognising  and C, there is only one set of analyses. In the case with the full rule representation, the two rules represent parallel analyses of the   component of the production. 7

Further work

The framework for parsing errorfull text requires the addition of feature structure relaxation techniques to make a complete system. There are many techniques, both targeted and untargeted, for dealing with inconsistencies in unification formalisms (Douglas & Dale 1992; Vogel & Cooper 1994 and others in Schöter & Vogel 1995). It is intended to use a simple model of unification relaxation, using feature structure paths as a description of the targeted point in the structure at which conflict is expected (as with parsing techniques, untargeted relaxation methods for unification are computation ally more expensive than targeted ones). Some grammar fragments have already been written for the target con trolled language (AECMA 1989), however, the proof of the value of the project will require a full grammar for this domain. 1 8

Conclusions

The field of Controlled Languages is particularly attractive to language tech nology developers as it provides a domain specific area in which a specific grammar is used as well as a finite set of lexical items. Consequently, check ing systems can be implemented which are more reliable than those used for unrestricted text. This paper has presented a method for parsing with a view to detecting errors. The parsing technology uses standard chart parsing augmented with rules expressed as finite state automata. As implemented, the targeted errors are represented declaratively as variations to the underlying finite state automaton of a rule. 1

The parsing system is also being put to use in a related project in the Language Technology Group in Edinburgh: the Construction Industry Specification, Analysis and Understanding project, which deals with sublanguage documents of the construction industry. As part of this project it is flexibility, and not error detection that is required. Consequently, a grammar has been developed for the domain which makes use of the regular expressions which may be used to define rules to incorporate variation, much as the insertion and deletion declarations do for the checking task.

TARGETED ERRORS IN CONTROLLED LANGUAGES

69

REFERENCES Aho, Alfred V., Ravi Sethi & Jeffrey D. Ullman. 1986. Compilers: Techniques and Tools. Mass.: Addison-Wesley.

Principles,

Andrews, N. A. & J.  Brown. 1993 "A High-Speed Natural-Language Parser". AISB Quarterly, Winter. UK: AISB. Association Europeenne Des Constructeurs De Materiel Aerospatial. 1989. A Guide for the Preparation of Aircraft Maintenance Documentation in the International Aerospace Maintenance Language, 5th edition. Paris: AECMA. Ballim, Afzal & Graham Russell. 1994. "LHIP: Extended DCGs for Configur able Robust Parsing". Proceedings of the 15th International Conference on Computational Linguistics (COLING,94)) 501-507. Kyoto, Japan. Briscoe, Ted & Nick Waegner. 1994. "Robust Stochastic Parsing Using the Inside-Outside Algorithm". Reader for European Summer School on Logic Language & Information'94, Advanced Course CA4: Robust Parsing. Douglas, Shona & Robert Dale. 1992. "Towards Robust PATR". Proceedings of the 14th International Conference on Computational Linguistics (COLING92), 468-474. Nantes, France. Earley, Jay. 1986. "An Efficient Context-Free Parsing Algorithm". Readings In Natural Language Processing ed. by Barbara J. Grosz, Karen Sparck Jones & Bonnie Lynn Webber, 25-23. Calif.: Morgan Kaufmann. Gazdar, Gerald & Chris Mellish. 1989. Natural Language Processing in PRO LOG. Wokingham, U.K.: Addison-Wesley. Goeser, Sebastian. 1992. "Chart Parsing of Robust Grammars". Proceedings of the 14th International Conference on Computational Linguistics (COLING92) ed. by Christian Boitet, 120-126. Nantes, France. Kay, Martin. 1986. "Algorithm Schemata and Data Structures in Syntactic Processing". Readings in Natural Language Processing ed. by Barbara J. Grosz, Karen Sparck Jones & Bonnie Lynn Webber, 35-70. Calif.: Morgan Kaufmann. Mellish, Chris S. 1989. "Some Chart-Based Techniques for Parsing Ill-Formed Input". Proceedings of the 27th Annual Meeting of the Association for Com putational Linguistics, 102-109. Schöter, Andreas & Carl Vogel, eds. 1995. Edinburgh Working Papers in Cog nitive Science, vol.10: Nonclassical Feature Systems. Edinburgh: University of Edinburgh. Strzalkowski, Tomek. 1992. "TTP: A Fast and Robust Parser for Natural Lan guage". Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), 198-204. Nantes, France.

70

MATTHEW F. HURST

Vogel, Carl & Robin Cooper. 1995. "Robust Chart Parsing with Mildly Incon sistent Feature Structures". Edinburgh Working Papers in Cognitive Science, vol 10: Nonclassical Feature Systems ed. by Andreas Schöter & Carl Vogel, 197-216. Edinburgh: University of Edinburgh. Wang, Jin. 1992. "Syntactic Preferences for Robust Parsing with Semantic Pref erences". Proceedings of the lįth International Conference on Computational Linguistics (COLING-92), 239-245. Nantes, France.

Applicative and Combinatory Categorial Grammar (from syntax to functional semantics) ISMAIL BISKRI & J E A N - P I E R R E DESCLÈS

ISHA - LALIC, France Abstract Applicative and Combinatory Categorial Grammar is an extension of Steedman's Combinatory Categorial Grammar by a canonical as sociation between rules and Curry's combinators on the one hand and meta-rules which control type-raising operations on the other hand. This model is included in the general framework of Applic ative and Cognitive Grammar (Desclès) with three levels of repres entation: (i) phenotype (concatened expressions); (ii) genotype (ap plicative expressions); (iii) the cognitive representations (meaning of linguistic predicates). The aim of the paper is: (i) an automatic pars ing of phenotype expressions that are underlying to sentences; (ii) the constructing of applicative expressions. The theoretical analysis is applied to spurious ambiguity and coordination. 1

Model of Applicative and Cognitive Grammar

Applicative and Cognitive Grammar (Desclès 1990) is an extension of the Universal Applicative Grammar (Shaumyan 1987). It postulates three levels of representations of languages: (i) Phenotype level (or phenotype) where the particular characteristics of natural languages are described (for example order of words, morphological cases, etc.). The linguistic expressions of this level are concatened linguistic units, the concatenation is noted by: u1 — u2— ... — un;] (ii) Genotype level (or genotype) where grammatical invariants and structures that are underlying to sentences of phenotype level are expressed. The genotype level is structured like a formal language called genotype language; it is described by a grammar called applicative grammar] (iii) The cognitive level where the meanings of lexical predicates are represented by semantic cognitive schemes. Representations of levels two and three are expressions of typed combin atory logic (Curry & Feys 1958; Shaumyan 1987). We abstract operators associated with elimination and introduction inference rules like in Gentzen calculus. For instance, we present combinators B, C * , S, Þ, with the fol , 2U ,3 are typed applicative expressions): lowing rules (U1U

ISMAIL BISKRI & JEAN-PIERRE DESCLÈS

72

introduction rules

elimination rules

These rules lead to β-reduction or /3-expansion: (( U1U2 )U3) ≥ (U1(U2U3)) ((*U1)U2 ((S

U1U2

) )U3)

((Φ U1U2U3)U4)

≥

(U2U1)

≥

((U1U3)(U2U3))

>

(U1(U2U4)(U3U4))

In what follows, we are interested in relations between the two first levels (phenotype and genotype) by implementing a system of formal analysis called Applicative and Combinatory Categorial Grammar (ACCG) which explicitly connects phenotype expressions to its underlain representations in the genotype 1 . This system consists of: 1. the syntactical analysis of concatened expressions of phenotype by using Combinatory Categorial Grammar. 2. the constructing from the result of syntactical analysis of the func tional semantic interpretation of phenotype expressions. 1.1

Categorial grammars

Categorial Grammars assign syntactical categories to each linguistic unit. Syntactical categories are orientated types developed from basic types and from two constructive operators / and \ . (i) N (nominal syntagm) and S (sentence) are basic types. (ii) If X and Y are orientated types then X/Y and X\Y are orientated types. 2 1

2

In phenotype, linguistic expressions are concatened according to syntagmatic rules of French. In genotype, expressions are arranged according to the applicative order. Here, we choose Steedman's notation (1989): X/Y and X\Y are functional orientated types. A linguistic unit u with the type X/Y (respectively X\Y) is considered as

APPLICATIVE AND COMBINATORY CATEGORIAL GRAMMAR 73 A linguistic unit u with orientated type X will be designed by [X : u] Both rules of application (forward and backward) are noted:

The premises in each rule are concatenations of linguistic units with orient ated types considered as being operators or operands, the consequence of each rule is an applicative expression with orientated type. Combinatory Categorial Grammar (Steedman 1989) generalises clas sical Categorial Grammars by introducing the operation of type-raising and composition on functional types. The new proposed rules aim at quasiincremental (from right to left) in order to eliminate the problem of spurious ambiguity (Haddock 1987; Pareschi & Steedman 1987). 1.2

Applicative and Combinatory Categorial Grammar

In ACCG, we consider that the rules of Steedman's Combinatory Categorial Grammar introduce the combinatore B, C*, S into the syntagmatic se quence. This introduction makes it possible to turn one concatened struc ture to one applicative structure.The rules of ACCG are: Type-raising rules:

The premises of rules are typed concatened expressions; results are applic ative expressions (typed) with an eventual introduction of one combinator. The type-raising of an unit u introduces the combinator C * ; the composition operator (or function) whose typed operand Y is positioned on the right (respectively on the left) of operator.

74

ISMAIL BISKRI & JEAN-PIERRE DESCLÈS

of two concatened units introduces the combinators  and S. With such rules we can analyse a sentence by means of a quasi-incremental strategy from left to right. The choice of such a strategy is motivated by: 1. our own comprehension that we believe to be incremental, some sen tences; that is to say, each term contributes to the gradual construct ing of meaning (Haddock 1987; Steedman 1989); 2. the control of spurious ambiguity problem (Pareschi & Steedman 1987; Steedman 1989). Example: John

loves

The first rule (>T) applied to the typed unit [N : John] turns operand to operator. It constructs an applicative structure (G*Johm) whose type is S/(S\N). The introduction of the combinator C* illustrates in the ap plicative representation the type-raising: (C* John) works like an operator with his functional type. The rule (>B) combinates the typed linguistic units [S/(S\N) : (C * John)} and [{S\N)/N : loves] with the combinator  in order to compose the two functional units (C * John) and loves. A full processing based upon Applicative and Combinatory Categorial Gram mar is carried out in two main steps: 1. The first step is illustrated by the checking of the proper syntactic connection and by the constructing of predicative structures with some combinatore introduced in certain positions of syntagmatic structure. 2. The second step consists in using the β-reduction rules of combinatore in order to create a predicative structure that is underlying phenotype expression. The obtained expression is an applicative one and be longs to genotype language. ACCG generates processes that associate one applicative structure to one concatened expression of phenotype. What remains to be eliminated is the combinatore of obtained expres sion in order to construct the normal form (in the technical meaning of β -reduction) that expresses the functional semantic interpretation. This calculus is completely done in genotype.

APPLICATIVE AND COMBINATORY CATEGORIAL GRAMMAR 75 Therefore, this process that we propose takes the shape of a compilation whose steps are summed up in Figure 1:

Let us deal with a simple example: John loves Mary.

|1 [N:John]-[(S\N)/N:loves]-[N:Mary]

Typed concatened structure of phenotype

2 [S/(S\N):(C * John)H(S\N)/N: loves}-[N:Mary] 3 [S/N:(B(C * John) loves)]-[N:Mary] 4 [S:((B(C * John) loves) Mary)}

(>T) (>B)

5 [S: ((B ( C * John) loves) Mary)] 6 [S: ( (  * John) {loves Mary))} 7 [S: ((loves Mary) John)]

() (C*) |

(>) Typed applicative structure of genotype Normal form of genotype

The type raising (>T) allocating the operand Johnmakes it possible to generate the operator (C * John) that the functional rule (>B) composes with the operator loves. The complex operator (B (C * John) loves) is ap plied to the operand Maryin order to form the applicative expression of genotype ((B (C * John) loves) Mary). The reduction of combinators in genotype constructs the functional semantic interpretation that is underly ing to phenotype expression (input).

76 2

ISMAIL BISKRI & JEAN-PIERRE DESCLÈS Structural reorganisation

The syntactic analysis from left to right raises the problem of non- determ inism introduced by the presence in the language of backward modifiers that stand as operators which are applied to the whole or a part of a structure previously constructed. If, in the first case the use of a rule of application allows the analysis to be carried on 3 , it is quite different for the second case where the analysis blocks. For a sentence like John loves Mary madly, the parser at first cre ates the constituent [S : ((B (C * John) loves) Mary)]. This last constituent is not combinable with madly, with the type (S\N)\(S\N). As a matter of fact, madly is an operator whose operand (loves Mary) stands on its left. A quasi-incremental analysis from left to right makes easy the application of a combinatory rule as soon as possible. This factor gets as direct con sequence to absorb 4 loves and Mary into ((B (C * John) loves) Mary), which obviously does not allow us to directly construct (loves Mary). The raised problem comes back to the possibility of a backtracking. But this backtracking is the kind one to increase the computational cost (memory and time execution) of one syntactic analysis. However, an intel ligent backtracking (that we will propose later on) can allow us to reduce this cost considerably, and at the same time by constructing proper semantic analysises and by eliminating spurious ambiguities. So, such a backtrack ing will decompose the constituent already constructed in two components whose one of them may be combined with the backward modifier. Formally, this operation of structural reorganisation is realised by the two following successive steps: (a) the reorganisation of constituent already constructed isolates two sub- categories at each time, and tests if the backward modifier may be combined on left 5 or not with one of these two sub- categories. We then proceed to the reduction of combinators until the test gives us a positive value. At the end of the process we will recover a new typed applicative structure equivalent to the first one. 3

4 5

Let us take the example of sentence John hit Mary yesterday where the backward modifier yesterday operates on the whole sentence John hit Mary; yesterday whose syntactic type is S\S, in order to continue the analysis, it is enough to apply yesterday to John hit Mary by the rule (<). That is to say, Mary does not appear clearly as the operand of the operator to love. In our terminology, u1 may be combined on the left with u2 if one of these two cases is possible: — one of the following rules <,
APPLICATIVE AND COMBINATORY CATEGORIAL GRAMMAR 77 Example: In the case of the statement John loves Mary madly, the steps of reorganisation are: Constituent constructed: [S : ((B (C * John) loves) Mary)] The two sub-categories are: [S/N : (B (C * John) loves)] and [TV : Mary] Test: [S/N : (B (C * John)loves)] may not be combined on the left with [(S\N)\(S\N) : madly] [N : Mary] may not be combined on left with [(S\N)\(S\N) : madly] Reduction of combinator B: [S : ((C * John)(loves Mary))] The two sub-categories are: [S/(S\N) : (C * Jean)]; [S\N : (lovesMary)] Test: [S/(S\N) : (C * John)] may not be combined on the left with [(S\N)\(S\N) : madly] [S\N : (loves Mary)] may be combined on the left with [(S\N)\(S\N) : madly] Stop of combinatore reduction process. We recover the category in output: [S : ((C * John)(loves Mary))]. (b) decomposition realised by means of the two rules:

We read these rules as follows: • For (>dec): If we have an applicative structure (u1 u2) with type X, ul of type X/Y and u2 of type Y, then we can construct a new concatened expression formed by both categories [X/Y:ul] and [Y:u2]. • For (<dec): If we have an applicative structure (ul u2) with type X, ul of type X \ Y and u2 of type Y, then we can construct a new concatened expression formed by both categories [Y:u2] and [X \ Y:ul]. Let us notice that the two rules (>dec) and (<dec) are respectively inverse to the rules of functional application (>) and (<). Both rules allow us to construct again a new concatened ordering of the structure operator/operand coming from the reorganisation. For the sentence John loves Mary madly the decomposition is applied to the structure that arises from reorganisation: [5 : ((C * John)(loves Mary))]. With the rule (>dec), we produce the concatened ordering: [S/(S\N) : (C * John)} ֊ [S\N : (loves Mary)}. These two steps enter the complete analysis of the sentence John loves Mary madly as following (step 5 for reorganisation and step  for decom-

78

ISMAIL BISKRI & JEAN-PIERRE DESCLÈS

position): 1 4 5 6 7 8

Typed concatened structure of phenotype (1) [N:John-[(S\N)/N:loves]-[N:Mary]-[(S\N)\(S\N):modly] [S:((B (C * John) loves) Mary)]-[(S\N)\(S\N):madly] [S:((C* John)(loves Mary))]-[(S\N)\(S\N):madly] [S/(S\N):(C * John)]-[S\N:(loves Mary)]~[(S\N)\(S\N):madly] [S/(S\N):(C* John)]-[S\N:(madly (loves Mary))] [S:((C* John)(madly (loves Mary)))]

(B) (>dec) (<) (>)

9 10

Typed applicative structure of phenotype (9-10) [S:((C* John)(madly (loves Mary)))] [S:((madly (loves Mary)) John)] (C*) Normal form of genotype

3

Coordination

Coordination is the action of joining two words or two expressions of the same kind or having the same function. Within the framework of Categorial Grammars, Steedman (1989), Barry and Pickering (1990) consider that two linguistic units may be coordinated in order to give one linguistic unit of type X if and only if each unit has type X. Even if this definition remains incomplete, knowing that coordination presents itself under different shapes, it points out the way to follow in an ideal manner and in order to settle a fiable solution. We present four types of examples of coordination with AND. We may coordinate 6 : 1. Two segments of the same kind, with the same structure and contigu ous to AND: [John loves]S/N and [William hates]S/N these pictures 2. Two segments into an elliptic construction: John loves [Mary madly] and [Jenny wildly] [John] loves [Mary] and [William Jenny] 3. Two segments of different structures: Mary walks [slowly] and [with happiness]. John [sings] and [plays the violin]. 4. Two segments without distributivity: The flag is [white] and [red] (≠ The flag is white and the flag is red). 6

The categories to be coordinated are between square brackets.

APPLICATIVE AND COMBINATORY CATEGORIAL GRAMMAR 79 To the conjunction AND we associate the morphologic type (X\X)/X. How ever the context gives more specifications to assign a type to AND. The hypothesis 1 and 2 make it possible to assign a type to AND by taking into account the context. Hypothesis 1: The constructed category that immediately fol lows the conjunction AND determinates the type of coordina tion. This hypothesis leads us to indirectly introduce an interruption in the quasiincremental analysis: as soon as we encounter the conjunction AND, we temporarily interrupt the quasi-incremental analysis in order to construct the second member of the coordination. We propose a second hypothesis: Hypothesis 2: When we have a typed coordination X defined by the hypothesis 1, the first member of coordination is the typed category X which immediately precedes the conjunction. The rules that we have to bring out through these both hypothesises con sequently emanate from the idea that both members of coordination have the same syntactic types X corresponding to different functional semantic interpretations. The result of rules application keeps the same syntactic type X. We set up two abstract types for the conjunction. The first one concerns the distributive conjunction, we will note it CONJD. The second type concerns the non-distributive conjunction and we will note it CONJN.

We apply the rule to the cases of distributive coordination. In order to take into account the distributivity at the level of applicative structure, we use the combinator Φ. We apply the rule to the cases of non-distributive coordination (see example E3). With the quasi-incremental analysis, during the application of the hy pothesis 2, two typical cases occur: 1. the constituent produced before encountering the conjunction is of the same type than the constituent determinated by coordination. This constituent is then the first member of coordination. For in stance, the analysis of the sentence: [John loves]S/N and [William hates]S/N these pictures constructs [S/N:(B (C * John) loves)] before encountering the conjunction. This constituent has the same type than the second member [S/N:(B (C * William) hates)], the constitu ent determinated by the first hypothesis. The constituent [S/N:(B (C * John) loves)] is then the first member of coordination.

80

ISMAIL BISKRI & JEAN-PIERRE DESCLÈS

2. the constituent determinated before encountering conjunction has not the same type than the constituent determinated by coordination. It is necessary to modify the structure of this constituent. For instance, analysis of sentence: John loves [Mary madly) and [Jenny wildly) con structs [S : ((C * John) (madly (loves Mary)))) before the analysis of conjunction. The second member of coordination is [(S\N)\((S\N)/N): (B wildly (C * Jenny)))7. In that second case, the process of structural reorganisation allows us: • either to directly isolate the first member of coordination (See the steps  and 7 of example El) • or to isolate the binary structure operator/operand which contains the first member of coordination. In this case, it is necessary to as sociate the structural reorganisation with the use of logical equival ences (These equivalences are the direct consequences of introduction and elimination of the combinators  and C * ) of combinatory logic (a,b,c,d) (See the step 8 of example E2): (a)

⇔

(ul(u2u3))

((B u1 u2) u3)

(b) ((u1 u2) u3) ⇔ ((B (C * u3) u1) u2) (c)

(u1 (u2

u3))

⇔

(( u1 ( * u3)) u2)

(d) ((u1 u2) u3) ⇔ (( ( * u3)( * u2)) u1)

4

Meta-rules

We add to our formalism different metarules that control type raising. These metarules, on the one hand, indicate to us that a rule of type raising has to be applied, and on the other hand choose the particular type raising to be realised. We do not consider these metarules as an absolute computational tool, we convey a linguistic and logical pertinence to them. They may have an interpretation if we take into account some prosodic factors. In what follows, we present three metarules among those that we have conceived. These last ones are ten in number (Biskri 1995). Let us take ul and u2 in the concatened expression ul-u2: 7

The sentence John loves Mary madly and Jenny wildly is ambiguous. In our example, we consider Jenny as an object.

APPLICATIVE AND COMBINATORY CATEGORIAL GRAMMAR 81 Meta-rule 1: If ul has type N and u2 has type (Y \ N)/Z, then we apply type-raising before (>T) to ul: [TV : ul ⇒ Y/(Y\N) : (C*u1)] Example: John eats the apple [N : John] - [(S\N)/N : eats] - [N/N : the] - [N : apple] [S/{S\N) : (C * John)] - [{S\N)/N : eats] - ... In this case Y = S ; Z = N. Meta-rule 2: If ul has type N (ul preceded by and) and u2 has type N, then we apply type-raising before (>T) to ul : [N : ul ⇒ S/(S\N) : (C *u1)] Example: John loves Mary and William Jenny ... - [CONJD : and] - [N : W i l l i a m [N : Jenny] ... - [S/(S\N) : (C * William)] - [N : Jenny] Meta-rule 3: If u2 has type N and ul has type Y/X (ul preceded by and), then we apply the backward type-raising (
5

Examples

El: John loves Mary and hates Jenny P h e n o t y p e (1) 1 [N:John]-[(S\N)/N:loves]-[N:Mary]-[CONJD:and]-[(S\N)/N: 4 5 6 7

hates]-[N:Jenny]

[S:((B ( C , John) loves) Mary)]-[CONJD:and]-[(S\N)/N: hates]-[N:Jenny] [S:((B ( C , John) loves) Mary)]-[CONJD:and]-[S\N:(hates Jenny)] (>) [S:((C * John) (loves Mary))]-[CONJD:and]-[S\N:(hates Jenny)] (B) [S/(S\N):(C * John)]-[S\N:(loves Mary)]-[CONJD:and][S\N:(hates Jenny)] (>dec) () 8 [S/(S\N):(C * John)]-[S\Ν:(Φ and (loves Mary) (hates Jenny))] 9 [S:((C * John)(Φ and (loves Mary) (hates Jenny)))] (>) Genotype (10-12) 10 [S:((C* John)(Φ and (loves Mary) (hates Jenny)))] 11 [S:((Φ and (loves Mary) (hates Jenny)) John)] (C * ) (Փ) 12 [S:(and ((loves Mary) John)((hates Jenny) John))]

E2: John loves Mary and William Jenny

82

ISMAIL BISKRI & JEAN-PIERRE DESCLÈS

P h e n o t y p e (1) 1 [N:John]-[(S\N)/N:loves]-[:Mary]-[CONJD:and]-[N:William]-[N-Jenny] 4 5 6 7 8 9 10

[S:((B ( C * John) loves) Mary)]-[CONJD:and]-[N:William]-[N:Jenny] ... -[CONJD:and]-[S/(S\N):(C * William)]-[N: Jenny] (>T),M2 ... -[CONJD:and]-[S/(S\N):(C * William)]-[(S\N)\(S/(S\N)):(C * Jenny)] ( < T ) , M 3 ... -[CONJD:and]-[S\(S/(S\N)):(B (  * William)(C* Jenny)] (>Bx) [S:((B ( C * John)(B ( C * Mary)) loves)]-[CONJD:and]-... (d) [(S\N)/N:loves]-[S\(S/(S\N)):(B ( C * John)(C* Many))-[CONJD:and]-... (<dec) [(S\N)/N:loves]-[S\((S/(S\N)): (Φ and ( (  , John)(C* Mary))(B (  * William)(* Jenny)))] (
Genotype (12-19) 12 [S:((Φ and ( (* John) ( C . Mary))(B (  * William)(* Jenny))) loves)] 13 [S:(and (( (* John)(* Many)) loves) (( (  * William)(C* Jenny)) loves))] 14 [S:(and ( ( C * J o h n ) ( ( C , Many) loves))((B (  * William)(C * Jenny)) loves))] 15 [S:(and (((C * Mary) loves) John)((B ( C * William)(C* Jenny)) loves))] 16 [S:(and ( ( l o v e sMary) Jo/m) (( ( C * William)(C, Jenny)) loves))] 17 [S:(and ((loves Mary) Jo/m) ((* William)((C * Jenny) loves)))] 18 [S:(and ((loves Mary) John)(((C* Jenny) loves) William))] ' 19 [S:(and ((loves Mary) John)((loves Jenny) William))]

E3: the flag is white and red

(Φ) () (C * ) (C * ) (B) (C*) (C * )

the flag is white and the flag is red

P h e n o t y p e (1) 1 [N/N:the]-[N:flag]-[(S\N)/(N\N):is]-[N\N:white]-[CONJN:and]-[N\N:red] 2 3 4 5 6 7 8

[N:(the flag)]-[(S\N)/(N\N): is]-[N\N:white]-[CONJN:and]-[N\N:red] [S/(S\N):(C * (the flag))]-[(S\N)/(N\N):is]-[N\N:white]-[CONJN:and]-... ( > T ) , M 1 [S/(N/N):(B (C* (the flag)) is)]-[N\N:white]-[CONJN:and]-N\N:red] (>B) [S:((B ( C * (the flag)) is)}-[N\N:white}-[CONJN:and}-N\N:red] (>) [S/(N/N):(B (C* (the flag)) is)]-[N\N:white]-[CONJN:and]-[N\N:red] (>dec) [S/(N/N):(B (C* (the flag)) is)]-[N\N:(and white red)] () [S:((B ( C * (the flag)) is)(and white red))] (>dec)

Genotype (9-11) 9 [S:((B ( C * (the flag)) is)(and white red))] 10 [S:((C * (the flag))(is (and white red)))] 11 [S:((is(and white red))(the

flag))]

(B) (C*)

Other examples and more details are provided in (Biskri 1995). Analysises are implemented. Here, we do not give the details of the algorithm.

APPLICATIVE AND COMBINATORY CATEGORIAL GRAMMAR 83 6

Conclusion

We have presented a model of analysis within the framework of Applicative Cognitive Grammar that realises the interface between syntax and semantic. For many French examples this model is able to realise the following aims: • to produce an analysis which verifies the syntactic correction of state ments. • to develop automatically the predicative structures that yield the func tional semantic interpretation of statements. Moreover, this model has the following characteristics: 1. We do not make any calculus parallel to syntactic calculus like Monta gue's one (1974). A first calculus verifies the syntactic correction, this calculus is carried on by a construction of functional semantic interpretation. This has been made possible by the introduction of combinators to some specific positions of syntagmatic order. 2. We introduce some components of functional semantic by some ap plicative syntactic tools (combinators). 3. We calculate the functional semantic interpretation by some applicat ive syntactic methods (combinators reduction). In order to sum up, we interpret by means of absolute syntactic techniques. The distinction syntax/semantic should be then thought again in another perspective. REFERENCES Ades, Anthony & Mark Steedman. 1982. "On the Order of Words". Linguistics and Philosophy 4.517-558. Barry, Guy & Martin Pickering. 1992. "Dependency and Constituency in Cat egorial Grammar". Word Order in Categorial Grammar / L'ordre des mots dans les grammaires catégorielles ed. by Alain Lecomte, 38-57. ClermontFerrand: Adosa. Biskri, Ismail. 1995. La Grammaire categorielle combinatoire applicative dans le cadre de la grammaire applicative et cognitive. Ph.D. dissertation, EHESS, Paris. Buszkowski, Wojciech, W. Marciszewsk: & Joan Van Benthem. 1988. Categorial Grammar. Amsterdam & Philadelphia: John Benjamins. Curry, Haskell . & Robert Feys. 1958. Combinatory Logic. vol.I, Amsterdam: North-Holland. Deselès, Jean-Pierre. 1990. Langages applicatifs, langues naturelles et cognition. Paris: Hermes.

84

ISMAIL BISKRI & JEAN-PIERRE DESCLÈS & Frederique Segond. 1992. "Topicalisation: Categorial Analysis and Ap plicative Grammar". Word Order in Categorial Grammar ed. by Alain Le comte, 13-37. Clermont-Ferrand: Adosa.

Haddock, Nicholas. 1987. "Incremental Interpretation and Combinatory Cat egorial Grammar". Working Papers in Cognitive Science, I: Categorial Gram mar, Unification Grammar and Parsing ed. by Nicholas Haddock et al., 7184. University of Edinburgh. Lecomte, Alain. 1994. Modeles logiques en théorie linguistique: Éléments pour une théorie informationnelle du langage. Work synthesis. Grenoble:: Uni versité de Grenoble. Moortgat, Michael. 1989. Categorial Investigation, Logical and Linguistic pects of the Lambek Calculus. Dordrecht: Foris.

As

Oehrle, Richard T., Emmon Bach & Deidre Wheeler. 1988. Categorial Grammars and Natural Languages Structures. Dordrecht: Reidel. Pareschi, Remo & Mark Steedman. 1987. "A Lazy Way to Chart Parse with Categorial Grammars". Proceeding of the 27th Annual Meeting of the Asso ciation for Computational Linguistics (ACL'87). Stanford. Shaumyan, Sebastian K. 1987. A Semiotic Theory of Natural Language. Bloom ington: Indiana University Press. Steedman, Mark. 1989. Work in Progress: Combinators and Grammars in Nat ural Language Understanding. Summer Institute of Linguistics, Tucsoni, Uni versity of Arizona. Szabolcsi, Anna. 1987. "On Combinatory Categorial Grammar". Proceeding of the Symposium on Logic and Language, 151-162. Budapest: Akademiai Kiadó.

PARSETALK

about Textual Ellipsis

U D O HAHN & MICHAEL STRUBE

Freiburg University Abstract We present a hybrid methodology for the resolution of textual ellipsis. It incorporates conceptual proximity criteria applied to ontologically well-engineered domain knowledge bases and an approach to cen tering based on functional topic/comment patterns. We state gram matical predicates for textual ellipsis and then turn to the procedural aspects of their evaluation within the framework of an actor-based implementation of a lexically distributed parser. 1

Introduction

Text phenomena, e.g., textual forms of anaphora or ellipsis, are a particu larly challenging issue for the design of natural language parsers, since lack ing recognition facilities either result in referentially incohesive or invalid text knowledge representations. At the conceptual level, textual ellipsis (also called functional anaphora) relates an elliptical expression to its ante cedent by conceptual attributes (or roles) associated with that antecedent (see, e.g., the relation between "Zugriffszeit" (access time) and "Laufwerk" (hard disk drive) in (3) and (2) below). Thus it complements the phe nomenon of nominal anaphora (cf. Strube & Hahn 1995), where an ana phoric expression is related to its antecedent in terms of conceptual gener alisation (as, e.g., "Rechner" (computer) refers to "LTE-Lite/25" (a partic ular notebook) in (2) and (1) below). The resolution of text-level anaphora contributes to the construction of referentially valid text knowledge repres entations, while the resolution of textual ellipsis yields referentially cohesive text knowledge bases. (1) Der LTE-Lite/25 wird mit der ST-3141 von Seagate ausgestattet. (The LTE-Lite/25 is - with the ST-3141 from Seagate - equipped.) (2) Der Rechner hat durch dieses neue Laufwerk ausreichend Platz für WindowsProgramme. (The computer provides - because of this new hard disk drive - sufficient storage for Windows programs.) (3) Darüber hinaus ist die Zugriffszeit mit 25 ms sehr kurz. (Also - is - the access time of 25 ms - quite short.)

86

UDO HAHN & MICHAEL STRUBE

Fig, 1: Fragment of the information technology domain knowledge base In the case of textual ellipsis, the conceptual entity that relates the topic of the current utterance to discourse elements mentioned in the preceding one is not explicitly mentioned in the surface expression. Hence, the missing conceptual link must be inferred in order to establish the local coherence of the whole discourse (for an early statement of that idea, cf. Clark (1975)). For instance, in (3) the proper conceptual relation between "Zugriffszeit" (access time) and "Laufwerk" (hard disk drive) must be determined. This relation can only be made explicit if conceptual knowledge about the domain is supplied. It is obvious (see Figure 11) that the concept A C C E S S - T I M E is bound in a direct associative or aggregational relation, viz. access-time, to the concept H A R D - D I S K - D R I V E , while its relation to the instance LTEL I T E - 2 5 is not so tight (assuming property inheritance). A relationship between A C C E S S - T I M E and S T O R A G E - S P A C E or SOFTWARE is excluded at the conceptual level, since they are not linked via any conceptual role. 1

The following notational conventions apply to the knowledge base for the information technology domain to which we refer throughout the paper (see Figure 1): Angular boxes from which double arrows emanate contain instances (e.g., LTE-LITE 2 5), while rounded boxes contain generic concept classes (e.g., NOTEBOOK). Directed unlabelled links relate concepts via the isa relation (e.g., NOTEBOOK and COMPUTER-SYSTEM), while links labelled with an encircled square represent conceptual roles (definitional roles are marked by "d"). Their names and value constraints are attached to each circle (e.g., COMPUTER-SYSTEM - has-central-unit - CENTRAL-UNIT, with small ital ics emphasising the role name). Note that any sub concept or instance inherits the conceptual attributes from its superconcept or concept class (this is not explicitly shown in Figure 1).

PARSETALK ABOUT TEXTUAL ELLIPSIS

87

Nevertheless, the association of concepts through conceptual roles is far too unconstrained to properly discriminate among several possible antecedents in the preceding discourse context. We therefore propose a basic heur istic for conceptual proximity, which takes the path length between concept pairs into account. It is based on the common distinction between concepts and roles in classification-based terminological reasoning systems (cf. MacGregor (1991) for a survey). Conceptual proximity takes only conceptual roles into consideration, while it does not consider the generalisation hier archy between concepts. The heuristic can be phrased as follows: If fully connected role chains between the concepts denoted by a possible ante cedent and an elliptical expression exist via one or more conceptual roles, that particular role composition is preferred for the resolution of textual ellipsis whose path contains the least number of roles. Whenever several connected role chains of equal length exist, functional constraints which are based on topic/comment patterns apply for the selection of the proper ante cedent. Hence, only under equal-length conditions grammatical information from the preceding sentence is brought into play (for a precise statement in terms of the underlying text grammar, cf. Table 5 in Section 4). To illustrate these principles, consider the sentences (1)-(3) and Fig ure 1. According to the convention above H A R D - D I S K - D R I V E is conceptu ally most proximate to the elliptical occurrence of A C C E S S - T I M E (due to the direct conceptual role linking H A R D - D I S K - D R I V E - access-time -A C C E S S T I M E with unit length 1), while the relationship between L T E - L I T E - 2 5 and A C C E S S - T I M E exhibits a greater conceptual distance (counting with unit length 2, due to the composition of roles between L T E - L I T E - 2 5

has-hd-drive ֊ H A R D - D I S K - D R I V E - access-time ֊ A C C E S S - T I M E ) . 2

Ontological engineering for ellipsis resolution

Metrical criteria incorporating path connectivity patterns in network-based knowledge bases have often been criticised for lacking generality and in troducing ad hoc criteria likely to be invalidated when applied to different domain knowledge bases (DKB). The crucial point about the presumed un reliability of path-length criteria addresses the problem how the topology of such a network can be 'normalised' such that formal distance measures uniformly relate to intuitively plausible conceptual proximity judgements. Though we have no formal solution for this correspondence problem, we try to eliminate structural idiosyncrasies by postulating two ontology engineer ing (OE) principles (cf. also Simmons (1992) and Mars (1994)):

88

UDO HAHN & MICHAEL STRUBE

1. Clustering into Basic Categories. The specification of the upper level of the ontology of some domain (e.g., information technology (IT)) should be based on a stable set of abstract, yet domain-oriented ontologicai categories inducing an almost complete partition on the en tities of the domain at a comparable level of generality (e.g., hardware, software, companies in the IT world). Each specification of such a ba sic category and its taxonomic descendents constitutes the common ground for what Hayes (1985) calls clusters and Guha & Lenat (1990) refer to as micro theories, i.e., self-contained descriptions of concep tually related proposition sets about a reasonable portion of the commonsense world within a single knowledge base partition (subtheory). 2. Balanced Deepening. Specifications at lower levels of that onto logy, which deal with concrete objects of the domain (e.g., notebooks, laser printers, hard disk drives in the IT world), must be carefully balanced, i.e., the extraction of attributes for any particular category should proceed at a uniform degree of detail at each decomposition level. The ultimate goal is that any subtheory have the same level of representational granularity, although these granularities might differ among various subtheories (associated with different basic categories). Given an ontologically well-engineered DKB, the ellipsis resolution problem, finally, has to be projected from the knowledge to the symbol layer of repres entations. By this, we mean the abstract implementation of knowledge rep resentation structures in terms of concept graphs and their emerging path connectivity patterns. At this level, we draw on early experiments from cognitive psychologists such as Rips et al. (1973) and more recent research on similarity metrics (Rada et al. 1989) and spreading-activation-based inferencing, e.g., by Charniak (1986). They indicate that the definition of proximity in semantic networks in terms of the traversal of typed edges (e.g., only via generalisation or via attribute links) and the corresponding counting of nodes that are passed on that traversal is methodologically valid for computing semantically plausible connections between concepts.2 The OE principles mentioned above are supplemented by the following linguistic regularities which hold for textual ellipsis: 1. Adherence to a Focused Context. Valid antecedents of elliptical expressions mostly occur within subworld boundaries (i.e., they remain within a single knowledge base cluster, micro theory, etc.). Given the 2

An alternative to simple node counting for the computation of semantic similarity, which is based on a probabilistic measure of information content, has recently been proposed by Resnik (1995).

PARSETALK ABOUT TEXTUAL ELLIPSIS

89

OE constraints (in particular, the one requiring each subworld to be characterised by the same degree of conceptual density), path length criteria make sense for estimating the conceptual proximity. 2. Limited Path Length Inference. Valid pairs of possible ante cedents and elliptical expressions denote concepts in the DKB whose conceptual relations (role chains) are constructed on the basis of rather restricted path length conditions (in our experiments, no valid chain ever exceeded unit length 5). This corresponds to the implicit require ment that these role chains must be efficiently computable. 3

Functional centering principles

Conceptual criteria are of tremendous importance, but they are not suffi cient for the proper resolution of textual ellipsis. Additional criteria have to be supplied in the case of equal role length for alternative antecedents. We therefore incorporate into our model various functional criteria in terms of topic/comment patterns which originate from (dependency) structure ana lyses of the underlying utterances. The framework for this type of informa tion is provided by the well-known centering model (Grosz et al. 1995). Ac cordingly, we distinguish each utterance's backward-looking center (Cb(Un)) and its forward-looking centers (Cf(Un)). The ranking imposed on the ele ments of the Cf reflects the assumption that the most highly ranked element of Cf(Un) is the most preferred antecedent of an anaphoric or elliptical ex pression in the utterance U n+1 , while the remaining elements are (partially) ordered according to decreasing preference for establishing referential links. The main difference between the original centering approach and our proposal concerns the criteria for ranking the forward-looking centers. While Grosz et al. assume (for the English language) that grammatical roles are the major determinant for the ranking on the C f , we claim that for German - a language with relatively free word order - it is the functional informa tion structure of the sentence in terms of topic/comment patterns. In this framework, the topic (theme) denotes the given information, while the com ment (rheme) denotes the new information (for surveys, cf. Danes (1974) and Dahl (1974)). This distinction can easily be rephrased in terms of the centering model. The theme then corresponds to the C b (U n ), the most highly ranked element of (Cf(Un_1) which occurs in Un. The theme/rheme hierarchy of Un is determined by the (C f (U n _ 1 ): elements of Un which are contained in Cf(Un-1) (context-bound discourse elements) are less rhematic than elements of Un which are not contained in ( C f ( U n - 1 ) (unbound ele-

90

UDO HAHN & MICHAEL STRUBE

ments). The distinction between context-bound and unbound elements is important for the ranking on the Cf, since bound elements are generally ranked higher than any other non-anaphoric elements. The rules for the ranking on the Cf are summarised in Table 1. They are organised at three layers. At the top level, >TCbase denotes the basic relation for the overall ranking of topic/comment (TC) patterns. The second relation in Table 1, > TCboundtype denotes preference relations exclusively dealing with multiple occurrences of bound elements in the preceding utterance. The bottom level of Table 1 is constituted by >prec, which covers the prefer ence order for multiple occurrences of the same type of any topic/comment pattern, e.g., the occurrence of two anaphora or two unbound elements (all heads in a sentence are ordered by linear precedence relative to their text position). The proposed ranking, though developed and tested for German, prima facie not only seems to account for other free word order languages as well but also extends to fixed word order languages like English, where grammatical roles and information structure, unless marked, coincide. Table 1: Functional ranking on Cf based on topic/comment patterns context-bound element(s) >TCbase unbound element(s) anaphora >TCboundtype elliptical antecedent >TCboundtype elliptical expression nominal head1 >prec nominal head2 >prec ... >prec nominal headn Given these basic relations, we may define the composite relation >TC (cf. Table 2). It summarises the criteria for ordering the items on the forwardlooking centers CF (X and y denote lexical heads). Table 2: Global topic/comment

relation

>TC := { (x, ) | if χ and y both represent the same type of TC patterns then the relation >prec applies to x and y else if x and y both represent different forms of bound elements then the relation >TCboundtype applies to x and y else the relation >TCbase applies to x and y } 4

Grammatical predicates for textual ellipsis

We here build on the ParseTalk model, a fully lexicalised grammar theory which employs default inheritance for lexical hierarchies (Hahn et al. 1994). The grammar formalism is based on dependency relations between lexical

PARSETALK ABOUT TEXTUAL ELLIPSIS

91

heads and modifiers at the sentence level. The dependency specifications3 allow a tight integration of linguistic knowledge (grammar) and conceptual knowledge (domain model), thus making powerful terminological reasoning facilities directly available for the parsing process. Accordingly, syntactic analysis and semantic interpretation are closely coupled. The resolution of textual ellipsis is based on two criteria, a structural and a conceptual one. The structural condition is embodied in the predicate is ΡotentialElliptic Antecedent (cf. Table 3). An elliptical relation between two lexical items is restricted to pairs of nouns. The elliptical phrase which occurs in the n-th utterance is restricted to a definite NP, the antecedent must be one of the forward-looking centers of the preceding utterance. Table 3: Grammar predicate for a potential elliptical antecedent

Į

isPotentialEllipticAntecedent (x, y, η) :⇔ x isac* Noun Λ  isac* Noun Λ 3 ζ: (y head ζ Λ ζ isac* DetDefinite) Λ y Є Un Λ x Є Cf(Un-1)

The function Proximity Score (cf. Table 4) captures the basic conceptual condition in terms of the role-related distance between two concepts. More specifically, there must be a connected path linking the two concepts under consideration via a chain of conceptual roles. Finally, the predicate PreferredConceptualBridge (cf. Table 5) combines both criteria. A lexical item χ is determined as the proper antecedent of the elliptical expression y if it is a potential antecedent and if there exists no alternative antecedent ζ whose Proximity Score either is below that of χ or, if their ProximityScore is equal, whose strength of preference under the TC relation is higher than that of x. 3

We assume the following conventions to hold:  = {Word, Nominal, Noun, DetDefin ite,...} denotes the set of word classes, and isac = {(Nominal, Word), (Noun, Nominal), (DetDefinite, Nominal),...} cCxC denotes the subclass relation which yields a hierarch ical ordering among these classes. The concept hierarchy consists of a set of concept names F = {COMPUTER-SYSTEM, NOTEBOOK, ACCESS-TIME, T I M E - M S - P A I R , . . . }

(cf. Figure 1) and a subclass relation isaF = {(NOTEBOOK, COMPUTER-SYSTEM), (ACCESS-TIME, TIME-MS-PAIR),...}  F x F. The set of role names R = [has-part, has-hd-drive, has-property, access-time,...} contains the labels of admitted conceptual roles. These role names are also ordered in terms of a conceptual hierarchy, viz. isaR = {(has-hd-drive, has-part), (access-time, has-property),...}  ΊΖ x ΊΖ. The relation permit  F x R x F characterises the range of possible conceptual roles among con cepts, e.g., (HARD-DISK-DRIVE, access-time, ACCESS-TIME) Є permit. Furthermore, object. refers to the concept  denoted by object, while head denotes a structural

92

UDO HAHN & MICHAEL STRUBE ProximityScore (from- concept, to-concept)

Table 4: Conceptual distance function

Ι

PreferredConceptualBridge (χ, y, η) :⇔ isPotentialEllipticAntecedent (χ, y, n) Λ - z : isPotentialEllipticAntecedent (ζ, y, n) Λ ( ProximityScore (z., .) < ProximityScore(x.c, y.x) V ( ProximityScore (z.c, y.x) = ProximityScore (x.x, .) Λ z >TC x ) ) Table 5: Preferred conceptual bridge for textual ellipsis

5

Text cohesion parsing: Ellipsis resolution

The actor computation model (Agha & Hewitt 1987) provides the back ground for the procedural interpretation of lexicalised grammar specifica tions in terms of so-called word actors (Hahn et al. 1994). Word actors communicate via asynchronous message passing; an actor can only send messages to other actors it knows about, its so-called acquaintances. The arrival of a message at an actor triggers the execution of a method that is composed of grammatical predicates, as those given in the previous section. The resolution of textual ellipsis depends on the results of the resolution of nominal anaphora and on the termination of the semantic interpretation of the current sentence. A SearchTextEllipsisAntecedent message will only be triggered at the occurrence of the definite noun phrase NP when NP is not a nominal anaphor and NP is not already connected via a Pof-type relation (e.g., property-of, physical-part-of)4. 4

relation within dependency trees, viz. χ being the head of y. Associated with the set R is the set of inverse roles R-1. This distinction becomes important for already established relations like has-property (subsuming access-time, etc.) or has-physical-part (subsuming has-hd-dnve, etc.) insofar as they do not block the initialisation of the ellipsis resolution procedure, whereas the existence of their inverses, we here refer to as Pof-type relations, viz. property-of (subsuming accesstime-of, etc.) and physical-part-of (subsuming hd-drive-of etc.), does. This is simply due to the fact that the semantic interpretation of a phrase like "the access time of the new hard disk drive", as opposed to that of its elliptical counterpart "the access time" in sentence (3), where the genitive object is elliptified (zeroed), already leads to the creation of the Pof-type relation the ellipsis resolution mechanism is supposed to determine. This blocking condition has been proposed and experimentally validated by Katja Markert.

PARSETALK ABOUT TEXTUAL ELLIPSIS

93

Der Rechner hat durch dieses neue Laufwerk ausreichend Platz für Windows-Programme. Darüber hinaus ist die Zugriffszeit mit 25 ms sehr kurz. The computer provides - because of this new HD-drive - sufficient storage for Windows programs. Also - is - the access time of 25 ms - quite short.

Fig. 2: Sample parse for text ellipsis resolution The message passing protocol for establishing cohesive links based on the recognition of textual ellipsis consists of two phases: 1. In phase i, the message is forwarded from its initiator to the sentence delimiter of the preceding sentence, where its state is set to phase 2. 2. In phase 2, the sentence delimiter's acquaintance Cf is tested for the predicate PreferredConceptualBridge. Note that only nouns and pronouns are capable of responding to the SearchTextEllipsis Antecedent message and of being tested as to whether they fulfil the required criteria for an elliptical relation. If the text ellipsis predic ate PreferredConceptualBridge succeeds, the determined antecedent sends a TextEllipsisAntecedentFound message to the initiator of the SearchTextEllipsisAntecedent message. Upon receipt of the AntecedentFound message, the discourse referent of the elliptical expression is conceptually related to the antecedent's referent via the most specific (common) Pof-type relation, thus preserving local coherence at the conceptual level of text propositions. In Figure 2 we illustrate the protocol for establishing elliptical rela tions by referring to the already introduced text fragment (2)-(3) which is repeated at the bottom line of Figure 2. Sentence (3) contains the def inite NP die Zugriffszeit (the access time). Since, at the conceptual level, A C C E S S - T I M E does not subsume any lexical item in the preceding text (cf. Figure 1), the anaphora test fails. The conceptual correlate of die Zugriffs zeit has also not been integrated in terms of a Pof-type relation into the conceptual representation of the sentence as a result of the semantic inter pretation. Consequently, a S'earchTextEllipsisAntecedent message is created by the word actor for Zugriffszeit. That message is sent directly to the sentence delimiter of the previous sentence (phase 1), where the predicate PreferredConceptualBridge is evaluated for the acquaintance Cf (phase 2).

94

UDO HAHN & MICHAEL STRUBE

The concepts are examined in the order given by the C f , first L T E - L I T E - 2 5 (unit length 2), then S E A G A T E - S T - 3 1 4 1 (unit length 1). Since no paths shorter than those with unit length 1 can exist, the test terminates. Even if another item in the centering list following S E A G A T E - S T - 3 1 4 1 would have this shortest possible length, it would not be considered due to the functional preference given to S E A G A T E - S T - 3 1 4 1 in the Cf. Since S E A G A T E - S T 3 1 4 1 has been tested successfully, a TextEllipsisAntecedentFound message is sent to the initiator of the SearchAntecedent message. An appropriate up date links the corresponding instances via the role access-time-of'and, thus, local coherence is established at the conceptual level of the text knowledge base. 6

C o m p a r i s o n with related approaches

As far as proposals for the analysis of textual ellipsis are concerned, none of the standard grammar theories (e.g., HPSG, LFG, GB, CG, TAG) covers this issue. This is not surprising at all, as their advocates pay almost no attention to the text level of linguistic description (with the exception of several forms of anaphora) and also do not take conceptual criteria as part of grammatical descriptions seriously into account. More specifically, they lack any systematic connection to well-developed reasoning systems accounting for conceptual knowledge of the underlying domain. This latter argument also holds for the framework of DRT, although Wada (1994) deals with restricted forms of textual ellipsis in the DRT context. Also only few systems exist which resolve textual ellipses. As an ex ample, consider the PUNDIT system (Palmer et al. 1986), which provides an informal solution for a particular domain. We consider our proposal superior, since it provides a more general, domain-independent treatment at the level of a formalised text grammar. The approach reported in this paper also extends our own previous work on textual ellipsis (Hahn 1989) by the incorporation of a more general proximity metric and an elaborated model of functional preferences on Cf elements which constrains the set of possible antecedents according to topic/comment patterns. 7

Conclusion

In this paper, we have outlined a model of textual ellipsis parsing. It con siders conceptual criteria to be of primary importance and provides a prox imity measure in order to assess various possible antecedents for consider ation of proper bridges (Clark 1975) to elliptical expressions. In addition,

PARSETALK ABOUT TEXTUAL ELLIPSIS

95

functional constraints based on topic/comment patterns contribute further restrictions on elliptical antecedents. The anaphora resolution module (Strube & Hahn 1995) and the tex tual ellipsis handler have both been implemented in Smalltalk as part of a comprehensive text parser for German. Besides the information techno logy domain, experiments with this parser have also been successfully run on medical domain texts, thus indicating that the grammar predicates we developed are not bound to a particular domain (knowledge base). The current lexicon contains a hierarchy of approximately 100 word class spe cifications with nearly 3.000 lexical entries and corresponding concept de scriptions from the LOOM knowledge representation system (MacGregor & Bates 1987) — 900 and 500 concept/role specifications for the information technology and medicine domain, respectively. Acknowledgements. We would like to thank our colleagues in the CLIF Lab who read earlier versions of this paper. In particular, improvements were due to discussions we had with N. Bröker, K. Markert, S. Schacht, K. Schnattinger, and S. Staab. This work has been funded by LGFG  aden-Württemberg (1.1.4-7631.0; M. Strube) and a grant from DFG (Ha 2907/1-3; U. Hahn). REFERENCES Agha, Gul & Carl Hewitt. 1987. "Actors: A Conceptual Foundation for Concur rent Object-oriented Programming". Research Directions in Object-Oriented Programming ed. by B. Shriver et al., 49-74. Cambridge, Mass.: MIT Press. Charniak, Eugene. 1986. "A Neat Theory of Marker Passing". Proceedings of the 5th National Conference on Artificial Intelligence (AAAI '86), vol.1, 584-588. Clark, Herbert H. 1975. "Bridging." Proceedings of the Conference on Theoretical Issues in Natural Language Processing (TINLAP-1), Cambridge, Mass. ed. by Roger Schank & . Nash-Webber, 169-174. Dahl, Sten, ed. 1974. Topic and Comment, Contextual Boundness and Focus. Hamburg: Buske. Danes, František, ed. 1974. Papers on Functional Sentence Perspective. Prague: Academia. Grosz, Barbara J., Aravind K. Joshi & Scott Weinstein. 1995. "Centering: A Framework for Modeling the Local Coherence of Discourse". Computational Linguistics 21:2.203-225. Guha, R. V. & Douglas B. Lenat. 1990. "CYC: A Midterm Report". AI Maga zine 11:3.32-59.

96

UDO HAHN & MICHAEL STRUBE

Hahn, Udo. 1989. "Making Understanders out of Parsers: Semantically Driven Parsing as a Key Concept for Realistic Text Understanding Applications". International Journal of Intelligent Systems 4:3.345-393. Hahn, Udo, Susanne Schacht & Norbert Bröker. 1994. "Concurrent, Objectoriented Natural Language Parsing: The ParseTalk Model". International Journal of Human-Computer Studies 41:1/2.179-222. Hayes, Patrick J. 1985. "The Second Naive Physics Manifesto". Formal Theories of the Commonsense World ed. by J. Hobbs & R. Moore, 1-36. Norwood, N.J.: Ablex. MacGregor, Robert. 1991. "The Evolving Technology of Classification-based Knowledge Representation Systems." Principles of Semantic Networks ed. by J. Sowa, 385-400. San Mateo, Calif.: Morgan Kaufmann. MacGregor, Robert & Raymond Bates. 1987. The LOOM Knowledge Repres entation Language. Information Sciences Institute, University of Southern California (ISI/RS-87-188). Mars, Nicolaas J. I. 1994. "The Role of Ontologies in Structuring Large Know ledge Bases". Knowledge Building and Knowledge Sharing ed. by K. Fuchi & T. Yokoi, 240-248. Tokyo, Ohmsha and Amsterdam: IOS Press. Palmer, Martha S. et al. 1986. "Recovering Implicit Information". Proceedings of the 24th Annual Meeting of the Association for Computational Linguistics (ACL86), 10-19. New York, N.Y. Rada, Roy, Hafedh Mili, Ellen Bicknell & Maria Blettner. 1989. "Development and Application of a Metric on Semantic Nets". IEEE Transactions on Sys tems, Man, and Cybernetics 19:1.17-30. Resnik, Philip. 1995. "Using Information Content to Evaluate Semantic Similar ity in a Taxonomy". Proceedings of the 14th International Joint Conference on Artificial Intelligence (IL95), vol.1, 448-453. Montreal, Canada. Rips, L. J., E. J. Shoben & E. E. Smith. 1973. "Semantic Distance and the Verification of Semantic Relations". Journal of Verbal Learning and Verbal Behavior 12:1.1-20. Simmons, Geoff. 1992. "Empirical Methods for 'Ontologicai Engineering'. Case Study: Objects". Ontologie und Axiomatik der Wissensbasis von LILOG ed. by G. Klose, E. Lang & Th. Piriein, 125-154. Berlin: Springer. Strube, Michael & Udo Hahn. 1995. "ParseTalk about Sentence- and Text-level Anaphora". Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics (EACL'95)i 237-244. Wada, Hajime. 1994. "A Treatment of Functional Definite Descriptions." Pro ceedings of the 15th International Conference on Computational Linguistics (COLING-94), vol.II, 789-795. Kyoto, Japan.

Improving a Robust Morphological Analyser Using Lexical Transducers IÑAKi

ALEGRÍA, X A B I E R ARTOLA

&

K E P A SARASOLA

University of the Basque Country Abstract This paper describes the components of a robust and wide-coverage morphological analyser for Basque and their transformation into lex ical transducers. The analyser is based on the two-level formalism and has been designed in an incremental way with three main mod ules: the standard analyser, the analyser of linguistic variants, and the analyser without lexicon which can recognise word-forms without having their lemmas in the lexicon. This analyser is a basic tool for current and future work on automatic processing of Basque and its first three applications are a commercial spelling corrector and a general purpose lemmatiser/tagger. The lexical transducers are gen erated as a result of compiling the lexicon and a cascade of two-level rules (Karttunen et al. 1994). Their main advantages are speed and expressive power. Using lexical transducers for our analyser we have improved both the speed and the description of the different com ponents of the morphological system. Some slight limitations have been found too. 1

Introduction

The two-level model of morphology (Koskenniemi 1983) has become the most popular formalism for highly inflected and agglutinative languages. The two-level system is based on two main components: (i) a lexicon where the morphemes (lemmas and affixes) and the possible links among them (morphotactics) are defined; (ii) a set of rules which controls the mapping between the lexical level and the surface level due to the morphophonological transformations. The rules are compiled into transducers, so it is possible to apply the system for both analysis and generation. There is a free available software, PC-Kimmo (Antworth 1990) which is a useful tool to experiment with this formalism. Different flavours of two-level morphology have been developed, most of them changing the continuation-class based morphotactics by unification based mechanisms (Ritchie et al. 1992; Sproat 1992).

98

INAKI ALEGRIA,

XABIER

ARTOLA & ΚΕΡΑ SARASOLA

We did our own implementation of the two-level model with slights vari ations, and applied it to Basque (Agirre et al. 1992), a highly inflected and agglutinative language. In order to deal with a wide variety of linguistic data we built a Lexical Database (LDBB). This database is both source and support for the lexicons needed in several applications, and was designed with the objectives of being neutral in relation to linguistic formalisms, flexible, open and easy to use (Agirre et al. 1995). At present it contains over 60,000 entries, each with its associated linguistic features (category, sub-category, case, number, etc.). In order to increase the coverage and the robustness, the analyser has been designed in a incremental way. It is composed of three main modules (see Figure 1): the standard analyser, the analyser of linguistic variants pro duced due to dialectal uses and competence errors, and the analyser without lexicon which can recognise word-forms without having their lemmas in the lexicon. An important feature of the analyser is its homogeneity as the three different steps are based on two-level morphology, far from ad-hoc solutions.

Fig. 1: Modules of the analyser This analyser is a basic tool for current and future work on automatic pro cessing of Basque and its first two applications are a commercial spelling cor rector (Aduriz et al. 1994) and a general purpose lemmatiser/tagger (Aduriz et al. 1995). Following an overview of the lexical transducers and the description of the application of the two-level model and lexical transducers to the different steps of morphological analysis of Basque are given.

IMPROVING MORPHOLOGY USING TRANSDUCERS 2

99

Lexical transducers

A lexical transducer (Karttunen et al. 1992; Karttunen 1994) is a finitestate automaton that maps inflected surface forms into lexical forms, and can be seen as an evolution of two-level morphology where: • Morphological categories are represented as part of the lexical form. Thus it is possible to avoid the use of diacritics. • Inflected forms of the same word are mapped to the same canonical dictionary form. This increases the distance between the lexical and surface forms. For instance better is expressed through its canonical form good (good+COMP:better). • Intersection and composition of transducers is possible (see Kaplan & Kay 1994). In this way the integration of the lexicon (the lexicon will be another transducer) in the automaton can be resolved and the changes between lexical and surface level can be expressed as a cascade of two-level rule systems (Figure 2).

Fig. 2: Lexical transducers (from Karttunen et al. 1992) In addition, the morphological process using lexical transducers is very fast (thousands of words per second) and the transducer for a whole morpholo gical description can be compacted in less than 1 MB.

100

INAKI ALEGRIA,

XABIER

ARTOLA & ΚΕΡΑ SARASOLA

Different tools to build lexical transducers (Karttunen & Beesley 1992; Karttunen 1993) have been developed in Xerox and we are using them. Uses of lexical transducers are documented by Chanod (1994) and Kwon & Karttunen (1994). 3

T h e s t a n d a r d analyser

Basque is an agglutinative language; that is, for the formation of words the dictionary entry independently takes each of the elements necessary for the different functions (syntactic case included). More specifically, the affixes corresponding to the determinant, number and declension case are taken in this order and independently of each other (deep morphological structure). One of the principal characteristics of Basque is its declension system with numerous cases, which differentiates it from the languages spoken in the surrounding countries. We have applied the two-level model defining the following elements (Agirre et al. 1992; Alegría 1995): • Lexicon: over 60,000 entries have been defined corresponding to lem mas and affixes, grouped into 154 sublexicons. The representation of the entries is not canonical because 18 diacritics are used to control the application of morphophonological rules. • Continuation classes: they are groups of sublexicons to control the morphotactics. Each entry of the lexicon has its continuation class and all together define the morphotactics graph. The long distance de pendencies among morphemes can not be properly expressed by con tinuation classes, therefore in our implementation we extended their semantics defining the so-called extended continuation classes. • Morphophonological rules: 24 two-level rules have been defined to express the morphological, phonological and orthographic changes between the lexical and the surface levels that appear when the morph emes are combined. The morphological analyser attaches to each input word-form all possible in terpretations and its associated information that is given in pairs of morphosyntactic features. The conversion of our description to a lexical transducer was done in the following steps: 1. Canonical forms and morphological categories were integrated in the lexicon from the lexical data-base.

IMPROVING MORPHOLOGY USING TRANSDUCERS

101

2. Due to long distance dependencies among morphemes, which could not be resolved in the lexicon, two additional rules were written to ban some combinations of morphemes. These rules can be put in a different rule system near to the lexicon without mixing morphotactics and morphophonology (see Figure 3). 3. The standard rules could be left without changes (mapping in the lexicon canonical forms and arbitrary forms) but were changed in or der to change diacritics by morphological features, doing a clearer description of the morphology of the language.

Fig. 3: Lexical transducer for the standard analysis of Basque The resultant lexical transducer is about 500 times faster than the original system.

102 4

INAKI ALEGRIA, XABIER ARTOLA & KEPA SARASOLA T h e analysis and correction of linguistic variants

Because of the recent standardisation and the widespread dialectal use of Basque, the standard morphology is not enough to offer good results when analysing corpora. To increase the coverage of the morphological processor an additional two-level subsystem was added (Aduriz et al. 1993). This subsystem is also used in the spelling corrector to manage competence errors and has two main components: 1. New morphemes linked to the corresponding correct ones. They are added to the lexical system and they describe particular variations, mainly dialectal forms. Thus, the new entry tikan, dialectal form of the ablative singular morpheme, linked to its corresponding right entry tik will be able to analyse and correct word-forms such etxetikan, k a l e t i k a n , ... (variants of e t x e t i k from the house, k a l e t i k from the street, ...). Changing the continuation class of morphemes morphotactic errors can be analysed. 2. New two-level rules describing the most likely regular changes that are produced in the variants. These rules have the same structure and management than the standard ones. Twenty five new rules have been defined to cover the most common competence errors. For instance, the rule h:0 => V:V_V:V describes that between vowels the h of the lexical level may disappear in the surface level. In this way the wordform bear, misspelling of behar, to need, can be analysed. All these rules are optional and have to be compiled with the standard rules but some inconsistencies have to be solved because some new changes were forbidden in the original rules. To correct the word-form the result of the analysis has to be entered into the morphological generation using correct morphemes linked to variants and original rules. To correct beartzetikan, variant of b e h a r t z e t i k , two steps, analysis and generation, are followed as it is shown in Figure 4. When we decided to use lexical transducers for the treatment of linguistic variants, the following procedure was applied: 1. The additional morphemes linked to the standard ones are solved using the possibility of expressing two levels in the lexicon. In one level the non-standard morpheme will be specified and in the other (the correspondent to the result of the analysis) the standard morpheme. 2. The additional rules do not need to be integrated with the standard ones (Figure 5), and so, it is not necessary to solve the inconsistencies.

IMPROVING MORPHOLOGY USING TRANSDUCERS

103

Fig. 4: Steps {or correction As Figure 5 (B) shows, it is possible and clearer to put these rules in other plane near to the surface, because most of the additional rules are due to phonetic changes and do not require morphological information. Only the surface characters, the morpheme boundary and additional information about one change (the final a of lemmas) complete the intermediate level between the two rule systems. 3. In our original implementation it was possible to distinguish between standard and non-standard analysis (the additional rules are marked and this information can be obtained as result of the analysis), and so the non- standard information can be additional; but with lexical transducers, it is necessary to store two transducers one for standard analysis and other for standard and non-standard analysis. Although in the original system the speed of analysis using additional in formation was two or three times slower than the standard analysis, using lexical transducers the difference between both analysis is very slight. 5

The analysis of unknown words

Based on the idea used in speech synthesis (Black et al. 1991), a two-level mechanism for analysis without lexicon was added to increase the robustness of the analyser.

104

INAKI ALEGRIA, XABIER ARTOLA & KEPA SARASOLA

(A)

(B)

Fig. 5: Lexical transducer for the analysis of linguistic

variants

This mechanism has the following two main components in order to be capable of treating unknown words: 1. generic lemmas represented by "??" (one for each possible open cat egory or subcategory) which are organised with the affixes in a small two-level lexicon 2. two additional rules in order to express the relationship between the generic lemmas at lexical level and any acceptable lemma of Basque, which are combined with the standard ones Some standard rules have to be modified because surface and lexical level are specified, and in this kind of analysis the lexical level of the lemmas changes. The two-level mechanism is also used to analyse the unknown forms, and the obtention of at least one analysis is guaranteed. In order to eliminate the great number of ambiguities in the analysis, a local disambiguation process is carried out.

IMPROVING MORPHOLOGY USING TRANSDUCERS

105

By using lexical transducers the two additional rules can be placed inde pendently (see Figure 6), and so, the original rules can remain unchanged. In this case the additional subsystem is arranged close to the lexicon be cause it maps the transformation between generic and hypothetical lemmas at lexical level. The resultant lexical transducer is very compact and fast.

Fig. 6: Lexical transducer for the analysis of unknown words Our system has a user lexicon and an interface to the update process too. Some information about the new entries (mainly part of speech) is necessary to add them to the user lexicon. The user lexicon is combined with the general one increasing the coverage of the morphological analyser. This mechanism is very useful in the process of spelling correction but an on line updating of the user lexicon is necessary. This treatment is carried out in our original implementation but, when we use lexical transducers the updating operation is slow (it is necessary to compile everything together) and therefore, there are problems for on-line updating. Carter (1995) proposes compiling affixes and rules, but no lemmas, in order to have flexibility when dealing with open lexicons, but it presents problems managing compounds at run-time.

106 6

INAKI ALEGRIA, XABIER ARTOLA & KEPA SARASOLA Conclusions

A two-level formalism based morphological processor has been designed in a incremental way in three main modules: the standard analyser, the analyser of linguistic variants produced due to dialectal uses and competence errors, and the analyser without lexicon which can recognise word-forms without having their lemmas in the lexicon. This analyser is a basic tool for current and future work on automatic processing of Basque. A B 4.846 2.343 2.607 1.429 307 85 101 28 22 85 (84%) (79%) 21 4 Full wrong analysis Precision 99,2% 99,7%

Concept Number of words Different words Unknown words Linguistic variants Analysed

A+B 7.207 4.036 392 129 107 (83%) 25 99,4%

Table 1: Figures about the different kinds of analysis Figures about the precision of the analyser are given in Table 6. Two different corpora were used: (A) a text of a magazine where foreign names appear and (B) a text about philosophy. The percents of unknown words and precision are calculated on different words, so, the results with all the corpus would be better. Using lexical transducers for our analyser we have improved both the speed and the description of the different components of the tool. Some slight limitations have been found too. Acknowledgements. This work had partial support from the local Government of Gipuzkoa and from the Government of the Basque Country. We would like to thank to Xerox for letting us using their tools, and also to Ken Beesley and Lauri Karttunen for their help in using these tools and designing the lexical transducers. We also want to acknowledge to Eneko Agirre for his help with the English version of this manuscript.

IMPROVING MORPHOLOGY USING TRANSDUCERS

107

REFERENCES Aduriz, Itziar, E. Agirre, I. Alegria, X. Arregi, J.M. Arriola, X. Artola, A, Diaz de Illarraza, N. Ezeiza, M. Maritxalar, K. Sarasola & M. Urkia. 1993. "A Morphological Analysis Based Method for Spelling Correction". Proceedings of the 6th Conference of the European Association for Computational Lin guistics (EACL'93), 463-463. Utrecht, The Netherlands. , E. Agirre, I. Alegria, X. Arregi, J.M. Arriola, X. Artola, Da Costa A., A. Diaz de Illarraza, N. Ezeiza, M. Maritxalar, K. Sarasola & M. Urkia. 1994. "Xuxen-Mac: un corrector ortografico para textos en euskara". Proceedings of the 1st Conference Universidad y Macintosh, UNIMAC, vol.11, 305-310. Madrid, Spain. , I. Alegria, J.M. Arriola, X. Artola, Diaz de Ilarraza A., N. Ezeiza, K, Gojenola, M. Maritxalar. 1995. "Different issues in the design of a lemmatiser/tagger for Basque". From Text to Tag Workshop, SIGDAT (EACL''95), 18-23. Dublin, Ireland. Agirre, Eneko, I. Alegria, X. Arregi, X. Artola, A. Diaz de Illarraza, M. Maritx alar, K. Sarasola & M. Urkia. 1992. "XUXEN: A spelling checker/corrector for Basque based on Two-Level morphology". Proceedings of the 3rd Con ference Applied Natural Language Processing (ANLP'92), 119-125. Trento, Italy. , X. Arregi, J.M. Arriola, X. Artola, A. Diaz de Illarraza, J.M. Insausti & K. Sarasola. 1995. "Different issues in the design of a general-purpose Lexical Database for Basque". Proceedings of the 1st Workshop on Applications of Natural Language to Data Bases (NLDB'95), Versailles, France, 299-313. Alegria, Iñaki. 1995. Euskal morfologiaren tratamendu automatikorako tresnak. Ph.D. dissertation, University of the Basque Country. Donostia, Basque Country. Antworth, Evan L. 1990. PC-KIMMO: A two-level processor for morphological analysis. Dallas, Texas: Summer Institute of Linguistics. Black, Alan W., Joke van de Plassche & Briony Williams. 1991. "Analysis of Unknown Words through Morphological Descomposition". Proceedings of the 5th Conference of the European Association for Computational Linguistics (EACL'91), vol.1, 101-106. Carter, David. 1995. "Rapid development of morphological descriptions for full language processing system". Proceedings of the 5th Conference of the European Association for Computational Linguistics (EACL'95), 202-209. Dublin, Ireland. Chanod, Jean-Pierre. 1994. "Finite-state Composition of French Verb Morpho logy". Technical Report (Xerox MLTT-005). Meylan, France: Rank Xerox Research Center, Grenoble Laboratory.

108

INAKI ALEGRIA, XABIER ARTOLA & KEPA SARASOLA

Kaplan, Ronald M. & Martin Kay. 1994. "Regular models of phonological rule systems". Computational Linguistics 20:3.331-380. Karttunen, Lauri & Kenneth R. Beesley. 1992. "Two-Level Rule Compiler". Technical Report (Xerox ISTL-NLTT-1992-2). Palo Alto, Calif.: Xerox. Palo Alto Research Center. , Ronald M. Kaplan & Annie Zaenen. 1992. "Two-level morphology with composition". Proceedings of the 14th Conference on Computational Lin guistics (COLING'92), vol.1, 141-148. Nantes, Prance. 1993. "Finite-State Lexicon Compiler". Technical Report (Xerox ISTLNLTT-1993-04-02). Xerox. Palo Alto Research Center. 3333 Coyote Hill Road. Palo Alto, CA 94304 1994. "Constructing Lexical Transducers". Proceedings of the 15th Con ference on Computational Linguistics (COLING'94), vol.1, 406-411. Kyoto, Japan. Koskenniemi, Kimmo. 1983. Two-level Morphology: A general Computational Model for Word-Form Recognition and Production. Publications 11. Univer sity of Helsinki. Kwon, Hyuk-Chul & Lauri Karttunen. 1994. "Incremental construction of a lexical transducer for Korean". Proceedings of the 15th Conference on Com putational Linguistics (COLING,94)-l vol.11, 1262-1266. Kyoto, Japan. Ritchie, Graeme D., Alan W. Black, Graham J. Russell & Stephen G. Pulman. 1992. Computational Morphology. Cambridge, Mass.: MIT Press. Sproat, Richard. 1992. Morphology and Computation. Press.

Cambridge, Mass.: MIT

II SEMANTICS AND DISAMBIGUATION

Context-Sensitive Word Distance by Adaptive Scaling of a Semantic Space HIDEKI KOZIMA & AKIRA ITO

Communications Research Laboratory Abstract This paper proposes a computationally feasible method for measuring the context-sensitive semantic distance between words. The distance is computed by adaptive scaling of a semantic space. In the semantic space, each word in the vocabulary V is represented by a multi dimensional vector which is extracted from an English dictionary through principal component analysis. Given a word set C which specifies a context, each dimension of the semantic space is scaled up or down according to the distribution of C in the semantic space. In the space thus transformed, the distance between words in V becomes dependent on the context (7. An evaluation through a word prediction task shows that the proposed measurement successfully extracts the context of a text. 1

Introduction

Semantic distance (or similarity) between words is one of the basic meas urements used in many fields of natural language processing, information retrieval, etc. Word distance provides bottom-up information for text under standing and generation, since it indicates semantic relationships between words that form a coherent text structure (Grosz & Sidner 1986); word dis tance also provides a basis for text retrieval (Schank 1990), since it works as associative links between texts. A number of methods for measuring semantic word distance have been proposed in the studies of psycholinguistics, computational linguistics, etc. One of the pioneering works in psycholinguistics is the 'semantic differ ential' (Osgood 1952), which analyses the meaning of words by means of psychological experiments on human subjects. Recent studies in computa tional linguistics proposed computationally feasible methods for measuring semantic word distance. For example, Morris & Hirst (1991) used Roget's thesaurus as a knowledge base for determining whether or not two words are semantically related; Brown et al. (1992) classified a vocabulary into semantic classes according to the co-occurrency of words in large corpora;

112

HIDEKI KOZIMA & AKIRA ITO

Kozima & Furugori (1993) computed the similarity between words by means of spreading activation on a semantic network of an English dictionary. The measurements in these former studies are so-called context-free or static ones, since they measure word distance irrespective of contexts. How ever, word distance changes in different contexts. For example, from the word car, we can associate related words in the following two directions: • car → bus, t a x i , railway, • car → engine, t i r e , seat, • • • The former is in the context of 'vehicle', and the latter is in the context of 'components of a car'. Even in free-association tasks, we often imagine a certain context for retrieving related words. In this paper, we will incorporate context-sensitivity into semantic dis tance between words. A context can be specified by a set C of keywords of the context (for example, {car, bus} for the context 'vehicle'). Now we can exemplify the context-sensitive word association as follows: • C= {car, bus} → t a x i , railway, airplane, ••• • C— {car, engine} → t i r e , seat, headlight, ••• Generally, we observe a different distance for different context. So, in this paper we will deal with the following problem: Under the context specified by a given word set C, compute semantic distance d(w,w'\C) between any two words w,w' in our vocabulary V. Our strategy for this context-sensitivity is 'adaptive scaling of a semantic space'. Section 2 introduces the semantic space where each word in the vocabulary V is represented by a multi-dimensional semantic vector. Sec tion 3 describes the adaptive scaling. For a given word set C that specifies a context, each dimension of the semantic space is scaled up or down accord ing to the distribution of C in the semantic space. After this transformation, distance between Q-vectors becomes dependent on the given context. Sec tion 4 shows some examples of the context-sensitive word distance thus computed. Section 5 evaluates the proposed measurement through word prediction task. Section 6 discusses some theoretical aspects of the pro posed method, and Section 7 gives our conclusion and perspective. 2

Vector-representation of word meaning

Each word in the vocabulary V is represented by a multi-dimensional Qvector. In order to obtain Q-vectors, we first generate 2851-dimensional

CONTEXT-SENSITIVE WORD DISTANCE

113

Fig. 1: Mapping words onto Q-vectors P-vectors by spreading activation on a semantic network of an English dic tionary (Kozima & Furugori 1993). Next, through principal component analysis on P-vectors, we map each P-vector onto a Q-vector with a re duced number of dimensions (see Figure 1). 2.1

From an English dictionary to P-vectors

Every word w in the vocabulary V is mapped onto a P-vector P(w) by spreading activation on the semantic network. The network is systematic ally constructed from a subset of the English dictionary, LDOCE (Longman Dictionary of Contemporary English). The network has 2851 nodes corres ponding to the words in LDV (Longman Defining Vocabulary, 2851 words). The network also has 295914 links between these nodes — each node has a set of links corresponding to the words in its definition in LDOCE. Since every headword in LDOCE is defined by using LDV only, the network be comes a closed cross-reference network of English words. Each node of the network can hold activity, and this activity flows through the links. Hence, activating a node in the network for a certain period of time causes the activity to spread over the network and forms a pattern of activity distribution on it. Figure 2 shows the pattern gener ated by activating the node red; the graph plots the activity values of 10 dominant nodes at each step in time. The P-vector P(w) of a word w is the pattern of activity distribution generated by activating the node corresponding to w. P(w) is a 2851dimensional vector consisting of activity values of the nodes at T —10 as an approximation of the equilibrium. P(w) indicates how strongly each node of the network is semanticaliy related with w. In this paper, we define the vocabulary V as LDV (2851 words) in or der to make our argument and experiments simple. Although V is not a large vocabulary, it covers 83.07% of the 1006815 words in the LancasterOslo/Bergen (LOB) corpus. In addition, V can be extended to the set of

114

HIDEKIKOZIMA & AKIRA ITO

Fig. 2: Spreading activation

Fig. 3: Clustering of P-vectors

all headwords in LDOCE (more than 56000 words), since a P-vector of a non-LDV word can be produced by activating a set of the LDV-words in its dictionary definition. (Remember that every headword in LDOCE is defined using only LDV.) The P-vector P(w) represents the meaning of the word w in its rela tionship to other words in the vocabulary V. Geometric distance between two P-vectors P(w) and P(w') indicates semantic distance between the words w and w''. Figure 3 shows a part of the result of hierarchical clus tering on P-vectors, using Euclidean distance between centers of clusters. The dendrogram reflects intuitive semantic similarity between words: for instance, rat/mouse, t i g e r / l i o n / c a t , etc. However, the similarity thus observed is context-free and static. The purpose of this paper is to make it context-sensitive and dynamic. 2.2

From P-vectors to Q-vectors

Through principal component analysis, we map every P-vector onto a Qvector, of which we will define context-sensitive distance later. The principal component analysis of P-vectors provides a series of 2851 principal compon ents. The most significant m principal components work as new orthogonal axes that span m-dimensional vector space. By these m principal compon ents, every P-vector (with 2851 dimensions) can be mapped onto a Q-vector (with m dimensions). The value of m, which will be determined later, is much smaller than 2851. This brings about not only compression of the semantic information, but also elimination of the noise in P-vectors. First, we compute the principal components X 1 , X 2 , • • •, X 2851 — each

CONTEXT-SENSITIVE WORD DISTANCE

115

of which is a 2851-dimensional vector — under the following conditions: • For any x3 its norm |x2| is 1. • For any X3,X3(i ≠ j), their inner product (Xi,X3) is 0. • The variance vi of P-vectors projected onto Xi is not smaller than any vi (j> i). In other words, X1 is the first principal component with the largest variance of P-vectors, and X2 is the second principal component with the secondlargest variance of P-vectors, and so on. Consequently, the set of principal components X 1 , X2 ,..., X 2851 provides a new orthonormal coordinate sys tem for P-vectors. Next, we pick up the first m principal components X 1 , X2, ...,Xm. The principal components are in descending order of their significance, because the variance vi indicates the amount of information represented by Xi We found that even the first 200 axes (7.02% of the 2851 axes) can represent 45.11% of the total information of P-vectors. The amount of information represented by Q-vectors increases with m: 66.21% for the first 500 axes, 82.80% for the first 1000 axes. However, for large m, each Q-vector would be isolated because of overfitting — a large number of parameters could not be estimated by a small number of data. We estimate the optimal number of dimensions of Q-vectors to be m = 281, which can represent 52.66% of the total information. This optimisation is done by minimising the proportion of noise remaining in Q-vectors. The amount of the noise is estimated by ∑wЄF |Q(w)|, where F ( V) is a set of 210 function words — determiners, articles, prepositions, pronouns, and conjunctions. We estimated the proportion of noise for all m = 1, • • •, 2851 and obtained the minimum for m = 281. Therefore, from now we will use a 281-dimensional semantic space. Finally, we map each P-vector P(w) onto a 281-dimensional Q-vector Q(w). The i-th component of Q(w) is the projected value of P(w) on the principal component Xi; the origin of Xi is set to the average of the projected values on it. 3

Adaptive scaling of the semantic space

Adaptive scaling of the semantic space of Q-vectors provides context-sensitive and dynamic distance between Q-vectors. Simple Euclidean distance between Q-vectors is not so different from that between P-vectors; both are contextfree and static distances. The adaptive scaling process transforms the se mantic space to adapt it to a given context C. In the semantic space thus

116

HIDEKI KOZIMA & AKIRA ITO

Fig. 4: Adaptive scaling

Fig. 5: Clusters in a subspace

transformed, simple Euclidean distance between Q-vectors becomes depend ent on C. (See Figure 4.) 3.1

Semantic subspaces

A subspace of the semantic space of Q-vectors works as a simple device for semantic word clustering. In a semantic subspace with the dimensions appropriately selected, the Q-vectors of semantically related words are ex pected to form a cluster. The reasons for this are as follows: • Semantically related words have similar P-vectors, as illustrated in Figure 3. • The dimensions of Q-vectors are extracted from the correlations between P-vectors by means of principal component analysis. As an example of word clustering in the semantic subspaces, let us consider the following 15 words: 1. after, 2. ago, 3. before, 4. bicycle, 5. bus, 6. car, 7. enjoy, 8. former, 9. glad, 10. good, 11. l a t e , 12. pleasant, 13. railway, 14. s a t i s f a c t i o n , 15. vehicle. We plotted these words on the subspace I 2 x l 3 , namely the plane spanned by the second and third dimensions of Q-vectors. As shown in Figure 5, the words form three apparent clusters, namely 'goodness', 'vehicle', and 'past'. However, it is still difficult to select appropriate dimensions for mak ing a semantic cluster for given words. In the example above, we used only two dimensions; most semantic clusters need more dimensions to be well-separated. Moreover, each of the 2851 dimensions is simply selected

CONTEXT-SENSITIVE WORD DISTANCE

117

Fig. 6: Adaptive scaling of the semantic space or discarded; this ignores their possible contribution to the formation of clusters. 3.2

Adaptive scaling

Adaptive scaling of the semantic space provides a weight for each dimension in order to form a desired semantic cluster; these weights are given by scaling factors of the dimensions. This method makes the semantic space adapt to a given context C in the following way: Each dimension of the semantic space is scaled up or down so as to make the words in C form a cluster in the semantic space. In the semantic space thus transformed, the distance between Q-vectors changes with C. For example, as illustrated in Figure 6, when C has ovalshaped (generally, hyper-elliptic) distribution in the pre-scaling space, each dimension is scaled up or down so that C has a round-shaped (generally, hyper-spherical) distribution in the transformed space. This coordinate transformation changes the mutual distance among Q-vectors. In the raw semantic space (Figure 6, left), the Q-vector • is closer to C than the Qvector o; in the transformed space (Figure 6, right), it is the other way round — o is closer to C, while • is further apart. The distance d(w,w'\C) between two words w,w' under the context C = {w1, • • •, wn} is defined as follows:

where Q(w) and Q(w') are the m-dimensional Q-vectors of w and w'; re spectively: Q(w) = (q1 ..., qm), Q(w') = (q', • • •, q'm).

118

HIDEKI KOZIMA & AKIRA ITO

The scaling factor fi G [0,1] of the z'-th dimension is defined as follows:

where SD i (C) is the standard deviation of the z-th component values of w1, • • •, wn, and SD i (V) is that of the words in the whole vocabulary V. The operation of the adaptive scaling described above is summarised as follows. • If C forms a compact cluster in the i-th dimension (ri 0), the di mension is scaled up (fi  1) to be sensitive to small differences in the dimension. • If C does not form an apparent cluster in the z-th dimension (ri >>0), the dimension is scaled down (fi0) to ignore small differences in the dimension. Now we can tune the distance between Q-vectors to a given word set C which specifies the context for measuring the distance. In other words, we can tune the semantic space of Q-vectors to the context C. This tune-up procedure is not computationally expensive, because once we have computed the set of Q-vectors and SD 1 (V), • • •, SD m (V), then all we have to do is to compute the scaling factors f1,..., fm for a given word set C Computing distance between Q-vectors in the transformed space is no more expensive than computing simple Euclidean distance between Q-vectors. 4

Examples of measuring the word distance

Let us see a few examples of the context-sensitive distance between words computed by adaptive scaling of the semantic space with 281 dimensions. Here we deal with the following problem: Under the context specified by a given word set C, compute the distance d(w, C) between w and C, for every word w in our vocabulary V. The distance d(w,C) is defined as follows:

This means that the distance d(w, C) is equal to the distance between w and the center of C in the semantic space transformed. In other words, d(w ,C) indicates the distance of w from the context C.

CONTEXT-SENSITIVE WORD DISTANCE (7 = {bus, car, railway} +

wЄC (15) car_l r a i l way J. bus_l carriage-1 motor_l motor_2 track_2 track_l road-1 passenger_l vehicle_l engine.l garage-1 train_l belt.l

d(w, C) 0.1039 0.1131 0.1141 0.1439 0.1649 0.1949 0.1995 0.2024 0.2038 0.2185 0.2274 0.2469 0.2770 0.2792 0.2853

119

C = {bus, scenery, tour} wЄC+(15) bus_l scenery_l tour - 2 tour-l abroad-1 tourist-l passenger-l make-2 make-3 everywhere_l garage.l set.2 machinery_l something-l timetable.l

d(w, C) 0.1008 0.1122 0.1211 0.1288 0.1559 0.1593 0.1622 0.1691 0.1706 0.1713 0.1715 0.1723 0.1733 0.1743 0.1744

Table 1: Association from a given word set C Now we can extract a word set C+(k) which consists of the k closest words to the given context C. This extraction is done by the following procedure: 1. Sort all words in our vocabulary V in ascending order of d(w, C). 2. Let C+(k) be the word set which consists of the first k words in the sorted list. Note that C+(k) may not include all words in C, even if k > \C\. Here we will see some examples of extracting C+(k) from a given context C. When the word set C = {bus, car, railway} is given, our contextsensitive word distance produces the cluster C + (15) shown in Table 1 (left). We can see from the list1 that our word distance successfully associates related words like motor and passenger in the context of 'vehicle'. On the other hand, from C = {bus, scenery, t o u r } , the cluster C + (15) shown in Table 1 (right) is obtained. We can see the context 'bus tour' from the list. Note that the list is quite different from that of the former example, though both contexts contain the word bus. When the word set C = {read, paper, magazine}, the following cluster C + (12) is obtained. (The words are listed in ascending order of the dis tance.) {paper_l, read_l, magazine.l, newspaper_l, print_2, book_l, p r i n t _ l , wall_l, something_l, a r t i c l e _ l , s p e c i a l i s t - 1 , t h a t - l } . 1

Note that words with different suffix numbers correspond to different headwords (i.e., homographs with different word classes) of the English dictionary LDOCE. For in stance, motor_l / noun, motor_2 / adjective.

120

HIDEKI KOZIMA & AKIRA ITO n

e

1 2 3 4 5 6 7 8

0.3248 0.1838 0.1623 0.1602 0.1635 0.1696 0.1749 0.1801

Fig. 7: Word prediction task (left) and its result (right) It is obvious that the extracted context is 'education' or 'study'. On the other hand, when C = {read, machine, memory}, the following word set C+ (12) is obtained. {machine_l, memory_l, read_l, computer_i, remember_l, someone_l, have-2, t h a t - l , instrument-1, f eeling_2, that_2, what_2}. It seems that most of the words are related to 'computer' or 'mind'. These two clusters are quite different, in spite of the fact that both contexts contain the word read. 5

Evaluation through word prediction

We evaluate the context-sensitive word distance through predicting words in a text. When one is reading a text (for instance, a novel), he or she often predicts what is going to happen next by using what has happened already. Here we will deal with the following problem: For each sentence in a given text, predict the words in the sen tence by using the preceding n sentences. This task is not so difficult for human adults because a target sentence and the preceding sentences tend to share the same contexts. This means that predictability of the target sentence suggests how successfully we extract information about the context from preceding sentences. Consider a text as a sequence S 1 ,...., SN, where Si is the i-th sentence of the text (see Figure 7, left). For a given target sentence Si, let Ci be a set of the concatenation of the preceding n sentences: Ci = {Si-n . . . S i - 1 } . Then, the prediction error ei of Si is computed as follows: 1. Sort all the words in our vocabulary V in ascending order of d(w, Ci). 2. Compute the average rank ri of wij Є Si in the sorted list. 3. Let the prediction error ei be the relative average rank ri/ |V'/.

CONTEXT-SENSITIVE WORD DISTANCE

121

Note that here we use the vocabulary V which consists of 2641 words — we removed 210 function words from the vocabulary V. Obviously, the prediction is successful when ei0. We used 0 . Henry's short story 'Springtime a la Carte' (Thornley 1960: 56-62) for the evaluation. The text consists of 110 sentences (1620 words). We computed the average value e of the prediction error ei for each target sentence Si (i = n + l , . . . , 110). For different numbers of preceding sentences (n = 1 , . . . , 8) the average prediction error ē is computed and shown in Figure 7 (right). If prediction is random, the expected value of the average prediction error ē is 0.5 (i.e., chance). Our method predicted the succeeding words better than randomly; the best result was observed for n — 4. Without adaptive scaling of the semantic space, simple Euclidean distance resulted in ē = 0.2905 for n — 4; our method is better than this, except for n — 1. When the succeeding words are predicted by using prior probability of word occurrence, we obtained ē — 0.2291. The prior probability is estimated by the word frequency in West's five-million-word corpus (West 1953). Again our result is better than this, except for n = 1. 6 6.1

Discussion Semantic vectors

A monolingual dictionary describes the denotational meaning of words by using the words defined in it; a dictionary is a self-contained and selfsufficient system of words. Hence, a dictionary contains the knowledge for natural language processing (Wilks et al. 1989). We represented the meaning of words by semantic vectors generated by the semantic network of the English dictionary LDOCE. While the semantic network ignores the syntactic structures in dictionary definitions, each semantic vector contains at least a part of the meaning of the headword (Kozima & Furugori 1993). Co-occurrency statistics on corpora also provide semantic information for natural language processing. For example, mutual information (Church & Hanks 1990) and n-grams (Brown et al. 1992) can extract semantic re lationships between words. We can represent the meaning of words by the co-occurrency vectors extracted from corpora. In spite of the sparseness of corpora, each co-occurrency vector contains at least a part of the meaning of the word. Semantic vectors from dictionaries and co-occurrency vectors from corpora would have different semantic information (Niwa & Nitta 1994). The former

122

HIDEKI KOZIMA & AKIRA ITO

displays paradigmatic relationships between words, and the latter syntagmatic relationships between words. We should incorporate both of these complementary knowledge sources into the vector-representation of word meaning. 6.2

Word prediction and text structure

In the word prediction task described in Section 5, we observed the best average prediction error e for n = 4 , where n denotes the number of preceding sentences. It is likely that e will decrease with increasing n, since the more we read the preceding text, the better we can predict the succeeding text. However, we observed the best result for n = 4. Most studies on text structure assume that a text can be segmented into units that form a text structure (Grosz & Sidner 1986). Scenes in a text are contiguous and non-overlapping units, each of which describes certain objects (characters and properties) in a situation (time, place, and backgrounds). This means that different scenes have different contexts. The reason why n = 4 gives the best prediction lies in the alternation of the scenes in the text. When both a target sentence Si and the preceding sentences Ci are in one scene, prediction of Si from d would be successful. Otherwise, the prediction would fail. A psychological experiment (Kozima & Furugori 1994) supports this correlation with the text structure. 7

Conclusion

We proposed context-sensitive and dynamic measurement of word distance computed by adaptive scaling of the semantic space. In the semantic space, each word in the vocabulary is represented by an m-dimensional Q-vector. Q-vectors are obtained through a principal component analysis on P-vectors. P-vectors are generated by spreading activation on the semantic network which is constructed systematically from the English dictionary (LDOCE). The number of dimensions, m = 281, is determined by minimising the noise remaining in Q-vectors. Given a word set C which specifies a context, each dimension of the Q-vector space is scaled up or down according to the distribution of C in the space. In the semantic space thus transformed, word distance becomes dependent on the context specified by C. An evaluation through predicting words in a text shows that the proposed measurement captures the context of the text well.

CONTEXT-SENSITIVE WORD DISTANCE

123

T h e context-sensitive and dynamic word distance proposed here can be applied in many fields of natural language processing, information retrieval, etc. For example, the proposed measurement can be used for word sense disambiguation, in t h a t the extracted context provides bias for lexical am biguity. Also prediction of succeeding words will reduce the computational cost in speech recognition tasks. In future research, we regard the adaptive scaling method as a model of human memory and attention t h a t enables us to follow a current context, to put a restriction on memory search, and to predict what is going to happen next. REFERENCES Brown, Peter F., Vincent J. Delia Pietra, Peter V. deSouza, Jenifer C. Lai & Robert L. Mercer. 1992. "Class-Based n-gram Models of Natural Language". Computational Linguistics 18:4.467-479. Church, Kenneth W. & Patrick Hanks. 1990. "Word Association Norms, Mutual Information, and Lexicography". Computational Linguistics 16:1.22-29. Grosz, Barbara J. & Candance L. Sidner. 1986. "Attention, Intentions, and the Structure of Discourse". Computational Linguistics 12:3.175-204. Kozima, Hideki & Teiji Furugori. 1993. "Similarity between Words Computed by Spreading Activation on an English Dictionary". Proceedings of the 6th Conference of the European Chapter of the Association for Computational Linguistics (EACL'93), 232-239. Utrecht, The Netherlands. Kozima, Hideki & Teiji Furugori. 1994. "Segmenting Narrative Text into Coher ent Scenes". Literary and Linguistic Computing 9:1.13-19. Morris, Jane and Graeme Hirst. 1991. "Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of Text". Computational Linguist ics 17:1.21-48. Niwa, Yoshiki & Yoshihiko Nitta. 1994. "Co-occurrence Vectors from Corpora vs. Distance Vectors from Dictionaries". Proceedings of the 15th International Conference on Computational Linguistics (COLING-94), 304-309. Kyoto, Japan. Osgood, Charles E. 1952. "The Nature and Measurement of Meaning". Psycho logical Bulletin 49:3.197-237. Schank, Roger C. 1990. Tell Me a Story: A New Look at Real and Artificial Memory. New York: Scribner. Thornley, G. C. 1960. British and American Short Stories. Harlow: Longman. West, Michael. 1953. A General Service List of English Words. Harlow: Long man.

124

HIDEKI KOZIMA & AKIRA ITO

Wilks, Yorick, Dan Fass, Cheng-Ming Guo, James McDonald, Tony Plate, & Brian Slator. 1989. "A Tractable Machine Dictionary as a Resource for Computational Semantics". Computational Lexicography for Natural Lan guage Processing ed. by Bran Boguraev & Ted Briscoe, 193-228. Harlow: Longman.

Towards a Sublanguage-Based Semantic Clustering Algorithm M. VICTORIA A R R A N Z , 1 IAN R A D F O R D , SOFIA ANANIADOU & JUN-ICHI T S U J I I

Centre for Computational Linguistics, UMIST Abstract This paper presents the implementation of a tool kit for the ex traction of ontological knowledge from relatively small sublanguagespecific corpora. The fundamental idea behind this system, that of knowledge acquisition (KA) as an evolutionary process, is discussed in detail. Special emphasis is given to the modular and interactive approach of the system, which is carried out iteratively. 1

Introduction

Not knowing which knowledge to encode happens to be one of the main reas ons for difficulties in current NLP applications. As mentioned by Grishman & Kittredge (1986), many of these language processing problems can for tunately be restricted to the specificities of the language usage in a certain knowledge domain. The diversity of language encountered here is consid erably smaller and more systematic in structure and meaning than that of the whole language. Approaching the extraction of knowledge on a sublan guage basis reduces the amount of knowledge to discover, as well as easing the discovery task. One such case of this sublanguage-based research is, for instance, the work carried out by Grishman & Sterling (1992) on selectional pattern acquisition from sample texts. However, we should also bear in mind the necessity for systematic meth odologies of knowledge acquisition, duly supported by software, as already emphasised by several authors (Grishman et al. 1986; Tsujii et al. 1992). Preparation of domain-specific knowledge for a NLP application still relies heavily on human introspection, due mainly to the non-trivial relationship between the ontological knowledge and the actual language usage. This makes the process complex and very time-consuming. In addition, while traditional statistical techniques have proven useful for knowledge acquisition from large corpora (Church & Hanks 1989; Brown 1

Sponsored by the Departamento de Education, Universidades e Investigation of the Basque Government, Spain. */****

126

ARRANZ, RADFORD, ANANIADOU & TSUJII

et al. 1991), they still present two main drawbacks: opacity of the process and insufficient data. The black box nature of purely statistical processes makes them com pletely opaque to the human specialist. This causes great difficulty when judging whether intuitionally uninterpretable results reflect actual language usage, or are simply errors due to the insufficient data. Results therefore have to be either revised to meet the expert's intuition or accepted without revision. To this problem one should also add the fact that statistical methods usually require very large corpora to obtain reasonable results, which is highly unpractical and often unfeasible. This is especially the case if work takes place at a sublanguage level as large corpora become even more inac cessible. Following the research initiated in Arranz (1992) and based on the Epsilon system described in Tsujii & Ananiadou (1993), our aim is to discover a systematic methodology for sublanguage-specific semantic KA, applicable to different subject domains and multilingual corpora. The tool kit [e] being developed at CCL supports the principles of KA as an evolutionary process and from relatively small corpora, making it very practical for current NLP applications. This work represents an iterative and modular approach to statistical language analysis, where the acquired knowledge is stored in a Central Knowledge Base (CKB), which is shared and easy to access and update by all subprocesses in the system. Bearing these considerations in mind, we selected a highly specific cor pus, such as the Unix manual, of about 100,000 words. 2

Epsilon [Є]: process

Knowledge

acquisition

as

an

evolutionary

Epsilon'ts idea of knowledge acquisition as an evolutionary process avoids the above-mentioned problems by achieving the following: Stepwise acquisition of semantic clusters. Our system acquires knowledge as a result of stepwise refinement, therefore avoiding the opacity derived from the single-shot techniques used by purely statistical methods. The specialist inspects after every cycle the hypotheses of new pieces of knowledge proposed by the utility programs in [e]. Design of robust discovery methods. Early stages of the KA process are particularly problematic for statistical programs, due to the fact that the corpus is still very complex. We aim to reduce this complexity by

SUBLANGUAGE-BASED SEMANTIC CLUSTERING

127

initially using more robust techniques (to cope, for e.g., with words with low frequency of occurrence) before applying statistical methods. Inherent links between acquired knowledge and language us age. Epsilon easily deals with the opacity caused by the non-trivial nature of the mapping between the domain ontology and the language usage. The cases of words which denote several different ontological entities, or con versely, one entity denoted by different words, are often encountered in actual corpora, [Є] keeps a record of the pseudo-texts produced during the KA process (cf. below), as well as of their relationships with the acquired knowledge, so that the specialist can check and understand why certain clusterings take place and when. Effective minimum human intervention. As emphasised by Arad (1991) in her quasi-statistical system, human intervention is inevitable. However, in [Є] this intervention remains systematised and is only applied locally, whenever required by the process. The general idea of Knowledge Acquisition as an evolutionary process is illustrated in Figure 1 (Tsujii & Ananiadou 1993). Application of utility programs to Text-i and human inspection of the results yield the next version of knowledge (the i-th version), which in turn is the input to the next cycle of KA. This general framework is simplified if the results of text description are text-like objects (pseudo-texts), where the i-th version presents a lesser degree of complexity than the previous pseudo-text. The pseudo-texts obtained are characterised by the following: they present the same type of data structure as ordinary texts, i.e., an ordered sequence of words. The words contained in these pseudo-texts include both pseudo-words as well as ordinary words. Such pseudo-words can denote semantic categories to which the actual words belong, words with POS information, single concept-names corresponding to multi-word terms and disambiguated lexical items (like in Zernik 1991). Also, these pseudo-texts are fully compatible with the existing utility programs, and neither the input data nor the tool itself require any alteration. Finally, the degree of complexity of the text is approximated in relation to the number of different words and word tokens resulting from the several passes of the programs. Working on lipoprotein literature, Sager (1986) also shows that it is pos sible to meassure quantitative features such as the complexity of information contained in a sublanguage.

128

ARRANZ, RADFORD, ANANIADOU & TSUJII

Fig. 1: General scheme of KA as an evolutionary process

3 3.1

Knowledge acquisition process POS information

Once the Classify subprocess (cf. Section 5) was put into practice, it was observed that since no part-of-speech information was provided, great con fusion was caused at the replacement stage. A series of illegitimate substi tutions were carried out, which resulted in serious incoherence within the generated pseudo-texts. The input text was then preprocessed with Eric Brill's (Brill 1993) rulebased POS tagger. The accuracy of the tagger for the corpus in current use oscillates between 87.89% and 88.64%, before any training takes place, and 94.05%, with a single pass of training. This is quite impressive, if we take into consideration the specificity and technicality of the text. After providing the sample text with POS information, the set of can didates for semantically related clusters was much more accurate, and the wrong replacements of mixed syntactic categories ceased to take place. In addition, this corpus annotation allowed us to establish a tag compatibility set, which contributed in recovering part of the incorrectly rejected hypo theses posed for replacement. Such tag compatibility set consisted of a group

SUBLANGUAGE-BASED SEMANTIC CLUSTERING

129

of lines, each of them containing interchangeable part-of-speech markers. An example of one of these lines looks as follows: JJ JJR JJS VBN. 3.2

Modular configuration

The current version of the system consists of: 1. Central Knowledge Base, which stores all the relationships among words and pseudo-words obtained during the KA process. 2. Record of the pseudo-texts created, as well as the relationships between them, in terms of replacements or clusterings taking place. 3. A number of separate subprocesses (detailed below) which are involved during the processing of each pass of the system. These subprocesses rely upon the iterative application of simple analysis tools, updating the CKB with the knowledge acquired at each stage. The resulting modular system is of a simple-to-maintain and enhance nature. At present [e] contains three major processes involved in the KA task: (i) Compound] which generates hypotheses of multi-word expressions; (ii) Classify, which generates semantically-related term clusters; (iii) Re placement, which deals with the reduction of the complexity of the text, by replacing the newly-found pieces of information within the corpus. 4 4.1

The Compound

subprocess

Framework

This tool performs the search for those multi-word structures within the text that can be ranked as single ontological entities. This module was built to interact with the other existing module Classify, and with the CKB, so as to achieve any required exchange or storage of semantic information. Step 1. The first stage relies on the analysis of the corpus using a simple grammar, which is based upon pairs of words where the second word is a noun and the first is one of the class Noun, Gerund, Adjective. Using this grammar we extract descriptions of the structures of potential compound terms. Any single pass can thus only determine two-word compounds, re quiring multiple passes if longer compounds are to be found. These poten tial compounds are then filtered by simply ensuring that they occur in the corpus more than once. Step 2. The remaining candidates from Step 1 are then prioritised by calculating the mutual information (Church & Hanks 1989) of each pair.

130

ARRANZ, RADFORD, ANANIADOU & TSUJII

Step 3. Once the set of compound term candidates has been verified by the human expert, the replacement of each selected compound with a single token takes place. At present, this token is a composite which retains all of the original information within the corpus entry. For instance, the compound generated from the nouns environment/NN and variable/NN looks as follows: compound (environment/NNV~variable/NN)/NN where the whole structure maintains the grammatical category NN. Step 4. Among those potential compounds discovered, only 40% turned out to be positive cases (cf. Section 4.2). This problem was particularly acute in Adjective Noun and Gerund Noun cases, mainly as a result of the difficulty entailed by the distinction between such general language and domain-specific syntactic pairs. Due to the low frequency of some of the compounds in the corpus, the resulting MI scores were noisy and led to rather irregular results. The measurement of the specificity of the com pounding candidates was then carried out by means of a large corpus of general language (the LOB corpus (Johansson & Holland 1989)). Using the formula shown in equation 1, we established a specificity coefficient, which indicated how specific a particular word was to the sublanguage. (1) Step 5. This is another replacement stage, where the verified compound terms are substituted by compound identifiers, such as Compound67/NN. These identifiers are directly related to the CKB, where a record of the information relating to this token is stored. 4.2

Performance

Regarding the module's performance, the use of the simple grammar in Step 1 succeeds in filtering the around 500 hypotheses of multi-word expressions originally produced, reducing them to around 70 candidates. Out of these 70, 45 present Noun Noun pairs, and the remaining 25 are Adjective Noun or Gerund Noun pairs. As already discussed in Section 4.1, only 40% of the hypotheses belonging to the latter type of compounds were actually correct. Meanwhile, the Noun Noun pairs presented 85% of positive cases. By means of the filtering carried out with the LOB corpus, and using a threshold of 0.9 on adjectives, performance improves from a disappointing 40% to a promising 64% for those troublesome cases, and adds to a global

SUBLANGUAGE-BASED SEMANTIC CLUSTERING

131

Iteration Number Fig. 2: Compounding results 77.5%, just after the first pass. A value of 1.0 in the specificity scale implies that the word is unique to the sublanguage, while negative values represent a word which is more common in general language than in our subject domain sample text. It should be pointed out though, that currently the statistics regarding the word frequencies in the LOB corpus do not take POS information into account, making this filtering a rather limited resource. The future application of an annotated general language text is already being considered, so as to attempt to detect remaining errors. The replacement in Step 5 facilitates the storage of the information in the CKB, and it makes it more accessible for the subprocesses. Once formed, compound identifiers will be treated as an ordinary word with a particular syntactic label. The results obtained by the compounding module are shown in Figure 2. 5 5.1

The Classify

subprocess

Inverse KWIC

This context matching module represents the initial stage in [Є]'s subprocess Classify. Based on the principle that linguistic contexts can provide us with enough information to characterise the properties of words, and to obtain accurate word classifications (Sekine et al. 1992; Tsujii et al. 1992), semantic clusters are extracted by means of the concordance program CIWK (or Inverse KWIC) (Arad 1991). The following is a sample output from CIWK

132

ARRANZ, RADFORD, ANANIADOU & TSUJII

for a [3 3] parameter (3 words preceding and three succeeding): input/NN ;output/NN ; #name/NN of/IN the/DT $ bar-file/NN using/VBG the/DT This indicates that both nouns input/NN and output/NN share the same context at least once in the corpus. Once the list of semantic clusters has been finalised, the corpus is updated with all occurrences of those words within each cluster being replaced by the first word of that cluster. For instance, in the example above, all occurrences of input/NN and output/NN would be replaced by input/NN. For our experiments, a relatively small contextual size parameter has been selected (a [2 2]), so as to obtain a larger set of hypotheses. A list of about 700 semantic classes has been produced with this parameter.

5.2

Evaluation

Among the 700 clusters generated, there is an interesting number of cases which present crucial ontological and contextual features for our KA process. Unfortunately, there is also a significant amount of ambiguous clusters which require filtering. Work is currently taking place on this filtering process and some preliminary results can already be seen in Section 7.2. In spite of the interesting results initially obtained from CIWK, the exact matching technique this tool is based on is rather inflexible for the semantic clustering task. The semantic classes formed and the actual instances of each class can be seen in Figure 3.

6

C e n t r a l knowledge base

Although not fully implemented, our Central Knowledge Base plays a very important role within the system's framework. Due to Epsilon's modular approach and the open nature of the links between the stored acquired know ledge and the different subprocesses within the system, there is no need to retain newly extracted information in the corpus. Everything is maintained in the CKB by means of referentials, such as Semantic-classl8/NN (to refer to a resulting cluster from Classify) or Compound67/NN (to present one of the acquired compound expressions). This provides an easy method of updating and improving the knowledge base, a well as an opportunity to add new modules to the whole configuration of the system.

SUBLANGUAGE-BASED SEMANTIC CLUSTERING

133

Iteration Number Fig. 3: Semantic clustering results

7

7.1

Dynamic context matching techniques for semantic clustering disambiguation Word sense disambiguation

As mentioned in Section 5.2, an important number of ambiguous clusters take place with the use of Classify, which are in need of filtering. However, the CIWK algorithm is very inflexible and will only accept those candidates sharing exact matching contexts. In practice we often encounter instances of semantically-related words, but whose contexts vary slightly for various reasons. In other occasions one might find that differing contexts within the same term, or between different words, represent the different ontologies of such word(s), and therefore need disambiguating. Work on such filtering module is currently being undertaken, by means of a technique called Dy namic Alignment (Somers 1994). 7.2

Dynamic context matching

This technique allows us to compare the degree of similarity between two words, and it represents a much more flexible approach than the exact matching technique used in CIWK. Its aim is to discover all potential matches between a given set of individual words, attaching a value to each match according to its level of importance. Then, the set of matches pro ducing the highest total match strength is calculated. The obtained highest

134

ARRANZ, RADFORD, ANANIADOU & TSUJII

score is attributed to the pair of contexts, establishing thus a value on their similarity relation. For each pair of contexts, the best match value is calcu lated, which results in a correlation matrix. Figure 4 presents an example of the way all possible word matches are discovered for a particular pair of contexts. Given the constraint that the individual matches are not allowed to cross, the maximal set is chosen and thus, its value calculated. The fol lowing is the correlation matrix formed by the pair of words discussed/VBN and listed/VBN: '/, dynamic discussed/VBN listed/VBN +5 -5 < corpus Post context length set to 5 Pre context length set to 5 CIWK data read. 9 records found. 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

5 10 8

7 7 27

6 8 10 8 3 5 7 11 9 6 8 6 14 9 14

7 10 5 4 5 4 4 4 6 4 1 3 4 5 5

Partial Match Full Match

Fig. 4: Example match between two contexts The clustering algorithm used to determine the strongest semantic cluster in the matrix operates in a simple manner. Initially, the pair of contexts with the highest correlation is selected as the core of the cluster. Then, each remaining context is considered in turn, adding to the cluster those

SUBLANGUAGE-BASED SEMANTIC CLUSTERING

135

contexts which present a correlation value above a certain threshold, with respect to more than half the contexts already in the cluster. This will be repeated until no more contexts can be added to the cluster. Although this process is still being tested and required thresholds and parameters are being set, it has proved to present important advantages over Classify: it is more flexible and it implicitly solves the ambiguity prob lem detailed above. The contexts provided contain the necessary ontological knowledge allowing us to extract the different senses of the cluster compon ents, e.g., the above matrix found two different contextual clusters, showing two different meanings. 8

Concluding remarks

This system attempts to avoid the pitfalls faced by purely statistical tech niques of knowledge acquisition. As for this, the idea of KA as an evolution ary process is described in detail, and applied to the task of sublanguagespecific KA from small corpora. The iterative nature of our system enables statistical measures to be performed, in spite of the relatively small size of our sample text. The interactive framework of our implementation provides a simple way to access and store the acquired ontological knowledge, and it also allows our subprocesses to exchange information so as to obtain desir able results. REFERENCES Arad, Iris. 1991. A Quasi-Statistical Approach to Automatic Generation of Lin guistic Knowledge. Ph.D. dissertation, CCL, UMIST, Manchester, U.K. Arranz, Victoria. 1992. Construction of a Knowledge Domain from a Corpus. M.Sc. dissertation, CCL, UMIST, Manchester, U.K. Brill, Eric. 1993. A Corpus-Based Approach to Language Learning. Ph.D. dis sertation, University of Pennsylvania, Philadelphia. Brown, Peter F., Stephen A. Delia Pietra, Vincent J. Delia Pietra & Robert L. Mercer. 1991. "Word-Sense Disambiguation Using Statistical Methods". Proceedings of the 29th Annual Conference of the Association for Compu tational Linguistics (ACL'91), Berkeley, Califs 264-270. San Mateo, Calif.: Morgan Kaufmann. Church, Kenneth W. & Patrick Hanks. 1989. "Word Association Norms, Mutual Information, and Lexicography". Proceedings of the 27th Annual Confer ence of the Association for Computational Linguistics (ACL'89), Vancouver, Canada, 76-82. San Mateo, Calif.: Morgan Kaufmann.

136

ARRANZ, RADFORD, ANANIADOU & TSUJII

Grishman, Ralph & Richard Kittredge. 1986. Analysing Language in Restricted Domains: Sublanguage Description and Processing. New Jersey: Lawrence Erlbaum Associates. & John Sterling. 1992. "Acquisition of Selectional Patterns". Proceedings of the 14th International Conference on Computational Linguistics (COLING'92), Nantes, France, 658-664. , Lynette Hirschman & Ngo Thanh Nhan. 1986. "Discovery Procedures for Sublanguage Selectional Patterns: Initial Experiments". Computational Linguistics 12:3.205-215. Johansson, Stig & Knut Hofland. 1989. Frequency Analysis of English Vocabulary and Grammar: Based on the LOB Corpus, vol.1: Tag Frequencies and Word Frequencies. Oxford: Clarendon Press. Sager, Naomi. 1986. "Sublanguage: Linguistic Phenomenon, Computational Tool". Analysing Language in Restricted Domains: Sublanguage Description and Processing ed. by Ralph Grishman & Richard Kittredge, 1-17. New Jersey: Lawrence Erlbaum Associates. Sekine, Satoshi, Jeremy J. Carroll, Sofia Ananiadou & Jun-ichi Tsujii. 1992. "Automatic Learning for Semantic Collocation". Proceedings of the 3rd Con ference on Applied Natural Language Processing (ANLP'92), Trento, Italy, 104-110. New Jersey: ACL. Somers, Harold, Ian McLean & Daniel Jones. 1994. "Experiments in Multi lingual Example-Based Generation". Proceedings of the 3rd Conference on the Cognitive Science of Natural Language Processing (CSNLP'94), Dublin, Ireland: Dublin City University. Tsujii, Jun-ichi & Sofia Ananiadou. 1993. "Epsilon [e] : Tool Kit for Knowledge Acquisition Based on a Hierarchy of Pseudo-Texts". Proceedings of Natural Language Processing Pacific Rim Symposium (NLPRS'93), 93-101. Fukuoka, Japan. Tsujii, Jun-ichi, Sofia Ananiadou, Iris Arad & Satoshi Sekine. 1992. "Linguistic Knowledge Acquisition from Corpora". Proceedings of the International Workshop on Fundamental Research for the future Generation of Natural Language Processing (FGNLP), 61-81. Manchester, U.K. Zernik, Uri. 1991. "Trainl vs. Train2: Tagging Word Senses in Corpus". Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon ed. by Uri Zernik, 91-112. New Jersey: Lawrence Erlbaum Associates.

Customising a Verb Classification to a Sublanguage ROBERTO BASILI*, MICHELANGELO DELLA ROCCA*, M A R I A T E R E S A PAZIENZA* & PAOLA VELARDI**

* Universita' di Tor Vergata, Roma ** Universita' di Ancona Abstract In this paper we study the relationships between a general purpose, human coded verb classification, proposed in the WordNet lexical reference system, and a corpus driven classification model based on context analysis. We describe a context-based classifier that tunes WordNet to specific sublanguages and reduces its over-ambiguity.1 1

Sense disambiguation and sense tuning

The purpose of this study is to define a context-based statistical method to constrain and customise the WordNet type hierarchy, according to a specific sublanguage. Our context-based method is expected to tune the initial WordNet categorisation to a given corpus, in order to: • Reduce the initial ambiguity • Order each sense according to its relevance in the corpus • Identify new senses typical for the domain. These results could be useful for any NLP systems lacking in human support for word categorisation. The problem that we consider in this paper is strongly related to the problem of word-sense disambiguation. Given a verb and a representative set of its occurrences in a corpus, we wish to determine a subset of its initial senses, that may be found in the sublanguage. In case, new senses may be found, that were not included in the initial classification. Word sense disambiguation is an old-standing problem. Recently, several statistically based algorithms have been proposed to automatically disam biguate word senses in sentences, but many of these methods are hopelessly unusable, because they require manual training for each ambiguous word. 1

This paper summarises the results presented in the International Conference on Recent Advances in Natural Language Processing. The interested reader may refer to the RANLP proceedings for additional details on the experiments.

138

BASILI, DELLA ROCCA, PAZIENZA & VELARDI

Exceptions are the simulated annealing method proposed in (Cowie et al. 1992), and the context-based method proposed in (Yarowsky 1992). Sim ulated annealing attempts to select the optimal combination of senses for all the ambiguous words in a sentence S. The source data for disambigu ation are the LDOCE dictionary definitions and subject codes, associated with each ambiguous word in the sentence S. The basic idea is that word senses that co-occur in a sentence will have more words and subject codes in common in their definitions. However in (Basili et al. 1996) we experimentally observed that sense definitions for verbs in dictionaries might not capture the domain specific use of a verb. For example, for the verb to obtain in the RSD we found patterns of use like: the algorithm obtains good results for the calculation... data obtained from the radar... the procedure obtains useful information by fitting... etc., while the (Webster's) dictionary definitions for this verb are: (i) to gain possession of: to acquire, (ii) to be widely accepted, none of which seems to fit the detected patterns. We hence think that the corpus itself, rather than dictionary definitions, should be used to derive disambiguation hints. One such approach is undertaken in (Yarowsky 1992), which inspired our method (Delia Rocca 1994). In this paper our objectives and methods are slightly different from those in (Yarowsky 1992). First, the aim of our verb classifier is to tune an exist ing verb hierarchy to an application domain, rather than selecting the best category for a word occurring in a context. Second, since in our approach the training is performed on an unbalanced corpus (and for verbs, that no toriously exhibit more fuzzy contexts), we introduced local techniques to reduce spurious contexts and improve the reliability of learning. Third, since we expect also domain-specific senses for a verb, during the classifica tion phase we do not make any initial hypothesis on the subset of categories of a verb. Finally, we consider globally all the contexts in which the verb is encountered in a corpus, and compute a (domain-specific) probability distri bution over its expected senses. In the next section the method is described in detail. 2

A context-based classifier

In his experiment, Yarowsky uses 726 Roget's categories as initial classi fication. In our study, we use a more recently conceived, widely available, classification system, WordNet.

CONTEXTS AND CATEGORIES CATEGORY

body (BD) change (CH) cognition (CO) communication (CM) competition (CP) consumption (CS) contact (CT) creation (CR) emotion (EM) perception (PE) possession (PS) social (SO) stative (ST)

139

#VERBS

#SYNSETS

78 287 200 240 63 48 209 124 47 76 122 217 162

76 412 218 299 73 41 279 133 50 80 156 240 183

Table 1: Excerpt of Kernel verbs in the RSD

We decided to adopt as an initial classification the 15 semantically distinct categories in which verbs have been grouped in WordNet. Table 2 shows the distribution of a sample of 826 RSD verbs among these categories, according to the initial WordNet classification. The average ambiguity of verbs among these categories is 3.5 for our sample in the RSD. In what follows we describe an algorithm to re-assign verbs to these 15 categories, depending upon their surrounding contexts in corpora. Our aim is to tune the WordNet classification to the specific domain as well as to capture rather technical verb uses that suggest semantic categories different from those proposed by WordNet. The method works as follows: 1. Select the most typical verbs for each category; 2. Acquire the collective contexts of these verbs and use them as a (dis tributional) description of each category; 3. Use the distributional descriptions to evaluate the (corpus-dependent) membership of each verb to the different categories. In step 1 of the algorithm we learn a probabilistic model of categories from the application corpus. When training is performed on an unbalanced corpus (or on verbs, that are highly ambiguous and with variable contexts), local techniques are needed to reduce the noise of spurious contexts. Hence, rather than training the classifier on all the verbs in the learning corpus, we select only a subset of prototypical verbs for each category. We call these verbs the salient verbs of a category C. We call typicality Tv(C)

140

BASILI, DELLA ROCCA, PAZIENZA & VELARDI

CATEGORY

K E R N E L VERBS

body (BD) change (CH) cognition (CG) communication (CM) competition (CP) consumption (CS) contact (CT) creation (CR) emotion (EM) motion (MO) perception (PC) possession (PS) social (SO) stative (ST) weather (WE)

produce, acquire, emit, generate, cover calibrate, reduce, increase, measure, coordinate estimate, study, select, compare, plot, identify record, count, indicate, investigate, determine base, point, level, protect, encounter, deploy sample, provide, supply, base, host, utilise function, operate, filter, segment, line, describe design, plot, create, generate, program, simulate like, desire, heat, burst, shock, control well, flow, track, pulse, assess, rotate, sense, monitor, display, detect, observe, show provide, account, assess, obtain, contribute, derive experiment, include, manage, implement, test consist, correlate, depend, include, involve, exist scintillate, radiate, flare

Table 2: Excerpt of kernel verbs in the RSD of v in C, the following ratio: (1) where: Nv is the total number of synsets of a verb v, i.e., all the WordNet synonymy sets including v. Nv,c is the number of synsets of v that belong to the semantic category (7, i.e., synsets indexed with C in WordNet. The synonymy Sv of v in C, i.e., the degree of synonymy showed by verbs other than v in the synsets of the class C in which v appears, is modeled by the following ratio: (2) where: Ov = the number of verbs in the corpus that appear in at least one of the synsets of v. Ov,c — the number of verbs in the corpus appearing in at least one of the synsets oftv,that belongs to C. Given 1 and 2, the salient verbs v, for a category C, can be identified maximising the following function, that we call Score: Scorev(C) = OAv x Tv(C) x Sv(C)

(3)

where OAv are the absolute occurrences of v in the corpus. The value of Score depends both on the corpus and on WordNet. OAv depends obviously

CONTEXTS AND CATEGORIES

141

on the corpus. Instead, the typicality depends only on WordNet. A typical verb for a category C is one that is either non ambiguously assigned to C in WordNet, or that has most of its senses (synsets) in C. Finally, the synonymy depends both on WordNet and on the corpus. A verb with a high degree of synonymy in C is one with a high number of synonyms in the corpus, with reference to a specific sense (synset) belonging to C. Salient verbs for C are frequent, typical, and with a high synonymy in C. The kernel of a category kernel(C), is the set of salient verbs v with a 'high' Scorev(C). To select a kernel, we can either establish a threshold for Scorev(C), or fix the cardinality of kernel(C). We adopted the second choice, because of the relatively small number of verbs found in the medium-sized corpora that we used. Table 2 lists some of the kernel verbs in the RSD. In step 2 of the algorithm, the collective contexts for each category are acquired. The collective contexts of a category C is acquired around the salient words for each category (see (Yarowsky 1992)), though we collect salient words using a ±10 window around the kernel verbs. Figure 1 plots the ratio

vs. the number of contexts

acquired for each category, in the RSD and the MD. It is seen that, in the average and for both domains, very few new words are detected over the threshold of 1000 contexts. This phenomenon is called saturation and is rather typical of sublanguages. However, some of the categories (like weather and emotion in the RSD) have very few kernel verbs. In step 3, we need to define a function to determine, given the set of contexts K of a verb v, the probability distribution of its senses in the corpus. For a given verb v, and for each category C, we evaluate the following function, that we call Sense(v,C): (4) where (5) and Ki is the i-th context of v, and w is a word within Ki. In 5, Pr(C) is the (not uniform) probability of a class C, given by the ratio between the number of collective contexts for C and the total number of collective contexts. A verb v has a high Sense value in a category if:

142

BASILI, DELLA ROCCA, PAZIENZA & VELARDI

Fig. 1: New words per context vs. number of contexts in MD and RSD • it co-occurs 'often' with salient words of a category C: • it has few contexts related to C, but these are more meaningful than the others, i.e., they include highly salient words for C. The corpus-dependent distribution of the senses of v among the categories can be analysed through the function Sense. Notice that, during the clas sification phase 3, the initial WordNet classification of ambiguous verbs is no longer considered (unlike for (Yarowsky 1992)). WordNet is used only during the learning phase in which the collective contexts are built. Hence, new senses may be detected for some verb. We need to establish a threshold for Sense(v, C) according to which, the sense C is considered not relevant in the corpus for the verb v, given all its observed occurrences. Since the values of the Sense function do not have a uniform distribution across categories, we introduce the standard variable: (6) where ΜC and σc are the average value and the standard deviation of the Sense function for all the verbs of C, respectively.

143

CONTEXTS AND CATEGORIES A verb v is said to belong to the class C if N s e n s e ( v , C )≥

Nsense0

(7)

Under the hypothesis of a normal distribution for the values of 6, we exper imentally determined that a reasonable choice is Nsenseo

=1

(8)

With this threshold, we assign to a category C only to those verbs whose Sense value is equal or higher than μ+σc- In a normal distribution, this threshold eliminates 84% of the classifications. In the next section we discuss and evaluate the experimental results obtained for the two corpora. 3

Discussion of the results

Table 3 shows the sense values that satisfy the 7, for an excerpt of randomly selected RSD verbs. The sign "*" indicates the initial WordNet classifica tion. The average ambiguity of our sample of 826 RSD verbs is 2.2, while the initial WordNet ambiguity was 3.5. For 1,235 verbs of the MD, the average ambiguity is 2.1 and the initial was 2.9. We obtained hence a 30-40% reduction of the initial ambiguity. As expected, classes are more appropriate for the domain. Less relevant senses are eliminated (all empty boxes with a "*" in Table 3). New proposed categories are indicated by scores without the "*". The function Sense, defined in the previous section, produces a new, context-dependent, distribution of categories. In this section we evaluate and discuss our data numerically. First, we wish to study the commonalities and divergence between WordNet and our classification method. We introduce the following definition: A = {(v,C)\Nsense(v,C) W = {{v,C)\Scorev(C)

= 84Nsense 0 } > 0}

I = A∩w where A is the set of verbs classified in C according to their context, W is the set of verbs classified in C according to WordNet and I is the intersection between the two sets. Two performance measures, that assume WordNet as an oracle, are recall, defined as and the precision, i.e.,

144

BASILI, DELLA ROCCA, PAZIENZA & VELARDI

BD CH CG CM CP CS CT CR MO PC PS SO ST apply 3.9* * * * 1.3* * calculate 1.1* * * change * * cover * * * * * * * * 1.1* gain * 1.38 * * 4.9* occur 3.8* * * operate 1.1* 3.0* * * point 1.0* * * 1.7* * 2.37 record * * 2.8* scan 2.1* 1.1* * * * * * survey 3.4* test * VERBS

Table 3: Sense values for an excerpt of RSD verbs This definition of recall measures the number of initial WordNet senses in agreement with our classifier. Under the perspective of sense tuning, the recall may be seen as measuring the capability of our classifier to reduce the WordNet initial ambiguity, while the percentage of new senses is given by 100% — precision. Domain Recall

RSD (200 verbs) 41%

MD (341 verbs) 40%

Table 4: A comparison between the corpus-driven classification and WordNet Table 3 summarises recall and precision values for the two domains and shows that the corpus-driven classifications fit the expectations of WordNet authors, while more than 1/2 of the initial senses (59% in RSD, 60% in MD) are pruned out! Furthermore, there are 13% and 18% new detected categor ies in the MD and in the RSD, respectively. Of course, it is impossible to evaluate, if not manually, the plausibility of these new classifications. We will return to this problem at the end of this section. A second possible evaluation of the method is a comparison between unambiguous verbs' classifications. We found that in the large majority of cases, there is a concordance between WordNet and our classifier. Verbs convoy flex wake

BD -2.53 -2.50 34.9*

CH -3.07 -4.76 0.21

CG -1.94 -2.23 0.21

Table 5: Nsense

CM -2.98 -4.42 -0.98

CP -3.08 -3.86 -1.34

CS 2.08 -4.20 1.70

CT -2.37 -3.94 -0.25

CR 0.41 -3.18 -0.17

MO 51.9* 9.14* -1.03

PC PS SO -1.19 -1.68 -2.19 -2.60 -1.97 -3.94 -0.58 -0.83 -0.08 values for three verbs unambiguous in WordNet

ST -4.59 -5.51 -1.16

Table 5 shows the standard variable 6 values for some unambiguous verbs.

CONTEXTS AND CATEGORIES DOMAIN

RSD (140 verbs)

145

MP (170 verbs)

Recall 91% 85% Table 6: Recall of the classification of unambiguous verbs Table 6 globally evaluates the performance of the classifier over unambigu ous verbs, for the two domains. We also attempted a global linguistic analysis of our data. We observed that for some verbs the collective contexts acquired may not express their intended meaning (i.e., category) in WordNet. Moreover technical uses of some verbs are idiosyncratic with respect to their WordNet category. Consider for example the verb to record in the medical domain. This verb is automatically classified in the categories communication and contact The contact classification is new, that is, it was not included among the WordNet categories for to record. Initially, we examined all the occurrences of this verb (45 sentences) with the purpose of manually evaluating the classi fication choices of our system. Each of the authors of this paper independ ently attempted to categorise each occurrence of the verb in the MD corpus as either belonging to the categories proposed by WordNet for to record (communication) or to the new class contact. However, since the WordNet authors provided only very schematic descriptions for each category, each of us used his personal intuition of the definition for each category. The result was a set of almost totally divergent classification choices! During the analysis of the sentences, we observed that the verb to record occurs in the medical domain in rather repetitive contexts, though the sim ilarity of these contexts can only be appreciated through a generalisation process. Specifically, we found two highly recurrent generalised patterns: A record(Z,X,Y): subject (z) object(physiological_state(X)), locative (individual(Y) or body-part(Y) ) .

(e.g., myelitis spinal cord injury tumors were recorded at the three levels paretial spinal cervical . . . ) . B

record(Z,X,Y): subject(Z), object(abstraction(X)), locative(information(Y) ) or time( time_period(Y) ) .

146 I ( ( ( ( ( ( ( ( | (

BASILI, DELLA ROCCA, PAZIENZA & VELARDI

In, normal, patients, potentials, of, a, uniform, shape, were, # , during, flaccidity ) I At, cutoff, frequencies Cavernous, electrical, activity, was, # , in, patients, with, erectile, dysfunction ) Abnormal, findings, of, cavernous, electrical, activity, were , # , in, _, of, the, consecutive, impotent, patients ) Morbidity, and, mortality, rates, were, # , in, the, first, month, of, life, Juveniles, and, yearlings, rarely ) seconds, of, EMG, interference, pattern, were, # , at, a, maximum, voluntary, contractions, from, the, biceps ) interference, pattern, IP, in, studies, were, # , using, a, concentric, needle, electrode, MUAPs, were, recorded ) During, Hz, stimulation, twitches, # , by, measurement, of, the, ankle, dorsiflexor, group, displayed, increasing ) Macro-electromyographic, MUAPs, were, # , from, patients, in, studies, MUAP, analysis, revealed ) myelitis, spinal, cord, injury, tumours, The, SEPs, were, # , at, three, levels, parietal, spinal, cervical ) |

Table 7: Examples of contexts for the verb to record in MD (e.g., mortality rates were recorded in the study during the first month of life) Above unary functors (e.g., individual, information, . . . ) are WordNet labels. We then attempted to re-classify all the occurrences of the verb as either fitting the scheme A or B, regardless of WordNet categories. Table 3 shows a subset of contexts for the verb to record. The symbol "#" indicates an occurrence of the verb. Out of 45 sentences, only 5 did not clearly fit one of the two schemes. There was almost no disagreement among the four human classifiers, and, surprisingly enough (but not so much), we found a very strong correspond ence between our partition of the set of sentences and that proposed by our context-based classifier. If we name the class A contact, and the class B com munication, we found 37 correspondences over 40 sentences. In the three non correspondent cases the context included physiological states and/or body parts, though not as direct object or modifiers of the verb. The sys tem hence classified the verb as contact, though we selected the scheme B. Somehow, it seems that the context-based classifier categorises a verb as contact, not so much because it implies the physical contact of entities, but because the arguments of the verb are physical and are the same of truly contact verbs. For the same verb, a similar analysis has been performed on its 170 RSD contexts and comparable results have been obtained. This experiment suggests that, even if viable (especially but not exclus ively for verb investigation), a mere statistical analysis of the surround ing context of a single ambiguous word does not bring sufficient linguistic insight, though it provides a good global domain representation. Verb se mantics (although domain specific) is useful to explain and validate most of the acquired evidence. As an improvement, in the future, we plan to integ rate the method described in this paper with a more semantically oriented corpus based classification method, described in (Basili et al. 1995).

CONTEXTS AND CATEGORIES 4

147

Final remarks

It is broadly agreed that most successful implementations of NLP applic ations are based on lexica. However, ontological and relational structures in general purpose on-line lexica are often inadequate (i.e., redundant and over-ambiguous) at representing the semantics of specific sublanguages. In this paper we presented a context-based method to tune a general purpose on-line lexical reference system, WordNet, to sublanguages. The method was applied to verbs, one of the major sources of sense ambigu ity. In order to acquire more statistically stable contextual descriptors, we used as initial classification the 15 highest level semantic categories defined in WordNet for verbs. We then used local (corpus dependent) and global (WordNet dependent) evidence to learn the collective contexts of each cat egory and to compute the probability distribution of verb senses among the categories. This tuning method showed to be reliable for a lexical category, like verbs, for which other statistically-based classifiers proposed in literature obtained weak results. For two domains, we could eliminate about 60% of the initial WordNet ambiguity and identify 10-20% new senses. Further more we observed that, for some category, the collective context acquired may be spurious for the intended meaning of the category. A manual ana lysis revealed that a more semantically-oriented representation of a category context would be greatly helpful at improving the performance of the sys tem and at gaining more linguistically oriented information on category descriptions. REFERENCES Basili, Roberto, Maria Teresa Pazienza & Paola Velardi. 1996. "A Context Driven Conceptual Clustering Method for Verb Classification". Corpus Pro cessing for Lexical Acquisition ed. by Branimir Boguraev & James Pustejovsky. Cambridge, Mass.: MIT press. , Maria Teresa Pazienza & Paola Velardi. Forthcoming. "An Empirical Symbolic Approach to Natural Language Processing". To appear in Artificial Intelligence, vol. 85, August 1996. Cowie, Jim, J. Guthrie & L. Guthrie. 1992. "Lexical Disambiguation Using Simulated Annealing". Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), 359-365. Nantes, France.

148

BASILI, DELLA ROCCA, PAZIENZA & VELARDI

Delia Rocca, Michelangelo. 1994. Classificazione automatica dei termini di una lingua basata sulla elaborazione dei contesti [Context-Driven Automatic Clas sification of Natural Language Terms]. Ph.D. dissertation, Dept. of Electrical Engineering, Tor Vergata University, Rome. Fellbaum, Christian, R. Beckwith, D. Gross & G. Miller. 1993. "WordNet: A Lexical Database Organised on Psycholinguistic Principles". Lexical Acquis ition: Exploting On-Line Resources to Build a Lexicon ed. by U. Zernik, 211-232. Hillsdale, New Jersey: Lawrence-Erlbaum Associates. Yarowsky, David. 1992. "Word-Sense Disambiguation Using Statistical Models of Rogets Categories Trained on Large Corpora". Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), 359365. Nantes, France.

Concept-Driven Search Algorithm Incorporating Semantic Interpretation and Speech Recognition A K I T O NAGAI, YASUSHI ISHIKAWA & KUNIO NAKAJIMA

MITSUBISHI Electric Corporation Abstract This paper discusses issues concerning incorporating speech recogni tion with semantic interpretation based on concept. In our approach, a concept is a unit of semantic interpretation and an utterance is re garded as a sequence of concepts with an intention to attain both linguistic robustness and constraints for speech recognition. First, we propose a basic search method for detecting concepts from a phrase lattice by island-driven search evaluating the linguistic likelihood of concept hypotheses. Second, an improved method to search effi ciently for N-best meaning hypotheses is proposed. Experimental results of speech understanding are also reported. 1

Introduction

A 'spoken language system' for a naive user must have linguistic robustness because utterances are shown by a large variety of expressions, which are often ill-formed (Ward 1993:49-50; Zue 1994:707-710). How does a language model cover such a variety of sentences? There is a crucial issue closely related to linguistic robustness: how do we exploit linguistic constraints to improve 'speech recognition'? Syntactic constraint contributes to improving speech recognition, but it is not robust because it limits sentential expressions. Several recent works have tried to solve these linguistic problems by relaxing grammatical con straints or applying the 'partial parsing' technique (Stallard 1992:305-310; Seneff 1992:299-304; Baggia 1993:123-126). This technique is based on the principle that a whole utterance can be analysed with syntactic grammar even if the utterance is partly ill-formed. It is, however, likely that the partial parser cannot create even a partial tree for an utterance in freephrase order in 'spontaneous speech' , and this linguistic feature is normal in Japanese. Thus, one key issue in attaining linguistic robustness is exploiting se mantic knowledge to represent relations between phrases by semantic-driven

150

AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA

processing. One of the methods for doing this is to use case frames based on predicative usage. In this approach, a hypothesis explosion, owing to both word-sense ambiguity and many recognised candidates, occurs if only se mantic constraint is used without syntactic constraint. Therefore, a frame work to evaluate growing meaning hypotheses, based on both syntactic and semantic viewpoints, is indispensable in the process of 'semantic interpret ation' from a 'phrase lattice' to a meaning representation. In our previous work (Nagai et al. 1994a, 1994b), we proposed a se mantic interpretation method for obtaining both linguistic robustness and constraints for speech recognition. This paper aims to focus on issues con cerning the integration of this semantic interpretation and speech recogni tion, and to evaluate the performance of 'speech understanding' . 2

Semantic interpretation based on concepts

Our approach is based on the idea that a semantic item represented by a partial expression can be a unit of semantic interpretation. We call this unit a concept. We consider that; (1) a concept is represented by phrases which are continuously uttered in a part of a sentence, (2) a sentence is regarded as a sequence of concepts, and (3) a user talks about concepts with an intention. A concept is defined to represent a target task: for example, concepts for the Hotel Reservation task are Date, Stay, Hotel Name, Room Type, Distance, Cost, Meal, etc.. The representation is based on a semantic frame. An intention is defined as an attributive type of meaning frame of a whole utterance. A meaning frame registers an intention that constrains a set of concept frames. The intention types are defined as reservation, change, cancel, WH-inquiry, Y/N-inquiry, and consultation. 2.1

Basic process

Figure 1 illustrates the principle of the proposed method. The total process can be divided into concepts detection and meaning hypotheses generation. In detecting concepts, slots are filled by phrase candidates which can be concatenated in the phrase lattice, based on examining the semantic value and a particle. A phrase candidate which has no particle is examined using only its semantic value. This phrase candidate has case-level ambiguity, and each case is hypothesised. In generating meaning hypotheses, the main process consists of two subprocesses. First, an intention type is hypothesised using; (1) key predicates

CONCEPT-DRIVEN SEMANTIC INTERPRETATION

151

which relate semantically to each intention, (2) a particle standing for an inquiry, and (3) interrogative adverbs. If a key predicate is not detected, the intention type is guessed using the semantic relation between concepts. Second, concept hypotheses are combined using meaning frames which are associated with each intention type. All meaning hypotheses for an entire sentence are generated as the meaning frames which have slots filled with concept hypotheses.

hypothesis of phrase sequence derived from phrase lattice

Fig. 1: Semantic interpretation based on concepts

2.2

Reduction of ambiguity in concept hypotheses

Many senseless meaning hypotheses remain owing to ambiguity of word sense, cases of a phrase, and boundaries of concepts. Two methods are used to reduce the ambiguity. First, two existence conditions of a concept are supposed. One is that a concept should have filled slots which are indispensable to the gist of the concept. The other condition is that a concept should occupy a continuous part of a sentence. This assumes that a user talks about a semantic item as a chunk of phrases. Second, the linguistic likelihood of a concept hypothesis is evaluated by a scoring method which considers linguistic dependency between phrases. This method is based on penalising linguistic features instead of using syn tactic rules in order to obtain less rigid syntactic constraints. If a new

152

AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA

concept hypothesis is produced, it is examined on the basis of all penalty rules. The total score of all concept hypotheses is evaluated as the lin guistic likelihood of a meaning hypothesis. Some principles for defining penalty rules are shown in Table 1.

Syntactic features

Semantic features

• • • • • • • •

Deletion of key particle Inversion of attributive case and substantive case Adverbial case without predicative case Inadequate conjugation of verbs Inversion of predicative case and other cases Predicative case without other cases Semantic mismatch between phrase candidates Abstract noun without modifiers

Table 1: Principles for defining penalty rules The advantageous features of this semantic interpreting method are con sidered to be: (1) better coverage of sentential expressions than syntactic rules of a sentence, (2) suppression of an explosion by treating a concept as a target of semantic constraints, and (3) portability of common defined concepts to be shared for different tasks. 3

Integrating speech recognition

For integration with speech recognition, we use 'island-driven search' for detecting concept hypotheses (Figure 2). 3.1

Basic process

First, the speech recogniser based on 'phrase spotting' sends a phrase lattice and pause hypotheses to the semantic interpreter. A concept lattice is then generated from the phrase lattice by the island-driven search. In this pro cess, reliable phrase candidates are selected as seeds for growing concept hy potheses. Each concept hypothesis is extended both forward and backward considering existence of gaps, overlaps, and pauses. To select phrase can didates for the extension, several criteria concerning concatenating phrase candidates are used as follows; (1) Gaps and overlaps between phrases are permitted, if their length is within the permitted limit. (2) Pauses are permitted between phrases, considering gaps and overlaps, within the per mitted limit. (3) Phrases which satisfy two conditions of the existence of a concept are connected. (4) Both acoustic and linguistic likelihood are

CONCEPT-DRIVEN SEMANTIC INTERPRETATION

153

given to a concept hypothesis whenever it is extended to integrate a phrase candidate. If the likelihoods are worse than their thresholds, the hypothesis is abandoned. Finally, meaning hypotheses for a whole sentence are generated by con catenating concept hypotheses in the concept lattice. This search is per formed in a best-first manner. In connecting concept hypotheses, the lin guistic likelihood of growing meaning hypotheses is also evaluated and the existence of gaps, overlaps, and pauses is considered between concept hypo theses within the permitted limit. The linguistic scoring method evaluates growing concept hypotheses and abandons hopeless hypotheses. The total score of acoustic and linguistic likelihood is given as ST — aSL + (1 — a)SU , where ST is the total score, SL is the linguistic score, SA is the acoustic score, and a is the weighting factor.

Fig. 2: Detecting concept hypotheses

3.2

Speech understanding experiments

Experiments were performed on 50 utterances of one male on the Hotel Re servation task. The uttered sentences were made by 10 subjects instructed to make conversational sentences with no limitation of sentential expres sions. The average number of phrases was 5.8 per sentence. Intra-phrase grammar with a 356-word vocabulary is converted into phrase networks. For the spotting model, the phrase networks are joined to background mod els which allows all connections of words or phrases (Hanazawa 1995:21372140). Speaker-independent phonemic 'hidden Markov model's ('HMM's)

154

AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA

are used. Phrase lattices provided by speech recognition included 'false alarm' s from 10 to 30 times the average number of input phrases. The standards for judging an answer correct are; (1) concepts and their boundaries are correctly detected, (2) cases are correctly assigned to phrase candidates, and (3) semantic values are correctly extracted. A best per formance of 92% at the first rank was achieved as shown in Table 2. This shows that the proposed semantic interpretation method is capable of ro bustly understanding various spoken sentences. Moreover, we see that using the total score improves the performance of speech understanding. This is because totalising both acoustic and linguistic likelihood improves the like lihood of a correct meaning hypothesis which is not always best in both acoustic and linguistic likelihood. background model rank 1 ≤ 2 ≤ 3 ≤4 ≤ 5

word A T 82 80 84 82 86 88 86 88

phrase A T 82 92 84 94 90 92 96

A: ordered with priority to acoustic score. T: total score. Table 2: Understanding rate (%): 50 utterances of one male These results, however, leave room for some discussion. First, performance was hardly improved in the case of the word background model, although total score was used. The reason for this is that the constraints of linguistic penalty rules were not powerful enough to exclude more false alarms than in the case of the phrase background model. The penalty rules have to be designed in more detail. Second, the errors were mainly caused in the fol lowing cases; (1) when length of gaps exceeded the permitted limit owing to deletion errors of particles and pauses, causing failure of phrase connection, and (2) when seeds for concept hypotheses were not detected in the seed selection stage. To cope with these errors, (1) speech recognition has to be improved using, for example, context-dependent precise HMMs, and (2) a search strategy considering the seed deletion error is required.

CONCEPT-DRIVEN SEMANTIC INTERPRETATION 4

155

Improving search efficiency

In this section, we propose an improved search method which overcomes computational problems arising from seed deletion errors (Nagai 1994:558563). In searching a phrase lattice, it is very important to perform an efficient search selecting reliable phrase candidates in as high a rank as possible. But if only reliable candidates are selected to limit the search space, correct phrase candidates with lower likelihoods will be missed, just like seed deletion errors. This compels us to lower the threshold to avoid the deletion error, and, as a result, the computational amount suddenly increases. To solve this problem, the improved method quickly generates initial meaning hypotheses which allow deletion of concepts. Then, these initial meaning hypotheses are repaired by re-searching for missing concepts using prediction knowledge associated with the initial meaning hypotheses.

Fig. 3: Principle of improved search method

4.1

Basic process

The total process is composed of concept lattice generation, initial meaning hypothesis generation, acceptance decision, and the repairing process (Fig ure 3). To start with, the concept lattice is generated using only a small number of reliable phrase candidates by the concept lattice generation mod ule. In this process, the number of concept hypotheses is also reduced to

156

AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA

improve the quality of the concept lattice. Next, the initial meaning hypo theses generation module generates meaning hypotheses which are incom plete as regards coverage of an utterance, but are reliable. Deletion sections are penalised in proportion to their length, because the initial meaning hy potheses should cover an utterance as widely as possible. Then, the acceptance decision module judges whether the initial meaning hypotheses are acceptable or not. Acceptable means that an initial meaning hypothesis satisfies two conditions; (1) it covers a whole utterance fully, and (2) it would not be possible to attain a better meaning hypothesis by re-searching the phrase lattice. This process is illustrated in Figure 4. The best likelihood possible after repairing hypotheses (set A) can be estimated, since the maximum likelihood in re-searching deletion sections will be less than the seed threshold value.

Fig. 4: Acceptance decision If the hypotheses are not acceptable, the repairing process module re-searches the phrase lattice for concepts in the limited search space of deletion sec tions. There is, however, a risk of failing to detect concepts because both concept hypotheses neighbouring a deletion section are considered not to be reliable. Therefore, additional meaning hypotheses are also generated to be repaired, assuming that such errors occur in either concept. We use a simple method to make these hypotheses; either concept hypothesis of the unreliable two is deleted and replaced with a new concept hypothesis which is re-searched and can fill the deletion. The search space of the re-searching process can be reduced by limiting concepts. Such concepts can be associated with both concept hypotheses and the intention of the initial meaning hypotheses which is already at tained. In the case as shown in Figure 5, for example, the concepts "Cancel"

CONCEPT-DRIVEN SEMANTIC INTERPRETATION

157

or "Distance" can be abandoned considering a situation where an intention "HOW MUCH" and concepts "Hotel Name", "Room Type", and "Cost" are obtained. As concept prediction knowledge, three kinds of coexistence rela tions are defined which concern (1) an intention and a verb, (2) an intention and a concept, and (3) two concepts.

Fig. 5: Prediction of concepts

4.2

Speech understanding experiments

To evaluate search efficiency, an experimental comparison was performed on two search methods; the basic search method mentioned in section 3 and this improved search method. The former searches all phrase candidates after detecting seeds in the stage of generating the concept lattice, while the latter searches limited reliable phrase candidates and re-searches predicted concepts if deletion sections exist. Experimental conditions were almost similar to those in section 3, but the number of false alarms in the phrase lattice was increased for the purpose of clarifying differences in processing time. The spotting model was the phrase background model. Thirteen types of intention were used. Table 3 shows the results of the baseline method without the re-searching technique, and Table 4 shows the results for the improved search method. Seeds in Table 3 means seeds for concept hypotheses in generating concept lattices, and seeds in Table 4 means reliable phrase candidates for generating initial meaning hypotheses. CPU times were computed on the DEC ALPHA 3600 workstation.

158

AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA # seeds rate (%), 1st rank < 5th CPU time (s.)

100 88 98 15.6

30 88 96 14.2

20 88 96 16.9

15 90 96 12.3

10 84 90 11.2

5 66 72 6.0

Table 3: Understanding rate and processing time: baseline search method, 50 utterances of one male. Number of false alarms: max. 227, ave. 75 # seeds rate (%), 1st rank < 5th CPU time (s.) # utterances repaired

30 88 98 1.7 2

20 88 96 1.2 3

15 88 96 3.1 10

10 84 94 3.8 13

5 64 76 3.7 27

Table 4: Understanding rate and processing time: improved search method, 50 utterances of one male These results show that the proposed search method using the repairing technique achieved a successful reduction in processing time. Moreover, the repairing process effectively kept the understanding rate almost equal to the rate of the baseline method in the case when deletion errors occurred owing to a small number of seeds. Processing time, however, tends to increase if the number of repetitions of the repairing process increases. One of the reasons for this is considered to be that constraints of concept prediction were not so powerful in the Hotel Reservation task. In this task, there are slightly exclusive relations between concepts and intentions because most concepts can coexist as parameter values for retrieving the hotel database. If this method is applied to a task where the relations of concepts and intentions are more distinct, for example, a task where interrogative adverbs appear frequently, the constraints of the concepts are considered to become stronger. There is ample room for further improvement in the re-search method in repairing initial meaning hypotheses. The present method does not use in formation concerning both concept hypotheses neighbouring a deletion sec tion, but only replaces them with concept hypotheses which are re-searched. Using this information will help reduce search space in the repairing pro cess. One of the methods for this improvement will be to try to extend both concept hypotheses in order to judge whether a better likelihood can be obtained or not before replacing them.

CONCEPT-DRIVEN SEMANTIC INTERPRETATION 5

159

Concluding r e m a r k s

We proposed a two-stage semantic interpretation method for robustly un derstanding spontaneous speech and described the integration of speech recognition. In this approach, the proposed concept has three roles; as a robust interpreter of various partial expressions, as a target of semantic constraints, and as a basic unit of understanding a whole meaning. This se mantic interpretation was successfully integrated with speech recognition by island-driven lattice search for generating a concept lattice and exploiting linguistic scoring knowledge. This baseline system achieved good performance with a 92% understand ing rate at the first rank. Moreover, we developed an efficient search method which quickly generates initial meaning hypotheses allowing deletion errors of correct concepts, and repairs them by re-searching for missing concepts using prediction knowledge associated with the initial meaning hypotheses. This technique considerably reduced search processing time to approxim ately one-tenth in experimental comparison with the baseline method. Future enhancements will include; (1) detailed design of general lin guistic knowledge for scoring linguistic likelihood of concept, (2) evaluation of this semantic interpretation as applied to other tasks using spontan eous speech data from naive speakers, (3) development of an interpretation method for a 'complex sentence' (Nagai 1996: Forthcoming), and (4) dealing with 'unknown words'. REFERENCES Baggia, Paolo & Claudio Rullent. 1993. "Partial Parsing as Robust Parsing Strategy". Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP'93), Minneapolis, Minn., vol.11, 123-126. New York: The Institute of Electrical and Electronics Engineers (IEEE). Goodine, David, Eric Brill, James Glass, Christine Pao, Michael Phillips, Joseph Polifroni, Stephanie Seneff & Victor Zue. 1994. "GALAXY: A HumanLanguage Interface to On-Line Travel Information". Proceedings of the Inter national Conference on Spoken Language Processing (ICSLP'94), Yokohama, Japan, vol.11, 707-710. Tokyo: The Acoustical Society of Japan. Hanazawa, Toshiyuki, Yoshiharu Abe & Kunio Nakajima. 1995. "Phrase Spot ting using Pitch Pattern Information". Proceedings of 4th European Confer ence on Speech Communication and Technology (EUROSPEECH'95), Mad rid, Spain, vol.III, 2137-2140. Madrid: Graficas Brens.

160

AKITO NAGAI, YASUSHIISHIKAWA & KUNIO NAKAJIMA

Nagai, Akito, Yasushi Ishikawa & Kunio Nakajima. 1994a. "A Semantic In terpretation Based on Detecting Concepts for Spontaneous Speech Under standing" . Proceedings of the International Conference on Spoken Language Processing (ICSLP'94), Yokohama, Japan, vol.1, 95-98. Tokyo: The Acous tical Society of Japan. , Yasushi Ishikawa & Kunio Nakajima. 1994b. "Concept-Driven Semantic Interpretation for Robust Spontaneous Speech Understanding". Proceedings of Fifth Australian International Conference on Speech Science and Tech nology (SST'94), Perth, W.A., Australia, vol.1, 558-563. Perth: Univ. of Western Australia. , Yasushi Ishikawa & Kunio Nakajima. Forthcoming. "Integration of ConceptDriven Semantic Interpretation with Speech Recognition". To appear in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP'96), Atlanta, Ga., Seneff, Stephanie. 1992. "A Relaxation Method for Understanding Spontan eous Speech Utterances". Proceedings of Defence Advanced Research Projects Agency (DARPA) Speech and Natural Language Workshop, Harriman, N.Y., 299-304. San Mateo, Calif.: Morgan Kaufmann. Stallard, David & Robert Bobrow. 1992. "Fragment Processing in the DELPHI System". Proceedings of Defence Advanced Research Projects Agency (DARPA) Speech and Natural Language Workshop, Harriman, N. V., 305310. San Mateo, Calif.: Morgan Kaufmann. Ward, Wayne & Sheryl R. Young. 1993. "Flexible Use of Semantic Constraints in Speech Recognition". Proceedings of the International Conference on Acous tics, Speech and Signal Processing (ICASSP'93), Minneapolis, Minn., vol.11, 49-50. New York: The Institute of Electrical and Electronics Engineers (IEEE).

A Proposal for Word Sense Disambiguation Using Conceptual Distance E N E K O A G I R R E 1 & GERMAN RIGAU 2

Euskal Herriko Unibertsitatea & Universitat Politecnica de Catalunya Abstract This paper presents a method for the resolution of lexical ambiguity and its automatic evaluation over the Brown Corpus. The method relies on the use of the wide-coverage noun taxonomy of WordNet and the notion of conceptual distance among concepts, captured by a Conceptual Density formula developed for this purpose. This fully automatic method requires no hand coding of lexical entries, hand tagging of text nor any kind of training process. The results of the experiment have been automatically evaluated against SemCor, the sense-tagged version of the Brown Corpus. 1

Introduction

Word sense disambiguation is a long-standing problem in Computational Linguistics. Much of recent work in lexical ambiguity resolution offers the prospect that a disambiguation system might be able to receive as input unrestricted text and tag each word with the most likely sense with fairly reasonable accuracy and efficiency. The most extended approach is to at tempt to use the context of the word to be disambiguated together with information about each of its word senses to solve this problem. Several interesting experiments in lexical ambiguity resolution have been performed in recent years using preexisting lexical knowledge resources. Cowie et al. (1992) and Guthrie et al. (1993) describe a method for lexical disambiguation of text using the definitions in the machine-readable version of the LDOCE dictionary as in the method described in Lesk (1986), but using simulated annealing for efficiency reasons. Yarowsky (1992) combines the use of the Grolier encyclopaedia as a training corpus with the categor ies of the Roget's International Thesaurus to create a statistical model for the word sense disambiguation problem with excellent results. Wilks et al. (1993) perform several interesting statistical disambiguation experiments 1 2

Eneko Agirre was supported by a grant from the Basque Government. German Rigau was supported by a grant from the Ministerio de Education y Ciencia.

162

ENEKO AGIRRE & GERMAN RIGAU

using co-occurrence data collected from LDOCE. Sussna (1993), Voorhees (1993) and Richarson et al. (1994) define a disambiguation programs based in WordNet with the goal of improving precision and coverage during doc ument indexing. Although each of these techniques looks somewhat promising for disam biguation, either they have been only applied to a small number of words, a few sentences or not in a public domain corpus. For this reason we have tried to disambiguate all the nouns from real texts in the public domain sense tagged version of the Brown Corpus (Francis & Kucera 1967; Miller et al. 1993), also called Semantic Concordance or SemCor for short. We also use a public domain lexical knowledge source, WordNet (Miller 1990). The advantage of this approach is clear, as SemCor provides an appropriate environment for testing our procedures in a fully automatic way. It also defines, for the purpose of this study, word-sense as the sense present in WordNet. This paper presents a general automatic decision procedure for lexical ambiguity resolution based on a formula of the conceptual distance among concepts: Conceptual Density. The system needs to know how words are clustered in semantic classes, and how semantic classes are hierarchically organised. For this purpose, we have used a broad semantic taxonomy for English, WordNet. Given a piece of text from the Brown Corpus, our system tries to resolve the lexical ambiguity of nouns by finding the combination of senses from a set of contiguous nouns that maximises the total Conceptual Density among senses. Even if this technique is presented as stand-alone, it is our belief, follow ing the ideas of McRoy (1992) that full-fledged lexical ambiguity resolution should combine several information sources. Conceptual Density might be only one evidence of the plausibility of a certain word sense. Following this introduction, Section 2 presents the semantic knowledge sources used by the system. Section 3 is devoted to the definition of Con ceptual Density. Section 4 shows the disambiguation algorithm used in the experiment. In Section 5, we explain and evaluate the performed experi ment. In the last section some conclusions are drawn. 2

WordNet and the semantic concordance

Sense is not a well defined concept and often has subtle distinctions in topic, register, dialect, collocation, part of speech, etc. For the purpose of this study, we take as the senses of a word those ones present in WordNet

A PROPOSAL FOR WSD USING CD

163

version 1.4. WordNet is an on-line lexicon based on psycholinguistic theories (Miller 1990). It comprises nouns, verbs, adjectives and adverbs, organised in terms of their meanings around semantic relations, which include among others, synonymy and antonymy, hypernymy and hyponymy, meronymy and holonymy. Lexicalised concepts, represented as sets of synonyms called synsets, are the basic elements of WordNet. The senses of a word are represented by synsets, one for each word sense. The version used in this work, WordNet 1.4, contains 83,800 words, 63,300 synsets (word senses) and 87,600 links between concepts. The nominal part of WordNet can be viewed as a tangled hierarchy of hypo/ hypernymy relations. Nominal relations include also three kinds of meronymic relations, which can be paraphrased as member-of, made-of and component-part-of. SemCor (Miller et al. 1993) is a corpus where a single part of speech tag and a single word sense tag (which corresponds to a WordNet synset) have been included for all open-class words. SemCor is a subset taken from the Brown Corpus (Francis & Kucera 1967) which comprises approximately 250,000 words out of a total of 1 million words. The coverage in WordNet of the senses for open-class words in SemCor reaches 96% according to the authors. The tagging was done manually, and the error rate measured by the authors is around 10% for polysemous words. 3

Conceptual density and word sense disambiguation

A measure of the relatedness among concepts can be a valuable prediction knowledge source to several decisions in Natural Language Processing. For example, the relatedness of a certain word-sense to the context allows us to select that sense over the others, and actually disambiguate the word. Relatedness can be measured by a fine-grained conceptual distance (Miller & Teibel 1991) among concepts in a hierarchical semantic net such as WordNet. This measure would allow to discover reliably the lexical cohesion of a given set of words in English. Conceptual distance tries to provide a basis for determining closeness in meaning among words, taking as reference a structured hierarchical net. Conceptual distance between two concepts is defined in Rada et al. (1989) as the length of the shortest path that connects the concepts in a hierarch ical semantic net. In a similar approach, Sussna (1993) employs the notion of conceptual distance between network nodes in order to improve preci sion during document indexing. Following these ideas, Agirre et al. (1994)

164

ENEKO AGIRRE & GERMAN RIGAU

describe a new conceptual distance formula for the automatic spelling cor rection problem and Rigau (1994), using this conceptual distance formula, presents a methodology to enrich dictionary senses with semantic tags ex tracted from WordNet. The measure of conceptual distance among concepts we are looking for should be sensitive to: - the length of the shortest path that connects the concepts involved. - the depth in the hierarchy: concepts in a deeper part of the hierarchy should be ranked closer. - the density of concepts in the hierarchy: concepts in a dense part of the hierarchy are relatively closer than those in a more sparse region. - and the measure should be independent of the number of concepts we are measuring. We have experimented with several formulas that follow the four criteria presented above. Currently, we are working with the Conceptual Density formula, which compares areas of sub-hierarchies.

Word to be disambiguated: W Context words: wl w2 w3 w4 ...

Fig. 1: Senses of a word in WordNet As an example of how Conceptual Density can help to disambiguate a word, in Figure 1 the word W has four senses and several context words. Each sense of the words belongs to a sub-hierarchy of WordNet. The dots in the sub-hierarchies represent the senses of either the word to be disambiguated (W) or the words in the context. Conceptual Density will yield the highest density for the sub-hierarchy containing more senses of those, relative to the total amount of senses in the sub-hierarchy. The sense of W contained in the sub-hierarchy with highest Conceptual Density will be chosen as the

A PROPOSAL FOR WSD USING CD

165

sense disambiguating W in the given context. In Figure 1, sense2 would be chosen. Given a concept c, at the top of a sub-hierarchy, and given nhyp and h (mean number of hyponyms per node and height of the sub-hierarchy, respectively), the Conceptual Density for c when its sub-hierarchy contains a number m (marks) of senses of the words to disambiguate is given by the formula below: (1) The numerator expresses the expected area for a sub-hierarchy contain ing m marks (senses of the words to be disambiguated), while the divisor is the actual area, that is, the formula gives the ratio between weighted marks below c and the number of descendant senses of concept c. In this way, formula 1 captures the relation between the weighted marks in the sub-hierarchy and the total area of the sub-hierarchy below c. The weight given to the marks tries to express that the height and the number of marks should be proportional. nhyp is computed for each concept in WordNet in such a way as to satisfy equation 2, which expresses the relation among height, averaged number of hyponyms of each sense and total number of senses in a sub-hierarchy if it were homogeneous and regular: (2) Thus, if we had a concept c with a sub-hierarchy of height 5 and 31 des cendants, equation 2 will hold that nhyp is 2 for c. Conceptual Density weights the number of senses of the words to be disambiguated in order to make density equal to 1 when the number m of senses below c is equal to the height of the hierarchy h, to make density smaller than 1 if m is smaller than h and to make density bigger than 1 whenever m is bigger than h. The density can be kept constant for different m's provided a certain proportion between the number of marks m and the height h of the sub-hierarchy is maintained. Both hierarchies A and B in Figure 2, for instance, have Conceptual Density l 3 . In order to tune the Conceptual Density formula, we have made several experiments adding two parameters, a and β. The a parameter modifies the 3

From formulas 1 and 2 we have:

166

ENEKO AGIRRE & GERMAN RIGAU

Fig. 2: Two hierarchies with CD strength of the exponential i in the numerator because h ranges between 1 and 16 (the maximum number of levels in WordNet) while m between 1 and the total number of senses in WordNet. Adding a constant (3 to nhyp, we tried to discover the role of the averaged number of hyponyms per concept. Formula 3 shows the resulting formula. (3) After an extended number of runs which were automatically checked, the results showed that β does not affect the behaviour of the formula, a strong indication that this formula is not sensitive to constant variations in the number of hyponyms. On the contrary, different values of a affect the performance consistently, yielding the best results in those experiments with a near 0.20. The actual formula which was used in the experiments was thus the following: (4) 4

The disambiguation algorithm using conceptual density

Given a window size, the program moves the window one word at a time from the beginning of the document towards its end, disambiguating in each step the word in the middle of the window and considering the other words in the window as context. The algorithm to disambiguate a given word w in the middle of a window of words W roughly proceeds as follows. First, the algorithm represents in a lattice the nouns present in the window, their senses and hypernyms (step 1). Then, the program computes the Conceptual Density of each concept in WordNet according to the senses it contains in its sub-hierarchy (step 2). It selects the concept c with highest density (step 3) and selects the senses

A PROPOSAL FOR WSD USING CD

167

below it as the correct senses for the respective words (step 4). If a word from W: - has a single sense under c, it has already been disambiguated. - has not such a sense, it is still ambiguous. - has more than one such senses, we can eliminate all the other senses of w, but have not yet completely disambiguated w. The algorithm proceeds then to compute the density for the remaining senses in the lattice, and continues to disambiguate words in W (back to steps 2, 3 and 4). When no further disambiguation is possible, the senses left for w are processed and the result is presented (step 5). To illustrate the process, consider the text in Figure 3 extracted from SemCor. The jury(2) praised the administration(3) and operation(8) of the Atlanta Police_Department(l) , the Fulton-Tax-Commissioner-'s.Office. the Bellwood and Alpharetta prison_f arms(i) , Grady .Hospital and the Fulton_Health_Department.

Fig. 3: Sample sentence from SemCor The underlined words are nouns represented in WordNet with the number of senses between brackets. The noun to be disambiguated in our example is operation, and a window size of five will be used. Each step goes as follows: Step 1 Figure 4 shows partially the lattice for the example sentence. As far as Prison_farm appears in a different hierarchy we do not show it in the figure. The concepts in WordNet are represented as lists of synonyms. Word senses to be disambiguated are shown in bold. Underlined concepts are those selected with highest Conceptual Density. Monosemic nouns have sense number 0. Step 2 , for instance, has underneath 3 senses to be disambiguated and a sub-hierarchy size of 96 and therefore gets a Conceptual Density of 0.256. Meanwhile, , with 2 senses and subhierarchy size of 86, gets 0.062. Step 3 , being the concept with highest Con ceptual Density is selected. Step 4 In the example, Operation_3, police-department_0 and jury_l are the senses chosen for operation, Police-Department and jury. All the other concepts below are marked so that they are no longer selected. Other senses of those words are deleted from the lattice, e.g., jury_2. In the next loop of the algorithm will have only one disambiguation-word below it, and therefore its density will be much

168

ENEKO AGIRRE & GERMAN RIGAU police_department_0  local department, department of local government  government department  department jury-1,panel  committee, commission operation_3, function  division  administrative unit  unit  organisation  social group  people

 group

administration-1, governance. . . jury_2  body  people  group, grouping

Fig. 4: Partial lattice for the sample sentence lower. At this point the algorithm detects that further disambiguation is not possible, and quits the loop. Step 5 The algorithm has disambiguated operation_3, police_department_0, jury_l and prison_farm_0 (because this word is monosemous in WordNet), but the word administration is still ambiguous. The output of the algorithm , thus, will be that the sense for operation in this context, i.e., for this window, is operation_3. The disambiguation window will move rightwards, and the algorithm will try to disambiguate Police-Department taking as context administration, operation, prison_f arms and whichever noun is first in the next sentence. The disambiguation algorithm has and intermediate outcome between completely disambiguating a word or failing to do so. In some cases the algorithm returns several possible senses for a word. In this experiment we treat this cases as failure to disambiguate. 5

The experiment

We selected one text from SemCor at random: br-aOl from the gender "Press: Reportage". This text is 2079 words long, and contains 564 nouns. Out of these, 100 were not found in WordNet. From the 464 nouns in WordNet, 149 are monosemous (32%).

A PROPOSAL FOR WSD USING CD

169

<s> <wd>jury<sn>[noun.group.0]NN <wd>administration<sn>[noun.act.0]NN <wd>operation<sn>[noun.state.0]NN <wd>Police_Department<sn> [noun.group.0] NN <wd>prison_farms<mwd>prisonjfarm <msn>[noun.artifact.0]NN

Fig. 5: SemCor format jury administration operation PoliceJDepartment prisonfarm

Fig. 6: Input words The text plays both the role of input file (without semantic tags) and (tagged) test file. When it is treated as input file, we throw away all nonnoun words, only leaving the lemmas of the nouns present in WordNet. The program does not face syntactic ambiguity, as the disambiguated part of speech information is in the input file. Multiple word entries are also available in the input file, as long as they are present in WordNet. Proper nouns have a similar treatment: we only consider those that can be found in WordNet. Figure 5 shows the way the algorithm would input the example sentence in Figure 3 after stripping non-noun words. After erasing the irrelevant information we get the words shown in Fig ure 6 4 . The algorithm then produces a file with sense tags that can be compared automatically with the original file (cf. Figure 5). Deciding the optimum context size for disambiguating using Conceptual Density is an important issue. One could assume that the more context there is, the better the disambiguation results would be. Our experiment shows that precision5 increases for bigger windows, until it reaches window size 15, where it gets stabilised to start decreasing for sizes bigger than 25 (cf. Figure 7). Coverage over polysemous nouns behaves similarly, but with a more significant improvement. It tends to get its maximum over 80%, decreasing for window sizes bigger than 20. Precision is given in terms of polysemous nouns only. The graphs are drawn against the size of the context 6 that was taken into account when disambiguating. Figure 7 also shows the guessing baseline, given when selecting senses at random. First, it was calculated analytically using the polysemy counts for 4

5

6

Note that we already have the knowledge that police and prison farm are compound nouns, and that the lemma of prison farms is prison farm. Precision is defined as the ratio between correctly disambiguated senses and the total number of answered senses. Coverage is given by the ratio between total number of answered senses and total number of senses. Context size is given in terms of nouns.

170

ENEKO AGIRRE & GERMAN RIGAU

Fig. 7: Precision and coverage % w=25 polysemic overall

Cover. 83.2 88.6

Prec. 47.3 66.4

Recall 39.4 58.8

Table 1: Overall data for the best window size the file, which gave 30% of precision. This result was checked experimentally running an algorithm ten times over the file, which confirmed the previous result. We also compare the performance of our algorithm with that of the 'most frequent' heuristic. The frequency counts for each sense were collected using the rest of SemCor, and then applied to the text. While the precision is similar to that of our algorithm, the coverage is nearly 10% worse. All the data for the best window size can be seen in table 5. The precision and coverage shown in the preceding graph was for polysemous nouns only. If we also include monosemic nouns precision raises from 47.3% to 66.4%, and the coverage increases from 83.2% to 88.6%. 6

Conclusions

The automatic method for the disambiguation of nouns presented in this paper is ready-usable in any general domain and on free-running text, given part of speech tags. It does not need any training and uses word sense tags from WordNet, an extensively used lexical data base. The algorithm is theoretically motivated and founded, and offers a general measure of the

A PROPOSAL FOR WSD USING CD

171

semantic relatedness for any number of nouns in a text. In the experiment, the algorithm disambiguated one text (2079 words long) of SemCor, a subset of the Brown corpus. The results were obtained automatically comparing the tags in SemCor with those computed by the algorithm, which would allow the comparison with other disambiguation methods. T h e results are promising, considering the difficulty of the task (free running text, large number of senses per word in WordNet), and the lack of any discourse structure of the texts. More extensive experiments on additional SemCor texts, including among others the use of meronymic links, testing of homograph level disambigu ation and direct comparison with other approaches, are reported in Agirre et al. (1996). This methodology has been also used for disambiguating nominal entries of bilingual MRDs against WordNet (Rigau & Agirre 1995). A c k n o w l e d g e m e n t s . We wish to thank all the staff of the CRL and specially Jim Cowie, Joe Guthtrie, Louise Guthrie and David Farwell. We would also like to thank Ander Murua, for mathematical assistance, Xabier Arregi, Jose Mari Arriola, Xabier Artola, Arantxa Diaz de Ilarraza, Kepa Sarasola, and Aitor Soroa from the Computer Science Department of EHU and Francesc Ribas, Horacio Rodriguez and Alicia Ageno from the Computer Science Department of UPC. REFERENCES Agirre, Eneko, Xabier Arregi, Arantza Diaz de Ilarraza & Kepa Sarasola. 1994. "Conceptual Distance and Automatic Spelling Correction". Workshop on Speech Recognition and Handwriting, 1-8. Leeds, U.K. & German Rigau. 1996. An Experiment in Word Sense Disambiguation of the Brown Corpus Using WordNet. Technical Report (MCCS-96-291). Las Cruces, New Mexico: Computing Research Laboratory, New Mexico State University. Cowie, Jim, Joe Guthrie & Louise Guthrie. 1992. "Lexical Disambiguation Using Simulated Annealing". Proceedings of the DARPA Workshop on Speech and Natural Language, 238-242. Francis, Nelson & Henry Kucera. 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Boston, Mass.: Houghton-Mifflin. Guthrie, Louise, Joe Guthrie & Jim Cowie. 1993. Resolving Lexical Ambiguity. Technical Report (MCCS-93-260). Las Cruces, New Mexico: Computing Research Laboratory, New Mexico State University.

172

ENEKO AGIRRE & GERMAN RIGAU

Lesk, Michael. 1986. "Automatic Sense Disambiguation: How to Tell a Pine Cone from an Ice Cream Cone". Proceeding of the 1986 SIGDOC Conference, Association of Computing Machinery, 24-26. McRoy, Susan W. 1992. "Using Multiple Knowledge Sources for Word Sense Discrimination". Computational Linguistics 18:1.1-30. Miller, George A. 1990. "Five Papers on WordNet". Special Issue of the Inter national Journal of Lexicography. 3:4. & Daniel A. Teibel. 1991. "A Proposal for Lexical Disambiguation". Pro ceedings of the DARPA workshop on Speech and Natural Language, 395-399. , Claudia Leacock, Tengi Randee & Ross T. Bunker. 1993. "A Semantic Concordance". Proceedings of the DARPA Workshop on Human Language Technology, 303-308. Rada, Roy, Hafedh Mili. Ellen Bicknell & Maria Blettner. 1989. "Development an Application of a Metric on Semantic Nets". IEEE Transactions on Systems, Man and Cybernetics. 19:1.17-30. Richarson, Ray, Allan F. Smeaton & John Murphy. 1994. Using WordNet as a Konwledge Base for Measuring Semantic Similarity between Words. Tech nical Report (CA-1294). Dublin, Ireland: School of Computer Applications, Dublin City University. Rigau, German. 1995. "An Experiment on Semantic Tagging of Dictionary Definitions". Workshop "The Future of the Dictionary". Uriage-les-Bains, Prance. & Eneko Agirre. 1995. "Disambiguating Bilingual Nominal Entries against WordNet". Proceedings of the Computational Lexicon Workshop, 7th European Summer School in Logic, Language and Information, 71-82. Barcelona, Spain. Sussna, Michael. 1993. "Word Sense Disambiguation for Free-text Indexing Using a Massive Semantic Network". Proceedings of the 2nd International Conference on Information and Knowledge Management, 67-74. Airlington, Virginia, U.S.A. Voorhees, Ellen. 1993. "Using WordNet to Disambiguate Word Senses for Text Retrieval", Proceedings of the 16th Annual International ACM SIGIR Con ference on Research and Development in Information Retrieval, 171-180. Wilks, Yorick et al. 1993. "Providing Machine Tractable Dictionary Tools". Semantics and the Lexicon ed. by James Pustejovsky, 341-401. Yarowsky, David. 1992. "Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora", Proceedings of the ARPA Workshop on Human Language Technology, 266-271.

An Episodic Memory for Understanding and Learning OLIVIER F E R R E T * & B R I G I T T E GRAU* **

*LIMSI-CNRS

**IIE-CNAM

Abstract In this article we examine the incorporation of pragmatic knowledge learning in natural language understanding systems. We argue that this kind of learning can and should be done incrementally. In order to do so we present a model that is able simultaneously to build a case library and to prepare the abstraction of schemata which represent general situations. Learning takes place on the basis of narratives whose representations are collected in an episodic memory. 1

Introduction

Text understanding requires pragmatic knowledge about stereotypical situ ations. One must go beyond the information given so that inferences can be performed to make explicit the links between utterances. By determining the relations between individual utterances the global representation of the entire text can be computed. Unless one is dealing with specific domains it is not reasonable to assume that a system has a priori all the information needed. In most cases texts are made of known and unknown bits and pieces of information. Text analysis is therefore best viewed as a complex process in which understanding and learning take place, and which must improve itself (Schank 1982). Methods of reasoning that are exclusively analytic are no longer suffi cient to assure the understanding of texts, as these typically include new situations. Hence alternatives such as synthetic and analogical reasoning, which use more contextualised knowledge, are also needed. Thus, a memory model dedicated to general knowledge must be extended with an episodic component that organises specific situations, and must be able to take into account the constraints coming from gathering the understanding and learn ing processes. In the domain of learning pragmatic knowledge from texts, the short comings of one-dimensional approaches such as Similarity-Based Learning — IPP (Lebowitz 1983) — or Explanation-Based Learning — GENESIS (Mooney & DeJong 1985) — have become apparent and have given place to a multistrategy approach. OCCAM (Pazzani 1988) is an attempt in this

174

OLIVIER FERRET & BRIGITTE GRAU

direction as it uses Similarity-Based Learning techniques in order to com plete a domain theory for an Explanation-Based Learning process. Despite their differences, all these approaches share the same goal or means: each new causal representation constructed by the system is generalised as soon as possible in order to classify it on the basis of the system's background knowledge. However learning is not an all-or-nothing process. We follow Vygotsky's (Vygotsky 1962) views on learning, namely, that learning is an incremental process whereby general knowledge is abstracted on the basis of cumulative, successive experiences (in our case, the representations of texts). In this perspective generalisations should not occur every time a new situation is encountered. Rather, we suggest to store them in a buffer, the episodic memory, where the abstraction takes place at a later stage. The result of this abstraction process is a graph of schemata, akin to the MOPs introduced by Schank (Schank 1982). Before we became interested in this topic other researchers have made proposals. Case-Based Reasoning (CBR) systems such as SWALE (Schank & Leake 1989) and AQUA (Ram 1993) have been designed in order to exploit the kind of representations we are talking about. However, these systems start out with a lot of knowledge. They do not model the incre mental aspect we are proposing, that is, an abstraction must be performed only when sufficient reinforced information has been accumulated. Further more, the memory structure of these systems is fixed a priori. Thus, the criteria for determining whether a case can be considered as representat ive cannot be dynamically determined. Despite these shortcomings, CBR systems remain a very good model in the context of learning and must be taken into account when specifying a dynamic episodic memory. 2

Structure of the episodic memory

2.1

Text representation

Before examining the structure of the episodic memory, we will consider the form of its basic component: the text representations. In our case these representations come from short narratives such as the following. A few years ago, [I was in a department store in Harlem] (1) [with a few hun dred people around me](2). [I was signing copies of my book "Stride toward Freedom"] (3) [which relates the boycott of buses in Montgomery in 1955-56] (4). Suddenly, while [Iwas appending my signature to a page] (5), [I felt a pointed thing sinking brutally into my chest] (6). [I had just been stabbed with a paper knife by

AN EPISODIC MEMORY

175

a woman] (7) [who was acknowledged as mad afterwards] (8). [I was taken immedi ately to the Harlem Hospital] (9) [where I stayed on a bed during long hours] (10) while [many preparations were made] (11) [in order to remove the weapon from my body] (12). Revolution Non-Violente by Martin Luther King (based on a French version of the original text)

The texts' underlying meanings are expressed in terms of conceptual graphs (Sowa 1984). The clauses are organised according to the situations men tioned in the texts (See Figure l 1 ). Hence, each of these situations (a dedication meeting in a department store, a murder attempt and a stay in hospital in our example) corresponds to a Thematic Unit (TU).

Fig. 1: The representation of the text about Martin Luther King A text representation, which we call an episode, is a structured set of TUs which are thematically linked in either one of two ways: • thematic deviation: this relation means that a part of a situation is elaborated. In our example, the hospital situation is considered to be a deviation from the murder attempt because these two situations are thematically related to the Martin Luther King's wound. More precisely, a deviation is attached to one of the graphs of a TU. Here, the Hospital TU is connected to the turning graph (9) expressing that Martin Luther King is taken to the hospital. • thematic shift: this relation characterises the introduction of a new situation. In the example below, there is a thematic shift between the dedication meeting situation and the murder attempt one because they are not intrinsically tied together, fortunately for the book writers. Among all the TUs of an episode, at least one has the status of being the main topic (MT). In the Martin Luther King text, the Murder attempt TU plays this role. More generally, a main topic is determined by applying heuristics based on the type of the links between the TUs (Grau 1984). TUs have a structure. Depending on the aspect of the situation they describe, graphs are distributed among three slots: 1

Propositions 6 and 7, also 3 and 5, are joined together in one conceptual graph. This is possible through the definition graph associated to the types of concept.

176

OLIVIER FERRET & BRIGITTE GRAU • circumstances (C): states under which the situation occurs; • description (D): actions which characterise the situation; • outcomes (0): states resulting from the situation.

A TU is valid only if its description slot is not empty. Nevertheless, as shown in the example below, certain slots may remain empty if the corresponding information is not present in the text. Inside the description slot, graphs may be linked by temporal and causal relations. For example, in the Hospital TU graphically represented in Fig ure 1, graphs (10) and (11) are causally tied with the graph (12). Text representations have so far been built manually. However, prelim inary studies show that this analysis could be done automatically without using any particular abstract schemata. A CBR mechanism using both text representations and linguistic clues (such as connectives, temporal markers ɔr other cohesive devices) is under study. 2.2

The episodic memory

The structure of the episodic memory is governed by one major principle: all similar elements are stored in the same structure. As a result, accumulation occurs and implicit generalisations are made by reinforcing the recurrent features of the episodes or the situations. This principle is applied to the episodes and the TUs, and the memory is organised by storing this information accordingly. That is, similar episodes and similar TUs are grouped such as to build aggregated episodes in one case and aggregated TUs in the other. We show an example of the memory in Figure 2. Episode 1 and episode 2, which talk about the same topic, a murder attempt with a knife, have been grouped together in one aggregated episode. In this episode, the TUs that describe more specifically the murder attempt have been gathered in the same aggregated TU. It should be noted that TUs coming from different episodes without being their main topic can still be grouped in a same aggregated TU (see the Scuffle TU or the Speech TU in Figure 2). The principle of aggregation is not applied at the memory scale for smaller elements such as concepts or graphs. Aggregated graphs exist in the memory; but their scope is limited to the slot of the aggregated TU containing them. An aggregated graph gathers only those similar graphs that belong to the same slot of similar TUs coming from different episodes. Similarly, an aggregated concept makes no sense in isolation of the aggreg ated graph of which it is part of, hence, it cannot be found in another graph.

AN EPISODIC MEMORY

177

It is in fact the product of a generalisation applied to concepts which re semble each other in the context of graphs which are also considered to be similar. This explains why the accumulation process can be viewed as the first step of a generalisation process.

Fig. 2: The episodic memory For instance, in the aggregated graph (a) of the description slot below (see Figure 3), Stab has Man for agent, because the type Man is the result of the aggregation of the more specific types Soldier and Young-man. On the other hand, we have no aggregated concept for recipient because the aggregation was unsuccessful for Arm and Stomach. The accumulation process has been designed in such a way as to make apparent the most relevant features of the situations by reinforcing them. This is done by storing similar elements in the same structure and by as signing them a weight. This weight quantifies the degree of recurrence of an element. Figure 3 shows these weights for aggregated graphs and aggregated con cepts. These weights characterise the relative importance of aggregated graphs with regard to the aggregated TU and the relative importance of aggregated concepts with regard to the aggregated graph. This principle of cumulation holds also for the relations between the entities. This is shown in Figure 3 for casual relations in the aggregated graphs. In a description slot, temporal and causal relations coming from different episodes are also aggregated and similarly for the thematic relations between the TUs of an episode. This example illustrates not only the accumulative dimension of our memory model but also its potential for being a case library. Even though aggregated concepts are generalisations, they still maintain a link to the

178

OLIVIER FERRET & BRIGITTE GRAU

Circumstances (b) [Quarrel] (0.5) [event] (1.0) (agent) (1.0) —> [young-man] (1.0) event [1] (agent) [2] young-man [2] [airport] (1.0) (object) (1.0) - > [money] (1.0) (object) [2] money [2] airport [1] (accomp.) (1.0) —> [young-man] (1.0) (accomp.) [2] young-man [2] Description (a) [Stab] (1.0) " (b) [Arrest] (1.0) (agent) (1.0) —> [man] (1.0) (agent) (1.0) —> [human] (1.0) (agent) [1,2] policeman [1], human [2] (agent) [1,2] soldier [1], young-man [2] (recipient) (1.0) —> [ ] — (object) (1.0) - > [man] (1.0) (recipient) [1,2] arm (0.5) [1], stomach (0.5) [2] (object) [1,2] soldier [1], young-man [2] (part) (1.0) - > [man] (1.0) (part) [1,2] head-of-state [1], young-man [2] (d) [Stumble] (0.5) (instrument) (1.0) - > [knife] (1.0) (agent) (1.0) - > [soldier] (1.0) (agent) [1] soldier [1] (instrument) [1,2] bayonet [1], flick knife [2] (c) [Attack] (0.5) (e) [Hit] (0.5) (agent) (1.0) - > [soldier] (1.0) (agent) (1.0) —> [young-man] (1.0) (agent) [1] soldier [1] (agent) [2] young-man [2] (object) (1.0) —> [head-of-state] (1.0) (recipient) (1.0) —> [young-man] (1.0) (object) [1] head-of-state [1] (recipient) [2] young-man [2] (manner) (1.0) —> [suddenly] (1.0) | (manner) [1] suddenly [1] Outcomes (a) [Located] (1.0) (b) [Wounded] (0.5) (experiencer) (1.0) —> [man] (1.0) (experiencer) (1.0) —> [head-of-state] (1.0) (experiencer) [1,2] soldier [1], young-man [2] (experiencer) [1] head-of-state [1] (location) (1.0) —> [prison] (1.0) (manner) (1.0) - > [light] (1.0) (location) [1,2] prison [1,2] (manner) [1] light [1] (c) [Dead] (0.5) (experiencer) (1.0) —> [young-man] (1.0) (experiencer) [2] young-man [2] (a) [Located] (0.5) (experiencer) (1.0) —> (experiencer) [1] (location) (1.0) - > (location) [1]

[Stab]: predicate of an aggregated graph. (1.0) : weight value, [man] : aggregated concept, (agent) : aggregated relation, soldier [1]: a concept, i.e. an instance, occurring in episode 1. It is linked to the aggregated concept above it. (recipient) [1,2]: a relation which occurs in episodes 1 and 2. It is linked to the aggregated relation above it.

Fig. 3: An aggregated TU (the Murder Attempt TU of Figure 2) concepts from which they have been built 2 . Thus, following the references to the episodes, we know that the agent of the Stab predicate in the episode 1 is a Soldier. Hence, a Case-Based Reasoner will be able to use this fact in order to exploit the specific situations stored in the aggregates and improve an automatic comprehension process. Such a reasoner could use the aggregated information and the specific information simultaneously. The former would be used to evaluate the relative importance of a piece of data, and the latter to reason more precisely on the basis of similarities and differences. The multidimensional aspect of this model also has implications on the way of retrieving information from the memory when it is used as a case 2

Unlike the aggregated concepts, concepts in texts, i.e. instances, may belong to several graphs and are therefore starting points for roles.

AN EPISODIC MEMORY

179

library. Unlike most CBR systems, the library here has a relatively flat structure: similar episodes and similar TUs are simply grouped together. Aggregated episodes can be considered as typical contexts for the aggreg ated TUs, which are the central elements, but there is no structural means (for instance, a hierarchical structure of relevant features) for searching a case. This operation is achieved in an associative way by a spreading activation mechanism which works on all different knowledge levels. The interaction between the concepts and the structures of the memory (aggreg ated episodes, aggregated TUs or schemata) leads to a stabilised activation configuration from which the cases with the highest activation level are se lected. This process is akin to what Lange and Dyer (Lange & Dyer 1989) call evidential activation. In our case, the weights upon which the propaga tion is based are those that characterise the element's relative importance in our memory model. This mechanism presents two major advantages from the search-phase point of view. First of all, no a priori indexing is necessary. This is useful in a learning situation where the setting is not stable. Secondly, a syntactic match is performed at the same time. 3

Episode matching and memorisation

When the building of the text's underlying meaning representation is com pleted, one, or possibly several memorised episodes have been selected by the spreading activation mechanism. They are related to either the text's main situation, the main TU, or a secondary one. Matching episodes amounts thus to comparing memorised TUs with TUs of the text. In this section we examine under what conditions TUs are similar. 3.1

Similarity of TUs

The relative similarity between two TUs depends on the degree of their slot matching. We proceed in two steps. At first we compute two ratios obtained from the number of similar graphs, in relation to the number of graphs present in the memorised slot as well as to the number of graphs in the text slot. Thus, we first evaluate each slot in the lump by comparing these ratios with an interval of arbitrary thresholds [t 1 ,t 2 ] we have established. When the two ratios are under the lower limit, the similarity is rejected: neither the memorised slot nor the text slot contains a sufficient number of common points with regard to their differences. If one of these two ratios is above

180

OLIVIER FERRET & BRIGITTE GRAU

the upper limit, the proportion of common points of one slot or the other is sufficient to consider the slots as highly similar. If both ratios happen to be within the interval, we conclude in favour of a moderate similarity that has to be evaluated by another more precise method. In this case, we compute a score based on the importance of the graphs inside the slots. This computation is described in detail in the next section. When this score is above another given threshold t3, we conclude that there is a high similarity. Thus, two slots sharing an average number of graphs can be very similar if these graphs are important for this slot. The thresholds are parameters of the system. In the current version, t1 — 0.5, t2 = 0.8 and t3 = 0.7. Finally, two TUs are similar if they correspond to any of the following rules: R1'. highly similar circumstances and moderately similar description R2: similar circumstances and similar outcomes, with at least one of the two dimensions highly similar. R3: moderately similar description and highly similar outcomes R4: highly similar description. 3.2

Similarity of slots and similarity of graphs

The score of a slot is based on the score of its similar graphs, weighted by their relative importance into the slot. We compute the score of two graphs only when they contain the same predicate and at least one similar concept related by an equivalent casual relation. Two concepts are similar if the most specific abstraction of their types is less than the concept type of the canonical graph. By definition, the graphs we compare are derived from the same canonical graph and, for each relation, their concept types are restrictions of the same type inside this canonical graph. In the comparison of two concepts, if the aggregated one does not exist, the resulting type is the one which abstracts the maximum number of concept occurrences. Thus, the evaluation function of the similarity of two graphs containing the same predicate is the following:

with Sim Concept(ci,c'i) = 1 when the concepts are similar otherwise 0, wci is the weight of the concept inside the memorised graph and the ci are the concepts other than the predicate.

AN EPISODIC MEMORY

181

Two graphs, g and g', are similar if SimGraph(g,g') > 0. The weight wci is either the weight of the aggregated concept or the sum of the weights of the regrouped occurrences. The following illustrates the computation of the similarity between the graph (a) of the description slot in Figure 3 and the graph of the Martin Luther King text which has the same predicate (it corresponds to the clauses 6 and 7): [Stab] — (agent) —> (recipient) —> (part) — > (instrument) — > (manner) — >

SimGraph [woman] [chest] — [man] [paper-knife] [brutally]

~

= (1.0 SimConcept(man,woman)+ 0.5 SimConcept(chest,stomach or arm)+ 1.0 SimConcept(man,man)+ 1.0 • SimConcept(knife,paper-knife)/3.5 = (1.0 + 0.0 + 1.0 + 1.0)/3.5 0.86

We can now define the evaluation function of two identically named slots as follows:

where

wpi is the weight of the aggregated predicate and SimGraph(txtgi,memgi) > 0.

The eventual presence of a chronological order between graphs in the de scription slots does not intervene in the similarity evaluation. We do not want to favour an unfolding of events with regard to another, the various combinations having actually occurred in the original situations. More generally, the way in which the similarity between structures is computed resembles Kolodner and Simpson's (Kolodner & Simpson 1989) method, with the computation of an aggregate match score. There are however two big differences: first of all, the similarity is context dependent because the relative importance of any element is always evaluated within the context of the component to which it belongs. Second, this importance can change, since it is represented by the recurrence of the element and not by an hierarchy of features established on a priori grounds. Because situations are not related in the same way, nor with the same level of precision, the structure of episodes may be different even if they deal with the same topic. For instance, a TU may be detailed by another TU in one episode and not in another one. Hence, graphs that could be matched may be found in two different TUs as we can see in Figure 4. This peculiarity must be taken into account when we compare two slots. We do so by first recognising similar graphs in identically named slots; then we try to find the remaining graphs in the appropriate slots of an eventually

182

OLIVIER FERRET & BRIGITTE GRAU

C: Circumstances D : Description O : Outcomes

memorized TUs: TU2 gives details concerning the circumstances of TU1

Fig. 4: Matching two different structures detailed TU. For example, when examining the similarity of the circum stance slots of text TU and TU1 in Figure 4, the remaining states (g2) are searched either in the outcomes slot of an associated TU (TU2), or in the resulting states of the actions in its description slot. This process will be applied to the remaining graphs of the text and to those of the memorised TU. The difference of structure is bypassed during the computation of the similarity measure, but it will not be neglected during the aggregation pro cess. In such cases, the aggregation of the first similar graphs will take place while the other similar graphs will be represented in their respective TU. No strengthening of the structure between the concerned TUs will occur. 3.3

Memorisation of an episode: The aggregation process

The spreading activation process leads to the selection of memorised epis odes which are ordered according to their activation level. To decide if one of these is a good candidate for an aggregation with the incoming episode, even if this aggregation is only a partial one, we have to find similar TUs between them. Episodes can be aggregated only if their principal TUs are similar. If this similarity is rejected, we are brought back to the sole ag gregation of TUs and the incoming episode leads to the creation of a new aggregated episode. Otherwise, the process continues in order to decide whether the topic structuring of the studied text is similar to the structur ing of the held episode. If similar secondary TUs are found in the same relation network, their links will be reinforced accordingly. This last part of the process will be applied even if no match is found at the episode level. The reinforcement of such links means that a more general context than a single TU is recurrent. Whatever level of matching is recognised, TUs are aggregated. In doing so, the graphs of the text TU are memorised according to the slot they belong and to the result of the similarity process. If new predicates appear,

AN EPISODIC MEMORY

183

the corresponding graphs are added to the memorised slot with a weight equal to 1 divided by the number of times the TU has been aggregated. In the case of graphs which contain an existing predicate and whose similarity has been rejected, they are joined with no strengthening of the predicate. New concepts related to existing casual relations are related to the corres ponding aggregated concept. Existing aggregated concepts, which are the abstraction amalgamating the maximum number of occurrences, may be questioned when a new concept is added to a graph. If any of them no longer fulfills this definitional constraint, it is suppressed. Pre-generalisation and reinforcement occur when the graphs are similar. As a result, the weight of the predicate increases. According to the res ults of the similarity process, aggregated concepts may evolve and become more abstract. The weights of the modified concepts inside the graphs are computed in order for them to be always equal to the number of times the concept has been strengthened, divided by the number of predicate's aggregations. The result of the aggregation of the Stab graph (see 3.2) coming from the Martin Luther King text (episode 5) with the Stab aggregated graph of the Murder Attempt aggregated TU (see Figure 3) is shown below: [Stab](1.0) — (agent)(1.0)—> (agent) [1,2,5]

[human] (1.0), soldier[l] young-man[2] woman[5] ( i n s t r u m e n t ) ( 1 . 0 ) — > [knife](1.0), (instrument) [1,2,5] bayonet[l] flick knife[2] paper-knife[5]

4

(recipient) ( 1 . 0 ) — > (recipient) [1,2,5]

(part)(1.0)—> (part)[l,2,5]

[] — arm(0.33)[l] stomach(0.33)[2] chest (0.33) [5] [man](1.0) head-of-state[l] young-man [2] man[5]

Conclusion

Natural Language Understanding systems must be conceived in a learning perspective if they are not designed for a specific purpose. Within this approach, we argue that learning is an incremental process based on the memorisation of its past experiences. That is why we have focused our work on the elaboration and the implementation of an episodic memory that is able to account for progressive generalisations by aggregating similar situations and reinforcing recurrent structures. This memory model also constitutes a case library for analogical reasoning. It is characterised by the two levels of cases it provides. These cases give different sorts of information: on one hand, specific cases can be used as sources given their richness coming from the situations they represent. On another hand, the aggregated cases,

184

OLIVIER FERRET & BRIGITTE GRAU

being a more reliable source of knowledge, guide and validate the retrieval and the use of the specific cases. More generally, our approach prepares the induction of schemata and the selection of their general features, a step which is still necessary to stabilise and organise abstract knowledge. This approach also provides a robust model of learning insofar as it allows for a weak text understanding. Even misunderstandings resulting from an incomplete domain theory will be compensated on the basis of the treatment of lots of texts involving analogous subjects. REFERENCES Grau, Brigitte. 1984. "Stalking Coherence in the Topical Jungle". Proceedings of the 5th Generation Computer System (FGCS'84), Tokyo, Japan. Kolodner, Janet L. & R.L. Simpson. 1989. "The MEDIATOR: Analysis of an Early Case-Based Problem Solver". Cognitive Science 13:4.507-549. Lange, Trent E. & Michael G. Dyer. 1989. "High-level Inferencing in a Connectionist Network". Connection Science 1:2.181-217. Lebowitz, Michael. 1983. "Generalization from Natural Language Text". Cog nitive Science 7.1-40. Mooney, Raymond & Gerald De Jong. 1985. "Learning Schemata for Natural Language Processing". Proceedings of the 9th International Joint Conference on Artificial Intelligence (IJCAF85), Los Angeles, 681-687. Pazzani, Michael J. 1988. "Integrating Explanation-based and Empirical Learn ing Methods in OCCAM". Third European Working Session on Learning (EWSL'88) ed. by Derek Sleeman, 147-165. Ram, Ahswin. 1993. "Indexing, Elaboration and Refinement: Incremental Learn ing of Explanatory Cases". Machine Learning (Special Issue on Case-Based Reasoning) ed. by Janet L. Kolodner, 10:3.201-248. Schank, Roger C. 1982. Dynamic Memory: A Theory of Reminding and Learning in Computers and People. New York: Cambridge University Press. & David B. Leake. 1989. "Creativity and Learning in a Case-Based Ex plainer". Artificial Intelligence (Special Volume on Machine Learning) ed. by Jaime G. Carbonell, 40:1-3.353-385. Sowa, John F. 1984. Conceptual Structures: Information Processing in Mind and Machine. Reading: Addison Wesley. Vygotsky, Lev S. 1962. Thought and Language. Cambridge, Mass.: MIT Press.

Ambiguities & Ambiguity Labelling: Towards Ambiguity D a t a Bases CHRISTIAN BOITET* & MUTSUKO

*GETA, CLIPS, IMAG **ATR Interpreting

TOMOKIYO**

(UJF, CNRS & INPG) Telecommunications

Abstract This paper has been prepared in the context of the MID DIM project (ATR-CNRS). It introduces the concept of'ambiguity labelling', and proposes a precise text processor oriented format for labelling 'pieces' such as dialogues and texts. Several notions concerning ambiguities are made precise, and many examples are given. The ambiguities labelled are meant to be those which state-of-the-art speech analysers are believed not to be able to solve, and which would have to be solved interactively to produce the correct analysis. The proposed labelling has been specified with a view to store the labelled pieces in a data base, in order to estimate the frequency of various types of ambiguities, the importance to solve them in the envisaged contexts, the scope of disambiguation decisions, and the knowledge needed for disambiguation. A complete example is given. Finally, an equivalent data base oriented format is sketched. 1

Introduction

As has been argued in detail in (Boitet 1993; Boitet 1993; Boitet & LokenKim 1993), interactive disambiguation technology must be developed in the context of research towards practical Interpreting Telecommunications sys tems as well as high-quality multi-target text translation systems. In t h e case of speech translation, this is because the state of the art in the foresee able future is such t h a t a black box approach to spoken language analysis (speech recognition plus linguistic parsing) is likely to give a correct o u t p u t for no more t h a n 50 to 60% of the utterances ('Viterbi consistency'(Black, Garside & Leech 1993)) 1 , while users would presumably require an overall success rate of at least 90% to be able to use such systems at all. However, the same spoken language analysers may be able to produce 1

According to a study by Cohen & Oviatt, the combined success rate is bigger than the product of the individual success rates by about 10% in the middle range. Using a formula such as S2 = S1*S1 + (1-S1)*A with A=20%, we get:

186

CHRISTIAN BOITET & MUTSUKO TOMOKIYO

sets of outputs containing the correct one in about 90% of the cases ('struc tural consistency' (Black, Garside & Leech 1993) ) 2 . In the remaining cases, the system would be unable to analyse the input, or no output would be correct. Interactive disambiguation by the users of the interpretation or translation systems is then seen as a practical way to reach the necessary success rate. It must be stressed that interactive disambiguation is not to be used to solve all ambiguities. On the contrary, as many ambiguities as pos sible should be reduced automatically. The remaining ones should be solved by interaction as far as practically possible. What is left would have to be reduced automatically again, by using preferences and defaults. In other words, this research is complementary to the research in auto matic disambiguation. Our stand is simply that, given the best automatic methods currently available, which use syntactic and semantic restrictions, limitations of lex icon and word senses by the generic task at hand, as well as prosodic and pragmatic cues, too many ambiguities will remain after automatic analysis, and the 'best' result will not be the correct one in too many cases. We suppose that the system will use a state-of-the-art language-based speech recogniser and multilevel analyser, producing syntactic, semantic and pragmatic information. We leave open two possibilities: • an expert system specialised in the task at hand may be available. • an expert human interpreter/translator may be called for help over the network. The questions we want to address in this context are the following: • what kinds of ambiguities (unsolvable by state-of- the-art speech ana lysers) are there in dialogues and texts to be handled by the envisaged systems ? • what are the possible methods of interactive disambiguation, for each ambiguity type? • how can a system determine whether it is important or not for the overall communication goal to disambiguate a given ambiguity?

2

SR of 1 component (S1) 40% 45% 50% 55% 60% SR of combination (S2) 28% 31% 35% 39% 44% S1 65% 70% 75% 80% 85% 90% 95% 100% S2 49% 55% 61% 68% 75% 83% 91% 100% 50~60% overall Viterbi consistency corresponds then to 65~75% individual success rate, which is already optimistic. According to the preceding table, this corresponds to a structural consistency of 95% for each component, which seems impossible to attain by strictly automatic means in practical applications involving general users.

AMBIGUITIES & AMBIGUITY LABELLING

187

• what kind of knowledge is necessary to solve a given ambiguity, or, in other word, whom should the system ask: the user, the interpreter, or the expert system, if any? • in a given dialogue or document, how far do solutions to ambiguities carry over: to the end of the piece, to a limited distance, or not at all? In order to answer these questions, it seems necessary to build a data base of ambiguities occurring in the intended contexts. In this report, we are not interested in any specific data base management software, but in the collection of data, that is, in 'ambiguity labelling'. First, we make more precise several notions, such as ambiguous repres entation, ambiguity, ambiguity kernel , ambiguity type, etc. Second, we specify the attributes and values used for manual labelling, and give a text processor oriented format. Third, we give a complete example of ambiguity labelling of a short dialogue, with comments. Finally, we define a data-base oriented exchange format. 2

A formal view of ambiguities

2.1 2.1.1

Levels and contexts of ambiguities Three levels of granularity for ambiguity labelling

First, we distinguish three levels of granularity for considering ambiguities. There is an ambiguity at the level of a dialogue (resp. a text) if it can be segmented in at least two different ways into turns (resp. paragraphs). We speak of ambiguity of segmentation into turns or into paragraphs. There is an ambiguity at the level of a turn (resp. a paragraph) if it can be segmented in at least two different ways into utterances (We use the term 'utterance' for dialogues and texts, to stress that the 'units of analysis' are not always sentences, but may be titles, interjections, etc.). We speak of ambiguity of\

segmentation into utterances. There is an ambiguity at the level of an utterance if it can be analysed in at least two different ways, whereby the analysis is performed in view of translation into one or several languages in the context of a certain generic task. There are various types of utterance-level ambiguities. Ambiguities of segmentation into paragraphs may occur in written texts, if, for example, there is a separation by a (new-line) character only, without or (paragraph). They are much more frequent and problematic in dialogues.

188

CHRISTIAN BOITET & MUTSUKO TOMOKIYO

For example, in ATR's transcriptions of Wizard of Oz interpretations dia logues (Park, Loken-KIM, Mizunashi & Fais 1995), there are an agent (A), a client (C), and an interpreter (I). In many cases, there are two success ive turns of I, one in Japanese and one in English. Sometimes, there are even three in a row (ATR-ITL 1994: J-E-J-32, E-J-J-33). If I does not help the system by pressing a button, this ambiguity will force the system to do language identification every time there may be a change of language. There are also cases of two successive turns by C (ATR-ITL 1994: E-27), and even three by A (ATR-ITL 1994: J-52) and I (ATR-ITL 1994: J-E-J-55, E-E-J-80) or four (ATR-ITL 1994: I,E-J-E-J-99). Studying these ambigu ities is important for discourse analysis, which assumes a correct analysis in terms of turns. Also, if successive turns in the same language are collapsed, this may add ambiguities of segmentation into utterances, leading in turn to more utterance-level ambiguities. Ambiguities of segmentation into utterances are very frequent, and most annoying, as we assume that the analysers will work utterance by utterance, even if they have access to the result of processing of the preceding context. There are for instance several examples of "right |? now |? turn left...". Or (Park, Loken-KIM, Mizunashi & Fais 1995:50):"OK |? so go back and is this number three |? right there |? shall I wait here for the bus?". An utterance may be spoken or written, may be a sentence, a phrase, a sequence of words, syllables, etc. In the usual sense, there is an ambiguity in an utterance if there are at least two ways of understanding it. This, however, does not give us a precise criterion for defining ambiguities, and even less so for labelling them and storing them as objects in a data base. Because human understanding heavily depends on the context and the com municative situation, it is indeed a very common experience that something is ambiguous for one person and not for another. Hence, we say that an utterance is ambiguous if it has an ambiguousl representation in some formal representation system. We return to that later. 2.1.2

Task-derived limitations on utterance-level ambiguities

As far as utterance-level ambiguities are concerned, we will consider only those which we feel should be produced by any state-of-the-art analyser constrained by the task. For instance, we should not consider that "good morning" is ambiguous with "good mourning", in a conference registration task. It could be different in the case of funeral arrangements.

AMBIGUITIES & AMBIGUITY LABELLING

189

Because the analyser is supposed to be state-of-the-art, "help" should not give rise to the possible meaning "help oneself" in "can I help you". Know ledge of the valencies and semantic restrictions on arguments of the verb "help" should eliminate this possibility. In the same way, "Please state your phone number" should not be deemed ambiguous, as no complete analysis should allow "state" to be a noun, or "phone" to be a verb. That could be different in a context where "state" could be construed as a proper noun, "State", for example in a dialogue where the State Department is involved. However, we should consider as ambiguous such cases as: "Please state (N/V) office phone number" (ATR-ITL 1994:33), where "phone" as a verb could be eliminated on grammatical grounds, but not "state office phone" as a noun, with "number" as a verb in the imperative form. The case would of course be different if the transcription would contain prosodic marks, but the point would continue to hold in general. 2.1.3

Necessity to consider utterance-level ambiguities in the context of full utterances

Let us take another example. Consider the utterance: (1) Do you know where the international telephone services are located? The underlined fragment has an ambiguity of attachment, because it has two different 'skeletons' (Black, Garside & Leech 1993) representations: [ i n t e r n a t i o n a l telephone] services / i n t e r n a t i o n a l [telephone services] As a title, this sequence presents the same ambiguity. However, it is not enough to consider it in isolation. Take for example: (2) The international telephone services many countries. The ambiguity has disappeared! It is indeed frequent that an ambiguity relative to a fragment appears, disappears and reappears as one broadens its context in an utterance. For example, in (3) The international telephone services many countries have established are very reliable. the ambiguity has reappeared. From the examples above, we see that, in order to define properly what an ambiguity is, we must consider the fragment within an utterance, and clarify the idea that the fragment is the smallest (within the utterance) where the ambiguity can be observed.

190 2.2 2.2.1

CHRISTIAN BOITET & MUTSUKO TOMOKIYO Representation

systems

Types of formal representation systems

Classical representation systems are based on lists of binary features, flat or complex attribute structures (property lists), labeled or decorated trees, various types of feature-structures, graphs or networks, and logical formulae. What is an 'ambiguous representation'? This question is not as trivial as it seems, because it is often not clear what we exactly mean by 'the' rep resentation of an utterance. In the case of a classical context-free grammar G, shall we say that a representation of U is any tree T associated to U via G, or that it is the set of all such trees? Usually, linguists say that U has several representations with reference to G. But if we use f-structures with disjunctions, U will always have one (or zero!) associated structure S. Then, we would like to say that S is ambiguous if it contains at least one disjunction. Returning to G, we might then say that 'the' representation of U is a disjunction of trees T. In practice, however, developers prefer to use hybrid data structures to represent utterances. Trees decorated with various types of structures are very popular. For speech and language processing, lattices bearing such trees are also used, which means at least 3 levels at which a representation may be ambiguous. 2.2.2

Computable representations and 'reasonable' analysers

Now, we are still left with two questions: 1. which representation system(s) do we choose? 2. how do we determine the representation or representations of a par ticular utterance in a specific representation system? The answer to the first question is a practical one. The representation system(s) must be fine-grained enough to allow the intended operations. For instance, text-to-speech requires less detail than translation. On the other hand, it is counter-productive to make too many distinctions. For example, what is the use of defining a system of 1000 semantic features if no system and no lexicographers may assign them to terms in an efficient and reliable way? There is also a matter of taste and consensus. Although different representation systems may be formally equivalent, researchers and developers have their preferences. Finally, we should prefer representations amenable to efficient computer processing. As far as the second question is concerned, two aspects should be dis tinguished. First, the consensus on a representation system goes with a

AMBIGUITIES & AMBIGUITY LABELLING

191

consensus on its semantics. This means that people using a particular rep resentation system should develop guidelines enabling them to decide which representations an utterance should have, at each level, and to create them by hand if challenged to do so. Second, these guidelines should be refined to the point where they may be used to specify and implement a parser producing all and only the intended representations for any utterance in the intended domain of discourse. A 'computable' representation system is a representation system for which a 'reasonable' parser can be developed. A 'reasonable' parser is a parser such as: • its size and time complexity are tractable over the class of intended utterances; • if it is not yet completed, assumptions about its ultimate capabilities, especially about its disambiguation capabilities, are realistic given the state of the art. _J Suppose, then, that we have defined a computable representation. We may not have the resources to build an adequate parser for it, or the one we have built may not yet be adequate. In that case, given the fact that we are specifying what the parser should and could produce, we may anticipate and say that an utterance presents an ambiguity of such and such types. This only means that we expect that an adequate parser will produce an ambiguous representation for the utterance at the considered level. 2.2.3

Expectations for a system of manual labelling

Our manual labelling should be such that: • it is compatible with the representation systems used by the actual or intended analysers. • it is clear and simple enough for linguists to do the labelling in a reliable way and in a reasonable amount of time. Representation systems may concern one or several levels of linguistic ana lysis. We will hence say that an utterance is phonetically ambiguous if it has an ambiguous phonetic representation, or if the phonetic part of its de scription in a 'multilevel' representation system is ambiguous, and so forth for all the levels of linguistic analysis, from phonetic to orthographic, mor phological, morphosyntactic, syntagmatic, functional, logical, semantic, and pragmatic.

192

CHRISTIAN BOITET & MUTSUKO TOMOKIYO

In the labelling, we should only be concerned with the final result of analysis, not in any intermediate stage, because we want to retain only ambiguities which would remain unsolved after the complete automatic analysis process has been performed. 2.3

Ambiguous representations

A representation will be said to be ambiguous if it is multiple or underspecified. 2.3.1

Proper representations

In all known representation systems, it is possible to define 'proper repres entations', extracted from the usual representations, and ambiguity-free. For example, if we represent "We read books" by the unique decorated dependency tree: [["We" .

((lex "I-Pro") (cat pronoun) (person i) (number plur)...)] " r e a d " ((lex "read-V") (cat verb) (person 1) (number plur) (tense (\{pres past\}))...) ["books" ((lex "book-N") (cat noun)...)]]

there would be 2 proper representations, one with (tense pres), and the other with (tense past). For defining the proper representations of a representation system, it is necessary to specify which disjunctions are exclusive, and which are inclus ive. Proper and multiple representations A representation in a formal representation system is proper if it contains no exclusive disjunction. The set of proper representations associated to a representation R, is obtained by expanding all exclusive disjunctions of R (and eliminating duplicates). It is denoted here by Proper(R). R is multiple if |Proper(R)| > 1. R is multiple if (and only if) it is not proper. 2.3.2

Underspecified representations

A proper representation P is underspecified if it is undefined with respect to some necessary information.

AMBIGUITIES & AMBIGUITY LABELLING

193

There are two cases: the information may be specified, but its value is unknown, or it is missing altogether. The first case often happens in the case of anaphoras: (ref ?), or in the case where some information has not been exactly computed, e.g. (task_domain ?), (decade.of .month ?), but is necessary for translating in at least one of the considered target languages. It is quite natural to consider this as ambiguous. For example, an ana phoric reference should be said to be ambiguous • if several possible referents appear in the representation, which will give rise to several proper representations, • and also if the referent is simply marked as unknown, which causes no disjunction. The second case may never occur in representations such as Ariane-G5 decorated trees, where all attributes are always present in each decoration. But, in a standard f- structure, there is no way to force the presence of an attribute, so that a necessary attribute may be missing: then, (ref ?) is equivalent to the absence of the attribute ref. For any formal representation system, then, we must specify what the 'necessary information' is. Contrary to what is needed for defining Proper(R), this may vary with the intended application. 2.3.3

Ambiguous representations

Our final definition is now simple to state. A representation R is ambiguous if it is multiple or if Proper(R) contains an underspecified P. 2.4 2.4.1

Scope, occurrence, kernel and type of ambiguity Informal presentation

Although we have said that ambiguities have to be considered in the context of the utterances, it is clear that a sequence like "international telephone services" is ambiguous in the same way in utterances (1) and (3) above. We will call this an 'ambiguity kernel', and reserve the term of 'ambiguities' for what we will label, that is, occurrences of ambiguities. The distinction is the same as that between dictionary words and text words. It also clear that another sequence, such as "important business ad dresses" , would present the same sort of ambiguity in analogous contexts. This we want to define as 'ambiguity type'. In this case, linguists speak of

194

CHRISTIAN BOITET & MUTSUKO TOMOKIYO

'ambiguity of attachment', or 'structural ambiguity'. Other types concern the acceptions (word senses), the functions (syntactic or semantic), etc. Our list will be given with the specification of the labelling conventions. Ambiguity patterns are more specific kinds of ambiguity types, usable to trigger disambiguation actions, such as the production of a certain kind of disambiguating dialogue. For example, there may be various patterns of structural ambiguities. 2.4.2

Scope of an ambiguity

We take it for granted that, for each considered representation system, we know how to define, for each fragment V of an utterance U having a proper representation P, the part of P which represents V. For example, given a context-free grammar and an associated tree struc ture P for U, the part of P representing a substring V of U is the smallest sub-tree Q containing all leaves corresponding to V. Q is not necessarily the whole subtree of P rooted at the root of Q. Conversely, for each part Q of P, we suppose that we know how to define the fragment V of U represented by Q. a. Scope of an ambiguity of underspecification Let P be a proper representation of U. Q is a minimal underspecified parti of P if it does not contain any strictly smaller underspecified part Q'. Let P be a proper representation of U and Q be a minimal underspecified part of P. The scope of the ambiguity of underspecification exhibited by Q is the fragment V represented by Q. In the case of an anaphoric element, Q will presumably correspond to one word or term V. In the case of an indeterminacy of semantic relation (deep case), e.g. on some argument of a predicate, Q would correspond to a whole phrase V. b. Scope of an ambiguity of multiplicity A fragment V presents an ambiguity of multiplicity n (n>2) in an utter ance U if it has n different proper representations which are part of n or more proper representations of U. V is an ambiguity scope if it is minimal relative to that ambiguity. This means that any strictly smaller fragment W of U will have strictly less than n associated subrepresentations (at least two of the representations of V are be equal with respect to W).

AMBIGUITIES & AMBIGUITY LABELLING

195

In example (1) above, then, the fragment "the international telephone ser vices", together with the two skeleton representations the [international telephone] services / the international [telephone services]

is not minimal, because it and its two representations can be reduced to the subfragment "international telephone services" and its two representations (which are minimal). This leads us to consider that, in syntactic trees, the representation of a fragment is not necessarily a 'horizontally complete' subtree (diagram on the right).

Fig. 1: Caption for the figure In the case above, for example, we might have the configurations given in the figure below. In the first pair (constituent structures), "international telephone services" is represented by a complete subtree. In the second pair (dependency structures), the representing subtrees are not complete subtrees of the whole tree.

196 2.4.3

CHRISTIAN BOITET & MUTSUKO TOMOKIYO Occurrence and kernel of an ambiguity a. Ambiguity (occurrence)

An ambiguity occurrence, or simply ambiguity, A of multiplicity n (n>2) relative to a representation system R, may be formally defined as: A (U, V, (Pl,P2...Pm), (pl,p2...pn)), where m>n and: • U is a complete utterance, called the context of the ambiguity. • V is a fragment of U, usually, but not necessarily connex, the scope of the ambiguity. • Pl,P2...Pm are all proper representations of U in R, and pl,p2...pn are the parts of them which represent V. • For any fragment W of U strictly contained in V, if ql,q2 ... qn are the parts of pl,p2 ... pn corresponding to W, there is at least one pair qi,qj (i≠j) such that qi = qj. This may be illustrated by the following diagram, where we take the rep resentations to be tree structures represented by triangles (see Figure 2). Here, P2 and P3 have the same part p2 representing V, so that m > n.

Fig. 2: Caption for the figure b. Ambiguity kernel The kernel of an ambiguity A = (U, V, (P1, P2...Pm), (p1, p2...pn)) is the scope of A and its (proper) representations: K(A) = (V, (p1, p2...pn)). In a data base, it will be enough to store only the kernels, and references to the kernels from the utterances.

AMBIGUITIES & AMBIGUITY LABELLING 2.4.4

197

Ambiguity type and ambiguity pattern a. Ambiguity type

The type of A is the way in which the pi differ, and must be defined relative to each particular R. J If the representations are complex, the difference between 2 representations is defined recursively. For example, 2 decorated trees may differ in their geometry or not. If not, at least 2 corresponding nodes must differ in their decorations. Further refinements can be made only with respect to the intended in terpretation of the representations. For example, anaphoric references and syntactic functions may be coded by the same formal kind of attribute-value pairs, but linguists usually consider them as different ambiguity types. When we define ambiguity types, the linguistic intuition should be the main factor to consider, because it is the basis for any disambiguation method. For example, syntactic dependencies may be coded geometrically in one representation system, and with features in another, but disambigu ating questions should be the same. b. Ambiguity pattern An ambiguity pattern is a schema with variables which can be instantiated to a (usually unbounded) set of ambiguity kernels. Here is an ambiguity pattern of multiplicity 2 corresponding to the example above. NP[ x l NP[ x2 x3 ] ] , NP[ NP[ x l x2] x3 ] .

We don't elaborate, as ambiguity patterns are specific to a particular rep resentation system and a particular analyser. 3

Attributes and values used in manual labelling

The proposed text processor oriented format for ambiguity labelling is a first version, resulting from several attempts by the second author to label transcriptions or spoken and multimodal dialogues. We describe this format with the help of a classical context-free gram mar, written in the font used here for our examples, and insert comments and explanations in the usual font.

198 3.1

CHRISTIAN BOITET & MUTSUKO TOMOKIYO Top level (piece)

::= | ::= ::= 'LABELLED TEXT:' ::= ::= '"' "" ::= <paragraph> [<parag_sep> <paragraph>]* <paragraph> ::= [ ]* ::= 'II?' ::= ::= 'LABELLED DIALOGUE:' ::= ::= [ ]* ::= [

]* ::= <speaker_code> ':'

This means that the labelling begins by listing the text or the transcrip tion of the dialogue, thereby indicating segmentation problems with the mark "||?". 3.2 3.2.1

Paragraph or turn level Structure of the list and associated separators

The labelling continues with the next level of granularity, paragraphs or turns. The difference is that a turn begins with a speaker's code. ::= + ::= <parag_text> I'PARAG' <parag_text> C'/PARAG'] <parag_text> ::= [ ]*

The mark PARAG must be used if there is more than one utterance. /PARAG is optional and should be inserted to close the list of utterances, that is if the next paragraph contains only one utterance and does not begin with PARAG. This kind of convention is inspired by SGML, and it might actually be a good idea in the future to write down this grammar in the SGML format.

::= [ ]* ::= '|?' ::= + ::= I'TURN5 ['/TURN']

AMBIGUITIES & AMBIGUITY LABELLING

199

We use the same convention for TURN and /TURN as for PARAG and /PARAG.

3.2.2

::= <speaker_code> ':' <parag_text>

Representation of ambiguities of segmentation

If there is an ambiguity of segmentation in paragraphs or turns, there may be more labelled paragraphs or turns than in the source. For example, A ||? B ||? C may give rise to A-B||C and A||B-C, and not to A-B-C and A||B||C. Which combinations are possible should be determined by the person doing the labelling. The same remark applies to utterances. Take one of the examples given at the beginning of this paper: OK |? so go back and is this number three |? right there |? shall I wait here for the bus?

This is an A | ? B | ? C |? D pattern, giving rise to 10 utterance possibilities. If the labeller considers only the 4 possibilities A|B|C-D, A|B|C|D, A|B-C|D, and A-B-C|D, the following 7 utterances will be labelled: A A-B-C B B-C C C-D D

3.3 3.3.1

OK OK so so go so go right right shall

go back and back and is back and is there there shall I wait here

is this number three right there this number three this number three right there I wait here for the bus? for the bus?

Utterance level Structure of the lists and associated separators

::= I ['UTTERANCES'] + ::=

(I-text) means 'indexed text': at the end of the scope of an ambiguity, we insert a reference to the corresponding ambiguity kernel, exactly as one inserts citation marks in a text. 3.3.2

Headers of ambiguity kernels

::= *

There may be no ambiguity in the utterance, hence the use of "*" instead of ".+ " as above.

200

CHRISTIAN BOITET & MUTSUKO TOMOKIYO

::= ' ( ' ' ) ' ::= 'ambiguity' ['-' ] ::= ' - ' [ ' ] *

For example, a kernel header may be: "ambiguity EMMI10a-2'-5.1 ". This is ambiguity kernel number 2' in dialogue EMMI 10a, noted here EMMI 10a, and 5.1 is M. Tomokiyo's hierarchical code.

3.3.3

::=

Obligatory labels

::= <scope> \{<status> \}

By { A B C }, we mean any permutation of ABC : we don't insist that the labeller follows a specific order, only that the obligatory labels come first, with the scope as very first. a. Scope <scope> b. Status <status> <status_value>

::= '(scope' ' ) ' ::= '(status' <status_value> ' ) ' ::= 'expert_system'|'interpreter'I'user'

The status expresses the kind of supplementary knowledge needed to re liably solve the considered ambiguity. If 'expert_system' is given, and if a disambiguation strategy decides to solve this ambiguity interactively, it may ask: the expert system, if any; the interpreter, if any; or the user (speaker). If I is given, it means that an expert system of the generic task at hand could not be expected to solve the ambiguity. c. Importance ::= '(importance' ' ) ' ::= 'crucial' | 'important' | 'not-important' | 'negligible'

This expresses the impact of solving the ambiguity in the context of the intended task. An ambiguity of negation scope is often crucial, because it may lead to two opposed understanding, as in "A did not push B to annoy C" (did A push B or not?). An ambiguity of attachment is often only important, as the correspond ing meanings are not so different, and users may correct a wrong decision themselves. That is the case in the famous example "John saw Mary in the park with a telescope". From Japanese into English, although the number is very often am biguous, we may also very often consider it as 'not-important'. 'Negligible'

AMBIGUITIES & AMBIGUITY LABELLING

201

ambiguities don't really put obstacles to the communication. For example, "bus" in English may be "autobus" (intra-town bus) or "autocar" (intertown bus) in French, but either translation will almost always be perfectly understandable given the situation. d. Type ::= '(type' ' ) ' : := ('structure' | 'attachment') '(' <structure>+ ' ) ' I ('communication_act' | 'CA') '(' + ' ) ' | ('class' | 'cat') '(' <morpho_syntactic_class>+ ' ) ' | 'meaning' '(' <definition>+ ' ) ' | '(' + ' ) ' | 'reference' | 'address' '(' + ' ) ' | 'situation' <situation> | 'mode' <mode> | ...

The linguists may define more types. <structure>

::= '<' ( | <structure>+) '>' ::= 'yes' | 'acknowledge' | 'yn-question' | 'inform' | 'confirmation-question'

<morpho_syntactic_class> <definition>

::= ::= ::= ::=

<defined_ref_value>

::= ::= ::=

<situation> <mode>

::= ::=

ι ...

3.3.4

'N' 1 'V' | 'Adj' | 'Adv' | ... | '(' (<defined_ref_value> | )+ ' )' '*somebody' | '*something' '*speaker' | '*hearer' | '*client' | '*agent' | '*interpreter' 'infinitive' | 'indicative' | 'conjunctive' | 'imperative' | 'gerund'

Other labels

Other labels are not obligatory. Their list is to be completed in the future as more ambiguity labelling is performed.

202

CHRISTIAN BOITET & MUTSUKO TOMOKIYO

::= [ | <multimodality>...J* ::= 'definitive' I 'long_term' | 'short_term' | 'local' <multimodality> ::= 'multimodal' (<multimodal_help> I '(' <multimodal_help>+ ' ) ' <multimodal_help> ::= 'prosody' | 'pause' | 'pointing' | 'gesture' | 'facial_expression' |...

4

Conclusions

Although many studies on ambiguities have been published, the specific goal of studying ambiguities in the perspective of interactive disambiguation in automated text and speech translation systems has led us to explore some new ground and to propose the new concept of 'ambiguity labelling'. Several dialogues from EMMI-1(ATR-ITL 1994) and EMMI-2(Park &· Loken-KIM 1994) have already labelled (in Japanese and English). Attempts have also been made on French texts and dialogues. In the near future, we hope to refine our ambiguity labelling, and to label WOZ dialogues from EMMI3(Park, Loken-KIM, Mizunashi & Fais 1995). In parallel, the specification of MIDDIM-DB, a HyperCard based support for the ambiguity data base under construction, is being reshaped to implement the new notions intro duced here: ambiguity kernels, occurrences, and types. Acknowledgements. We are very grateful to Dr. Y. Yamazaki, president of ATR-ITL, Mr. T.Morimoto, head of Department 4, and Dr. Loken-Kim K-H., for their constant support to this project, which one of the projects funded by CNRS and ATR in the context of a memorandum of understanding on scientific cooperation. Thanks should also go to M. Axtmeyer, L.Fais and H.Blanchon, who have contributed to the study of ambiguities in real texts and dialogues, and to M.Kurihara, for his programming skills.

REFERENCES ATR-ITL. 1994. "Transcriptions of English Oral Dialogues Collected by ATRITL using EMMI (from TR-IT-0029, ATR-ITL)" ed. by GETA. EMMI re port. Grenoble & Kyoto. Axtmeyer, Monique. 1994. "Analysis of Ambiguities in a Written Abstract (MIDDIM project)". Internal Report. Grenoble, France: GETA, IMAG (UJF & CNRS).

AMBIGUITIES & AMBIGUITY LABELLING

203

Black, Ezra, R. Garside & G. Leech. 1993. Statistically-Driven Grammars of English: The IBM/ Lancaster Approach ed. by J. Aarts & W. Mejs, (= Language and Computers: Studies in Practical Linguistics, 8). Amsterdam: Rodopi. Blanchon, Hervé. 1993. "Report on a stay at ATR". Project Report (MIDDIM), Grenoble & Kyoto: GETA & ATR-ITL. 1994. "Perspectives of DBMT for Monolingual Authors on the Basis of LIDIA-1, an Implemented Mockup". Proceedings of 15th International Con ference on Computational Linguistics(COLING-94)', vol.1, 115-119. Kyoto, Japan. 1994. "Pattern-Based Approach to Interactive Disambiguation: First Definition and Experimentation". Technical Report 0073. Kyoto, Japan: ATR-ITL. Boitet, Christian. 1989. "Speech Synthesis and Dialogue Based Machine Trans lation". Proceedings of ATR Symposium on Basic Research for Telephone Interpretation, 22-22. Kyoto, Japan. & H. Blanchon. 1993. "Dialogue-based MT for Monolingual Authors and the LIDIA Project". Rapport de Recherche (RR-918-I). Grenoble: IMAG. GETA, UJF & CNRS. 1993. "Practical Speech Translation Systems will Integrate Human Expert ise, Multimodal Communication, and Interactive Disambiguation". Proceed ings of the 4th Machine Translation Summit, 173-176. Kobe, Japan. 1993. "Human-Oriented Design and Human-Machine-Human Interactions in Machine Interpretation". Technical Report 0013. Kyoto: ATR-ITL. _. 1993. "Multimodal Interactive Disambiguation: First Report on the MIDDIM Project". Technical Report 0014. Kyoto: ATR-ITL. & K-H. Loken-Kim. 1993. "Human-Machine-Human Interactions in Inter preting Telecommunications". Proceedings of International Symposium on Spoken Dialogue. Tokyo, Japan. & M. Axtmeyer. 1994. "Documents Prepared for Inclusion in MIDDIMDB". Internal Report. Grenoble: GETA, IMAG (UJF & CNRS). 1994. "On the design of MIDDIM-DB, a Data Base of Ambiguities and Dis ambiguation Methods". Technical Report 0072. Kyoto & Grenoble: ATRITL & GETA-IMAG. & H. Blanchon. 1995. "Multilingual Dialogue-Based MT for monolingual authors: the LIDIA project and a first mockup". Seminor Report on Machine Translation. Grenoble.

204

CHRISTIAN BOITET & MUTSUKO TOMOKIYO

Maruyama, Hiroshi, H. Watanabe & S. Ogino. 1990. "An Interactive Japan ese Parser for Machine Translation" ed. by H. Karlgren, Proceedings of 15th International Conference on Computational Linguistics (COLING-90), vol.II/III, 257-262. Helsinki, Finland. Tomokiyo, Mutsuko & K-H.Loken-Kim. 1994. "Ambiguity Analysis and MIDDIMDB". Technical Report 0064. Kyoto & Grenoble: ATR-ITL & GETA-IMAG. . 1994. "Ambiguity Classification and Representation". Proceedings of Nat ural Language Understanding and Models of Communication (NLC-94 work shop). Tokyo. Park Young Dok & K-H.Loken-Kim. 1994. "Text Database of the Telephone and Multimedia Multimodal Interpretation Experiment". Technical Report 0086. Kyoto: ATR-ITL. , K-H. Loken-Kim & L. Fais. 1994. "An Experiment for Telephone versus Multimedia Multimodal Interpretation: Methods and Subject's Behavior". Technical Report 0087. Kyoto: ATR-ITL. , K-H.Loken-Kim, S.Mizunashi & L.Fais. 1995. "Transcription of the Col lected Dialogue in a Telephone and Multimedia/ Multimodal WOZ Experi ment". Technical Report 0091. Kyoto: ATR-ITL. Winship, Joe. 1994. "Building MIDDIM-DB, a HyperCard data-base of ambigu ities and disambiguation methods". ERASMUS Project Report. Grenoble  Brighton: GETA, IMAG (UJF  CNRS)  University of Sussex at Brighton.

AMBIGUITIES & AMBIGUITY LABELLING E x a m p l e of a short dialogue I. C o m p l e t e l a b e l l i n g in t e x t p r o c e s sor o r i e n t e d f o r m a t The numbers in square brackets are not part of the labelling format and are only given for convenience.

205

[15] A:and y o u ' l l t a k e t h e subway n o r t h t o Sanjo s t a t i o n [16]AA:0K [17] A : / I s / a t Sanjo s t a t i o n y o u ' l l g e t off and change t r a i n s t o t h i Keihan Kyotsu l i n e [18]AA: [hmm] [19] A:OK I.2 Turns

I.1 Text of the dialogue LABELLED DIALOGUE:" EMMI 10a"

LABELLED TURNS OF DIALOGUE "EMMI 10a"

[1] A:Good morning conference office how can I help you TURN [2] AA:[ah] yes good morning [1] AA:Good morning, c o n f e r e n c e o f f i c e , could you tell me please | ? How can I h e l p you? how to get from Kyoto UTTERANCES station to your conference center AA:Good morning, c o n f e r e n c e [3] A : / I s / [ah] yes (can you t e l l office(l) me) [ah](you) y o u ' r e going t o t h e conference c e n t e r (ambiguity EMMI10a-l-2.2.8.3 today ((scope ''conference o f f i c e ' ' ) [4] AA:yes I am t o a t t e n d t h i [uh] (status expert_system) Second I n t e r n a t i o n a l ( a d d r e s s (*speaker * h e a r e r ) ) Symposium { o n } I n t e r p r e t i n g (importance not-important) Telecommunications (multimodal facial-expression) [5] A : { [ o ? ] } OK n ' where a r e you (desambiguation_scope d e f i n i t i v e ) ) ) c a l l i n g from r i g h t now [6] A A : c a l l i n g from Kyoto s t a t i o n AA:How can I h e l p you? [7] A : / I s / OK, y o u ' r e a t Kyoto /TURN is not necessary here because an s t a t i o n r i g h t now other TURN appears. [8] AA:{yes} [9] A : { / b r e a t h / } and t o g e t t o t h e TURN I n t e r n a t i o n a l Conference Center you can e i t h e r t r a v e l [2] AA:[ah] y e s , good morning. | by t a x i bus or subway how Could you t e l l me p l e a s e would you l i k e t o go how t o g e t from Kyoto [10]AA:I t h i n k subway sounds l i k e s t a t i o n t o your t h e b e s t way t o me conference center? [11] A:OK [ah] you wanna go by The labeller distinguishes here a sure seg subway and y o u ' r e a t t h e mentation into 2 utterances. s t a t i o n r i g h t now [12]AA:yes UTTERANCES [13] A:OK so [ah] y o u ' l l want t o g e t A A : [ a h ] y e s ( 2 ) , good morning. back on t h i subway going n o r t h [14]AA:[hmm]

206

CHRISTIAN BOITET & MUTSUKO TOMOKIYO

(ambiguity EMMI10a-2-5.1 ((scope "yes") (status user) (type CA (yes acknowledge)) (importance crucial) (multimodal prosody))) AA:Could you tell me please how to get from Kyoto station to your conference center(3)? (ambiguity EMMI10a-3-2.2.2 ((scope "your conference center") (status user) (type structure («your conferenceXcenter» «yourXconference center»)) (importance negligible) (multimodal prosody)))

/TURN

(type

Japanese

(importance

important)))

[6] AA:calling from Kyoto station [7] A A : / I s / OK, you're at Kyoto station(8) right now. (ambiguity EMMI10a-8-5.1 ((scope "you're at Kyoto station") (status expert_system) (type CA (yn-question inform)) (importance crucial) (multimodal prosody))) [8] AA :

{yes}

TURN [9] A:{/breath/} and to get to the International Conference Center you can either travel by taxi bus or subway. | how would you like to go

TURN is not necessary if there is only one utterance with no ambiguity of segmenta U T T E R A N C E tion. A:{/breath/} and to get to the [3] A:/Is/[ah] yes (can you tell me) [ah] (you) you're going to the conference center today(4) (ambiguity EMMI10a-4-5.2 ((scope "today") (status expert_system) (situation "the day they are speaking") (importance negligible) (multimodal "built-in calendar on screen"))) [4] AA:yes I am to(5) attend thi [uh] Second International Symposium {on} Interpreting Telecommunications (ambiguity EMMIlOa-5-3.1.2 ((scope "am to") (status user)

International Conference Center you can(9) either travel(9', 9") by taxi bus or subway(10). (ambiguity EMMIiOa-9-2.1 ((scope "can") (status expert_system) (type class(verb modal_verb)) (importance crucial))) (ambiguity EMMI10a-9'-2.1 ((scope "the International Conference Center you can either travel") (status expert_system) (type structure (<«the International Conf erence CenterXyou can» <either travel» «the Inter national Conference Center>

AMBIGUITIES & AMBIGUITY LABELLING

207

subway and you're at the s t a t i o n right now") (status expert-system) (type CA (yn-question inform)) (importance crucial) (multimodal prosody)))

>>>) (importance crucial) (multimodal prosody))) (ambiguity EMMI10a-9"-2.1 ((scope "travel") (status expert_system) (mode (infinitive imperative)) (importance crucial)))

[12]AA:yes [13] A:OK so [ah] you'll want to(13) get back on thi subway going north(14)

(ambiguity EMMI10a-10-2.2.2 ((scope "taxi bus or subway") (status expert_system) (type structure (

Anaphora Processing (Current Issues in Linguistic Theory)

Read more

Recent advances in natural language processing III: selected papers from RANLP 2003

Read more

Recent advances in natural language processing IV: selected papers from RANLP 2005

Read more

Postvelar Harmony (Current Issues in Linguistic Theory)

Read more

Loan Phonology (Current Issues in Linguistic Theory)

Read more

Loan Phonology (Current Issues in Linguistic Theory)

Read more

Loan Phonology (Current Issues in Linguistic Theory)

Read more

Recent Advances in Natural Language Processing IV: Selected Papers from RANLP 2005 (Amsterdam Studies in the Theory and History of Linguistic Science, Series IV: Current Issues in Linguistic Theory)

Read more

Ergativity: Emerging Issues (Studies in Natural Language and Linguistic Theory)

Read more

Challenges in Natural Language Processing (Studies in Natural Language Processing)

Read more

Advances in Natural Language Processing 4 conf

Read more

Recent Advances in Signal Processing

Read more

Current Issues in Romance Languages: Selected Papers from the 29th Linguistic Symposium on Romance Languages (Lsrl), Ann Arbor, 8-11 April 1999 (Current Issues in Linguistic Theory)

Read more

Constructions in Cognitive Linguistics: Selected Papers from the International Cognitive Linguistics Conference, Amsterdam, 1997 (Current Issues in Linguistic Theory)

Read more

Romance Languages and Linguistic Theory 2004: Selected Papers from 'Going Romance', Leiden, 911 December 2004 (Amsterdam Studies in the Theory and History ... IV: Current Issues in Linguistic Theory)

Read more

Recent Advances in Matrix Theory

Read more

Recent Advances in Matrix Theory

Read more

Morphology and Its Demarcations: Selected Papers from the 11th Morphology Meeting, Vienna, February 2004 (Current Issues in Linguistic Theory)

Read more

Germanic linguistics: syntactic and diachronic (Current Issues in Linguistic Theory)

Read more

Constraint Processing.. selected papers

Read more

Ancient Scripts and Phonological Knowledge (Current Issues in Linguistic Theory)

Read more

Morphology: Selected Papers from the 9th Morphology Meeting, Vienna, 24-28 February 2000 (Current Issues in Linguistic Theory)

Read more

Language Death and Language Maintenance: Theoretical, Practical and Descriptive Approaches (Current Issues in Linguistic Theory)

Read more

Building Natural Language Generation Systems (Studies in Natural Language Processing)

Read more

Building Natural Language Generation Systems (Studies in Natural Language Processing)

Read more

Memory-Based Language Processing (Studies in Natural Language Processing)

Read more

Current research in natural language generation

Read more

Time and Modality (Studies in Natural Language and Linguistic Theory)

Read more

Romance Languages and Linguistic Theory 2002: Selected Papers from 'Going Romance',Groningen,28-30 Novermber 2002 (Current Issues in Linguistic Theory)

Read more

The Transformational-Generative Paradigm and Modern Linguistic Theory (Current Issues in Linguistic Theory)

Read more

Recommend Documents

Anaphora Processing (Current Issues in Linguistic Theory)

ANAPHORA PROCESSING AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE General Editor E.F. KONRAD KOER...

Recent advances in natural language processing III: selected papers from RANLP 2003

RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING III AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE ...

Recent advances in natural language processing IV: selected papers from RANLP 2005

Recent Advances in Natural Language Processing IV AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE G...

Postvelar Harmony (Current Issues in Linguistic Theory)

POSTVELAR HARMONY AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE General Editor E. F. KONRAD KOERN...

Loan Phonology (Current Issues in Linguistic Theory)

Loan Phonology AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE General Editor E.F.K. KOERNER Zentru...

Loan Phonology (Current Issues in Linguistic Theory)

Loan Phonology AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE General Editor E.F.K. KOERNER Zentru...

Loan Phonology (Current Issues in Linguistic Theory)

Loan Phonology AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE General Editor E.F.K. KOERNER Zentru...

Recent Advances in Natural Language Processing IV: Selected Papers from RANLP 2005 (Amsterdam Studies in the Theory and History of Linguistic Science, Series IV: Current Issues in Linguistic Theory)

Recent Advances in Natural Language Processing IV AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE G...

Ergativity: Emerging Issues (Studies in Natural Language and Linguistic Theory)

ERGATIVITY Studies in Natural Language and Linguistic Theory VOLUME 65 Managing Editors Marcel den Dikken, City Unive...

Challenges in Natural Language Processing (Studies in Natural Language Processing)

Studies in Natural Language Processing Challenges in natural language processing Studies in Natural Language Process...