Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science Edited by J. G. Carbonell and J. Siekmann
Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen
1925
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo
James Cussens Saˇso Dˇzeroski (Eds.)
Learning Language in Logic
13
Series Editors Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA J¨org Siekmann, University of Saarland, Saarbr¨ucken, Germany Volume Editors James Cussens University of York Department of Computer Science Heslington, York, YO10 5DD, UK E-mail:
[email protected] Saˇso Dˇzeroski Joˇzef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia E-mail:
[email protected] Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Learning language in logic / James Cussens ; Saˇso Dˇzeroski (ed.). Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 2000 (Lecture notes in computer science ; Vol. 1925 : Lecture notes in artificial intelligence) ISBN 3-540-41145-3
CR Subject Classification (1998): I.2, F.4 ISBN 3-540-41145-3 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH © Springer-Verlag Berlin Heidelberg 2000 Printed in Germany Typesetting: Camera-ready by author, data conversion by Steingr¨aber Satztechnik GmbH, Heidelberg Printed on acid-free paper SPIN: 10722866 06/3142 543210
Preface This volume has its origins in the first Learning Language in Logic (LLL) workshop which took place on 30 June 1999 in Bled, Slovenia immediately after the Ninth International Workshop on Inductive Logic Programming (ILP’99) and the Sixteenth International Conference on Machine Learning (ICML’99). LLL is a research area lying at the intersection of computational linguistics, machine learning, and computational logic. As such it is of interest to all those working in these three fields. I am pleased to say that the workshop attracted submissions from both the natural language processing (NLP) community and the ILP community, reflecting the essentially multi-disciplinary nature of LLL. Eric Brill and Ray Mooney were invited speakers at the workshop and their contributions to this volume reflect the topics of their stimulating invited talks. After the workshop authors were given the opportunity to improve their papers, the results of which are contained here. However, this volume also includes a substantial amount of two sorts of additional material. Firstly, since our central aim is to introduce LLL work to the widest possible audience, two introductory chapters have been written. Dˇzeroski, Cussens and Manandhar provide an introduction to ILP and LLL and Thompson provides an introduction to NLP. In both cases no previous knowledge on the part of the reader is assumed. Secondly, to give a full account of research in LLL we invited a number of researchers from the NLP and machine learning communities to contribute. We are pleased to say that all those invited contributed – resulting in the contributions from Tjong Kim Sang and Nerbonne; Kazakov; Eineborg and Lindberg; Riezler; and Thompson and Califf. An LLL home page has been created at http://www.cs.york.ac.uk/aig/lll which provides pointers to papers, research groups, datasets, and algorithms. A page for this volume is at http://www.cs.york.ac.uk/aig/lll/book. We have many people to thank for bringing this volume about. Thanks firstly to all those authors who contributed to this volume. Thanks also to the LLL’99 programme committee who provided the initial reviews for many of the papers here. We greatly appreciate the generous financial support towards the LLL’99 workshop from the ILP2 project and the MLNet-II, ILPNet2, and CompulogNet Networks of Excellence. Saˇso Dˇzeroski is supported by the Slovenian Ministry of Science and Technology. Thanks also to Chris Kilgour and Heather Maclaren at York for help with Word. Alfred Hofmann and colleagues at Springer deserve thanks for efficiently producing this book. Finally, we would like to thank Stephen Muggleton. Stephen not only pioneered many of the ILP techniques used by contributors to this volume, he also initiated much of the interaction between the NLP and ILP community which is central to LLL – indeed the slogan “Learning Language in Logic” is due to Stephen. We are very glad that he agreed to provide the foreword to this volume. July 2000
James Cussens Saˇso Dˇzeroski
Foreword
The new research area of learning language in logic (LLL) lies at the intersection of three major areas of computational research: machine learning, natural language processing (NLP), and computational logic. While statistical and machine learning techniques have been used repeatedly with good results for automating language acquisition, the techniques employed have largely not involved representations based on mathematical logic. This is in stark contrast to the well-established uses of logical representations within natural language processing. In particular, since the 1980s there has been an increasing use of approaches implemented within a logic programming framework. The area of machine learning most closely allied with logic programming is known as inductive logic programming (ILP), in which examples, background knowledge, and machine-suggested hypotheses are each logic programs. LLL represents the marriage of ILP and logic programming approaches to NLP (for more detail see “Inductive logic programming: issues, results and the LLL challenge”, Artificial Intelligence, 114(1–2):283–296, December 1999). The present volume represents an international cross-section of recent research into LLL. As motivation for LLL, the telecommunications and other industries are investing substantial effort in the development of natural language grammars and parsing systems. Applications include information extraction, database query (especially over the telephone), tools for the production of documentation, and translation of both speech and text. Many of these applications involve not just parsing, but the production of a semantic representation of a sentence. Hand development of such grammars is very difficult, requiring expensive human expertise. It is natural to turn to machine learning for help in automatic support for grammar development. The paradigm currently dominant in grammar learning is statistically based. This work is completely, with a few recent small-scale exceptions, focussed on syntactic or lexical properties. No treatment of semantics or contextual interpretation is possible because there are no annotated corpora of sufficient size available. The aim of statistical language modelling is, by and large, to achieve wide coverage and robustness. The necessary trade-off is that a depth of analysis cannot also be achieved. Statistical parsing methods do not deliver semantic representations capable of supporting full interpretation. Traditional rule-based systems, on the other hand, achieve the necessary depth of analysis, but at the sacrifice of robustness: hand-crafted systems do not easily extend to new types of text or applications. In this paradigm disambiguation is addressed by associating statistical preferences, derived from an annotated training corpus, with particular syntactic or semantic configurations and using those numbers to rank parses. While this can be effective, it demands large annotated corpora for each new application, which are costly to produce. There is presumably an upper limit on the accuracy of these techniques, since the variety in language means that it is always possible
VI
Foreword
to express sentences in a way that will not have been encountered in training material. The alternative method for disambiguation and contextual resolution is to use an explicit domain theory which encodes the relevant properties of the domain in a set of logical axioms. While this has been done for small scale domains, the currently fashionable view is that it is impractical for complex domains because of the unmanageably large amount of hand-coded knowledge that would be required. However, if a large part of this domain knowledge could be acquired (semi-)automatically, this kind of practical objection could be overcome. From the NLP point of view the promise of ILP is that it will be able to steer a midcourse between these two alternatives of large scale, but shallow analyses, and small scale, but deep and precise analyses. ILP should produce a better ratio between breadth of coverage and depth of analysis. In conclusion, the area of LLL is providing a number of challenges to existing ILP theory and implementations. In particular, language applications of ILP require revision and extension of a hierarchically defined set of predicates in which the examples are typically only provided for predicates at the top of the hierarchy. New predicates often need to be invented, and complex recursion is usually involved. Advances in ILP theory and implementation related to the challenges of LLL are already producing beneficial advances in other sequenceoriented applications of ILP. This book shows that LLL is starting to develop its own character as an important new sub-discipline of artificial intelligence. July 2000
Stephen Muggleton
An Introduction to Inductive Logic Programming and Learning Language in Logic Saˇso Dˇzeroski1 , James Cussens2 , and Suresh Manandhar2 1
2
Joˇzef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia
[email protected] Department of Computer Science, University of York, Heslington, York, YO10 5DD, UK
[email protected],
[email protected]
Abstract. This chapter introduces Inductive Logic Programming (ILP) and Learning Language in Logic (LLL). No previous knowledge of logic programming, ILP or LLL is assumed. Elementary topics are covered and more advanced topics are discussed. For example, in the ILP section we discuss subsumption, inverse resolution, least general generalisation, relative least general generalisation, inverse entailment, saturation, refinement and abduction. We conclude with an overview of this volume and pointers to future work.
1
Introduction
Learning Language in Logic (LLL) is an emerging research area lying at the intersection of computational logic, machine learning and natural language processing (Fig 1). To see what is to be gained by focussing effort at this intersection begin by considering the role of logic in natural language processing (NLP). The flexibility and expressivity of logic-based representations have led to much effort being invested in developing logic-based resources for NLP. However, manually developed logic language descriptions have proved to be brittle, expensive to build and maintain, and still do not achieve high coverage on unrestricted text. As a result, rule-based natural language processing is currently viewed as too costly and fragile for commercial applications. Natural language learning based on statistical approaches, such as the use of n-gram models for part-of-speech tagging or lexicalised stochastic grammars for robust parsing, partly alleviate these problems. However, they tend to result in linguistically impoverished descriptions that are difficult to edit, extend and interpret. Work in LLL attempts to overcome both the limitations of existing statistical approaches and the brittleness of hand-crafted rule-based methods. LLL is, roughly, the application of Inductive Logic Programming (See Section 3) to NLP. ILP works within a logical framework but uses data and logically encoded background knowledge to learn or revise logical representations. As Muggleton notes in the foreword to this volume: “From the NLP point of view the promise of ILP is that it will be able to steer a mid-course between the two alternatives of J. Cussens and S. Dˇ zeroski (Eds.): LLL’99, LNAI 1925, pp. 3–35, 2000. c Springer-Verlag Berlin Heidelberg 2000
4
Saˇso Dˇzeroski, James Cussens, and Suresh Manandhar
ML
ILP
CLo
LLL DDNLP
LG
NLP
Fig. 1. Situating LLL (CLo = computational logic, ML = machine learning, DDNLP = data-driven NLP, LLL = learning language in logic, LG = logic grammars, NLP = natural language processing, ILP = inductive logic programming.)
large scale, but shallow levels of analysis, and small scale, but deep and precise analyses, and produce a better balance between breadth of coverage and depth of analysis.” From the ILP point of view, NLP is an ideal application area. The existence within NLP problems of hierarchically defined, structured data with large amounts of relevant, often logically defined, background knowledge provides a perfect testbed for stretching ILP technology in a way that would also be beneficial in other application areas. The aim of this introductory chapter is to provide the background required to make ILP and LLL accessible to the widest possible audience. The sections on logic programming and ILP (Sections 2–7) give a concise account of the main ideas in the hope that, for example, computational linguists unfamiliar with either or both of these topics can appreciate what LLL has to offer. Thompson (this volume) has a complementary goal: to give an overview, accessible to nonlinguists, of the main problems of NLP and how they are being addressed. In both cases, no previous knowledge is assumed so that quite elementary topics are covered. Section 8 summarises the contributions contained in this volume. We conclude with Section 9 where we analyse which NLP problems are most suited to a LLL approach and suggest future directions for LLL.
2
Introduction to Logic Programming
In this section, the basic concepts of logic programming are introduced. These include the language (syntax) of logic programs, as well as basic notions from model and proof theory. The syntax defines what are legal sentences/statements in the language of logic programs. Model theory is concerned with assigning meaning (truth values) to such statements. Proof theory focuses on (deductive) reasoning with such statements. For a thorough treatment of logic programming we refer to the standard textbook of Lloyd (1987). The overview below is mostly based on the comprehensive and easily readable text by Hogger (1990).
Inductive Logic Programming and Learning Language in Logic
2.1
5
The Language
A first-order alphabet consists of variables, predicate symbols and function symbols (which include constants). A variable is a term, and a function symbol immediately followed by a bracketed n-tuple of terms is a term. Thus f (g(X), h) is a term when f , g and h are function symbols and X is a variable—strings starting with lower-case letters denote predicate and function symbols, while strings starting with upper-case letters denote variables. A constant is a function symbol of arity 0 (i.e. followed by a bracketed 0-tuple of terms, which is usually left implicit). A predicate symbol immediately followed by a bracketed n-tuple of terms is called an atomic formula or atom. For example, if mother and f ather are predicate symbols then mother(maja, f ilip) and f ather(X, Y ) are atoms. A well-formed formula is either an atomic formula or takes one of the following forms: (F ), F , F ∨ G, F ∧ G, F ← G, F ↔ G, ∀X : F and ∃X : F , where F and G are well-formed formulae and X is a variable. F denotes the negation of F , ∨ denotes logical disjunction (or), and ∧ logical conjunction (and). F ← G stands for implication (F if G, F ∨ G) and F ↔ G stands for equivalence (F if and only if G). ∀ and ∃ are the universal (for all X F holds) and existential quantifier (there exists an X such that F holds). In the formulae ∀X : F and ∃X : F , all occurrences of X are said to be bound. A sentence or a closed formula is a well-formed formula in which every occurrence of every variable symbol is bound. For example, ∀Y ∃Xf ather(X, Y ) is a sentence, while f ather(X, andy) is not. Clausal form is a normal form for first-order sentences. A clause is a disjunction of literals—a positive literal is an atom, a negative literal the negation of an atom—preceded by a prefix of universal quantifiers, one for each variable appearing in the disjunction. In other words, a clause is a formula of the form ∀X1 ∀X2 ...∀Xs (L1 ∨ L2 ∨ ...Lm ), where each Li is a literal and X1 , X2 , ..., Xs are all the variables occurring in L1 ∨ L2 ∨ ...Lm . Usually the prefix of variables is left implicit so that ∀X1 ∀X2 ...∀Xs (L1 ∨ L2 ∨ ...Lm ) is written as L1 ∨ L2 ∨ ...Lm . A clause can also be represented as a finite set (possibly empty) of literals. The set {A1 , A2 , ..., Ah , B1 , B2 , ..., Bb }, where Ai and Bi are atoms, stands for the clause (A1 ∨ ... ∨ Ah ∨ B1 ∨ ... ∨ Bb ), which is equivalently represented as A1 ∨ ... ∨ Ah ← B1 ∧ ... ∧ Bb . Most commonly, this same clause is written as A1 , ..., Ah ← B1 , ..., Bb , where A1 , ..., Ah is called the head and B1 , ..., Bb the body of the clause. A finite set of clauses is called a clausal theory and represents the conjunction of its clauses. A clause is a Horn clause if it contains at most one positive literal; it is a definite clause if it contains exactly one positive literal. A set of definite clauses is called a definite logic program. A fact is a definite clause with an empty body, e.g., parent(mother(X), X) ←, also written simply as parent(mother(X), X). A goal (also called a query) is a Horn clause with no positive literals, such as ← parent(mother(X), X). A program clause is a clause of the form A ← L1 , ..., Lm where A is an atom, and each of L1 , ..., Lm is a positive or negative literal. A negative literal in the
6
Saˇso Dˇzeroski, James Cussens, and Suresh Manandhar
body of a program clause is written in the form not B, where B is an atom. A normal program (or logic program) is a set of program clauses. A predicate definition is a set of program clauses with the same predicate symbol (and arity) in their heads. Let us now illustrate the above definitions with some examples. The clause daughter(X, Y ) ← f emale(X), mother(Y, X). is a definite program clause, while the clause daughter(X, Y ) ← not male(X), f ather(Y, X). is a normal program clause. Together, the two clauses constitute a predicate definition of the predicate daughter/2. This predicate definition is also a normal logic program. The first clause is an abbreviated representation of the formula ∀X∀Y : daughter(X, Y ) ∨ f emale(X) ∨ mother(Y, X) and can also be written in set notation as {daughter(X, Y ), f emale(X), mother(Y, X)} The set of variables in a term, atom or clause F , is denoted by vars(F ). A substitution θ = {V1 /t1 , ..., Vn /tn } is an assignment of terms ti to variables Vi . Applying a substitution θ to a term, atom, or clause F yields the instantiated term, atom, or clause F θ where all occurrences of the variables Vi are simultaneously replaced by the term ti . A term, atom or clause F is called ground when there is no variable occurring in F , i.e., vars(F ) = ∅. The fact daughter(maty, ann) is thus ground. A clause or clausal theory is called function free if it contains only variables as terms, i.e., contains no function symbols (this also means no constants). The clause daughter(X, Y ) ← f emale(X), mother(Y, X) is function free and the clause even(s(s(X)) ← even(X) is not. A Datalog clause (program) is a definite clause (program) that contains no function symbols of non-zero arity. This means that only variables and constants can be used as predicate arguments. The size of a term, atom, clause, or a clausal theory T is the number of symbols that appear in T , i.e., the number of all occurrences in T of predicate symbols, function symbols and variables. 2.2
Model Theory
Model theory is concerned with attributing meaning (truth value) to sentences in a first-order language. Informally, the sentence is mapped to some statement about a chosen domain through a process known as interpretation. An interpretation is determined by the set of ground facts (ground atomic formulae) to which it assigns the value true. Sentences involving variables and quantifiers are interpreted by using the truth values of the ground atomic formulae and a fixed
Inductive Logic Programming and Learning Language in Logic
7
set of rules for interpreting logical operations and quantifiers, such as “F is true if and only if F is false”. An interpretation which gives the value true to a sentence is said to satisfy the sentence; such an interpretation is called a model for the sentence. An interpretation which does not satisfy a sentence is called a counter-model for that sentence. By extension, we also have the notion of a model (counter-model) for a set of sentences (e.g., for a clausal theory): an interpretation is a model for the set if and only if it is a model for each of the set’s members. A sentence (set of sentences) is satisfiable if it has at least one model; otherwise it is unsatisfiable. A sentence F logically implies a sentence G if and only if every model for F is also a model for G. We denote this by F |= G. Alternatively, we say that G is a logical (or semantic) consequence of F . By extension, we have the notion of logical implication between sets of sentences. A Herbrand interpretation over a first-order alphabet is a set of ground facts constructed with the predicate symbols in the alphabet and the ground terms from the corresponding Herbrand domain of function symbols; this is the set of ground atoms deemed to be true by the interpretation. A Herbrand interpretation I is a model for a clause c if and only if for all substitutions θ such that cθ is ground body(c)θ ⊂ I implies head(c)θ ∩ I = ∅. In that case, we say c is true in I. A Herbrand interpretation I is a model for a clausal theory T if and only if it is a model for all clauses in T . We say that I is a Herbrand model of c, respectively T . Roughly speaking, the truth of a clause c in a (finite) interpretation I can be determined by running the goal (query) body(c), not head(c) on a database containing I, using a theorem prover such as PROLOG. If the query succeeds, the clause is false in I; if it fails, the clause is true. Analogously, one can determine the truth of a clause c in the minimal (least) Herbrand model of a theory T by running the goal body(c), not head(c) on a database containing T . To illustrate the above notions, consider the Herbrand interpretation i = {parent(saso, f ilip), parent(maja, f ilip), son(f ilip, saso), son(f ilip, maja)} The clause c = parent(X, Y ) ← son(Y, X) is true in i, i.e., i is a model of c. On the other hand, i is not a model of the clause parent(X, X) ← (which means that everybody is their own parent). 2.3
Proof Theory
Proof theory focuses on (deductive) reasoning with logic programs. Whereas model theory considers the assignment of meaning to sentences, proof theory considers the generation of sentences (conclusions) from other sentences (premises). More specifically, proof theory considers the derivability of sentences in the context of some set of inference rules, i.e., rules for sentence derivation. An inference rule has the following schematic form: “from a set of sentences of that kind, derive a sentence of this kind”. Formally, an inference system consists of an initial set S of sentences (axioms) and a set R of inference rules.
8
Saˇso Dˇzeroski, James Cussens, and Suresh Manandhar
Using the inference rules, we can derive new sentences from S and/or other derived sentences. The fact that sentence s can be derived from S is denoted S s. A proof is a sequence s1 , s2 , ..., sn , such that each si is either in S or derivable using R from S and s1 , ..., si−1 . Such a proof is also called a derivation or deduction. Note that the above notions are of entirely syntactic nature. They are directly relevant to the computational aspects of automated deductive inference. The set of inference rules R defines the derivability relation . A set of inference rules is sound if the corresponding derivability relation is a subset of the logical implication relation, i.e., for all S and s, if S s then S |= s. It is complete if the other direction of the implication holds, i.e., for all S and s, if S |= s then S s. The properties of soundness and completeness establish a relation between the notions of syntactic () and semantic (|=) entailment in logic programming and first-order logic. When the set of inference rules is both sound and complete, the two notions coincide. Resolution comprises a single inference rule applicable to clausal-form logic. From any two clauses having an appropriate form, resolution derives a new clause as their consequence. For example, the clauses daughter(X, Y ) ← f emale(X), parent(Y, X) f emale(sonja) ← resolve into the clause daughter(sonja, Y ) ← parent(Y, sonja) Resolution is sound: every resolvent is implied by its parents. It is also refutation complete: the empty clause is derivable by resolution from any set S of Horn clauses if S is unsatisfiable. Resolution is discussed further in Section 3. 2.4
Logic Programming for NLP
Even when restricted to clausal form, the language of first-order logic is sufficiently flexible to cleanly represent a wide range of linguistic information. For example, grammar rules are easy to represent as clauses. Consider the grammar rule VP → VP MOD where VP stands for verb phrase and MOD stands for modifier. If a grammar can establish that the word ran constitutes a verb phrase and that the word quickly constitutes a modifier, then this rule allows us to infer that ran quickly is a verb phrase. We can represent this rule by the definite clause vp(Start, End) ← vp(Start, M iddle), mod(M iddle, End) where the variables stand for vertices in between words. For example if, from the sentence 0 She 1 ran 2 quickly 3 , we derive vp(1, 2), mod(2, 3), then the given
Inductive Logic Programming and Learning Language in Logic
9
clause allows us to infer vp(1, 3). vp(1, 2), mod(2, 3) and vp(1, 3) are examples of edges as used in chart parsing. See Thompson (this volume) for a description of chart parsing. A clause such as this, where a predicate appears in both the head and the body is called a recursive clause. Grammars represented as a set of definite clauses are, unsurprisingly, known as definite clause grammars (DCGs). The logic programming language PROLOG has a special DCG syntax so that the PROLOG clause vp(Start,End) :- vp(Start,Middle), mod(Middle,End). (where the :- represents a ←) may be more compactly written as vp --> vp, mod. First-order terms are important in representing structured information. The most common sort of term used in logic programming is the list. Lists are recursive terms: a list is either the empty list or of the form .(Head, T ail), where T ail is a list. In PROLOG, .(Head, T ail) can be more intuitively represented as [Head|Tail]. Sentences are often represented as lists, so She ran quickly would be represented in PROLOG as [’She’|[ran|[quickly|[]]]]. In fact, as a convenience, PROLOG allows one to write this as [’She’,ran,quickly]. Note that the constant She has been quoted since otherwise its initial capital letter would mean that it would be interpreted as a variable. Structured terms allow great representational flexibility. For example, in the grammar framework used by Cussens and Pulman (this volume), it is more convenient to represent the grammar rule VP → VP MOD as the following non-ground fact, using terms in place of literals cmp synrule(vp(Start, End), [vp(Start, M iddle), mod(M iddle, End)]). First-order terms are sufficiently flexible to represent semantic, as well as syntactic, information. Bostr¨ om (this volume) uses ILP to translate between quasi-logical forms (QLFs)—terms intended to represent the meaning of sentences. QLFs are often very complex; here is one representing the meaning of the sentence List the prices [imp, form(_,verb(no,no,no,imp,y),A, Bˆ[B,[list_Enumerate,A, term(_,ref(pro,you,_,l([])),_, Cˆ[personal,C],_,_), term(_,q(_,bare,plur),_, Dˆ[fare_Price,D],_,_)]],_)] The underscores ( ) are a PROLOG convention used to represent ‘anonymous variables’, each underscore is implicitly replaced by a distinct variable.
10
3
Saˇso Dˇzeroski, James Cussens, and Suresh Manandhar
Introduction to ILP
While logic programming (and in particular its proof theory) is concerned with deductive inference, inductive logic programming is concerned with inductive inference. It generalizes from individual examples/observations in the presence of background knowledge, finding regularities/hypotheses. The most commonly addressed task in ILP is the task of learning logical definitions of relations (Quinlan, 1990), where tuples that belong or do not belong to the target relation are given as examples. ILP then induces a logic program (predicate definition) defining the target relation in terms of other relations that are given as background knowledge. We assume a set of examples is given, i.e., tuples that belong to the target relation p (positive examples) and tuples that do not belong to p (negative examples). Also given are background relations (or background predicates) qi that constitute the background knowledge and can be used in the learned definition of p. Finally, a hypothesis language, specifying syntactic restrictions on the definition of p is also given (either explicitly or implicitly). The task is to find a definition of the target relation p that is consistent and complete. Informally, it has to explain all the positive and none of the negative examples. More formally, we have a set of examples E = P ∪ N , where P contains positive and N negative examples, and background knowledge B. The task is to find a hypothesis H such that ∀e ∈ P : B ∧ H |= e (H is complete) and ∀e ∈ N : B ∧H |= e (H is consistent). This setting was introduced by Muggleton (1991). In an alternative setting proposed by De Raedt and Dˇzeroski (1994), the requirement that B ∧ H |= e is replaced by the requirement that H be true in the minimal Herbrand model of B ∧ e: this setting is called learning from interpretations. In the most general formulation, each e, as well as B and H can be a clausal theory. In practice, each e is most often a ground example and H and B are definite logic programs. Recall that |= denotes logical implication (semantic entailment). Semantic entailment (|=) is in practice replaced with syntactic entailment () / provability, where the resolution inference rule (as implemented in PROLOG) is most often used to prove examples from a hypothesis and the background knowledge. As an illustration, consider the task of defining the relation daughter(X, Y ), which states that person X is a daughter of person Y , in terms of the background knowledge relations f emale and parent. These relations, as well as two more background relations mother and f ather are given in Table 1. There are two positive and two negative examples of the target relation daughter. In the hypothesis language of definite program clauses it is possible to formulate the following definition of the target relation, daughter(X, Y ) ← f emale(X), parent(Y, X). which is consistent and complete with respect to the background knowledge and the training examples.
Inductive Logic Programming and Learning Language in Logic
11
Table 1. A simple ILP problem: learning the daughter relation. T raining examples daughter(mary, ann). daughter(eve, tom). daughter(tom, ann). daughter(eve, ann).
Background knowledge ⊕ ⊕
mother(ann, mary). f emale(ann). mother(ann, tom). f emale(mary). f ather(tom, eve). f emale(eve). f ather(tom, ian). parent(X, Y ) ← mother(X, Y ) parent(X, Y ) ← f ather(X, Y )
In general, depending on the background knowledge, the hypothesis language and the complexity of the target concept, the target predicate definition may consist of a set of clauses, such as daughter(X, Y ) ← f emale(X), mother(Y, X). daughter(X, Y ) ← f emale(X), f ather(Y, X). The hypothesis language is typically a subset of the language of program clauses. Since the complexity of learning grows with the expressiveness of the hypothesis language, restrictions have to be imposed on hypothesised clauses. Typical restrictions include a bound on the number of literals in a clause and restrictions on variables that appear in the body of the clause but not in its head (so-called new variables). ILP systems typically adopt the covering approach of rule induction systems. In a main loop, they construct a clause explaining some of the positive examples, add this clause to the hypothesis, remove the positive examples explained and repeat this until all positive examples are explained (the hypothesis is complete). In an inner loop, individual clauses are constructed by (heuristically) searching the space of possible clauses, structured by a specialization or generalization operator. Typically, search starts with a very general rule (clause with no conditions in the body), then proceeds to add literals (conditions) to this clause until it only covers (explains) positive examples (the clause is consistent). This search can be bound from below by using so-called bottom clauses, constructed by least general generalization or inverse resolution/entailment. We discuss these issues in detail in Sections 5–7. When dealing with incomplete or noisy data, which is most often the case, the criteria of consistency and completeness are relaxed. Statistical criteria are typically used instead. These are based on the number of positive and negative examples explained by the definition and the individual constituent clauses.
4
ILP for LLL
In many cases, ‘standard’ single predicate learning from positive and negative examples as outlined in Section 3 can be used fairly straightforwardly for natural language learning (NLL) tasks. For example, many of the papers reviewed by
12
Saˇso Dˇzeroski, James Cussens, and Suresh Manandhar
Eineborg and Lindberg (this volume) show how the standard ILP approach can be applied to part-of-speech tagging. In this section, we outline NLL problems that require a more complex ILP approach. We will use grammar learning and morphology as examples, beginning with grammar learning. Table 2. Learning a grammar rule for verb phrases T raining examples vp(1, 3)
4.1
Background knowledge ⊕
np(0, 1) word( She , 0, 1) vp(1, 2) word(ran, 1, 2) mod(2, 3) word(quickly, 2, 3)
Learning from Positive Examples Only
Consider the very simple language learning task described in Table 2. The background knowledge is a set of facts about the sentence She ran quickly, which, in practice, would have been derived using a background grammar and lexicon, which we do not include here. An ILP algorithm would easily construct the following single clause theory H: vp(S, F ) ← vp(S, M ), mod(M, F ) which represents the grammar rule VP → VP MOD and which is consistent and complete with respect to the background knowledge and the training example. However many other clauses are also consistent and complete, including overgeneral clauses such as: vp(S, F ) ← vp(S, F ) ← word(ran, S, M ) vp(S, F ) ← vp(S, M ) So an ILP algorithm needs something more than consistency and completeness if it is to select the correct hypothesis in this case. This problem stems from a common feature of NLP learning tasks: the absence of explicit negative examples. This absence allowed the wildly overgeneral rule vp(S, F ) ← (VPs are everywhere!) to be consistent. A suitably defined hypothesis language which simply disallows many overgeneral clauses can only partially overcome this problem. One option is to invoke a Closed World Assumption (CWA) to produce implicit negative examples. In our example LLL problem this would amount to the (correct) assumption that VPs only occur at the places stated in the examples and
Inductive Logic Programming and Learning Language in Logic
13
background knowledge. We make these negative examples explicit in Table 3. Although correct in this case, such an assumption will not always be so. If the fact vp(1, 2) were missing from background knowledge the CWA would incorrectly assert that the clause vp(1, 2) were false. Table 3. Learning a grammar rule for verb phrases, with explicit negatives. T raining examples vp(1, 3) vp(0, 1) vp(0, 2) vp(0, 3) vp(1, 3) vp(2, 3)
Background knowledge ⊕
np(0, 1) word( She , 0, 1) vp(1, 2) word(ran, 1, 2) mod(2, 3) word(quickly, 2, 3)
Since a CWA is generally too strong, an alternative is to assume that most other potential examples are negative. Muggleton (2000) gives a probabilistic method along these lines for learning from positive examples where the probability of a hypothesis decreases with its generality and increases with its compactness. Bostr¨om (1998) also uses a probabilistic approach to positive-only learning based on learning Hidden Markov Models (HMMs). A different approach rests on the recognition that grammar learning, and many other NLL tasks, are not primarily classification tasks. In this view a grammar is not really concerned with differentiating sentences (positive examples) from non-sentences (negative examples). Instead, given a string of words assumed to be a sentence, the goal is to find the correct parse of that sentence. From this view, the problem with vp(S, F ) ← is not so much that it may lead to non-sentences being incorrectly classified as sentences; but rather that it will lead to a given sentence having many different parses—making it harder to find the correct parse. A shift from classification to disambiguation raises many important issues for the application of ILP to NLP which we can not discuss here. This issue is addressed in the papers by Mooney (this volume) and Thompson and Califf (this volume) which present the CHILL algorithm which ‘learns to parse’ rather than classify. Riezler’s paper (this volume) also focusses on disambiguation, employing a statistical approach. Output Completeness is a type of CWA introduced by Mooney and Califf (1995) which allows a decision list learning system to learn from positive only data—an important consideration for language learning applications. Decision lists are discussed in Section 4.3. We illustrate the notion of output completeness using a simple example. In Table 4, there are two possible plural forms for fish whereas there is only one plural form for lip. Given these examples, under the output completeness assumption, every example of the form plural([l,i,p], Y) where Y = [l,i,p,s] is a negative example. Similarly, every example of the form
14
Saˇso Dˇzeroski, James Cussens, and Suresh Manandhar Table 4. Positive examples for learning about plurals plural([f,i,s,h], [f,i,s,h]). plural([f,i,s,h], [f,i,s,h,e,s]). plural([l,i,p], [l,i,p,s]).
plural([f,i,s,h],Y) where Y = [f,i,s,h] and Y = [f,i,s,h,e,s] is a negative example. Thus, output completeness assumption allows us to assume that the training data set contains, for every example, the complete set of output values, for every input value present in the data set. 4.2
Multiple Predicate Learning
So far we have been looking at cases where we learn from examples of a single target predicate. In grammar learning we may have to learn clauses for several predicates (multiple predicate learning) and also the examples may not be examples of any of the predicates we need to learn. Consider the learning problem described in Table 5. The target theory is the two clause theory: v(X, Y ) ← word(ran, X, Y ) vp(S, F ) ← vp(S, M ), mod(M, F )
Table 5. Grammar completion T raining examples s(0, 3)
Background knowledge ⊕
np(X, Y ) ← word( She , X, Y ) word( She , 0, 1) mod(X, Y ) ← word(quickly, X, Y ) word(quickly, 2, 3) s(X, Y ) ← np(X, Z), vp(Z, Y ) word(ran, 1, 2) vp(X, Y ) ← v(X, Y )
The problem in Table 5 is to induce a theory H with certain properties. For example, it must be in the hypothesis language, not be over-general, not increase the ambiguity of the grammar too much, etc. However here we will focus on the constraint that s(0, 3) must follow from B ∧ H. Neither of our two clauses directly define the predicate s/2, the predicate symbol of our example, but we can use abduction to generate positive examples with the right predicate symbol. Abduction is a form of reasoning introduced by Pierce which can be identified with the following inference rule: C, C ← A A
Inductive Logic Programming and Learning Language in Logic
15
This formalises the following form of reasoning (Pierce, 1958), quoted in (Flach & Kakas, 2000): The surprising fact, C, is observed; But if A were true, C would be a matter of course, Hence, there is reason to suspect that A is true In our example we have the positive example s(0, 3) which requires explanation. We also have s(0, 3) ← vp(1, 3) which follows from B. Abduction allows us derive vp(1, 3) which we can then treat as a normal positive example and attempt to learn vp(S, F ) ← vp(S, M ), mod(M, F ). Abduction is, of course, unsound: we are just guessing that vp(1, 3) is true, and sometimes our guesses will be wrong. For example, from vp(1, 3) and vp(X, Y ) ← v(X, Y ), which are both true, we can abductively derive v(1, 3) which is not. Abductive approaches applied to LLL problems can be found in (Muggleton & Bryant, 2000; Cussens & Pulman, 2000) and Cussens and Pulman (this volume). 4.3
Decision List Learning for NLP
Many linguistic phenomena can be explained by rules together with exceptions to those rules, but the pure logic programs induced by standard ILP algorithms do not express exceptions easily. For example, this clause: plural(X,Y) :- mate(X,Y,[],[],[],[s]). says that for any word X, without exception the plural Y is formed by adding an ‘s’. In a pure logic program this is the only possible interpretation of the rule, even if we have somehow stated the exceptions to the rule elsewhere in the logic program. A solution to this problem is to use first-order decision lists. These can be thought of as an ordered set of clauses with specific clauses at the top and general clauses at the bottom. For example, Table 6 shows a decision list learnt by Clog (Manandhar, Dˇzeroski, & Erjavec, 1998) from the training examples given in Table 71 . Table 6. Decision list induced by Clog training data in Table 7 plural([s,p,y],[s,p,i,e,s]):- !. plural(X,Y) :- mate(X,Y,[],[],[a,n],[e,n]),!. plural(X,Y) :- mate(X,Y,[],[],[a,s,s],[a,s,s,e,s]),!. plural(X,Y) :- mate(X,Y,[],[],[],[s]),!.
The first clause in Table 6 is an exception. Although the general rule in English is to change the ending -y to -ies, Clog was not able to learn this rule 1
Clog is available from www.cs.york.ac.uk/∼suresh/CLOG
16
Saˇso Dˇzeroski, James Cussens, and Suresh Manandhar Table 7. Sample training data set for Clog. plural([l,i,p], [l,i,p,s]). plural([m,e,m,b,e,r], [m,e,m,b,e,r,s]). plural([d,a,y], [d,a,y,s]). plural([s,e,c,o,n,d], [s,e,c,o,n,d,s]). plural([o,t,h,e,r], [o,t,h,e,r,s]). plural([l,i,e], [l,i,e,s]). plural([m,a,s,s], [m,a,s,s,e,s]). plural([c,l,a,s,s], [c,l,a,s,s,e,s]). plural([s,p,y], [s,p,i,e,s]). plural([m,a,n], [m,e,n]). plural([w,o,m,a,n], [w,o,m,e,n]). plural([f,a,c,e], [f,a,c,e,s]).
because there was only one example in the training set showing this pattern. The second rule states that if a word ends in an -an then the plural form replaces it with -en. The last rule is the default rule that will apply if the previous rules fail. This default rule will add an -s ending to any word. Decision lists are implemented in PROLOG by adding a cut (!) to the end of every clause. PROLOG will then only apply one clause—the first one that succeeds. There are several reasons why decision lists are attractive for language learning applications. In contrast with pure logic programs, output-completeness provides a straightforward mechanism to learn decision lists from positive only data. Again, in contrast with pure logic programs which can return multiple answers nondeterministically, decision lists are learnt so that the most likely answer is returned when multiple solutions are possible. For example, in part-of-speech (PoS) tagging, every word in a given sentence needs to be assigned a unique PoS tag from a fixed set of tags. PoS tags are things such as VT (transitive verb), ADJ (adjective), NN (common noun), DET (determiner) etc. Words are often ambiguous. The task of a PoS tagger is to determine from context the correct PoS tag for every word. Thus, in the forest fires have died, a PoS tagger should assign NN to fires and not VT. Given a sentence, one is interested in the most probable tag sequence and not the set of plausible tag sequences. Transformation lists have been popularised by Brill’s work (Brill, 1995) on learning of PoS tagging rules. As an example, consider the PoS tagging rules given in Table 8 (adapted from Brill, this volume). In transformation based tagging, the tag assigned to a word is changed by the rules depending upon the left and right context. As far as the syntax for representing both transformation lists and decision lists is concerned there is hardly a difference. For instance, we can easily represent the rules in Table 8 in PROLOG form as given in Table 9. The main difference is that in a decision list once a rule has been applied to a test example no further changes to the example take place. On the other hand, in a transformation list, once a rule has been applied to a test example, the example is still subject to further rule application and it is not removed from the test set. In fact, the
Inductive Logic Programming and Learning Language in Logic
17
Table 8. Transformation based tagging rules – Change tag from IN to RB if the current word is about and the next tag is CD. – Change tag from IN to RB if the current word is about and the next tag is $. – Change tag from NNS to NN if the current word is yen and the previous tag is CD. – Change tag from NNPS to NNP if the previous tag is NNP.
rules in a transformation list apply to all the examples in the test set repeatedly until no further application is possible. Thus a transformation list successively transforms an input set of examples until no further application is possible. The resulting set is the solution produced by the transformation list. Table 9. PROLOG representation of the tagging rules in Table 8 tag_rule( tag_rule( tag_rule( tag_rule(
_, ’IN’-about, [ ’CD’-_, _ ], ’RB’-about) :- !. _, ’IN’-about, [ ’$’-_, _ ], ’RB’-about) :- !. [_, ’CD’- _], ’NNS’-yen, _ , ’NN’-yen) :- !. [ _, ’NNP’-_], ’NNPS’-X, _, ’NNP’-X) :- !.
Given this background, it is straightforward to generalise first order decision lists to first order transformation lists thereby lifting Brill’s transformation based learning into an ILP setting (see for instance, (Dehaspe & Forrier, 1999)). This further justifies the use of decision lists and its extension to first order transformation lists for language learning applications. Finally, existing work (Quinlan, 1996; Manandhar et al., 1998) demonstrate that there exists relatively efficient decision list learning algorithms.
5
Structuring the Space of Clauses
In order to search the space of clauses (program clauses) systematically, it is useful to impose some structure upon it, e.g., an ordering. One such ordering is based on θ-subsumption, defined below. Recall first that a substitution θ = {V1 /t1 , ..., Vn /tn } is an assignment of terms ti to variables Vi . Applying a substitution θ to a term, atom, or clause F yields the instantiated term, atom, or clause F θ where all occurrences of the variables Vi are simultaneously replaced by the term ti . Let c and c be two program clauses. Clause c θ-subsumes c if there exists a substitution θ, such that cθ ⊆ c (Plotkin, 1969). To illustrate the above notions, consider the clause c c = daughter(X, Y ) ← parent(Y, X).
18
Saˇso Dˇzeroski, James Cussens, and Suresh Manandhar
Applying the substitution θ = {X/mary, Y /ann} to clause c yields cθ = daughter(mary, ann) ← parent(ann, mary). Recall that the clausal notation daughter(X, Y ) ← parent(Y, X) stands for {daughter(X, Y ), parent(Y, X)} where all variables are assumed to be universally quantified and the commas denote disjunction. According to the definition, clause c θ-subsumes c if there is a substitution θ that can be applied to c such that every literal in the resulting clause occurs in c . Clause c θ-subsumes the clause c = daughter(X, Y ) ← f emale(X), parent(Y, X) under the empty substitution θ = ∅, since {daughter(X, Y ), parent(Y, X)} is a proper subset of {daughter(X, Y ), f emale(X), parent(Y, X)}. Furthermore, clause c θ-subsumes the clause c = daughter(mary, ann) ← f emale(mary), parent(ann, mary), parent(ann, tom) under the substitution θ = {X/mary, Y /ann}. θ-subsumption introduces a syntactic notion of generality. Clause c is at least as general as clause c (c ≤ c ) if c θ-subsumes c . Clause c is more general than c (c < c ) if c ≤ c holds and c ≤ c does not. In this case, we say that c is a specialization of c and c is a generalization of c . If the clause c is a specialization of c then c is also called a refinement of c. The only clause refinements usually considered by ILP systems are the minimal (most general) specializations of the clause. There are two important properties of θ-subsumption: – If c θ-subsumes c then c logically entails c , c |= c . The reverse is not always true. As an example, Flach (1992) gives the following two clauses c = list([V |W ]) ← list(W ) and c = list([X, Y |Z]) ← list(Z). Given the empty list, c constructs lists of any given length, while c constructs lists of even length only, and thus c |= c . However, no substitution exists that can be applied to c to yield c , since it should map W both to [Y |Z] and to Z which is impossible. Therefore, c does not θ-subsume c . – The relation ≤ introduces a lattice on the set of all clauses (Plotkin, 1969). This means that any two clauses have a least upper bound (lub) and a greatest lower bound (glb). Both the lub and the glb are unique up to equivalence (renaming of variables) under θ-subsumption. For example, the clauses daughter(X, Y ) ← parent(Y, X), parent(W, V ) and daughter(X, Y ) ← parent(Y, X) θ-subsume one another.
Inductive Logic Programming and Learning Language in Logic
19
The second property of θ-subsumption leads to the following definition: The least general generalization (lgg) of two clauses c and c , denoted by lgg(c, c ), is the least upper bound of c and c in the θ-subsumption lattice (Plotkin, 1969). The rules for computing the lgg of two clauses are outlined later in this chapter. Note that θ-subsumption and least general generalization are purely syntactic notions since they do not take into account any background knowledge. Their computation is therefore simple and easy to implement in an ILP system. The same holds for the notion of generality based on θ-subsumption. On the other hand, taking background knowledge into account would lead to the notion of semantic generality (Niblett, 1988; Buntine, 1988), defined as follows: Clause c is at least as general as clause c with respect to background theory B if B ∪ {c} |= c . The syntactic, θ-subsumption based, generality is computationally more feasible. Namely, semantic generality is in general undecidable and does not introduce a lattice on a set of clauses. Because of these problems, syntactic generality is more frequently used in ILP systems. θ-subsumption is important for inductive logic programming for the following reasons: – As shown above, it provides a generality ordering for hypotheses, thus structuring the hypothesis space. It can be used to prune large parts of the search space. – θ-subsumption provides the basis for two important ILP techniques: • top-down searching of refinement graphs, and • building of least general generalizations from training examples, relative to background knowledge, which can be used to bound the search of refinement graphs from below or as part of a bottom-up search. These two techniques will be elaborated upon in the following sections.
6
Searching the Space of Clauses
Most ILP approaches search the hypothesis space (of program clauses) in a topdown manner, from general to specific hypotheses, using a θ-subsumption-based specialization operator. A specialization operator is usually called a refinement operator (Shapiro, 1983). Given a hypothesis language L, a refinement operator ρ maps a clause c to a set of clauses ρ(c) which are specializations (refinements) of c: ρ(c) = {c | c ∈ L, c < c }. A refinement operator typically computes only the set of minimal (most general) specializations of a clause under θ-subsumption. It employs two basic syntactic operations on a clause: – apply a substitution to the clause, and – add a literal to the body of the clause. The hypothesis space of program clauses is a lattice, structured by the θsubsumption generality ordering. In this lattice, a refinement graph can be defined as a directed, acyclic graph in which nodes are program clauses and arcs
20
Saˇso Dˇzeroski, James Cussens, and Suresh Manandhar daughter(X, Y ) ←
✏ ❍ ✏✏ ✄ ❍ ❍ ✄ · · · ❍❍ · · · ❥ ❍ z ✄ ✄ daughter(X, Y ) ← daughter(X, Y ) ← daughter(X, Y ) ← ✄ parent(X, Z) parent(Y, X) X=Y ✄ ✄ ✄ ✄ ✄ ✄✎ ✄ ✄ daughter(X, Y ) ← ✄ f emale(X) ✄ ✄ ✟❍ ❍❍ ✄ ✟✟ ✟ ❍ ··· ··· ❍❍ ✙✟ ✟ ❥ ✄✎✄ ✏✏ ✏✏ ✮✏ ✏
daughter(X, Y ) ← f emale(X) f emale(Y )
daughter(X, Y ) ← f emale(X) parent(Y, X)
Fig. 2. Part of the refinement graph for the family relations problem.
correspond to the basic refinement operations: substituting a variable with a term, and adding a literal to the body of a clause. Figure 2 depicts a part of the refinement graph for the family relations problem defined in Table 1, where the task is to learn a definition of the daughter relation in terms of the relations f emale, and parent. At the top of the refinement graph (lattice) is the clause c = daughter(X, Y ) ← where an empty body is written instead of the body true. The refinement operator ρ generates the refinements of c, which are of the form ρ(c) = {daughter(X, Y ) ← L} where L is one of following literals: – literals having as arguments the variables from the head of the clause: X = Y (this corresponds to applying a substitution X/Y ), f emale(X), f emale(Y ), parent(X, X), parent(X, Y ), parent(Y, X), and parent(Y, Y ), and – literals that introduce a new distinct variable Z (Z = X and Z = Y ) in the clause body: parent(X, Z), parent(Z, X), parent(Y, Z), and parent(Z, Y ). The search for a clause starts at the top of the lattice, with the clause that covers all example (positive and negative). Its refinements are then considered, then their refinements in turn, and this is repeated until a clause is found which covers only positive examples. In the example above, the clause
Inductive Logic Programming and Learning Language in Logic
21
daughter(X, Y ) ← f emale(X), parent(Y, X) is such a clause. Note that this clause can be reached in several ways from the top of the lattice, e.g., by first adding f emale(X), then parent(Y, X) or vice versa. The refinement graph is typically searched heuristically level-wise, using heuristics based on the number of positive and negative examples covered by a clause. As the branching factor is very large, greedy search methods are typically applied which only consider a limited number of alternatives at each level. Hill-climbing considers only one alternative at each level, while beam search considers n alternatives, where n is the beam width. Occasionally, complete search is used, e.g., A∗ best-first search or breadth-first search. Often the search can be pruned. For example, since we are only interested in clauses that (together with background knowledge) entail at least one example, we can prune the search if we ever construct a clause which entails no positive examples. This is because if a clause covers no positive examples then neither will any of its refinements.
7
Bounding the Search for Clauses
The branching factor of a refinement graph, i.e., the number of refinements a clause has, is very large. This is especially true for clauses deeper in the lattice that contain many variables. It is thus necessary to find ways to reduce the space of clauses actually searched. One approach is to make the refinement graph smaller by making the refinement operator take into account the types of predicate arguments, as well as input/output mode declarations. For example, we might restrict the parents(Y, X) predicate to only give us the parents of a given child and not give us persons that are the offspring of a given person. Also we could restrict parents(Y, X), so that X can only be instantiated with a term of type child. This can be done with a mode declaration parents(−person, +child). Type and mode declarations can be combined with the construction of a bottom clause that bounds the search of the refinement lattice from below. This is the most specific clause covering a given example (or examples). Only clauses on the path between the top and the bottom clause are considered, significantly improving efficiency. This approach is implemented in the Progol algorithm (Muggleton, 1995). The bottom clause can be constructed as the relative least general generalization of two (or more) examples (Muggleton & Feng, 1990) or the most specific inverse resolvent of an example (Muggleton, 1991), both with respect to a given background knowledge B. entailment. These methods are discussed below. 7.1
Relative Least General Generalization
Plotkin’s notion of least general generalization (lgg) (Plotkin, 1969) forms the basis of cautious generalization: the latter assumes that if two clauses c1 and c2 are true, it is very likely that lgg(c1 , c2 ) will also be true. The least general generalization of two clauses c and c , denoted lgg(c, c ), is the least upper bound of c and c in the θ-subsumption lattice. It is the most
22
Saˇso Dˇzeroski, James Cussens, and Suresh Manandhar
specific clause that θ-subsumes c and c . If a clause d θ-subsumes c and c , it has to subsume lgg(c, c ) as well. To compute the lgg of two clauses, lgg of terms and literals need to be defined first (Plotkin, 1969). The lgg of two terms lgg(t1 , t2 ) is computed as 1. lgg(t, t) = t, 2. lgg(f (s1 , .., sn ), f (t1 , .., tn )) = f (lgg(s1 , t1 ), .., lgg(sn , tn )). 3. lgg(f (s1 , .., sm ), g(t1 , .., tn )) = V , where f = g, and V is a variable which represents lgg(f (s1 , .., sm ), g(t1 , .., tn )), 4. lgg(s, t) = V , where s = t and at least one of s and t is a variable; in this case, V is a variable which represents lgg(s, t). For example, and
lgg([a, b, c], [a, c, d]) = [a, X, Y ] lgg(f (a, a), f (b, b)) = f (lgg(a, b), lgg(a, b)) = f (V, V )
where V stands for lgg(a, b). When computing lggs one must be careful to use the same variable for multiple occurrences of the lggs of subterms, i.e., lgg(a, b) in this example. This holds for lggs of terms, atoms and clauses alike. The lgg of two atoms lgg(A1 , A2 ) is computed as follows: 1. lgg(p(s1 , .., sn ), p(t1 , .., tn )) = p(lgg(s1 , t1 ), .., lgg(sn , tn )), if atoms have the same predicate symbol p, 2. lgg(p(s1 , .., sm ), q(t1 , .., tn )) is undefined if p = q. The lgg of two literals lgg(L1 , L2 ) is defined as follows: 1. if L1 and L2 are atoms, then lgg(L1 , L2 ) is computed as defined above, 2. if both L1 and L2 are negative literals, L1 = A1 and L2 = A2 , then lgg(L1 , L2 ) = lgg(A1 , A2 ) = lgg(A1 , A2 ), 3. if L1 is a positive and L2 is a negative literal, or vice versa, lgg(L1 , L2 ) is undefined. For example, lgg(parent(ann, mary), parent(ann, tom)) = parent(ann, X) lgg(parent(ann, mary), parent(ann, tom)) is undefined lgg(parent(ann, X), daughter(mary, ann)) is undefined Taking into account that clauses are sets of literals, the lgg of two clauses is defined as follows. Let c1 = {L1 , .., Ln } and c2 = {K1 , .., Km }. Then lgg(c1 , c2 ) = {Mij = lgg(Li , Kj ) | Li ∈ c1 , Kj ∈ c2 , lgg(Li , Kj ) is defined }. If c1 = daughter(mary, ann) ← f emale(mary), parent(ann, mary) and
c2 = daughter(eve, tom) ← f emale(eve), parent(tom, eve)
Inductive Logic Programming and Learning Language in Logic
then
23
lgg(c1 , c2 ) = daughter(X, Y ) ← f emale(X), parent(Y, X)
where X stands for lgg(mary, eve) and Y stands for lgg(ann, tom). In (Cussens & Pulman, 2000), lgg is used as the basis for a bottom-up search for clauses representing grammar rules. The search is implemented in Prolog using the Sicstus Prolog library built-in term_subsumer/3 to construct lggs. term_subsumer/3 finds lggs of terms, not clauses, but since the clauses in (Cussens & Pulman, 2000) are always unit (single-literal) clauses we can find lggs by presenting these unit clauses to term_subsumer/3 as if they were terms. Table 10 shows two grammar rules and their lgg. Table 11 gives human-readable decompiled versions of the three rules in Table 10. These unification-based grammar rules are more sophisticated versions of the context-free rule VP → VP MOD where the linguistic categories (i.e. VP and MOD) have features which must match up correctly. See Thompson (this volume) for a comparison of contextfree and unification-based grammars. Rule r67 involves, amongst other things, a generalisation of the ‘gap-threading’ patterns seen in Rules r3 and r24. Gap-threading is a technique for dealing with movement phenomena in syntax (Pereira, 1981). Adding rule r3 to the initial incomplete grammar in (Cussens & Pulman, 2000) allows the sentence All big companies wrote a report quickly. to be parsed, adding rule r24 allows What don’t all big companies read with a machine? to be parsed; adding the lgg rule r67 allows both to be parsed. Table 10. Two grammar rules and their lgg cmp_synrule( vp([ng,ng],f(0,0,0,0,1,1,1,1,1),n), [vp([ng,ng],f(0,0,0,0,1,1,1,1,1),n), mod([ng,ng],f(0,0,0,1),f(0,1,1,1))] ). cmp_synrule( vp([np([ng,ng],f(0,0,0,0,X257,X257,X257,1,1), f(0,1,1,1),nonsubj),ng],f(0,0,1,1,1,1,1,1,1),n), [vp([np([ng,ng],f(0,0,0,0,X257,X257,X257,1,1), f(0,1,1,1),nonsubj),ng],f(0,0,1,1,1,1,1,1,1),n), mod([ng,ng],f(0,X187,X187,1),f(0,1,1,1))] ). cmp_synrule( vp([X231,ng],f(0,0,X223,X223,1,1,1,1,1),n), [vp([X231,ng],f(0,0,X223,X223,1,1,1,1,1),n), mod([ng,ng],f(0,X187,X187,1),f(0,1,1,1))] ).
24
Saˇso Dˇzeroski, James Cussens, and Suresh Manandhar Table 11. Two grammar rules and their lgg (uncompiled version) r3 vp ==> [vp,mod] vp:[gaps=[ng:[],ng:[]],mor=pl,aux=n]==> [vp:[gaps=[ng:[],ng:[]],mor=pl,aux=n], mod:[gaps=[ng:[],ng:[]],of=vp,type=n]] r24 vp ==> [vp,mod] vp:[gaps=[np:[gaps=[ng:[],ng:[]],mor=or(pl,s3),type=n,case=nonsubj], ng:[]],mor=inf,aux=n]==> [vp:[gaps=[np:[gaps=[ng:[],ng:[]],mor=or(pl,s3),type=n,case=nonsubj], ng:[]],mor=inf,aux=n], mod:[gaps=[ng:[],ng:[]],of=or(nom,vp),type=n]] r67 vp ==> [vp,mod] vp:[gaps=[X283,ng:[]],mor=or(inf,pl),aux=n]==> [vp:[gaps=[X283,ng:[]],mor=or(inf,pl),aux=n], mod:[gaps=[ng:[],ng:[]],of=or(nom,vp),type=n]]
The definition of relative least general generalization (rlgg) is based on the semantic notion of generality. The rlgg of two clauses c1 and c2 is the least general clause which is more general than both c1 and c2 with respect (relative) to background knowledge B. The notion of rlgg was used in the ILP system GOLEM (Muggleton & Feng, 1990). To avoid the problems with the semantic notion of generality, the background knowledge B in GOLEM is restricted to ground facts. If K denotes the conjunction of all these facts, the rlgg of two ground atoms A1 and A2 (positive examples), relative to B can be computed as: rlgg(A1 , A2 ) = lgg((A1 ← K), (A2 ← K)). Given the positive examples e1 = daughter(mary, ann) and e2 = daughter(eve, tom) and the background knowledge B for the family example, the least general generalization of e1 and e2 relative to B is computed as: rlgg(e1 , e2 ) = lgg((e1 ← K), (e2 ← K)) where K denotes the conjunction of the literals parent(ann, mary), parent(ann, tom), parent(tom, eve), parent(tom, ian), f emale(ann), f emale(mary), and f emale(eve). For notational convenience, the following abbreviations are used: d-daughter, p-parent, f-female, a-ann, e-eve, m-mary, t-tom, i-ian. The conjunction of facts from the background knowledge (comma stands for conjunction) is K = p(a, m), p(a, t), p(t, e), p(t, i), f (a), f (m), f (e). The computation of rlgg(e1 , e2 ) = lgg((e1 ← K), (e2 ← K)), produces the following clause d(Vm,e , Va,t ) ←p(a, m), p(a, t), p(t, e), p(t, i), f (a), f (m), f (e), p(a, Vm,t ), p(Va,t , Vm,e ),p(Va,t , Vm,i ), p(Va,t , Vt,e ), p(Va,t , Vt,i ), p(t, Ve,i ), f (Va,m ), f (Va,e ), f (Vm,e ).
Inductive Logic Programming and Learning Language in Logic
25
In the above clause, Vx,y stands for lgg(x, y), for each x and y. The three literals in bold are those that will remain after literals failing to meet various criteria are removed. Now we will describe these criteria and how they can be used to reduce the size of a large rlgg. In general, a rlgg of training examples can contain infinitely many literals or at least grow exponentially with the number of examples. Since such a clause can be intractably large, constraints are used on introducing new variables into the body of the rlgg. For example, literals in the body that are not connected to the head by a chain of variables are removed. In the above example, this yields the clause d(Vm,e , Va,t ) ←p(Va,t , Vm,e ),p(Va,t , Vm,i ), p(Va,t , Vt,e ), p(Va,t , Vt,i ), f (Vm,e ). Also, nondeterminate literals (that can give more than one value of output arguments for a single value of the input arguments) may be eliminated. The literals p(Va,t , Vm,i ), p(Va,t , Vt,e ), p(Va,t , Vt,i ) are nondeterminate since Va,t is the input argument and a parent can have more than one child. Eliminating these yields the bottom clause d(Vm,e , Va,t ) ← p(Va,t , Vm,e ), f (Vm,e ), i.e., daughter(X, Y ) ← f emale(X), parent(Y, X). In this simple example, the bottom clause is our target clause. In practice, the bottom clause is typically very large, containing hundreds of literals. 7.2
Inverse Resolution
The basic idea of inverse resolution introduced by Muggleton and Buntine (1988), is to invert the resolution rule of deductive inference (Robinson, 1965), i.e., to invert the SLD-resolution proof procedure for definite programs (Lloyd, 1987). The basic resolution step in propositional logic derives the proposition p∨r given the premises p ∨ q and q ∨ r. In a first-order case, resolution is more complicated, involving substitutions. Let res(c, d) denote the resolvent of clauses c and d. To illustrate resolution in first-order logic, we use the grammar example from earlier. Suppose that background knowledge B consists of the clauses b1 = vp(1, 2) and b2 = mod(2, 3) and H = {c} = {vp(S, E) ← vp(S, M ), mod(M, E)}. Let T = H ∪ B. Suppose we want to derive the fact vp(1, 3) from T . To this end, we proceed as follows: – First, the resolvent c1 = res(c, b1 ) is computed under the substitution θ1 = {S/1, M/2}. This means that the substitution θ1 is first applied to clause c to obtain vp(1, E) ← vp(1, 2), mod(2, E), which is then resolved with b1 as in the propositional case. The resolvent of vp(S, E) ← vp(S, M ), mod(M, E) and vp(1, 2) is thus c1 = res(c, b1 ) = vp(1, E) ← mod(2, E). – The next resolvent c2 = res(c1 , b2 ) is computed under the substitution θ2 = {E/3}. The clauses vp(1, E) ← mod(2, E) and mod(2, 3) resolve in c2 = res(c1 , b2 ) = vp(1, 3). The linear derivation tree for this resolution process is given in Figure 3. Inverse resolution, used in the ILP system CIGOL (Muggleton & Buntine, 1988), inverts the resolution process using generalization operators based on inverting substitution (Buntine, 1988). Given a well-formed formula W , an in-
26
Saˇso Dˇzeroski, James Cussens, and Suresh Manandhar vp(S, E) ← vp(S, M ), mod(M, E)
✁
✁
✁
✁
✁ ✁ ❍ ✝✁ ✆ θ = {S/1, M/2} ❍ ❍ 1 ✁ ✁☛
b1 = vp(1, 2)
❍
❍❍ ❍ ❍
❍ ❥❍ ❍
❍❍ ✁ ✁ ✁ ✁ ✁ ✁
b2 = mod(2, 3)
❍ ❍❍ ❍ ❍❍ ❥❍ ❍
✁ ❍ ✝✁ ✆ ❍ ❍ ✁ ✁☛
c1 = vp(1, E) ← mod(2, E)
θ2 = {E/3}
❍❍✁✁
c2 = vp(1, 3) Fig. 3. A linear derivation tree.
verse substitution θ−1 of a substitution θ is a function that maps terms in W θ to variables, such that W θθ−1 = W . Let c = vp(S, E) ← vp(S, M ), mod(M, E) and θ = {S/1, M/2, E/3}: then c = cθ = vp(1, 3) ← vp(1, 2), mod(2, 3). By applying the inverse substitution θ−1 = {1/S, 2/M, 3/E} to c1 , the original clause c is restored: c = c θ−1 = vp(S, E) ← vp(S, M ), mod(M, E). In the general case, inverse substitution is substantially more complex. It involves the places of terms in order to ensure that the variables in the initial W are appropriately restored in W θθ−1 . In fact, each occurrence of a term can be replaced by a different variable in an inverse substitution. We will not treat inverse resolution in detail, but will rather illustrate it by an example. Let ires(c, d) denote the inverse resolvent of clauses c and d. As in the example above, let background knowledge B consist of the two clauses b1 = vp(1, 2) and b2 = mod(2, 3). The inverse resolution process might then proceed as follows: – In the first step, inverse resolution attempts to find a clause c1 which will, together with b2 , entail e1 . Using the inverse substitution θ2−1 = {3/E}, an inverse resolution step generates the clause c1 = ires(b2 , e1 ) = vp(1, E) ← mod(2, E). – Inverse resolution then takes b1 = vp(1, 2) and c1 . It computes c = ires(b1 , c1 ), using the inverse substitution θ1−1 = {1/S, 2/M }, yielding c = vp(S, E) ← vp(S, M ), mod(M, E).
Inductive Logic Programming and Learning Language in Logic
27
c = vp(S, E) ← vp(S, M ), mod(M, E)
✁
✁
✁
✁
b1 = vp(1, 2)
✁ ❍ ✝✁ ✆ ❍ θ1−1 = {1/S, 2/M } ❍❍ ❍ ✁ ❍ ❍ ❥❍ ❍ ✁ ❍❍ ✁✁✕ c1 = vp(1, E) ← ✁ ✁✁ mod(2, E) ✁✁ ✁ ✁ ✁✁ ✁✁ b2 = mod(2, 3) ✁ ❍ ❍ ✝✁✁✆ ❍❍ ✁ ❍❍ θ2−1 = {3/E} ❍ ✁ ✁ ❍ ❍ ❥❍ ❍ ✁✁ ❍❍✁✁✕ ❍
❍
e1 = vp(1, 3) Fig. 4. An inverse linear derivation tree.
The corresponding inverse linear derivation tree is illustrated in Figure 4. Most specific inverse resolution uses only empty inverse substitutions. In the above example, this would yield the clause vp(1, 3) ← vp(1, 2), mod(2, 3) as the final inverse resolvent. The process of repeatedly applying inverse resolution to an example to get such a most specific clause is known as saturation and was introduced by Rouveirol (1990). In practice, inverse substitutions replacing the occurrence of each constant with the same variable are considered to yield the most specific inverse resolvent, which can the be used as a bottom clause. In our case, this is again the target clause, while in practice the bottom clause will be much more specific/larger, containing a large number of literals. 7.3
Inverse Entailment
Inverse entailment (IE) was introduced in (Muggleton, 1995) and defines the problem of finding clauses from a model-theoretic viewpoint, as opposed to the proof-theoretic definition of inverse resolution. If H is such that B ∧ H |= E then B ∧ E |= H
28
Saˇso Dˇzeroski, James Cussens, and Suresh Manandhar
Let ⊥ be the potentially infinite collection of ground literals which are true in every model of B, E. Note that ⊥ is not being used to denote falsity as in standard logical notation. It follows that B ∧ E |= ⊥ |= H and
H |= ⊥
It follows that a subset of solutions for H can be found by considering clauses that θ-subsume the bottom clause ⊥. If a clause H does subsume ⊥ we say that H follows from B and E by inverse entailment. IE is implemented (in the Progol algorithm) by saturating E to produce ⊥ and then searching the set of clauses that θ-subsume ⊥ in a top-down manner. Table 12 gives examples of two bottom clauses. Note that in the first example, the bottom clause is not a definite clause; this example shows how the construction of the bottom clause can be used to do abduction. Table 12. Most specific ‘bottom’ clauses ⊥ for various B and E B
E
⊥
vp(X, Y ) ← v(X, Y ) v(X, Y ) ← word(ran, X, Y )
vp(1, 3)
vp(1, 3) ∨ v(1, 3)∨ word(ran, 1, 3)
vp(X, Y ) ← v(X, Y ) v(X, Y ) ← word(ran, X, Y )
is english(X, Y ) ← word(ran, X, Y )
is english(X, Y ) ← word(ran, X, Y ), vp(X, Y ), v(X, Y )
Neither inverse resolution nor inverse entailment as described above can construct a target clause from an example if the clause is recursive and is used more than once to deduce the example. This is particularly problematic in LLL applications such as grammar learning. Suppose we had an incomplete grammar lacking the target rule vp(S, E) ← vp(S, M ), mod(M, E) and so were unable to parse the sentence She ran quickly from the telephone. Suppose now that we had used abduction to (correctly) guess that there should be a VP from vertex 1 to 6, i.e. spanning the words ran quickly from the telephone. Suppose also that the rest of the grammar is sufficiently complete to infer that ran is a VP, quickly is a modifier and from the telephone is also a modifier. This gives us the following background B, positive example E and target H: E = vp(1, 6) B = vp(1, 2), mod(2, 3), mod(3, 6) H = vp(S, E) ← vp(S, M ), mod(M, E) Consider now the derivation of E from B and H. This is illustrated in Fig 5 where the last three resolution steps have been squashed into one to conserve
Inductive Logic Programming and Learning Language in Logic
29
vp(S, E) ← vp(S, M ), mod(M, E)
✁
✁
H = vp(S , E ) ← vp(S , M ), mod(M , E ) ✁
❍❍ ❍ ❍ ❍❍ ❥ ❍
✁
✁
✝✁✆ ❍❍ θ = {S /S, E /M } ❍1 ✁ ☛ ✁
❍❍ ✁ c1 = vp(S, E) ← ✁vp(S, M ), mod(M , M ), mod(M, E) ✁ ✁ ✁ b2 = vp(1, 2), mod(2, 3), mod(3, 6) ✁ ❍ ✝✁✆ ❍❍ θ = {S/1, M /2, M/3, E/6} ❍ ❍ ❍2 ✁ ❍❍ ☛ ✁ ❥ ❍ ❍ ❍❍✁✁ c2 = vp(1, 3) Fig. 5. An abbreviated linear derivation tree.
space. If we attempted to go up the tree by inverse resolution, then we would fail since we need H to derive H in the final step. Approaching the problem from the IE perspective, we find that ⊥ = vp(1, 6) ← vp(1, 2), mod(2, 3), mod(3, 6) = {vp(1, 6)vp(1, 2), mod(2, 3), mod(3, 6)} and although there is an vp(S, E) ← vp(S, M ), mod(M, E) = H such that vp(S, E) ← vp(S, M ), mod(M, E) |= vp(1, 6) ← vp(1, 2), mod(2, 3), mod(3, 6) H does not subsume ⊥. This is because there is no substitution θ such that {vp(S, E), vp(S, M ), mod(M, E)}θ ⊆ {vp(1, 6)vp(1, 2), mod(2, 3), mod(3, 6)} Muggleton (1998) works around this problem by defining enlarged bottom clauses, which contain all possible positive literals except those that follow from B ∧ E. However, the enlarged bottom clauses are so large as to effectively not restrict the search for clauses.
8
Volume Overview
Grammars. Cussens & Pulman and also Osborne assume the existence of a given initial grammar that must be completed to parse sentences in a training
30
Saˇso Dˇzeroski, James Cussens, and Suresh Manandhar
set of unparsable sentences. In both cases, candidate rules are generated from an unparsable sentence via the chart created by the failed parse. See Thompson (this volume) for a description of chart parsing. From an ILP point of view this chart is that part of the background knowledge relevant to the example sentence that produced it. In Osborne’s paper an stochastic context-free grammar is defined on a feature grammar via a context-free (backbone) grammar formed by mapping each category, distinct in terms of features, to an atomic symbol. Riezler takes a more direct approach, defining a log-linear model on the parses of constraintbased grammars—the goal here is disambiguation rather than finding ‘missing’ grammar rules. Riezler describes how to learn both features (structure) and parameters of such models from unannotated data. Note that the features of the model are not necessarily grammar rules. Watkinson and Manandhar’s paper shares a theme with that of Adriaans earlier work (Adriaans, 1999) in that both use the functional nature of Categorial Grammar to enable them to build up the structure of the sentences. However, both the learning settings and the results of the learning processes are different. While Watkinson and Manandhar use a stochastic compression technique to learn in a completely unsupervised setting, Adriaans’s learner, working within the PAC learning model, uses an oracle providing a strongly supervised setting. Secondly, Adriaans uses the partial categories that he builds to develop a context-free grammar, whereas Watkinson and Manandhar build both a parsed corpus and a stochastic Categorial Grammar lexicon (which can be used as a probabilistic Categorial Grammar). Adriaans and de Haas’s approach (this volume) is the application of the same techniques as in (Adriaans, 1999) to learning of simple Lambek grammars. Semantics. Going against the grain of work in statistical computational linguistics, a number of contributions focus on semantics. Indeed, Mooney’s paper argues that logical approaches to language learning have “most to offer at the level of producing semantic interpretations of complete sentences” since logical representations of meaning are “important and useful”. Mooney backs this up with an overview of work on the Chill system for learning semantic parsers. Chill is also examined in the paper from Thompson & Califf, who demonstrate (empirically) the effectiveness of active learning where a learning algorithm selects the most informative inputs for a human to annotate. This theme of “how best to marry machine learning with human labor” is central to Brill’s paper arguing that we should be “capitalizing on the relative strengths of people and machines”. Bostr¨om reports on learning transfer rules which translate between semantic representations (quasi-logical forms) for different languages; in this case French and English. Nedellec continues the semantic theme with the ILP system Asium, which learns ontologies and verb subcategorization frames from a parsed corpora.
Inductive Logic Programming and Learning Language in Logic
31
Information extraction. As well as Chill, Thompson & Califf also describe and give experimental results for active learning with the Rapier system. Rapier is a system that learns information extraction (IE) rules using limited syntactic and semantic information. Junker et al also learn IE rules, examining how standard ILP approaches can be used. They also learn rules for text categorisation. PoS tagging, morphological analysis and phonotactics. Part-of-speech tagging is a popular test-bed for the application of ILP to NLP. An overview of this area is provided by the paper from Eineborg and Lindberg. ILP approaches to tagging often use the tags of neighbouring words to disambiguate the focus word—however, typically, these neighbouring tags are (initially) unknown. Jorge & de Andrade Lopes tackle this problem by learning a succession of theories, starting with a base theory which only has lexical rules and where a theory makes use of the disambiguation achieved by previous theories. Kazakov gives an overview of ILP approaches to morphology and argues that future work should make more extensive use of linguistic knowledge and be more closely integrated with other learning approaches. Dˇzeroski & Erjavec use Clog to learn rules which assign the correct lemma (canonical form) to a word given the word and its tag. A statistical tagger is used to find the tags. Tjong Kim Sang & Nerbonne examine learning phonotactics to distinguish Dutch words from non-words, finding that the inclusion of background knowledge improves performance.
9
Some Challenges in Language Learning for ILP
The main advantages of ILP from a NLP perspective are that 1) the rules learnt by an ILP system are easily understandable by a linguist, 2) ILP allows easy integration of linguistic background knowledge such as language universal/specific syntactic/morphological/phonological principles and 3) ILP offers a first-order representation both for the hypothesis language and the background language. Each of these advantages are either unavailable or hard to integrate within a purely statistical learning system. On the other hand, ILP systems suffer from certain disadvantages that make them less competitive compared to well engineered statistical systems. Although first-order representations are learnt, a disadvantage of ILP systems is efficiency and its effect on rule accuracy. One reason it is difficult to solve the efficiency issue is that most ILP implementations are general purpose learning systems that will accept any background knowledge and any training data set. In contrast, most statistical systems are engineered for a specific task. It follows that ILP systems specifically designed for a specific NLP task are likely to be more successful than general-purpose implementations. ILP systems are ideally suited to language learning tasks which involve learning complex representations. By “complex representation” we mean representations which are not purely attribute-value. For instance, grammars such as
32
Saˇso Dˇzeroski, James Cussens, and Suresh Manandhar
Head-Driven Phrase Structure Grammar (HPSG) (Pollard & Sag, 1987, 1994) employ a representation language similar to a description logic2 called typed feature structures (TFS) (Carpenter, 1992). HPSG comes with a background linguistic theory expressed in the TFS language as schemas that applies to (a subset of) all languages. TFSs can be viewed as terms in a constraint language (Smolka, 1992). To enable ILP systems to learn HPSG grammars it would be necessary to extend refinement operators to allow learning of TFS hierarchies and learning of TFS logic programs. Similarly, learning of Lambek grammars, involves extending refinement operators to the Lambek calculi. Adriaans and de Haas (this volume) can be viewed as an initial effort in this direction. An additional challenge for ILP is unsupervised learning or, in the NLP context, learning from unannotated data. Currently, most ILP systems require annotated training data. However, for tasks such as learning of lexicalist grammars such as HPSG or categorial grammar (CG), such annotated data (in the form of parse trees annotated with HPSG/CG structures) is generally unavailable. Thus, the question of whether it is possible to learn a HPSG/CG lexicon from the HPSG/CG background linguistic theory and a set of grammatical sentences needs investigation. Initial work on unsupervised learning of a CG lexicon is reported in Watkinson and Manandhar (this volume). For NLP learning tasks where totally unsupervised learning is difficult or infeasible, the best strategy might be to aid the learning process either by 1) providing a small amount of annotated data or 2) starting with a partial but correct theory or 3) using a limited amount of active learning. Combining human input with machine learning may be unavoidable both to shorten the learning time and to make it feasible, see, for instance, Brill (this volume). Use of active learning for information extraction is explored in Thompson and Califf (this volume). Active learning is also employed for learning verb subcategorisation frames in Nedellec (this volume). However, each of these need scaling up in order to be usable in large-scale NLP applications. NLP querying systems are likely to become very important in the near future for language based querying of web-based information resources. Thus, the application of machine learning techniques for semantic parsing is essential. (Zelle & Mooney, 1996) and Thompson and Califf (this volume) demonstrate the use of ILP techniques for learning semantic parsers. For these tasks, the training examples consists of sentence-meaning pairs. The task of the learner is to learn parse control rules (for a shift reduce parser) and appropriate semantic composition rules. However, further work is essential to scale up these existing approaches. Concept hierarchies such as WordNet (Fellbaum, 1998) are useful in information extraction tasks. As an information extraction system is deployed it will need to be updated as new domain specific terminology comes into use. Automatic methods for acquisition of domain specific terminology and its automatic integration into a concept hierarchy is necessary since it will not be possible to 2
By “description logics” we mean knowledge representation languages of the KLONE family such as KL-ONE (Brachman & Schmolze, 1985), CLASSIC (Bordiga, Brachman, McGuiness, & Resnick, 1989), LOOM (McGregor, 1988).
Inductive Logic Programming and Learning Language in Logic
33
collect training data and retrain an IE system once it is deployed. Nedellec (this volume) shows that ILP techniques can be usefully employed for such tasks. Acknowledgements Thanks to Stephen Pulman for input to the introductory section and to Stephen Muggleton for help on inverse entailment. Saˇso Dˇzeroski is supported by the Slovenian Ministry of Science and Technology.
References 1. Adriaans, P. (1999). Learning Shallow Context-Free languages under simple distributions. CSLI-publications, University of Stanford. 2. Bordiga, A., Brachman, R., McGuiness, D., & Resnick, L. (1989). Classic: A structural data model for objects. In 1989 ACM SIGMOD International Conference on Management of Data, pp. 59–67. 3. Bostr¨om, H. (1998). Predicate invention and learning from positive examples only. In Proc. of the Tenth European Conference on Machine Learning, pp. 226–237. Springer Verlag. 4. Brachman, R. J., & Schmolze, J. G. (1985). An overview of the kl-one knowledge representation system. Cognitive Science, 9 (2), 171–216. 5. Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 246–253. 6. Buntine, W. (1988). Generalized subsumption and its applications to induction and redundancy. Artificial Intelligence, 36 (2), 149–176. 7. Carpenter, B. (1992). The Logic of Typed Feature Structures. Cambridge University Press. 8. Cussens, J., & Pulman, S. (2000). Incorporating linguistics constraints into inductive logic programming. In Proc. of CoNLL-2000 and LLL-2000 Lisbon. Omni Press. To appear. 9. De Raedt, L., & Dˇzeroski, S. (1994). First order jk-clausal theories are PAClearnable. Artificial Intelligence, 70, 375–392. 10. Dehaspe, L., & Forrier, M. (1999). Transformation-based learning meets frequent pattern discovery. In Language Logic and Learning Workshop Bled, Slovenia. 11. Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA. 12. Flach, P. (1992). Logical approaches to machine learning - an overview. THINK, 1 (2), 25–36. 13. Flach, P. A., & Kakas, A. C. (Eds.). (2000). Abduction and Induction: Essays on their Relation and Integration, Vol. 18 of Applied Logic Series. Kluwer, Dordrecht. 14. Hogger, C. (1990). Essentials of Logic Pogramming. Clarendon Press, Oxford. 15. Lloyd, J. (1987). Foundations of Logic Programming (2nd edition). Springer, Berlin.
34
Saˇso Dˇzeroski, James Cussens, and Suresh Manandhar
16. Manandhar, S., Dˇzeroski, S., & Erjavec, T. (1998). Learning multilingual morphology with CLOG. In Page, D. (Ed.), Inductive Logic Programming; 8th International Workshop ILP-98, Proceedings, No. 1446 in Lecture Notes in Artificial Intelligence, pp. 135–144. Springer. 17. McGregor, R. (1988). A deductive pattern matcher. In Proceedings of the Seventh National Conference on Artificial Intelligence (AAAI88), pp. 403– 408 Menlo Park, CA. 18. Mooney, R. J., & Califf, M. E. (1995). Induction of first–order decision lists: Results on learning the past tense of English verbs. Journal of Artificial Intelligence Research, 3, 1–24. 19. Muggleton, S. (1995). Inverse entailment and Progol. New Generation Computing Journal, 13, 245–286. 20. Muggleton, S. (2000). Learning from positive data. Machine Learning. Accepted subject to revision. 21. Muggleton, S. (1991). Inductive logic programming. New Generation Computing, 8 (4), 295–318. 22. Muggleton, S., & Bryant, C. (2000). Theory completion using inverse entailment. In Proc. of the 10th International Workshop on Inductive Logic Programming (ILP-00) Berlin. Springer-Verlag. In press. 23. Muggleton, S., & Buntine, W. (1988). Machine invention of first-order predicates by inverting resolution. In Proc. Fifth International Conference on Machine Learning, pp. 339–352 San Mateo, CA. Morgan Kaufmann. 24. Muggleton, S., & Feng, C. (1990). Efficient induction of logic programs. In Proc. First Conference on Algorithmic Learning Theory, pp. 368–381 Tokyo. Ohmsha. 25. Niblett, T. (1988). A study of generalisation in logic programs. In Proc. Third European Working Session on Learning, pp. 131–138 London. Pitman. 26. Pereira, F. (1981). Extraposition grammars. Computational Linguistics, 7, 243–256. 27. Pierce, C. (1958). Collected Papers of Charles Sanders Pierce. Harvard University Press. Edited by C. Hartsthorne, P. Weiss and A. Burks. 28. Plotkin, G. (1969). A note on inductive generalization. In Meltzer, B., & Michie, D. (Eds.), Machine Intelligence 5, pp. 153–163 Edinburgh. Edinburgh University Press. 29. Pollard, C., & Sag, I. A. (1987). Information-Based Syntax and Semantics: Volume 1 Fundamentals, Vol. 13 of Lecture Notes. Center for the Study of Language and Information, Stanford, CA. 30. Pollard, C., & Sag, I. A. (1994). Head-driven Phrase Structure Grammar. Chicago: University of Chicago Press and Stanford: CSLI Publications. 31. Quinlan, J. R. (1996). Learning first-order definitions of functions. Journal of Artificial Intelligence Research, 5, 139–161. 32. Quinlan, J. (1990). Learning logical definitions from relations. Machine Learning, 5 (3), 239–266. 33. Robinson, J. (1965). A machine-oriented logic based on the resolution principle. Journal of the ACM, 12 (1), 23–41.
Inductive Logic Programming and Learning Language in Logic
35
34. Rouveirol, C. (1990). Saturation: Postponing choices when inverting resolution. In Proceedings of the Ninth European Conference on Artificial Intelligence. Pitman. 35. Shapiro, E. (1983). Algorithmic Program Debugging. MIT Press, Cambridge, MA. 36. Smolka, G. (1992). Feature constraint logics for unification grammars. Journal of Logic Programming, 12, 51–87. 37. Zelle, J. M., & Mooney, R. J. (1996). Learning to parse database queries using inductive logic programming. In Proceedings of the Thirteenth National Conference on Artificial Intelligence Portland, OR.
A Brief Introduction to Natural Language Processing for Non-linguists Cynthia A. Thompson CSLI, Ventura Hall, Stanford University, Stanford, CA 94305, USA
[email protected]
Abstract. This chapter introduces the field of natural language processing to the computer scientist or logician outside of the field. We first discuss some concepts from the field of linguistics, which focuses on the language half of the NLP equation, then move on to some of the common computational methods used to process and understand language. No previous knowledge of NLP is assumed.
1
Introduction
Natural Language Processing (NLP) is the attempt to use computers to analyze and understand human (as opposed to computer) languages. Applications of this technology include translation, text retrieval, categorization, and summarization, information extraction, and dialogue systems. The field is challenging for many reasons. Computers do not have the tremendous amount of world and contextual knowledge that humans bring to bear when they read a text or participate in a conversation. Also, while we usually do not notice this, language is fraught with ambiguity. Finally, words can be combined into sentences in an infinite variety of ways, making it impossible to simply list the possible uses, contexts, and meanings of a word or phrase (for which new meanings are also constantly being invented). This chapter introduces the field of natural language processing to the computer scientist or logician outside of the field. NLP is itself a sub-field of the broader field of computational linguistics, which also includes work in language theory. We first discuss some concepts from the field of linguistics, which focuses on the language half of the NLP equation, then move on to some of the common computational methods used to process and understand language. We will only briefly touch on the spoken language recognition and understanding, but focus instead on written language. Finally, the process of generating language with a computer, or natural language generation, is a topic that is beyond the scope of this article. J. Cussens and S. Dˇ zeroski (Eds.): LLL’99, LNAI 1925, pp. 36–48, 2000. c Springer-Verlag Berlin Heidelberg 2000
A Brief Introduction to Natural Language Processing for Non-linguists
2
37
Linguistic Concepts
The analysis of language is typically divided into four parts: morphology, syntax, semantics, and pragmatics. This section discusses each in turn. A more complete overview of the field can be found in Crystal (1987), and there are many introductory linguistic texts and books of survey papers, for example that by Newmeyer (1988). Morphology, Phonology, and Words. Morphology is the study of how words can be broken into meaningful pieces, while phonology studies the units of sounds possible in a language, and how they combine with one another. For example, morphological knowledge includes knowledge of suffixes and prefixes. This gives us information such as that the meaning of the word unbelievable can be derived from the prefix un, the verb believe, and the suffix able. We often refer to words as lexical items, and a lexicon is a data structure used to keep track of information needed to process the words in a sentence. Most theories of language assign features to words, such as whether they are singular or plural, transitive or intransitive, or first or third person. These features help detect inconsistencies such as the combination of a singular noun with a plural verb, as in “The dog eat.” Syntax. Syntax is the study of the legal structures of a language. Linguists group words that behave in similar syntactic ways into categories. These categories are called syntactic categories, or also parts of speech (POS). Categories include noun, verb, and adjective. Knowledge of syntax includes the rules for combining these categories into phrases and the structural roles that these phrases can play in a sentence. Syntax tells us that (1) is an allowable sentence while (2) is not (Linguistics typically use an “*” to indicate a linguistically questionable sentence). The old man sat on the bench. (1) ∗Man the bench the on sat old.
(2)
There are several major phrase types that combine together in various ways to form the phrase structure for the entire sentence. The phrase structure for the sentence “The woman took the book to the park.” is displayed in Figure 1. Here we have used the standard POS labels (also called tags) from the Brown corpus, a free corpus used in much computational linguistics work. The tags used here are sentence (S), noun phrase (NP), verb phrase (VP), past tense verb (VBD), prepositional phrase (PP), and preposition (IN). The most common type of sentence is assigned the category “S,” and is divided into a noun phrase and verb phrase. The rules for combining phrases into larger structures is captured by a grammar. The most common type of grammar is a context-free, or phrase structure grammar. These grammars consist of: – a set of terminal symbols (the lexical items and punctuation),
38
Cynthia A. Thompson S
VP
NP
The woman
VBD
NP
took
the book
PP IN
NP
to
the park
Fig. 1. A Sentence’s Phrase Structure
– a set of non-terminal symbols, such as the parts of speech, – a start symbol, and – a set of rewrite rules, each with a non-terminal on the left hand side and zero or more terminal or non-terminals on the right. A simple example is shown in Table 1. Table 1. Sample context-free grammar and lexicon Grammar
Lexicon
S → NP VP NP → Det NP2 NP → NP2 NP2 → Noun NP2 → NP2 PP PP → Prep NP2 VP → Verb VP → Verb NP VP → VP PP
Det → a Det → the Noun → dog Verb → bit Noun → saw Verb → saw Prep → with Noun → woman Verb → eats
Context-free grammars (CFGs) and their variations are widely used, but are not without their problems. First, it is not clear that the syntax of English or any other natural language can be fit into the context-free formalism. Also, handling so called long-distance dependencies can complicate the grammar tremendously. An example of such a dependency is in “Whom did Jane take the book from?” where “Whom” serves as the noun phrase in the (implied) prepositional phrase “from whom.” This is a dependency because the prepositional phrase that is missing its noun phrase depends on another noun phrase in the sentence, and it is long distance because the intervening structure can be arbitrarily large.
A Brief Introduction to Natural Language Processing for Non-linguists
39
Common alternatives to CFGs include Head-Driven Phrase Structure Grammar, Categorial Grammar, (lexicalized) Tree Adjoining Grammar, and Lexical Functional Grammar. Many of these incorporate features and unification. Unification-based grammars take into account the properties (such as number and subcategorization, discussed below) associated with grammatical categories, and attempt to ensure that these properties are consistent where needed across the sentence. Many of these alternative formalisms also incorporate semantics, discussed in the next section. Another powerful enhancement to context-free grammars is the addition of probabilities to each grammar rule, forming probabilistic context-free grammars (PCFGs). PCFGs are useful for several reasons. First, they can derive more than one analysis for an ambiguous sentence, such as “Time flies like an arrow.” This sentence, while it has its conventional idiomatic meaning, could be a request to “time flies” as an arrow would do so. While a person might never derive the latter meaning, a straight CFG could. Second, PCFGs allow one to learn grammars from an annotated corpus, a task that would also require examples of non-sentences if one were to try to learn CFGs alone. Such negative examples are not very common aside from the starred sentences in linguistic text books! Finally, PCFGs can allow us to parse sentences that ordinarily would not be considered as grammatical, by assigning small probabilities to unlikely but still possible rules. Besides grammars, an important syntactic notion is that of dependency. For example, in the sentence “Joe saw the book in the store.” Joe and the book are dependents of a seeing event. They are the arguments of the verb see. The prepositional phrase “in the store” is a dependent of book, and modifies book. When we talk about arguments, we are usually referring to noun phrases as the arguments of verbs. These arguments can be classified by the semantic roles they play. For example, the agent of an action is the person or thing that is doing something, and the patient is the person or thing that is having something done to it. Another way to classify the arguments is via their syntactic relations such as subject and object. Different verbs can take different numbers of arguments, which just means that they may differ in the number of entities that they may describe relationships about. For example, we cannot say “She brought.” without giving an object of the bringing. Verb arguments fall into two categories, the subject (the noun phrase that appears before the verb), and all non-subject arguments, referred to as complements. Verbs can be divided into classes depending on what types of complements they allow. This classification is called subcategorization. We say that a verb subcategorizes for a particular complement. For example, bring subcategorizes for an object. Semantics. Semantics is the study of the meaning of a unit of language, whether it be a word, sentence, or entire discourse. We can divide the study into that of studying the meanings of individual words, and that of how word meanings
40
Cynthia A. Thompson
combine into sentence meanings (or even larger units). Sentence meanings are studied independently of the broader context in which that sentence is used. Semantics tells us that (3) is an allowable sentence while (4) is not, even though the latter is syntactically correct. Who took the blue book?
(3)
∗The blue book spoke about its small feet.
(4)
This kind of distinction is captured by selectional restrictions, which describe the semantic regularities associated with the possible complements of a verb. In this example, the verb speak prefers people, not animals or books (much less books with feet), as the subject. Most work in semantics involves a logical representation language. The knowledge of the meaning of a sentence can be equated in this way with knowledge of its truth conditions, or what the world would have to be like if the sentence were true. Pragmatics. Pragmatics is the study of appropriateness, necessity, and sufficiency in language use. It also explores how sentences relate to one another. Pragmatics knowledge includes information about sentences that might have meaning in the present context, but are inappropriate or unnecessary because of the inferences that a rational person could apply.
3
Language Processing
Now we move on to the computational processes that are used at various stages of language processing and understanding. An important distinction in the field is that between those who write programs that are engineered entirely by hand, and which use primarily symbolic techniques, and those who use statistical and machine learning techniques to help engineer their NLP programs. The latter is typically dubbed statistical NLP, or statistical language learning. There is also a growing movement to combine the two fields, as evidenced by many of the chapters in this volume. For an introduction to natural language understanding in general, Allen (1995) is an excellent reference. Statistical NLP is covered in Charniak (1993) and Manning and Sch¨ utze (1999), with a set of recent articles available in Cardie and Mooney (1999). Before we discuss methods specific to a given level of language processing, several ubiquitous techniques bear mention. One of these is the use of finite state models, including finite state machines, Markov chains, and Hidden Markov Models. These can all be thought of as acceptors or generators for strings of words or other symbols. They each contain states (including a start state), sets of possible events (such as the output of a symbol), and rules for going from one state to the next. A Markov chain models sequences of events where the probability of an event occurring depends upon the fact that a preceding event
A Brief Introduction to Natural Language Processing for Non-linguists
41
occurred. Thus, each transition is associated with the probability of taking that arc, given the current state. Hidden Markov models (HMMs) have the additional complication that one may not be able to observe the states that the model passes through, but only has indirect evidence about these states. More formally, an HMM is a five-tuple (Π, S, W, T, O), where Π is the start state probability distribution, S is the set of states, W is the set of possible observations, T are transition probabilities between the states, and O are the observation probabilities. HMMs are useful in situations where underlying event probabilities generate surface events. For example, in tagging, discussed next, one can think of underlying chains of possible parts of speech from which actual legal sentences are generated with some probability. There are three questions that are asked about an HMM. First, given a model, how do we (efficiently) compute how likely a certain observation sequence is? Second, given the observation sequence and and a model, how do we choose the state sequence that best explains the observations? Third, given an observation sequence and a space of possible models, how do we find the model that best explains the observed data? Methods for answering these questions, especially the last of learning the probabilities associated with a HMM, have become widespread in the statistical NLP literature. See Rabiner (1989) for a good introduction to HMMs. Part of Speech Tagging. Part-of-speech (POS) tagging is the process of labeling each word with its part of speech. Tagging is often the first step in the analysis of a sentence, and can assist with later stages of understanding. Three main techniques are used for POS tagging: rule-based tagging, stochastic tagging, and transformation-based tagging. Rule-based techniques use a dictionary or other source to find possible parts of speech, and disambiguation rules to eliminate illegal tag combinations. Stochastic techniques use variants on Hidden Markov Models, based on training from a tagged corpus, to pick the most likely tag for each word. Transformation-based tagging (Brill, 1993) is a technique from machine learning that learns rules for transforming a sentence into its appropriate tags. Bracketing. Bracketing is the process of dividing up a sentence into its highlevel syntactic constituents. For example, for the sentence in Figure 1, we have the partial labeled bracketing: [s[np The woman] [vp[vbd took][np the book][pp[in to][np the library]]]] A preliminary bracketing such as this can be useful in guiding a more detailed parse. It can also be used to quickly extract basic information about the structure of a sentence in scenarios where speed of processing is an issue, such as in dialogue systems. Since most systems perform detailed bracketing that is more analogous to a full parse, the process of high-level bracketing will not be further discussed here.
42
Cynthia A. Thompson
Parsing. Parsing is a more detailed version of bracketing, basically combining the tagging and bracketing steps by labeling the brackets to the lowest level, but the resulting parse is commonly displayed as a tree as in Figure 1. There are many parsing methods in use; we cover only the most common here. A parsing algorithm (parser) is a procedure for searching through the possible ways of combining grammatical rules to find one (or more) structure(s) that matches a given sentence’s structure. The search for the best parse is constrained both by the grammar and the sentence at hand. For context free grammars, there are three dimensions along which parsers vary. First, alternative parses can be examined in parallel or sequential fashion. Second, the parser can work in a top-down or bottom-up fashion. Top-down procedures start at the highest structural level (usually sentences), look for rules that make up that structure, and work their way down to the level of the individual words. Bottom-up procedures start with the words and look for rules matching these words, combining them into higher and higher level constituents. Along the third dimension, parsers take different strategies for deciding which pieces of the parse to analyze first. Typical strategies include moving through the input in a set direction, analyzing chunks of increasing size, or combining these methods. One of the most common algorithms for analyzing sentences using contextfree grammars is the chart parser. A chart parser contains three structures: a key list (also called the agenda), a chart, and a set of edges. The chart stores the pieces of the parse as it is constructed. Each chart entry contains the name of a terminal or non-terminal symbol in the grammar, the location in the sentence at which that entry begins, and the length of the entry. The key list is a FIFO stack of chart entries that are candidates for entry into the chart. Figure 2 shows a schematic of an empty chart (on the left) and the key list (on the right) for the sentence “The dog eats.” The horizontal labels indicate the starting position of a chart entry, and the vertical labels indicate the constituent length. Thus, when “eats, 3, 1” is processed, it is placed in the third column of the first row.
3 the, 1, 1 2
dog, 2, 1 eats, 3, 1
1 1
2
3
Fig. 2. The chart and key list before parsing begins
A Brief Introduction to Natural Language Processing for Non-linguists
43
The edges keep track of the grammar rules that can be applied to current chart entries to combine them into larger entries. An edge e contains: – the rule that may be applied (rule(e)), – the sentence position where the first constituent of the right-hand side of the rule was located (start(e)), – the position where the first uncompleted right-hand constituent must start (end(e)), and – an indicator (denoted by ◦) in the right-hand side of which constituents have been completed. An algorithm for chart parsing is given in Table 2. Table 2. A Chart Parsing Algorithm Add each word in the sentence to the key list, working from back to front. While entries in key list do Pop a key list entry, c. If c is not in the chart, then Add c to the chart. For all rules that have c’s type as the first constituent of their right hand side, add an edge, e, for that rule to the chart, where start(e)=start(c), end(e)=start(c)+len(c), and ◦ is placed after the first constituent. For all edges e that have c’s type as the constituent following the ◦ indicator, call create extended(e, c) If the extended edge is completed, add an entry to the key list with the appropriate information. Procedure create extended(edge e, key list entry c) Create a new edge e . Set start(e ) = start(e). Set end(e ) = start(c)+len(c). Set rule(e ) = rule(e) with ◦ moved beyond c.
As an example, let us give an overview of the parse of “The dog eats.” using the grammar from Table 1. Figure 3 shows the chart (with some empty portions omitted) after processing “the.” It was first removed from the key list and added to the chart. Next, an edge was added for all rules that could start with “the.” Here we just have the rule “det → the,” which we have added to the bottom of the chart. We thus pictorially indicate start(e) by the tail of the arc and end(e) by the head of the arc. Since “the” is the last constituent of the det, the ◦ goes at the end of the right-hand side, indicating that it is completed, so we push it onto the key list. We now start processing the det entry. We omit the figure in this case. It would include a new edge in the chart: “np → det ◦ noun,” starting at 1 and
44
Cynthia A. Thompson det, 1, 1 2
dog, 2, 1 eats, 3, 1
the, 1, 1 1
✒
det → the ◦
2
3
Fig. 3. The chart and key list after processing “the”
ending at 2. It would also include the chart entry “det, 1, 1.” Next, “dog” is taken from the key list and inserted into the chart, creating a noun entry for the key list. In the next step, when we process the noun, we have an example of creating an extended edge. The resulting edge is “np → det noun ◦,” starting at 1 and ending at 3. This finishes the edge, and a new constituent, np, is added to the key list. We show the chart after processing np in Figure 4. The parse is successful if an S constituent is completed that covers the entire sentence.
np, 1, 2
eats, 3, 1
2
1
the, 1, 1 dog, 2, 1 det, 1, 1 noun,2,1
✒ ✒ 3 det → the ◦ noun → dog ◦ np → det ◦ noun np → det noun ◦ Fig. 4. The chart and key list right before processing “eats”
This is a simplified version of the chart parsers typically used, and does not include a mechanism for actually producing a final syntactic structure for a given sentence. It just indicates success if there exists such a structure. We will not go into the details of maintaining such a structure during the parse, but the process involves some straightforward bookkeeping extensions to the basic chart parser. A second technique for parsing is the augmented transition network. A transition network consists of nodes (one of which is the start state) and labeled arcs between the states. An example network for a noun phrase consisting of a determiner, followed by zero or more adjectives, followed by a noun, is shown in Figure 5. Starting in the first node, one can traverse an arc if the current word in the sentence is in the arc’s category. Traversing an arc allows us to update
A Brief Introduction to Natural Language Processing for Non-linguists DET
Noun
45
Pop
Adj
Fig. 5. A Transition Network
the current word, and a phrase is legal if there is a path from the start node to a an arc labeled “Pop”. To capture the power of a CFG, one also needs recursion in the grammar, where arcs can refer to other networks, as well as to categories. A transition network on its own does not handle language phenomena such as agreement and subcategorization. For this we introduce a set of features into our grammars, and add conditions and actions to the arcs of a network, creating an augmented transition network. Conditions restrict the circumstances under which an arc can be taken, while actions update the features and structures associated with the parse. Because parsing with augmented transition networks is not as commonly used, we will not discuss the algorithm’s details. Finally, recent work in statistical NLP has tackled the problem of learning PCFGs from a parsed corpus. The first step is to simply enumerate all possible CFG rules by reading them directly from the parsed sentences in the corpus. Next, one attempts to assign some reasonable probabilities to these rules, test them on a new corpus, and remove those with sufficiently small probability. In the latest iteration of such techniques, Charniak (1999) presents a method extracting such probabilities, and the use of the resulting grammar to efficiently parse novel sentences. Word Sense Disambiguation. The problem addressed by word sense disambiguation algorithms is that many words have several meanings, or senses. These words, if taken out of context, thus have several possible interpretations, and are said to be ambiguous. A favorite example is “bank,” which can refer to the monetary meaning or the river side meaning. The task of word sense disambiguation is to determine which meaning is appropriate in the current context. This task is important for constraining the parsing processes previously discussed, or for assisting with simple word-for-word translations, where the different senses of a word can be translated in different ways. Disambiguation can also be the first stage in semantic processing, discussed below. For an overview of disambiguation methods in general, see Ide and V´eronis (1998). Some of the most successful word sense disambiguation methods to date are from the area of statistical NLP. These can be divided up into supervised learning techniques, dictionary-based methods, and unsupervised techniques. In the supervised learning scenario, we are given a corpus that has the correct sense label for each word, and attempt to learn a model that can correctly label new
46
Cynthia A. Thompson
sentences (Gale, Church, & Yarowsky, 1992; Brown, Pietra, Pietra, & Mercer, 1991). In dictionary-based techniques, word senses are hypothesized based on an analysis of a knowledge base containing information about words and their relationships, whether it be a dictionary, thesaurus, or parallel corpora in two or more languages (Lesk, 1986; Yarowsky, 1992; Dagan & Itai, 1994). Finally, unsupervised techniques attempt to cluster words into related senses based on the contexts in which they occur or other surface information, e.g., (Sch¨ utze, 1995). Semantic Processing. While all of the above steps are useful, there is still information missing from the syntactic representation and even word senses of a sentence. To allow a system to reason about the implications and meaning of a sentence, it must be transformed into a representation that allows such reasoning. While some systems can use just syntax as a basis for making decisions, deeper interactions require deeper understanding. Most semantic understanding systems attempt to transform the sentence into an underlying representation language, typically one based on predicate or first order logic. For example “The woman took the book” can be represented as (the x: (book x)(the y: (woman y)(took1 y x))), where the acts as a quantifier over the variables x and y. The exact predicates or other representation actually used are highly task-dependent. For example, if a question answering system is the application, the representation might be a database query language. If the task is translation, some sort of interlingua might be the goal. Often an intermediate logical form is used as the result of the semantic interpretation phase, with the final representation being derived from it based on the context. One of the difficulties of assembling the word meanings into a meaning for the whole sentence is that language does not always obey the principle of compositionality. This principle says the meaning of the whole can be strictly predicted from the meaning of the parts. This is clearly not true, as “broad daylight” (light is not wide) and “strong tea” (tea cannot exert a force). These two groupings are examples of collocations, or words that tend to appear together or express a conventional way of saying things. Knowing about collocations can help bias the parsing process so that the words in a collocation will not be placed into two different constituents. Collocations are found by counting words in a corpus and noticing which appear together frequently and which could also conceivably be part of the same phrase based on their tags. The details for counting occurrences appropriately are important in getting true collocations, but we will not go into them here. Some systems simply add semantic features to an existing context-free grammar, and add parsing mechanisms for combining them. This type of technique is common for those taking a theoretical view, but those working on the comprehension of real texts face a problem: real language use includes errors, metaphor, and many other noisy phenomena. As a result, a syntax-driven approach (unless it is probabilistic in nature) may not produce any analysis at all when it cannot find a parse for the entire sentence.
A Brief Introduction to Natural Language Processing for Non-linguists
47
One can overcome these difficulties in many ways. One technique is to first derive a syntactic parse (or partial parse), then use rules to translate this to a semantic representation. Another is to use semantic grammars, that translate a sentence directly into its “meaning.” These are most useful in limited domains where one can take advantage of the predetermined context to constrain the grammar. Third, in information extraction and other constrained tasks, one can specify meaningful patterns that might occur in the text. Information extraction systems attempt to find specific pieces of information in a document, such as the principle actors and objects in a situation. Deep semantic processing requires the ability to perform inferences based on the representation of the input and the knowledge base used by the system. Existing systems typically work within a narrow domain of understanding to constrain such inferences, such as those of translating parliamentary procedures or helping a user schedule a trip.
4
Summary
We have discussed linguistic concepts and some associated natural language processing techniques. This survey has by no means covered all techniques and applications. In particular, it has almost completely ignored the lowest level processing required for speech recognition, and the highest level processing required for understanding larger units of language such as texts or conversations. In conclusion, it should be noted that many of the above techniques can and have been used in combination with one another. For example, some methods combine pragmatic knowledge in the semantic parsing process to help constrain the meaning of a sentence based on the context in which it appears. Finally, there are many tasks that are still difficult for current systems, but the state of the art is constantly changing and the future is bright.
References 1. Allen, J. F. (1995). Natural Language Understanding (2nd Ed.). Benjamin/Cummings, Menlo Park, CA. 2. Brill, E. (1993). Automatic grammar induction and parsing free text: A transformation-ba sed approach. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 259–265 Columbus, Ohio. 3. Brown, P., Pietra, S. D., Pietra, V. D., & Mercer, R. (1991). Word-sense disambiguation using statistical methods. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, pp. 264–270. 4. Cardie, C., & Mooney, R. J. (1999). Machine learning and natural language (introduction to special issue on natural language learning). Machine Learning, 34, 5–9. 5. Charniak, E. (1993). Statistical Language Learning. MIT Press.
48
Cynthia A. Thompson
6. Charniak, E. (1999). A maximum-entropy-inspired parser. Tech. rep. CS9912, Department of Computer Science, Brown University. 7. Crystal, D. (1987). The Cambridge Encyclopedia of Language. Cambridge University Press, Cambridge, England. 8. Dagan, I., & Itai, A. (1994). Word sense disambiguation using a second language monolingual corpus. Computational Linguistics, 20, 563–596. 9. Gale, W., Church, K. W., & Yarowsky, D. (1992). A method for disambiguating word senses in a large corpus. Computers and the Humanities. 10. Ide, N., & V´eronis, J. (1998). Introduction to the special issue on word sense disambiguation: The state of the art. Computational Linguistics, 24 (1). 11. Lesk, M. (1986). Automatic sense disambiguation: how to tell a pine cone from an ice cream cone. In proceedings of the 1986 SIGDOC Conference, pp. 24–26. 12. Manning, C. D., & Sch¨ utze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. 13. Newmeyer, F. (1988). Linguistics: The Cambridge Survey. Cambridge University Press, Cambridge, England. 14. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77 (2), 257– 285. 15. Sch¨ utze, H. (1995). Distributional part-of-speech tagging. In 7th Conference of the European Chapter of the Association for Computational Linguistics, pp. 141–148. 16. Yarowsky, D. (1992). Word-sense disambiguation using statistical methods of Roget’s categories trained on large corpora. In Proceedings of the Fourteenth International Conference on Computational Linguistics, pp. 454– 460.
A Closer Look at the Automatic Induction of Linguistic Knowledge Eric Brill Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA
[email protected]
Abstract. Much recent research activity has focused toward automatically extracting linguistic information from on-line corpora. There is no question that great progress has been made applying machine learning to computational linguistics. We believe now that the field has matured, it is time to look inwards and carefully examine the basic tenets of the corpus-based learning paradigm. The goal of this paper is to raise a number of issues that challenge the paradigm in hopes of stimulating introspection and discussion that will make the field even stronger.
1
Learning Linguistic Information Automatically
One of the biggest challenges in developing software with robust linguistic capabilities is how one provides the machine with the knowledge of language necessary to achieve an adequate level of linguistic sophistication. Although children appear to learn natural language effortlessly, teaching language to a machine has proven to be an extraordinarily difficult task. While it is possible to anticipate and encode a fair percentage of the necessary linguistic facts to process a relatively constrained domain like weather reports, we are still far from being able to provide machines with the knowledge necessary to effectively process unconstrained language. Until very recently, people attempted to break the language acquisition bottleneck by having humans manually input linguistic rules and lexical relations for the machine. As an alternative to laboriously hand-crafting rules, people are exploring methods for automatically extracting linguistic knowledge from online resources such as raw text, linguistically annotated text, dictionaries and thesauri. The field of linguistic machine learning has matured and expanded greatly over the last few decades, beginning with a small number of research labs speculating as to whether automatic techniques could succeed and growing into a huge research community that has produced a wide array of useful and widely used tools, such as alignment programs, part of speech taggers, named entity identifiers and syntactic parsers. A large percentage of papers at the major computational linguistics conferences contain some machine learning component, and there are many new workshops and conferences devoted solely to the area of linguistic machine learning. J. Cussens and S. Dˇ zeroski (Eds.): LLL’99, LNAI 1925, pp. 49–56, 2000. c Springer-Verlag Berlin Heidelberg 2000
50
Eric Brill
In linguistic machine learning, most people adopt the traditional machine learning paradigm of using a gold standard sample set for training a supervised learning algorithm and then using a held out section of the set for testing the classification accuracy of the resulting learner. There are a number of difficulties we face in applying this to linguistic learning. This paper highlights two such difficulties: understanding what a gold standard corpus really is, and assessing what we gain from applying machine learning techniques compared to manually deriving linguistic knowledge.
2
Part of Speech Tagging: Who Did What?
In machine learning, we typically believe the more accurately a classifier performs on a fair test set, the more useful that classifier will be. Implicitly, this is the motivation behind the large number of natural language papers demonstrating how a new machine learning technique gives incremental performance improvements over previous techniques. This section describes an experiment we ran that highlights the fact that for linguistic gold standards that are human-annotated, this tight coupling between performance and usefulness does not necessarily hold. One of the most studied problems in natural language processing is part of speech tagging, the problem of assigning a word the appropriate part of speech given the context it appears in. There have been a wide range of machine learning techniques applied to this problem. Typically, we claim tagger A is better than tagger B if the difference in performance is statistically significant. Published improvements in tagging accuracy are usually very small, with less than a half percent absolute accuracy separating the performance of the very best taggers and the worst performing viable taggers when trained and tested on identical data. The Penn Treebank (Marcus, Santorini, & Marcinkiewicz, 1994) was tagged by first running an n-gram tagger and then having individual people manually correct the tagger output, according to a tagging style manual. Ratnaparkhi (1996), observed that the tag distribution of words in the Penn Treebank tagged corpus changes dramatically right at the positions in the text where the human annotator changes. He found that slightly higher accuracy could be obtained by training and testing a tagger on material hand corrected by a single human annotator than when using the corpus as a whole. As another way of measuring the difference in human annotators, we ran an experiment to see how much automatic tagging performance would improve if we included a feature indicating which annotator annotated a particular word, both in training and in testing. Each file of the Penn Treebank is annotated by a single human, and included with the Penn Treebank distribution is a list of which files each annotator annotated. We began with a simple transformation-based tagger (Brill, 1995) containing the following templates:
A Closer Look at the Automatic Induction of Linguistic Knowledge
51
Change a tag from X to Y if: Previous tag is Z Next tag is Z Previous tag is Z and current word is W Next tag is Z and current word is W To avoid the unknown word problem in these experiments, we extracted a lexicon from the entire Wall Street Journal Penn Treebank (including the test set) indicating all possible tags for each word, along with the most likely tag. We trained the tagger on 100K words and tested on a separate 500K words. Doing so, we obtained a test set tagging accuracy of 96.4%. Next, we augmented the training algorithm by adding the following four templates: Change a tag from X to Y if: Previous tag is Z and human annotator is H Next tag is Z and human annotator is H Previous tag is Z and current word is W and human annotator is H Next tag is Z and current word is W and human annotator is H We then trained and tested this new system on the same data as above, and achieved an accuracy of 96.6%, achieving a 0.2% absolute and a 6% relative error reduction. The tagger without access to annotator information learned a total of 198 rules and the tagger with annotator information learned 261 rules. In Figure 1 we show test set accuracy as a function of learned rule number for both taggers. Below we present a few representative learned rules containing the “human annotator” feature. – Change tag from IN to RB if the current word is about, the next tag is CD and the annotator is maryann. – Change tag from IN to RB if the current word is about, the next tag is $ and the annotator is maryann. – Change tag from NNS to NN if the current word is yen and the previous tag is CD and the annotator is parisi. – Change tag from NNPS to NNP if the previous tag is NNP and the annotator is hudson. The first two rules indicate that the annotator maryann differs from the other annotators in prefering to label the word about as an adverb instead of a preposition in constructs such as: “about twenty dollars” and “about $ 200”. The third rule shows that parisi seems to treat the word yen as a singular noun regardless of how many yen are being referred to. The fourth rule indicates that hudson is biased toward, for instance, labeling the word Systems in “Electronic Data Systems” as a singular noun and not a plural noun.
52
Eric Brill
0.97 "Knowing_Annotator" "Not_Knowing_Annotator"
0.965
0.96
0.955
0.95
0.945
0.94 0
50
100
150
200
250
300
Fig. 1. Rule Number vs Test Set Accuracy
One could view this unusual tagger constructively as a tool for finding anomalies of individual human annotators and helping to make a more consistent annotated corpus. But these results also point out an important fact: there is not necessarily the tight coupling between improving accuracy on a test set and creating a program with increased utility. Note that one could somewhat lessen the problem of corpus inconsistency by having two or more human annotators annotate every sentence and then have a process for adjudicating any disagreements. However, there are problems with this approach as well. While some annotator disagreements will be due to one annotator making an error or misinterpreting the style manual, many inconsistencies arise by being forced to pick a single tag in cases of true ambiguity. The more complex the style manual gets in an attempt to remove ambiguities, the
A Closer Look at the Automatic Induction of Linguistic Knowledge
53
more difficult it will be to train annotators, and the more biased the underlying annotations will be toward a specific linguistic theory and therefore be of less general use. We would like to reiterate here that the point of this section is not to discount the excellent work people have done in developing techniques to learn from annotated data, but rather to point out that now that our field has matured we need to devote some energy to better understanding the strengths and weaknesses of our paradigm, and to think carefully what can be done to improve upon the weaknesses.
3
Machine Learning and Portability
One of the biggest arguments made for pushing methods for automatically learning linguistic knowledge is the fact that such systems could in theory be much more portable than methods that involve manual labor. For instance, at least in cases where we are fortunate enough to have adequate training resources for a new language or domain, we can simply run our training algorithm on the training data and will by doing so have quickly ported our algorithm. There are many highly accurate manually derived programs for tagging, parsing, and so forth, but these programs took a great deal of highly skilled human labor to create. For instance, Samuelsson and Voutilainen (1997) describe a highly accurate tagger based on Constraint Grammar and built and refined over the years by linguists trained in this formalism. While this system achieves high accuracy, it is not very portable, requiring specially trained experts and a fair amount of time to create. A question to ask is: if we want to develop portable systems, does this mean automatic techniques are our only alternative? In this section we present results from an experiment that challenges this conclusion. Chanod and Tapanainen (1994) describe experiments in porting two part of speech taggers to French: an HMM tagger and a constraint-based tagger. The HMM tagger did not work well out of the box, and required approximately one man-month of fine tuning to attain reasonable performance. We should note that this is an uncharactaristically long time for tuning, and was probably due to the fact that they were tuning a notoriously finicky unsupervised learning algorithm rather than using one of the standard supervised learning algorithms for tagging. Nonetheless, they next devoted a comparable amount of time to manually developing a constraint-based part of speech tagger, and found that the constraint-based tagger outperformed the HMM. In (Brill & Ngai, 1999), we describe an experiment comparing human rule writing versus machine learning for automatic base noun phrase bracketing. In the experiments of Chanod and Tapanainen (1994), they asked how well people (and machines) could port to a new domain in roughly one person-month. We wanted to impose a much more stringent definition of portability. Roughly speaking, we asked the question: what quality system can we derive manually given one day and no more than $40 in human labor. Nobody can argue that a
54
Eric Brill
system that can be retrained in less than a day and at such a low expense is not portable. We refer the reader to the original paper (Brill & Ngai, 1999) for details and here only give a brief summary of the experiment. First, we developed software to facilitate manual rule writing. This consisted of two components: the rule engine and the analysis engine. The goal was for people to manually write empirically derived transformation lists. A transformation-based system works by first assigning a default structure (in our case the default structure was to assume nothing as a noun phrase) and then applies each rule in the tranformation list, in order. At any stage of rule writing, the person will have written a set of N rules. To write the N+1st rule, the person uses the analysis engine, which displays precision and recall errors of the current transformation list. An important point about this method is that in writing the N+1st rule, a person can ignore the previous N rules. The state of the system is completely captured in the annotations of the training corpus. This makes rule writing relatively easy, as unlike a rule system such as context-free grammars, a person need not be concerned with how various rules will interact. After examining errors in the training set, the person then attempts to write a rule to reduce the number of errors. They then apply the rule to the training set, and based upon the net results and studying where the rule applied, the person can decide to accept the rule, discard the rule, or refine the rule in hopes of improving it. We allowed four basic rule types: – – – –
Add a base NP. Delete a base NP. Transform a base NP. Merge multiple base NPs into one.
Each rule type is built from regular expressions denoting the environment to the left of where the rule applies, to the right of where the rule applies as well as the environment to which the rule applies. For our initial experiments, we asked students in an introductory Natural Language Processing course at Johns Hopkins University to derive a transformation list for base NP annotation, using a 25,000 word training corpus. We also trained the Ramshaw and Marcus machine learning algorithm (Ramshaw & Marcus, 1995) on this same data. There have been many machine learning attempts at base noun phrase learning. It is difficult to compare different methods, as they are often run on different corpora and with somewhat different definitions of what constitutes a base noun phrase. However, of systems run on the Penn Treebank, Ramshaw and Marcus achieve among the best performance. Of the 11 students who participated in this experiment, the top three achieved test set performace very close to that attained by the automatically trained system. See Table 1. On the average, the students spent less than five hours each deriving their transformation lists. We were rather surprised and encouraged that students
A Closer Look at the Automatic Induction of Linguistic Knowledge
55
Table 1. Comparing Human to Machine Performance for BaseNP Annotation (Taken from (Brill & Ngai, 1999))
Ramshaw & Marcus Student 1 Student 2 Student 3
Precision 88.7 88.0 88.2 88.3
Recall 89.3 88.8 87.9 87.8
F-Measure 89.0 88.4 88.1 88.1
could achieve performance close to the best machine learning algorithm in so little time, and believe that with better analysis tools these results would improve. In more recent experiments, we have found that on smaller training sets people in fact are able to significantly outperform the very best machine learning algorithms. This is somewhat intuitive, as generalization is key when we have little training data, and this is something people are good at and machines are bad at. One explanation for these results is that because of the skewed distribution of linguistic entities in naturally occurring data, all of the “shallow” methods, such as rapid human rule writing and linguistically naive machine learning, are able to capture the frequently occurring phenomena, and then the law of diminishing returns kicks in, meaning that obtaining improved results beyond these systems requires a great deal of effort. This skewed distribution is commonly referred to as Zipf’s Law. Zipf (1932) observed that this skewed distribution of types is prevalent, surfacing in many different linguistic phenomena and languages. In particular, he observed that the rank of a type multiplied by the frequency of that type is roughly constant for distributions such as word frequency. In other words, if the most frequent word in a book appears N times, we would expect the second most frequent word to appear N/2 times, the third most frequent to appear N/3 times, and so forth. If this diminishing-returns explanation is indeed what is happening, it suggests that perhaps we should divert our efforts from deriving machine learning methods that attempt to challenge human performance to thinking hard about how machine learning can be used to make better systems. In particular, one fascinating area of research that has gotten very little attention to date is the question of how manual labor and machine learning can best be combined to create systems that are better than those that could be derived solely by hand. To do this, we must begin to ask what a machine can do significantly better than people and focus on ways of capitalizing on the relative strengths of people and machines, rather than simply viewing machine learning as another way to do the same thing.
4
Conclusions
We believe the field of corpus-based natural language processing has matured to the point where it is constructive to discuss and challenge the basic tenets of the
56
Eric Brill
field. We have raised issues pertaining to two such tenets: (1) that incremental improvements in test set accuracy necessarily imply incremental improvements in the utility of the resulting system and (2) that machine learning offers the advantage of portability. By carefully studying the many issues in preparing and using annotated corpora and by learning how best to marry machine learning with human labor, we hope the field can advance to the point of some day truly overcoming the language acquisition bottleneck.
References 1. Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 246–253. 2. Brill, E., & Ngai, G. (1999). Man vs. machine: A case study in base noun phrase learning. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 65–72. 3. Chanod, J., & Tapanainen, P. (1994). Statistical and constraint-based taggers for French. Technical report MLTT-016, Rank Xerox Research Centre, Grenoble. 4. Marcus, M., Santorini, B., & Marcinkiewicz, M. (1994). Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19, 313–330. 5. Ramshaw, L., & Marcus, M. (1995). Text chunking using transformationbased learning. In Proceedings of the third ACL Workshop on Very Large Corpora, pp. 82–94. 6. Ratnaparkhi, A. (1996). A maximum entropy part of speech tagger. In Proceedings of the First Empirical Methods in Natural Language Processing Conference, pp. 133–142. 7. Samuelsson, C., & Voutilainen, A. (1997). Comparing a linguistic and stochastic tagger. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pp. 246–253. 8. Zipf, G. (1932). Selected Studies of the Principle of Relative Frequency in Language. Harvard University Press.
Learning for Semantic Interpretation: Scaling Up without Dumbing Down Raymond J. Mooney Department of Computer Sciences, University of Texas, Austin, TX 78712-1188, USA
[email protected]
Abstract. Most recent research in learning approaches to natural language have studied fairly “low-level” tasks such as morphology, part-ofspeech tagging, and syntactic parsing. However, I believe that logical approaches may have the most relevance and impact at the level of semantic interpretation, where a logical representation of sentence meaning is important and useful. We have explored the use of inductive logic programming for learning parsers that map natural-language database queries into executable logical form. This work goes against the growing trend in computational linguistics of focusing on shallow but broad-coverage natural language tasks (“scaling up by dumbing down”) and instead concerns using logic-based learning to develop narrower, domain-specific systems that perform relatively deep processing. I first present a historical view of the shifting emphasis of research on various tasks in natural language processing and then briefly review our own work on learning for semantic interpretation. I will then attempt to encourage others to study such problems and explain why I believe logical approaches have the most to offer at the level of producing semantic interpretations of complete sentences.
1
Introduction
The application of machine learning techniques to natural language processing (NLP) has increased dramatically in recent years under the name of “corpusbased,” “statistical,” or “empirical” methods. There has been a dramatic shift in computational linguistics from manually constructing grammars and knowledge bases to partially or totally automating this process by using statistical learning methods trained on large annotated or unannotated natural language corpora. The success of statistical methods in speech recognition (Stolcke, 1997; Jelinek, 1998) has been particularly influential in motivating the application of similar methods to other aspects of natural language processing. There is now a variety of work on applying learning methods to almost all other aspects of language processing as well (Charniak, 1993; Brill & Mooney, 1997; Manning & Sch¨ utze, 1999), including syntactic analysis (Charniak, 1997), semantic disambiguation and interpretation (Ng & Zelle, 1997), discourse processing and J. Cussens and S. Dˇ zeroski (Eds.): LLL’99, LNAI 1925, pp. 57–66, 2000. c Springer-Verlag Berlin Heidelberg 2000
58
Raymond J. Mooney
information extraction (Cardie, 1997), and machine translation (Knight, 1997). Some concrete publication statistics clearly illustrate the extent of the revolution in natural language research. According to data recently collected by Hirschberg (1998), a full 63.5% of the papers in the Proceedings of the Annual Meeting of the Association for Computational Linguistics and 47.4% of the papers in the journal Computational Linguistics concerned corpus-based research in 1997. For comparison, 1983 was the last year in which there were no such papers and the percentages in 1990 were still only 12.8% and 15.4%. Nevertheless, traditional machine learning research in artificial intelligence, particularly logic-based learning, has had limited influence on recent research in computational linguistics. Most current learning research in NLP employs statistical techniques inspired by research in speech recognition, such as hidden Markov models (HMMs) and probabilistic context-free grammars (PCFGs). There has been some recent research on logic-based language learning (Mooney & Califf, 1995; Cohen, 1996; Freitag, 1998), in particular, a recent body of European inductive logic programming (ILP) research on language (Cussens, 1997; Manandhar, Dˇzeroski, & Erjavec, 1998; Kazakov & Manandhar, 1998; Eineborg & Lindberg, 1998; Lindberg & Eineborg, 1998; Cussens, Dˇzeroski, & Erjavec, 1999; Lindberg & Eineborg, 1999). However, most of this research has focused on relatively “low level” tasks such as morphological analysis and part-of-speech tagging and has not conclusively demonstrated superior performance when compared to competing statistical methods for these tasks. In contrast, most of our own recent research on applying ILP to NLP has focused on learning to parse natural-language database queries into a semantic logical form that produces an answer when executed in Prolog (Zelle & Mooney, 1993, 1994, 1996; Zelle, 1995; Mooney, 1997; Thompson & Mooney, 1999; Thompson, 1998; Thompson, Califf, & Mooney, 1999). There is a long tradition of representing the meaning of natural language statements and queries in first-order logic (Allen, 1995; Dowty, Wall, & Peters, 1981; Woods, 1978). However, we know of no other recent research specifically on learning to map language into logical form. Nevertheless, we believe this is the most suitable NLP task for ILP, since the desired output is a logical representation that is best processed using logic-based methods. This chapter first presents a brief historical view of the shifting emphasis of research on various tasks in natural language processing. Next, it briefly reviews our own work on learning for semantic interpretation. Finally, it summarizes the arguments in favor of semantic interpretation as the most promising naturallanguage application of logic-based learning.
2
A Brief Historical Review of NLP Research
From the very early days of NLP research, answering natural-language questions in a particular domain was a key task (Green, Wolf, Chomsky, & Laughery, 1963; Simmons, 1965, 1970). Although syntactic analysis was a major component of this task, the production of a semantic interpretation that could be used to
Learning for Semantic Interpretation: Scaling Up without Dumbing Down
59
retrieve answers was also very important. The semantic analysis of language was a particular focus of NLP research in the 1970’s, with researchers exploring tasks ranging from responding to commands and answering questions in a micro-world (Winograd, 1972) to answering database queries (Woods, 1977; Waltz, 1978; Hendrix, Sacerdoti, Sagalowicz, & Slocum, 1978) and understanding short stories (Charniak, 1972; Schank, 1975; Charniak & Wilks, 1976; Schank & Abelson, 1977; Schank & Riesbeck, 1981). Research in this era attempted to address complex issues in semantic interpretation, knowledge representation, and inference. The systems that were developed could perform interesting semantic interpretation and inference when understanding particular sentences or stories; however, they tended to require tedious amounts of application-specific knowledge-engineering and were therefore quite brittle and not easily extended to new texts or new applications and domains. The result was systems that could perform fairly in-depth understanding of narrative text; but were restricted to comprehending three or four specific stories (Dyer, 1983). Disenchantment with the knowledge-engineering requirements and brittleness of such systems grew, and research on in-depth semantic interpretation began to wane in the early to mid 1980’s. The author’s own thesis research in the mid 1980’s focused on attempting to relieve the knowledge-engineering bottleneck by using explanation-based learning (EBL) to automatically acquire the larger knowledge structures (scripts or schemas) needed for narrative understanding (DeJong, 1981; Mooney & DeJong, 1985; DeJong & Mooney, 1986). However, this approach still required a large amount of existing knowledge that could be used to construct detailed explanations for simpler stories. In order to avoid the difficult problems of detailed semantic analysis, NLP research began to focus on building robust systems for simpler tasks. With the advent of statistical learning methods that could successfully acquire knowledge from large corpora for more tractable problems such as speech recognition, partof-speech tagging, and syntactic parsing, significant progress has been made on these tasks over the past decade (Jelinek, 1998; Manning & Sch¨ utze, 1999). Also, much current NLP research is driven by applications to arbitrary documents on the Internet and World Wide Web (Mahesh, 1997), and therefore cannot exploit domain-specific knowledge. Consequently, much current NLP research has more the flavor of traditional information retrieval (Sparck Jones & Willett, 1997), rather than AI research on language understanding. This overall trend is succinctly captured by the recently coined clever phrase “scaling up by dumbing down.” Unfortunately, there is relatively little research on using learning methods to acquire knowledge for detailed semantic interpretation. Research on corpusbased word-sense disambiguation addresses semantic issues (Ng & Zelle, 1997; Ide & V´eronis, 1998); however, only at the level of interpreting individual words rather than constructing representations for complete sentences. Research on learning for information extraction also touches on semantic interpretation; however, existing methods learn fairly low-level syntactic patterns for extracting spe-
60
Raymond J. Mooney
cific target phrases (Cardie, 1997; Freitag, 1998; Bikel, Schwartz, & Weischedel, 1999; Soderland, 1999; Califf & Mooney, 1999). Nevertheless, there has been a limited amount of research on learning to interpret complete sentences for answering database queries (Zelle & Mooney, 1996; Miller, Stallard, Bobrow, & Schwartz, 1996; Kuhn & De Mori, 1995).
3
CHILL: ILP for Semantic Interpretation
Our own research on learning for semantic interpretation has involved the development of a system called Chill (Zelle, 1995)1 which uses ILP to learn a deterministic shift-reduce parser written in Prolog. The input to Chill is a corpus of sentences paired with semantic representations. The parser learned from this data is able to transform these training sentences into their correct representations, as well as generalizing to correctly interpret many novel sentences. Chill is currently able to handle two kinds of semantic representations: a case-role form based on conceptual dependency (Schank, 1975) and a Prologbased logical query language. As examples of the latter, consider two sample queries for a database on U.S. geography, paired with their corresponding logical form: What is the capital of the state with the highest population? answer(C, (capital(S,C), largest(P, (state(S), population(S,P))))). What state is Texarkana located in? answer(S, (state(S), eq(C,cityid(texarkana, )), loc(C,S))). Chill treats parser induction as a problem of learning rules to control the actions of a shift-reduce parser. During parsing, the current context is maintained in a stack of previously interpreted constituents and a buffer containing the remaining input. When parsing is complete, the buffer is empty and the stack contains the final representation of the input. There are three types of operators used to construct logical queries. First is the introduction onto the stack of a predicate needed in the sentence representation due to the appearance of a word or phrase at the front of the input buffer. A second type of operator unifies two variables appearing in the current items in the stack. Finally, a stack item may be embedded as an argument of another stack item. A generic parsing shell is provided to the system, and the initial parsing operators are produced through an automated analysis of the training data using general templates for each of the operator types described above. During learning, these initial overly-general operators are specialized so that the resulting parser deterministically produces only the correct semantic interpretation of each the training examples. The introduction operators require a semantic lexicon as background knowledge that provides the possible logical representations of specific words and phrases. Chill initially required the user to provide 1
A more detailed description of Chill can be found in the chapter by Thompson and Califf, this volume.
Learning for Semantic Interpretation: Scaling Up without Dumbing Down
61
this lexicon; however, we have recently developed a system called Wolfie that learns this lexicon automatically from the same training corpus (Thompson & Mooney, 1999; Thompson, 1998). Chill has been used successfully to learn natural-language interfaces for three separate databases: 1) a small database on U.S. geography, 2) a database of thousands of restaurants in northern California, and 3) a database of computer jobs automatically extracted from the Usenet newsgroup austin.jobs (Califf & Mooney, 1999). After training on corpora of a few hundred queries, the system learns parsers that are reasonably accurate at interpreting novel queries for each of these applications. For the geography domain, the system has learned semantic parsers for Spanish, Japanese, and Turkish, as well as English. Below are some of the interesting novel English queries that the geography system can answer although it was never explicitly trained on queries of this complexity: – What states border states through which the Mississippi runs? – What states border states that border Texas? – What states border states that border states that border states that border Texas? – What states border the state that borders the most states? – What rivers flow through states that border the state with the largest population? – What is the largest state through which the Mississippi runs? – What is the longest river that flows through a state that borders Indiana? – What is the length of the river that flows through the most states? Chill is described in a bit more detail in the article in this volume by Thompson and Califf (2000), which focuses on the automatic selection of good training sentences.
4
Semantic Interpretation and Learning in Logic
Our research on Chill demonstrates that ILP can help automate the construction of useful systems for semantic interpretation of natural language. Since the desired output of semantic interpretation is a logical form, ILP methods are particularly well suited for manipulating and constructing such representations. In addition, ILP methods allow for the easy specification and exploitation of background knowledge that is useful in parsing and disambiguation. The current version of Chill makes significant use of semantic typing knowledge and background predicates for finding items in the stack and buffer that satisfy particular constraints. There has been significant work on using statistical methods for performing tasks such as part-of-speech tagging and syntactic parsing, and recent results demonstrate fairly impressive performance on these tasks. Logic based methods have currently been unable to demonstrate superior performance on these tasks, and due to the limited context that apparently usually suffices for these problems, are unlikely to easily overtake statistical methods. However, there has
62
Raymond J. Mooney
been very little research demonstrating successful application of statistical methods to semantic interpretation. Consequently, this problem presents a promising opportunity for demonstrating the advantages of logic-based methods. Although corpora consisting of tens of thousands of annotated sentences exist for tasks such as part-of-speech tagging and syntactic parsing (Marcus, Santorini, & Marcinkiewicz, 1993), very little data exists for semantic analysis. 2 Consequently, the identification of important and representative tasks and the construction of significant corpora of semantically interpreted sentences are leading requirements for furthering research in this area. Although developing a good semantic representation and annotating sentences with logical form can be a fairly time-consuming and difficult job, a dedicated effort similar to that already undertaken to produce large treebanks for syntactic analysis could produce very sizable and useful semantic corpora. Part of the resistance to exploring semantic analysis is that, given the current state of the art, it almost inevitably leads to domain dependence. However, many useful and important applications require NLP systems that can exploit specific knowledge of the domain to interpret and disambiguate queries, commands, or statements. The goal of developing general learning methods for this task is exactly to reduce the burden of developing such systems, in the same way that machine learning is used to overcome the knowledge-acquisition bottleneck in developing expert systems. It is largely the difficulty of engineering specific applications that has prevented natural-language interface technology from becoming a wide-spread method for improving the user-friendliness of computing systems. Learning technology for automating the development of such domainspecific systems could help overcome this barrier, eventually resulting in the wide-spread use of NL interfaces. Consequently, I strongly encourage others to consider investigating learning for semantic interpretation using either statistical, logic-based, or other methods. A larger community of researchers investigating this problem is critical for making important and significant progress. In particular, logic-based approaches are the only ones to have demonstrated significant success on this problem to date, and it is the most promising and natural application of ILP to NLP. Acknowledgements This research was supported by the National Science Foundation under grant IRI-9704943.
References 1. Allen, J. F. (1995). Natural Language Understanding (2nd Ed.). Benjamin/Cummings, Menlo Park, CA. 2. Bikel, D. M., Schwartz, R., & Weischedel, R. M. (1999). An algorithm that learns what’s in a name. Machine Learning, 34, 211–232. 2
Some of the corpora we have developed are available from http://www.cs.utexas.edu/users/ml.
Learning for Semantic Interpretation: Scaling Up without Dumbing Down
63
3. Brill, E., & Mooney, R. J. (1997). An overview of empirical natural language processing. AI Magazine, 18, 13–24. 4. Califf, M. E., & Mooney, R. J. (1999). Relational learning of pattern-match rules for information extraction. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, pp. 328–334 Menlo Park, CA. AAAI Press. 5. Cardie, C. (1997). Empirical methods in information extraction. AI Magazine, 18, 65–79. 6. Charniak, E. (1972). Toward a model of children’s story comprehension. Tech. rep. TR266, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. 7. Charniak, E. (1997). Statistical techniques for natural language parsing. AI Magazine, 18, 33–43. 8. Charniak, E., & Wilks, Y. (Eds.). (1976). Computational Semantics. NorthHolland, Amsterdam. 9. Charniak, E. (1993). Statistical Language Learning. MIT Press. 10. Cohen, W. W. (1996). Learning to classify English text with ILP methods. In De Raedt, L. (Ed.), Advances in Inductive Logic Programming, pp. 124– 143. IOS Press, Amsterdam. 11. Cussens, J. (1997). Part-of-speech tagging using Progol. In Proceedings of the Seventh International Workshop on Inductive Logic Programming, pp. 93–108 Berlin. Springer. 12. Cussens, J., Dˇzeroski, S., & Erjavec, T. (1999). Morphosyntactic tagging of Slovene using Progol. In Dˇzeroski, S., & Flach, P. (Eds.), Proceedings of the Ninth International Workshop on Inductive Logic Programming Berlin. Springer-Verlag. 13. DeJong, G. (1981). Generalizations based on explanations. In Proceedings of the Seventh International Joint Conference on Artificial Intelligence, pp. 67–70 San Francisco. Morgan Kaufman. 14. DeJong, G. F., & Mooney, R. J. (1986). Explanation-based learning: An alternative view. Machine Learning, 1, 145–176. Reprinted in Readings in Machine Learning, J. W. Shavlik and T. G. Dietterich (eds.), Morgan Kaufman, San Mateo, CA, 1990. 15. Dowty, D. R., Wall, R. E., & Peters, S. (1981). Introduction to Montague Semantics. D. Reidel, Dordrecht, Holland. 16. Dyer, M. (1983). In Depth Understanding. MIT Press, Cambridge, MA. 17. Eineborg, M., & Lindberg, N. (1998). Induction of constraint grammarrules using Progol. In Proceedings of the Eighth International Workshop on Inductive Logic Programming, pp. 116–124 Berlin. Springer. 18. Freitag, D. (1998). Toward general-purpose learning for information extraction. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and COLING-98, pp. 404–408 New Bunswick, New Jersey. Association for Computational Linguistics. 19. Green, B. F., Wolf, A. K., Chomsky, C., & Laughery, K. (1963). Baseball: An automatic question answerer. In Feigenbaum, E. A., & Feldman, J. (Eds.), Computers and Thought, pp. 207–216. McGraw Hill, New York.
64
Raymond J. Mooney
Reprinted in Readings in Natural Language Processing, B. Grosz, K. Spark Jones, and B. Lynn Webber (eds.), Morgan Kaufman, Los Altos, CA, 1986. 20. Hendrix, G. G., Sacerdoti, E., Sagalowicz, D., & Slocum, J. (1978). Developing a natural language interface to complex data. ACM Transactions on Database Systems, 3, 105–147. 21. Hirschberg, J. (1998). Every time I fire a linguist, my performance goes up, and other myths of the statistical natural language processing revolution. Invited talk, Fifteenth National Conference on Artificial Intelligence (AAAI-98). 22. Ide, N., & V´eronis, J. (1998). Introduction to the special issue on word sense disambiguation: The state of the art. Computational Linguistics, 24, 1–40. 23. Jelinek, F. (1998). Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA. 24. Kazakov, D., & Manandhar, S. (1998). A hybrid approach to word segmentation. In Proceedings of the Eighth International Workshop on Inductive Logic Programming, pp. 125–134 Berlin. Springer. 25. Knight, K. (1997). Automating knowledge acquisition for machine translation. AI Magazine, 18, 81–96. 26. Kuhn, R., & De Mori, R. (1995). The application of semantic classification trees to natural language understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17, 449–460. 27. Lindberg, N., & Eineborg, M. (1998). Learning constraint grammar-style disambiguation rules using inductive logic programming. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and COLING-98, pp. 775–779 New Bunswick, New Jersey. Association for Computational Linguistics. 28. Lindberg, N., & Eineborg, M. (1999). Improving part of speech disambiguation rules by adding linguistic knowledge. In Dˇzeroski, S., & Flach, P. (Eds.), Proceedings of the Ninth International Workshop on Inductive Logic Programming Berlin. Springer-Verlag. 29. Mahesh, K. (Ed.). (1997). Papers from the AAAI Spring Symposium on Natural Language Processing for the World Wide Web, Menlo Park, CA. AAAI Press. 30. Manandhar, S., Dˇzeroski, S., & Erjavec, T. (1998). Learning multilingual morphology with CLOG. In Proceedings of the Eighth International Workshop on Inductive Logic Programming, pp. 135–144 Berlin. Springer. 31. Manning, C. D., & Sch¨ utze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. 32. Marcus, M., Santorini, B., & Marcinkiewicz, M. (1993). Building a large annotated corpus of English: The Penn treebank. Computational Linguistics, 19, 313–330. 33. Miller, S., Stallard, D., Bobrow, R., & Schwartz, R. (1996). A fully statistical approach to natural language interfaces. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp. 55– 61 New Bunswick, New Jersey. Association for Computational Linguistics.
Learning for Semantic Interpretation: Scaling Up without Dumbing Down
65
34. Mooney, R. J. (1997). Inductive logic programming for natural language processing. In Muggleton, S. (Ed.), Inductive Logic Programming: Selected papers from the 6th International Workshop, pp. 3–22. Springer-Verlag, Berlin. 35. Mooney, R. J., & Califf, M. E. (1995). Induction of first-order decision lists: Results on learning the past tense of English verbs. Journal of Artificial Intelligence Research, 3, 1–24. 36. Mooney, R. J., & DeJong, G. F. (1985). Learning schemata for natural language processing. In Proceedings of the Ninth International Joint Conference on Artificial Intelligence, pp. 681–687 San Francisco. Morgan Kaufman. 37. Ng, H. T., & Zelle, J. (1997). Corpus-based approaches to semantic interpretation in natural language processing. AI Magazine, 18, 45–64. 38. Schank, R. C. (1975). Conceptual Information Processing. North-Holland, Oxford. 39. Schank, R. C., & Abelson, R. P. (1977). Scripts, Plans, Goals and Understanding: An Inquiry into Human Knowledge Structures. Lawrence Erlbaum and Associates, Hillsdale, NJ. 40. Schank, R. C., & Riesbeck, C. K. (1981). Inside Computer Understanding: Five Programs plus Miniatures. Lawrence Erlbaum and Associates, Hillsdale, NJ. 41. Simmons, R. F. (1965). Answering English questions by computer: A survey. Communications of the Association for Computing Machinery, 8, 53–70. 42. Simmons, R. F. (1970). Natural language question-answering systems: 1969. Communications of the Association for Computing Machinery, 13, 15–30. 43. Soderland, S. (1999). Learning information extraction rules for semistructured and free text. Machine Learning, 34, 233–272. 44. Sparck Jones, K., & Willett, P. (Eds.). (1997). Readings in Information Retrieval. Morgan Kaufmann, San Francisco, CA. 45. Stolcke, A. (1997). Linguistic knowledge and empirical methods in speech recognition. AI Magazine, 18, 25–31. 46. Thompson, C. A., & Califf, M. E. (2000). Improving learning in two natural language tasks: Choosing examples intelligently. In This volume. 47. Thompson, C. A., Califf, M. E., & Mooney, R. J. (1999). Active learning for natural language parsing and information extraction. In Proceedings of the Sixteenth International Conference on Machine Learning, pp. 406–414 San Francisco. Morgan Kaufman. 48. Thompson, C. A., & Mooney, R. J. (1999). Automatic construction of semantic lexicons for learning natural language interfaces. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, pp. 487– 493. 49. Thompson, C. A. (1998). Semantic Lexicon Acquisition for Learning Natural Language Interfaces. Ph.D. thesis, Department of Computer Sciences, University of Texas, Austin, TX. Also appears as
66
50. 51. 52. 53.
54.
55.
56.
57.
Raymond J. Mooney
Artificial Intelligence Laboratory Technical Report AI 99-278 (see http://www.cs.utexas.edu/users/ai-lab). Waltz, D. L. (1978). An English language question answering system for a large relational database. Communications of the Association for Computing Machinery, 21, 526–539. Winograd, T. (1972). Understanding Natural Language. Academic Press, Orlando, FL. Woods, W. A. (1977). Lunar rocks in natural English: Explorations in natural language question answering. In Zampoli, A. (Ed.), Linguistic Structures Processing. Elsevier North-Holland, New York. Woods, W. A. (1978). Semantics and quantification in natural language question answering. In Yovits, M. (Ed.), Advances in Computers, vol. 17, pp. 2–64. Academic Press, New York. Reprinted in Readings in Natural Language Processing, B. Grosz, K. Spark Jones, and B. Lynn Webber (eds.), Morgan Kaufman, Los Altos, CA, 1986. Zelle, J. M. (1995). Using Inductive Logic Programming to Automate the Construction of Natural Language Parsers. Ph.D. thesis, Department of Computer Sciences, University of Texas, Austin, TX. Also appears as Artificial Intelligence Laboratory Technical Report AI 96-249. Zelle, J. M., & Mooney, R. J. (1993). Learning semantic grammars with constructive inductive logic programming. In Proceedings of the Eleventh National Conference on Artificial Intelligence, pp. 817–822 Menlo Park, CA. AAAI Press. Zelle, J. M., & Mooney, R. J. (1994). Inducing deterministic Prolog parsers from treebanks: A machine learning approach. In Proceedings of the Twelfth National Conference on Artificial Intelligence, pp. 748–753 Menlo Park, CA. AAAI Press. Zelle, J. M., & Mooney, R. J. (1996). Learning to parse database queries using inductive logic programming. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pp. 1050–1055 Menlo Park, CA. AAAI Press.
Learning to Lemmatise Slovene Words Saˇso Dˇzeroski and Tomaˇz Erjavec Department for Intelligent Systems, Joˇzef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia
[email protected],
[email protected]
Abstract. Automatic lemmatisation is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma to each word in a running text is not trivial: nouns and adjectives, for instance, inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, as word forms cannot be matched against a lexicon giving the correct lemma, its part-of-speech and paradigm class. The paper discusses a machine learning approach to the automatic lemmatisation of unknown words, in particular nouns and adjectives, in Slovene texts. We decompose the problem of learning to perform lemmatisation into two subproblems: the first is to learn to perform morphosyntactic tagging, and the second is to learn to perform morphological analysis, which produces the lemma from the word form given the correct morphosyntactic tag. A statistics-based trigram tagger is used to learn to perform morphosyntactic tagging and a first-order decision list learning system is used to learn rules for morphological analysis. The dataset used is the 90.000 word Slovene translation of Orwell’s ‘1984’, split into a training and validation set. The validation set is the Appendix of the novel, on which extensive testing of the two components, singly and in combination, is performed. The trained model is then used on an open-domain testing set, which has 25.000 words, pre-annotated with their word lemmas. Here 13.000 nouns or adjective tokens are previously unseen cases. Tested on these unknown words, our method achieves an accuracy of 81% on the lemmatisation task.
1
Introduction
Lemmatisation is a core functionality for various language processing tasks. It represents a normalisation step on the textual data, where all inflected forms of a lexical word are reduced to its common lemma. This normalisation step is needed in analysing the lexical content of texts, e.g. in information retrieval, term extraction, machine translation etc. In English, lemmatisation is relatively easy, especially if we are not interested in the part-of-speech of a word. So called stemming can be performed with a J. Cussens and S. Dˇ zeroski (Eds.): LLL’99, LNAI 1925, pp. 69–88, 2000. c Springer-Verlag Berlin Heidelberg 2000
70
Saˇso Dˇzeroski and Tomaˇz Erjavec
lexicon which lists the irregular forms of inflecting words, e.g. ‘oxen’ or ‘took’, while the productive ones, e.g. ‘wolves’ or ‘walks’, can be covered by a small set of suffix stripping rules. The problem is more complex for inflectionally rich languages, such as Slovene. Lemmatisation in inflectionally rich languages must pressupose correctly determining the part-of-speech together with various morphosyntactic features of the word form. Adjectives in Slovene, for example, inflect for gender (3), number (3) and case (6), and in some instances, also for definiteness and animacy. This, coupled with various morpho-phonologically induced stem and ending alternations gives rise to a multitude of possible relations between a word form and its lemma. A typical Slovene adjective has, for example, 14 different orthographic inflected forms, and a noun 8. It should be noted that we take the term ‘lemma’ to mean a word form in its canonical form, e.g., infinitive for verbs, nominative singular for regular nouns, nominative plural for pluralia tantum nouns, etc. The orthography of what we call a ’lemma’ and of the ‘stem’ of a word form are, in general, different, but much less in English than in Slovene. For example, the feminine noun ‘postelja’/’bed’, has ‘postelja’ as its lemma and this will be also its headword in a dictionary. However, the stem is ‘postelj-’, as the ‘-a’ is already the inflectional morpheme for (some) feminine nouns. Performing lemmatisation thus in effect involves performing morphological analysis, to identify the ending and isolate the stem, and synthesis, to join the canonical ending to it. Using a large lexicon with coded paradigmatic information, it is possible to reliably, but ambiguously lemmatise known words. Unambiguous lemmatisation of words in running text is only possible if the text has been tagged with morphosyntactic information, a task typically performed by a part-of-speech tagger. Much more challenging is the lemmatisation of unknown words. In this task, known as ‘unknown word guessing’ a morphological analyser can either try to determine the ambiguity class of the word, i.e. all its possible tags (and stems), which are then passed on to a POS tagger, or it can work in tandem with a tagger to directly determine the context dependent unambiguous lemma. While results on open texts are quite good with hand-crafted rules (Chanod & Tapanainen, 1995), there has been less work done with automatic induction of unknown word guessers. Probably the best known system of this kind is described in (Mikheev, 1997). It learns ambiguity classes from a lexicon and a raw (untagged) corpus. It induces rules for prefixes, suffixes and endings: the paper gives detailed analysis of accuracies achieved by combining these rules with various taggers. The best results obtained for tagging unknown words are in the range of 88%. However, the tests are performed on English language corpora and it is unclear what the performance as applied to lemmatisation would be with inflectionally richer languages. In this article, we discuss a machine learning approach to the automatic lemmatisation of unknown words in Slovene texts. We decompose the problem of learning to perform lemmatisation into two subproblems. The first is to learn rules for morphological analysis, which produce the lemma from the word form
Learning to Lemmatise Slovene Words
71
given the correct tag in the form of a morphosyntactic description (MSD). The second is to learn to perform tagging, where tags are MSDs. We use an existing annotated/disambiguated corpus to learn and validate rules for morphological analysis and tagging. A first-order decision list learning system, Clog (Manandhar, Dˇzeroski, & Erjavec, 1998) is used to learn rules for morphological analysis. These rules are limited to nouns and adjectives, as these are, of the inflectional words, by far the most common new (unknown) words of a language. A statistics-based trigram tagger, TnT (Brants, 2000) is used to learn to perform MSD tagging. Once we have trained the morphological analyser and the tagger, unknown word forms in a new text can be lemmatised by first tagging the text, then giving the word forms and corresponding MSDs to the morphological analyser. The remainder of the article is organised as follows. Section 2 describes the corpus used to inductively develop the morphological analyser and the tagger. This was the 90.000 word Slovene Multext-East annotated corpus, which was divided into a larger training set and a smaller validation set. Section 3 describes the process of learning rules for morphological analysis, including an evaluation of the learned rules on the validation set. Similarly, Section 4 describes the process of training the tagger, including an evaluation of the learned tagger on the validation set. Section 5 describes the evaluation of the lemmatisation performed by the combination of the learned tagger and morphological analyser on the validation set and the testing set. The validation set on which we perform a detailed analysis is the one from the Multext-East corpus; we also evaluate the results on a text of 25.000 words from a completely different domain, preannotated with the word lemmas. Finally, Section 5 concludes and discusses directions for further work.
2
The Training and Validation Data Sets
The EU Multext-East project (?; Erjavec, Lawson, & Romary, 1998) developed corpora, lexica and tools for six Central and East-European languages; the project reports and samples of results are available at http://nl.ijs.si/ME/. The centrepiece of the corpus is the novel “1984” by George Orwell, in the English original and translations. For the experiment reported here, we used the annotated Slovene translation of “1984”. This corpus has been further cleaned up and re-encoded within the scope of the EU project ELAN (Erjavec, 1999). The novel is sentence segmented (6,689 sentences) and tokenised (112,790) into words (90,792) and punctuation symbols (21,998). Each word in the corpus is annotated for context disambiguated linguistic annotation. This annotation contains the lemma and morphosyntactic descriptions (MSD) of the word in question. The corpus is encoded according to the recommendation of the Text Encoding Initiative, TEI (Sperberg-McQueen & Burnard, 1994). To illustrate the information contained in the corpus, we give the encoding of an example sentence in Table 1.
72
Saˇso Dˇzeroski and Tomaˇz Erjavec
Table 1. The TEI encoding of the sentence ‘Winston se je napotil proti stopnicam.’ (’Winston made for the stairs.’) <s id="Osl.1.2.3.4"> <w lemma="Winston" msd="Npmsn">Winston <w lemma="se" msd="Px------y">se <w lemma="biti" msd="Vcip3s--n">je <w lemma="napotiti" msd="Vmps-sma">napotil <w lemma="proti" msd="Spsd">proti <w lemma="stopnica" msd="Ncfpd">stopnicam
.
The MSDs are structured and more detailed than is commonly assumed for part-of-speech tags; they are compact string representations of a simplified kind of feature structures — the formalism and MSD grammar for the MultextEast languages is defined in (Erjavec & (eds.), 1997). The first letter of a MSD encodes the part of speech (Noun, Adjective); Slovene distinguishes 11 different parts-of-speech. The letters following the PoS give the values of the position determined attributes. Each part of speech defines its own appropriate attributes and their values, acting as a kind of feature-structure type or sort. So, for example, the MSD Ncmpi expands to PoS:Noun, Type:common, Gender:masculine, Number:plural, Case:instrumental. It should be noted that in case a certain attribute is not appropriate (1) for a language, (2) for the particular combination of features, or (3) for the word in question, this is marked by a hyphen in the attribute’s position. Slovene verbs in the indicative, for example, are not marked for gender or voice, hence the two hyphens in Vcip3s--n. For the experiment reported here, we first converted the TEI encoded novel into a simpler, tabular encoding. Here each sentence ends with an empty line, and all the words and lemmas are in lower-case. This simplifies the training and testing regime, and, arguably, also leads to better results as otherwise capitalised words are treated as distinct lexical entries. The example sentence from Table 1 converts to the representation in Table 2. Table 2. A tabular encoding of the sentence ‘Winston se je napotil proti stopnicam.’ (‘Winston made for the stairs.’) winston se je napotil proti stopnicam .
winston se biti napotiti proti stopnica
Npmsn Px------y Vcip3s--n Vmps-sma Spsd Ncfpd .
Learning to Lemmatise Slovene Words
73
To give an impression of the size and complexity of the dataset we give in Table 3 the distribution over part-of-speech for the disambiguated Slovene words in “1984”. The first column in the Table gives the number of word tokens, the second of word types, i.e. of different word forms appearing in the corpus. The third column gives the number of different lemmas in the corpus and the fourth the number of different MSDs. The last column is especially interesting for lemmatisation, as it gives the number of tokens that are identical to their lemmas; these represent the trivial cases for lemmatisation. As can be seen, approx. 38% of noun tokens and 16% of adjective tokens are already in their lemma form. This serves as a useful baseline against which to compare analysis results. Table 3. Part-of-speech distribution of the words in the ‘1984’ corpus. Category Token Type Lemma MSD Verb Noun Pronoun Conjunction Preposition Adjective Adverb Particle Numeral Abbreviation Interjection Residual
(V) (N) (P) (C) (S) (A) (R) (Q) (M) (Y) (I) (X)
25163 4883 19398 6282 10861 373 8554 32 7991 86 7717 4063 6681 790 3237 41 1082 193 60 14 47 7 1 1
Total (*) 90792 16401
2003 3199 64 32 82 1943 786 41 112 14 7 1
93 74 581 2 6 167 3 1 80 1 1 1
= 1405 7408 4111 8554 7987 1207 4479 3237 511 60 47 1
7902 1010 39007
The Slovene Orwell also exists in a format that contains all the possible interpretations (MSDs, lemmas) for each word form in the corpus. This version was also used in the experiment, to train the morphological analyser and to determine the unknown words in the validation set; we return to this issue below. We took Parts I – III of “1984” as the training set, and the Appendix of the novel, comprising approx. 15% of the text, as the validation set. It should be noted that the Appendix, entitled “The Principles of Newspeak” has quite a different structure and vocabulary than the body of the book; it therefore represents a rather difficult validation set, even though it comes from the same text as the training part. The main emphasis of the experiments we performed is on the Slovene nouns and adjectives in the positive degree. The reason for this is that nouns and adjectives represent the majority of unknown words; the other parts of speech are either closed, i.e. can be exhaustively listed in the lexicon, or, bar verbs, do not inflect. The reason for limiting the adjectives degree to positive only is
74
Saˇso Dˇzeroski and Tomaˇz Erjavec
similar: adjectives that form the other two degrees (comparative and superlative) also represent a closed class of words. To set the context, we give in Table 4 the distribution of nouns and positive adjectives in the dataset and in its training and validation parts, with the meaning of the columns being the same as in Table 3. Table 4. The distribution of nouns and adjectives in the entire dataset, the training and the validation set. Source
2.1
Category Token Type Lemma MSD
=
Entire dataset
Noun (N) 19398 6282 Adjective (A) 7462 3932 Both (*) 26860 10214
3199 1936 5135
74 7408 121 1207 195 8615
Training set
Noun (N) 18438 6043 Adjective (A) 7019 3731 Both (*) 25457 9774
3079 1858 4937
74 7049 120 1124 194 8173
Validation set
Noun (N) 960 Adjective (A) 443 Both (*) 1403
379 245 624
51 359 62 83 113 442
533 347 880
The Lexical Training Set
As was mentioned, the training set for morphological analysis was not the disambiguated body of the book, but rather its undisambiguated, lexical version, in which each word form is annotated with all its possible MSDs and lemmas. This represents a setting in which lexical look-up has been performed, but the text has not yet been tagged, i.e. disambiguated. The lexical training set thus contains more MSDs and lemmas per word form than does the disambiguated corpus. For a comparison with the disambiguated corpus data, we give in Table 5 the quantities for nouns and adjectives in the lexical training set. Table 5. The distribution of nouns and adjectives in the lexical training set. Category Entry WordF Lemma MSD Noun (N) 15917 6596 Adjective (A) 24346 4796 Both (*) 40263 11392
3382 2356 5738
85 157 242
The first column in the Table gives the number of different triplets of word form, lemma and MSD; the second column represents the number of different word forms in the lexical training set, the third the number of different lemmas and the fourth the number of MSDs. We can see that, on the average, a lemma
Learning to Lemmatise Slovene Words
75
has two different word forms, that a noun word form is 2.4 times ambiguous, while adjectives are 5 times ambiguous. 2.2
Unknown Words
As our experiments centre around unknown words, this notion also has to be defined: we take as unknown those nouns and adjectives that appear in the validation corpus, but whose lemma does not appear in the lexical training set. It should be noted that this excludes ‘half-unknown’ words, which do share a lemma, but not a word form token. With this strict criterion, Table 6 gives the numbers for the unknown nouns and positive adjectives in the Appendix. Table 6. The distribution of unknown nouns and adjectives in the validation set. Category Token Type Lemma MSD Noun (N) Adjective (A) Both (*)
3
187 92 279
144 82 226
127 72 199
=
37 85 31 26 68 111
Morphological Analysis
This section describes how the lexical training set was used to learn rules for morphological analysis of Slovene nouns and adjectives. For this purpose, we used an inductive logic programming (ILP) system that learns first-order decision lists, i.e. ordered sets of rules. We first explain the notion of first-order decision lists on the problem of synthesis of the past tense of English verbs, one of the first examples of learning morphology with ILP (Mooney & Califf, 1995). We then lay out the ILP formulation of the problem of learning rules for morphological analysis of Slovene nouns and adjectives and describe how it was addressed with the ILP system Clog. The induction results are illustrated for an example MSD. We finally discuss the evaluation of the learned rules on the evaluation set. 3.1
Learning Decision Lists
The ILP formulation of the problem of learning rules for the synthesis of past tense of English verbs considered in (Mooney & Califf, 1995) is as follows. A logic program has to be learned defining the relation past(PresentVerb,PastVerb), where PresentVerb is an orthographic representation of the present tense form of a verb and PastVerb is an orthographic representation of its past tense form. PresentVerb is the input and PastVerb the output argument. Given are examples of input/output pairs, such as past([b,a,r,k],[b,a,r,k,e,d])] and past([g,o],[w,e,n,t]). The program for the relation past uses the predicate
76
Saˇso Dˇzeroski and Tomaˇz Erjavec
split(A,B,C) as background knowledge: this predicate splits a list (of letters) A into two lists B and C. Given examples and background knowledge, Foidl (Mooney & Califf, 1995) learns a decision list defining the predicate past. A decision list is an ordered set of rules: rules at the beginning of the list take precedence over rules below them and can be thought of as exceptions to the latter. An example decision list defining the predicate past is given in Table 7. Table 7. A first-order decision list for the synthesis of past tense of English verbs. past([g,o],[w,e,n,t]) :- !. past(A,B) :- split(A,C,[e,p]), split(B,C,[p,t]), !. past(A,B) :- split(B,A,[d]), split(A,C,[e]), !. past(A,B) :- split(B,A,[e,d]).
The general rule for forming past tense is to add the suffix ‘-ed’ to the present tense form, as specified by the default rule (last rule in the list). Exceptions to these are verbs ending on ‘-e’, such as ‘skate’, where ‘-d’ is appended, and verbs ending in ‘-ep’, such as ‘sleep’, where the ending ‘-ep’ is replaced with ‘-pt’. These rules for past tense formation are specified as exceptions to the general rule, appearing before it in the decision list. The first rule in the decision list specifies the most specific exception: the past tense form of the irregular verb ‘go’ is ‘went’. Our approach is to induce rules for morphological analysis in the form of decision lists. To this end, we use the ILP system Clog (Manandhar et al., 1998). Clog shares a fair amount of similarity with Foidl (Mooney & Califf, 1995): both can learn first-order decision lists from positive examples only — an important consideration in NLP applications. Clog inherits the notion of output completeness from Foidl to generate implicit negative examples (see (Mooney & Califf, 1995)). Output completeness is a form of closed world assumption which assumes that all correct outputs are given for each given combination of input arguments’ values present in the training set. Experiments show that Clog is significantly more efficient than Foidl in the induction process. This enables Clog to be trained on more realistic datasets, and therefore to attain higher accuracy. 3.2
Learning Rules for Morphological Analysis
We formulate the problem of learning rules for morphological analysis of Slovene nouns and adjectives in a similar fashion to the problem of learning the synthesis of past tense of English verbs. We have used Clog earlier to generate rules for synthesis and analysis of nouns and adjectives for English, Romanian, Czech, Slovene, and Estonian (Man-
Learning to Lemmatise Slovene Words
77
andhar et al., 1998). In the current experiment, we re-use the rules learned for the analysis of Slovene nouns and adjectives. Triplets are extracted from the training corpus, consisting of the word form itself, and the lexical, undisambiguated lemmas with their accompanying MSDs, thus using a setting similar to the one prior to tagging. The lexical training set is used to obtain the word forms and their undisambiguated lemmas and MSDs. Each triplet is an example of analysis of the form msd(orth,lemma). Within the learning setting of inductive logic programming, msd(Orth,Lemma) is a relation or predicate, that consist of all pairs (word form, lemma) that have the same morphosyntactic description. Orth is the input and Lemma the output argument. A set of rules has to be learned for each of the msd predicates. Encoding-wise, the MSD’s part-of-speech is decapitalised and hyphens are converted to underscores. The word forms and lemmas are encoded as lists of characters, with non-ASCII characters encoded as SGML entities. In this way, the generated examples comply with Prolog syntax. For illustration, the triplet ˇclanki /ˇclanek/Ncmpn gives rise to the following example: n0mpn([ccaron,l,a,n,k,i],[ccaron,l,a,n,e,k]). Certain attributes have (almost) no effect on the inflectional behaviour of the word. We generalise over their values in the predicates, and indicate this by a 0 for the value of the vague attribute, as seen above for the collapsing of proper and common nouns (Nc, Np) to n0. This gives rise to generalised MSDs, such as n0mpn above. For the complete noun and adjective paradigms, where we have all together 242 MSDs, we find that Slovene needs 108 generalised MSDs, 54 for nouns (85 MSDs) and 54 for adjectives (157). Each generalised MSD is a target predicate to be learned. Examples for these 108 predicates are generated from the training lexicon as described above. Instead of Foidl’s predicate split/3, the predicate mate/6 is used as background knowledge in Clog. mate generalises split to deal also with prefixes, and allows the simultaneous specification of the affixes for both input arguments. As Slovene inflection only concern the endings of words, the prefix arguments will be empty lists, and the form mate that will be used corresponds to the following definition: mate(W1,W2,[],[],Y1,Y2) :- split(W1,X,Y1), split(W2,X,Y2). As an example, consider the set of rules induced by Clog for the particular task of analysing the genitive singular of Slovene feminine nouns. The training set for this concept contained 608 examples, from which Clog learned 13 rules of analysis. Nine of these were lexical exceptions, and are not interesting in the context of unknown word lemmatisation. We list the four generalisations in Table 8. From the bottom up, the first rule describes the formation of genitive for feminine nouns of the canonical first declension, where the lemma ending -a is replaced by -e to obtain the genitive. The second rule deals with the canonical second declension where i is added to the nominative singular (lemma) to obtain
78
Saˇso Dˇzeroski and Tomaˇz Erjavec
Table 8. A first-order decision list for the analysis of Slovene feminine nouns in the singular genitive declination. n0fsg(A,B):-mate(A,B,[],[],[t,v,e],[t,e,v]),!. n0fsg(A,B):-mate(A,B,[],[],[e,z,n,i],[e,z,e,n]),!. n0fsg(A,B):-mate(A,B,[],[],[i],[]),!. n0fsg(A,B):-mate(A,B,[],[],[e],[a]),!.
the genitive. The third rule attempts to cover nouns of the second declension that exhibit a common morpho-phonological alteration in Slovene, the schwa elision. Namely, if a schwa (weak -e-) appears in the last syllable of the word when it has the null ending, this schwa is dropped with non-null endings: bolezen-0, but bolezn-i. Finally, the topmost rule models a similar case with schwa elision, coupled with an ending alternation, which affects only nouns ending in -ev.
3.3
Evaluating the Morphological Rules
The rules for morphological analysis learned by Clog were first tested independently of the tagger on the Appendix of the novel ’1984’. For each token in the Appendix, the correct (disambiguated) MSD tag is used and the appropriate msd predicate is called with the token as an input argument. An error is reported unless the returned output argument is equal to the correct lemma as specified by the ‘1984’ lexicon (of which the training lexicon is as a subset). Table 9 summarises the results. Table 9. Validation results for the morphological analyser on all words, known and unknown words. All Acc. Correct/Err Nouns Adjectives Both
97.5% 97.3% 97.4%
936/24 431/12 1367/36
Known Acc. Correct/Err 99.1% 96.6% 98.3%
766/ 7 339/12 1105/19
Unknown Acc. Correct/Err 90.9% 100% 93.9%
170/17 92/0 262/17
It might come as a surprise that the accuracy on known words is not 100%. However, the errors on known words are on word forms that do not appear in the training corpus. Only word forms that appear in the training corpus are used to learn the rules for morphological analysis together with the corresponding undisambiguated sets of MSDs. The training lexicon is used to provide the latter, and not all word forms of the lemmas that appear in the training corpus.
Learning to Lemmatise Slovene Words
4
79
Tagging for Morphosyntax
Syntactic wordclass tagging (van Halteren, 1999), often referred to as part-ofspeech tagging has been an extremely active research topic in the last decade. Most taggers take a training set, where previously each token (word) had been correctly annotated with its part-of-speech, and learn a model of the language This model enables them to predict the parts-of-speech for words in new texts to a greater or lesser degree. Some taggers learn the complete necessary model from the training set, while others must make use of background knowledge, in particular a morphological lexicon. The lexicon contains all the possible morphological interpretations of the word forms, i.e. their ambiguity classes. The task of the tagger is to assign the correct interpretation to the word form, taking context into account. For our experiments, we needed an accurate, fast, flexible and robust tagger that would accommodate the large Slovene morphosyntactic tagset. Importantly, it also had to be able to deal with unknown words, i.e. word forms not encountered in the training set or background lexicon. In an evaluation exercise (Dˇzeroski, Erjavec, & Zavrel, 1999) we tested several different taggers on the Slovene Orwell corpus. They were: the Hidden Markov Model (HMM) tagger (Cutting, Kupiec, Pedersen, & Sibun, 1992; Steetskamp, 1995), the Rule Based Tagger (RBT) (Brill, 1995), the Maximum Entropy Tagger (MET) (Ratnaparkhi, 1996), and the Memory-Based Tagger (MBT) (Daelemans, Zavrel, Berck, & Gillis, 1996). After this experiment was performed, a new tagger became available, called TnT (Brants, 2000). It works similarly to our original HMM tagger (Steetskamp, 1995) but is a more mature implementation. We therefore substituted TnT for HMM in the evaluation. We also trained a tagger using the ILP system Progol. On English, this approach attains accuracies comparable to other state-of-the-art taggers (Cussens, 1997). Unambiguous tagging of Slovene data was less satisfactory (Cussens, Dˇzeroski, & Erjavec, 1999) (although the tagger turned out to be a very good validation aid, as it can identify errors of manual tagging). We have thus omitted this tagger from the experimental comparison. The comparative evaluation of RBT, MET, MBT and TnT was performed by taking the body of ‘1984’ and using 90% of randomly chosen sentences as the training set, and 10% as the validation set. The evaluation took into account all tokens, words as well as punctuation. While (Dˇzeroski et al., 1999) considered several different tagsets, here we use the ‘maximal’ tagset, where tags are full MSDs. The results indicate that accuracy is relatively even over all four taggers, at least for known words: the best result was obtained by MBT (93.6%), followed by RBT (92.9%), TnT (92.2%) and MET (91.6%). The differences in tagging accuracies over unknown words are more marked: here TnT leads (67.55%), followed by MET (55.92%), RBT (45.37%), and MBT (44.46%). Apart from accuracy, the question of training and testing speed is also paramount; here RBT was by far the slowest (3 days for training), followed by MET, with MBT and TnT being very fast (both less than 1 minute).
80
Saˇso Dˇzeroski and Tomaˇz Erjavec
Table 10. Excerpts from the a) n-gram and b) lexicon files generated by the TnT tagger. a) An excerpt from the n-gram file generated by TnT. Vcps-sma 544 Vcip3s--n 82 Afpmsnn 17 Aopmsn 2 Ncmsn 12 Npmsn 1 Css 2 Afpnpa 1 Q 3 ... b) An excerpt from the lexicon file generated by TnT. ... juhe julij julija julije juliji julijin ...
2 1 59 4 10 4
Ncfsg 2 Npmsn 1 Npfsn 58 Npfsg 4 Npfsd 10 Aspmsa--n 2
Npmsa--y
1
Aspmsn
2
Given the above assessment, we chose for our experiment the TnT tagger: it exhibits good accuracy on known words, excellent accuracy on unknown words, is robust and efficient. In addition, it is easy to install and run, and incorporates several methods of smoothing and of handling unknown words. 4.1
Learning the Tagging Model
The disambiguated body of the novel was first converted to TnT training format, identical to our tabular file, but without the lemma; each line contains just the token and the correct tag. For word tags we used their MSDs, while punctuation marks were tagged as themselves. This gives us a tagset of 1024, comprising the sentence boundary, 13 punctuation tags, and the 1010 MSDs. Training the TnT tagger produces a table of MSD n-grams (n=1,2,3) and a lexicon of word forms together with their frequency annotated ambiguity classes. The n-gram file for our training set contains 1024 uni-, 12293 bi-, and 40802 trigrams, while the lexicon contains 15786 entries. Example stretches from the n-gram and lexicon file are given in Table 10. The excerpt from the n-gram file can be interpreted as follows. The tag Vcps-sma appeared 544 times in the training corpus. It was followed by the tag
Learning to Lemmatise Slovene Words
81
Vcip3s--n 82 times. The triplet Vcps-sma, Vcip3s--n, Afpmsnn appeared 17 times. The excerpt from the lexicon file can be interpreted as follows. The word form juhe appeared in the corpus twice and was tagged Ncfsg in both cases. The word form julijin appeared 4 times and was tagged twice as Aspmsa--n and twice as Aspmsn. The ambiguity class of the word form julijin is thus the tagset {Aspmsa--n,Aspmsn}. We did not make use of any background lexicon. We left the smoothing parameters of TnT at their default values. Experiments along these lines could well improve the tagging model. 4.2
Evaluating the Tagger
We then tested the performance of the TnT tagger on the Appendix validation set. The results are summarised in Table 11. Table 11. Validation results for the TnT tagger. Accuracy Correct/Err All tokens All words Known words Unknown words
83.7% 82.5% 84.3% 64.2%
4065/789 3260/692 3032/565 228/127
We can see that the overall tagging accuracy is 83.7%, which is less than in the randomly partitioned training/testing sets and underlines the intuition that the Appendix is quite different from the rest of the book. This is somewhat reflected also in the accuracies on unknown words, which are here 64.2%, but were 67.55% on the random fold. In Table 12 we concentrate only on nouns and adjectives. Here the accuracy is even somewhat lower, bottoming out at 58.3% for unknown nouns. Table 12. Validation results for the TnT tagger on nouns and adjectives. All Known Unknown Accuracy Correct/Err Accuracy Correct/Err Accuracy Correct/Err Nouns Adjectives Both
73.8% 62.3% 70.1%
708/252 276/167 984/419
77.5% 60.7% 72.2%
599/174 213/138 812/312
58.3% 68.4% 61.6%
109/ 78 63/ 29 172/107
The above results raise fears that cascading the tagger and the analyser might not give much better results than simply assigning each word form as the lemma, but as the following section will show, this is not quite the case.
82
Saˇso Dˇzeroski and Tomaˇz Erjavec
5
Experiment and Results
The previous sections explained the ‘1984’ dataset, and the training and separate testing of the learned analyser and tagger on the validation set. This section gives the results where the two are combined to predict the correct lemma of (unknown) words in the validation and testing sets. We describe two experiments; one is on the Appendix of the ‘1984’ novel, the other on a Slovenian/EU legal document. 5.1
Lemmatisation of the Validation Set
The first experiment concerns the validation set, i.e. the Appendix of the novel. The Appendix was first tagged with TnT, following which the predicted tags were used for morphological analysis. For convenience, we first summarise the relevant data in the validation set in Table 13. Table 13. Distribution of words in the validation set (Appendix of the novel). All Known Unknown Category Token Type Lemma Token Type Lemma Token Type Lemma MSD = All words 3952 1557 Nouns Adjectives
1073 3597 1276
828
355
281
245
960 443
533 347
379 245
773 351
389 265
252 173
187 92
144 82
127 72
37 85 31 26
Both 1403
880
624 1124
654
425
279
226
199
68 111
The Table gives for each of all, known and unknown words, nouns and adjectives, the number of all tokens in the Appendix, the number of different word forms and the number of lemmas. ‘Unknown’-ness was computed against the lexically tagged body of the novel; the words whose lemma is not in the training corpus are unknown. The Table shows that 58% of all lemmas, and 81% of unknown lemmas are nouns or adjectives. Word Forms of an unknown noun/adjective lemma, on average, appear 1.2 times in the text, and the word form and lemma are different in 60% of the cases. We then tested the combination tagger/analyser on the unknown nouns and adjectives. Because we take the part of speech of the unknown words as given, our assesment does not take into account errors where the tagger classifies an unknown word as a noun or adjective, even though the word in fact belongs to a different part of speech. If the analyser then attempts to lemmatise these words, the results are wrong, except for isolated lucky guesses. In the validation set, there were 59 words misstaged as a noun or adjective, which is 1.5% of all the words or 4% of the total number of true nouns and adjectives in the Appendix. As was explained in the preceding sections, tagging is 87.5% correct on known and 61.2% on unknown noun and adjective tokens, while lemmatisation is cor-
Learning to Lemmatise Slovene Words
83
rect 98.3% and 93.9% respectively. When the two methods are combined, the accuracy is as given in Table 14. Table 14. Lemmatisation results on the validation set. All Known Unknown Accuracy Correct/Err Accuracy Correct/Err Accuracy Correct/Err Nouns Adjectives Both
91.7% 87.6% 90.4%
880/ 80 388/ 55 1268/135
95.4% 88.0% 93.1%
738/ 35 309/ 42 1047/ 77
75.9% 85.9% 79.2%
142/ 45 79/ 13 221/ 58
The accuracy of lemmatisation is thus 79.2%. A closer look at the errors reveals that the majority is due to the fact that the TnT tagger tags a noun or an adjective with the wrong part of speech. This happens in 78 cases (58% of the errors); in 60 of them, the assigned PoS in not a noun or adjective, and in 18 a noun is misstagged as an adjective or vice-versa. Obviously, tagger performance is the limiting factor in the achieved accuracy although the lemmatisation often manages to recover from errors of tagging. That is, in a large number of cases (245 known / 53 unknown), the predicted lemma of the word was correct, even though the assigned MSD was wrong. In fact, this is not surprising, as the unknown word guesser in TnT builds a suffix tree that helps it in determining the ambiguity classes of unknown words. Thus, TnT will often make an error when tagging a form that is syncretic to other forms, i.e. is identical in orthography, but has different inflectional features in its MSDs. For lemmatisation, it does not matter which of the syncretic MSDs is given, as they resolve to the same lemma. While the errors are usually caused by the tagger/analyser tandem returning the wrong lemma, there are some cases (11, 8 known / 3 unknown) where the analyser simply fails, i.e. does not return a result. Even though in two cases TnT correctly tagged the word in question, the others, all of them unknown words, are examples of misstagged words. This means that the analyser can also function as a validation component, rejecting misstagged words. 5.2
Lemmatisation of a Slovenian/EU Legal Document
While the Appendix of the ‘1984’ novel, used for validation, is quite different from the body of the book, which was used for training, we nevertheless wanted to assess the results on a truly different text type, and thus gauge the robustness and practical applicability of the method. For this, we took the Slovene version of the text fully titled the “Europe Agreement Establishing an Association Between the European Communities and their Member States, Acting within the Framework of the European Union, of the One Part, and the Republic of Slovenia, of the Other Part June 10. 1996 Luxembourg”. The text was collected and encoded as one of the 15 components of the one million word ELAN Slovene-English parallel corpus (Erjavec, 1999). This text
84
Saˇso Dˇzeroski and Tomaˇz Erjavec
is encoded in a similar manner as the ’1984’, and consists of 1,191 translation segments, which roughly correspond to sentences. It is tokenised into 12,049 words and 2,470 punctuation marks. However, the text had, in the ELAN release, not yet been tagged or lemmatised. In order to be used as a testing set, the corpus had to be at least reliably lemmatised. This was achieved in two steps: first, the company Amebis d.o.o. kindly lemmatised the text with words known to their morphological analyser BesAna, which includes a comprehensive lexicon of the Slovene language. Here each known word was ambiguously lemmatised; we then semi-interactively, via a series of filters and manual edits, disambiguated the lemmas. This produced the text in which words that are known to BesAna are unambiguously and, for the most part, correctly lemmatised, while those unknown do not have a lemma. The latter do contain interesting terms, but they are mostly abbreviations, foreign words, dates, typos, and similar. The identification of such entities is interesting in its own right, and is usually referred to as ‘named entity extraction’. However, this task is not directly connected to lemmatisation. We therefore chose to test the system on those words (nouns and adjectives) which were lemmatised but, again, did not appear in the training set. This gives us a fair approximation of the distribution of new inflected words in texts. With these remarks, Table 15 gives the main characteristics on this testing set. The Table shows that the number of unknown noun and adjective lemmas is about three times greater than in the Appendix.
Table 15. Distribution of words in the Slovenian/EU legal document. Token Type Lemma Known 12049 3407 Unknown 1458 863 Unknown nouns and adjectives 1322 796
1672 644 595
For testing on this corpus, we used the same tagging and analysis models as before; we first tagged the complete text, then lemmatised the unknown nouns and adjectives. Here we, of course, do not have an evaluation on the correctness of the tagging procedure or of the morphological analysis in isolation, as we are lacking the correct MSDs. Table 16 summarizes the results of testing the tagger/analyser tandem. It contains the accuracy results for unknown noun and adjective tokens, as well as for their word types and lemmas. For each of the data classes the Table contains the number of all items, and the number of correctly lemmatised ones, also as a percentage of the total. The mislematised cases are further subdivided into those that were simply wrong, those that, for types and lemmas sometimes returned the correct lemmatisation, and sometimes the incorrect one, and those where the analyser failed to analyse the word.
Learning to Lemmatise Slovene Words
85
Table 16. Lemmatisation results on the Slovenian/EU legal document. Token Type Lemma Accuracy 81.3% 79.8% 75.6% All 1322 Correct 1075 Error 247 Wrong Mixed Fail
195 52
796 635 161
595 450 145
105 38 18
73 62 10
The Table shows that the per-token accuracy of 81.3% is in fact slightly higher than on the Appendix (79.2%), and shows the method to be robust. The analysis of the errors per word form type and lemma shows lower accuracy, but also points the way to improving the results. Instead of lemmatising tokens, i.e. each word in the text separately, the text can be preprocessed first, to extract the lexicon of unknown words. This would give us the equivalent of the Types column, where we can see that a significant portion of the errors are either ‘mixed’ cases or failures of the analyser. A voting regime on the correct lemmatisation can be applied to the mixed cases, while the failures, as was discussed above, usually point to errors in tagging.
6
Summary and Discussion
We have addressed the problem of lemmatisation of unknown words, in particular nouns and adjectives, in Slovene texts. This is a core normalisation step for many language processing tasks expected to deal with unrestricted texts. We approached this problem by combining a morphological analyser and a morphosyntactic tagger. The language models for both components were inductively learned from a previously tagged corpus, in particular, from the Slovene translation of the ‘1984’ novel. We tested the combination of the learned analyser and tagger on the Appendix of the ‘1984’ novel, as well as on a completely different text type, namely a Slovenian/EU legal document. In both cases, the overall accuracy of lemmatisation of unknown nouns and adjectives is about 80%. Even at this level of accuracy, the lemmatisation approach proposed can be useful as an aid to the creation and updating of language resources (lexica) from language corpora. To our knowledge, there are no published results for lemmatisation of unknown words in Slovene or even other Slavic languages, so it is difficult to give a comparable evaluation of the results. The combination of the morphological analyser and the tagger is performed in a novel way. Typically, the results of morphological analysis would be given as input to a tagger. Here, we give the results of tagging to the morphological
86
Saˇso Dˇzeroski and Tomaˇz Erjavec
analyser: an unknown word form appearing in a text is passed on to the analyser together with its morphosyntactic tag produced by the tagger. Our method relies heavily on the unknown word guessing module of the tagger. While the TnT tagger has superior performance on unknown words as compared to other taggers, it, in the Appendix, still reaches only 64%, while the accuracy of the analyser is 94%. Given the combined accuracy of 79%, it is obvious that some of the errors committed by the tagger are not fatal: if the morphosyntactic tag produced by the tagger is within the inflectional ambiguity class of the word form, then the analyser should get the lemma right. There are some obvious directions by which to improve the currently achieved accuracy. In our experiments here we used only the ‘1984’ corpus for learning the language model. While enlarging the rather small training corpus is the obvious route, annotating corpora is a very time consuming task. However, plugging a larger lexicon into the system, which would at least cover all the closed word classes would be feasible and should improve the accuracy of tagging. Another extension to be considered is the addition of verbs, as the next largest open class of words. Another avenue of research would be to combine the morphological analyser and the tagger in a more standard fashion, which to an extent is already done in the TnT tagger. Here we use morphological analyser first to help the tagger postulate the ambiguity classes for unknown words. While this proposal might sound circular, one can also say that the lemmatiser and the tagger each impose constraints on the context dependent triplet of word-form, lemma and MSD. It is up to further research to discover in which way such constraints are best combined. It would also be interesting to compare our approach to morphological analysis, where synthesis rules are learned separately for each morphosyntactic description (MSD) to an approach where rules are learned for all MSDs of a word class together. Acknowledgements Thanks are due to Amebis, d.o.o. for ambigously lemmatising the ELAN corpus. Thanks also to Suresh Manadhar for earlier cooperation on learning morphology and for providing us with Clog and to Thorsten Brants for providing TnT.
References 1. Brants, T. (2000). TnT - a statistical part-of-speech tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference ANLP-2000 Seattle, WA. http://www.coli.uni-sb.de/˜thorsten/tnt/. 2. Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21 (4), 543–565.
Learning to Lemmatise Slovene Words
87
3. Chanod, J., & Tapanainen, P. (1995). Creating a tagset, lexicon and guesser for a French tagger. In Proceedings of the ACL SIGDAT workshop From Text to Tags: Issues in Multilingual Language Analysis Dublin. 4. Cussens, J. (1997). Part-of-speech tagging using Progol. In Proceedings of the 6th International Workshop on Inductive Logic Programming, pp. 93–108 Berlin. Springer. 5. Cussens, J., Dˇzeroski, S., & Erjavec, T. (1999). Morphosyntactic tagging of Slovene using Progol. In Dˇzeroski, S., & Flach, P. (Eds.), Inductive Logic Programming; 9th International Workshop ILP-99, Proceedings, No. 1634 in Lecture Notes in Artificial Intelligence, pp. 68–79 Berlin. Springer. 6. Cutting, D., Kupiec, J., Pedersen, J., & Sibun, P. (1992). A practical part-of-speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, pp. 133–140 Trento, Italy. 7. Daelemans, W., Zavrel, J., Berck, P., & Gillis, S. (1996). MBT: A memorybased part of speech tagger-generator. In Ejerhed, E., & Dagan, I. (Eds.), Proceedings of the Fourth Workshop on Very Large Corpora, pp. 14–27 Copenhagen. 8. Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H.-J., Petkeviˇc, V., & Tufi¸s, D. (1998). Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages. In COLING-ACL ’98, pp. 315–319 Montr´eal, Qu´ebec, Canada. 9. Dˇzeroski, S., Erjavec, T., & Zavrel, J. (1999). Morphosyntactic Tagging of Slovene: Evaluating PoS Taggers and Tagsets. Research report IJSDP 8018, Joˇzef Stefan Institute, Ljubljana. http://nl.ijs.si/lll/bib/dzerzareport/. 10. Erjavec, T. (1999). The ELAN Slovene-English Aligned Corpus. In Proceedings of the Machine Translation Summit VII, pp. 349–357 Singapore. http://nl.ijs.si/elan/. 11. Erjavec, T., & (eds.), M. M. (1997). Specifications and notation for lexicon encoding. MULTEXT-East final report D1.1F, Joˇzef Stefan Institute, Ljubljana. http://nl.ijs.si/ME/CD/docs/mte-d11f/. 12. Erjavec, T., Lawson, A., & Romary, L. (1998). East meets West: A Compendium of Multilingual Resources. CD-ROM. ISBN: 3-922641-46-6. 13. Manandhar, S., Dˇzeroski, S., & Erjavec, T. (1998). Learning multilingual morphology with CLOG. In Page, D. (Ed.), Inductive Logic Programming; 8th International Workshop ILP-98, Proceedings, No. 1446 in Lecture Notes in Artificial Intelligence, pp. 135–144. Springer. 14. Mikheev, A. (1997). Automatic rule induction for unknown-word guessing. Computational Linguistics, 23 (3), 405–424. 15. Mooney, R. J., & Califf, M. E. (1995). Induction of first-order decision lists: Results on learning the past tense of English verbs. Journal of Artificial Intelligence Research, pp. 1–24. 16. Ratnaparkhi, A. (1996). A maximum entropy part of speech tagger. In Proc. ACL-SIGDAT Conference on Empirical Methods in Natural Language Processing, pp. 491–497 Philadelphia.
88
Saˇso Dˇzeroski and Tomaˇz Erjavec
17. Sperberg-McQueen, C. M., & Burnard, L. (Eds.). (1994). Guidelines for Electronic Text Encoding and Interchange. Chicago and Oxford. 18. Steetskamp, R. (1995). An implementation os a probabilistic tagger. Master’s thesis, TOSCA Research Group, University of Nijmegen, Nijmegen. 48 p. 19. van Halteren, H. (Ed.). (1999). Syntactic Wordclass Tagging. Kluwer.
Achievements and Prospects of Learning Word Morphology with Inductive Logic Programming Dimitar Kazakov Department of Computer Science, University of York, Heslington, York YO10 5DD, UK
[email protected] Abstract. This article presents an overview of existing ILP and nonILP approaches to word morphology learning, and sets targets for future research. The article claims that new challenges to the ILP community with more appeal to computational linguists should be sought in a whole new range of unexplored learning tasks in which ILP would have to make a more extensive use of relevant linguistic knowledge, and be more closely integrated with other learning techniques for data preprocessing.
1
Introduction
Matthews (1974) defines morphology as “that branch of linguistics which is concerned with the ‘forms of words’ in different uses and constructions”. According to another definition by the same author, morphology is “the study of the grammatical structure of words and the categories realized by them” (Matthews, 1997). In order to compare words and study their differences, words are often segmented into a number of constituents. Word segmentation — here the expression is used as related to word morphology, and not to tokenisation, where lexical constituents are identified in the text — is an important subtask of Natural Language Processing (NLP) with a range of applications from hyphenation to more detailed morphological analysis and text-to-speech conversion. Several approaches aiming to learn word morphology have been published recently. In many of them, annotated data is required. For instance, Brill (1994) uses a text corpus tagged with the corresponding part of speech (PoS) to learn morphological rules for the prediction of the PoS of unknown words; van den Bosch, Daelemans, and Weijters (1996) make use of morphologically analysed words to derive a text-to-speech conversion theory in an inductive fashion. There are also algorithms for unsupervised learning of word morphology that use plain text (Deligne, 1996) or lists of words (Harris, 1955; Pirelli, 1993; Yvon, 1996b) as training data. Inductive Logic Programming (ILP) aims at the inductive learning of concepts from examples and background knowledge in a first-order logic framework (Lavraˇc & Dˇzeroski, 1994). ILP has proved to be a feasible way to learn linguistic knowledge in different domains, such as morphological analysis (Blockeel, 1994; Mooney & Califf, 1995), part-of-speech tagging (Cussens, 1997) and parsing (Zelle & Mooney, 1993; Kazakov, 1999). Unlike statistical or connectionist J. Cussens and S. Dˇ zeroski (Eds.): LLL’99, LNAI 1925, pp. 89–109, 2000. c Springer-Verlag Berlin Heidelberg 2000
90
Dimitar Kazakov
approaches, these results can be easily understood and modified by a human expert. This article presents an overview of existing applications of ILP to word morphology learning, and sets targets for future research. The article claims that new challenges for the ILP community, and a higher interest from computational linguists should be sought in a whole new range of unexplored learning tasks in which ILP will be more closely integrated with other learning techniques and relevant linguistic knowledge. The exposition starts with an overview of basic concepts and approaches in word morphology and ILP. It then proceeds to describe two of the relevant aspects in machine learning of language: – training data annotation and the possibilities of its augmentation through learning – background concepts used by the learner. The above can be seen as two orthogonal dimensions defining a space of possible learning settings according to the object language in which the training data is described, and the concept language used to express the theories learned. Each learning task can then be represented as a move along either or both axes, for instance, adding word segmentation to the existing annotation of training data or learning morphology rule templates with the help of some basic linguistic concepts, such as vowels and consonants. This article describes a number of biases for pre-processing of lexical data, and suggests ways of combining the so obtained additional information with linguistic concepts of various complexity for the ILP learning of word morphology tools and theories.
2
Word Morphology
It is usually assumed that a word consists of a number of constituents which cover the entire word, do not overlap and follow each other. The last condition is generally valid for the Indo-European languages. In Arabic, however, a more complex operator than concatenation is used to combine morphemes (Matthews, 1974, p. 131). 2.1
Morphemes
The rˆ ole of the word constituents has varied between the different linguistic schools and with the development of linguistics. The word constituents can be perceived merely as differences allowing us to distinguish between two words with no particular meaning assigned to them (de Saussure, 1916). However, in what is now considered to be the classical approach in morphology, word constituents are paid somewhat more attention. Three basic assumptions are often made, which concern word constituents, the words and their phonological/orthographic representation1 (Fradin, 1994): 1
The way in which the words are actually pronounced or written.
Learning Word Morphology with Inductive Logic Programming
91
1. There are minimal word constituents, morphemes, with which morphology operates. 2. The word is made up of morphemes which follow each other. 3. There is a set of rules that produce the actual pronunciation/spelling of each morpheme according to its context. For instance, it can be said that the word truthful is made of two morphemes, truth and ful. Also, one might want to say that the word studied consists of the morphemes study and ed where a special rule handles the change of the final ‘y’ in study into ‘i’ in the context of -ed. So, there are two different meanings of ‘morpheme’: 1. Any sequence of characters that has either grammatical or lexical meaning 2. An invariant lexical or grammatical unit realised by one or more sequences of characters. In the above definitions, based on Matthews (1997), ‘sequence of characters’ was substituted for ‘configuration of phonological units’, since a lot of NLP research aims at segmentation and morphological analysis of text, and in many languages the pronunciation is not directly reflected in the spelling. That change in terminology may raise the objection that it would make possible to split a single phonological unit represented by two or more characters in the text. However, the requirement that the sequence of characters have grammatical or lexical meaning eliminates those cases. If ‘morpheme’ is used in its second sense, it represents a certain kind of abstraction over a number of variants, called morphs. One can imagine a word represented at two levels. If the lexical level contains the morphemes included in the word, the surface level shows the actual word as formed by the concatenation of the corresponding morphs. For instance, one can speak of the pairs of morphemes dog + ‘plural morpheme’, boss + ‘plural morpheme’, leaf + ‘plural morpheme’, ‘negative morpheme’ + do which at the surface level produce dogs, bosses, leaves, undo, respectively. A distinction is usually made between derivational and inflectional morphology. The latter studies how all the forms of each word in the lexicon are produced, e.g. walk, walks, walked given the lexical entry to walk, Verb. All these word-forms share the same part of speech and meaning, and it is only the morphosyntactic features, such as Tense or Person, that change. On the other hand, derivational morphology describes transformations changing the word meaning. In this case, the PoS of the initial and resulting words may differ. For instance, adding the suffix -able to a verb can produce an adjective confirming the applicability of the action denoted by the verb. 2.2
Morphology Representation
From a practical point of view, it is important to represent a large number of word-forms and the corresponding set of categories in a more concise way than by exhaustive enumeration. Such a compact theory of word-formation has
92
Dimitar Kazakov
also a generative power, enabling prediction of new word-forms from the known ones and analysis of unseen words. A certain level of complexity is required if the theory is to reflect the subtleties of the described language; however, a detailed description of language morphology can be costly and, in practice, the complexity of the model used is often chosen to effect a trade-off between its cost and performance. Since concatenation is the basic operator to arrange morphemes into words in the majority of European languages, simple and efficient representations of word paradigms can be obtained as a combination of an inflectionally invariable stem and a set of endings. The use of several stems corresponding to different parts of the paradigm can simplify the representation, decreasing the number of endings used. The paradigms of several words (lexemes) can be further merged into larger clusters called inflectional classes. The choice of representation also varies according to the purpose. For instance, one may be interested in modelling the idealised native speaker-hearer. Also, one may wish “to assist the second language learner by constructing an optimal pedagogical grammar, to plumb the depths of what there is to be learned about the history of the language through the method of internal reconstruction, or to predict future directions apt to occur in a language” (Bender, 1997). For instance, when a second language is studied, practical reasons are put forward, or, in Matthews’s words (Matthews, 1974): Any language is learned by a mixture of rote-learning, rules and practice and it is the job of the language teacher to work out what combination is the most efficient. Indeed, it is often simpler for the human learner and the NLP developer alike to cover some of the cases as exceptions by simply memorising them, rather than to use a complex set of rules needed otherwise. 2.3
Generative Phonology and Two-Level Morphology
Several approaches to morphology use the notion of abstract, lexical-level, morphemes. In these formalisms, words are formed by firstly combining morphemes at lexical level, and then applying rules to derive the word surface-level representation. In the framework of generative phonology (Chomsky & Halle, 1968), the initial, lexical-level sequence of morphemes is modified by sequential application of rewriting [generative] rules, and a word can go through several intermediate stages before it takes its final, surface-level form. Some of the criticism of this approach is related to the procedural character of the rules, as they have to be applied in a certain order, and operate only in an underlying-to-surface direction. The following example, borrowed from Antworth (1991), illustrates the principles of generative phonology. Consider the rules: Rule 1 (Vowel Raising) e becomes (is rewritten as) i if it is followed by a zero or more consonants, followed by i. Rule 2 (Palatalisation) t becomes c preceding i.
Learning Word Morphology with Inductive Logic Programming d
v
d
v
d
d
k
+
a
E
k
+
a
v
e
k
+
v
E
k
+
93
Fig. 1. Lexical vs surface level in two-level morphology
A sample derivation of forms to which these rules apply follows. Lexical level: Rule 1: Rule 2: Surface level:
temi timi cimi cimi
Koskenniemi (1983) introduces the Two-Level Morphology approach. The approach is based on two lexicons containing morphemes (roots and affixes), and a set of morphological rules. The rules establish whether a given sequence of characters at the surface level (as it appears in the text) can correspond to another given sequence of symbols used to represent the morphemes in the lexicon. The morphological rules are implemented as finite state transducers (FST). FST are automata with two bands, i.e. they are based on alphabets composed by pairs of symbols. For instance, the Nominative Singular of the Czech word d´ıvka (girl) can be represented at lexical level as a combination of the morphemes d´ıvEk+a. The corresponding string at surface level is d´ıv∅k+a (Figure 1). Pairs of corresponding symbols at the two levels have to be aligned, possibly making use of nil (‘∅’) symbols. One possible rule may postulate that the symbol E corresponds to the nil symbol ∅ if the right-hand side context of E is ka. There could be another rule matching E to e if the right-hand side context of E is k∅. From a theoretical point of view, there are several aspects which make the two-level approach interesting. It makes use of morphemes close to the surface level, and it does not need intermediary descriptions which cannot be directly related to empirical observations. Furthermore, this approach is defended by its supporters as declarative, allowing to separate the conditions governing a linguistic phenomenon from their application (Fradin, 1994). Also, two-level rules are bi-directional, i.e. they can operate in an underlying to surface direction (generation mode) or in a surface to underlying direction (recognition mode) (Antworth, 1991). The declarative and bi-directional character of two-level approach makes its logic programming implementation suitable and easy. Another related approach worth mentioning is Kaplan and Kay’s multilevel morphology (Kaplan & Kay, 1994).
94
3
Dimitar Kazakov
ILP Tools
This section describes the ILP tools which have been applied to morphology learning so far. 3.1
Decision List Learners
A first-order decision list is a logic program in which the last literal in all clauses but the last one is a cut (!). For example, in Figure 2, the predicate number/2 predicts the morpho-syntactic category Number of English nouns according to their ending. The order of clauses does matter, and when the predicate is called, only the first of all applicable clauses is fired. The figure shows a decision list
lemmata
Order of testing - only the first applicable clause is used.
birds number(Word,plural):append(_,[a,t,a],Word), !.
mass
song
number(Word,singular):append(_,[s,s],Word), !.
number(Word,plural):append(_,[s],Word), !.
number(Word,singular).
plural singular plural singular
Legend song : input word singular : category assigned
: clause used : clause scope
Fig. 2. Scope and priority of decision list clauses
of four clauses and how it is applied on four English nouns. The noun lemmata fires the first clause in the list, and is classified as being a plural word-form. The noun mass is covered by the second, third, and fourth clauses; however, the clause order guarantees that only the first of the three is used. The scope of each clause is schematically represented with an ellipse; these are projected on a plane to allow for their comparison. The last clause represents the default case, i.e. covers all examples which are not subsumed by any of the preceding clauses.
Learning Word Morphology with Inductive Logic Programming
95
Foidl Foidl (Mooney & Califf, 1995) is a greedy top-down ILP system for learning first-order decision lists from examples and is closely related to FOIL (Quinlan, 1990). One of the main reasons that a system such as Foidl is a good candidate for NLP applications is that it can be used to learn from positive examples only. Foidl uses an assumption known as output completeness to generate implicit negative examples from positive data. The notion of output completeness can be best explained by an example. Let the positive examples be: past(sleep,slept), past(like,liked), past(walk,walked). Let the target predicate past/2 be queried using the mode past(+,-). This mode describes that the first argument is an input variable, i.e. it must be instantiated in any query, and the second one is an output variable, i.e. one that is to be instantiated by the clause being learned. The only query about that clause that is allowed by the mode declaration is of the type past(sleep,X). Any instantiation of the output variables that does not exactly match a positive example is perceived as a negative one. Such implicit negative examples for the examples shown are past(sleep,X), past(sleep,liked), past(sleep,walked).
Clog Clog (Manandhar, Dˇzeroski, & Erjavec, 1998) is another system for learning first-order decision lists. It shares a fair amount of similarity with Foidl (Califf & Mooney, 1996). In both Foidl and Clog, the decision list is learned from the bottom upwards, i.e. the most general clauses are learned first. Like Foidl, Clog can learn first-order decision lists from positive examples only using the output completeness assumption. In the current implementation, the generalisations relevant to an example are supplied by a user-defined predicate which takes as input an example and generates a list of all generalisations that cover that example. Clog treats the set of generalisations of an example as a generalisation set. It then cycles every input example through the generalisation set in a single iteration checking whether a candidate generalisation covers the example positively or negatively. Once this process is complete the best candidate generalisation w.r.t. a user-defined gain function is chosen. A comparison between Foidl and Clog is clearly in favour of the latter. Unlike Foidl, Clog does not require theory constants to be specified prior to learning. Clog is much faster than Foidl (Manandhar et al., 1998; Kazakov & Manandhar, 1998), and it can process much larger data sets as it does not require all data to be loaded in the memory. A relative shortcoming of Clog in comparison with Foidl is that the space of target concepts is limited to a number of hard-coded clauses. A possible way around is to apply Foidl to a feasible subset of the data, and then pass the type of clauses learned to Clog and process the whole data set with it.
96
3.2
Dimitar Kazakov
Analogical Prediction
Analogical Prediction (AP) is a learning technique that combines some of the main advantages of Instance-Based Learning (IBL) and standard ILP. It takes background knowledge and training examples, then for each test example e it searches for the most compressive single-clause theory that covers e and classifies it similarly to the largest class of training examples covered. AP and IBL share the robustness to the changes in the training data typical for lazy learning approaches. In terms of preference bias, AP replaces the similarity metric needed in IBL with a compression-based metric. Also, AP inherits from ILP its greatest advantage over IBL, namely the explicit hypothesis serving as an explanation of the decisions made. AP has been shown to be particularly well suited for domains in which a large proportion of examples must be treated as exceptions (Muggleton & Bain, 1999). AP has been implemented in Progol4.4 by Muggleton.
4 4.1
Aspects of Machine Learning of Natural Language Learning from Annotated Data
A look at existing work shows two principal settings in which a computational linguist would learn word morphology. The data in the first case is a corpus (list, lexicon) of annotated words, where each one is tagged with some relevant information. Whether it is a part of speech (Brill, 1994; Mikheev, 1997), a more complex list of morphosyntactic categories (Kazakov, Manandhar, & Erjavec, 1999) or a string of phonemes describing the word pronunciation (Yvon, 1996b; van den Bosch, 1997), the information contained in the tag is related to the word as a whole. For instance, if the corpus pairs words with their pronunciation, the two strings of letters and phonemes may be aligned, so that each phoneme corresponds to one or more letters. However, it is the context of the whole word that defines the phonetic value of each letter. Given a tagged list of words, one can try to identify constituents within the word and find a mapping between these and the information in the tag. One can so attempt to discover whether a certain word ending can be associated with a specific part of speech, or morphological category, such as Number, Gender or Tense. Also, knowing the segments that preserve their pronunciation throughout the corpus can be very valuable, as it can help predict the pronunciation of unknown words composed out of such segments. One can also imagine the case in which the data set contains words with marked morpheme boundaries. The aim of learning in this case depends on the additional annotation provided. Segmented data can supply the learning algorithm with morphemes or concatenations of morphemes that can be used as theory constants in the rules mapping words to categories. Learning segmentation rules can provide the computational linguist with a tool that can be used as a pre-processor of the morphological analyser or alone. Segmentation producing
Learning Word Morphology with Inductive Logic Programming
97
the word stem can be seen as assigning a word its meaning as most generalpurpose systems for information retrieval assume that the word meaning is fully specified by its stem, ignoring aspects of lexical ambiguity and contextual interpretation. In the setting described so far, word morphology aims to identify word constituents (morphemes) and assign them certain properties, such as pronunciation, possible parts of speech, etc. The matching between morphemes and word properties can be used both for word generation (Mooney & Califf, 1995; Muggleton & Bain, 1999) and analysis (Brill, 1994; Mikheev, 1997). 4.2
Unsupervised Learning of Language
In the second setting in which machine learning is applied to word morphology, the corpus contains no annotation. Here the interest is focussed on the search for the best model of the potentially infinite language or of its finite sample available. The quality of the model can be evaluated by its generative power, i.e. the ability to generate all words of the language, by the likelihood with which the corpus is produced by the model (Deligne, 1996) or by a combination of these two criteria, which are clearly contradictory. The two extremes of the scale could be marked by the most general model, the one that describes words as strings of arbitrary length and composition, and the model represented by the training data itself. Unsupervised learning combines a new data representation framework (e.g. replace words with concatenations of a prefix and suffix) with a preference bias introducing an order among all possible representations (use the shortest possible lexicons of prefixes and suffixes), and a search technique (e.g. a genetic algorithm) to find the best representation of data w.r.t. the bias. In this context, two principles are often used. Occam’s razor recommends the use of the simplest (shortest) model (theory, hypothesis) that can explain how the data was produced. The Minimal Description Length (MDL) principle combines the model length with the length of the data description based on the model; the model which minimises the number of bits encoding both model and data, is given preference (Mitchell, 1997). Unsupervised learning has been used to annotate texts or lists of words with their segmentations (Deligne, 1996; Harris, 1955; Kazakov, 1997; Brent, Lundberg, & Murthy, 1995), and to assign the most probable pronunciation to each segment (Yvon, 1996b). The biases used in the cited work, and the way they contribute to the data annotation, are shown in the upper part of Figure 3 on page 106. From a practical point of view, unsupervised learning is attractive because of its low cost, as it does not require the presence of a linguist. However, there is also a more fundamental reason to study unsupervised learning. In an area as abundant in models and theories as linguistics is, support from information theory perspective can help consolidate the opinions.
98
4.3
Dimitar Kazakov
ILP and Background Knowledge for Language Learning
According to Shieber (1986), there are two classes of linguistic formalisms: linguistic tools and linguistic theories. The former are used to describe natural languages, the latter serve to define the class of possible natural languages. One can extend this taxonomy by taking into account the fact that natural languages are yet another class of communication codes, and as such should follow the principles of Information Theory (Shannon & Weaver, 1963), and could be studied and modelled from the its general perspective. Indeed, a considerable part of this article is dedicated to the description of existing research applying general Information Theory principles to natural language. ILP theory claims that learning is more accurate when the concept language is extended to include appropriate background concepts. Limiting the concept language to some of these concepts can improve learning efficiency. Even at present, with all ongoing work on predicate invention (Khan, Muggleton, & Parson, 1998) and multi-predicate learning (Muggleton, 1998), the use of wellchosen domain knowledge is crucial. Having existed in its modern form for more than 300 years (Lancelot & Arnauld, 1660), linguistics seems likely to provide such relevant concepts. These could be used in three ways in conjunction with ILP learning. 1. Learning could help specialise an existing theory, e.g. producing rules from templates to obtain a theory or tool for a particular language in a way reminding of Explanation-Based Learning (Mitchell, 1997). 2. On the other hand, it might be possible to use learning for the generalisations of existing theories or tools. For instance, existing morpho-lexical analysers for a number of languages Li could be compiled into a single tool for use in a multilingual context. In this case, learning could also result in a metatheory which would subsume each of the languages Li , and yet be more specific than the general framework initially used. 3. Finally, an existing theory (tool) which is incomplete or inconsistent w.r.t. the corpus could be extended or fine-tuned by learning. We want to suggest a classification of all possible applications of ILP learning to word morphology based on two factors, the extent to which linguistic concepts are used as background concepts, and the amount of annotation in data. The lower part of Figure 3 (see page 106) gives a graphical representation of that classification. The horizontal axis represents a partial ordering between data sets—if two data sets containing the same words are displayed in different columns, the one on the left-hand side contains all annotation from the first data set plus some additional one. The rows in the table show whether the background concepts used for morphology learning consist only of string relations or make use of increasingly complex linguistic concepts and theories. Now a learning task can be described as (Oin , Cin ) → (Oout , Cout ), i.e. as a vector, the initial co-ordinates of which specify the type of data Oin and background knowledge Cin available. The vector end point is given by the type of target theory learned Cout , and also reflects changes in the type of annotation available in the data.
Learning Word Morphology with Inductive Logic Programming
99
The application of any preference bias for unsupervised learning from the upper part of Figure 3 would result in a right-to-left move in parallel to the horizontal axis. Indeed, each of these biases would add extra annotation to the data yet their contribution to the concept language would usually be limited to the creation of new theory constants, typically segments of words. The vector describing unsupervised learning would usually not have any vertical component, as the (very simple) relations it introduces, such as prefix and suffix (Kazakov, 1997), would already be present in the background knowledge for any learning setting. One could only imagine the hypothetical case in which the use of a preference bias based on a general principle of Information Theory would actually result in a less trivial concept, e.g. lexical-level morpheme (see the last row of the PREFERENCE BIAS part of the table). Unlike unsupervised learning, one would expect ILP learning to change the conceptual level of representation by introducing new, more complex concepts, and their gradual specialisation, or by generalising existing case-specific theories. In either case, the learning task vector will be vertical. Can ILP be applied to the task usually assigned to unsupervised learners? Strictly speaking, the answer is yes. However, to do so, the ILP learner should incorporate the three already mentioned components of unsupervised learners. As it uses first-order logic as a representation formalism, ILP would hardly have any difficulties incorporating the representation formalism and the preference bias of almost any unsupervised learner. It is the search technique that would make ILP a poor substitute for a given unsupervised learner. To make the search maximally efficient, each of these programs uses a specialised algorithm which would be difficult to replicate in the context of declarative programming or for which a number of standard libraries exist in other, imperative languages. The list of such algorithms includes search for a minimal-cost path in a graph (Deligne, 1996), and genetic algorithms (Kazakov, 1997). The implication is that non-ILP algorithms for unsupervised learning and ILP are not mutually exclusive. Quite the opposite, they could, and should, be combined in fruitful marriages, in which the unsupervised learner is used as a pre-processor for ILP. An example of such hybrid approach using the Na¨ıve Theory of Morphology bias already exists (Kazakov & Manandhar, 1998). This, and several other relevant biases are described in the next section.
5 5.1
Preference Biases for Word Segmentation Deligne’s Text Segmentation
Deligne (1996) describes a technique for the segmentation of a long single string, possibly containing several sentences with no delimiters between words. The method is looking for a model from which the corpus could be generated with a maximal likelihood. All possible segmentations are represented as a directed graph with one start and one end node. Each edge corresponds to a sequence of letters forming a segment. Each path from the start to the end node represents a particular segmentation of the whole text. Each edge of the graph is assigned weight representing the probability of this segment. This probability is estimated
100
Dimitar Kazakov
simply as the relative frequency of the string in the text, and the search for the best segmentation—the one which maximises the product of all segments’ probabilities—is reformulated as a search for a minimal-cost path in the graph. Once an initial segmentation is obtained, it is used to re-estimate the probability of each segment, and to segment the text again. The process stops when the segmentation becomes stable. This method can be adapted to the task of word segmentation by introducing firm boundaries in the text corresponding to the word borders. The algorithmic complexity of the method requires that the maximal length of segments should be limited (Deligne (1996) sets it to 5 characters). Even in the context of segmenting lists of words, rather than continuous text, removing this restriction may cause problems, as the number of possible constituents becomes exponential w.r.t. the word length. It follows directly from the fact that a segmentation of a string into substrings can be encoded as a string of 1s and 0s where a 1 marks a split point. Then for a string of length n, there are 2n−1 possible segmentations. This, additionally multiplied by the complexity O(n2 ) of the search for a minimalcost path (n is the number of nodes in the graph) can easily become an issue, especially for agglutinative languages where words can be of considerable length. For instance, a forty-one-letter word, which is not unheard of in Turkish, has 240 ≈ 1012 possible segmentations. 5.2
Brent et al.
Another word segmentation method based on information theory is employed by Brent et al. (1995). The article describes a binary encoding of a list of words based on lexicons of word constituents, and a table describing how these constituents are combined to form words. Then the Minimal Description Length (MDL) principle is applied, i.e. the table corresponding to the encoding requiring the minimal number of bits is assumed to describe the optimal segmentation of the list of words. As the search space for possible encodings is very large, the approach limits the number of constituents to two per word. Also, suffixes (right-hand-side constituents) which share a suffix between themselves are not allowed. The suggested search technique is iterative, trying to reduce the binary encoding size by alternately adding new suffixes to their lexicon, or by removing suffixes from that lexicon if the addition of new suffixes does not result in shorter representations. The efficiency of this search technique is unsatisfactory — “the bad news is search” (Brent et al., 1995), which leaves the bias with no practical use so far. 5.3
Na¨ıve Theory of Morphology
Kazakov (1997) introduces a bias for the segmentation of all words in a list. It is assumed that each word can be represented as a concatenation of two (possibly empty) strings Pref + Suf. For each two segmentations of all words in the list,
Learning Word Morphology with Inductive Logic Programming
101
the bias gives a preference to the one for which the two lexicons of prefixes and suffixes contain less characters in total. The segmentation of each word can be represented with an integer denoting the prefix length, and the segmentation of a list of words can be encoded as a vector of integers called a Na¨ıve Theory of Morphology (NTM). Then a genetic algorithm can be easily applied to the search for a suitable segmentation of the input where each chromosome is an NTM, and its fitness function is minimising the number of characters in the corresponding prefix and suffix lexicons. Initially, the algorithm generates a number of na¨ıve theories of morphology at random, and then repeatedly applies the genetic operators crossover, mutation and selection. 5.4
Harris’s Segmentation
Harris (1955) describes an unsupervised approach to the segmentation of utterances spelt phonetically. The approach counts the number of different phonemes succ(Prefix) which can appear in an utterance of the language after a given initial sequence Prefix of phonemes (the notation succ(n), where n is the prefix length, will also be used). For each utterance available, its left substrings Prefixi are produced and the counts succ(Prefixi ) computed. Then the utterance is segmented where the function succ(Prefix) reaches its local maxima. It is possible to adapt the method for the segmentation of words instead of whole utterances, also replacing phonemes with letters. Table 1 gives the value of succ(Prefix) for each word in the data set but, cut, cuts, bread, spot, spots, spotted — note the use of ˆ and @ as unique symbols for a start and end of word. As already stated, Table 1. Segmentation points and succ(Prefix) (shown after the last letter of each prefix) but cut cut-s bread spot spot-s spot-ted
: : : : : : :
ˆ-3, ˆ-3, ˆ-3, ˆ-3, ˆ-3, ˆ-3, ˆ-3,
b-2, c-1, c-1, b-2, s-1, s-1, s-1,
u-1, u-1, u-1, r-1, p-1, p-1, p-1,
t-1, t-2, t-2, e-1, o-1, o-1, o-1,
@-0 @-0 s-1, a-1, t-3, t-3, t-3,
@-0 d-1, @-0 @-0 s-1, @-0 t-1, e-1, d-1, @-0
a word w is segmented after the initial substring of length n, if for n succ(n) reaches a local maximum, which is greater than both succ(n−1) and succ(n+1). In case a plateau is found, all nodes on the plateau are considered as segmentation points provided a downhill slope follows. Using this criterion one obtains cut-, cut-s, spot-, spot-s, spot-ted while but and bread are not segmented. Although not mentioned in the original paper (Harris, 1955), Harris’s method can be based on the representation of all utterances (or words) as a prefix, resp. suffix tree.
102
5.5
Dimitar Kazakov
de Saussure’s Analogy
In his Course in General Linguistics, de Saussure (1916) describes a principle of analogy according to which, in the long term, the word-forms in a given language change and tend to form the following patterns: Pref1 + Suf1 : Pref1 + Suf2 = Pref2 + Suf1 : Pref2 + Suf2
(1)
The principle is often used for word segmentation (Pirelli, 1993; Yvon, 1996b). A word is split into a prefix and suffix only if one can find in the corpus another three words forming the proportion in Equation 1, for example, sleep-s, sleep-ing, read-s, read-ing. A simpler approach using just one row or column of this pattern can be also adopted (Brill, 1994; Mikheev, 1997). However, using the complete analogy principle helps filter out the numerous spurious segmentations generated when prefixes or suffixes are factored out of pairs of words. For instance, although the prefix on- appears in the words onyx and ontology, one would never segment on-yx if the corpus does not contain the words Pref+‘yx’, Pref+‘tology’ for any value of Pref. There are situations when the principle of analogy will fail even if there is a very obvious segmentation at hand. A look at the following example 2 shows that, for instance, W1 cannot be segmented, as the word Pref2 + Suf1 is missing in the data. W1 = Pref1 + Suf1 W4 = Pref1 + Suf2 W2 = Pref2 + Suf2 W5 = Pref2 + Suf3
(2)
W3 = Pref3 + Suf3 W6 = Pref3 + Suf1
6
Existing Applications of ILP to Morphology Learning
ILP has been applied to two types of morphology-related learning tasks so far, learning of inflectional morphology and word segmentation. In our new taxonomy, these tasks are encoded as (O2 , C1 ) → (O2 , C4 ) and (O1 , C1 ) → (O2 , C4 ) respectively. 6.1
Learning of Inflectional Morphology
Most ILP applications to word morphology have aimed to learn a theory matching pairs of word-forms of the same lexical entry (lemma), each of which corresponds to a certain set of morphosyntactic categories. The setting can be schematically represented as W1 /Cat1 : W2 /Cat2 , for instance sleep/VerbPresent: slept/VerbPast. Typically, one of the forms provided (e.g. W1 ) is the standard lexical entry, which has for each language and part of speech a fixed set of morphosyntactic categories, e.g. for the Czech Noun it is Case=Nominative, Gender=Masculine,
Learning Word Morphology with Inductive Logic Programming
103
Number=Singular. That allows for the data to be represented as a set of facts of a binary predicate which has the form cat2(LexEntry,W2). There are two variations of the task, (1) given the lexical entry find the inflected form, or, (2) given the inflected form find the lexical entry. In ILP terms, it corresponds to learning the target predicate with modes cat2(+,-) or cat2(-,+), respectively. Existing work includes learning of Dutch diminutives (Blockeel, 1994), English Past Tense (Mooney & Califf, 1995; Muggleton & Bain, 1999), Slovene nominal paradigms (Dˇzeroski & Erjavec, 1997), and nominal and adjectival paradigms in English, Romanian, Czech, Slovene and Estonian (Manandhar et al., 1998). In all but one case, a first-order decision list learner has been used, the exception being the application of Progol and Analogical Prediction to English Past Tense data by Muggleton and Bain (1999). The rules learned are based on the presence of one or two affixes in the word. For instance, one could have rules of the following template: (Stem + Suf1 )/Cat1 : (Stem + Suf2 )/Cat2
(3)
If the concatenation of a given stem and suffix Suf1 is a word-form corresponding to category Cat1 , then by replacing Suf1 with Suf2 one can obtain the word-form corresponding to category Cat2 , e.g. (appl+y)/Present: (appl+ied)/Past. The English Past Tense data set has become a standard test bed for both ILP (Mooney & Califf, 1995; Muggleton & Bain, 1999) and non-ILP learning approaches related to morphology (Rumelhart & McClelland, 1986; Ling & Marinov, 1993). The work of Rumelhart and McClelland (1986) is the first in a series of articles to apply learning to the task of producing the past tense of English verbs from their present form. The simple perceptron-based learner used by them was followed by a more sophisticated use of Artificial Neural Networks (ANN) by MacWhinney and Leinbach (1991). Ling and Marinov (1993) have subsequently applied the Symbolic Pattern Associator (SPA), a C4.5-based learner, to the same task. The principle of analogy has also been used by Yvon (1996a). The data consists of pairs of present and past tense forms of the same verb, for instance, (sleep, slept). It comes in two flavours, alphabetic, using standard spelling, and phonetic, reflecting the pronunciation. Direct comparison of learning methods is made difficult by the fact that different approaches do not use training and test data of the same size and distribution of regular and irregular verb forms. For instance, for a sample of 1200 training and 102 test verb forms spelt phonetically, SPA shows 89% predictive accuracy as compared to the 80% of MacWhinney’s ANN (Ling, 1994). The ANN here is an improved version of the one previously published by MacWhinney and Leinbach (1991). The original ANN achieves 55.6% accuracy on a sample of 500 training and 500 test entries in phonetic representation as compared to Ling’s 82.8%. The ILP learner FOIL outperforms SPA on the same data with 83.3% (Quinlan, 1994). The alphabetical data set with the same 500/500 training-to-test sample ratio allows for the best comparison of ILP methods applied to the task. Table 2 shows the results for Foidl, IFOIL and FOIL reported by Mooney and Califf (1995)
104
Dimitar Kazakov
(Quinlan’s result with FOIL for the same data is 81.2%). Results with standard Progol learning pure logic programs and Progol using Analogical Prediction have been reported by Muggleton and Bain (1999). All results should be compared to the majority class baseline of 79.3% (Past = Present+‘ed’). Comparison shows that ILP outperforms the other approaches. Within ILP, Analogical Prediction achieves the best results (Muggleton & Bain, 1999). Table 2. Learning of English past tense
6.2
non-ILP learning
ILP learning
Data set
Neural network
SPA
FOIDL
IFOIL
FOIL
AP
PROGOL
Phonetic
55.6
82.8
N/A
N/A
83.3
N/A
N/A
Alphabetic N/A
N/A
86
60
82
95.9
87.0
Hybrid Learning of Word Segmentation Rules
Kazakov and Manandhar (1998) have proposed a hybrid approach to the segmentation of an unannotated list of words in which the words are initially segmented by applying the Na¨ıve Theory of Morphology bias, and then a first-order decision list of segmentation rules is learned with the help of either Foidl or Clog. These rules show a good degree of generalisation, and can be used for the segmentation of unknown words. A characteristic feature of the method is that the search technique (GA) used in the unsupervised learner may not be able to find the best segmentations w.r.t. the bias. The words that are not segmented in an optimal way are often covered by an exception in the decision list, that is, a rule applicable to just one word with no potential for generalisation. Removing these exceptions from the decision list does not influence its performance on other, unseen, words, and the application of the remaining rules in the decision list to the training list of words usually results in segmentations that are better both w.r.t. the bias, and from a linguistic point of view. Experiments with the learning of segmentation rules for Slovene have shown that the segments produced for each word are strongly correlated with its morphosyntactic categories, thus supporting the hypothesis that these segments can be interpreted as morphemes (Kazakov et al., 1999).
7
Future Work
The taxonomy introduced in this article defines automatically a whole space of learning tasks most of which remain unexplored. The following list contains the defining vectors of some of them and their possible interpretation.
Learning Word Morphology with Inductive Logic Programming
105
(O1 , C1 ) → (O2 , C4 ) Replace NTM with other biases, such as Deligne’s, Harris’s or analogy, and learn word segmentation rules. (O2 , C1 ) → (O2 , C3 ) Invent new predicates to be used as rule templates in task (O2 , C1 ) → (O2 , C4 ). The template in Equation 3 (page 103) gives a good idea of the type of templates one could expect to learn. (O1 , C2 ) → (O1 , C4 ) Use some basic linguistic concepts, such as consonants, vowels, semivowels, diphthongs (unary predicates) or the possibility of mapping several morphs onto a single lexical-level morpheme (a binary predicate with append/3 literals in the body) to learn a model of a given language that is a more-than-one-level morphology. (O1 , C3 ) → (O1 , C4 ) Same as above, but use the two-level morphology and, possibly, relevant rule templates as background knowledge. (O2 , C3 ) → (O3 , C4 ) Learn the two-level morphology model of language L from words tagged with their morphosyntactic categories. Use a combination of unsupervised learning to segment the words (Harris, Na¨ıve Theory of Morphology) and ILP. (O3 , C2 ) → (O3 , C4 ) Learn the two-level morphology model of language L from words tagged with their lexical-level morphemes and morphosyntactic categories. This kind of annotation is the typical output of a two-level morpholexical analyser. Here is a transliterated example of so analysed Turkish word: [’CIkarIlmakla’,[[cat:noun,stem:[cat:verb,root:’CIK’,voice:caus, voice:pass,sense:pos],suffix:mak,type:infinitive,agr:’3SG’,poss: ’NONE’,case:ins]]].
(O3 , C4 ) → (O3 , C4+ ) Start with an incomplete and/or inaccurate two-level description of morphology and learn the complete and correct set of rules. This is a particularly interesting application, as it can help to adapt an existing two-level model of the language norm to account for the peculiarities of a vernacular. (O2 , C4 ) → (O2 , C3 ) Learn metarules (rule templates) from inflection or segmentation rules. The latter can be provided by linguist, or obtained in previous learning steps. (O3 , C4 ) → (O3 , C3 ) Use the existing two-level models of the morphology of several languages, e.g. French, Spanish and Italian, to learn a common set of rules and/or templates potentially relevant to a family of languages (e.g. Romance languages).
8
Conclusions
This paper describes some existing work in ILP learning of word morphology, and identifies the issues addressed so far. In the context of the current achievements in computational morphology, it seems that combining ILP with existing linguistic theories, rather than substituting it for them, would lead to applications which would challenge the ILP community and attract more attention from computational linguists. Research in this direction could be helped by the
Dimitar Kazakov
PREFERENCE BIAS
106
Deligne’s text segmentation
Segmentation
No annotation
Brent et al.
Segmentation
No annotation
NTM
Segmentation
No annotation
Harris
Segmentation
No annotation
Segmentation
No annotation
de Saussure’s analogy
Segmentation Segmentation and lexical level morphemes
?
Pronunciation or morphosynt. cat. Morphosyntactic categories
No annotation
English Past Tense, Slovene nominal paradigms Segmentation (NTM)
Relations between strings - append, prefix, suffix C1
Words
Basic linguistic concepts Words annotated with lex. level morphemes and categories
- consonants, vowels, semivowels, diphthongs; - morphemes: lexical vs surface level
Words
C2
Abstract machines and rules - 2L morphology: finite state transducers, rule templates - specific rules: vowel raising, palatalisation, nasal C3 assimilation
Implementation of tool T for language L
Segmentation (Harris) 2L framework for Romance languages
Two-level morphology of French, Spanish, Italian...
- 2L morphology rules of language L * incomplete/inconsistent Two-level morphology * complete/accurate C4
CONCEPT LANGUAGE TRAINING DATA LANGUAGE
More annotation additional use of - segmentation or - morphosyntactic categories or - lex. level morphemes O3
(Word,Cat) in Language L Words
Rule templates
Segmentation rules
Two-level morphology
Conjugation / declension rules
Some annotation
No annotation
- segmentation or - morphosyntactic categories, etc. O2
O1
Fig. 3. Bottom: space of ML applications to word morphology. Top: preference biases used to augment training data annotation.
development of standard ILP libraries of background predicates representing linguistic concepts or theories. Also, the integration of ILP with methods for unsupervised learning in a hybrid framework would increase the ability of ILP to learn from impoverished data containing little or no annotation. The newly introduced taxonomy of learning approaches to morphology describes a variety of tasks that have not been addressed yet, and could serve as an inspiration for future research in the field.
Learning Word Morphology with Inductive Logic Programming
107
References 1. Antworth, E. (1991). Introduction to two-level phonology. Notes on Linguistics, 53, pp. 4–18. 2. Bender, B. W. (1997). Latin Verb Inflection. University of Hawai‘i, http://www2.hawaii.edu/∼bender/Latin Verb.html. 3. Blockeel, H. (1994). Application of inductive logic programming to natural language processing. Master’s thesis, Katholieke Universiteit Leuven, Belgium. 4. Brent, M., Lundberg, A., & Murthy, S. (1995). Discovering morphemic suffixes: A case study in minimum description length induction. In Proc. of the Fifth International Workshop on Artificial Intelligence and Statistics. 5. Brill, E. (1994). Some advances in transformation-based part of speech tagging. In Proc. of the Twelfth National Conference on Artificial Intelligence, pp. 748–753. AAI Press/MIT Press. 6. Califf, M. E., & Mooney, R. J. (1996). Advantages of decision lists and implicit negatives in inductive logic programming. Tech. rep., University of Texas at Austin. 7. Chomsky, N., & Halle, M. (1968). The Sound Pattern of English. Harper and Row, New York. 8. Cussens, J. (1997). Part-of-speech tagging using Progol. In Proc. of the Seventh International Workshop on Inductive Logic Programming, pp. 93– 108. 9. de Saussure, F. (1916). Course in General Linguistics (1959 edition). Philosophical Library, New York. 10. Deligne, S. (1996). Mod`eles de s´equences de longueurs variables: Application au traitement du langage ´ecrit et de la parole. Ph.D. thesis, ENST Paris, France. 11. Dˇzeroski, S., & Erjavec, T. (1997). Induction of Slovene nominal paradigms. In Proc. of the Seventh International Workshop on Inductive Logic Programming, pp. 141–148 Prague, Czech Republic. Springer. 12. Fradin, B. (1994). L’approche a` deux niveaux en morphologie computationnelle et les d´evelopments r´ecents de la morphologie. Traitement automatique des langues, 35 (2), 9–48. 13. Harris, Z. S. (1955). From phoneme to morpheme. Language, 31 (2). 14. Kaplan, R. M., & Kay, M. (1994). Regular models of phonological rule systems. Computational Linguistics, 20 (3), 331–379. 15. Kazakov, D. (1997). Unsupervised learning of na¨ıve morphology with genetic algorithms. In Daelemans, W., Bosch, A., & Weijters, A. (Eds.), Workshop Notes of the ECML/MLnet Workshop on Empirical Learning of Natural Language Processing Tasks, pp. 105–112 Prague, Czech Republic. 16. Kazakov, D. (1999). Combining lapis and WordNet for the learning of LR parsers with optimal semantic constraints. In Dˇzeroski, S., & Flach, P. (Eds.), Proc. of The Ninth International Workshop on Inductive Logic Programming, pp. 140–151 Bled, Slovenia. Springer-Verlag.
108
Dimitar Kazakov
17. Kazakov, D., & Manandhar, S. (1998). A hybrid approach to word segmentation. In Page, D. (Ed.), Proc. of the Eighth International Conference on Inductive Logic Programming, pp. 125–134 Madison, Wisconsin. SpringerVerlag. 18. Kazakov, D., Manandhar, S., & Erjavec, T. (1999). Learning word segmentation rules for tag prediction. In Dˇzeroski, S., & Flach, P. (Eds.), Proc. of The Ninth International Workshop on Inductive Logic Programming, pp. 152–161 Bled, Slovenia. Springer-Verlag. 19. Khan, K., Muggleton, S., & Parson, R. (1998). Repeat learning using predicate invention. In Page, D. (Ed.), Proc. of the Eighth International Workshop on Inductive Logic Programming, pp. 165–174 Madison, Wisconsin. Springer-Verlag. 20. Koskenniemi, K. (1983). Two-level morphology: A General Computational Model for Word-Form Recognition and Production. University of Helsinki, Dept. of General Linguistics, Finland. 21. Lancelot, C., & Arnauld, A. (1660). Grammaire g´en´erale et raisonn´ee (de Port-Royal) (1967 facsimile edition). The Scolar Press, Menston, England. 22. Lavraˇc, N., & Dˇzeroski, S. (1994). Inductive Logic Programming Techniques and Applications. Ellis Horwood, Chichester. 23. Ling, C. X. (1994). Learning the past tense of English verbs: The symbolic pattern associatior vs. connectionist models. Journal of Artificial Intelligence Research, 1, 209–229. 24. Ling, C. X., & Marinov, M. (1993). Answering the connectionist challenge: A symbolic model of learning the past tense of English verbs. Cognition, 49 (3), 235–290. 25. MacWhinney, B., & Leinbach, J. (1991). Implementations are not conceptualizations: Revising the verb model. Cognition, 40, 291–296. 26. Manandhar, S., Dˇzeroski, S., & Erjavec, T. (1998). Learning Multilingual Morphology with CLOG. In Page, D. (Ed.), Proc. of The Eighth International Conference on Inductive Logic Programming, pp. 135–144 Madison, Wisconsin. 27. Matthews, P. H. (1974). Morphology: an Introduction to the Theory of Word-Structure (First edition). Cambridge University Press. 28. Matthews, P. H. (1997). The Concise Oxford Dictionary of Linguistics. Oxford University Press. 29. Mikheev, A. (1997). Automatic rule induction for unknown word guessing. Computational Linguistics, 23 (3), 405–423. 30. Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York. 31. Mooney, R. J., & Califf, M. E. (1995). Induction of first–order decision lists: Results on learning the past tense of English verbs. Journal of Artificial Intelligence Research, 3, 1–24. 32. Muggleton, S. (1998). Advances in ILP theory and implementations. In Page, C. (Ed.), Proc. of The Eighth International Workshop on Inductive Logic Programming, p. 9 Madison, Wisconsin. Springer-Verlag. Abstract of keynote presentation.
Learning Word Morphology with Inductive Logic Programming
109
33. Muggleton, S., & Bain, M. (1999). Analogical prediction. In Proc. of the Ninth International Workshop on Inductive Logic Programming Bled, Slovenia. Springer-Verlag. 34. Pirelli, V. (1993). Morphology, Analogy and Machine Translation. Ph.D. thesis, Salford University, UK. 35. Quinlan, J. R. (1994). Past tenses of verbs and first-order learning. In Debenham, J., & Lukose, D. (Eds.), Proceedings of the Seventh Australian Joint Conference on Artificial Intelligence, pp. 13–20 Singapore. World Scientific. 36. Quinlan, J. (1990). Learning logical definitions from relations. Machine Learning, 5, 239–266. 37. Rumelhart, D. E., & McClelland, J. (1986). Parallel Distributed Processing, Vol. II, chap. On Learning the Past Tense of English Verbs, pp. 216–271. MIT Press, Cambridge, MA. 38. Shannon, C. E., & Weaver, W. (1963). The Mathematical Theory of Communication. University of Illinois Press, Urbana. 39. Shieber, S. (1986). An introduction to unification-based approaches to grammar. No. 4 in CSLI Lecture Notes. Center for the Study of Language and Information, Stanford, CA. 40. van den Bosch, A. (1997). Learning to Pronounce Written Words: A Study in Inductive Language Learning. Ph.D. thesis, University of Maastricht, The Netherlands. 41. van den Bosch, A., Daelemans, W., & Weijters, T. (1996). Morphological analysis as classification: an inductive learning approach. In Oflazer, K., & Somers, H. (Eds.), Proc. of NeMLaP-2, pp. 79–89 Ankara, Turkey. 42. Yvon, F. (1996a). Personal communication. 43. Yvon, F. (1996b). Prononcer par analogie: motivations, formalisations et ´evaluations. Ph.D. thesis, ENST Paris, France. 44. Zelle, J. M., & Mooney, R. J. (1993). Learning semantic grammars with constructive inductive logic programming. In Proceedings of AAAI-93, pp. 817–822. AAI Press/MIT Press.
Learning the Logic of Simple Phonotactics Erik F. Tjong Kim Sang1 and John Nerbonne2 1
CNTS - Language Technology Group, University of Antwerp, Belgium
[email protected] 2 Alfa-informatica, BCN, University of Groningen, The Netherlands
[email protected]
Abstract. We report on experiments which demonstrate that by abductive inference it is possible to learn enough simple phonotactics to distinguish words from non-words for a simplified set of Dutch, the monosyllables. The monosyllables are distinguished in input so that segmentation is not problematic. Frequency information is withheld as is negative data. The methods are all tested using ten-fold cross-validation as well as a fixed number of randomly generated strings. Orthographic and phonetic representations are compared. The work presented in this chapter is part of a larger project comparing different machine learning techniques on linguistic data.
1
Introduction
This chapter describes an application of abduction to recognising the structure of monosyllabic words. It is part of a larger project comparing various learning methods on language-learning tasks. The learning methods compared in the larger project are taken from three paradigms: stochastic learning (Hidden Markov Models), connectionist learning (simple recurrent nets using back propagation), and symbolic learning (abductive inference) In each case we use the methods to build an acceptor for the monosyllables from positive data only, and we test it on held-back test data as well as randomly generated (negative data). We systematically compare results based on orthographic and phonetically encoded data. The data comes from Dutch but the results are similar for other related languages (in current experiments). The study focuses on three questions. First, we ask whether by abductive inference it is possible to learn the structure of monosyllabic words. Second, we seek to learn the influence of data representation on the performance of the learning algorithm and the models it produces. Third, we would like to see whether the learning process is able to create better models when it is equipped with basic initial knowledge, so-called innate knowledge. The phonotactic models that will be produced can be used for suggesting non-dictionary correction alternatives for words generated by Optical Character Recognition (OCR) software. This study on phonotactics is also important 0
This chapter is a revised compilation of part of the work described in (Tjong Kim Sang, 1998) and (Tjong Kim Sang & Nerbonne, 1999).
J. Cussens and S. Dˇ zeroski (Eds.): LLL’99, LNAI 1925, pp. 110–124, 2000. c Springer-Verlag Berlin Heidelberg 2000
Learning the Logic of Simple Phonotactics
111
for the Groningen research group because it is our first application of machine learning techniques to natural language processing. The problem chosen is deliberately simple in order to make possible a good understanding of the machine learning techniques. The results of this study will be the basis of future research in even more challenging applications of machine learning to natural language processing.
2 2.1
Theoretical Background Problem Description
Why is ‘pand’ a possible English word and why not ‘padn’ ? Why is ‘mloda’ a possible Polish word but not a possible Dutch word? For answers to these questions one has to know the syllable structures which are allowed in English, Polish and Dutch. Native speakers of English can tell you that ‘pand’ is a possible English word and that ‘padn’ is not. For this judgement they do not need to have seen the two words before. They use their knowledge of the structure of English words to make his decision. How did they get this knowledge? Which words are used depends on the phonotactic structure of the language—which sequences basic sound elements may occur in. Certain languages, such as Polish, allow ml onsets of words but others, such as English and Dutch, do not. Phonotactics are directly reflected in the phonetic transcriptions of words, and indirectly in the orthography, i.e., the writing system. Different languages usually have different orthographies. Some aspects of phonotactics, such as preference for consonant vowel sequences, are shared by almost all languages, but phonotactic structure varies from one language to another. No universal phonotactics exists. Two possibilities exist for entering language-dependent phonotactics into a computer program. The first is to examine (by eye or ear) the language and create a list of rules reflecting phonotactic structure. This is labour-intensive and repetitive when many languages are involved. The second possibility is to have the machine learn the phonotactics by providing it with language data. People manage to learn phonotactic rules which restrict phoneme sequences so it might be possible to construct an algorithm that can do the same. If we are able to develop a model capable of learning phonotactics, we can use it to acquire the phonotactic structure of many languages. Both artificial intelligence and psychology offer a wide variety of learning methods: rote learning, induction, learning by making analogies, explanation based learning, statistical learning, genetic learning and connectionist learning. We are not committed to one of these learning methods but we are interested in finding the one that performs best on the problem we are trying to tackle: acquiring phonotactic structure. For the experiments in the project reported on here we have chosen methods from three machine learning paradigms, Hidden Markov Models (from stochastic learning), abductive inference (from symbolic learning), and simple recurrent networks (from connectionist learning). This chapter focuses on abductive inference.
112
Erik F. Tjong Kim Sang and John Nerbonne
The problem of distinguishing words from nonwords is not particularly difficult, as linguistic problems range. One perspective on the complexity of the task is given by comparison to an appropriate baseline. The simplest learning algorithm we could think of accepted all and only strings that consist of character pairs (bigrams) which appear in the training data. This algorithm accepted 99.3±0.3% of the positive orthographic data and rejected 55.7±0.9% of the negative orthographic data. For the phonetic data (see below) these scores were 99.0±0.5% and 76.8±0.5% respectively—indicating an easier task. This baseline algorithm was good in accepting positive data but performed less well in rejecting negative data. 2.2
Data Representation
The importance of knowledge representation is widely acknowledged. The representation of input to a problem solving process can make the difference between the process finding a good result or finding no result. The input for our learning methods can be represented in two ways. The first one is orthographic: words are represented by the way they are written down, for example: “the sun is shining”. The second way of representing the words is phonetic. If we use the phonetic representation then the sentence “the sun is shining” will be represented as [D@ s2n Iz SaynIN]. The second representation uses a variant of the International Phonetic Alphabet (IPA), which enjoys general acceptance among scholars of phonotactics (Ladefoged, 1993). We do not know which (if either) of the two representations will enable the learning process to generate the best word models. Acceptance decisions of words by humans may be based on the way the words are written, but they may also be based on the pronounceability of the words. We are interested in finding out which representation is most suitable for the learning methods. Therefore we perform two sets of experiments: one with data in the orthographic representation and one with the same data in the phonetic representation. We work with Dutch and since Dutch orthography is fairly transparent, it turned out that there is less need to distinguish the two problems in Dutch. The similarity of the orthographic and phonetic problems is also reflected in the similar baseline performances. A more detailed phonetic representation, in which common features are directly reflected, might still yield significantly different results. Our learning algorithms process character bigrams rather than character unigrams. This means that they interpret words as sequences of two characters such as splash=sp-pl-la-as-sh. By working this way the algorithms are forced to take the context of a character into account when they consider extensions. Without this context they would make embarrassing errors. 2.3
Innate Knowledge
A recurrent issue in modeling language acquisition is the amount of innate knowledge available or required. Linguists have emphasised that important aspects of
Learning the Logic of Simple Phonotactics
113
language learning require some innate knowledge (Pinker, 1994). Debates in the language acquisition literature have led to a general acceptance of the assumption that children use innate linguistic constraints when they acquire their native language. Finch (1993) describes approaches to language acquisition which assume no innate knowledge. We are interested in what artificial language learning systems can gain from equipping them with initial linguistic constraints. Therefore we will perform two versions of our experiments: one version without specific linguistic constraints and another in which the learning algorithm starts from general phonotactic knowledge. In the case of other methods we have studied in the larger project (notably connectionist methods), this required some creativity, but we believe that reasonable operationalizations were found. 2.4
Positive and Negative Data
A further perennial question is whether negative information needs to be used— e.g., the information that ‘mlod’ is not a Dutch monosyllable. Early research in computational learning theory showed the need for negative learning if grammars are to characterise perfectly (Gold, 1967), but we will be satisfied with good performance on the task of distinguishing words and nonwords. Research in child language acquisition has had difficulties with finding negative language input from parents in conversations with young children, and has noted that children attend to it poorly. This presents a problem to us: according to computational learning theory children need negative information for learning (perfectly), while children do not seem to receive this information even though they manage to learn natural languages. We will approach the acquisition of models for monosyllabic words from the research perspective of child language acquisition. We will supply our learning methods with positive information only. In some learning experiments it might be theoretically necessary that negative examples are supplied in order to obtain a good result. We will assume that in those learning experiments the negative information is supplied implicitly. One source of implicit information is innate knowledge. Another is the application of the closed world assumption which states that non-present information is false.
3 3.1
Experiment Setup Evaluating Syllable Models
We vary experiments by using two initialisation types and two data representation types. In order to be able to compare the results of the experiments (especially within the larger project), we perform all experiments with the same training and test data and use only one linguistic model for all linguistic initialisation. We use ten-fold cross-validation which means that we randomly divide our positive data in ten parts and use nine parts for training and one part for testing. Each part is used as testing part once. The results presented here are the average performances over the ten test sets.
114
Erik F. Tjong Kim Sang and John Nerbonne
A further question is how to evaluate the monosyllabic word models. After a learning phase the word models are tested with two sets of data. One set contains unseen positive language data, that is words which occur in the language but have not been present in the training data. The other data set will contain negative data: strings which do not occur as words in the language. The algorithm can make two errors. Firstly, it can classify positive test data as incorrect (false negatives). Secondly, it can classify negative test data as correct (false positives). Our goal will be to find a model which makes as few errors as possible. 3.2
The Training and Test Data
The learning algorithms receive training data as input and use this set for building models of the structure of Dutch monosyllabic words. The models are able to evaluate arbitrary strings. They can either accept a string as a possible monosyllabic Dutch word or reject it. A good phonotactic model will accept almost all strings of the positive test data and reject almost all strings of the negative test data. The data sets were derived from the CELEX cd-rom (Baayen, Piepenbrock, & van Rijn, 1993). From the Dutch Phonology Wordforms directory (DPW) we have extracted 6218 monosyllabic word representation pairs. The first element of each pair was the orthographic representation of the word (field Head) and the second the phonetic representation of the word (field PhonolCPA). All characters in the orthographic representation of words were changed to lower case. The list contained 6190 unique orthographic strings and 5695 unique phonetic strings. We used the character frequencies of the two data sets for generating two sets of 1000 unique random strings which do not appear in the related data set. We use these lists of random strings as negative data in the main experiments. 3.3
The Linguistic Initialisation Model
We perform two versions of the learning experiments: one without initial knowledge and one which is equipped with some linguistic constraints. As an initialisation model we have chosen the syllable model which is presented in Gilbers (1992) (see Figure 1). This model is a mixture of the syllable models by Cairns and Feinstein (1982) and van Zonneveld (1988). Hence it will be called the Cairns and Feinstein model. The Cairns and Feinstein model is a hierarchical syllable model consisting of a tree which contains seven leaves. Each leaf can either be empty or contain one phoneme. The leaves are restricted to a class of phonemes: the peak can only contain vowels and the other leaves may only contain consonants. The exact phonemes that are allowed in a leaf are language-dependent. In the syllable model there are vertical lines between nodes and daughter nodes which are main constituents. A slanted line between two nodes indicates that the daughter node is dependent on the sister node that is a main constituent. A dependent constituent can only be filled if this main constituent is filled. For example,
Learning the Logic of Simple Phonotactics
115
syllable
pre-margin
onset
rhyme
margin
nucleus
margin core
satellite
peak satellite
coda
appendix
Fig. 1. The syllable model of Cairns and Feinstein
the margin satellite can only contain a phoneme if the margin core contains a phoneme. This syllable model can be used to explain consonant deletion in child language. For example, the Dutch word stop fits in the model as s:pre-margin, t:margin-core, o:peak and p:coda (the t cannot occur in the margin satellite in Dutch). The model predicts that a child that has difficulty producing consonant clusters will delete the dependent part in the onset cluster and produce top. Another example is the word gram which fits in the model as g:margin-core, r:margin satellite, a:peak and m:coda (the g cannot occur in the pre-margin in Dutch). In this case the model will predict that the child will produce gam. Both predictions are correct. In our experiments with initial knowledge we will supply the learning algorithms with the syllable structure presented in Figure 1. Two extra constraints will be provided to the algorithms: the peak can only contain vowels while the other leaves are restricted to consonants. Finally the division of the phonemes in vowels and consonants will be made explicit for the learning algorithms. Their task will be to restrict the phonemes in each leaf to those phonemes that are possible in the language described by the learning examples. By doing this they will convert the general Cairns and Feinstein model to a language-specific syllable model.
4 4.1
Abductive Reasoning Introduction
We use a version of abductive inference which is related to Inductive Logic Programming (ILP). This is a logic programming approach to machine learning (Muggleton, 1992). The term induction denotes a reasoning technique which can be seen as the inverse of deduction. In ILP, one makes a distinction between three types of knowledge namely background knowledge, observations and the hypothesis (Muggleton, 1992). Background knowledge is the knowledge that a learner already has about the domain of the learning problem. Observations are the input patterns for the learning method with their classifications. The hypothesis contains the model of the domain that ILP should build. The relation between these three knowledge types can be described with two rules: DR B ∧ H O IR B ∧ O −→ H
116
Erik F. Tjong Kim Sang and John Nerbonne
These rules contain three symbols: ∧ stands for “and”, stands for “leads deductively to” and −→ stands for “leads inductively to”. DR represents the deductive rule which states that the observations (O) are derivable from the background knowledge (B) and the hypothesis (H) (Muggleton, 1992). The inductive rule IR represents the inductive step that we want to make: derive a hypothesis from the background knowledge and the observations. In ILP, the hypothesis that is derived consists of rules which contain variables. We will specify the rule formats in advance and restrict the derivation to variablefree instances of these rules. Hence we perform abduction rather than induction. We will regard word production as a process of adding prefix and suffix characters to words1 . The possibility of adding a prefix (suffix) character to a word will depend on the first (last) character of the word. Our models will consist of prefix clauses PC(A,B) and suffix clauses SC(A,B) which state in which context certain characters can be added and of basic word clauses BWC(A) which state which basic words are correct2 . These clauses are made more concrete in the following three rules: Suffix Rule Suppose there exists a suffix clause SC(P,F) and a word W such that F is the final character of W and P is the penultimate character of W and word W less F (W - F) is W without its final character F. In that case the fact that W is a valid word will imply that W - F is a valid word and vice versa. Prefix Rule Suppose there exists a prefix clause PC(I,S) and a word W such that I is the initial character of W and S is the second character of W and word W - I is W less its initial character I (W - I). In that case the fact that W is a valid word implies that W - I is a valid word and vice versa. Basic Word Rule The existence of a basic word clause BWC(W) implies that word W is a valid word. The suffix and the prefix rules can be written in Prolog as bwc(WminF):-bwc(W),sc(P,F),append(R,[P,F],W),append(R,[P],WminF). bwc(W):-bwc(WminF),sc(P,F),append(R,[P,F],W),append(R,[P],WminF). bwc([S|R]):-bwc([I,S|R]),pc(I,S). bwc([I,S|R]):-bwc([S|R]),pc(I,S). where bwc(W), pc(I,S) and sc(P,F) are the three clauses that can be derived by the abduction process. The words are represented as lists of characters. Two 1 2
See (Kazakov & Manandhar, 1998) for a related approach to word segmentation. The three clauses can be proven to be equivalent to regular grammars.
Learning the Logic of Simple Phonotactics
117
standard Prolog functions are used to put characters in front of a list and behind a list. Another important issue that we should take a look at is deciding how we are going to generate monosyllabic word models in practice. We will use a derivation scheme which is based on the three clauses used in our word models. The derivation scheme consists of three rules: one for prefix clauses, one for suffix clauses and one for basic word clauses: Basic word clause inference rule If word W is a valid word then we will derive the basic word clause BWC(W). All observed words are valid words. Prefix clause inference rule If W with initial character I and second character S is a valid word and W - I, which is W less its initial character I, is a valid word as well then we will derive the prefix clause PC(I,S). Suffix clause inference rule If W with final character F and penultimate character P is a valid word and W - F, which is W less its final character F, is a valid word as well then we will derive the suffix clause SC(P,F). An example: suppose we are looking for a model describing the three words clan, clans and lans. We can use the basic word clause inference rule for deriving three basic word clauses for our model: BWC(clan), BWC(clans) and BWC(lans). The first two in combination with the suffix clause inference rule enable us to derive suffix clause SC(n,s). The prefix clause PC(c,l) can be derived by using BWC(clans) and BWC(lans) in combination with the prefix clause rule. This new prefix clause can be used in combination with BWC(clan) and the prefix clause rule to derive BWC(lan). This new basic word clause makes the other basic word clauses superfluous. Our final model consists of the clauses BWC(lan), PC(c,l) and SC(n,s). 4.2
Algorithm
The abductive clause inference algorithm will process the data in the following way: 1. Convert all observations to basic word clauses. 2. Make a pass through all basic words and process one at a time. We will use the symbol W for the word being processed and assume that W has initial character I, second character S, penultimate character P and final character F. We will perform the following actions: (a) If W without I (W - I)is a valid word as well then derive the prefix clause PC(I,S) and remove the basic word clause for W. (b) If W without F (W - F) is a valid word as well then derive the suffix clause SC(P,F) and remove the basic word clause for W.
118
Erik F. Tjong Kim Sang and John Nerbonne
(c) If the prefix clause PC(I,S) exists then derive the basic word clause BWC(W - I) and remove the basic word clause for W. (d) If the suffix clause SC(P,F) exists then derive the basic word clause BWC(W - F) and remove the basic word clause for W. 3. Repeat step 2 until no new clauses can be derived. Steps 1, 2(a) and 2(b) are straightforward applications of the inference rules for basic words, prefix clauses and suffix clauses which were defined in Section 4.1. The steps 2(c) and 2(d) are less intuitive applications of the basic word clause inference rule in combination with the rules noted above. In the suffix rule we have defined that W - F will be a valid word whenever W is a valid word and a suffix clause SC(P,F) exists. This is exactly the case handled by step 2(d) and because of the fact that W - F is a valid word we may derive BWC(W - F) by using the basic word clause inference rule. Step 2(c) can be explained in a similar way. The steps 2(c) and 2(d) will be used to make the basic words as short as possible. This is necessary to enable the algorithm to derive all possible prefix and suffix clauses. Consider for example the following intermediate configuration clauses set: BWC(yz) BWC(yx) SC(y,z) By applying step 2(d) we can use SC(y,z) and BWC(yz) to add the basic word clause BWC(y) and remove BWC(yz). In its turn this new basic word clause in combination with BWC(yx) can be used for deriving the suffix clause SC(y,x). In this example shortening a basic word has helped to derive an extra suffix clause. The abductive clause inference algorithm will repeat step 2 until no more new clauses can be derived. 4.3
Experiments
In this section we will describe our abduction experiments. The experiments were performed on the data described above. We have constructed an algorithm that performed several passes over the data while deriving prefix clauses, suffix clauses and basic word clauses while removing redundant basic word clauses whenever possible. Experiments without Linguistic Constraints. We used abductive inference to derive a rule-based model for the orthographic training data. The tree rules were used for generating clauses from the training words, the observations, without using any background knowledge. We used a ten-fold cross-validation set-up which means that we randomly divided the training data into ten parts and used each of them as positive test data while using the other nine as training data. The same negative data set was used in all ten experiments.
Learning the Logic of Simple Phonotactics
119
Table 1. Average performance and size of the models generated by abductive inference without using background knowledge. The algorithm converts the training strings into models that on average contain 800 orthographic rules and 1200 phonetic rules. The models perform well on the positive test data (more than 99% were accepted) but poorly on negative data (rejection rates of 56% and 75%)—roughly the baseline recognition rate for bigrams.
data type
method
orthographic abduction
number of clauses % accepted % rejected basic word prefix suffix positive data negative data 27±0 377±2 377±2
orthographic baseline phonetic
abduction
phonetic
baseline
41±0 577±2 577±2
99.3±0.3
55.7±0.9
99.3±0.3
55.7±0.9
99.1±0.4
74.8±0.2
99.0±0.5
76.8±0.5
The performance of the model was similar to the baseline performance (see Table 1). For orthographic data it performed well in accepting positive data (99.3%) but poorly in rejecting negative data (55.7%). The false negatives included loan words like bye, fjord, hadzj and kreml and short words containing an apostrophe like ’m and q’s. The false positives contained reasonable strings like eep but also peculiar ones like bsklt and zwsjn. We discovered that the learning algorithm had difficulty dealing with single characters that appeared in the training data. These characters were identified as basic words. Because of this most strings could be analysed with a basic word clause combined with either only prefix clauses or suffix clauses. For example, man can be accepted with with the clauses PC(m,a), PC(a,n) and BWC(n). Because of the absence of single characters in the phonetic data, it is more difficult to find out what the origin was of the problems of the models for phonetic data. However, the result was the same as for the orthographic models: after training almost all symbols were identified as basic words. Again many strings could be accepted by the models in an incorrect way. The acceptance rate for the positive data was high (99.1%) but the reject rate for negative strings was not so good (74.8%). Neither for orthographic data nor for phonetic data did we manage to improve the baseline performance (see Table 1).
Adding Extra Linguistic Constraints. As noted above, we look to the phonetic model by Cairns and Feinstein (1982) for suggestions about possibly innate linguistic constraints, and we will use this model in our experiments as well. One problem we have to solve is the conversion of the Cairns and Feinstein model to the clause structure. For this purpose we will make use of a model developed in our stochastic experiments (Tjong Kim Sang, 1998). This model is derived from Figure 1 of Section 3.3. It consists of a set of linked states and the production capabilities of each state have been limited to a subset of all characters. We will use this model for deriving some extra learning constraints.
120
Erik F. Tjong Kim Sang and John Nerbonne
S1
S2
S3
S4
S5
P
S8
S9
B
S6
S7
S
Fig. 2. An initial model for orthographic data, derived from the Cairns and Feinstein model. States produce characters and arcs denote possible transitions. The model has been divided into three parts: a part P in which the prefix rules operate, a part B generated by the basic word clauses and a part S in which the suffix clauses work. Constraints within each part of the model will be converted to constraints on the format of the clauses.
The automaton in Figure 2 contains two separate parts with nine states in total. Each of the states s1 to s7 corresponds with one of the seven leaves of the three in Figure 1 (the left-to-right order is the same). States s8 and s9 are duplicates of s2 which will be used for generating non-vowel strings. States produce characters and arcs denote the possibility of moving from one state to another. Arcs without a start state mark initial states and arcs without an end state mark final states. We want to divide the model into three parts: one part in which only prefix clauses operate, one part in which only suffix clauses work and one part that is generated by basic word clauses. This division is shown in Figure 2: part P is the part for the prefix clauses, part B is for the basic word clauses and part S is for the suffix clauses. Each character production by states in the parts P and S corresponds to a set of prefix or suffix clauses. The states s5 and s6 have been put in the basic word clause part because s6 is able to produce vowels and we want to produce all vowels in the basic word clauses. This extension is necessary to allow the model to cover loan words like creme and shares. The model of Figure 2 is equivalent to the initial Hidden Markov Model used in our stochastic experiments with linguistic constraints (Tjong Kim Sang, 1998). However, for learning purposes there are differences between the two. Each character produced by a state in the P and the S parts can be modeled with a prefix or a suffix clause. But we cannot produce every single character in the B part with a separate basic word clause because basic word clauses contain character sequences and, in contrast to simple abductive models (without linguistic knowledge), we will not combine basic word clauses in the model. Instead of making each basic word state produce only one character, we will make them behave as a group of states and allow them to produce character
Learning the Logic of Simple Phonotactics
121
sequences. This may cause problems when there are internal parts which repeat themselves an arbitrary number of times. The behaviour invoked by the self-links from the states s4 , s6 and s9 cannot be modeled with the basic word clauses. The learning process will not generate models with extendible basic word clauses and this means that the generalisation capabilities of the resulting rule models will be weaker than those of the HMMs developed in Tjong Kim Sang’s thesis (Tjong Kim Sang, 1998). Now that we have interpreted the Cairns and Feinstein initialisation model in the terms that we are using in this section, we can attempt to derive usable constraints from this model. The Cairns and Feinstein model imposes constraints on the characters that can be generated by a particular state. We have defined the following constraints for our orthographic data: the vowels a, e, i, o, u and the quote character ’ can only be generated by the states s4 and s6 , the ambiguous vowel/consonant y can be generated by any state and all other characters are consonants which can be generated by any state except s4 . The new states s8 and s9 are consonant states: they can generate any character except the six characters a, e, i, o, u and ’ (we regard the quote character here as a vowel). When we inspect the model with these character production constraints in mind we can make two interesting observations. First of all, the prefix clause states cannot produce the characters a, e, i, o, u and ’. We will call these characters pure vowels. Since the characters produced by these states are put before a word by a prefix clauses, this means that prefix clauses cannot add a pure vowel prefix to a word. Second, the suffix clause state cannot produce a pure vowel. A character produced by this state is a character appended to a word by a suffix clause. This means that suffix clauses cannot append a pure vowel to a word. We can summarise these two observations in the following two rules: Prefix Clause Constraint In a prefix clause PC(I,S) the character I that is prepended to a word cannot be a pure vowel. Suffix Clause Constraint In a suffix clause SC(P,F) the character F that is appended to a word cannot be a pure vowel. It is not possible to derive a similar constraint for the basic word clauses because these can contain both vowels and consonants. The derivation presented here applies only to orthographic data. In a similar fashion one can encode alternative initial phonetic models, including similar constraints for prefix and suffix clauses3 . Our phonetic data contains 18 vowels. We repeated the experiments for deriving rule-based models for our orthographic and our phonetic data by using the prefix and the suffix clause constraints presented in this section. Apart from these extra constraints the experiment setup was the same as described in the previous section. The resulting 3
An additional constraint can be derived for phonetic data: the basic word clauses cannot consist of a mixture of vowels and consonants. We did not use this constraint because we expected it would cause practical problems in the learning phase.
122
Erik F. Tjong Kim Sang and John Nerbonne
Table 2. The models generated by abductive inference and their performance after training with the extra prefix and suffix clause constraints. They perform well in accepting valid test data (maximally 1.4% error) and reasonable in rejecting negative data (error rates of 8% and 15%). These models eliminate about two-thirds of the baseline error rates for negative data.
data type
method
orthographic abduction phonetic
abduction
number of clauses % accepted % rejected basic word prefix suffix positive data negative data 197±4 376±3 194±1
98.6±0.3
84.9±0.3
38±1 674±4 456±2
99.0±0.5
91.9±0.3
models were evaluated in the same way as in the previous experiments. The results of these tests can be found in Table 2. The added constraints during learning make the abduction process generate better models. The orthographic model performs worse in accepting positive data (98.6% versus 99.3%) but remarkably better in rejecting negative data (84.9% versus 55.7%). The phonetic model performs about equally well in accepting positive data (99.1% versus 99.0%) and much better in rejecting negative data (91.9% versus 74.8%). 4.4
Abductive Reasoning in Language Learning
In this chapter we have described experiments considering building models for monosyllabic words with abductive inference. We have designed an abduction process which is capable of handling orthographic and phonetic data. We performed two variants of the learning experiments: one that started without background knowledge and one that was supplied with constraints which were extracted from the syllable model by Cairns and Feinstein (1982). The process without background knowledge produced models which performed well for recognising positive data but they performed poorly in rejecting negative data (see Figure 1). The linguistically initialise process generated models that performed well in both cases (see Table 2). From the results of the experiments described in this section we may conclude that abductive inference is a good learning method for building mono-syllabic orthographic and phonetic models. Abduction in combination with linguistic constraints generates models that perform much better than the models that were generated without background knowledge. It is natural to note that abductive inference not only performs reasonably well on this learning task, but that it also produces models which are subject to direct examination and analysis by area specialists, in this case phonologists.
5
Conclusions and Future Directions
The problem of phonotactics as it has been tackled here is basically the problem of sequencing. The results show that abductive inference perform credibly, if not
Learning the Logic of Simple Phonotactics
123
perfectly on this task. (Tjong Kim Sang, 1998; Tjong Kim Sang & Nerbonne, 1999) compare this work to learning by stochastic automata (Hidden Markov Models) and biologically and cognitively inspired Neural Networks (so-called “Simple Recurrent Networks”). These results indicate that rule abduction performs about as well on the sequence learning task as the popular HMMs and also that further advancements in the application of neural networks to sequencing are needed4 . The results further indicate that using linguistic constraints helps the learning algorithms since this results in improvements in speed and accuracy. Abductive inference is particularly well-suited to accommodating background knowledge which linguists have developed, typically in the form of rule systems. Finally, results show that learning from written symbols is only slightly more difficult than learning from phonetic representation, but this may have to do with the fairly phonic Dutch orthography. The models derived in this chapter satisfy four of the five properties Mark Ellison outlined in his thesis (Ellison, 1992). They are cipher-independent (independent of the symbols chosen for the phonemes); language-independent (they make no assumptions specific for a certain language); accessible (in symbolic form); and linguistically meaningful. They fail to satisfy Ellison’s first property (operation in isolation) because they receive specific language input: monosyllabic words. This was deliberate, naturally, since we wished to focus on the single problem of sequencing. Moreover, the techniques can help a linguist to get a rough impression of the syllable structure of a language. There are numerous natural extensions and refinements of the work presented here, not only seeking improved performance in these techniques, extending the study to other learning techniques, but also refining the task so that it more closely resembles the human task of language learning. This would involve incorporating frequency information, noisy input, and (following Ellison’s criterion) coding input for phonetic properties, and naturally extending the task to multisyllable words and to related tasks in phonological learning. Acknowledgements The authors want to thank two reviewers for valuable comments on earlier versions of this chapter. This work was sponsored by the former Dutch Graduate Network for Language, Logic and Information (currently Dutch Graduate School in Logic, OZSL).
4
Without background knowledge, the HMMs accepted 99.1% of the positive orthographic data and rejected 82.2% of the negative orthographic data. Using linguistic constraints changed these figures to 99.2% and 77.4% respectively. For phonetic data these numbers were 98.7%, 91.6%, 98.9% and 92.9%. The neural networks performed poorly. They were not able to reject more than 10% of the negative data in any experiment (Tjong Kim Sang & Nerbonne, 1999)
124
Erik F. Tjong Kim Sang and John Nerbonne
References 1. Baayen, R., Piepenbrock, R., & van Rijn, H. (1993). The Celex Lexical Database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA. 2. Cairns, C. E., & Feinstein, M. H. (1982). Markedness and the theory of syllable structure. Linguistic Inquiry, 13 (2). 3. Ellison, T. M. (1992). The Machine Learning of Phonological Structure. PhD thesis, University of Western Australia. 4. Finch, S. P. (1993). Finding Structure in Language. PhD thesis, University of Edinburgh. 5. Gilbers, D. (1992). Phonological Networks. PhD thesis, University of Groningen. ISSN 0928-0030. 6. Gold, E. (1967). Language identification in the limit. Information and Control, 16, 447–474. 7. Kazakov, D., & Manandhar, S. (1998). A hybrid approach to word segmentation. In Page, D. (Ed.), Proceedings of the ILP-98. Springer. Lectures Notes in Computer Science, vol. 1446. 8. Ladefoged, P. (1993). A Course in Linguistic Phonetics (3 edition). Philadelphia. 9. Muggleton, S. (1992). Inductive logic programming. In Muggleton, S. (Ed.), Inductive Logic Programming, pp. 3–27. 10. Pinker, S. (1994). The Language Instinct. W. Morrow and Co., New York. 11. Tjong Kim Sang, E. F. (1998). Machine Learning of Phonotactic Structure. Ph.D. thesis, University of Groningen. 12. Tjong Kim Sang, E. F., & Nerbonne, J. (1999). Learning simple phonotactics. In Neural, Symbolic, and Reinforcement Methods for Sequence Learning, pp. 41–46. Proc. IJCAI workshop. 13. van Zonneveld, R. (1988). Two level phonology: Structural stability and segmental variation in dutch child language. In van Besien, F. (Ed.), First Language Acquisition. ABLA papers no. 12, University of Antwerpen.
Grammar Induction as Substructural Inductive Logic Programming Pieter Adriaans1,2 and Erik de Haas1 1
Syllogic, P.O. Box 2729, 3800GG Amersfoort, The Netherlands {p.adriaans,e.de.haas}@syllogic.com 2 University of Amsterdam, Faculty of Mathematics, Computer Science, Physics and Astronomy Plantage Muidergracht 24, 1018TV Amsterdam, The Netherlands
Abstract. In this chapter we describe an approach to grammar induction based on categorial grammars: the EMILE algorithm. Categorial grammars are equivalent to context-free grammars. They were introduced by Ajduciewicz and formalised by Lambek. Technically they can be seen as a variant of the propositional calculus without structural rules. Various learnability results for categorial grammars are known. There exists a whole landscape of these so called substructural logics. This suggests an extension of the ILP research program in the direction of what one might call substructural ILP. We discuss the application of substructural logic to database design and present some complexity results from the literature that suggest the feasibility of this approach.
1
Introduction
In this chapter we introduce the notion of substructural ILP and apply it to grammar induction. If one removes the structural rules from traditional propositional calculus one obtains the Lambek calculus, which can be interpreted as a categorial grammar. This grammatical formalism is known to be equivalent with context-free grammars (Pentus, 1993). Various authors have proved learnability results for categorial grammars (Adriaans, 1990, 1999; Kanazawa, 1994). The Lambek calculus can be seen as a variant of the so called substructural logics. Research in logic in the past decennia (Troelstra, 1992) has shown that the landscape between the traditional propositional calculus and the predicate calculus is inhabited by a rich variety of systems: modal logics, resource conscious logics (e.g. linear logic) and substructural logic. For a number of these logics the meta mathematical qualities are well known. This poses an interesting dilemma for the student of applied logic. If one wants to use logic to describe certain phenomena in reality there are in principle two options. 1) One takes some variant of predicate calculus, e.g. Horn clauses, and one tries to model the phenomena in this medium, or, 2) one tries to find a certain variant of logic in the substructural landscape that has characteristics that intrinsically model the target concepts. The latter route is to the knowledge of the authors hardly taken by researchers J. Cussens and S. Dˇ zeroski (Eds.): LLL’99, LNAI 1925, pp. 127–142, 2000. c Springer-Verlag Berlin Heidelberg 2000
128
Pieter Adriaans and Erik de Haas
in ILP. In this chapter we show that in some areas, especially grammar induction, the substructural approach has specific advantages. These advantages are: 1) a knowledge representation that models the target concepts intrinsically, 2) of which the complexity issues are well known, 3) with an expressive power that is in general weaker than the Horn-clause or related representations that are used in more traditional ILP research, 4) for which explicit learnability results are available. These observations suggest a wider approach that we call substructural ILP. De Haas, for example, describes a semantic model for UML (Unified Modelling Language) based on a substructural propositional modal logic (de Haas, 2000). The fact that object oriented (OO) database concepts can be expressed in a fragment of logic that is much weaker than the classical predicate calculus is interesting. It suggests that we do not need all this power to learn relevant design rules for OO and ER data bases. We can develop a system that learns database constraints and design rules based on a representational formalism that is fundamentally weaker than Horn clauses.
2 2.1
Grammar Induction as Substructural Logic The Lambek Paradigm
If we want to develop a mathematical theory about language learning the first and most important question we can ask is: what mathematical structure do we expect natural language to have? There are several relatively unchallenged presuppositions we can make here. Utterances in a natural languages seem to be organized in discrete entities called sentences. Sentences consist of discrete entities called words. We can construct sentences out of words by an operation called concatenation. If we have the words Tweety and flies we can form the sentence “Tweety • flies”. The dot is a concatenation operator in this formula. Concatenation is not commutative. According to Lambek there are three structural levels on which we can study language in this respect: A multiplicative system. There is a concatenation operation. Some concatenations are grammatical, others not, but there are no additional structural rules. This view corresponds to the idea that sentences are bracketed strings (trees). A semigroup. This is the same as the multiplicative system except for the fact that concatenation is defined to be associative. In this view sentences are unbracketed strings. A monoid. Concatenation is associative, there is a unity element 1. Sentences are unbracketed strings and there is an empty word. The subsets of a multiplicative system M are subject to three operations: A • B = {x.y ∈ M | x ∈ A & y ∈ B} C/B = {x ∈ M | ∀y∈B (x.y ∈ C)}
(1)
A\C = {y ∈ M | ∀x∈A (x.y ∈ C)}
(3)
(2)
Grammar Induction as Substructural Inductive Logic Programming
129
A,B and C are subsets of M . We read: A times B, C over B and A under C. One can easily verify that the following statements hold for all A,B and C in M : A • B ⊆ C if and only if A ⊆ C/B A • B ⊆ C if and only if B ⊆ A\C
(4) (5)
If M is a semigroup we have (A • B) • C = A • (B • C)
(6)
If M is a monoid with a unity element 1 we have: I •A = A = A•I
(7)
where I = {1}. Lambek observed that one half of the equivalences 4 and 5 seems to correspond to the deduction theorem (A&B) C implies A (B → C)
(8)
in the propositional calculus. From this it is but a small step to formulate rules like: A • A\C C and C/B • B C
(9) (10)
that correspond to the modus ponens in the propositional calculus. An interesting consequence is that 9 and 10 can be interpreted as a form of type-oriented functional application much in the lines of the ideas on compositional semantics in Montague Grammar. In fact these two application rules lie at the basis of Ajdukiewicz original proposal for a syntactical calculus (Ajdukiewicz, 1935). Building on these ideas Lambek developed a syntactical calculus. We might as well cite Lambek’s own description of the research program behind the idea of categorial grammars: A categorial grammar of a language may be viewed as consisting of the syntactic calculus freely generated from a finite set of basic types together with a dictionary which assigns to each word of the language a finite set of types composed from the basic types and I by the three binary operations. We say that such a categorial grammar assigns type S to a string A1 A2 . . . An of words if and only if the dictionary assigns type Bi to Ai and B1 B2 . . . Bn → S is a theorem in the freely generated syntactic calculus. One may consider the categorial grammar to be adequate provided it assigns type S to A1 A2 . . . An if and only if the latter is a well formed declarative sentence to some other standard. (Oehrle, 1988, p. 304)
130
Pieter Adriaans and Erik de Haas
Note that by introducing the notion of a dictionary it becomes clear that Lambek is tacitly assuming that types are intensional objects of which we have only partial knowledge. In fact a complete extensional description of the types implicitly defines a dictionary. Lambek obtained a decidability result by transforming the original calculus to an equivalent Gentzen-like sequent calculus. If we replace arrows like f : A → B by multi arrows f : A1 , A2 , . . . An → B. If we drop the letter f we get what Gentzen calls a sequent. The concept of a multi-linear function f : A1 × . . . × An → B is very important in learning theory. An expression of the form f (a1 , . . . , an ) = b can be interpreted as: – A production rule – A database tuple with a functional dependency – A sequent in a Gentzen-like calculus The concept of a sequent plays a unifying rˆ ole in the intersection between linguistics, logic and knowledge representation. As Lambek puts it: Depending on what application one has in mind, one may think of such a sequent as: 1. a contextfree derivation in a categorial grammar, where the arrows have the opposite direction as to those compared to those in generative grammar; 2. as a deduction in a variant of intuitionistic logic without structural rules, namely the intuitionistic fragment of Girard’s linear logic, but without the interchange rule; 3. as a multilinear operation in algebra, as introduced by Bourbaki to explicate the tensor product; 4. as an arrow (or multiarrow) in a multicategory, abstracting from (3) above. The Ai have different names in these four applications: they are called types in (1), formulas in (2), sorts (or bimodules) in (3) and just objects in (4) (Lambek, 1990). The axioms and rules of the Lambek calculus L may be replaced by: A→A
x, A, B, y → C x, A • B, y → C
x → A, y → B z → B, x, A, y → C x, y → A • B x, A/B, z, y → C x, B → A x → A/B
z → B, x, A, y → C x, z, B\A, y → C
B, x → A x → B\A
x, y → A x, I, y → A
→I
Grammar Induction as Substructural Inductive Logic Programming
131
Here x, y and z stand for finite sequences A1 , A2 , . . . An . Note that each of these rules, with the exception of A → A introduces one of the symbols “•”, “I”, “\” or “/”. Consequently this calculus is decidable. For any sequent x → S there are only a finite number of possible proof trees, since there are only a finite number of rules and each rule introduces a symbol. Lambek proved that this calculus is equivalent to L. To do this it is convenient to have an additional rule that enables us to use existing results in new proofs, the cut: z → A x, A, y → B x, z, y, → B
(11)
It is clear that this cut is a potential threat to the decidability of the calculus, because it does not introduce new symbols. Fortunately one can prove that cuts can be eliminated. Lambek gave cut-elimination proofs for various syntactical calculi (without unity element I in 1958, with I in 1969, non-associative 1961). Lambek’s proof comes very close to Gentzen’s original decidability proof for the intuitionistic propositional calculus. The main difference is the absence of the so-called structural rules: x, A, y, B, z → C – Interchange: x, B, y, A, z → C x, A, A, y → C – Contraction: x, A, y → C x, y → C – Thinning: x, A, y → C In fact the Lambek calculus becomes of interest in a linguistic context because of the absence of these rules: – The fact that we have no interchange rule reflects the fact that a sentence in general looses its grammaticality if words are interchanged. – The fact that we have no contraction reflects the fact that in general we may not remove identical words that occur directly after each other without destroying grammaticality. – The fact that we have no thinning reflects the fact that in general we cannot add words to a sentence without destroying grammaticality. The decidability proofs of Lambek and Gentzen share the same basic structure. If we add structural rules to the Lambek calculus, it collapses into the intuitionistic propositional calculus. The operators “\” and “/” collapse into “⇒” and “•” into “&”. 2.2
Grammar Induction as Substructural Logic: The EMILE Algorithm
The insights described in the last paragraph form the motivation for the EMILE algorithm (Adriaans, 1992, 1999). Currently researchers from different backgrounds frequently rediscover the fact that by applying forms of cluster analysis
132
Pieter Adriaans and Erik de Haas
to text one can identify semantic and syntactic clusters. New disciplines like PAC learning combined with a proper analysis of the complexity issues involved may lead to feasible language learning algorithms that can be applied to practical problems. What is lacking however is a unifying framework to evaluate these ad hoc results. We claim that Adriaans (1992) does provide such a framework and that the EMILE approach could serve as a valuable metamodel to evaluate clustering approaches to language learning.1 The learning context presupposed by EMILE is one in which we have a dialogue between a teacher and a pupil. The teacher generates grammatically correct sentences (positive examples). The pupil can ask yes/no questions concerning the grammatical validity of new sentences (positive as well as negative examples). We say that a language can be learned effectively when the amount of examples, yes/no questions and computing and storage power the pupil needs to learn a grammar are polynomially related to the length of a description of the grammar. Informally, the circumstances under which EMILE learns a contextfree or categorial language effectively are: – Each grammatical construction in the language can be illustrated with a relatively small example sentence. (The length of description of the sentence must be logarithmic in the length of the description of the grammar). – The teacher gives examples randomly, but in such a way that simpler sentences have a much higher probability to occur than complex sentences. (In the proof we use the Solomonov-Levin distribution in which the probability of a sentence is related to its complexity). – The learning algorithm must be able to check the grammatical validity of example sentences by means of an oracle. We argue that these constraints are quite reasonable for natural dialogues. The first constraint tells us that at first we only have to analyze the simple sentences that the teacher gives us. We can postpone the analysis of longer sentences, because we know for certain that they are built using grammatical rules that already occur in simple sentences. The second constraint tells us that we will get all the necessary examples in time. We only need a polynomial number of example sentences in order to have the right building blocks to start the learning process. The third constraint states that we need both positive and negative examples. In practice there is a large number of potential mechanisms to speed up the learning process considerably. Note that EMILE employs a rather strong definition of grammar induction. The original grammar and the learned grammar must be weakly equivalent, i.e., recognize exactly the same set of sentences as grammatical. For natural languages this is probably too rigid. It means that somebody only knows English when he knows the whole dictionary by heart. If 1
Adriaans (1992) officially deals with the learnability of categorial grammars. The results presented there have gained a lot in value since Pentus proved the equivalence of categorial and context-free languages (Pentus, 1993).
Grammar Induction as Substructural Inductive Logic Programming
133
we loosen our definition to learning subgrammars the effectiveness of the learning process increases considerably. Note that the EMILE algorithm is based on insights of categorial grammar. The theory of categorial grammar teaches us that expressions that belong to the same type can be substituted for each other in any context in a sentence. Consequently we will define a type as a set of expressions that can be substituted in any context without destroying the grammaticality of that context. In categorial grammar the complete structure of the language is determined by an assignment of types to words in a lexicon. These lexicons may be ambiguous, that is, a word may belong to several categories. EMILE in its basic form is able to deal with any conceivable form of lexical ambiguity. We describe the EMILE algorithm in its basic form. Since our aim is to present this algorithm as a tool to the ML-community, we omit mathematical details. In the example we assume that the grammar is not lexically ambiguous. We introduce the slash notation used by categorial grammar. The operator \ enables us to transform the rule S → ab into a\S → b and vice versa. The operator / enables us to transform the rule S → ab into S/b → a and vice versa. These operators help us to isolate and manipulate parts of rules that would otherwise be inaccessible. Below we list the steps of the EMILE algorithm The algorithm consists of the following stages: 1. 2. 3. 4. 5. 6.
Selection of a sample of sentences. Generation of the first order explosion. Cross-check of expressions in different contexts. Clustering of expressions and contexts into types. Induction of general types from specific cases. Rewriting of rules.
The detailed description of the algorithm will be illustrated by the following example-input: John loves Mary Mary walks We will suppose this sample to be characteristic, i.e., it contains an example of every possible construction of the grammar in its simplest form. In subsequent sections we will study the exact conditions under which we can draw a characteristic sample of a language in polynomial time. We start by forming the first order explosion of the sentences in the example. This means that for each sentence we examine the number of ways in which this sentence can be split up in subexpressions and contexts. Another way of looking at this, is to consider all ways in which we can place two brackets ‘(’ and ‘)’, such that the sentence is split up in three parts α, β and γ. β is the subexpression and α and γ form the context. α and γ may be empty. For each division S → αβγ we move the context to the other side of the functor: α\S/γ → β. This results in the complete first order explosion for the example as illustrated in Table 1.
134
Pieter Adriaans and Erik de Haas Table 1. Substitution-matrix for the enriched first order explosion.
S/loves Mary S/Mary S John\S/Mary John\S John loves\S S/walks Mary\S
John John John loves loves Mary Mary walks loves loves Mary walks Mary s · · · · s · · · s · · · · · · · · s · · · s · · · · s · · · · · · · · s · · s s · · · · s · · s · · · · s · · · · · · s · · s
The next step is to cluster context rules into types. Expressions that can be substituted into the same contexts belong to the same type. Thus, to generate a set of types from the enriched first order explosion, we can cluster expressions on the basis of equivalent contexts. For the matrix in Table 1 this corresponds to re-ordering the rows and columns such that rectangles of s’s can be found along the diagonal in the matrix.2
Table 2. Re-ordered substitution-matrix for the enriched first order explosion.
S/loves Mary John loves\S S/walks S/Mary S John\S/Mary John\S Mary\S
John Mary John John Mary loves loves walks loves loves walks Mary Mary s s · · · · · · s s · · · · · · s s · · · · · · · · s · · · · · · · · s s · · · · · · · · s · · · · · · · · s s · · · · · · s s
Each rectangle corresponds to one type. The resulting matrix for our example is shown in Table 2. By choosing variable names for the types and making the appropriate substitutions, we can create the following grammar: 2
Ambiguous grammars may cause s’s to appear outside these rectangles. We need more sophisticated clustering techniques for ambiguous languages, but that does not affect the complexity in a fundamental way.
Grammar Induction as Substructural Inductive Logic Programming
135
S → BA S → AE A → John|Mary B → AD D → loves E → DA|walks For a detailed description of the algorithm we refer to (Adriaans, 1992). The EMILE algorithm can learn the class of shallow context-free grammars efficiently. The scalability of EMILE is illustrated by Figure 1 that shows the convergence on the text of the Bible in the King James translation.
Fig. 1. Graph displaying number of different grammatical types learned after reading n bible books.
The total process takes about one hour on a Sun workstation. It does not yet converge to a concise grammar but it finds scores of interesting rule fragments and substitution classes: about 5200 different grammatical types in 2.3 million expressions and about 3 million contexts. An example of a dictionary type found by EMILE is: [573] [573] [573] [573] [573] [573] [573]
--> --> --> --> --> --> -->
Simeon Naphtali Gad Asher Issachar Zebulun Onan
136
Pieter Adriaans and Erik de Haas
In a Dutch coursetext of about 1000 sentences EMILE found amongst others the following set of rules: [0] --> [12] ? [12] --> Waar [23] [12] --> Wie [46] [12] --> Hoe [13] [12] --> Wat [31] [31] --> is [33] [31] --> heb je gisteren gedaan [31] --> vind jij daarvan [33] --> jouw naam [33] --> uw naam [33] --> je leeftijd [33] --> jouw mening hierover [33] --> jouw opvatting over dit onderwerp This set of rules parses a sentence like ‘Wat is uw naam?’ ([What [is [your name]]?]) correctly. This illustrates the linguistic viability of EMILE. The EMILE paradigm suggest a more general approach which can be described as substructural ILP. In the following paragraph we discuss a number of technical issues that illustrate the feasibility of these ideas.
3
A Research Program: ILP Based on Substructural Logic
In the field of Inductive Logic Programming (ILP) a lot of work is done in developing strategies to find simple (‘simple’ in the sense that the hypothesis has a short description) and consistent hypothesis from a base of logical rules and axioms (facts). The main examples in this field deal with finding logic programs from a base of facts. The main problem of the ILP techniques is their computational tractability. Because First Order predicate logic (FOL) is enormously expressive, calculating with it is in general computationally undecidable. This means that the search for an interesting hypothesis or relation could take forever, even when such an hypothesis exists. In this context the art of ILP could be seen as the ability to define strategies that come up with interesting hypothesis within reasonable computational resources. The taming of the computational complexity can, for example, be achieved by by syntactically restricting the FOL language in order to get a more tractable subset of full FO predicate logic, or by restricting the use of the formulas in the search algorithm by decorating them with input-output schemas. These techniques combined with search heuristics have delivered some good results. In this chapter we propose an alternative route for applying ILP techniques. We propose to use in ILP less expressive, substructural logics with nicer computational behaviour. This way we reduce the computational burden of ILP and
Grammar Induction as Substructural Inductive Logic Programming
137
use well studied logics with nice computational properties. We claim that for many purposes there exist substructural logics that are expressive enough to express the hypothesis and base of facts for ILP. For example in (de Haas, 1995) and (de Haas, 2000) we have designed a substructural a logic, called the logic of categories that has the same expressiveness as a modern information system modeling language like UML (OMG, 1997; Fowler, 1997). This logic is in general as expressive as most modern modeling and database languages. This logic is a substructural logic based on the traditions of model and linear logic (Troelstra, 1992; Chagrov & Zakharyaschev, 1996) and is interpreted in a semantic domain of objects (opposed to a domain of relations as for FO predicate logic). A logic like this would be well suited to apply ILP techniques in datamining, because there one searches for interesting descriptions of relations in an information system. An important substructural logic in grammar formalisms is the Lambek calculus (Lambek, 1958). The fact that this logic is decidable means that we can learn linguistic structures using ILP techniques in a computationally more attractive environment than that of FOL. Another recent example of a decidable substructural logic that is capable of expressing interesting linguistic structures is the substructural logic for encoding and transforming Discourse Representation Systems of Tsutomu Fujinami (Fujinami, 1997). This illustrates that the field of substructural logic is quite capable to handle interesting issues in natural language. More generally, by using substructural logics with nicer (decidable and better) computational behaviour, we will get the ILP search problem in a league of well studied optimization problems. This will open doors to very fruitful algorithmic research for ILP applications. We claim that for many purposes there exist substructural modal logics that are powerful enough to express the hypothesis and facts such that they can be input to learning algorithms based on ILP techniques. The applications to (categorial) grammar induction and datamining on information systems described in UML mentioned in this chapter, are just two nice examples in a broader research domain. We illustrate these applications in this research context in Figure 2. Another very interesting observation from the substructural logic approach to inductive logic programming is the relation between the formal learning framework and learning strategies. For example for Boolean concept learning a strategy called PAC learning (Valiant, 1984) is developed, and learnability results have been proved for this kind of learning. Similarly for Categorial Grammar Induction the EMILE algorithm implements a PAC-like strategy for learning categorial grammars, for which also learnability results are proven, as indicated in this chapter. Imagine a transformation between the boolean calculus and the string calculus. There exists a transformation that simply eliminates the structural rules from the representational theory of the boolean calculus to form a string oriented calculus. This indicates that for the landscape of logics between the boolean calculus (containing all structural rules) and the Lambek calculus (containing no
138
Pieter Adriaans and Erik de Haas
Fig. 2. Research context where SL = substructural logic, ML = modal logic, SML = substructural modal logic, ISLP = inductive substructural logic programming, IMLP = inductive model logic programming, ISMLP = inductive substructural modal logic programming.
structural rules) similar learning strategies can be developed and similar learnability results could be obtained. The exact nature of these transformations is not clear, as is illustrated by the ‘XX’ classes in Figure 3. The power of this approach is illustrated by the fact that one can immediately deduce a proof of the PAC learnability of the class of k-structural CNF languages from the PAC learnability proof for k-CNF boolean formulas due to Valiant (Valiant, 1984). Table 3 gives the shifts in complexity of the sample and hypothesis space if we remove the substructural rules from the boolean calculus. Basically this is a shift from a set oriented to a string oriented representation. One sees that removing the structural rules leads to a slight increase of complexity of the hypothesis space, but also to a fundamental reduction in complexity of the sample space. Surprisingly these transformations do not affect Valiant’s learnability result (Adriaans & de Haas, 2000). The use of logics that are tailored more closely to the application domain than a general logic like FOL is a trend in a broader research program. We cite from the manifest “Logic and the challenge of computer science” of the famous logician Yuri Gurevich (Gurevich, 1988): [...] But the new applications call, we believe, for new developments in logic proper. First order predicate calculus and its usual generalizations are not sufficient to support the new applications.
Grammar Induction as Substructural Inductive Logic Programming
139
Fig. 3. Transformations in the research context. Table 3. Comparison of complexity indicators for boolean concepts and finite pattern languages. Boolean Concept Learning Finite Languages (k-CNF) (k-SUBSTR CNF) Lexicon Size sample space
U 2|U | 2|U |! Size hypothesis space (2|U | − k)!
U |U |k (2|U |)k
[...] It seems that we (the logicians) were somewhat hypnotized by the success of classical systems. We used first-order logic where it fits and where it fits not so well. We went on working on computability without paying adequate attention to feasibility. One seemingly obvious but nevertheless important lesson is that different applications may require formalizations of different kinds. It is necessary to ”listen” to the subject in order to come up with the right formalization. [...]
4
Computational Advantages of Substructural Logic
The common syntactical restrictions of FOL used in logic programming do not tame the computational complexity of the system.
140
Pieter Adriaans and Erik de Haas
Theorem 1. Given a program P and a query Q, the question whether there is a substitution σ such that σ(Q) follows from P is undecidable (E.Y., 1984). This means that provability in the Horn fragment of FOL is undecidable. This causes problems for doing ILP in this fragment, because the search space for good solutions will be unlimited. For logic programs with restrictions on input-output behaviour of the predicates (‘well and nicely moded’ logic programs (Apt & Pellegrini, 1994)) there is to our knowledge no decidability result proven. In (Aarts, 1995) a complexity formula for the upper bound for provability for these logic programs is given. This formula, however, is in some cases unbounded. The common denominator of several important nonclassical logics is that in their sequent formulation they reject or restrict some structural rules. Logics that are obtained by restricting the structural part of classical logic are called substructural logics. The most important substructural logics investigated until now are intuitionistic logic, relevant logic, BCK logic, linear logic and the Lambek calculus of syntactic categories (Schroeder-Heister, 1993). The fruits of doing ILP in a decidable fragment is obvious. The search space for the optimal solution is limited. Of course, working in a decidable fragment does not entail computational feasibility. In ILP, however, one is not necessarily interested in the most optimal solution, but often a good solution suffices. In the context of a decidable logic (or even better computational upper bounds like P or NP), this means that the ILP search problem falls into a class of optimization problems in a decidable search space. These kind of problems are very common in AI and algorithmic research. Moreover, in some domains we are able to improve the computational behavior by using techniques from modal logics to describe the needed fragments of FOL (Andreka, van Benthem, & Nemeti, 1996). An example of a substructural modal logic is the logic of categories (de Haas, 2000) we mentioned above. To found our proposal we quote some complexity results of substructural logics. – Horn programming in linear logic is NP-complete (Kanovich, 1992). – Provability in the non-modal fragment of Propositional Linear Logic (MALL) is PSPACE-complete (Lincoln, Mitchell, Scedrov, & Shankar, 1992) – Provability in the non-modal fragment of (predicative) Linear Logic (MALL1) is NEXPTIME-complete (Lincoln, 1994; Scedrov, 1995). We note that one still needs to be careful. If one re-introduces the structural rules in a controlled manner in a substructural logic, as is done in Linear logic with the ‘bang’ and ‘of-course’ modality, one again looses decidability – Propositional Linear Logic (with ‘full’ modalities) is Undecidable (Lincoln et al., 1992)
5
Conclusion and Further Research
In this chapter we have presented the well known EMILE algorithm as a special case of a new paradigm that we call substructural ILP. This approach has
Grammar Induction as Substructural Inductive Logic Programming
141
specific advantages over a more traditional ILP approach to language learning. These advantages are: 1) a knowledge representation that models the target concepts intrinsically, 2) of which the complexity issues are well known, 3) with an expressive power that is in general weaker than the Horn-clause or related representations that are used in more traditional ILP research, 4) for which explicit learnability results are available. Of course there is also a price to pay. Using different representational languages means that we need special induction machines for different variants of substructural logic. In the near future we hope to present a further analysis of the substructural landscape in the context of learning systems. The fact that design constraints for databases can be expressed in substructural logic stresses the power of these ideas. The best illustration of the feasibility of this approach based on substructural logics is the fact that we are currently preparing a test to do grammar induction on a corpus of 10 million Dutch sentences on a large parallel system. Data sets of this size are to our knowledge beyond the capability of current ILP systems.
References 1. Aarts, E. (1995). Investigations in Logic, Language and Computation. Ph.D. thesis, University of Utrecht. 2. Adriaans, P. (1990). Categoriale modellen voor kennissystemen. Informatie, 118–126. 3. Adriaans, P. (1992). Language learning from a categorial perspective. Academisch proefschrift, Universiteit van Amsterdam. 4. Adriaans, P. (1999). Learning Shallow Context-Free languages under simple distributions. CSLI-publications, University of Stanford. 5. Adriaans, P., & de Haas, E. (2000). Substructural PAC learning. Research report, Syllogic. 6. Ajdukiewicz, K. (1935). Die syntactische konnexit¨ at. Studia Philosophica, 1, 1–27. 7. Andreka, H., van Benthem, J., & Nemeti, I. (1996). Modal languages and bounded fragments of predicate logic. Pre-print ML-96-03, ILLC, Amsterdam. 8. Apt, K., & Pellegrini, A. (1994). On the occur-check free Prolog programs. ACM Toplas, 16 (3), 687–726. 9. Chagrov, A., & Zakharyaschev, M. (1996). Modal Logic. Oxford University Press. 10. de Haas, E. (1995). Categorial graphs. In Reichel, H. (Ed.), Fundamentals of Computation Theory, FCT’95, Vol. 965 of LNCS, pp. 263–272. Springer. 11. de Haas, E. (2000). Categories for Profit. Ph.D. thesis, Universiteit van Amsterdam. In preparation. 12. E.Y., S. (1984). Alternation and the computational complexity of logic programs. Journal of Logic Programming, 1, 19–33. 13. Fowler, M. (1997). UML Distilled: Applying the Standard Object Modeling Language. Addison Wesley Longman.
142
Pieter Adriaans and Erik de Haas
14. Fujinami, T. (1997). A decidable linear logic for transforming DRSs in context. In Dekker, P., S. M. V. Y. (Ed.), Proceedings of the 11th Amsterdam Colloquium, pp. 127–132. 15. Gurevich, Y. (1988). Logic and the challenge of computer science. In Boerger, E. (Ed.), Trends in Theoretical Computer Science, pp. 1–57. Computer Science Press. 16. Kanazawa, M. (1994). Learnable Classes of Categorial Grammars. Ph.D. thesis, University of Stanford. 17. Kanovich, M. (1992). Horn programming in linear logic is NP-complete. In Proc. 7th annual IEEE symposium on Logic in Computer Science, pp. 200– 210 Santa Cruz, CA. Full paper appears in Annals of Pure and Applied Logic. 18. Lambek, J. (1958). The mathemetics of sentence structure. American Mathematical Monthly, 65, 154–169. 19. Lambek, J. (1990). Logic without structural rules. Another look at cut elimination. Ms. McGill University, Montreal. 20. Lincoln, P., Mitchell, J., Scedrov, A., & Shankar, N. (1992). Decision problems for propositional linear logic. Annals of Pure and Applied Logic, 56, 239–311. 21. Lincoln, P., S. N. (1994). Proof search in first order linear logic and other cut-free sequent calculi. In Proc. of the ninth (IEEE) symposium on Logic in Computer Science, pp. 282–291. 22. Oehrle, R.T., B. E. W. D. (Ed.). (1988). Categorial Grammars and Natural Language Structures. D. Reidel Publishing Company, Dordrecht. 23. OMG, www.omg.org (1997). UML 1.1 Specification. documents ad970802ad0809. 24. Pentus, M. (1993). Lambek grammars are context free. In IEEE Symposium on Logic in Computer Science. 25. Scedrov, A. (1995). Linear logic and computation: A survey. In Schwichtenberg, S. (Ed.), Proc. of the 1993 Summer School at Marktoberdorf, Germany, No. 139 in Ser. F. Commut. System Sci., pp. 379–395. Springer Verlag. also as Report Dept. of Mathematics, University of Pensylvania, 1993. 26. Schroeder-Heister, P., D. K. (1993). Substructural Logics. Oxford University Press. 27. Troelstra, A. (1992). Lectures on Linear Logic. No. 29 in Lecture Notes. CSLI, Stanford. 28. Valiant, L. (1984). A theory of the learnable. Communications of the ACM, 27 (11), 1134–1142.
Experiments in Inductive Chart Parsing James Cussens1 and Stephen Pulman2 1
Department of Computer Science, University of York Heslington, York, Y010 5DD, UK
[email protected] 2 University of Cambridge Computer Laboratory New Museums Site, Pembroke Street, Cambridge CB2 3QG, UK
[email protected]
Abstract. We use Inductive Logic Programming (ILP) within a chartparsing framework for grammar learning. Given an existing grammar G, together with some sentences which G can not parse, we use ILP to find the “missing” grammar rules or lexical items. Our aim is to exploit the inductive capabilities of chart parsing, i.e. the ability to efficiently determine what is needed for a parse. For each unparsable sentence, we find actual edges and needed edges: those which are needed to allow a parse. The former are used as background knowledge for the ILP algorithm (PProgol) and the latter are used as examples for the ILP algorithm. We demonstrate our approach with a number of experiments using contextfree grammars and a feature grammar.
1
Introduction
The classic ILP formulation of a learning problem is couched in terms of Background information, some Evidence, and a Hypothesis. We are given the Background, and the Evidence, and have to produce a Hypothesis that satisfies the following schema: Background ∧ Hypothesis |= Evidence Our learning problem is this: given a partial grammar and lexicon of a language, and a set of sentences, some of which can be correctly parsed, but some of which cannot, how can we extend the grammar and lexicon so as to produce plausible parses for all (or more) of the sentences? (This may also result in extra parses for previously analysed sentences: we want to try to ensure that these reflect genuine ambiguities). Progress in solving this learning task would have real practical benefits in extending the coverage of existing grammars. We use a chart parsing framework for our experiments, partly because it is a practically efficient method, but also because chart parsing has a clean deductive formulation and can thus can be regarded in principle as part of the Background. Using chart parsing has a further advantage, because when a parse fails, the information in the chart is sufficient for us to be able to determine what constituents would have allowed the parse to go through if they had been J. Cussens and S. Dˇ zeroski (Eds.): LLL’99, LNAI 1925, pp. 143–156, 2000. c Springer-Verlag Berlin Heidelberg 2000
144
James Cussens and Stephen Pulman
found, a fact used by Mellish (1989) to repair failed parses in a relatively efficient way. We adapt Mellish’s technique to locate ‘needed’ constituents: constituents which, if there was a rule to produce them given the current input, would allow a complete parse to go through. A further advantage of using a chart-based framework is that the process of testing out hypothesised rules can be handled very efficiently, since we do not need to reparse those components of a sentence that were already handled successfully. When the ILP mechanism has suggested some plausible rule hypotheses we can restart the parser to focus in on the application of the hypothesised rule just to the needed constituents.
2
Overview of Chart Parsing
Details of chart parsing can be found in (Shieber, Schabes, & Pereira, 1995; Pereira & Warren, 1983), here we give only a very brief introduction, focussing on the parser used in these experiments. A sentence can be viewed as a linear sequence of vertices with words spanning neighbouring vertices, for example 0 Every 1 company 2 wrote 3 all 4 manuals 5 . A chart parser will find edges between vertices and store these edges in a chart. A complete edge asserts the existence of a particular linguistic category between two vertices. For example, in Table 1, the first edge states that there is a determiner (det) from vertex 0 to vertex 1, and the seventh edge that there is a noun phrase np between vertex 0 and vertex 2. Grammar rules produce new edges from existing ones, for example, the rule NP → DET NOM produced edge 7 from edges 1 and 2 (this information is recorded in the edge/7 fact). Incomplete edges assert that a particular category exists if some other category starting at the appropriate vertex can be found. Edge 8 asserts that there is a sentence (s) starting at vertex 0 if there is a verb phrase (vp) starting at vertex 2. Edge 12 shows that there is indeed such a verb phrase (wrote all manuals). Edge 15 shows that we have a successful parse since we have a sigma edge (representing the start symbol of the grammar) spanning the entire sentence. Since the example chart in Table 1 was produced by a context-free grammar, the linguistic categories are atomic. Feature grammars produce more complex edges as illustrated later in Table 13.
3
Method Overview
Sections 4 and 5 detail our experiments with a context-free and feature grammar respectively. Here, by way of introduction, we summarise the stages in our approach. For convenience, assume here that lexical entries are introduced by grammar rules, and thus references to ‘grammar rules’ in what follows is intended to include lexical entries and words. Deduction For each unparsable sentence, we run a standard left to right, bottom up chart parser to produce all edges possible for that sentence. Let us refer to such edges (complete and incomplete) as actual edges.
Experiments in Inductive Chart Parsing
145
Table 1. Chart entries produced by parsing Every company wrote all manuals with a simple context-free grammar %edge(EdgeID,Origin,LeftVertex,RightVertex,Cat,Needed,Contents) edge(1, every, 0, 1, det, [], []). edge(2, company, 1, 2, nom, [], []). edge(3, wrote, 2, 3, vt, [], []). edge(4, all, 3, 4, det, [], []). edge(5, manuals, 4, 5, nom, [], []). edge(6, np_det_nom, 0, 1, np, [nom], [1]). edge(7, np_det_nom, 0, 2, np, [], [2,1]). edge(8, s_np_vp, 0, 2, s, [vp], [7]). edge(9, vp_v_np, 2, 3, vp, [np], [3]). edge(10, np_det_nom, 3, 4, np, [nom], [4]). edge(11, np_det_nom, 3, 5, np, [], [5,4]). edge(12, vp_v_np, 2, 5, vp, [], [11,3]). edge(13, s_np_vp, 3, 5, s, [vp], [11]). edge(14, s_np_vp, 0, 5, s, [], [12,7]). edge(15, sigma:dcl, 0, 5, sigma, [], [14]).
Abduction We then produce needed edges in a top-down manner. This produces edges, which, if they existed, would allow a complete parse of the sentence. Each needed edge is presented to P-Progol1 as a positive example which requires “explanation”. Treating needed edges as if they were actual edges is a form of abduction, since we are hypothesising that the needed edge should actually be there. Induction Using the actual edges as background knowledge, P-Progol takes a particular needed edge and generates clauses which represent grammar rules which, if present, would allow the needed edge to be deduced from the actual edges in a single proof step. Evaluation Each generated grammar rule is evaluated in the search for the ‘best’ grammar rule that entails the needed edge. We can either look for that rule that allows the entailment of the maximum number of needed edges of all types or that rule that entails the maximum number of needed sigma edges. The former is likely to lead us to prefer more conservative additions to the grammar, at the phrasal level, whereas the latter could lead to a preference for more radical additions to the grammar of rules for new types of sentence structure. Rule evaluation is efficient since we already have a chart of actual edges stored in the background knowledge; we need only re-start the chart parser to look for those new edges which follow because of the rule under evaluation. 1
Recent versions of P-Progol go by the name of Aleph: see http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/aleph.html
146
James Cussens and Stephen Pulman
4
Experiments with a Context-Free Grammar
4.1
Creating Background Knowledge
We took the grammar G1 given in Table 2, and used it, together with a lexicon of 101 items, to generate 500 sentences. We then removed the vp_v_np rule: cmp_synrule(vp_v_np, vp, [vt,np]). and attempted to parse the previously generated 500 sentences. 231 of the original 500 sentences were unparsable by the reduced grammar. All actual edges which could be produced were stored in a file of indexed edges. Table 3 lists the actual edges found for the first unparsable sentence Every companies wrote all manual. Figure 1 is a graphical representation of the important edges in Table 3. Notice that since the context free grammar was derived from a feature grammar it overgenerates by ignoring agreement information. Table 2. Grammar G1 cmp_synrule(sigma:dcl, sigma, [s]). cmp_synrule(s_np_vp, s, [np,vp]). cmp_synrule(np_det_nom, np, [det,nom]). cmp_synrule(vp_v, vp, [vi]). cmp_synrule(vp_v_np, vp, [vt,np]).
% SIGMA --> S % S --> NP VP % NP --> DET NOM % VP --> VI % VP --> VT NP (Removed)
Table 3. Edges for the first unparsable sentence %edge(SentenceID,EdgeID,Origin,LeftVertex, % RightVertex,Cat,Needed,Contents) edge(1, 1, every, 0, 1, det, [], []). edge(1, 6, np_det_nom, 0, 1, np, [nom], [1]). edge(1, 2, companies, 1, 2, nom, [], []). edge(1, 7, np_det_nom, 0, 2, np, [], [2,1]). edge(1, 8, s_np_vp, 0, 2, s, [vp], [7]). edge(1, 3, wrote, 2, 3, vt, [], []). edge(1, 4, all, 3, 4, det, [], []). edge(1, 9, np_det_nom, 3, 4, np, [nom], [4]). edge(1, 5, manual, 4, 5, nom, [], []). edge(1, 10, np_det_nom, 3, 5, np, [], [5,4]). edge(1, 11, s_np_vp, 3, 5, s, [vp], [10]).
4.2
Creating Examples
The next step was to find needed edges. We use the algorithm T D GEN N EEDS given in Table 4 to generate needed edges starting from the top-level need for
Experiments in Inductive Chart Parsing
147
s (needs a vp)
np
np
1
0 det
2 nom
3 vt
5
4 det
nom
Fig. 1. Graphical representation of important edges produced by trying to parse Every companies wrote all manual. Solid lines represent complete edges, the dashed line represents an incomplete edge looking for a vp starting at vertex 2.
a sigma edge which spans the entire input. Note that T D GEN N EEDS is quite restricted. For example, it can only find single edges which allow a parse, rather than sets of edges which together allow a parse. However, the restriction to rules with either one or two daughters is not fundamental and can be dropped to allow rules with longer RHSs. The needed edges for each sentence input were then stored (Table 5). The edges have been simplified to be 4-tuples of sentence id, category, from-vertex and to-vertex. Note also that the predicate symbol edge/4 is used although these edges do not (yet) exist. This reflects our abductive approach. Table 4. Generating needed edges needs(S):- make_lexical_edges(0,S,Finish,InitialAgenda), chart_parse(InitialAgenda), top_down_needs(sigma,0,Finish), fail. % single daughter case: %Cat -> Daughter top_down_needs(Cat,From,To) :- cmp_synrule(_Id,Cat,[Daughter]), top_down_needs(Daughter,From,To). % two daughters, first one already found: %Cat -> Daughter1, Daughter2 top_down_needs(Cat,From,To) :- cmp_synrule(_Id,Cat,[Daughter1,Daughter2]), edge(_,_,From,Next,Daughter1,[],_), assert_if_new(need(Daughter2,Next,To)), top_down_needs(Daughter2,Next,To). % two daughters, second one already found: %Cat -> Daughter1, Daughter2 top_down_needs(Cat,From,To) :- cmp_synrule(_Id,Cat,[Daughter1,Daughter2]), edge(_,_,Next,To,Daughter2,[],_), assert_if_new(need(Daughter1,From,Next)), top_down_needs(Daughter1,From,Next).
148
James Cussens and Stephen Pulman Table 5. Needed edges %edge(SentenceID,Cat,From,To) edge(1, vp, 2, 5). edge(1, vi, 2, 5). %WRONG! edge(1, s, 0, 5). edge(1, sigma, 0, 5). edge(4, vp, 2, 5). ...
Recall that the first unparsable sentence is Every companies wrote all manual. The allegedly ‘needed’ edge edge(1, vi, 2,5) in Table 5 for this sentence would allow a parse of the sentence but clearly this would be an incorrect parse in the light of the intended meaning of lexical symbols like vi. There are a number of ways of detecting the incorrectness of edges like edge(1, vi, 2,5). We can use linguistic knowledge to constrain our hypothesis language to avoid obviously incorrect rules, such as those which allow the deduction of a lexical edge of category vi with length greater than one. Alternatively, if our training data is annotated with the correct parses, then we can simply check that the parse sanctioned by any hypothesised rule is the correct one. In the absence of annotated training data or reasonable constraints we do not have sufficient information to avoid the induction of logically correct but linguistically implausible rules. 4.3
Running P-Progol
We now have all the necessary input to use P-Progol to search for suitable grammar rules. We include 1. 2. 3. 4.
The The The The
chart parser actual edges for each sentence incomplete grammar (complete) lexicon
as background knowledge. The needed edges are then read in as positive examples of edges. One convenience of P-Progol is that positive examples are not asserted: edge(1, vp, 2, 5). is represented internally as example(1,pos,edge(1, vp, 2, 5)). This means that positing these edges as positive examples does not automatically allow the unparsable sentences to become parsable. P-Progol output is shown in Table 6, beginning with the top-level induce goal entered by the user, down to the final theory. We begin with the first needed edge found, which is for a vp between vertices 2 and 5 in sentence 1. P-Progol
Experiments in Inductive Chart Parsing
149
Table 6. P-Progol run on baseline experiment. ?- induce. [edge(1,vp,2,5)] [bottom clause] edge(A,vp,B,C) :edge(A,vt,B,D), lex_edge(A,wrote,B,D), edge(A,np,D,C). ... edge(A,vp,B,C) :edge(A,vt,B,D). [constraint violated] ... [best clause] edge(A,vp,B,C) :edge(A,vt,B,D), edge(A,np,D,C). [pos-neg] [693] [atoms left] [231] [sat] [2] [edge(1,vi,2,5)] [bottom clause] edge(A,vi,B,C) :edge(A,vt,B,D), lex_edge(A,wrote,B,D), edge(A,np,D,C). [best clause] edge(A,vi,B,C) :edge(A,vt,B,D), edge(A,np,D,C). [pos-neg] [231] [atoms left] [0] ... [time taken] [1.89]
then generates the most specific (‘bottom’) clause in the hypothesis language which allows the needed edge to be directly inferred. The literals in the bottom clause are there simply because the following three complete actual edges edge(1, vt, 2, 3). lex_edge(1, wrote, 2, 3). edge(1, np, 3, 5). are all in our background knowledge. lex_edge/4 is used to indicate that an edge is lexical: a word spanning two neighbouring vertices. P-Progol just looks for edges which will span the required vertices and turns constants into variables. Only one generalisation of this clause represents a valid grammar rule, and this is, of course, the missing grammar rule. However, the assertion of this rule only allows 693 of the needed edges to be found. There are another 231 incorrect vi edges left. These are explained by the erroneous V I → V T N P rule. In this trivial example, the incorrect vi edges could have been filtered out before running P-Progol, or P-Progol could have disallowed obviously incorrect rules such as V I → V T N P , by incorporating
150
James Cussens and Stephen Pulman
the relevant background knowledge about possible and impossible forms of rule. In non-artificial grammar learning, we will discard all detectably incorrect needed edges and rules, but it is unlikely that all actually incorrect edges and rules will be detected. 4.4
Implementation Issues
A feature of our approach is that we can use a standard ILP algorithm to do induction in a chart parsing framework. To do this we essentially “include the chart parser in the background”. When checking to see whether an edge has been entailed, P-Progol uses the clauses given in Table 7. If this is the first edge under consideration for a particular sentence, we collect the needed edges for that sentence, grab the current hypothesis, translate it into a grammar rule, and then re-start the parser. This will generate as many edges as possible for that sentence, and is done only once for each sentence: if the sentence has already been re-parsed we just check for the edge. Another point is that all appropriate grammar rules must be available to the chart parser when evaluating hypothesised rules. There are three kinds: grammar rules from the original incomplete grammar, grammar rules that have been induced earlier in the induction process, and the grammar rule currently under consideration. To do this, we rename original grammar rules from cmp synrule to cmp synrule orig and include the three clauses in Table 8. We also include directly analogous clauses for lexical items. Table 7. Checking for edges by re-starting the chart parser edge(SId,Cat,From,To) :\+ setting(stage,saturation), %we are searching !, %ignore P-Progol’s edge clause check_or_parse(SId,Cat,From,To), !, edge(_,_,From,To,Cat,[],_). %is it there? check_or_parse(SId,Cat,From,To) :%just check don’t parse are_parsing(SId), !. check_or_parse(SId,Cat,From,To) :cleanup3, %delete edges from previous sentences assert(are_parsing(SId)), %prevent reparsing in future collect_needs(SId,Needs), recorded(pclause,pclause(Head,Body),_), %get current rule write_synruleword((Head:-Body),cmp_synrule(new,Mother,Daughters)), assert(ctr(1000)), !, continue_with_induced_rule(induced_rule(new,Mother,Daughters),Needs), !.
Experiments in Inductive Chart Parsing
151
Table 8. Redefinition of grammar rule predicate %Original rules cmp_synrule(Name,Mother,Daughters) :cmp_synrule_orig(Name,Mother,Daughters). %Previously learnt rules cmp_synrule(new,Mother,Daughters) :clause(edge(Id,Mother,From,To),Body), write_synruleword((edge(Id,Mother,From,To):-Body), cmp_synrule(new,Mother,Daughters)). %Brand new rule cmp_synrule(new,Mother,Daughters) :recorded(pclause,pclause(Head,Body),_), %current hypothesis write_synruleword((Head:-Body), cmp_synrule(new,Mother,Daughters)).
4.5
Finding Missing Lexical Entries
Finding missing lexical entries is no different in principle from finding missing grammar rules. edge(A,nom,B,C) :- lex_edge(A,report,B,C). is a lexical entry recovered by P-Progol when there were no missing grammar rules. We just put the word itself in the background as an existing lex edge and the need for a nom spanning the same vertices is sufficient to allow the construction of the clause representing the lexical entry. 4.6
Experiments with a Bigger CFG
We tested our approach on the bigger CFG in Table 9 with two grammar rules deleted. Using 501 sentences parsable by the full CFG, we produced 4302 actual edges for the background knowledge and 1180 needed edges as positive examples. We then ran P-Progol as described above. P-Progol found the rules listed in Table 10, performing 940 bottom clause constructions in the process and taking 30 seconds to do it. The two missing rules have been successfully recovered as well as two spurious rules. The V P → P REP SIGM A rule arises form the eleventh sentence The manual on kim own every machines. Referring to the chart for this sentence in Table 11, edge 11 shows that we are looking for a V P after the manual and on finding a P REP (edge 3) followed by a SIGM A, (edge 19) we postulate the rule. We should stress that no linguistic constraints have been added to cut down on the number of induced rules. Such constraints may be implemented by including in the background knowledge the clause false :- hypothesis(Head,Body,_), bad((Head:-Body)) and a suitable definition for bad/1 which identifies obviously incorrect rules.
152
James Cussens and Stephen Pulman Table 9. Bigger incomplete grammar cmp_synrule(sigma:dcl, sigma, [s]). cmp_synrule(s_np_vp, s, [np,vp]). cmp_synrule(np_det_nom, np, [det,nom]). cmp_synrule(nom_nom_mod, nom, [nom,mod]). %cmp_synrule(nom_adjp_nom, nom, [adjp,nom]). cmp_synrule(vp_vp_mod, vp, [vp,mod]). %cmp_synrule(mod_p_np, mod, [prep,np]). cmp_synrule(vp_v, vp, [vi]). cmp_synrule(vp_v_np, vp, [vt,np]).
% SIGMA --> S % S --> NP VP % NP --> DET NOM % NOM --> NOM MOD % NOM --> ADJP NOM % VP --> VP MOD % MOD --> PREP NP % VP --> VI % VP --> VT NP
Table 10. Induced grammar rules edge(A,nom,B,C) :- edge(A,adjp,B,D), edge(A,nom,D,C). edge(A,vp,B,C) :- edge(A,prep,B,D), edge(A,sigma,D,C). % spurious edge(A,mod,B,C) :- edge(A,prep,B,D), edge(A,np,D,C). edge(A,vt,B,C) :- edge(A,vp,B,D), edge(A,prep,D,C). % spurious Table 11. Chart produced by (failed) parse of the manual on kim own every machines edge(11, edge(11, edge(11, edge(11, edge(11, edge(11, edge(11, edge(11, edge(11, edge(11, edge(11, edge(11, edge(11, edge(11, edge(11, edge(11, edge(11, edge(11, edge(11, edge(11, edge(11,
5
1, the, 0, 1, det, [], []). 2, manual, 1, 2, nom, [], []). 3, on, 2, 3, prep, [], []). 4, kim, 3, 4, np, [], []). 5, own, 4, 5, vt, [], []). 6, every, 5, 6, det, [], []). 7, machines, 6, 7, nom, [], []). 8, np_det_nom, 0, 1, np, [nom], [1]). 9, np_det_nom, 0, 2, np, [], [2,1]). 10, nom_nom_mod, 1, 2, nom, [mod], [2]). 11, s_np_vp, 0, 2, s, [vp], [9]). 12, s_np_vp, 3, 4, s, [vp], [4]). 13, vp_v_np, 4, 5, vp, [np], [5]). 14, np_det_nom, 5, 6, np, [nom], [6]). 15, np_det_nom, 5, 7, np, [], [7,6]). 16, nom_nom_mod, 6, 7, nom, [mod], [7]). 17, vp_v_np, 4, 7, vp, [], [15,5]). 18, s_np_vp, 5, 7, s, [vp], [15]). 19, s_np_vp, 3, 7, s, [], [17,4]). 20, vp_vp_mod, 4, 7, vp, [mod], [17]). 21, sigma:dcl, 3, 7, sigma, [], [19]).
Incomplete Experiments with a Feature Grammar
We report on some experiments where our approach is applied to a unification grammar: the LLL challenge dataset (Kazakov, Pulman, & Muggleton, 1998). This dataset includes an incomplete unification grammar and an accompanying
Experiments in Inductive Chart Parsing
153
Table 12. Second unparsable annotated training sentence from the LLL dataset parse([which,big,heavy,secretaries,have,arrived], sigma(whq([pos,_25677,pres(perf(arrive(_25677, qterm(wh,_25663ˆbig(_25663,_25657ˆheavy(_25657, _25651ˆsecretary(_25651)))))))]))).
lexicon, which is also incomplete. (The missing items are automatically randomly selected and are thus unknown to the authors. However, the second author wrote the code and the grammar for the LLL challenge, so this ongoing work can not be viewed as an attempt to solve the LLL challenge. Nevertheless it is a useful dataset with which to demonstrate our method.) In the LLL challenge, we have 555 annotated sentences such as the one in Table 12. Exactly as for the context-free case, we ran the chart parser and found those sentences which could not be parsed with the given incomplete grammar and lexicon. We then ran T D GEN N EEDS as before, but this time instead of always having a top-level need for an atomic sigma edge which spans the entire input, we have a need for a particular sort of sigma edge, represented by a complex term, found in the annotated training data. Running T D GEN N EEDS with the sigma edge from Table 12 produced the needed edges shown in Table 13. These are the positive examples that P-Progol will see. Table 13. Needed edges for the second unparsable annotated training sentence from the LLL dataset edge(9, np([ng,ng],f(0,0,0,0,_,_,1,1,1),sem(qterm(wh,Aˆbig(A, Bˆheavy(B,Cˆsecretary(C)))),_,_,_,_,_),f(0,0,1,1),subj), 0, 4). edge(9, nom(f(0,0,0,0,_,_,1,1,1),sem(big(A,Bˆheavy(B, Cˆsecretary(C))),A,_,_,_,_)), 1, 4). edge(9, det(f(0,0,1,1),f(0,0,0,0,1,1,1,1,1),sem(qterm(wh, Aˆbig(A,Bˆheavy(B,Cˆsecretary(C)))),Dˆsecretary(D),_,_,_,_)), 0, 3). edge(9, s([ng,ng],f(0,0,0,0,_,_,_,1,1),sem([pos,A,pres( perf(arrive(A,qterm(wh,Bˆbig(B,Cˆheavy(C,Dˆsecretary(D)))))))],_,_,_,_,_), f(0,0,1,1),_,_), 0, 6). edge(9,sigma(whq([pos,A,pres(perf(arrive(A,qterm(wh, Bˆbig(B,Cˆheavy(C,Dˆsecretary(D)))))))])),0, 6).
Our strategy in searching for suitable unification grammar rules is to find terms in the needed edge and the neighbouring actual edges which unify. To this end, we use two background predicates univ(Term,Functor,ArgList,NumArgs)
154
James Cussens and Stephen Pulman
and nth(N,List,NthElt) which recursively pull terms apart. P-Progol will check for terms which unify automatically. With such background knowledge, P-Progol constructs bottom clauses with a few hundred literals. Although this construction is quick (generally under 1 second)—it is clearly impossible to search through every generalisation of this bottom clause for grammar rules. Fortunately the vast majority of such generalisations have a syntactic form which can not represent possible grammar rules; in one experiment P-Progol generated 20,000 clauses, none of which represented grammar rules. Although it is possible to quickly reject invalid clauses by specifying constraints of the form false :- hypothesis(Head,Body,_),... in P-Progol, it is far more efficient to only generate suitable clauses in the first place. In P-Progol this is achieved by defining a refine(+MotherClause,-DaughterClause) predicate in the background knowledge which explicitly directs P-Progol’s topdown search for good rules. Table 14 gives part of the definition of refine/2 used. The search starts with the clause false, so the first clause in Table 14 generates the first ‘layer’ of clauses. Table 14. User-defined refinement operator refine(false,(edge(A,B,C,D) :- univ(B,_,_,_),Rest)) :completes_rule(A,C,D,Rest,3). refine((Head:-Body),(Head:-NewBody)) :addnewlit(Body,NewBody).
By defining a specialised refinement operator, we can ‘jump’ straight to clauses that represent reasonable grammar rules. Table 15 shows the first two clauses in a P-Progol search, and a later more specific rule. Note that we jump straight to a 8 literal clause. The second rule represents the grammar rule cmp_synrule(new,nom(f(_,_,_,_,_,_,_,_,_),_), [adjp(_),adjp(_),nom(_,_)]). In Table 15, the [18/0] following each of the first two clauses indicates that 18 needed edges can unify with edges produced using the grammar rule represented by both clauses. This is useful information since any refinement of these clauses will only check these 18 edges (as opposed to the original 1060 edges). However, the edges produced by the rules in Table 15 are insufficiently instantiated, we really want rules which produce edges which are identical to (alphabetic variants of) needed edges.
Experiments in Inductive Chart Parsing
155
Table 15. P-Progol search with user-defined refinement operator (and user-defined cost) [new refinement] edge(A,B,C,D) :univ(B,nom,E,2), edgeb(A,F,C,G), univ(F,adjp,H,1), edgeb(A,I,G,J), univ(I,adjp,K,1), edgeb(A,L,J,D), univ(L,nom,M,2). [18/0] [new refinement] edge(A,B,C,D) :univ(B,nom,E,2), nth(1,E,F), univ(F,f,G,9), edgeb(A,H,C,I), univ(H,adjp,J,1), edgeb(A,K,I,L), univ(K,adjp,M,1), edgeb(A,N,L,D), univ(N,nom,O,2). [18/0] ... edge(A,B,C,D) :univ(B,nom,E,2), edgeb(A,F,C,G), univ(F,adjp,H,1), nth(1,H,I), univ(I,sem,J,6), edgeb(A,K,G,L), univ(K,adjp,M,1), nth(1,M,N), univ(N,sem,O,6), nth(1,O,P), univ(P,heavy,Q,2), edgeb(A,R,L,D), univ(R,nom,S,2). [4/0]
We tell P-Progol what we are looking for explicitly with a user-defined utility function (defined in the background knowledge), where the utility of a rule is simply the number of needed edges for which the rule produces alphabetic variants. If this number is zero, the rule is deemed unacceptable; this is why the over-general rules in Table 15 are rejected. In our experiments, P-Progol has failed to find any acceptable unification grammar rules.
6
Conclusions and Future Work
In this paper we have taken a general-purpose ILP algorithm and used it within a chart parsing framework to induce missing grammar rules. Although the experiments are preliminary, we can say that the results on learning CFG rules are basically positive and those on learning feature grammar rules are negative. Our failure with the feature grammar is because we have not fully exploited ILP’s declarative framework in order to tightly constrain the ILP search with linguistic knowledge. In very recent work (Cussens & Pulman, 2000), this has been done with constraints on head features and gap threading. We have also found a bottom-up approach more effective than P-Progol’s top-down approach, addressing the search problems flagged in Section 5. Our work is related to that of Zelle and Mooney (1996) and also Muggleton’s logical backpropagation (Parson, Khan, & Muggleton, 1999). In (Zelle & Mooney, 1996), rather than learn a parse(Sentence,Representation) predicate directly Zelle and Mooney learn control rules for a shift-reduce parser. The connection with the approach presented here is that in both cases an indi-
156
James Cussens and Stephen Pulman
rect approach is taken: intermediate stages of a proof/parse are represented and then examined to find appropriate rules. Our approach also shares features with that of (Osborne & Bridge, 1994) which uses over-general rules to extend incomplete derivations. Finally, our T D GEN N EEDS algorithm is similar to Progol bottom clause construction using logical backpropagation (Parson et al., 1999). In both cases a ‘need’ (= positive example) is ‘pushed’ in a top-down manner through an existing theory to allow the construction of clauses which allow the entailment of the need indirectly. However, logical backpropagation applies to general logic programs, rather than being tied to a chart parsing approach. Acknowledgements Thanks to Ashwin Srinivasan for help on P-Progol. The authors would like to acknowledge the support of the ILP2 project (Esprit 20237).
References 1. Cussens, J., & Pulman, S. (2000). Incorporating linguistics constraints into inductive logic programming. In Proc. LLL-2000. To appear. 2. Kazakov, D., Pulman, S., & Muggleton, S. (1998). The FraCaS dataset and the LLL challenge. Unpublished. 3. Mellish, C. (1989). Some chart based techniques for parsing ill-formed input. In Proc 27th ACL, pp. 102–109 Vancouver, BC. ACL. 4. Osborne, M., & Bridge, D. (1994). Learning unification-based grammars using the Spoken English Corpus. In Grammatical Inference and Applications, pp. 260–270. Springer Verlag. 5. Parson, R., Khan, K., & Muggleton, S. (1999). Theory recovery. In Proc. of the 9th International Workshop on Inductive Logic Programming (ILP-99) Berlin. Springer-Verlag. 6. Pereira, F., & Warren, D. (1983). Parsing as deduction. In Proc 21st ACL, pp. 137–144 Cambridge Mass. ACL. 7. Shieber, S. M., Schabes, Y., & Pereira, F. C. N. (1995). Principles and implementation of deductive parsing. Journal of Logic Programming, 24 (1–2), 3–26. 8. Zelle, J. M., & Mooney, R. J. (1996). Learning to parse database queries using inductive logic programming. In Proceedings of the Thirteenth National Conference on Artificial Intelligence Portland, OR.
ILP in Part-of-Speech Tagging — An Overview Martin Eineborg1 and Nikolaj Lindberg2 1
2
Machine Learning Group, Department of Computer and Systems Sciences, Stockholm University and Royal Institute of Technology, Electrum 230, 164 40 Kista, Sweden
[email protected] Centre for Speech Technology, Department of Speech, Music and Hearing, Royal Institute of Technology, Drottning Kristinas v. 31, 100 44 Stockholm, Sweden
[email protected]
Abstract. This paper presents an overview of work on inducing part-ofspeech taggers using Inductive Logic Programming. Constraint Grammar inspired rules have been induced for several languages (English, Hungarian, Slovene, Swedish) using Progol. This overview focuses on a Swedish tagger, but other work is discussed as well.
1
Introduction
The purpose of this paper is to give an overview of work done to induce Constraint Grammar (CG) rules using Inductive Logic Programming (ILP). ILP is a fairly novel machine learning technique for learning rules from examples. Focus will be on the work of inducing Swedish part-of-speech disambiguation rules (Eineborg & Lindberg 1998; Lindberg & Eineborg, 1998, 1999), but other work (Cussens, 1997; Cussens et al., 1999; Horv´ath et al., 1999) will be discussed as well. For more details, the interested reader is referred to the references given. The task of a part-of-speech (POS) tagger is to assign to each word in a text the correct morphological analysis. POS tagging of unrestricted text is an interesting machine learning task for several reasons: – It is a difficult task – It is a well-known problem to which several different machine learning techniques have been applied (Brill, 1994; Cutting et al., 1992; Ratnaparkhi, 1996; Samuelsson et al., 1996; Zavrel & Daelemans, 1999) – It has real world applications (in text-to-speech conversion, information extraction/retrieval, corpus linguistics, etc) – Large data sets are available for several languages Since a state-of-the-art tagger only makes a mistake in a few percent of the words, the margin for errors is very low. Furthermore, the learning algorithm must be able to process large sets of data (which unfortunately also contain errors) in reasonable time. J. Cussens and S. Dˇ zeroski (Eds.): LLL’99, LNAI 1925, pp. 157–169, 2000. c Springer-Verlag Berlin Heidelberg 2000
158
Martin Eineborg and Nikolaj Lindberg
Hand-coded grammars for part-of-speech tagging have, at least for English, been proven successful (Karlsson et al., 1995). However, it is a complex and timeconsuming task to develop one manually. To automatically induce a grammar for a new language would be a convincing way to illustrate the utility of a rule learning approach such as ILP. There is a number of reasons why ILP is an interesting technique in this context: – ILP can make use of domain specific knowledge – ILP systems produce interpretable rules (in contrast to e.g. purely statistical methods) expressed in logic – There is a strong tradition of expressing language knowledge in logic This paper is organised as follows: Section 2 touches on ILP and a specific instance of an ILP system, Progol, followed by a short account of the original Constraint Grammar tagger. Section 3 describes the common setting for all of the different studies summarised in this paper. Section 4 presents three studies on inducing a Swedish tagger, and some related work is presented in Sect. 5. Finally, Sect. 6 gives a short summary and discusses a few topics neglected in the work described in the previous sections.
2
Background
This section briefly presents the ILP machine learning paradigm, the Progol system (Muggleton, 1995) used in all subsequently described experiments, and the Constraint Grammar framework for part-of-speech tagging, which has inspired most ILP work on tagging. 2.1
Inductive Logic Programming
Inductive Logic Programming (ILP) (Muggleton, 1991) is a combination of machine learning and logic programming, where the goal is to find a hypothesis given examples and background knowledge, such that the hypothesis along with the background knowledge logically implies the examples: Hypothesis ∧ Background knowledge |= Examples In the work referred to in this paper, the Hypothesis that the learning system is trying to produce is a set of Constraint Grammar rules that would make a morphologically ambiguous text as unambiguous as possible with as few errors (rules covering negative examples) as possible. The Background knowledge could be anything from access predicates used to inspect a specific feature that a word has to, e.g., a grammar for noun phrases. The Examples are usually divided into a set of positive examples (that the rules should cover) and a set of negative examples (that the rules should not cover). All authors referred to in Sect. 4 and 5 below have used the Progol ILP algorithm (Muggleton, 1995) to induce disambiguation rules. The input to Progol
ILP in Part-of-Speech Tagging — An Overview
159
is a set of positive and negative examples, given as ground Prolog facts, and a set of background knowledge Prolog predicates. The user sets a language bias, which constrains the set of possible rules. The user also sets other restrictions, e.g. only allowing rules of a minimum accuracy, or rules covering a minimum number of positive examples. 2.2
Constraint Grammar POS Tagging
Constraint Grammar (CG) is a framework for automatic POS tagging and shallow syntactic dependency analysis of unrestricted text (Karlsson et al., 1995). CG is ‘reductionistic’, since the rules discard ambiguous readings rather than identify the correct ones. A nice feature of CG is that the rules can be thought of as independent of each other; new rules can be added, or old ones changed, without worrying about unexpected behaviour because of complex dependencies between existing rules or because of rule ordering. The words in the input text are looked up in a lexicon, and assigned one tag for every possible analysis (in other words, each word is assigned an ambiguity class). The morphologically ambiguous text is passed on to the tagger, which discards incorrect readings among the ambiguities, according to contextual constraints. For example, a local-context rule discarding a verb (V) reading of a word following a word unambiguously (-1C) tagged as determiner (DET) can be expressed as in Tapanainen (1996): REMOVE (V) (-1C (DET)); In addition to the above remove rule, there are select rules, used when the correct reading has been identified and all other reading should be removed. Lexical rules discard readings given that the conditions on the target word (the word to disambiguate) are satisfied and finally barrier rules allow arbitrarily long context between a context word and the word to disambiguate, given that the words in between do not have certain features. Examples of remove, select and barrier rules are given in Sect. 4.3. CG rules refer to part-of-speech tags and to word tokens as well as to sets of tags. Since some rules might not be applicable until some disambiguation has been taken care of by other rules, the CG rules are applied in several rounds. The rules are divided into rule sets of varying reliability, and the more reliable ones are applied first. A full-scale CG consists of more than a thousand rules. The rules are handcoded by experts, and the CG developers report high figures of accuracy for unrestricted English text: 99.7% of the words retain the correct reading, and 93–97% of the words are unambiguous after tagging (Karlsson et al., 1995, page 186).
3
Inducing Taggers Using ILP
In the last few years, a body of ILP-based work on inducing CGs has emerged (Cussens, 1997; Cussens et al., 1999; Eineborg & Lindberg, 1998; Horv´ ath et al.,
160
Martin Eineborg and Nikolaj Lindberg
1999; Lindberg & Eineborg, 1998, 1999). In all these studies, training data was generated from tagged corpora. Ambiguously tagged texts were produced by looking up the words in a lexicon which itself was created from a tagged corpus. Negative and positive examples consisted of a left context, a right context, and a target reading taken from the ambiguity class of the target word. This is illustrated in Table 1, where a positive and a negative example for learning remove rules discarding noun tags are found. The positive example is an instance of when a remove rule could be correctly applied to a target noun tag. The negative example, on the other hand, is an example of when it would be incorrect to apply a rule discarding a noun tag. Typically, for each part-of-speech category, thousands of such examples have been automatically generated from the corpus and lexicon. Table 1. A ‘remove noun’ training data example Left context
Target
Right context
Input words The man saw her leave Ambig. class Det. N oun/V erb N oun/V erb P ron. N oun/V erb Positive ex. Det. N oun N oun P ron. V erb Input words A Ambig. class Det. Negative ex. Det.
big Adj. Adj.
can of worms N oun/V erb P rep. N oun/V erb N oun P rep. N oun
In the next section, focus will be on the work by the present authors. An overview of related work is given in Sect. 5.
4
A Tagger for Swedish — Three Experiments
This section summarises three papers addressing the problem of inducing a Swedish tagger. This work was carried out to investigate the feasibility of inducing high-quality disambiguation rules from Swedish data using Progol. Since initial experiments presented in Sect. 4.1 and 4.2 indicated that this was indeed possible, more advanced background knowledge was added in a subsequent study, described in Sect. 4.3. Swedish is a Germanic language, closely related to Norwegian and Danish, and its morphology is somewhat richer than e.g. the English. Swedish nouns are of two different genders. In a noun phrase, determiners and adjectives agree in number, gender and definiteness with the head noun. Examples of noun phrase agreement are found in Table 2. Verbs are inflected for tense, but not number. The training material was sampled from a pre-release of the Stockholm-Ume˚ a Corpus (SUC) (Ejerhed et al., 1992). SUC covers just over one million words of manually corrected POS tagged Swedish text, sampled from different text genres. SUC has 146 different tags, and they consist of part-of-speech information and
ILP in Part-of-Speech Tagging — An Overview
161
Table 2. Swedish noun phrase agreement Det. Adj. Noun
Translation
en den de ett det de
fin fina fina fina
katt katten katter katterna
a nice cat the nice cat nice cats the nice cats
fint fina fina fina
hus huset hus husen
a nice house the nice house nice houses the nice houses
morphological features. There are 25 different POS categories. Thus, many of the 146 tags represent different inflected forms. For example, a word tagged as a verb in the present tense, active voice, has the tag VB PRS AKT, and a word tagged as a verb in the past tense, active voice, is tagged VB PRT AKT. See Table 3 for some corpus data. Table 3. Some statistics of the Stockholm-Ume˚ a Corpus Number of occurrences Running words (tokens) and delimiters Delimiters Different word forms (types) Word forms occurring only once
1,166,589 134,089 97,170 55,032
The corpus data was split into a training, a tuning and a test set. Due to differences in complexity of the induced theories, hence also in search space, a varying number of examples was used, ranging between 2,000 and 10,000 positive examples and a matching number of negative ones. The examples were of the general format RuleT ype(P OS, Lef tContext, T arget, RightContext) The window size (i.e., the number of words in Lef tContext, T arget and RightContext) of the examples ranged from three words to an arbitrary number, depending on the background knowledge. When the background knowledge only allowed rules to be constructed which referred to context words in fixed positions around the target word, the window was limited to only a few words. This was motivated by the fact that sensible constraints referring to a position relative to the target word utilise close context, typically 1-3 words (Karlsson et al., 1995, page 59). When adding more complex background knowledge, such as rules for forming noun phrases, which can be of varying length, longer context is motivated.
162
Martin Eineborg and Nikolaj Lindberg
The rules presented in the following subsections have been chosen more to illustrate the expressiveness of the rules rather than to be examples of linguistically interesting generalisations. The format of the rules is not the actual output from Progol but a common intermediate format. 4.1
No Higher Level Background Knowledge
In the first experiment of three (Eineborg & Lindberg, 1998), no higher-level grammatical background knowledge was used: The rules could refer to a full reading (part-of-speech plus all morphological features) or part-of-speech only. The words of the original text were in the training data as well. A motivation for allowing the rules to refer to word forms is that some very frequent ambiguous words can be handled by word specific rules. Another reason is that rules referring to specific word forms around the target word, are applicable regardless of whether these words are unambiguously tagged or not. Below is an example of a rule which removes the verb reading (vb) from a target word if the first word to the right is a pronoun (pn) and the second word to the left is the verb vet (‘know’) in present tense and active voice (prs_akt).1 vb remove left 2: {vet, vb, prs_akt}, right 1: pn. The aim of this study was to investigate the feasibility of the method rather than to create a complete tagger. This meant that rules were learnt for only 45 of the 146 possible readings of the SUC tag set. Rules were induced for all of the part of speech categories, but not for all of the possible inflected forms. In other words, there might have been ambiguities that the induced rules could not disambiguate. For example: If there existed rules discarding any verbal reading (VB), but there were no rules discarding e.g. verbs of the infinite form, active voice (VB INF AKT), a word assigned the ambiguity class {VB INF AKT, VB IMP AKT} would be impossible to correctly disambiguate whenever VB INF AKT is the wrong reading. The above paragraph illustrates one of the shortcomings of this work, i.e. the fact that training data had to be generated for each of the possible tags of the corpus that one wanted to handle (some of the 146 different tags of the corpus were infrequent, but there still remained quite a number of possible readings). In other words, one had to run a large number of Progol processes in order to produce a grammar covering all of the ambiguities of the corpus. The test data consisted of 48,408 words. After lexicon look-up the words were assigned 105,828 tags, i.e., on average 2.19 tags per word. 46,942 words retained the correct tag after disambiguation, which means that the correct tag survived for 97.0% of the words. After disambiguation, 55,464 tags remained (1.15 tags per word). Given the simple approach and the fact that some shortcomings 1
This is not the actual rule format of the rules induced in Eineborg and Lindberg (1998). The rule has been translated into the same rule format that was used in Lindberg and Eineborg (1999) to make it easier to compare the rules.
ILP in Part-of-Speech Tagging — An Overview
163
were evident, the result was considered interesting enough to be followed up in subsequent studies. 4.2
Using Individual Morphological Features
In the second experiment (Lindberg & Eineborg, 1998) the background knowledge given to Progol was extended so that the access predicates could also pull out individual morphological features. However, from a linguistic perspective the background knowledge was still restricted. The rule below2 discards the verb reading (vb) in the imperative (imp) and active voice (akt) if the word to the left is att (which is the infinitive marker or a subordinating conjunction). vb remove left 1: att, target: {imp, akt}. The test data consisted of 42,925 words. After lexicon look-up the words were assigned 93,810 tags, i.e., on average 2.19 tags per word. 41,926 words retained the correct tag after disambiguation, which means that the correct tag survived for 97.7% of the words. After disambiguation, 48,691 tags remained (1.13 tags per word). 4.3
Using Feature Unification and a Richer Set of Background Knowledge
In the two experiments described above, rules discarding contextually incorrect readings, remove rules, were induced one part-of-speech category at a time, for all of the part-of-speech categories of the corpus. In Lindberg and Eineborg (1999), rules were only learnt for the two most frequent part-of-speech categories, the noun and the verb. However, this experiment presented a richer set of background predicates, as well as an extended set of rule types and a richer rule formalism. In other words, in the three experiments, the rules of the induced theories were gradually made more expressive, by making it possible for the learner to pull out more features, but also by adding a richer set of background knowledge predicates. For example, the rules could make use of feature unification (forcing the values of two features to unify). When creating the background knowledge, the ambition was to provide a rich enough background knowledge, which could be used to construct an interesting theory, but at the same time avoiding linguistic detail. For example, instead of using a noun phrase grammar, examples of noun phrases were manually collected. The words were deleted from the noun phrases, and the part-of-speech and morphological features were used as a noun phrase example database. An example of such a tag sequence can be found in Table 4. The reason for collecting 2
The rule has been translated into the same rule format that was used in Lindberg and Eineborg (1999).
164
Martin Eineborg and Nikolaj Lindberg
examples instead of manually writing a grammar were that the latter is a very complex and time consuming task, even for an expert. In addition to the noun phrase tag sequences, the linguistic data added to the background knowledge consisted of examples of auxiliary verbs, auxiliary verb sequences, verb chains, and sets of part-of-speech categories with similar distribution. Examples of these can also be seen in Table 4. An example of a rule that makes use of sets of part-ofspeech categories can be seen below. The rule says that the word i (‘in’) cannot be a noun (nn) if the previous word is a preposition (pp) or a verb (vb). nn remove left 1: {pos=pp or pos=vb}, target: token=i. The set of rules was extended with two more rule types, select and barrier. In contrast to the remove rule, which specifies under which conditions a certain reading should be removed, the select rule specifies under which conditions a certain reading should be kept (and every other reading removed). An example of a select rule which says that if the word to the left is alla (‘all’, ‘everyone’), the plural noun reading(s) should be selected, and every other reading discarded can be seen below. nn select left 1: token=alla, target: num=plu. A barrier rule is a remove rule with an extra condition that specifies a number of features that are not allowed to be present in the context. This makes it possible to write rules that covers a common pattern although it has some irregularities (which makes an ordinary remove rule impossible). An example of a barrier rule can be seen below. The rule removes the noun (nn) reading of the word man (‘man’, ‘one’ (pronoun), ‘mane’) if there is a verb (vb) somewhere (*) to the left, and no determiner (dt) or adjective (jj) occur between the verb and the target word. nn barrier left *: pos=vb, barrier: {dt,jj}, target: token=man. In Swedish, adjectives agree in number (and gender) with the noun on which they are dependent. Agreement can easily be expressed using unification. The rule below forces the number (num) of the target word stora (‘big’, adjective, plural or singular) and the first word to the right to unify, and selects the agreeing adjective (jj) analysis of the target word. jj select target: {token=stora, num=V}, right 1: num=V.
ILP in Part-of-Speech Tagging — An Overview
165
Table 4. Examples of linguistic background knowledge Ling. Knowledge
Prolog Code
NP tag sequence
np_tag_seq([t(pos=dt,gen=utr/neu,num=plu, def=def),t(pos=pc,vform=prf,gen=utr/neu, num=plu,def=ind/def,case=gen),t(pos=nn, gen=utr,num=sin,def=ind,case=nom)]). % dt pc nn
auxiliary verb
aux(har). aux(kan). aux(skulle). aux(vill).
auxiliary verb seq
aux(skulle,kunna). %’would be able to’
verb chain
vb_chain([token(skulle),token(kunna),t(pos=vb, vform=inf,voice=akt)]). %’would be able to...’
pos sets
pos_set([pp,vb]). pos_set([ab,jj,ps]). pos_set([rg,nn,ro]). pos_set([dt,ab,ha]).
%’have’ %’can’ %’would’ %’want to’
The above rule would for instance select the singular reading of stora in the phrase den stora f˚ ageln (‘the big bird’), but select the plural reading of stora in de stora f˚ aglarna (‘the big birds’). This study was concentrated to the two most frequent POS categories of the training corpus, the verb and the noun, and the result was compared to the noun and verb rules in Lindberg and Eineborg (1998), presented in Sect. 4.2. The experiment showed that the new rules reached a recall of 99.4% (as compared to 97.6% for the noun and verb rules of the second study) and a slightly better precision too. 4.4
Comparing the Results
It is not a trivial task to compare different taggers—even when dealing with one and the same language. However, a preliminary test of two different taggers (Ridings, 1998) reports that the Brill tagger (Brill, 1994), also trained on SUC, tagged 96.9% of the words correctly, and Oliver Mason’s HMM-based QTag (Mason, 1997) got 96.3% on the same data. Yet another Markov model tagger, trained on the same corpus, albeit with a slightly modified tag set, tagged 96.4% of the words correctly (Carlberger & Kann, 1999). None of these taggers left ambiguities pending, and all handled unknown words. In the three studies by the current authors only known words were considered (i.e., words that were in the lexicon). An overview of the result can be found in Table 5.
166
Martin Eineborg and Nikolaj Lindberg Table 5. Summary of the work on inducing a Swedish tagger Study
5
Rules Window POS Background Recall Ambig.
Eineborg and remove Lindberg (1998)
fixed
all pos reading word token
97.0%
1.15
Lindberg and remove Eineborg (1998)
fixed
all any feature word token
97.7%
1.13
Lindberg and remove Eineborg (1999) select barrier
varied
noun linguistic verb any feature word token
99.4%
Related Work
The following subsections summarise related work on inducing CGs using ILP. In the first two studies, dealing with English and Slovene data, Progol was used, while in the third study, dealing with Hungarian, Progol as well as other machine learning techniques were used. 5.1
Using ILP to Learn Tag-Elimination Rules for English
Cussens (1997) was the first to induce CG rules using ILP, and describes a project in which CG inspired rules for tagging English text were induced using Progol. The training examples (both positive and negative) were sampled from the POS tagged 3 million word Wall Street Journal corpus, which was split into a 2/3 train and a 1/3 test set. A lexicon of 66,024 words was created from the corpus. Tags from a word’s ambiguity class were deleted from the lexicon if a tag appeared in less than 5% of the cases. Constraints were learnt separately for each of the tags in the corpus. As part of its background knowledge, Progol had a small hand-crafted grammar. The background knowledge predicates looked at (sequences of) unique tags (and not e.g. word forms of the original text). In the final tagger, the rules were applied to the text after lexicon look-up. An hybrid approach, combining rules and statistics was chosen; when no rules were applicable to an ambiguity class, the tag with lowest lexical frequency was deleted. Given no unknown words (words not in the lexicon) and a tag set of 43 different tags, the system tagged 96.4% of the words correctly. 5.2
Using ILP to Learn Tag-Elimination Rules for Slovene
In Cussens et al. (1999), tag elimination rules for Slovene were induced using Progol. Slovene is a highly inflectional Slavic language, with thousands of possible morpho-syntactical readings, making tagging a more difficult task than
ILP in Part-of-Speech Tagging — An Overview
167
for Germanic languages, such as English or Swedish. The induced rules identify contextually impossible readings of a target word, given unambiguously tagged context words. The examples were generated from a manually tagged corpus, and 99,261 positive and 81,805 negative examples were produced. Eight data sets were used, one for each of the part-of-speech categories of the corpus. The background knowledge consisted of access predicates that returned features of the morpho-syntactic descriptions (tags), e.g. gender, case, number, part-of-speech, or a complete tag. A simple noun phrase definition was given. Disagreement in gender, case and number could also be used. The accuracy of the resulting theory was 87.5%. 5.3
Learning to Choose Tags for Hungarian
In the study presented in Horv´ ath (1999), a number of machine learning systems (Progol, C4.5, AGLEARN, PHM, and RIBL) was trained on a Hungarian corpus. Hungarian has a very rich morphology and free word order. The original tag set of the Hungarian corpus had several thousand possible tags, and a reduced tag set of only 125 different tags was used instead. A lexicon was created from the corpus. Training data was created by looking up the words of the corpus text in the lexicon, and assigning to each word an ambiguity class. The training data consisted of tag sequences representing the original sentences (thus the actual word forms of the training text were not retained). The background knowledge in the different systems was: Simple linguistic background knowledge used to recognise token groups and phrase structures in AGLEARN, access predicates to pull out tags in the context in Progol, and tags and sequences of two tags in RIBL. Progol was used to induce “Choose” rules, which select the correct reading in an ambiguity class. Rules for each ambiguity class were learnt separately. In the training examples, the unique correct tags of the context words were used. The rules referred to specific ambiguity classes, and identified the tag to choose from this ambiguity class (note that this is different from the rules that the other authors referred to above induced). The induced rules could not refer to ambiguous context words. The best result was obtained by a combination of taggers that reached an accuracy of 86.5%.
6
Discussion
This paper has given a short overview of work done to induce Constraint Grammar inspired part-of-speech taggers using Inductive Logic Programming. Common to all studies presented is that the rules were induced one part-of-speech category at a time using examples which were generated from a tagged corpus. The importance of using linguistic background knowledge has been recognised by several authors (Cussens, 1997, page 104; Horv´ ath et al., 1999, page 138; Lindberg & Eineborg (1999)). ILP based taggers have been developed for several different languages (English (Cussens, 1997), Hungarian (Horv´ ath et al., 1999), Slovene (Cussens et al.
168
Martin Eineborg and Nikolaj Lindberg
1999), Swedish (Eineborg & Lindberg, 1998; Lindberg & Eineborg, 1998, 1999), and Czech (Popel´ınsk´ y et al., 1999)). It is interesting that similar methods have been tried on different languages, since it is not obvious that a method successfully applied to one language, is directly applicable to another one (e.g., see Megyesi (1999), where Brill’s original rule templates were extended to better suit Hungarian). There are still problems which have been neglected. For instance, unknown words have not been dealt with properly, but is an important issue in any working tagger. A topic which perhaps should have been dealt with more in depth, is that of how the rules are applied by the tagger. By evaluating the rules against a tuning set, the accuracy of individual rules could be used by the tagging algorithm, to always choose the rule of the highest accuracy. Most of the rules described in this paper are so called careful rules, i.e. they can only be applied if the context they are referring to is unambiguous. All ILP systems were trained on examples where the context words had been completely disambiguated. In other words, the training data is not reflecting the actual input to a tagger. An illustration of how training data for remove rules might look like is found in Table 1, where it can be seen that the context words of the positive and the negative examples are unambiguous. Ambiguous context is one of the reasons why rules may not be applicable. An attempt to lessen this problem was made in Lindberg and Eineborg (1999), where rules could refer to word forms (regardless of ambiguity class). Furthermore, a rule referring to a feature that is shared by every reading in an ambiguity class can also be applied even though the context is not disambiguated. Due to large data sets the rules were induced one POS category at a time. Perhaps this means that some generalisations over sets of POS categories (which have features in common, or have a similar distribution) are lost.
References 1. Brill, E. (1994). Some advances in transformation-based part of speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence. 2. Carlberger, J. and Kann, V. (1999). Implementing an efficient part-of-speech tagger. In press. Available at http://www.nada.kth.se/theory/projects/granska/. 3. Cussens, J. (1997). Part of speech tagging using Progol. In Proceedings of the Seventh International Workshop on Inductive Logic Programming, 93–108, Prague, Czech Republic. 4. Cussens, J., Dˇzeroski, S., and Erjavec, T. (1999). Morphosyntactic tagging of Slovene using Progol. In Proceedings of the Ninth International Workshop on Inductive Logic Programming, 68–79. 5. Cutting, D., Kupiec, J., Pedersen, J., and Sibun, P. (1992). A practical part-ofspeech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, 133–140. 6. Eineborg, M. and Lindberg, N. (1998). Induction of Constraint Grammar-rules using Progol. In Proceedings of The Eighth International Conference on Inductive Logic Programming.
ILP in Part-of-Speech Tagging — An Overview
169
7. Ejerhed, E., K¨ allgren, G., Wennstedt, O., and ˚ Astr¨ om, M. (1992). The Linguistic Annotation System of the Stockholm-Ume˚ a Project. Department of General Linguistics, University of Ume˚ a. 8. Horv´ ath, T., Alexin, Z., Gyim´ othy, T., and Wrobel, S. (1999). Application of different learning methods to Hungarian part-of-speech tagging. In Dˇzeroski, S. and Flach, P., editors, Proceedings of the Ninth International Workshop on Inductive Logic Programming, 128–139. 9. Karlsson, F., Voutilainen, A., Heikkil¨ a, J., and Anttila, A., editors (1995). Constraint Grammar: A language-independent system for parsing unrestricted text. Mouton de Gruyter, Berlin and New York. 10. Lindberg, N. and Eineborg, M. (1998). Learning Constraint Grammar-style disambiguation rules using Inductive Logic Programming. In Proceedings of the Seventeenth International Conference on Computational Linguistics and the thirty-sixth Annual Meeting of the Association for Computational Linguistics, volume II, 775– 779. 11. Lindberg, N. and Eineborg, M. (1999). Improving part of speech disambiguation rules by adding linguistic knowledge. In Dˇzeroski, S. and Flach, P., editors, Proceedings of the Ninth International Workshop on Inductive Logic Programming, 186–197. 12. Mason, O. (1997). QTAG—A portable probabilistic tagger. Corpus Research, The University of Birmingham, U.K. 13. Megyesi, B. (1999). Improving Brill’s PoS tagger for an agglutinative language. In Joint Sigdat Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 21–22. 14. Muggleton, S. (1991). Inductive logic programming. New Generation Computing, 8(4):295–318. 15. Muggleton, S. (1995). Inverse entailment and Progol. New Generation Computing Journal, 13:245–286. 16. Popel´ınsk´ y, L., Pavelek, T., and Pt´ aˇcn´ık, T. (1999). Towards disambiguation in Czech corpora. In Cussens, J., editor, In Proceedings of the First Learning Language in Logic Workshop, 106–116. 17. Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Proceedings of Conference on Empirical Methods in Natural Language Processing, University of Pennsylvania. 18. Ridings, D. (1998). SUC and the Brill tagger. GU-ISS-98-1 (Research Reports from the Department of Swedish, G¨ oteborg University). 19. Samuelsson, C., Tapanainen, P., and Voutilainen, A. (1996). Inducing Constraint Grammars. In Laurent, M. and de la Higuera, C., editors, Grammatical Inference: Learning Syntax from Sentences, 146–155. Springer-Verlag. 20. Tapanainen, P. (1996). The Constraint Grammar Parser CG-2. Department of General Linguistics, University of Helsinki. 21. Zavrel, J. and Daelemans, W. (1999). Recent advances in memory-based part-ofspeech tagging. In VI Simposio Internacional de Comunicacion Social, 590–597.
Iterative Part-of-Speech Tagging Alípio Jorge and Alneu de Andrade Lopes LIACC - Laboratório de Inteligência Artificial e Ciências de Computadores Universidade do Porto, Rua Campo Alegre 823, 4150 Porto, Portugal. {amjorge,alneu}@ncc.up.pt
Abstract. Assigning a category to a given word (tagging) depends on the particular word and on the categories (tags) of neighboring words. A theory that is able to assign tags to a given text can naturally be viewed as a recursive logic program. This article describes how iterative induction, a technique that has been proven powerful in the synthesis of recursive logic programs, has been applied to the task of part-of-speech tagging. The main strategy consists of inducing a succession T1, T2, ..., Tn of theories, using in the induction of theory Ti all the previously induced theories. Each theory in the sequence may have lexical rules, context rules and hybrid ones. This iterative strategy is, to a large extent, independent of the inductive algorithm underneath. Here we consider one particular relational learning algorithm, CSC(RC), and we induce first order theories from positive examples and background knowledge that are able to successfully tag a relatively large corpus in Portuguese.
1
Introduction
The task of Part-of-Speech Tagging is to assign to each word in a given body of text an appropriate grammatical category like noun, article, ordinal number, etc., according to the role of the word in that particular context. These categories are called part-of-speech tags and may total a few tens, depending on the variants one considers for each particular category. The difficulty of this task lies in the fact that a given word may play different roles in different contexts. Although there is, for each word, a relatively small set of possible tags, for many words there is more than one tag. Words with a single possible assignable tag are dealt with by employing a simple lookup-table (dictionary). The way to solve the ambiguity for words with more than one possible tag is by considering the context of the word and possibly employing background knowledge. In our approach, we represent the text to be tagged as a set of facts. For example, the text “the car is red. I like the car.” is represented as word(s1,1,the). word(s1,2,car). word(s1,3,is).
J. Cussens and S. Džeroski (Eds.): LLL’99, LNAI 1925, pp. 170-183, 2000 © Springer-Verlag Berlin Heidelberg 2000
Iterative Part-of-Speech Tagging
171
word(s1,4,red). word(s1,5,’.’). word(s2,1,’I’). word(s2,2,like). word(s2,3,the). word(s2,4,car). word(s2,5,’.’). where s1 and s2 are sentence labels and the second argument is the position of the word within the sentence. Punctuation marks such as ‘.’ are regarded as words. The task of tagging is to assert a set of facts that assign a tag to each word. For the given example one possible result is: tag(s1,1,art). tag(s1,2,noun). tag(s1,3,v). tag(s1,4,adj). tag(s1,5,dot). tag(s2,1,art). tag(s2,2,v). tag(s2,3,art). tag(s2,4,noun). tag(s2,5,dot). In our approach we construct a theory that is interpreted as a decision list of first order rules. tag(S,P,art)← word(S,P,the),!. tag(S,P,art)← window(P,L1,L2,L3,L4,R1,R2,R3,R4), tag(S,R1,noun),tag(S,R2,verb),!. tag(S,P,noun)← window(P,L1,L2,L3,L4,R1,R2,R3,R4), tag(S,L1,art),tag(S,R1,verb),!. tag(S,P,art). Here window/9 defines a window consisting of four positions left of P and four positions right of P. window(P,L1,L2,L3,L4,R1,R2,R3,R4)← L1 is P-1, L2 is P-2, L3 is P-3, L4 is P-4, R1 is P+1, R2 is P+2, R3 is P+3, R4 is P+4. Tagging one particular word (e.g. ‘like’ in position 2 of sentence s2) is done by answering the query ?- tag(s2,2,T). This query may not terminate if we employ a straightforward top-down interpreter like the one used in Prolog. Some troublesome recursive calls may occur leading to
172
Alípio Jorge and Alneu de Andrade Lopes
non-termination. We describe how we handle the interpretation of these recursive clauses in Section 7 (Algorithm 5 and Algorithm 6). In this article we use the technique of iterative induction to synthesize the recursive tagger. This technique has been proven powerful in the synthesis of recursive logic programs from sparse sets of examples (Jorge & Brazdil 96, Jorge 98). Here we show how it can be used in the task of part-of-speech tagging. In each iteration we use the novel first order algorithm CSC (Characterize, Select, Classify) in combination with the existing algorithm RC (Lopes & Brazdil 98).
2
The Inductive Task
The task of constructing such a theory is defined as Given: Corpus C (predicate word/3), A set of examples E that associate a tag to each word in the corpus (predicate tag/3), A clausal language L, Background knowledge BK, Output: A theory T ∈ L such that T ∪ BK ∪ C |– E The above specification is over-constrained since it does not allow the theory T to fail on any of the examples. In practice we will be looking for a theory T that maximizes the expected success rate over an unseen set of words, where the success rate is the proportion of correct answers given by the theory.
3
The Language
To decide which tag should be assigned to a given word, we want to have rules that take into account either the word or the context or both. For that we will look for rules of the form tag(S,P,T)← window(P,L1,L2,L3,L4, R1,R2,R3,R4) [,word(S,P,W)] [,tag(S,L1,TL1),tag(S,R1,TR1) [,tag(S,L2,TL2),tag(S,R2,TR2) [,tag(S,L3,TL3),tag(S,R3,TR3) [,tag(S,L4,TL4),tag(S,R4,TR4)]]]].
Iterative Part-of-Speech Tagging
173
where [x] means that x is optional. The third argument of predicates word/3 and tag/3 are constants (ranging over words and tags respectively). We also considered other combinations of the same literals. To construct clauses in this recursive language we have to solve the following important problem. We need information about the context of each word. This information is available at induction time, but it is not available at classification time. To assign tags to an unseen text, we start only with the set of words to be tagged (predicate word/3). One common solution is to tag as many words as we can (for example the words that have no ambiguity) and then apply the recursive rules. After that we will have more tagged words and the recursive rules may be applied repeatedly until all words have been tagged. In other words, the theory is applied in layers. The key idea in this article is to employ the layer by layer strategy in induction. Although the use of a first order language is not strictly necessary to represent the learnt theories presented in this article, it has a few advantages. First of all, the approach proposed allows the use of background knowledge which can be provided as in other ILP systems. Second, the recursive nature of the rules is represented and exploited in a natural and elegant way. Finally, the use of a first order declarative bias language provides a powerful mechanism for focussing particular regions of the search space.
4
The Iterative Induction Strategy
Learning recursive first order clauses is a difficult problem that has been tackled by several different approaches. The iterative approach we present here is to a large extent based on (Jorge 98) and (Jorge & Brazdil 96). Given a set of words, we start by inducing clauses that are able to determine the tag of some of those words without any context information. These will also be the first clauses to be applied in classification. They are the base clauses of the recursive definition we want to induce and are not recursive. These clauses are also used to enrich the background knowledge, thus enabling and/or facilitating the synthesis of recursive clauses in the following iterations. Having obtained this first layer of clauses, let us call it T1, we are able to classify (tag) some of the words in the text used for training. Using the answers given by this theory T1 we may induce some recursive context clauses thus obtaining theory T2. By iterating the process, we obtain a sequence of theories T1, T2, ..., Tn. The final theory is T = T1 ∪ T2 ∪ ... ∪ Tn. To induce each theory in the sequence we may apply a sort of covering strategy, by considering as training examples in iteration i only the ones that have not been covered by theories T1, ..., Ti-1. We stop when all the examples have been covered, or when we cannot find any clauses.
174
Alípio Jorge and Alneu de Andrade Lopes
The construction of each theory T1, T2, ... is done by a learning algorithm. In this article we consider the algorithm CSC(RC). Example: Assume that the background knowledge includes the definition of the predicate word/3 (describing the text) and window/9. We now describe in more detail how a theory is produced by iterative induction.
Algorithm 1: Iterative Induction Given Language L, examples E and background knowledge BK, Learning algorithm ALG(Set of Examples, Background knowledge theory) Find A theory T in L Algorithm: Uncovered ← E T←∅ i←1 Do Ti ← ALG(Uncovered, BK) T ← T ∪ Ti BK ← BK ∪ Ti Uncovered ← Uncovered – covered_examples(Ti) i←i+1 Until covered_examples(Ti) = ∅ In iteration 1 non recursive rules like the following are induced: tag(A,B,adj)← window(A,B,L1,L2,L3,L4,R1,R2,R3,R4), word(A,B,portuguesa),!. tag(A,B,n):window(A,B,L1,L2,L3,L4,R1,R2,R3,R4), word(A,B,documento),!. These rules are defined solely in terms of the background predicates word/3 and window/9. They do not depend on the context of the word to be tagged. Before proceeding to iteration 2 we add these rules to the background knowledge. In iteration 2, some words can be tagged using the rules induced in iteration 1. Therefore recursive rules like the following appear:
Iterative Part-of-Speech Tagging
175
tag(A,B,art)← window(A,B,L1,L2,L3,L4,R1,R2,R3,R4),tag(A,L1,p rep), tag(A,R1,n),tag(A,L2,n), tag(A,R2,virg),tag(A,L3,prep),!. ... tag(A,B,art)← window(A,B,L1,L2,L3,L4,R1,R2,R3,R4), word(A,B,a),tag(A,R2,prep), tag(A,R3,n),tag(A,R4,prep),!. The second rule shown above is defined in terms of the word to tag and the context. In this second iteration we also find many non recursive rules. In subsequent iterations more clauses will appear until the stopping criterion is satisfied. In general, the total number of iterations depends on the data, the language, and the underlying learning algorithm employed.
5
The CSC Algorithm
CSC (Characterize, Select, Classify) is a new first order learning algorithm that learns from positive examples and background knowledge, and enables the use of declarative bias. This algorithm has two distinct learning phases (Algorithm 2). In the first one, all rules in the language that have confidence and support above given thresholds are produced. This is the characterization phase. It is akin to the discovery of association rules (Agrawal et al. 96) but using a first order language. In fact CSC uses the notions of support and confidence to eliminate potentially uninteresting rules. The second phase is selection. The set of all rules is sifted and sorted in order to obtain a decision list of first order rules that are able to classify. Here the selection is done by the algorithm RC described in the following section. We refer to the combination of CSC and RC as CSC(RC). Algorithm 2: CSC Given Language L, Examples E , Background knowledge BK, Minimal support MS, minimal confidence MC Find A theory T in L Algorithm ALLRULES ← Characterize(L,E,BK,MC,MS) T ← Select(ALLRULES)
176
Alípio Jorge and Alneu de Andrade Lopes
The measures of support and confidence of a rule correspond to the notions used in the construction of association rules. support( A→B ) = #{ true instances of A∧B } confidence( A→B ) = #{ true instances of A∧B }/ #{ true instances of A } Algorithm 3 describes the construction of the set ALLRULES. The rules in this set are first order clauses belonging to the language L given as a parameter. This language can be defined by the user via one of two declarative bias formalisms. One is a DCG (definite clause grammar) and the other is clause templates. Both describe the set of acceptable clauses. The set ALLRULES can be very large. For the data set used here it reached more than thirteen thousand rules in the second iteration. Other iterations had smaller ALLRULES sets. The overall synthesis time, for all iterations and including characterization and selection phases took less than six hundred seconds. In general, the number of rules generated is linear on the number of examples, but it is very sensitive to minimal support and the size of language L. Algorithm 3: Characterize. Given Language L, Examples E, Background knowledge BK, Minimal support MS, minimal confidence MC Find all rules A→B in L such that, relatively to E and BK, support( A→B ) ≥ MS, confidence( A→B ) ≥ MC We call this set of rules ALLRULES. The selection is done by algorithm RC which is described in the following section.
6
The RC Algorithm
The RC (Rules and Cases) algorithm learns a logic program represented as an ordered list of clauses. It starts from a general rule (default) and selects, in successive iterations, more specific rules which minimize the error of the previously selected rule. General rules tend to incorrectly cover a large number of examples (negative examples). So, RC tries to minimize this incorrect coverage by selecting other more specific rules that correctly cover the examples. These new rules may themselves
Iterative Part-of-Speech Tagging
177
incorrectly cover other examples. Therefore the process iterates until there are no more candidate rules. A detailed description of RC can be found in (Lopes & Brazdil 98). Here we give a short description of the algorithm. The RC algorithm receives as input the set ALLRULES. From these candidate rules we obtain the final theory in three main steps. Firstly the candidate rules are organized, by the relation of generality, into a set of hierarchies (a forest). Secondly the candidate rules are analyzed to establish their effective coverage, a measure explained below. Thirdly, the consistency of the candidate rules is analyzed. This analysis chooses, among the inconsistent rules, the best one and adds it to the theory. Algorithm RC is described as Algorithm 4. We now briefly describe each step of RC. After placing the candidate rules into the structure, the system assigns a certain level to each element in the forest. These levels reflect the generality relation between the candidate rules in each tree. The most general rules are assigned to level 1, their most general specializations as level 2, and so on. To determine the effective coverage of each rule, the levels are processed starting with the most specific level. For each rule R, it looks for more general rules (in the next upper level) which can be applied to the same examples as R but give different outputs (they are considered inconsistent with R). The effective coverage of each of these more general rules is updated to the sum of its support with the support of R. This effective coverage gives us a criterion to choose, in the consistency analysis step, the best general rule among candidates in each level. It estimates how many negative examples covered by the more general rule can be possibly corrected by more specify ones. Algorithm 4: RC. Given A set of candidate rules: ALLRULES Do Let T be the empty theory Add each candidate rule to the existing structure (forest). Determine the effective coverage of each candidate rule. Perform the Consistency Analysis, Identify potentially redundant rules While some non-redundant rule exists do: Select the rule with the largest effective coverage (among the most general candidate rules). Add the selected rule to the theory T. (to the beginning of the ordered list of clauses) Eliminate inconsistent candidate rules. Mark descendants of eliminated rules as non-redundant Output the final theory T.
178
Alípio Jorge and Alneu de Andrade Lopes
The Consistency analysis starts with the most general level. First, the system needs to identify all potentially redundant rules in the forest. These are all the candidate rules except the most general ones in each tree. They are marked as potentially redundant. To explain the selection of the best rule suppose we are dealing with the inconsistent set of candidate rules in Table 1. Rule R1 states that the Portuguese word “a” is an article. Other rules define the same word as a preposition. The RC chooses the rule with the largest effective coverage. According to this measure, the rule R1 becomes the default, and the rules R2, R3, R4, R5 its exceptions. These other rules are specializations that state the word “a” is a preposition if there is some particular tag in a given position. For instance, a noun to its left (as in R2). This choice leads to overall gain in term of global positive coverage. The specialized rules can be seen as exceptions to the default rule that minimize its negative coverage. The effective coverage estimates the quality of the rule, considering the possibility of the existence of more specific rules in the final theory. Table 1. A set of rules as sorted by RC.
R5:tag(S,P,prep)← window(P,L1,L2,L3,L4,R1,R2,R3,R4), word(S,P,a),tag(S,R1,vinf),!. R4:tag(S,P,prep)← window(P,L1,L2,L3,L4,R1,R2,R3,R4), word(S,P,a),tag(S,R1,nc),!. R3:tag(S,P,prep)← window(P,L1,L2,L3,L4,R1,R2,R3,R4), word(S,P,a),tag(S,L1,adj),!. R2:tag(S,P,prep)← window(P,L1,L2,L3,L4,R1,R2,R3,R4), word(S,P,a),tag(S,L1,n),!. R1:tag(S,P,art)← word(S,P,a). In addition, to understand why the rules R2,…,R5 were selected, we emphasize the following. Whenever a rule has been selected from a set of inconsistent candidates, the remaining ones are used to identify their immediate descendants. They are marked as non-redundant. This enables choosing the rules R2,…,R5 in the next iteration, because the specializations of R1 remain redundant since R1 has been selected. When there are no more non-redundant rules, all rules selected until then are added to the top of the ordered list of learned rules and we get the final theory.
Iterative Part-of-Speech Tagging
7
179
Classification with Iteratively Induced Theories
In this section we describe how theories produced by iterative induction are interpreted. A theory produced by Algorithm RC is interpreted as a decision list by algorithm Classify (Algorithm 5). This interpretation is equivalent to using a Prolog interpreter on a set of clauses, each ending with a cut (!). Algorithm 5: Classify. Given A set of examples E, with unknown classes Background knowledge BK, A theory T Find A set of answers (facts representing the class) for the examples in E For each example ε in E, look for the first clause A→B in T such that A is true in BK, There is a substitution θ such that εθ = B collect the answer εθ. Theories induced iteratively are interpreted using an iterative classification algorithm (Algorithm 6). Suppose T = T1 ∪ ... ∪ Tn. We first classify as many examples as we can using algorithm Classify. Then we proceed to theory T2, classify, and so on until Tn. Algorithm 6: Iteratively Classify Given A set of examples E, with unknown classes Background knowledge BK, A theory T = T1 ∪ ... ∪ Tn Find A set of answers (facts) for the examples in E Answers ← ∅ Unanswered ← E For i=1 to n NewAnswers ← Classify(Unanswered, BK, T1 ∪ ... ∪ Ti) Unanswered ← Unanswered – {examples answered in this iteration} Answers ← Answers ∪ NewAnswers BK ← BK ∪ NewAnswers
180
8
Alípio Jorge and Alneu de Andrade Lopes
Experiments with the Lusa Corpus
We used the described approach in a set of experiments with a given corpus. These preliminary results provide evidence that the use of the iterative strategy improves on the results obtained by the isolated algorithm. For these experiments we used a corpus of Portuguese text containing 150 sentences with more than 5000 words. This text was produced by the Portuguese news agency (Lusa) and includes short news about politics, sports, economy and other common matters. The corpus was manually tagged. The set of tags is {nco, vppser, conjcoord, trav, ppr, pr, vter, vgerser, pind, ppoa, conjsub, vinfser, vinfter, ch, virg, vger, dpto, pps, ord, nc, adv, v, vser, pd, np, vinf, vpp, prep, art, n, adj, par, pto}. The corpus was divided into a training set with the first 120 sentences, and a test set with the remaining 30 sentences. The theories were induced using the information in the training set only, and then we measured the success rate of the theories on both the training and the test sets. Note that the learning task we consider here starts with no dictionary. In fact, the dictionary is learned and is expressed as rules that will be part of the final theory produced. Table 1 shows the success rates obtained by three different algorithms. The Default algorithm assigns the most likely tag to each word, given the word. In case the word is unknown it assigns the most frequent tag overall. In general, this algorithm obtains quite good results in the part-of-speech tagging task when all words are known. Some existing taggers start from the results given by what we call here the default algorithm and improve on them by using rules that decide whether the default tag should be changed or not. The two-step algorithm, corresponds to first inducing lexical rules that tag some of the words. These rules are obtained using CSC(RC) described earlier (in Sections 4 to 6). The rules induced in this first step correspond to words that have only one possible tag, or at least a tag that occurs 80% of the times. In other words, CSC(RC) generates rules like the following, with a confidence of at least 0.8. tag(S,P,art)← word(S,P,a),!. In the second step lexical and context rules, as the one below, are induced on the basis of known tags using CSC(RC) again. These rules are placed above the ones from the first iteration in the decision list. tag(S,P,prep)← window(P,L1,L2,L3,L4,R1,R2,R3,R4), word(S,P,a),tag(S,L1,n),!. In this experimental framework, tagging words of the test set is a hard task since approximately 30% of the words do not occur in the training set. This setting is referred to as “open dictionary”. The large number of unknown words explains the poor results of the Default algorithm that assigns the same tag to all those words (the most frequent tag in the training set). In this setting, the inductive abilities of CSC(RC) become crucial. Results are discussed in the conclusions.
Iterative Part-of-Speech Tagging
181
Table 2. Success rates on the Lusa corpus.
Algorithm Default Two-step CSC(RC) Iterative CSC(RC)
Test 0.766 0.774 0.806
We now give some details about the synthesis of the theory associated with the result shown in the third line of Table 2. In iteration 1 a large number (more than 350) of lexical rules have been induced. These rules are defined solely in terms of the word to be tagged. In our experiments the minimal confidence of the rules, for each iteration was 0.8. The minimal support was 2. In iteration 2, some words had already been tagged. Therefore many recursive rules (about 200) appeared. In this second iteration we have also found many strictly lexical rules. The iterative induction algorithm went through three more iterations (100 rules more). The number of induced clauses decreased in each iteration. The last clause in the theory is tag(A,B,n). As already stated, the total number of iterations depends on the data, the language, and the parameters (minimal confidence, support and selection algorithm). We should also mention that a possible tag filter was used to obtain the results. This filter simply verifies if the tag assigned to a word by a particular clause corresponds to an already observed word-tag combination. If that combination was not observed in the training data, the clause fails and resolution proceeds to the following clause. Otherwise, the answer is accepted. For unknown words, any tag is accepted. The filter is implemented by automatically adding a literal of the form possible_tag(S,P,T) to the end of each clause tag(S,P,T):-Body. When this filter is not used, the success rates are slightly lower, but still higher then the default algorithm.
9
Related Work
The system SKILit (Jorge 98, Jorge & Brazdil 96) used the technique of iterative induction to synthesize recursive logic programs from sparse sets of examples. The iterative induction algorithm we use here is largely based on the one used in SKILit. Many other ILP approaches to the task of part-of-speech tagging exist. The ones that are more directly related to our work are (Cussens 97) and (Dehaspe & De Raedt 97), where relational learning algorithms are employed in the induction of rule based taggers. Cussens's approach consists in inducing rules that choose, for each word with more than one possible tag, the most appropriate tag to eliminate (disambiguation rules). More recently, Cussens et al. (1999) used the ILP system P-Progol to induce disambiguation rules to tag Slovene words. Lindberg and Eineborg (1999) used PProgol to induce constraint grammars for tagging of Swedish words using linguistic
182
Alípio Jorge and Alneu de Andrade Lopes
background knowledge. Horváth et al. (1999) tried different learning algorithms for tagging of Hungarian. One of the systems that obtained good results was RIBL, a relational instance based learning system. Liu et al (98) proposed a propositional learning algorithm that is similar in structure to CSC. The main differences are that CSC is relational and it is used here in an iterative way which was not the case with Liu and colleagues. The approach in (Dehaspe & De Raedt 97) also involved the use of association rules. Mooney (95) and Lopes & Brazdil (98) used first order decision lists in the problem of learning the past tense of English verbs.
10 Conclusion In the experiments we carried out the algorithm CSC(RC) obtains significantly better results when it is applied iteratively than when it is applied in two steps. The results obtained by the iterative approach are also much better than the ones of the default algorithm in an open dictionary setting (Table 2). The reason why the iterative approach worked well on this problem might be due to one of the (or both) following reasons: • The problem is recursive in nature. In that case it makes sense to first learn the base clauses (non-recursive) and add them to the background knowledge. After that we are able to induce successive layers of recursive clauses, enriching the background knowledge in each layer. • The fact that, in each iteration, the inductive algorithm focuses on the examples that were not covered in previous iterations. This has the effect of first inducing a theory that captures frequent patterns. In the following iteration new frequent patterns arise, since the distribution of the examples changed. In particular, CSC finds new rules and computes their confidence based on the remaining examples, which allows RC to select a more adequate set of rules. Other iterations continue this process until a homogeneous group of examples has been found or the language becomes inadequate to describe eventually existing patterns. Notice that this argument applies also to non-recursive problems. The iterative approach has another important advantage. The running time tends to get shorter from iteration to iteration. On the other hand, learning all the recursive rules at once takes a very long time. If in the first iteration a considerable number of examples is covered by non-recursive rules, the task is easier in the second iteration. The number of recursive rules that can be learned in the second iteration is also reduced due to the fact that they depend on the answers given by the first iteration. The same applies to other iterations. By observing the partial results of each theory T1, T2,... we notice that the last theory in the sequence is responsible for a large number of wrong answers. This is not surprising, since this theory is the one that tries to cover the remaining subset of uncovered examples. These examples escaped the learning efforts of previous iterations, which probably means that there are not many regularities that can be
Iterative Part-of-Speech Tagging
183
captured from them. They include noise, rare exceptions or examples that cannot be expressed in the given language. One possible direction to improve the results is the application of Case-Based Reasoning on these residual examples. Preliminary experiments we have carried out indicate that the results may indeed improve with such an approach.
Acknowledgements This was work was partly sponsored by project ECO under Praxis XXI, FEDER, and Programa de Financiamento Plurianual de Unidades de I&D. The second author would also like to thank CNPq - Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil. The Lusa corpus was kindly provided by Gabriel Pereira Lopes and his NLP group.
References 1. 2.
3.
4. 5.
6. 7. 8.
9. 10.
11.
12.
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Inkeri Verkamo, A. Fast Discovery of Association Rules. Advances in Knowledge Discovery and Data Mining: 307-328. 1996. Cussens, J.. Part of Speech Tagging Using Progol. In Inductive Logic Programming. th Proceedings of the 7 International Workshop (ILP-97). LNAI 1297, p 93-108, Springer, 1997. Cussens, J.; Dzeroski, S.; Erjavec, T.: Morphosyntatic Tagging of Slovene Using Progol. th Proceedings of the 9 Int. Workshop on Inductive Logic Programming (ILP-99). Dzeroski, S. and Flach, P. (Eds). LNAI 1634, 1999. Dehaspe, L. De Raedt, L. Mining Association Rules in Multiple Relations. Proceedings of th the 7 International Workshop (ILP-97). LNAI 1297: 125-132. 1997. Horváth, T.; Alexin, Z.; Gyimóthy, T.; Wrobel, S.: Application of Different Learning Methods to Hungarian Part-of-Speech Tagging. Proceedings of the 9th Int. Workshop on Inductive Logic Programming (ILP-99). Dzeroski, S. and Flach, P. (Eds). LNAI 1634, 1999. Jorge, A. Iterative Induction of Logic Programs: an approach to logic program synthesis from incomplete specifications. Ph.D. thesis. University of Porto, 1998. Jorge, A., Brazdil, P. Architecture for Iterative Learning of Recursive Definitions. Advances in Inductive Logic Programming, De Raedt, L. (Ed), IOS Press, 1996. Lindberg, N; Eineborg, M: Improving Part-of-Speech Disambiguation Rules by Adding th Linguistic Knowledge. Proceedings of the 9 Int. Workshop on Inductive Logic Programming (ILP-99). Dzeroski, S. and Flach, P. (Eds). LNAI 1634, 1999. Liu, B., Hsu, W., Ma, Y. Integrating Classification and Association Rule Mining. In Proceedings of KDD 1998: 80-86. 1998. Lopes, A., Brazdil, P. Redundant Covering with Global Evaluation in the RC1 Inductive Learner. Advances in Artificial Intelligence, 14th Brazilian Symposium on Artificial Intelligence, SBIA ‘98. LNAI 1515, Springer Verlag, 1998. Lopes, A. Using Inductive Learning in Case-Based Reasoning (in Portuguese). Proceedings of Brazilian Symposium on Artificial Intelligence - SBIA 96 (student session), 1996. Mooney, R. J. Induction of First-order Decision Lists: Results on Learning the Past Tense of English Verbs. Journal of Artificial Intelligence Research 3, pp. 1-24, 1995.
DCG Induction Using MDL and Parsed Corpora Miles Osborne Alfa Informatica, Faculteit der Letteren, University of Groningen, Oude Kijk in ’t Jatstr. 26, Postbus 716, NL 9700 AS Groningen, The Netherlands
[email protected]
Abstract. We show how partial models of natural language syntax (manually written DCGs, with parameters estimated from a parsed corpus) can be automatically extended when trained upon raw text (using MDL). We also show how we can use a parsed corpus as an alternative constraint upon learning. Empirical evaluation suggests that a parsed corpus is more informative than a MDL-based prior. However, best results are achieved when the learner is supervised with a compressionbased prior and a parsed corpus.
1
Introduction
The grammar learning problem can be specified as follows. Given a sequence of events s1 . . . sn produced by an unknown stochastic process Q, estimate another stochastic process, E, that models Q as closely as possible. Usually, the events will be sentences, and the process will generate these sentences through a series of choices. These choices are hidden from us, and result from a series of rule applications. (Stochastic) grammar learning consists of finding these rule applications and estimating their associated parameters. In a fully unsupervised setting, we would assume nothing about the class of processes that might generate our sentences, other than that it is computable. Clearly, this radical setting is not viable (for example, Kolmogorov (1965) has shown that learning of Turing Machines has no algorithmic solution). We therefore have to supervise our learner, and tell it which class of process to consider. However, this form of supervision is still not sufficiently constraining to guarantee accurate modelling. For example, in a non-statistical setting, when given a grammar class (but not the actual target grammar), Gold (1967) showed that not even regular languages could be identified in the limit on the basis of (positive) sentences alone; in a statistical setting, when the learner was given an actual model instance (but not told the parameters), Pereira and Schabes (1992) showed that raw text alone contained insufficient information to enable the learner to assign linguistically plausible derivations to sentences observed (our definition of accurate modelling). There are (at least) two reasons why learning of a model, given just raw text alone and minimal supervision, might produce poor results. The search space usually contains numerous (local) maxima, and the chance of finding the ‘best’ one, unaided, is extremely small (Carroll & Charniak, 1992). Secondly, there J. Cussens and S. Dˇ zeroski (Eds.): LLL’99, LNAI 1925, pp. 184–198, 2000. c Springer-Verlag Berlin Heidelberg 2000
DCG Induction Using MDL and Parsed Corpora
185
is no guarantee that maximising the posterior probability of a model from an arbitrary model class will simultaneously maximise the utility of that model in some application. Indeed, as Abney argues, the reverse is more likely to occur: the process of maximising a model instance, from some class of models that poorly approximates statistical dependencies in natural languages, systematically makes that model less linguistically plausible (Abney, 1997). In the context of using the Expectation-Maximisation algorithm to classify documents, Nigam, McCallum, Thrun, and Mitchell (1998) reported similar findings All is not lost however, and there are two complementary ways we can increase our chance of estimating a linguistically useful model: we could assume the unknown process contained rules drawn from some family of grammars that can, on the basis of positive-only examples, be identified in the limit. (Equivalently, we can select some model class whose maximal a posterior model coincides with the best linguistic model). Or, we could increase the degree of knowledge our learner has about the unknown process: we could bias the search. The first approach amounts to structuring the search space such that straightforward Maximisation takes us to the desired model. Unfortunately, there is no evidence that natural languages fall into the class of known grammar families, such as grammars containing at most k rules, k-reversible grammars (etc) (Shinohara, 1990; Sakakibara, 1992; Angluin, 1982) that can be identified in the limit using minimal supervision. Furthermore, we do not know enough about the statistical properties of natural languages for us to design a model class whose global optimum coincided with the best performance in some domain. The second approach (biasing search) assumes less about the relationship between the search space and model utility, and in lieu of a good, constrained model class, is the method of choice. In this paper, we consider the links between the following sources of supervision: – Parsed corpora. – Minimal Description Length (MDL). A parsed corpus, in principle, can fully specify the stochastic process (and so means we no longer have to deal with hidden variables), and learning constrained by it frees us from having to design accurate model classes. It can focus attention upon subspaces of possible model parameterisations whose maximum correlates with the best performance at a given task (Pereira & Schabes, 1992). Numerous studies have demonstrated the utility of parsed corpora as an learning constraint (for example, (Black, Jelinek, Lafferty, & Magerman, 1993b; Magerman, 1994; Collins, 1996)). However, in practice, there are problems with available parsed corpora: they are limited in quantity (thus do not cover every construct in any given natural language). In addition, they often only partially specify derivations (for example, Noun Phrases in the parsed Wall Street Journal (WSJ) are often internally unanalysed). Finally, they are relatively expensive to produce. Now, learning using a parsed corpus is frequently based upon some form of Maximum Likelihood Learning (MLE). MLE is unsupervised in the sense that it assumes an uninformative prior. As is well know, when training material is limited, MLE tends to suffer from overfitting, and this will lead to suboptimal performance. It
186
Miles Osborne
is therefore frequently necessary to smooth ML estimated models. However, the majority of smoothing techniques are arguably ad hoc and difficult to understand (MacKay, 1994). By contrast, learning using the Minimum Description Length principle (MDL) is less prone to overfitting, less reliant upon smoothing, and as such usually leads to better results than those yielded by MLE (for example, see (Stolcke, 1994; Chen, 1996; de Marcken, 1996; Osborne & Briscoe, 1997)). MDL is a model selection technique, and balances the complexity of the model against its degree of fit. MLE, on the other hand, simply fits the model to the data, irrespective of model complexity. Unlike parsed corpora, MDL as an learning bias is complete (in the sense that it acts over the entire space of models). Here, we show how partial models of natural language syntax (manually written Definite Clause Grammars (DCGs), with parameters estimated from parsed corpora) can be automatically extended when trained upon raw text (using MDL). This constitutes a weak form of supervision. We also gives results using a parsed corpus as an additional constraint upon MDL-based learning. This more strongly supervises learning. The structure of the rest of this paper is as follows. In Section 2 we give a brief introduction to MDL. Following this, in Section 3, we outline our MDLbased DCG learner. We then (Section 4) give some empirical evaluation showing it in action. Finally, we conclude with a discussion.
2
The Minimum Description Length Principle
Learning can be viewed as significant compression of the training data in terms of a compact hypothesis. It can be shown that, under very general assumptions, the hypothesis with the minimal, or nearly minimal complexity, which is consistent with the training data, will with high probability predict future observations well (Blumer, Ehrenfeucht, Haussler, & Warmuth, 1987). One way of finding a good hypothesis is to use a prior that favours hypotheses that are consistent with the training data, but have minimal complexity. That is, the prior should be construed in terms of how well the hypothesis can be compressed (since significant compression is equivalent to a low stochastic complexity). We can compress the hypothesis by transforming it into a (prefix) code, such that when measured in bits of information, the total length of the encoding is less than, or equal to, the length of the hypothesis, also when measured in bits. Let l(H) be the total length of the code words for some set of objects H, as assigned by some optimal coding scheme. It turns out that: 2−l(x) ≤ 1 (1) x∈H ∗
the Kraft Inequality, where H ∗ is the space of possible hypotheses, can be used as a prior probability for H. The smaller l(H), the greater the compression, and so the higher the prior probability. There is an equivalence between description lengths, as measured in bits, and probabilities: the Shannon Complexity of some object x, with probability
DCG Induction Using MDL and Parsed Corpora
187
P (x), is − log(P (x)) (all logarithms are to the base 2). This gives the minimal number of bits required to encode some object. Hence, we can give a description length to both the prior and likelihood probabilities (Section 3.1 about our prior and likelihood probabilities). Using these description lengths yields the MDL Principle (Rissanen, 1989): we should select an hypothesis H that: – Minimises the length of the hypothesis (when measured in bits) and – Minimises the length of the data encoded in the hypothesis (measured in bits). The first part says prefer hypotheses that are compact; the second part says prefer hypotheses that fit the data well. Both aspects of a theory are taken into consideration to arrive at a proper balance between overly favouring a compact hypothesis (which might model the training data badly) and overly favouring the likelihood probability (which sometimes leads to overfitting: too much concentration of the likelihood probability mass upon an unrepresentative training set). MDL-based learning has the following properties: – Asymptotically, it gives the same results as MLE. – Otherwise, it usually overfits less than MLE. MDL can be viewed as biased learning (Schaffer, 1993), and so cannot be guaranteed to be appropriate for all situations. In practice however, MDL never leads to dramatically poor results. – Convergence is (usually) faster than that of MLE (Rissanen, 1989). – As an approximation to stochastic complexity, the size of the estimated model provides insight into how well the model class is able to capture facts about the target hypothesis. In particular, if we select another model class and find a more compact model, we can deduce that the second model class is more appropriate for the task in hand. Together, the first three properties mean that MDL is ideally suited to those learning situations when training material is limited. In our context, this means we can be less profligate in our use of parsed corpora. The final property is of value when assessing the utility of model classes.
3
Implementation
Here, we briefly describe our learner. Fuller details can be found in (Osborne, 1999). In outline, we: 1. Start with a DCG language model M0 and a sequence of sentences s1 . . . sk . . . sn . 2. Process each sentence sk . 3. If sk cannot be generated by M0 : (a) Build a set of models Mi . . . all of which can generate the sentences s1 . . . sk .
188
Miles Osborne
(b) Eliminate any model in the set just built that produces a parse for sk incompatible with a manually created parse for sk . (c) Out of the remaining models, pick the one with the highest posterior probability, M1 . (d) Replace M0 with M1 . 4. If there are more sentences to process, goto step 2, else terminate with a model MN . DCGs in our approach are modelled in terms of a compression-based MDLstyle prior probability and a Stochastic Context Free Grammar (SCFG)-based likelihood probability (we present modelling details later in this section). The prior assigns high probability to compact models, and low probabilities to verbose, idiomatic models. As such, it favours simple grammars over more complex possibilities. The likelihood probability describes how well we can encode the training set in terms of the model. We realise that SCFGs are a suboptimal model class; we comment upon this in the final section of this paper. Candidate models are built by adding new rules to the current model (and re-normalising probabilities accordingly).1 New rules are induced by inspecting the remains of a chart after failing to parse some sentence. To this chart, new inactive edges are added, and these new edges extend other inactive edges that could not be extended using rules in the grammar. Rules are then created out of these new edges if and only if the new edges added to the chart contribute to a full parse for the sentence. The parser packs the chart and experimentally is capable of recovering the n most likely (in terms of a modified SCFG) parses in quadratic time (Carroll, 1994). Our learner is integrated into the Alvey Natural Language Toolkit (ANLT) (Carroll, Grover, Briscoe, & Boguraev, 1991) and will be part of the next release. When we use parsed corpora as a constraint, we do not create new edges that cross the bracketings in an associated manually produced parse. We do not add to the chart some new edge, spanning the vertices i to j, i < j, if there exists a bracketing in the parse spanning from either i − k to i + l, or i + m to j + n, where i + l < j or i + m < j. See Pereira and Schabes (1992) for further details of how learning might be constrained by parsed corpora. 3.1
Modelling Details
Here we outline how we compute likelihood and prior probabilities for DCGs. Likelihood Probability. To specify a likelihood probability for DCGs, we have opted to use a SCFG, which consists of a set of context free grammar rules along with an associated set of parameters (Booth, 1969). Each parameter models the 1
The likelihood probabilities are estimated by counting the number of times some rule was seen in derivations of the previous q sentences. Usually q is fixed to make computation tractable. The description length of the rule parameters are updated based upon these counts.
DCG Induction Using MDL and Parsed Corpora
189
way we might expand non-terminals in a top-down derivation process, and within a SCFG, we associate one such parameter with each distinct context free rule. However, DCG rules are feature-based, and so not directly equivalent to simple context free rules. In order to define a SCFG over DCG rules, we need to interpret them in a context-free manner. One way to achieve this is as follows. For each category in the grammar that is distinct in terms of features, invent an atomic non-terminal symbol. With these atomic symbols, create a SCFG by mapping each category in a DCG rule to an atomic symbol, yielding a context free (backbone) grammar, and with this grammar, specify a SCFG, Mi . Naturally, this is not the most accurate probabilistic model for feature-based grammars, but for the interim, is sufficient (see Abney (1997) for a good discussion of how one might define a more accurate probabilistic model for feature-based grammars). SCFGs are standardly defined as follows. Let P (A → α | A) be the probability of expanding (backbone) non-terminal symbol A with the (backbone) rule A → α when deriving some sentence si . The probability of the j th derivation of si is defined as the product of the probabilities of all backbone rules used in that derivation. That is, if derivation j followed from an application of the rules Aj1 → α1j , . . . , Ajn → αnj , (s | Mi ) = Pj deriv i
n i=1
P (Aji → αij )
(2)
The probability of a sentence is then defined as the sum of the probabilities of all m ways we can derive it: Ps (si | Mi ) =
m j=1
(s | Mi ) Pj deriv i
(3)
Having modelled DCGs as SCFGs, we can immediately specify the likelihood probability of Mi generating a sample of sentences s0 . . . sq , as: P (s0 . . . sq | Mi ) =
q
Ps (sk | Mi )
(4)
k=0
This treats each sentence as being independently generated from each other sentence. Prior Probability. Specifying a prior for DCGs amounts to encoding the rules and the associated parameters. We encode DCG rules in terms of an integer giving the length, in categories, of the rule (requiring log∗ (n) bits, where log∗ is Rissanen’s encoding scheme for integers), and a list of that many encoded categories. Each category consists of a list of features, drawn from a finite set of features, and to each feature there is a value. In general, each feature will have a separate set of possible values. Within manually written DCGs, the way a feature is assigned a value is sensitive to the position, in a rule, of the category
190
Miles Osborne
containing the feature in question. Hence, if we number the categories of a rule, we can work out the probability that a particular feature, in a given category, will take a certain value. Let P (v | fi ) be the probability that feature f takes the value v, in category i of all rules in the grammar.2 Each value can now be encoded with a prefix code of − log(P (v | fi ) bits in length. Encoding a category simply amounts to a (fixed length) sequence of such encoded features, assuming some canonical ordering upon features. Note we do not learn lexical entries and so not not need to encode them. To encode the model parameters, we simply use Rissanen’s prefix coding scheme for integers to encode a rule’s frequency. We do not directly encode probabilities, since these will be inaccurate when the frequency, used to estimate that probability, is low. Rissanen’s scheme has the property that small integers are assigned shorter codes than longer integers. In our context, this will favour low frequencies over higher ones, which is undesirable, given the fact that we want, for learning accuracy, to favour higher frequencies. Hence, instead of encoding an integer i in log∗ (i) bits (as, for example, Keller and Lutz (1997) roughly do), we encode it in log∗ (Z − i) bits, where Z is a number larger than any frequency. This will mean that higher frequencies are assigned shorter code words, as intended. The prior probability of a model Mi , containing a DCG G and an associated parameter set is: P (Mi ) = 2−(lg (Mi )+lp (Mi )) + C
(5)
where: lg (Mi ) =
∗
[log (| r |) +
|r|
− log(P (v | fi ))]
(6)
i=1 f ∈F
r∈G
is description length of the grammar and log∗ (Z − f (r)) lp (Mi ) =
(7)
r∈G
is the description length of the parameters. C is a constant ensuring that the prior sums to one; F is the set of features used to describe categories; | r | is the length of a DCG rule r seen f (r) times. Apart from being a prior over DCG rules, our scheme has the pleasing property that it assigns longer code words to rules containing categories in unlikely positions than to rules containing categories in expected positions. For example, our scheme would assign a longer list of code words to the categories expressing a rule such as Det → Det N P than to the list of categories expressing a rule such as N P → Det N P . Also, our coding scheme favours shorter rules than longer rules, which is desirable, given the fact that, generally speaking, rules in natural language grammars tend to be short. 2
This probability is estimated by counting the number of times a given feature takes some value, in a given category, within the manually written grammar mentioned in Section 4.
DCG Induction Using MDL and Parsed Corpora
3.2
191
Related Research
Our approach is closely related to Stolcke’s model merging work (Stolcke & Omohundro, 1994). Apart from differences in prior and likelihood computation, the main divergence is that our work is motivated by the need to deal with undergeneration in broad-coverage, manually written natural language grammars (for example (Grover, Briscoe, Carroll, & Boguraev, 1993)). Although we do not go into the issues here, learning of rules missing from such grammars is different from estimating grammars ab initio. This is because rules missing from any realistic grammar are all likely to have a low frequency in any given corpus, and so will be harder to differentiate from competing, incorrect rules purely on the basis of statistical properties alone. We know of no other work reporting automated extension of broad-coverage grammars using MDL and parsed corpora.
4
Empirical Evaluation
In order to determine whether learning using parsed corpora was improved by MDL, we ran a series of experiments. We started with a broad coverage grammar, called the Tag Sequence Grammar (TSG) (Briscoe & Carroll, 1995), which when compiled consisted of 455 (generalised with Kleene operators) DCG rules. An example TSG rule is given in Table 1.
Table 1. Example ANLT rule
N -, V + , BAR 2, PLU X6 , WH -, AUX X VFORM X21, INV -, FIN +, CONJ -, SCOLON - COLON X, DASH X, TA X, BAL X, BRACK X → COMMA X, TXTCAT UNIT, TXT Cl N +, V -, BAR 2, PLU X6, POSS -, NTYPE X WH -, MOD X CONJ -, SCOLON -, COLON X DASH X, TA X, BAL X, BRACK X, COMMA X TXTCAT UNIT, TXT PH
N -, V +, BAR 1, PLU X6, MOD X, AUX X VFORM X21, FIN +, CONJ -, SCOLON -
This rule is equivalent to the more familiar rule S → N P V P . The rule consists of three categories (one for each non-terminal category), and each category consists of a list of features and their associated values. For example, we see that in the first category, the feature N has a value −. Variables are of the form X1 . . . Xn and the variable X is anonymous. The reader unfamiliar with our
192
Miles Osborne
formalism should consult the ANLT documentation (which is available online at http://www.cl.cam.ac.uk/Research/NL/anlt.html) for further details. For the experiments reported here, we extended TSG with four extra rules. These extra rules dealt with obvious oversights when parsing the WSJ. TSG did not parse sequences of words directly, but instead assigned derivations to sequences of part-of-speech tags (using the CLAWS2 tagset (Black, Garside, & Leech, 1993a)). For training and testing material, we extracted from the parsed section of the WSJ 10,249 sentences (maximum length 15 tokens) and 739 sentences (maximum length 30 tokens) respectively. Sentences in the testing set were not present in the training set, and were randomly sampled from WSJ. As is usual in machine learning settings, the training set was used to estimate models, and the testing set to evaluate them. M0 , the initial model, consisted of TSG and parameters estimated by counting TSG rule applications in a parsed corpus annotated in the TSG format. We automatically created this corpus by using TSG to parse all sentences in WSJ (maximum length 30 tokens) not present in the testing set. At this stage, we did not induce any new rules. Whenever we managed to parse a sentence, we ranked the parses using a tree similarity metric (Hektoen, 1997), and recorded the TSG parse that was closest to the WSJ parse. Models were evaluated in terms of coverage, unlabelled crossing rates, recall and precision (Harrison, Abney, Black, Flickinger, Gdaniec, Hindle, Ingria, Marcus, Santorini, & Strzalkowski, 1991). These are standard metrics used when evaluating parsers and broadly speaking, measure how close structurally a candidate parse tree is to a reference parse tree. Labelling refers to the identity of internal nodes. We did not attempt to measure labelled bracketing performance as TSG does not use the Penn Nonterminal set. Additionally, we also estimated how many parses each grammar assigned to a set of sentences. Considering ambiguity in this manner gives a handle on how hard it is to recover the correct parse, out of all parses produced. All things being equal, we would expect more ambiguous grammars to reveal the deficiencies of the parse selection mechanism more clearly than less ambiguous grammars. As a metric, we used Briscoe and Carroll’s average parse base (APB), which is defined √ as the geometric mean of n p, where n is the length of a given sentence, in a corpus of sentences, that has p parses (Briscoe & Carroll, 1995). The APB to the power of n has an interpretation as the expected number of parses for a sentence of length n; the higher the APB value, the more ambiguous the grammar. When measuring APBs of grammars, for computational reasons, we could only parse short sentences (less than or equal to 6 tokens in length) and recover at most 250 parses per sentence. Our APB results therefore underestimate the true ambiguity of the various grammars. Nevertheless, relative differences will reveal information about differences between the various grammars. For data, we used all parsed sentences in the Wall Street Journal that were at most 6 tokens long. Finally, because the accent of this research is upon grammar learning (and not parse selection), we did not attempt to equip our learner with the best parse
DCG Induction Using MDL and Parsed Corpora
193
ranking method possible. In particular, we ranked parses in terms of an unlexicalised SCFG-variant, and recorded in the results table (Table 3) derivation accuracy in terms of the single highest ranked parse.3 As is well known, this is not a good way to rank parses. We therefore additionally report results (in parentheses) that we might expect if we had a perfect selection mechanism. This was simulated by re-ranking the top 10 parses produced by estimated models with respect to a WSJ treebank parse and then selecting the highest re-ranked parse. We used the same tree similarity metric previously mentioned when initialising the model for TSG. We ran the experiments summarised in Table 2. By a uniform prior, we mean a prior that assigns equal probability to any model in the hypothesis space. As a comparison, we evaluated the initial model M0 before any subsequent extension. This is experiment 5 in the tables. Table 2. Experiments ran varying the training material and prior probability Experiment Training annotation 1 2 3 4 5
Raw sentences Parsed corpora Raw sentences Parsed corpora None
Prior Uniform Uniform MDL-based MDL-based N/A
Table 3 shows the results we obtained using the WSJ testing sentences. The first column gives the experiment, the second column the number of rules in the final model and the third column reports the percentage of testing sentences covered by the model. Table 3. Coverage results when tested on WSJ material Experiment Size % Generated 1 2 3 4 5
3548 3617 2687 2592 459
77 87 78 90 63
Table 4 shows crossing rates, recall and precision results for the various experiments. 3
Our SCFG-variant was based upon the probability of a SCFG rule a expanding some non-terminal and the probability of some other SCFG-rule b, given a, expanding one of the nonterminals of rule a. The parameters of these rules were estimated by counting rule application in a small disjoint corpus.
194
Miles Osborne Table 4. Parse selection Results when tested on WSJ material Experiment Crossing rates 1 2 3 4 5
1.73 1.98 1.79 2.07 1.61
(1.10) (1.30) (1.16) (1.37) (1.00)
Recall 51.8 50.3 51.2 49.5 52.4
(58.7) (57.8) (58.4) (57.2) (59.3)
Precision 67.5 64.1 66.7 62.8 70.1
(75.6) (72.8) (75.1) (72.0) (78.1)
Table 5 present the APB results. The first column gives the actual value, whilst the second one gives the number of sentences covered. Note we forced the parser to timeout (fail to return a parse should it spend too much time processing a sentence), so the number of sentences covered is not simply the sum of the relevant length training and testing set sentences. In general, more ambiguous grammars lead to more timeouts than less ambiguous grammars. Table 5. APB Results when tested on all WSJ Material Experiment APB Number Covered 1 2 3 4 5
2.2 3.0 2.4 3.2 1.2
2317 1660 2287 1565 1439
As can be seen (Table 3), parsed corpora guides the system towards the acquisition of learnt rules that generalise better (have a higher coverage) than rules acquired using raw text alone. Although the differences are smaller, an informative, compression-based prior also leads to more general rules than are found using an uninformative prior. Furthermore, we see that parsed corpora can be used in conjunction with a compression-based prior: the benefits are additive. The APB results (Table 5) suggest that the cost of this generalisation is increased ambiguity, and so a reduced chance of selecting the correct parse from the set of competing ones (Table 4). Alternatively, we can view this ambiguity increase as a greater reliance upon the parse selection mechanism, which in our case is known to be deficient. Increased lexicalisation should allow us to estimate models that enjoy both high generalisation and competitive parse selection accuracy. The parse selection results (Table 4) are probably too poor for semantic interpretation, but the estimated grammar might be useful for situations when only a rough indication of syntactic structure is required. Generally speaking, we see that as we increase the amount of supervision, the amount of time necessary to recover all parses increases. This is shown by the reduction in the number of sentences covered by the various grammars (the final
DCG Induction Using MDL and Parsed Corpora
195
column in Table 5). We conjecture that this is because our learning framework takes no account of how hard it is for the decoder to recover the encoded material. Future work should take account of this tendency (for example, learning using some form of resource bounded MDL; Li and Vit´ anyi (1997) call this resource bounded complexity), for, if pushed to the limit, it would preclude practical application of the learnt grammars. One interesting observation is that (ignoring the TSG case), compact models do not alway appear to give the best prediction. For example, the model produced by MDL-based learning with raw text is smaller than the one produced by learning with a uniform prior and parsed corpora. However, it gives lower (worse) coverage results. This appears to contradict the claims of MDL. A more careful consideration shows that this is not the case. Models produced using parsed corpora (in a sense) encode the corpora and the training set, whilst models estimated from raw text alone encode just the sentences. When viewed in this way, we see that, when the composition of the training material is properly considered, as predicted by MDL, more compact models do give better results.
5
Discussion
We argued that supervision is necessary for practical grammar learning, and that increasing the amount of it yields better results. In particular, we advocated using MDL (which is cheap, but weak) as a way of extracting more information from parsed corpora (which is expensive, but strong). Empirical results suggested that this is indeed the case. Future work will follow in three main directions: – Abandonment of the SCFG as the basis of the language model. We have adopted Abney’s random fields (Abney, 1997). Apart from performance improvements, altering the model class should allow empirical investigation of the MDL claim that model classes can be evaluated in terms of compression. So, if we discover even more compact models using random fields than we could using our SCFG, we might deduce that this is the case. Naturally, lexicalisation would enter into any scheme entertained.4 – Use of semantics in learning. We have at our disposal a large grammar augmented with a compositional semantics (Grover et al., 1993). Again, this should lead to better results. – Prior weighting. As is well known, MDL-based learners sometimes improve from weighting the prior with respect to the likelihood. Schemes, such as Quinlan and Rivest (1989), fall outside of the coding framework and (effectively) replicate the training set. We intend to pursue encoding-based schemes that achieve the same purpose. 4
We have now built started experimenting with Random Field models (Osborne, 2000). Preliminary results show improvements over parse selection using our SCFG.
196
Miles Osborne
Acknowledgements We would like to thank Ted Briscoe for comments on previous incarnations of this research and for the suggestion about relating ambiguity to parse selection, John Carroll for technical support with the Alvey Toolkit, Erik Hektoen for the tree comparison code, the anonymous reviewers of previous versions of this paper, and Donnla Nic Gearailt for comments on the written style. Naturally all mistakes are the author’s. This work was supported by the EU Project Sparkle LE-2111 and the TMR Project Learning Computational Grammars.
References 1. Abney, S. P. (1997). Stochastic Attribute-Value Grammars. Computational Linguistics, 23 (4), 597–618. 2. Angluin, D. (1982). Inference of reversible languages. ournal for the Association for Computing Machinery, 29, 741–765. 3. Black, E., Garside, R., & Leech, G. (Eds.). (1993a). Statistically driven computer grammars of English: The IBM-Lancaster approach. Rodopi. 4. Black, E., Jelinek, F., Lafferty, J., & Magerman, D. M. (1993b). Towards History-based Grammars: Using Richer Models for Probabilistic Parsing. In 31st Annual Meeting of the Association for Computational Linguistics, pp. 31–37 Ohio State University, Columbus, Ohio, USA. 5. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. (1987). Occam’s Razor. Information Processing Letters, 24, 377–380. 6. Booth, T. (1969). Probabilistic representation of formal languages. In Tenth Annual IEEE Symposium on Switching and Automata Theory. 7. Briscoe, E. J., & Carroll, J. (1995). Developing and Evaluating a probabilistic LR Parser of Part-ofSpeech and Punctuation Labels. In ACL/SIGPARSE 4th International Workshop on Parsing Technologies, pp. 48–58 Prague, Czech Republic. 8. Carroll, G., & Charniak, E. (1992). Two Experiments on Learning Probabilistic Dependency Grammars from Corpora. In AAAI-92 Workshop Program: Statistically-Based NLP Techniques San Jose, California. 9. Carroll, J. (1994). Relating complexity to practical performance in parsing with wide-coverage unification grammars. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pp. 287– 294 NMSU, Las Cruces, NM. 10. Carroll, J., Grover, C., Briscoe, T., & Boguraev, B. (1991). A Development Environment for Large Natural Language Grammars. Technical report 233, University of Cambridge Computer Laboratory. 11. Chen, S. F. (1996). Building Probabilistic Language Models for Natural Language. Ph.D. thesis, Harvard University. 12. Collins, M. J. (1996). A new statistical parser based on bigram lexical dependencies. In 34th Annual Meeting of the Association for Computational Linguistics University of California, Santa Cruz, California, USA.
DCG Induction Using MDL and Parsed Corpora
197
13. de Marcken, C. (1996). Unsupervised Language Acquisition. Ph.D. thesis, MIT. 14. Gold, E. M. (1967). Language Identification to the Limit. Information and Control, 10, 447–474. 15. Grover, C., Briscoe, T., Carroll, J., & Boguraev, B. (1993). The Alvey Natural Language Tools Grammar (4th Release). Technical report, University of Cambridge Computer Laboratory. 16. Harrison, P., Abney, S., Black, E., Flickinger, D., Gdaniec, R. G. C., Hindle, D., Ingria, R., Marcus, M., Santorini, B., & Strzalkowski, T. (1991). Evaluating Syntax Performance of Parser/Grammars of English. In Neal, J. G., & Walter, S. M. (Eds.), Natural Language Processing Systems Evaluation Workshop, Vol. Technical Report RL-TR-91-362. 17. Hektoen, E. (1997). Probabilistic Parse Selection Based on Semantic Cooccurrences. In 5th International Workshop on Parsing Technologies, pp. 113–122 MIT, Cambridge, Massachusetts, USA. 18. Keller, B., & Lutz, R. (1997). Evolving Stochastic Context-Free Grammars from Examples Using a Minimum Description Length Principle. In Worksop on Automata Induction, Grammatical Inference and Language Acquisition Nashville, Tennessee, USA. ICML’97. 19. Kolmogorov, A. (1965). Three approaches to the quantitive definition of information. Problems of Information Transmission, 1, 1–7. 20. Li, M., & Vit´ anyi, P. M. B. (1997). An Introduction to Kolmogorov Complexity and its Applications (Second edition). Springer-Verlag. 21. MacKay, D. J. C. (1994). A Hierarchic Dirichlet Language Model. Natural Language Engineering, 1 (1). 22. Magerman, D. M. (1994). Natural Language Parsing as Statistical Pattern Recognition. Ph.D. thesis, Stanford University. 23. Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (1998). Using EM to Classify Text from Labeled and Unlabeled Documents. Tech. rep., School of Computer Science, CMU, Pittsburgh, PA 15213. 24. Osborne, M. (1999). MDL-based DCG Induction for NP Identification. In Osborne, M., & Sang, E. T. K. (Eds.), CoNLL99, pp. 61–68 Bergen, Norway. EACL. 25. Osborne, M. (2000). Estimation of Stochastic Attribute-Value Grammars using an Informative Sample. In Coling 2000. 26. Osborne, M., & Briscoe, T. (1997). Learning Stochastic Categorial Grammars. In Ellison, T. M. (Ed.), CoNLL97, pp. 80–87. ACL. 27. Pereira, F., & Schabes, Y. (1992). Inside-outside reestimation from partially bracketed corpora. In Proceedings of the 30th ACL, pp. 128–135 University of Delaware, Newark, Delaware. 28. Quinlan, J. R., & Rivest, R. L. (1989). Inferring decision trees using the minimum description length principle. Information and Computation, 80, 227–248. 29. Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry, Vol. 15 of Series in Computer Science. World Scientific.
198
Miles Osborne
30. Sakakibara, Y. (1992). Efficient learning of context-free grammars frm positive structural examples. Information and Computation, 97, 23–60. 31. Schaffer, C. (1993). Overfitting Avoidance as Bias. Machine Learning, 10, 153–178. 32. Shinohara, T. (1990). Inductive inference from positive data is powerful. In The 1990 Workshop on Computational Learning Theory, pp. 97–110 San Mateo, CA. Morgan Kaufmann. 33. Stolcke, A. (1994). Bayesian Learning of Probabilistic Language Models. Ph.D. thesis, University of California, Berkley. 34. Stolcke, A., & Omohundro, S. (1994). Inducing Probabilistic Grammars by Bayesian Model Merging. In Grammatical Inference and Applications, pp. 106–118. Springer Verlag.
Learning Log-Linear Models on Constraint-Based Grammars for Disambiguation Stefan Riezler Institut f¨ ur Maschinelle Sprachverarbeitung, Universit¨ at Stuttgart, Azenbergstr. 12, 70174 Stuttgart
[email protected]
Abstract. We discuss the probabilistic modeling of constraint-based grammars by log-linear distributions and present a novel technique for statistical inference of the parameters and properties of such models from unannotated training data. We report on an experiment with a log-linear grammar model which employs sophisticated linguistically motivated features of parses as properties of the probability model. We report the results of statistical parameter estimation and empirical evaluation of this model on a small scale. These show that log-linear models on the parses of constraint-based grammars are useful for accurate disambiguation.
1
Introduction
Structural ambiguity is a practical problem for every grammar describing a nontrivial fragment of natural language. That is, for such grammars every input of reasonable length may receive a large number of different analyses, many of which are not in accord with human perceptions. Probabilistic grammars attempt to solve the problem of structural ambiguity resolution by a probabilistic ranking of analyses. A prerequisite for such approaches is the choice of appropriate parametric probability models and statistical inference methods for estimating the model parameters from empirical data. Probability models and estimators for constraint-based grammars have been presented for stochastic unification-based grammars (Briscoe & Waegner, 1992; Briscoe & Carroll, 1993), stochastic constraint logic programming (Eisele, 1994), stochastic head-driven phrase structure grammar (Brew, 1995), stochastic logic programming (Miyata, 1996), stochastic categorial grammars (Osborne & Briscoe, 1997) or data-oriented approaches to lexical-functional grammar (Bod & Kaplan, 1998). All of these approaches apply a renormalized extension of the probabilistic and statistical methods underlying stochastic context-free grammars (Baum, Petrie, Soules, & Weiss, 1970; Baker, 1979) to constraint-based models. As shown by Abney (1997), a loss in probability mass due to failure derivations is caused in these approaches since the estimation technique for context-free models is based on the assumption of mutual independence of the model’s derivation steps, but context-dependent constraints on derivations are inherent to constraint-based grammars. The necessary renormalization of the probability distribution on derivations with respect J. Cussens and S. Dˇ zeroski (Eds.): LLL’99, LNAI 1925, pp. 199–217, 2000. c Springer-Verlag Berlin Heidelberg 2000
200
Stefan Riezler
to consistent derivations causes a general deviance of the resulting estimates from the desired maximum likelihood estimates. One solution to this problem is the definition of custom-built statistical inference procedures for specialized parsing models including a limited amount of context-dependency. For example, the probabilistic feature-grammar model of Goodman (1998) conditions on a finite set of categorial features beyond the nonterminal of each node which makes it possible to explicitly unfold the dependencies in the model. Another solution to this problem is to go to a more expressive family of probability models and to new related statistical inference methods. Following Abney (1997), we choose the parametric family of log-linear probability distributions to model constraint-based grammars. The great advantage of log-linear models is their generality and flexibility. For example, log-linear models allow arbitrary context dependencies in the data to be described by choosing a few salient properties of the data as the defining properties of the model. Log-linear models (or variants thereof) for the probabilistic modeling of context-dependencies in rule-based grammars have been presented by Magerman (1994), Abney (1997), Riezler (1997), Ratnaparkhi (1998), and Cussens (1999). However, with loglinear models we are not restricted to building our models on production rules or other configurational properties of the data. Rather, we have the virtue of employing essentially arbitrary properties in our models. For example, heuristics on preferences of grammatical functions or on attachment preferences as used in Srinivas, Doran, and Kulick (1995), or the preferences in lexical relations as used in Alshawi and Carter (1994) can be integrated into a log-linear model very easily. A log-linear model employing more sophisticated properties encoding grammatical functions, attachment preferences, branching behaviour, coordination parallelism, phrase complexity, and other general properties of constraint-based parses has been presented by Johnson, Geman, Canon, Chi, and Riezler (1999). Clearly, the step from simple rule-based probability models to general log-linear models requires also a more general and more complex estimation algorithm. The estimation algorithm for log-linear models proposed by Abney (1997) is the iterative scaling method of Della Pietra, Della Pietra, and Lafferty (1997). This algorithm recasts the optimization of weights of preference functions as done by Srinivas et al. (1995) or Alshawi and Carter (1994) as estimation of parameters associated with the properties of a log-linear model. However, there is a drawback: In contrast to rule-based models where efficient estimation algorithms from incomplete, i.e., unannotated data exist, the iterative scaling estimation method of Della Pietra et al. (1997) applies only to complete, i.e., fully annotated training data. Unfortunately, the need to rely on large samples of complete data is impractical. For parsing applications, complete data means several person-years of hand-annotating large corpora with specialized grammatical analyses. This task is always labor-intensive, error-prone, and restricted to a specific grammar framework, a specific language, and a specific language domain. Thus, the first open problem to solve is to find automatic and reusable techniques for parameter estimation and property selection of probabilistic constraint-based grammars from incomplete data. We will present a general sta-
Learning Log-Linear Models on Constraint-Based Grammars
201
tistical inference algorithm for log-linear models from incomplete data which can be seen as an extension of the iterative scaling method of Della Pietra et al. (1997). A further open problem is the empirical evaluation of the performance of probabilistic constraint-based grammars in terms of finding human-determined correct parses. We present an experiment with a log-linear model employing a few hundred linguistically motivated properties. The experiment was conducted on a small scale but clearly shows the usefulness of general properties in order to get good results in a linguistic evaluation. The rest of this paper is organized as follows. Section 2 introduces the basic formal concepts of CLP. Section 3 presents a log-linear model for probabilistic CLP. Parameter estimation and property selection of log-linear models from incomplete data is treated in Sect. 4. Section 5 presents an empirical evaluation of the applicability of general log-linear models to probabilistic constraint-based grammars in a small-scale experiment. Concluding remarks are made in Sect. 6. The work presented below on the theory of statistical inference of log-linear models is based upon work previously published in Riezler (1997, 1998, 1999). The presented empirical results on experimenting with log-linear grammar models were published firstly in Johnson et al. (1999). For more recent experimental work see Johnson and Riezler (2000) and Riezler, Prescher, Kuhn, and Johnson (2000).
2
Constraint Logic Programming for NLP
In the following we will sketch the basic concepts of the constraint logic programming (CLP) scheme of H¨ohfeld and Smolka (1988). A constraint-logic grammar (CLG) is encoded by a constraint logic program P with constraints from a grammar constraint language L embedded into a relational programming constraint language R(L) . Let us consider a simple non-linguistic example. The program of Table 1 consists of five definite clauses with embedded L -constraints from a language of hierarchical types. The ordering on the types is defined by the operation of set inclusion on the denotations (· ) of the types and a ⊆ c ⊆ e , b ⊆ d ⊆ e , and c ∩ d = ∅. Table 1. Simple constraint logic program s(Z) ← p(Z) & q(Z). p(Z) ← Z = a. p(Z) ← Z = b. q(Z) ← Z = a. q(Z) ← Z = b.
Seen from a parsing perspective, an input string corresponds to an initial goal or query G which is a possibly empty conjunction of L -constraints and
202
Stefan Riezler
R(L) -atoms. Parses of a string (encoded by G) as produced by a grammar (encoded by P ) correspond to P -answers of G. A P -answer of a goal G is defined as a satisfiable L -constraint φ s.t. the implication φ → G is a logical consequence of P . The operational semantics of conventional logic programming, SLD-resolution (see Lloyd (1987)), is generalized by performing goal reduction only on the R(L) -atoms and solving conjunctions of collected L -constraints by a given L -constraint solver. Examples for queries and proof trees for the program of Table 1 are given in Fig. 1. In the following it will be convenient to view the search space determined by this derivation procedure as a search of a tree. Each derivation from a query G and a program P corresponds to a branch of a derivation tree, and each successful derivation to a subtree of a derivation tree, called a proof tree, with G as root note and a P -answer as terminal node. We assume each parse of a sentence to be associated with a single proof tree.
3
A Log-Linear Probability Model for CLP
Log-linear models can be seen as an exponential family of probability distributions where the probability of an event is simply defined as being proportional to the product of weights assigned to selected properties of the event. Another way to understand log-linear models is as maximum-entropy models. From this viewpoint we do statistical inference and, believing that entropy is the unique consistent measure of the amount of uncertainty represented by a probability distribution, we obey the following principle: In making inferences on the basis of partial information we must use that probability distribution which has maximum entropy subject to whatever is known. This is the only unbiased assignment we can make; to use any other would amount to arbitrary assumption of information which by hypothesis we do not have. (Jaynes, 1957) The solution to this constrained maximum-entropy problem has the parametric form of log-linear probability models. For our application, the special instance of interest is a log-linear distribution over the countably infinite set of proof trees for a set of queries to a program. Definition 1. A log-linear probability distribution pλ on a set X is defined s.t. for all x ∈ X : pλ (x) = Zλ −1 eλ·ν(x) p0 (x), Zλ = x∈X eλ·ν(x) p0 (x) is a normalizing constant, λ = (λ1 , . . . , λn ) ∈ IRn is a vector of log-parameters, νi : X → IR, i = 1, . . . , n, ν = (ν1 , . . . , νn ) is a vector of property-functions n λ · ν(x) is a the vector dot product i=1 λi νi (x), p0 is a fixed reference distribution.
Learning Log-Linear Models on Constraint-Based Grammars
203
Fig. 1. Queries and proof trees for constraint logic program
Why are these models so interesting for probabilistic constraint-based grammars? Firstly, log-linear models provide the means for an appropriate probabilistic modeling of the context-dependencies inherent in constraint-based grammars. For example, dependencies between single clauses or rules can be modeled by taking subtrees of proof trees including these clauses as properties. Moreover, since linguistically there is no particular reason for assuming rules or clauses or even subtrees as the best properties to use in a probabilistic grammar, we can go a step further to even more sophisticated non-standard properties. As we will see in Sect. 5, properties referring to grammatical functions, attachment preferences, coordination parallelism, phrase complexity, right-branching behaviour of parses, or other general features of constraint-based parses can be employed successfully to probabilistic CLGs. Clearly, to answer the above question, the parameterization of log-linear models not only in numerical log-values but also in the properties themselves is what makes them so flexible and interesting. Let us illustrate this with a simple example. Suppose we have a training corpus of ten queries, consisting of three tokens of query y1 : s(Z) & Z = a, four tokens of y3 : s(Z) & Z = c, and one token each of query y2 : s(Z) & Z = b, y4 : s(Z) & Z = d, and y5 : s(Z) & Z = e. The corresponding proof trees generated by the program in Table 1 are given in Fig. 1. Note that queries y1 , y2 , y3 and y4 are unambiguous, being assigned a single proof tree, while y5 is ambiguous.
204
Stefan Riezler
A useful first distinction between the proof trees of Fig. 1 can be obtained by selecting the two subtrees χ1 : Z = a and χ2 : Z = b as properties. These properties allow us to cluster the proof trees in two disjoint sets on the basis of similar statistical qualities of the proof trees in these sets. In the statistical estimation framework of maximum likelihood estimation (MLE) we would expect the parameter value corresponding to property χ1 to be higher than the parameter value of property χ2 since in our training corpus seven out of ten queries come unambiguously with a proof tree including property χ1 . However, we cannot simply recreate the proportions of the training data from the corresponding proof trees as we could do for an unambiguous example. Here we are confronted with an incomplete-data problem, which means that we do not know the frequency of the possible proof trees of query y5 . In the next section we will discuss this problem of parameter estimation for log-linear models in more detail.
4
Statistical Inference for Log-Linear Models
Della Pietra et al. (1997) have presented a statistical inference algorithm for combined property selection and parameter estimation for log-linear models. Abney (1997) has shown the applicability of this algorithm to stochastic attribute-value grammars, which can be seen as a special case of context-sensitive CLGs. This algorithm, however, applies only to complete data, i.e. fully annotated data are needed for training. Unfortunately, the need to rely on large training samples of complete data is a problem if such data are difficult to gather. For example, in natural language parsing applications, complete data means several person-years of hand-annotating large corpora with detailed analyses of specialized grammar frameworks. This is always a labor-intensive and error-prone task, which additionally is restricted to the specific grammar framework, the specific language, and the specific language domain in question. Clearly, for such applications automatic and reusable techniques for statistical inference from incomplete data, i.e. training data consisting simply of unannotated sentences, are desirable. In the following, we present a version of the statistical inference algorithm of Della Pietra et al. (1997) especially designed for incomplete data problems. We present a parameter estimation technique for log-linear models from incomplete data (Sect. 4.2) and a property selection procedure from incomplete data (Sect. 4.3). These algorithms are combined into a statistical inference algorithm for log-linear models from incomplete data (Sect. 4.4). A self-contained proof of monotonicity and convergence of this algorithm which does not rely on the convergence of alternating minimization procedures for maximum-entropy models as presented by Csisz´ar (1989) or on the regularity conditions for generalized EM algorithms as presented by Wu (1983) can be found in Riezler (1999). 4.1
Optimization-Theoretic Background
Why is incomplete-data estimation for log-linear models difficult? The answer is because complete-data estimation for such models is difficult, too. Let us
Learning Log-Linear Models on Constraint-Based Grammars
205
have a look at the first partial derivatives of some objective functions which are considered in MLE of log-linear models from complete and incomplete data (see Table 2). Let p[f ] = x∈X p(x)f (x) denote the expectation of a function f : X → IR with respect to a probability distribution p on X , and let p˜(x) denote the empirical distribution of a parse x. The system of equations to be solved at the points where the first partial derivatives of the complete data log-likelihood function Lc are zero, i.e., at the critical points of Lc , can be expressed as Zλ−1 eλ·ν(x) νi (x) = p˜(x)νi (x) for all i = 1, . . . , n. x∈X
x∈X
That is, the parameters λi solving this MLE problem are found by setting the expected values of the functions νi under the model pλ equal to the expectations of the functions νi under the empirical distribution p˜. However, because of the dependence of both Zλ and eλ·ν(x) on λ, this system of equations cannot be solved coordinate-wise in λi . This problem is even more severe for the case of incomplete-data estimation. Let Y be a sample space of incomplete data, i.e. unparsed sentences, X(y) the set of parses produced by a constraint-based grammar for sentence y, and kλ (x|y) the conditional probability of parse x given the corresponding sentence y and the current parameter value λ. The incompletedata log-likelihood L has its critical points at the solution of the following system of equations in λi : Zλ−1 eλ·ν(x) νi (x) = p˜(y) kλ (x|y)νi (x) for all i = 1, . . . , n. x∈X
y∈Y
x∈X(y)
Here additionally a dependence of the conditional probability kλ (x|y) on λ has to be respected. However, an application of the standard solution to incompletedata problems by the “expectation-maximization” (EM) algorithm (see Dempster, Laird, and Rubin (1977) ) to log-linear models only partially solves the problem. The equations to be solved to find the critical points of the EM-auxiliary function Q(λ; λ ) for a log-linear model depending on λ are Zλ−1 eλ·ν(x) νi (x) = p˜(y) kλ (x|y)νi (x) for all i = 1, . . . , n. x∈X
y∈Y
x∈X(y)
Here kλ (x|y) depends on λ instead of λ. However, the dependency of Zλ and eλ·ν(x) on λ still remains a problem. Solutions for the system of equations can be found, e.g., by applying general-purpose numerical optimization methods (see Fletcher (1987)) to the problem in question. For the smooth and strictly concave complete-data log-likelihood Lc , e.g., a conjugate gradient approach could be used. Fortunately, optimization methods specifically tailored to the problem of MLE from complete data for log-linear models have been presented by Darroch and Ratcliff (1972) and Della Pietra et al. (1997). Both of these “iterative scaling” algorithms iteratively maximize an auxiliary function Ac (γ; λ) which is defined as a lower bound on the difference Lc (γ + λ) − Lc (λ) in complete-data log-likelihood when going from a basic model pλ to an extended model pγ+λ .
206
Stefan Riezler
The function Ac (γ; λ) is maximized as a function of γ for fixed λ which makes it possible to solve the following equation coordinate-wise in γi , i = 1, . . . , n: pλ (x)νi (x)eγi ν# (x) = p˜(x)νi (x) for all i = 1, . . . , n. x∈X
x∈X
n A closed form solution for γi is given for constant ν# = i=1 νi (x) for all x ∈ X ; otherwise simple numerical methods such as Newton’s method can be used to solve for the γi . It is shown in Della Pietra et al. (1997) and Darroch and Ratcliff (1972) that iteratively replacing λ(t+1) by λ(t) + γ (t) conservatively increases Lc and a sequence of likelihood values converges to the the global maximum of the strictly concave function Lc . Table 2. Partial derivatives of objective functions for MLE of log-linear models log-likelihood complete data incomplete data
∂Lc (λ) ∂λi ∂L(λ) ∂λi
= p˜[νi ] − pλ [νi ]
= p˜[kλ [νi ]] − pλ [νi ]
auxiliary function ∂Ac (γ;λ) ∂γi
∂Q(λ;λ ) ∂λi
= p˜[νi ] − pλ [νi eγi ν# ] = p˜[kλ [νi ]] − pλ [νi ]
For the case of incomplete-data estimation things are more complicated. Since the incomplete-data log-likelihood function L is not strictly concave, generalpurpose numerical methods such as conjugate gradient cannot be applied. However, such methods can be applied to the auxiliary function Q as defined by a standard EM algorithm for log-linear models. Alternatively, iterative scaling methods can be used to perform maximization of the auxiliary function Q of the EM algorithm. Both approaches result in a doubly iterative algorithm where an iterative algorithm for the M-step is interweaved in the iterative EM algorithm. Clearly, this is computationally burdensome and should be avoided. The aim of our approach is exactly to avoid such doubly iterative algorithms. The idea is here to interleave the auxiliary functions Q of the EM algorithm and Ac of iterative scaling in order to define a singly-iterative incomplete-data estimation algorithm using a new combined auxiliary function. 4.2
Parameter Estimation
Let us start with a problem definition. Applying an incomplete-data framework to a log-linear probability model for CLP, we can assume the following to be given: – observed, incomplete data y ∈ Y, corresponding to a finite sample of queries for a constraint logic program P , – unobserved, complete data x ∈ X , corresponding to the countably infinite sample of proof trees for queries Y from P ,
Learning Log-Linear Models on Constraint-Based Grammars
207
– a many-to-one function Y : X → Y s.t. Y (x) = y corresponds to the unique query labeling proof tree x, and its inverse X : Y → 2X s.t. X(y) = {x| Y (x) = y} is the countably infinite set of proof trees for query y from P, – a complete-data specification pλ (x), which is a log-linear distribution on X with given reference distribution p0 , fixed property vector χ and propertyfunctions vector ν and depending on parameter vector λ, – an incomplete-data specification gλ (y), which is related to the complete-data specification by gλ (y) = pλ (x). x∈X(y)
The problem of maximum-likelihood estimation for log-linear models from incomplete data can then be stated as follows. Given a fixed sample from Y and a set Λ = {λ ∈ IRn | pλ (x) is a log-linear distribution on X with fixed p0 and fixed ν}, we want to find a maximum ˜ likelihood estimate λ∗ of λ s.t. λ∗ = arg max L(λ) = ln y∈Y gλ (y)p(y) . λ∈Λ
Similar to the case of iterative scaling for complete-data estimation, we define an auxiliary function A(γ, λ) as a conservative estimate of the difference L(γ + λ) − L(λ) in log-likelihood. The lower bound for the incomplete-data case can be derived from the complete-data case, in essence, by replacing an expectation of complete, but unobserved data by a conditional expectation given the observed data and the current fit of the parameter values. Clearly, this is the same trick that is used in the EM algorithm, but applied in the context of a different auxiliary function. From the lower-bounding property of the auxiliary function it can immediately be seen that each maximization step of A(γ, λ) as a function of γ will increase or hold constant the improvement L(γ + λ) − L(λ). This is a first important property of a MLE algorithm. Furthermore, our approach to view the incomplete-data auxiliary function directly as a lower bound on the improvement in incomplete-data log-likelihood enables an intuitive and elegant proof of convergence. Let the conditional probability of complete data x given incomplete data y and parameter values λ be defined as eλ·ν(x) p0 (x) . λ·ν(x) p (x) 0 x∈X(y) e
kλ (x|y) = pλ (x)/gλ (y) =
Then a two-place auxiliary function A can be defined as n ν¯i eγi ν# ]]. A(γ, λ) = p˜[1 + kλ [γ · ν] − pλ [ i=1
where ν# (x) =
n
i=1
νi (x), ν¯i (x) = νi (x)/ν# (x).
208
Stefan Riezler
A(γ, λ) takes its maximum as a function of γ at the unique point γˆ satisfying for each γˆi , i = 1, . . . , n: p˜[kλ [νi ]] = pλ [νi eγˆi ν# ]. From the auxiliary function A an iterative algorithm for maximizing L is constructed. For want of a name, we will call this algorithm the “Iterative Maximization (IM)” algorithm. At each step of the IM algorithm, a log-linear model based on parameter vector λ is extended to a model based on parameter vector λ + γˆ , where γˆ is an estimation of the parameter vector that maximizes the improvement in L when moving away in the parameter space from λ. This increment γˆ is estimated by maximizing the auxiliary function A(γ, λ) as a function of γ. This maximum is determined for each i = 1, . . . , n uniquely as the solution n γˆi to the equation p˜[kλ [νi ]] = pλ [νi eγˆi ν# ]]. If ν# = i=1 νi (x) = K sums to a constant independent of x ∈ X , there exists a closed form solution for the γˆi : γˆi =
p˜[kλ [νi ]] 1 ln for all i = 1, . . . , n. K pλ [νi ]
For ν# varying as a function of x Newton’s method can be applied to find an approximate solution. The IM algorithm in its general form is defined as follows: Definition 2 (Iterative maximization). Let M : Λ → Λ be a mapping defined by M(λ) = γˆ + λ with γˆ = arg max A(γ, λ). γ∈IRn
Then each step of the IM algorithm is defined by λ(k+1) = M(λ(k) ). As shown in Riezler (1999), the general properties of the IM algorithm are as follows: The IM algorithm conservatively increases the incomplete-data loglikelihood function L (monotonicity). Furthermore, it converges monotonically to a critical point of L, which in almost all cases is a local maximum (convergence). 4.3
Property Selection
For the task of parameter estimation, we assumed a vector of properties to be given. However, exhaustive sets of properties can get unmanageably large and have to be curtailed. For example, suppose properties of proof trees are defined as connected subgraphs of proof trees. Such properties could be constructed incrementally by selecting from an initial set of goals and from subtrees built by performing a resolution step at a terminal node of a subtree already in the model. Another possibility is to define properties directly as the set of all possible combinations of clauses of a given constraint language (see Dehaspe (1997) and Cussens (1999)). Clearly, such exponentially growing sets of possible properties
Learning Log-Linear Models on Constraint-Based Grammars
209
must be pruned by some quality measure. An appropriate measure can then be used to define an algorithm for automatic property selection. Such an algorithm enables the induction of the property structure of the log-linear model, and in the case of the clause-properties of Dehaspe (1997) and Cussens (1999), even the induction of the structure of the underlying constraint-based grammar. Table 3. Algorithm (Combined Statistical Inference)
Input Initial model p0 , incomplete-data sample from Y. Output Log-linear model p∗ on complete-data sample X = y∈Y|p(y)>0 X(y) ˜ with selected property function vector ν ∗ and log-parameter vector λ∗ = arg max L(λ) where Λ = {λ ∈ IRm | pλ is a log-linear model on X based λ∈Λ
on p0 and ν ∗ }. Procedure 1. p(0) := p0 with C (0) := ∅, 2. Property selection: For each candidate property c ∈ C (t) , compute the gain Gc (λ(t) ) := max Gc (α, λ(t) ), and select the property cˆ := arg max Gc (λ(t) ).
α∈IR
c∈C (t)
3. Parameter estimation: Compute a maximum likelihood parameter ˆ := arg max L(λ) where Λ = {λ ∈ IRn+1 | pλ (x) is a log-linear value λ λ∈Λ
distribution on X with initial model p0 and property function vector (t) (t) (t) νˆ := (ν1 , ν2 , . . . , νn , cˆ)}. 4. Until the model converges, set p(t+1) := pλ·ˆ ˆ ν , t := t + 1, go to 2.
A straightforward measure would be the improvement in log-likelihood when extending a model by a single candidate property c with corresponding parameter α. However, this would require iterative maximization for each candidate property and is thus infeasible. Following Della Pietra et al. (1997), we could instead approximate the improvement due to adding a single property by adjusting only the parameter of this candidate and holding all other parameters of the model fixed. Unfortunately, the incomplete-data log-likelihood L is not concave in the parameters and thus cannot be maximized directly. However, we can instantiate the auxiliary function A used in parameter estimation to the extension of a model pλ by a single property c with log-parameter α, i.e., we can express an approximate gain Gc (α, λ) of adding a candidate property c with log-parameter value α to a log-linear model pλ as a conservative estimate of the true gain in log-likelihood as follows. Gc (α, λ) = p˜[1 + kλ [αc] − pλ [eαc ]]. ˆ satisfying Gc (α, λ) is maximized in α at the unique point α ˆ p˜[kλ [c]] = pλ [c eαc ].
210
Stefan Riezler
The selection function Gc (λ) = maxα Gc (α, λ) then yields a greedy property selection algorithm where a candidate property c is selected if c = arg maxc Gc (λ). That is, at each step that property out of the set of candidates is selected that gives the greatest improvement to the model at the property’s best adjusted parameter value. Since we are interested only in relative, not absolute gains, a single, non-iterative maximization of the approximate gain will be sufficient to choose from the candidates. 4.4
Combined Statistical Inference
The IM procedure for parameter estimation and the procedure for property selection can be combined into a statistical inference algorithm for log-linear models from incomplete data as shown in Table 3. Note that X is defined as the disjoint union of the complete data corresponding to the incomplete data in the random sample, i.e., X := y∈Y|p(y)>0 X(y). ˜ Table 4. Estimation using the IM algorithm (t)
Iteration t λ1 0 1 2 3
0 ln 1.5 ln 1.55 ln 1.555
(t)
(t)
(t)
λ2
p1
p2
0 ln .5 ln .45 ln .445
1/6 .25 .2583˙ .25916˙
1/6 .083˙ .075 .07416˙
L(λ(t) ) −17.224448 −15.772486 −15.753678 −15.753481
Let us apply the IM algorithm to the incomplete-data problem shown in Fig. 1. For the selected properties χ1 and χ2 , we have ν# (x) = ν1 (x) + ν2 (x) = 1 for all possible proof trees x for the sample of Fig. 1. Thus the parameter updates ˜ λ [νi ]] γˆi can be calculated from a particularly simple closed form γˆi = ln p[k pλ [νi ] . A sequence of IM iterates is given in Table 4. Probabilities of proof trees involving property χi are denoted by pi . Starting from an initial uniform probability of 1/6 for each proof tree, this sequence of likelihood values converges with an accuracy in the third place after the decimal point after three iterations and yields probabilities p1 ≈ .259 and p2 ≈ .074 for the respective proof trees.
5
An Experiment
In this section we present an empirical evaluation of the applicability of log-linear probability models and related parameter estimation techniques to constraintbased grammars. We present an experiment which starts from a small corpus of analyses in the lexical-functional grammar (LFG) format provided by Xerox PARC. We introduce a maximum pseudo-likelihood estimation procedure for log-linear models from complete data. This estimator uses ideas from incomplete data estimation to make the computation tractable. The log-linear model
Learning Log-Linear Models on Constraint-Based Grammars
211
employs a small set of about 200 properties to induce a probability distribution on 3000 parses where on average each sentence is ambiguous in 10 parses. The empirical evaluation shows that the correct parse from the set of all parses is found about 59 % of the time. For further details on this experiment see Johnson et al. (1999). 5.1
Incomplete-Data Estimation as Maximum Pseudo-Likelihood Estimation for Complete Data
As we saw in Sect. 4, the equations to be solved in statistical inference of loglinear models involve the computation of expectations of property-functions νi (x) with respect to pλ (x). Clearly it is possible to find constraint-based grammars where the sample space X of parses to be summed over in these expectations is unmanageably large or even infinite. One possibility to sensibly reduce the summation space is to employ the definition of the sample space X := y∈Y|p(y)>0 X(y) used in incomplete-data ˜ estimation as a reduction factor in complete-data estimation. That is, we approximate expectations with respect to the distribution pλ (·) on X by considering only such parses x ∈ X whose terminal yield y = Y (x) is seen in the training corpus. Furthermore, the distribution gλ (y) on terminal yields is replaced by the empirical distribution p˜(y): pλ [νi ] = pλ (x)νi (x) x∈X
=
pλ (x)νi (x)
y∈Y x∈X(y)
=
gλ (y)
y∈Y
≈
y∈Y
kλ (x|y)νi (x)
x∈X(y)
p˜(y)
kλ (x|y)νi (x).
x∈X(y)
Clearly, for most cases the approximate expectation is easier to calculate since the space y∈Y|p(y)>0 X(y) is smaller than the original full space X . ˜ The equations to be solved in complete-data estimation for log-linear models are then p˜(y) kλ (x|y)νi (x) = p˜(x)νi (x) for all i = 1, . . . , n. y∈Y
x∈X(y)
x∈X
These equations are solutions to the maximization problem of another criterion, namely a complete-data log-pseudo-likelihood function P Lc which is defined with respect to the conditional probability of parses given the yields observed in the training corpus. ˜ kλ (x|y)p(x,y) P Lc (λ) = ln x∈X ,y∈Y
212
Stefan Riezler
In the actual implementation described in Johnson et al. (1999), a slightly different function involving a regularization term promoting small values of λ onto the objective function was maximized. The maximization equations were solved using a conjugate-gradient approach adapted from Press, Teukolsky, Vetterling, and Flannery (1992). A similar approach to maximum pseudo-likelihood estimation for log-linear models from complete data but in the context of an iterative scaling approach can be found in Berger, Della Pietra, and Della Pietra (1996). 5.2
Property Design for Feature-Based CLGs
One central aim of our experiment was to take advantage of the high flexibility of log-linear models and evaluate the usefulness of this in hard terms of empirical performance. The properties employed in our models clearly deviate from the rule or production properties employed in most other probabilistic grammars by encoding as property-functions general linguistic principles as proposed by Alshawi and Carter (1994), Srinivas et al. (1995) or Hobbs and Bear (1995). The definition of properties of LFG parses refers to both the c(onstituent)- and f(eature)structures of the parses. Examples for the properties employed in our model are – properties for c-structure nodes, corresponding to standard production properties, – properties for c-structure subtrees, indicating argument versus adjunct attachment, – properties for f-structure attributes, corresponding to grammatical functions used in LFG, e.g., SUBJ, OBJ, OBJ2, COMP, XCOMP, ADJUNCT, – properties for atomic attribute-value pairs in f-structures, – properties measuring the complexity of the phrase being attached to, thus indicating both high and low attachment, – properties indicating non-right-branching of nonterminal nodes, – properties indicating non-parallel coordinate structures. The number of properties defined for each of the two corpora we worked with was about 200 including about 50 rule-properties respectively. 5.3
Empirical Evaluation
The two corpora provided to us by Xerox PARC contain appointment planning dialogs (Verbmobil corpus, henceforth VM-corpus), and a documentation of Xerox printers (Homecentre corpus, henceforth HC-corpus). The basic properties of the corpora are summarized in Table 5. The corpora consist of a packed representation of the c- and f-structures of parses produced for the sentences by a LFG grammar. The LFG parses have been produced automatically by the XLE system (see Maxwell and Kaplan (1996)) but corrected manually in addition.
Learning Log-Linear Models on Constraint-Based Grammars
213
Furthermore, it is indicated for each sentence which of its parses is the linguistically correct one. The ambiguity of the sentences in the corpus is 10 parses on average. Table 5. Properties of the corpora used for the estimation experiment VM-corpus HC-corpus number of sentences
540
980
number of ambiguous sentences
314
481
number of parses of ambiguous sentences
3245
3169
In order to cope with the small size of the corpora, a 10-way cross-validation framework has been used for estimation and evaluation. That is, the sentences of each corpus were assigned randomly into 10 approximately equal-sized subcorpora. In each run, 9 of the subcorpora served as training corpus, and one subcorpus as test corpus. The evaluation scores presented in Tables 6 and 7 are sums over the the evaluation scores gathered by using each subcorpus in turn as test corpus and training on the 9 remaining subcorpora. We used two evaluation measures on the test corpus. The first measure Ctest (λ) gives the precision of disambiguation based on most probable parses. That is, Ctest (λ) counts the percentage of sentences in the test corpus whose most probable parse according to a model pλ is the manually determined correct parse. If a sentence has k most probable parses and one of these parses is the correct one, this sentence gets score 1/k. The second evaluation measure is −P Ltest (λ), the negative log-pseudo-likelihood for the correct parses of the test corpus given their yields. This metric measures how much of the probability mass the model puts onto the correct analyses. In the empirical evaluation, the maximum pseudo-likelihood estimator is compared against a baseline estimator which treats all parses as equally likely. Furthermore, another objective function is considered: The function CX˜ (λ) is the number of times the highest weighted parse under λ is the manually determined correct parse in the training corpus X˜ . This function directly encodes the criterion which is used in the linguistic evaluation. However, CX˜ (λ) is a highly discontinuous function in λ and hard to maximize. Experiments using a simulated annealing optimization procedure (Press et al., 1992) for this objective function showed that the computational difficulty of this procedure grows and the quality of the solutions degrades rapidly with the number of properties employed in the model. The results of the empirical evaluation are shown in Tables 6 and 7. The maximum pseudo-likelihood estimator performed superior to both the simulated annealing estimator and the uniform baseline estimator on both corpora. The simulated annealing procedure typically scores better than the maximum pseudo-likelihood approach if the number of properties is very small. However, the pseudo-likelihood approach outperforms simulated annealing already for a
214
Stefan Riezler
property-size of 200 as used in our experiment. Furthermore it should be noted that the absolute numbers of 59 % precision on the disambiguation task have to be assessed relative to a number of on average 10 parses per sentence. Table 6. Empirical evaluation of estimators on Ctest (precision of disambiguation with most probable parse) and −P Ltest (negative log-pseudo-likelihood of correct parses in test corpus) on VM-corpus Ctest for VM-corpus −P Ltest for VM-corpus funiform baseline estimator
9.7 %
533
simulated annealing estimator
53.7 %
469
maximum pseudo-likelihood estimator
58.7 %
396
Table 7. Empirical evaluation of estimators on HC-corpus Ctest for HC-corpus −P Ltest for HC-corpus uniform baseline estimator
15.2 %
655
simulated annealing estimator
53.2%
604
maximum pseudo-likelihood estimator
58.8 %
583
6
Conclusion
In this paper we have presented the theory and practice of probabilistic modeling of constraint-based grammars with log-linear distributions. We reported on an experiment with a log-linear grammar model which is based on weights assigned to linguistically motivated properties of the parses. This possibility to define arbitrary features of parses as properties of the probability model and to estimate appropriate weights for them permits the probabilistic modeling of arbitrary context-dependencies in constraint-based grammars. Moreover, our incomplete-data inference algorithm is general enough to be applicable to loglinear probability distributions in general, and thus is useful in other incompletedata settings as well. However, in contrast to related approaches to probabilistic constraint-based grammars which require fully annotated corpora for estimation, our statistical inference algorithm provides general means for automatic and reusable training of arbitrary probabilistic constraint-based grammars from unannotated corpora. Current and future research is dedicated to experiments with more expressive log-linear models on larger scales of unannotated data. Recent work addressed a “lexicalized” extension of log-linear grammar models by adding properties which correspond to lexical-semantic head-head relations (see Johnson and Riezler (2000)). Current work is done on training of such linguistically motivated loglinear models on large amounts on unannotated data (see Riezler et al. (2000)).
Learning Log-Linear Models on Constraint-Based Grammars
215
In future work we want to evaluate empirically also the approach to property selection described above and the dynamic-programming algorithms for efficient parsing and searching in probabilistic constraint-based grammars presented in Riezler (1999). Acknowledgements This work is based on parts of my PhD thesis which was conducted at the Graduiertenkolleg ILS of the Deutsche Forschungsgemeinschaft at the University of T¨ ubingen. The presented experiments were carried out during a stay at Brown University in summer 1998 in cooperation with Mark Johnson, Stuart Geman, Steven Canon, and Zhiyi Chi. I would like to thank my PhD supervisors—Steven Abney, Erhard Hinrichs, Uwe M¨ onnich, and Mats Rooth—for their extended support; the people at Brown—especially Mark Johnson—for initiating and realizing various joint experiments; my colleagues at the IMS—Detlef Prescher and Helmut Schmid—for their support of my work on log-linear grammar models in Stuttgart.
References 1. Abney, S. (1997). Stochastic attribute-value grammars. Computational Linguistics, 23 (4), 597–618. 2. Alshawi, H., & Carter, D. (1994). Training and scaling preference functions for disambiguation. Computational Linguistics, 20 (4), 635–648. 3. Baker, J. (1979). Trainable grammars for speech recognition. In Klatt, D., & Wolf, J. (Eds.), Speech Communication Papers for the 97th Meeting of the Acoustical Society of America, pp. 547–550. 4. Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41 (1), 164–171. 5. Berger, A. L., Della Pietra, V. J., & Della Pietra, S. A. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22 (1), 39–71. 6. Bod, R., & Kaplan, R. (1998). A probabilistic corpus-driven model for lexical-functional analysis. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 98) Montreal. 7. Brew, C. (1995). Stochastic HPSG. In Proceedings of the 7th Meeting of the European Chapter of the Association for Computational Linguistics (EACL’95) Dublin. 8. Briscoe, T., & Carroll, J. (1993). Generalized probabilistic LR parsing of natural language (corpora) with unification-based grammars. Computational Linguistics, 19 (1), 25–59. 9. Briscoe, T., & Waegner, N. (1992). Robust stochastic parsing using the inside-outside algorithm. In Proceedings of the Workshop on Probabilistically-Based Natural Language Processing Techniques (AAAI’92) San Jose, CA.
216
Stefan Riezler
10. Csisz´ar, I. (1989). A geometric interpretation of Darroch and Ratcliff’s generalized iterative scaling. The Annals of Statistics, 17 (3), 1409–1413. 11. Cussens, J. (1999). Loglinear models for first-order probabilistic reasoning. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI’99) Stockholm. 12. Darroch, J., & Ratcliff, D. (1972). Generalized iterative scaling for loglinear models. The Annals of Mathematical Statistics, 43 (5), 1470–1480. 13. Dehaspe, L. (1997). Maximum entropy modeling with clausal constraints. In Proceedings of the 7th International Workshop on Inductive Logic Programming. 14. Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19 (4), 380–393. 15. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39 (B), 1–38. 16. Eisele, A. (1994). Towards probabilistic extensions of constraint-based grammars. In D¨ orre, J. (Ed.), Computational Aspects of Constraint-Based Linguistic Description II, pp. 3–21. DYANA-2 Deliverable R1.2.B. 17. Fletcher, R. (1987). Practical Methods of Optimization. Wiley, New York. 18. Goodman, J. (1998). Parsing Inside-Out. Ph.D. thesis, Computer Science Group, Harvard University, Cambridge, MA. 19. Hobbs, J. R., & Bear, J. (1995). Two principles of parse preference. In Zampolli, A., Calzolari, N., & Palmer, M. (Eds.), Linguistica Computazionale: Current Issues in Computational Linguistics. In Honour of Don Walker. Kluwer, Dortrecht. 20. H¨ ohfeld, M., & Smolka, G. (1988). Definite relations over constraint languages. LILOG report 53, IBM Deutschland, Stuttgart. 21. Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106, 620–630. 22. Johnson, M., Geman, S., Canon, S., Chi, Z., & Riezler, S. (1999). Estimators for stochastic “unification-based” grammars. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL’99) College Park, MD. 23. Johnson, M., & Riezler, S. (2000). Exploiting auxiliary distributions in stochastic unification-based grammars. In Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL 2000) Seattle, WA. 24. Lloyd, J. W. (1987). Foundations of Logic Programming. Springer, Berlin. 25. Magerman, D. M. (1994). Natural Language Parsing as Statistical Pattern Recognition. Ph.D. thesis, Department of Computer Science, Stanford University. 26. Maxwell, J., & Kaplan, R. (1996). Unification-based parsers that automatically take advantage of context freeness. Unpublished manuscript, Xerox Palo Alto Research Center.
Learning Log-Linear Models on Constraint-Based Grammars
217
27. Miyata, T. (1996). A Study on Inference Control in Natural Language Processing. Ph.D. thesis, Graduate School of the University of Tokyo, Tokyo, Japan. 28. Osborne, M., & Briscoe, T. (1997). Learning stochastic categorial grammars. In Proceedings of the Workshop on Computational Natural Language Learning (CoNLL’97) Madrid. 29. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, New York. 30. Ratnaparkhi, A. (1998). Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. thesis, University of Pennsylvania, Philadelphia, PA. 31. Riezler, S. (1997). Probabilistic constraint logic programming. Arbeitsberichte des sonderforschungsbereich 340, Bericht Nr. 117, Seminar f¨ ur Sprachwissenschaft, Universit¨ at T¨ ubingen. 32. Riezler, S. (1998). Statistical inference and probabilistic modeling for constraint-based NLP. In Proceedings of the 4th Conference on Natural Language Processing (KONVENS’98) Bonn. 33. Riezler, S. (1999). Probabilistic Constraint Logic Programming. Ph.D. thesis, Seminar f¨ ur Sprachwissenschaft, Universit¨ at T¨ ubingen. AIMS Report, 5(1), IMS, Universit¨ at Stuttgart. 34. Riezler, S., Prescher, D., Kuhn, J., & Johnson, M. (2000). Lexicalized stochastic modeling of constraint-based grammars using log-linear measures and EM training. Unpublished Manuscript. Institut f¨ ur Maschinelle Sprachverarbeitung, Universit¨ at Stuttgart. 35. Srinivas, B., Doran, C., & Kulick, S. (1995). Heuristics and parse ranking. In Proceedings of the Forth International Workshop on Parsing Technologies (IWPT 95). 36. Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. The Annals of Statistics, 11 (1), 95–103.
Unsupervised Lexical Learning with Categorial Grammars Using the LLL Corpus Stephen Watkinson and Suresh Manandhar Department of Computer Science, University of York, Hesslington, York YO10 5DD, UK
[email protected],
[email protected]
Abstract. In this paper we report on an unsupervised approach to learning Categorial Grammar (CG) lexicons. The learner is provided with a set of possible lexical CG categories, the forward and backward application rules of CG and unmarked positive only corpora. Using the categories and rules, the sentences from the corpus are probabilistically parsed. The parses of this example and the set of parses of earlier examples in the corpus are used to build a lexicon and annotate the corpus. We report the results from experiments on two generated corpora and also on the more complicated LLL corpus, that contains examples from subsets of English syntax. These show that the system is able to generate reasonable lexicons and provide accurately parsed corpora in the process. We also discuss ways in which the approach can be scaled up to deal with larger and more diverse corpora.
1
Introduction
In this paper we discuss a potential solution to two problems in Natural Language Processing (NLP), using a combination of statistical and symbolic machine learning techniques. The first problem is learning the syntactic roles, or categories, of words of a language i.e. learning a lexicon. Secondly, we discuss a method of annotating a corpus of example sentences with the most likely parses. The aim is to learn Categorial Grammar (CG) lexicons, starting from a set of lexical categories, the functional application rules of CG and an unannotated corpus of positive examples. The CG formalism (discussed in Section 2) is chosen because it assigns distinct categories to words of different types, and the categories describe the exact syntactic role each word can play in a sentence, which is used by the learner to constrain assignment. The problem under discussion is similar to the unsupervised part of speech tagging work of, for example, Brill (1997) and Kupiec (1992), as the learner is seeking to attach tags (albeit tags containing a great deal of syntactic detail) to words. In Brill’s work a lexicon containing the parts of speech available to each word is provided and a simple tagger attaches a complex tag to each word in the corpus, which represents all the possible tags that word can have. Transformation rules are then learned which use the context of a word to determine which simple J. Cussens and S. Dˇ zeroski (Eds.): LLL’99, LNAI 1925, pp. 218–233, 2000. c Springer-Verlag Berlin Heidelberg 2000
Unsupervised Lexical Learning with Categorial Grammars
219
tag it should be assigned. The results are good, generally achieving around 95% accuracy on large corpora such as the Penn Treebank. Kupiec (1992) uses an unsupervised version of the Baum-Welch algorithm, which is a way of using examples to iteratively estimate the probabilities of a Hidden Markov Model for part of speech tagging. Instead of supplying a lexicon, he places the words in equivalence classes. Words in the same equivalence class must take one of a specific set of parts-of-speech. This improves the accuracy of this algorithm to about the same level as Brill’s approach. In both cases, the learner is provided with a large amount of background knowledge – either a complete lexicon or a set of equivalence classes. Hence, they have been either completely or largely given the set of categories that a word can map to and they learn the probability of the occurrence of a particular category in a particular context. Our problem is more complex, because either no lexicon, or at most a small partial lexicon, is given to bootstrap the learning process. Other than this all words can take any category. Hence, while our learner must determine the probability of the occurrence of a particular category in a particular context, it must also determine what set of categories each word can map to. The second problem – finding the most likely parses – is solved because of the approach we use to learn the lexicon. The system parses examples and uses the lexical assignments determined by the parse to build the lexicon. The corpus is annotated with the parses (also providing less probable parses if desired). This means that the lexicon building and the parse annotation depend on each other, as the lexicon must be used to parse and the parses are used to build the lexicon. An example of another approach to annotating a corpus with the most likely parse is the Fidditch parser of Hindle (1983) (based on the deterministic parser of Marcus (1980)), which was used to annotate the Penn Treebank (Marcus, Santorini, & Marcinkiewicz, 1993). However, instead of learning the lexicon, a complete grammar and lexicon must be supplied to the Fidditch parser. Our work also relates to CG induction. Osborne (1997) has an algorithm that learns a grammar for sequences of part-of-speech tags from a tagged corpus, using the Minimum Description Length (MDL) principle – a well-defined form of compression. While this is a supervised setting of the problem, the use of the more formal approach to compression is of interest for future work. The learning problem is also simpler, as the CG developed is not to cover the set of example sentences, but the set of part-of-speech tags attached to those sentences. This removes some of the possible ambiguity. However, results of 97% coverage are impressive. Kanazawa (1994) and Buszkowski (1987) use a unification based approach with a corpus annotated with semantic structure, which in CG is a strong indicator of the syntactic structure. Unfortunately, they do not present results of experiments on natural language corpora and again the approach is essentially supervised. Two unsupervised approaches to learning CGs are presented by Adriaans (1992) and Solomon (1991). Adriaans, describes a purely symbolic method that
220
Stephen Watkinson and Suresh Manandhar
uses the context of words to define their category. An oracle is required for the learner to test its hypotheses, thus providing negative evidence. This would seem to be awkward from an engineering viewpoint, i.e. how one could provide an oracle to achieve this, and implausible from a psychological viewpoint, as humans do not seem to receive such evidence (Pinker, 1990). Unfortunately, again no results on natural language corpora seem to be available. Solomon’s approach (Solomon, 1991) uses unannotated corpora, to build lexicons for CG. He uses a simple corpora of sentences from children’s books, with a slightly ad hoc and non-incremental, heuristic approach to developing categories for words. The results show that a wide range of categories can be learned, but the current algorithm, as the author admits, is probably too naive to scale up to working on full corpora. No results on the coverage of the CGs learned are provided. The main difference of the approach presented here is that we seek to use a learning setting similar to that of children learning language (Pinker, 1990). This means that learning is unsupervised and that no negative evidence is allowed. It would also suggest that there is no need to structure examples, or provide simple examples. However, the above work on CG induction is particularly interesting with respect to category induction. Currently, our learner is provided with a full set of possible categories, but in the future, some of the above methods may be useful for automatically generating categories. The rest of the paper describes our system and gives details of the experiments already completed. Section 2 describes the Categorial Grammar formalism. In Section 3 we discuss our learner. In Section 4 we describe experiments on three corpora containing examples of a subset of English syntax and Section 5 contains the results, which are encouraging with respect to both problems. Finally, in Section 6, we discuss ways the system can be expanded and larger scale experiments may be carried out.
2
Categorial Grammar
Categorial Grammar (CG) (Wood, 1993; Steedman, 1993) provides a functional approach to lexicalised grammar, and so, can be thought of as defining a syntactic calculus. Below we describe the basic (AB) CG, although in future it will be necessary to pursue a more flexible version of the formalism. There is a set of atomic categories in CG, which are usually nouns (n), noun phrases (np) and sentences (s). It is then possible to build up complex categories using the two slash operators “/” and “\”. If A and B are categories then A/B is a category and A\B is a category, where A is the resulting category when B, the argument category, is found. The direction of the “slash” functors indicates the position of the argument in the sentence i.e. a “/” indicates that a word or phrase with the category of the argument should immediately follow in the sentence. The “\” is the same except that the word or phrase with the argument category should immediately precede the word or phrase with the this category. This is most easily seen with examples.
Unsupervised Lexical Learning with Categorial Grammars
221
Table 1. The categories available to the learner Syntactic Role
CG Category
Example
Sentence Noun Noun Phrase Intransitive Verb Transitive Verb Ditransitive Verb Sentential Complement Verb Determiner Adjective Auxiliary Verb Preposition
s n np s\np (s\np)/np ((s\np)/np)/np (s\np)/s np/n n/n (s\np)/(s\np) (n\n)/np ((s\np)\(s\np))/np
the dog ran dog the dog ran kicked gave believe the hungry does in
Suppose we consider an intransitive verb like “run”. The only category that is required to complete the sentence is a subject noun phrase. Hence, in Steedman’s notation, the category of “run” can be considered to be a sentence that is missing a preceding noun phrase i.e. s\np. Similarly, with a transitive verb like “ate”, the verb requires a subject noun phrase. However, it also requires an object noun phrase, which is attached first. The category for “ate” is therefore (s\np)/np. With basic CG there are just two rules for combining categories: the forward (FA) and backward (BA) functional application rules. Following Steedman’s notation (Steedman, 1993) these are: X/Y Y ⇒ X Y X\Y ⇒ X
(F A) (BA)
Table 1 shows a wide range of examples of the kinds of categories that can be used in CG. In Figure 1 the parse derivation for “John ate the apple” is presented. Here we can see the similarity with Context Free Phrase Structure Grammars (CFPSG). The tree in Figure 2 shows the analysis which could be completed using the CFPSG in Table 2. Each time either the FA or the BA rule is used with a pair of categories it is equivalent to the application of one of the binary CFPSG rules. Table 2. An Example CFPSG S → NP VP NP → PN NP → DT N VP → VT NP
PN → John VT → ate DT → the N → apple
The CG described above has been shown to be weakly equivalent to contextfree phrase structure grammars (Bar-Hillel, Gaifman, & Shamir, 1964). While
222
Stephen Watkinson and Suresh Manandhar
John
ate
the
apple
np
(s\np)/np
np/n
n FA np FA
s\np BA s Fig. 1. An Example Parse in Basic CG
S VP NP
NP
PN
VT
DT
N
John
ate
the
apple
Fig. 2. An Example Parse in CFPSG
such expressive power covers a large amount of natural language structure, it has been suggested that a more flexible and expressive formalism may capture natural language more accurately (Wood, 1993; Steedman, 1993). This has led to some distinct branches of research into usefully extending CG (see Wood (1993) for an overview). Examples of extended formalisms are Steedman’s Combinatory Categorial Grammar (CCG) (Steedman, 1993), which captures certain linguistic constructs, such as coordination, elegantly. An alternative method of extension, is to add features to the categories thus gaining the flexibility of unification grammars such as Definite Clause Grammars (DCGs). Uszkoreit (1986), for example, developed Categorial Unification Grammar (CUG) to marry the categorial and unification approaches. CG has at least the following advantages for our task. – Learning the lexicon and the grammar is one task. – The syntax directly corresponds to the semantics. The first of these is vital for the work presented here. Because the syntactic structure is defined by the complex categories assigned to the words, it is not necessary to have separate learning procedures for the lexicon and for the grammar rules. Instead, it is just one procedure for learning the lexical assignments to words. Secondly, the syntactic structure in CG parallels the semantic structure, which allows an elegant interaction between the two. While this feature of CG
Unsupervised Lexical Learning with Categorial Grammars
223
is not used in the current system, it could be used in the future to add semantic background knowledge to aid the learner (e.g. Buszkowski’s discovery procedures (Buszkowski, 1987)).
3
The Learner
The system we have developed for learning lexicons and assigning parses to unannotated sentences is shown diagrammatically in Figure 3. In the following sections we explain the learning setting and the learning procedure respectively.
Example
Probabilistic
Corpus Parser
Categories & Rules
N most probable parses
Parse
Parsed Examples
Selector Current Lexicon
Lexicon Modifier
Fig. 3. A Diagram of the Structure of the Learner
3.1
The Learning Setting
The input to the learner has five parts: 1. 2. 3. 4. 5.
the corpus, the lexicon, the CG rules, the set of legal categories a probabilistic parser.
These are discussed in turn below.
224
Stephen Watkinson and Suresh Manandhar
The Corpus. The corpus is a set of unannotated positive examples represented in Prolog as facts containing a list of words e.g. ex([mary,loved,a,computer]). The Lexicon. It is the lexicon that the learner induces and so initially it is either empty, or, in some experiments, there is a small lexicon supplied to bootstrap the learning process. Lexicons are stored by the learner as a set of Prolog facts of the form: lex(Word, Category, Frequency). where Word is a word, Category is a Prolog representation of the CG category assigned to that word and Frequency is the number of times this category has been assigned to this word up to the current point in the learning process. The Rules. The CG functional application rules (see Section 2) are supplied to the learner. Extra rules may be added in the future for fuller grammatical coverage. The Categories. The learner has a complete set of the categories that can be assigned to a word in the lexicon. The complete set is shown in Table 1, although not all of these were used in all experiments. The Parser. The system employs a probabilistic chart parser, which calculates the N most probable parses, where N is the beam set by the user. A chart parser uses a data structure called a chart on which to store each application of a rule in a particular context. When a parse fails and a new parse includes some of the same structure, it does not have to be recalculated, because it is already stored on the chart. Such parsers are renowned for being efficient when working with natural language. Gazdar and Mellish (1989) describe the approach in more detail. The probability of a word being assigned a category is based on the relative frequency, which is calculated from the current lexicon. This probability is smoothed simply (for words that have not been given fixed categories prior to execution) to allow the possibility that the word may appear as other categories. For all categories from the complete set provided for which the word has not appeared, the word is given a frequency of one. This is particularly useful for new words, as it allows the category of a word to be determined by its context. When a new word occurs all the categories in the set given the learner are assigned an equal probability of occurring with that word. This allows even sentences made up of completely new words (e.g. the first example of the corpus) to be parsed, although the parsed will be of a low probability. Each non-lexical edge (i.e. a derivation in the chart that is not simply the denoting of a word by a category) in the chart has a probability calculated by multiplying the probabilities of the two edges that are combined to form it. A candidate edge between two vertices is not added if there are N edges labelled with the same category and a higher probability between the same two vertices.
Unsupervised Lexical Learning with Categorial Grammars
225
If there are N such edges already, but one or more has a lower probability than the candidate edge, then the edge with the lowest probability is removed and the candidate edge is added. This constraint leads to an algorithm similar to the Viterbi algorithm (Charniak, 1993), although the parser calculates the N most likely parses rather than simply the most likely parse. For efficiency, a candidate edge is not added between vertices if there is an edge already in place with a much higher probability. This does not retain the completeness of the algorithm in the sense that it is now possible (if unlikely) that the most probable parse will not by found, or that no parse will be found when one does exist. This is because it is possible that edges that are not added, because they have very low probability, may in fact be involved in the parses required. However, the improvement in efficiency is likely to be more important to performance than this removal of completeness. The chart in Figure 4 shows examples of edges that would not be added. The top half of the chart shows one parse and the bottom half another. If N (the beam) was set to 1 then the dashed edge spanning all the vertices would not be added, as it has a lower probability than the other s edge covering the same vertices. Similarly, the dashed edge between the first and third vertices would not be added, as the probability of the n is so much lower than the probability of the np. s - 0.512 np - 0.64 np/n - 0.8
n - 0.8
s\np - 0.8
n/n - 0.001 the
man
ran
np - 0.1
(s\np)/np - 0.09
n - 0.0008
np - 0.1
s\np - 0.009
s - 0.0009 Fig. 4. Example chart showing edge pruning
It is important that the parser is efficient, as it is used on every example and each word in an example may be assigned any category. As will be seen it is also used extensively in selecting the best parses. We are currently investigating improvements to the parsing algorithm. and have recently developed an alternative parser with approximately the same features that processes examples much more quickly. It is a chart parser based upon the probabilistic CKY algorithm given by Collins (1999), which has been adapted to be used with Categorial
226
Stephen Watkinson and Suresh Manandhar
Grammars. This algorithm searches for the most probable parse, so we have extended it to search for the N most probable parses, where N is the beam set by the user, as before. Such an algorithm is ideal for Categorial Grammars, as it is specifically designed for grammars that generate exclusively binary branching trees. Although tests with this parser are at an early stage, in due course, it should allow us to present experiments on much larger and more syntactically complex corpora, e.g. the Penn Treebank. 3.2
The Learning Procedure
Having described the various components with which the learner is provided, we now describe how they are used in the learning procedure. Parsing the Examples. Examples are taken from the corpus one at a time and parsed. Each example is stored with the group of N (the beam) parses generated for it, so they can be efficiently accessed in future. The parse that is selected (see below) as the current correct parse is maintained at the head of this group. The head parse contributes information to the lexicon and annotates the corpus. The other parses are also used extensively for the efficiency of the parse selection module, as will be described below. When the parser fails to find an analysis of an example, perhaps because it is ungrammatical, or because of the incompleteness of the coverage of the grammar, the system moves to the next example. The Parse Selector. Once an example has been parsed, the N most probable parses are considered in turn to determine which can be used to compress the lexicon (by a given measure), following the compression as learning approach of, for example, Wolff (1987). In this context, compressing the lexicon means returning the smallest lexicon that covers all the examples seen thus far on the basis of the size calculations made by the system. In some cases this will necessarily be a larger lexicon than at a previous stage of the process. The aim is to keep this increase to a minimum and thus the lexicon is compressed. The current size measure for the lexicon is the sum of the sizes of the categories for each lexical entry for a word. The size of a category is the number of atomic categories within it. In other words, for each mapping between a word and a category in the lexicon, the size of that category is added to the total score for the lexicon. It should be noted that this means that there is a preference for small categories, which seems reasonable (although we intend to compare this with other measures in the future, e.g. simply counting the number of entries in the lexicon) and may be particularly useful if we begin to generate categories. It may also be of value to point out that compression here is of the lexicon, and not the corpus, as the frequencies of the words are not included in the calculation. Again this seems reasonable, as it is the lexicon that will be stored in the long term. It is not enough to look at what a parse would add to the lexicon. Changing the lexicon may change the results given by the parser for previous examples. Changes in the frequency of assignments can cause the probabilities of previous
Unsupervised Lexical Learning with Categorial Grammars
227
parses to change. This can correct mistakes made earlier when the evidence from the lexicon was too weak to assign the correct parse. Such correction is achieved by reparsing previous examples that may be affected by the changed lexicon, i.e. those previous examples that contain words, which also appear in the current example and, which have changed probability. These examples may now have a different most probable parse. Not reparsing those examples that will not be affected saves a great deal of time. For each hypothesised parse of the current example, a temporary lexicon is built from the lexical assignments in the head parses for the selectively reparsed previous examples. The hypothesised parse leading to the most compressive of these temporary lexicons is chosen. The amount of reparsing is also reduced by using stored parse information. While this significantly improves efficiency, there is again a possibility that the most probable parse will be missed and so the most compressive lexicon. This seems to have little effect on the quality of the results and does significantly improve the efficiency of the learner. This may appear an expensive way of determining which parse to select, but it enables the system to compress the lexicon and keep an up-to-date annotation for the corpus. Also, the chart parser works in polynomial time and it is possible to do significant pruning, as outlined, so few sentences need to be reparsed each time. However, in the future we will look at ways of determining which parse to select that do not require complete reparsing. Lexicon Modification. The final stage takes the current lexicon and replaces it with the lexicon built with the selected parse. The whole process is repeated until all the examples have been parsed. The final lexicon is left after the final modification. The most probable annotation of the corpus is the set of top-most parses after the final parse selection.
4
Experiments
Experiments were performed on three different corpora all containing only positive examples. We also tested performance with and without a partial lexicon of closed-class words. Some categories are considered to have a finite and reasonably small number of words as members e.g. determiners and prepositions. Words in these categories are called closed-class words. In some experiments, the learner was supplied with a lexicon of closed-class words that had fixed categories and probabilities. All experiments were carried out on a SGI Origin 2000. Experiments on Corpus 1. The first corpus was built from a context-free grammar (CFG), using a simple random generation algorithm. The CFG (shown in Table 3) covers a range of simple declarative sentences with intransitive, transitive and ditransitive verbs and with adjectives. The lexicon of the CFG contained 39 words with an example of noun-verb ambiguity. The corpus consisted of 500 such sentences (Table 4 shows examples). As the size of the lexicon was small and there was only a small amount of ambiguity, it was unnecessary to supply the partial lexicon, but the experiment was carried out for comparison. We
228
Stephen Watkinson and Suresh Manandhar Table 3. The CFG used to generate Corpus 1 with example lexical entries S → NP VP Vbar → IV Vbar → DV NP NP NP → Nbar N → Adj N PN → john N → boy IV → ran DV → gave
VP → Vbar Vbar → TV NP NP → PN Nbar → Det N Det → the Adj → small TV → timed
also performed an experiment on 100 unseen examples to see how accurately they were parsed with the learned lexicon. The results were manually verified to determine how many sentences were parsed correctly. Table 4. Examples from Corpus 1 ex([mary, ran]). ex([john, gave, john, a, boy]). ex([a, dog, called, the, fish, a, small, ugly, desk]).
Experiments on Corpus 2. The second corpus was generated in the same way, but using extra rules (see Table 5) to include prepositions, thus making the fragment of English more complicated. The lexicon used for generating the corpus was larger – 44 words in total. Again 500 examples were generated (see Table 6 for examples) and experiments were carried out both with and without the partial lexicon. Again we performed an experiment on 100 unseen examples to see how accurately they are parsed. Table 5. The extra rules required for generating Corpus 2 with example lexical entries NP → Nbar PP PP → P NP
VP → Vbar PP
P → on
Experiments on Corpus 3. We also performed experiments using the LLL corpus (Kazakov, Pulman, & Muggleton, 1998). This is a corpus of generated sentences for a substantial fragment of English syntax. It is annotated with a certain
Unsupervised Lexical Learning with Categorial Grammars
229
Table 6. Examples from Corpus 2 ex([the, fish, with, a, elephant, gave, banks, a, dog, with, a, bigger, statue]). ex([a, elephant, with, jim, walked, on, a, desk]). ex([the, girl, kissed, the, computer, on, a, fish]).
amount of semantic information, which was ignored. The corpus contains 554 sentences, including a lot of movement (e.g. topicalized and question sentences). Examples are shown in Table 7. While our CG can handle a reasonable variety of declarative sentences it is by no means complete, not allowing any movement. This is a limitation in the current approach. Also, this corpus is small and sparse, making learning difficult. Initially experiments were performed using only the 157 declarative sentences (895 words, with 152 unique words) in the LLL corpus. These experiments and those on corpora 1 and 2 have also been presented elsewhere (Watkinson & Manandhar, 1999). However, we also decided to see how well the system performed with the full corpus. Experiments on the full corpus were performed with closed-class words included only, as the experiments on the declarative sentences suggested that they would be necessary, due to the sparseness of the corpus. We present results for parsing unseen examples for the full corpus, but not for the declarative sentence corpus, as it was too small to perform a useful experiment. Table 7. Examples from Corpus 3 ex([which, rough, reports, above, hilary, wrote, hilary, in, sandy, beside, which, telephone]). ex([inside, no, secretary, wont, all, small, machines, stop]). ex([which, old, report, disappears]).
All experiments were performed with the minimum number of categories needed to cover the corpus, so for example, in the experiments on Corpus 1 the categories for prepositions were not available to the parser. This will obviously affect the speed with which the learner performs. Also, the parser was restricted to two possible parses in each case, i.e. the beam width, N , was set to two.
5
Results
In Table 8, we report the results of these experiments. The CCW Preset column indicates whether the closed-class words were provided or not. The lexicon accuracy column is a measure, calculated by manual analysis, of the percentage of lexical entries i.e. entries that have word-category pairs that can plausibly be accepted as existing in English. This should be taken together with the parse
230
Stephen Watkinson and Suresh Manandhar
accuracy, which is the percentage of correctly parsed examples i.e. examples with a linguistically correct syntactic analysis. Corpus 3a is the corpus of declarative sentences from the LLL corpus, whereas corpus 3b refers to the whole LLL corpus. The results for the first two corpora are extremely encouraging with 100% Table 8. Accuracies and timings for the different learning experiments Corpus 1 1 2 2 3a 3a 3b
CCW Preset
Lexicon Acc. (%)
× √ √
100 100 100 × 14.7 77.7 73.2
× × √ √
Parse Exec. Acc. (%) Time (s) 100 100 100 × 0.6 58.9 28.5
5297 625 10524 × 164151 361 182284
accuracy in both measures. While these experiments are on relatively simple corpora, these results strongly suggest the approach can be effective. Note that any experiment on corpus 2 without the closed-class words being set did not terminate in a reasonable time, as the sentences in that corpus are significantly longer and each word may be a large number of categories. It is therefore clear that setting the closed-class words greatly increases speed and that we need to consider methods of relieving the strain on the parser if the approach is to be useful on more complex corpora. The results with the LLL corpus are also encouraging in part. A lexical accuracy of 77.7% on the declarative sentences and 73.2% on the whole corpus is a good result, especially considering that the grammar is not designed to cover sentences containing movement and there are many of these in this corpus. This is due to the learner parsing the majority of the phrases within examples correctly, with incorrect analysis of the parts that the grammars does not cover, e.g. the question word at the start of a sentence. Clearly, a grammar with greater coverage could provide improved results. The results for parse accuracy, both in the training and test sets, do not suggest that the system is very successful in providing parses. However, they actually do not show the full picture. Practically all the sentences are mostly parsed correctly. Very few of the errors could actually be handled correctly by the grammar with which the system was supplied. In terms of less stringent measures of parse accuracy e.g. bracket crossing, the performance would be much higher. This is indicated by the much higher parse accuracy measure for the declarative sentences, 58.9%. On this set of sentences, which are much more likely to be covered by the CG categories, the learner performs nearly 30% better. Obviously, in the longer term, a grammar with wider coverage would possibly provide the desired results for this more stringent measure.
Unsupervised Lexical Learning with Categorial Grammars
231
It should be noted that the poor results obtained for Corpus 3a without using the closed-class words is due to the severity of the sparse data problem within this corpus. Table 9 shows predictably good results for parsing the test sets with the learned lexicons. The timings presented give some idea of the scale of the experiments conducted. As the complexity of the problem increases it is clear that the time taken by the learner becomes significant. It may be interesting to note for future work that an efficient Scheme implementation of the probabilistic CKY algorithm discussed in Section 3.1 has replaced the Prolog chart parser. This is currently enabling us to perform experiments on the Penn Treebank (Marcus et al., 1993), a much larger corpus, within a reasonable time. Table 9. Unseen example parsing accuracy
6
Corpus
Closed-Class Parse Accuracy (%)
1 1 2 2 3
× √ × √ √
100 100 100 100 37.2
Conclusions
We have presented an unsupervised learner that is able to both learn CG lexicons and annotate natural language corpora, with less background knowledge than other systems in the literature. Results from preliminary experiments are encouraging with respect to both problems, particularly as the system appears to be reasonably effective on small, sparse corpora. It is encouraging that where errors arose this was often due only to incomplete background knowledge. The results presented are encouraging with respect to the work that has already been mentioned - 100% can clearly not be improved upon and compares very favourable with the systems mentioned in the introduction. However, it is also clear that this was achieved on unrealistically simple corpora and when the system was used on the more diverse LLL corpus it did not fair as well. However, given the fact that the problem setting discussed here is somewhat harder than that attempted by other systems and the lack of linguistic background knowledge supplied, it is hoped that it will be possible to use the approach with some simple changes to greatly improve the performance on the LLL corpus and other larger coverage corpora. In particular, it will obviously be necessary to include coverage of the movement phenomena in natural language in the CG supplied. The use of CGs to solve the problem provides an elegant way of using syntactic information to constrain the learning problem and provides the opportunity for expansion to a full grammar learning system in the future by the development of a category hypothesiser. This will generate all the complex categories
232
Stephen Watkinson and Suresh Manandhar
from the set of atomic categories (s, n, np) and so replace the need for the full set of categories to be supplied to the learner. It is hoped that this will be part of future work. We also hope to carry out experiments on larger and more diverse corpora, as the corpora used thus far are too small to be an exacting test for the approach. We need to expand the grammar to cover more linguistic phenomena to achieve this, as well as considering other measures for compressing the lexicon (e.g. using an MDL-based approach). Currently we are working on expanding the set of CG categories to be suitable for learning using the Penn Treebank (Marcus et al., 1993). Larger experiments will lead to a need for increased efficiency in the parsing and reparsing processes. This could be done by considering deterministic parsing approaches (Marcus, 1980), or perhaps shallower syntactic analysis. As already mentioned we have developed a more efficient parser based on a similar approach as that described in this paper and we hope in the future to be able to present results of the performance of the system on the Penn Treebank. While many extensions may be considered for this work, the evidence thus far suggests that the approach outlined in this paper is effective and efficient for these natural language learning tasks. Acknowledgements Stephen Watkinson is supported by the EPSRC (Engineering and Physical Sciences Research Council).
References 1. Adriaans, P. W. (1992). Language Learning from a Categorial Perspective. Ph.D. thesis, Universiteit van Amsterdam. 2. Bar-Hillel, Y., Gaifman, C., & Shamir, E. (1964). On categorial and phrase structure grammars. In Language and Information, pp. 99–115. AddisonWesley. First appeared in The Bulletin of the Research Council of Israel, vol. 9F, pp. 1-16, 1960. 3. Brill, E. (1997). Unsupervised learning of disambiguation rules for part of speech tagging. In Natural Language Processing Using Very Large Corpora. Kluwer Academic Press. 4. Buszkowski, W. (1987). Discovery procedures for categorial grammars. In Klein, E., & van Benthem, J. (Eds.), Categories, Polymorphism and Unification, pp. 35–64. Centre for Cognitive Science, University of Edinburgh & Institue for Language, Logic and Information, University of Amsterdam. 5. Charniak, E. (1993). Statistical Language Learning. The MIT Press, Cambridge, Massachusetts. 6. Collins, M. (1999). Head-driven statistical models for natural language parsing. Ph.D. thesis, Computer & Information Science, University of Pennsylvania. 7. Gazdar, G., & Mellish, C. (1989). Natural Language Processing in Prolog: An Introduction to Computational Linguistics. Adison-Wesley.
Unsupervised Lexical Learning with Categorial Grammars
233
8. Hindle, D. (1983). Deterministic parsing of syntactic non-fluencies. In Marcus, M. (Ed.), Proceedings of the 21st Annual Meeting of the Association for Computational Linguistics, pp. 123–128. ACL. 9. Kanazawa, M. (1994). Learnable Classes of Categorial Grammars. Ph.D. thesis, Institute for Logic, Language and Computation, University of Amsterdam. 10. Kazakov, D., Pulman, S., & Muggleton, S. (1998). The FraCas dataset and the LLL challenge. Tech. rep., SRI International. 11. Kupiec, J. (1992). Robust part-of-speech tagging using a hidden markov model. Computer Speech and Language, 6, 225–242. 12. Marcus, M. P. (1980). A Theory of Syntactic Recognition. The MIT Press Series in Artificial Intelligence. The MIT Press. 13. Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of english: The penn treebank. Tech. rep. IRCS93-47, Institution for Research in Cognitive Science. 14. Osborne, M. (1997). Minimisation, indifference and statistical language learning. In Workshop on Empirical Learning of Natural Language Processing Tasks, ECML’97, pp. 113–124. 15. Pinker, S. (1990). Language acquisition. In Oshershon, D. N., & Lasnik, H. (Eds.), An Invitation to Cognitive Science: Language, Vol. 1, pp. 199–241. The MIT Press. 16. Solomon, W. D. (1991). Learning a grammar. Tech. rep. UMCS-AI-91-2-1, Department of Computer Science, Artificial Intelligence Group, University of Manchester. 17. Steedman, M. (1993). Categorial grammar. Lingua, 90, 221–258. 18. Uszkoreit, H. (1986). Categorial unification grammars. Technical report CSLI-86-66, Center for the Study of Language and Information, Stanford University, Stanford, CA. 19. Watkinson, S., & Manandhar, S. (1999). Unsupervised lexical learning with categorial grammars. In Stolcke, A., & Kelher, A. (Eds.), Proceeding of the Workshop in Unsupervised Learning in Natural Language Processing. 20. Wolff, J. (1987). Cognitive development as optimisation. In Bolc, L. (Ed.), Computational Models of Learning, Symbolic computation-artificial intelligence. Springer Verlag. 21. Wood, M. M. (1993). Categorial Grammars. Linguistic Theory Guides. Routledge. General Editor Richard Hudson.
Induction of Recursive Transfer Rules Henrik Bostr¨ om Department of Computer and Systems Sciences, Stockholm University and Royal Institute of Technology, Electrum 230, 164 40 Kista, Sweden
[email protected]
Abstract. Transfer rules are used in bi-lingual translation systems for transferring a logical representation of a source language sentence into a logical representation of the corresponding target language sentence. This work studies induction of transfer rules from examples of corresponding pairs of source-target quasi logical formulae (QLFs). The main features of this problem are: i) more than one rule may need to be produced from a single example, ii) only positive examples are provided and iii) the produced hypothesis should be recursive. In an earlier study of this problem, a system was proposed in which hand-coded heuristics were employed for identifying non-recursive correspondences. In this work we study the case when non-recursive transfer rules have been given to the system instead of heuristics. Results from a preliminary experiment with English-French QLFs are presented, demonstrating that this information is sufficient for the generation of generally applicable rules that can be used for transfer between previously unseen source and target QLFs. However, the experiment also shows that the system suffers from producing overly specific rules, even when the problem of disallowing the derivation of other target QLFs than the correct one is not considered. Potential approaches to this problem are discussed.
1
Introduction
In transfer-based translation, source language input is first analysed resulting in a logical representation of the (preferred) meaning of the input utterance. Next, the source logical formula is transferred into a logical representation in the target language. Finally, the such obtained logical formula is then used for generating the target text. Each language is processed with its specific morphology, grammar, lexicon, etc., with the transfer being the only bridge between them. The Spoken Language Translator (SLT) (Agn¨ as et al 1994), is a transferbased translation system, that is based on the Core Language Engine (CLE) (Alshawi, 1992) for mapping between natural language phrases and (quasi) logical formulas (QLFs). It uses so-called transfer rules for the mapping between source and target QLFs. A transfer rule specifies a pair of logical form patterns, where the first pattern represents a form in one language and the second pattern represents a J. Cussens and S. Dˇ zeroski (Eds.): LLL’99, LNAI 1925, pp. 237–246, 2000. c Springer-Verlag Berlin Heidelberg 2000
238
Henrik Bostr¨ om
form in the other language. The patterns can include so-called transfer variables showing the recursive correspondence between parts of the matching logical forms. Many transfer rules are only responsible for transferring word senses, e.g., trule(leave_Depart,partir_Leave), while others, the so-called structural transfer rules, are more complex such as the following rule, which is taken from Milward and Pulman (1997)1 : trule([stop_ComeToRest,A,S1], [faire_Make,A,S2,term(q(_,bare,sing),_, Xˆ[escale_Stop,X])]):trule(S1,S2). The above rule corresponds to a mapping from the English word stop to the French phrase faire une escale. It allows a QLF that is an instance of the first argument to be transferred into a QLF that is an instance of the second argument, given that the sub-term of the first QLF that correspond to the variable S1 can be recursively transferred into a term corresponding to S2. The last element in the second list corresponds to a singular noun phrase, where escale_stop represents the meaning of the noun constituent. The reader is referred to Alshawi (1992) for further details about the QLF syntax and semantics, but it should be noted that no further knowledge about QLFs is necessary in order to follow our presentation, since it is sufficient to regard the QLFs as complex terms only in the following. Transfer rules are applied recursively, and this process follows the recursive structure of the source QLF. Normally, the transfer rules are hand-crafted through inspection of a set of non-transferable QLF pairs. Their creation is a tedious and time-consuming task. The main problem addressed in this work is how to use inductive logic programming (ILP) techniques for automatic derivation of transfer rules from examples of corresponding QLF pairs, such as the following (corresponding to the sentences “List the prices” and “Indiquez les tarifs” respectively): qlf_pair([imp, form(_,verb(no,no,no,imp,y),A, Bˆ[B,[list_Enumerate,A, term(_,ref(pro,you,_,l([])),_, Cˆ[personal,C],_,_), term(_,q(_,bare,plur),_, Dˆ[fare_Price,D],_,_)]],_)], [imp, form(_,verb(impera,no,no,impera,y),E, Fˆ[F,[indiquer_Show,E, term(_,ref(pro,vous,_,l([])),G, Hˆ[personal,H],_,_), term(_,ref(def,le,plur,l([G-_])),_, Iˆ[tarif_Fares,I],_,_)]],_)]). 1
The reader is assumed to be familiar with logic programming terminology (Lloyd, 1987) and standard Edinburgh syntax for logic programs (Clocksin & Mellish, 1981).
Induction of Recursive Transfer Rules
239
The main features of this problem are: – more than one rule may need to be produced from a single example, – only positive examples are provided, and – the produced hypothesis should be recursive. The first problem is significant as most ILP systems (e.g., Progol (Muggleton, 1995) and FOIL (Quinlan, 1990)) produce at most one clause per example. We have developed one approach to this problem, a system called TRL (Transfer Rule Learner) (Bostr¨ om & Zemke, 1996), which works in three steps: 1. It identifies non-recursive correspondences between the source and target QLFs. 2. It generates a set of clauses which are as general as possible for each example. 3. It specialises the clauses so that for each source QLF exactly one target QLF can be generated (cf. output completeness (Mooney & Califf, 1996)). In previous work, the system relied on elaborated, hand-coded heuristics for the first step. In this work we assume no such heuristics, but instead assume that the non-recursive transfer rules are known, and hence the task is to induce the recursive rules (initial work on inducing non-recursive transfer rules is described in Milward and Pulman (1997)). In the next section, we present the basic algorithm used in TRL for solving this task. In Section 3, we present results from a preliminary experiment on learning transfer rules from English-French QLF pairs and point out some major difficulties that are revealed by this experiment. In Section 4 we give some concluding remarks and discuss some possible directions for future work.
2 2.1
The Transfer Rule Learner The Algorithm
Given a set of pairs of input-output terms (QLF pairs), the rule generating component of TRL produces a set of clauses such that for each input term, (a variant of) the corresponding output term can be derived. The problem of excluding the derivation of other terms than the output term is assumed to be handled later by the rule specialisation component. The rule generating component is based on the following assumptions: – there is some way to find a bijection from sub-terms in each input term to sub-terms in the corresponding output term that corresponds to all subterms that should be non-recursively transferred, – only one (recursive) predicate is needed in the produced hypothesis, and – both arguments in the head of each clause produced must be compound (this ensures that the hypothesis terminates for all input).
240
Henrik Bostr¨ om
Given an example pair, the objective of the rule generating component is to find rules that will transfer the source term into the target, and that these rules are as general as possible in order to cover as many similar cases as possible. Hence, the corresponding terms should be distributed over several rules rather than a single rule, since a single rule would only allow the given example pair to be covered, while each rule in a set of generated rules is applicable to transferring (sub-)terms of QLF pairs, thus allowing other pairs to be transferred as well. Since it is computationally infeasible to decide how to distribute a pair of terms over rules such that a most general hypothesis is obtained, a greedy strategy is adopted. Rules are generated in a top-down fashion (i.e., in the same order as they will be applied when deriving the target from the source). Each generated rule is a most general rule for the corresponding input-output pair, i.e., a rule with the most general head such that when instantiated with the input-output pair, the literals in the body correspond to input-output pairs for which transfer rules can be generated, i.e., there are no variable connections to other literals or to the head (except for literals containing just a pair of transfer variables, which should have connections with the head only). It should be noted that there is always a unique clause with this property. In Table 1 we show the rule generating component of TRL, that given an input-output pair, where sub-terms that should be non-recursively transferred have been replaced by transfer variables, generates a set of transfer rules that is as general as possible while still being able to recreate (a variant of) the output term from the input term. The function, called Find-Transfer-Rules, takes as input the input-output pair, denoted f (s1 , . . . , sm ) and g(t1 , . . . , tn ), a set of transfer variable pairs V and a (initially empty) set of transfer rules H that has been produced previous to the call. 2.2
An Example
Assume the following is given as input to the algorithm Find-Transfer-Rules: the input term s(f(V1,V2),g(V3)), the output term t(h(W3),W2,W1), the set of transfer variable pairs V = {{V1,W1}, {V2,W2}, {V3,W3}} and the initial hypothesis H = ∅. Then the initialisation steps result in the following: R = trule(s(X1,X2),t(Y1,Y2,Y3)). θS = {X1/f(V1,V2), X2/g(V3)} θT = {Y1/h(W3), Y2/W2, Y3/W1} Since the transfer variables V1 and V2 that occur in the first term in θS correspond to transfer variables (W1 and W2) that occur in two different terms in θT , it follows from the first if-statement that X1 is substituted by a term f(X3,X4) in R, and that the terms X3/V1 and X4/V2 replace the first term in θS : R = trule(s(f(X3,X4),X2),t(Y1,Y2,Y3)). θS = {X3/V1, X4/V2, X2/g(V3)} θT = {Y1/h(W3), Y2/W2, Y3/W1}
Induction of Recursive Transfer Rules
241
Table 1. The function for finding recursive transfer rules. function Find-Transfer-Rules(f (s1 , . . . , sm ), g(t1 , . . . , tn ), V, H) R := {trule(f (x1 , . . . , xm ), g(y1 , . . . , yn )} θS := {x1 /s1 , . . . , xm /sm } and θT := {y1 /t1 , . . . , yn /yn } repeat if there is a term s = xi /fi (u1 , . . . , uk ) ∈ θS (resp. θT ), such that some variable in u1 , . . . , uk occurs in R or θS (resp. θT ) \ {s} or there are two distinct terms t1 and t2 in θT (resp. θS ), where t1 contains w1 and t2 contains w2 such that {{v1 , w1 }, {v2 , w2 }} ⊆ V for some variables v1 and v2 in s then θS (resp. θT ) := θS (resp. θT ) \ {s} ∪ {z1 /u1 , . . . zk /uk } R := R{xi /fi (z1 , . . . , zk )} else if there is a pair (s, t), where s = xi /ui ∈ θS , t = yj /vj ∈ θT , and {ui , vj } ∈ V then R := R ∪ {← trule(xi , yj )}, θS := θS \ {s}, and θT := θT \ {t} else if there is a pair (s, t), where s = xi /si ∈ θS , t = yj /tj ∈ θT , si is compound and contains ui , tj is compound and contains vj , where {ui , vj } ∈ V then R := R ∪ {← trule(xi , yj )}, θS := θS \ {s}, and θT := θT \ {t} H := Find-Transfer-Rules(si , tj , V, H) else R := RθS θT and θS := θT := ∅ until θS = θT = ∅ return H ∪ {R}
Now, according to the second if-statement, two recursive calls are added to R in turn, subtracting X3/V1 from θS and Y3/W1 from θT after having added the first recursive call, and subtracting X4/V2 from θS and Y2/W2 from θT after having added the second recursive call: R = trule(s(f(X3,X4),X2),t(Y1,Y2,Y3)):- trule(X3,Y3), trule(X4,Y2). θS = {X2/g(V3)} θT = {Y1/h(W3)} Following the third if-statement, a third recursive call trule(X2,Y1) is added to R, resulting in the following rule that is included in H: trule(s(f(X3,X4),X2),t(Y1,Y2,Y3)):trule(X3,Y3), trule(X4,Y2), trule(X2,Y1). The algorithm Find-Transfer-Rules is then invoked recursively with the subterms g(V3) and h(W3) as input, resulting in the following rule that also is included in H: trule(g(X),h(Y)):- trule(X,Y).
242
3
Henrik Bostr¨ om
An Experiment
3.1
Experimental Data
540 English-French QLF pairs have been obtained from SRI-Cambridge, where the corresponding sentences range from (a few) single word utterances to quite complex sentences, e.g., ’Find all flights leaving Boston that departs before ten o’clock in the morning’. The QLF pairs were formed by running the SLT system which has accuracy of over 95% on the ATIS 2 corpus. Just one target QLF was generated for each source QLF. The system uses statistical methods to choose the best QLF which is both a good French sentence and a good translation of the original (according to weighted transfer rules). In addition, the set of nonrecursive transfer rules used by SLT was obtained from SRI as well, consisting of 1155 rules. These were used in the following way. For each QLF pair, a bijection from a set of sub-terms in the source to a set of sub-terms in the target was formed using the non-recursive rules, and the corresponding sub-terms were replaced by transfer variables. Note that not all pairs of sub-terms for which there is a matching non-recursive rule can be replaced by transfer variables, due to the fact that the same sub-term may go into several sub-terms on the opposite side and due to that variable connections might get lost (i.e., some variable occurs both inside and outside the sub-term) - in these cases the sub-terms were left unchanged. The induced rules were in the current experiment forced to be on the following form: either the functor and arity of the source and target QLF should be identical, or the source should be a list with two elements and the target have the functor form and arity 5 or vice versa. This turned out to work better than allowing any form of the rules. 3.2
Experimental Results
Subsets of the set of pairs in which transfer variables have been introduced were given as input to TRL2 (except for 40 pairs that were left out for testing). The rules produced by the system were then tried on the test set, and it was checked whether the first target QLF produced for each (test) source QLF, was a variant of the correct (test) target QLF (the rules were tested in the same order as they were generated). It was also checked whether the correct target QLF could be produced at all. The performance was compared to just storing the pairs that were given as input (which still are more general than the original set of QLF pairs as e.g., lexical items have been replaced by transfer variables). Average results from running the experiment 10 times are summarised in Table 2. The number of examples given as input to TRL is shown in the first column. The coverage, measured as the fraction of the test set for which the target QLF can be generated at all, is shown in the second column. The third column shows the fraction of the test set for which the first target QLF generated 2
The specialisation step described in Bostr¨ om and Zemke (1996) was omitted.
Induction of Recursive Transfer Rules
243
is correct. The fourth and fifth column shows the number of rules generated and the cputime3 in seconds respectively. The last column shows the coverage obtained from just storing the pairs that were given as input. Table 2. Average results from 10 iterations. No. ex.
Cov.
1st cov.
No. clauses
Time (s.)
No. rec.
10
0.06
0.06
35.2
7.1
0.02
50
0.12
0.10
106.0
41.8
0.08
100
0.18
0.16
170.7
80.7
0.13
200
0.27
0.23
277.1
163.9
0.18
300
0.34
0.30
361.6
251.3
0.22
400
0.40
0.35
437.2
337.2
0.25
500
0.45
0.39
508.3
424.3
0.27
It could be seen that although TRL in fact tries to generate as general rules as possible, it does not suffer significantly from producing too non-determinate rules, as indicated by the relatively small difference in coverage and 1st coverage. The results rather indicate that TRL suffers from being overly specific. Some explanations for this problem are given in the next section. 3.3
Comments
The rules that are produced are of varying complexity. Below, the initial sequence of rules produced in one of the sessions is shown (they have been rewritten in a form which should be more readable - the variables in the actual rules produced are instantiated with terms on the form $var(N) and there are also recursive calls to trule/2 instead of transfer variables): trule([tr(1)|tr(2)], [tr(1)|tr(2)]). trule([tr(1)], [tr(1)]). trule(form(_,verb(pres,no,no,no,y),A, Bˆ[B,form(_,tr(1),_,Cˆ[C,v(A)|tr(2)],_),[tr(3),A|tr(4)]],_), form(_,verb(pres,no,no,no,y),D, Eˆ[E,form(_,tr(1),_,Fˆ[F,v(D)|tr(2)],_),[tr(3),D|tr(4)]],_)). ... 3
TRL was implemented in SICStus Prolog v. 3 and was executed on a SUN Ultra 60.
244
Henrik Bostr¨ om
trule([term(_,tr(1),A,tr(2),_,_), term(_,ord(ref(def,the,sing,l([A-_])), Bˆ[cheap_NotExpensive,B],order,’N’(’1’),sing),C, Dˆ[and,[and,[tr(3),D],[’one way_TravellingThereOnly’,D]], form(_,tr(4),_, Eˆ[E,form(_,prep(to),_,Fˆ[F,C|tr(5)],_), form(_,prep(from),_,Gˆ[G,C|tr(6)],_)],_)],_,_)], [term(_,tr(1),_,tr(2),_,_), term(_,ord(ref(def,le,sing,l([])), Hˆ[cher_Expensive,H],reverse_order,’N’(’1’),sing),I, Jˆ[and,[and,[tr(3),J],[aller_simple_OneWay,J]], form(_,tr(4),_, Kˆ[K,form(_,prep(implicit_to),_,Lˆ[L,I|tr(5)],_), form(_,prep(implicit_from),_,Lˆ[L,I|tr(6)],_)],_)],_,_)]). Comment: The non-recursive transfer rule: trule([’one way_TravellingThereOnly’,A],[aller_simple_OneWay,A]). has not been applied to the QLF-pair from which the above recursive rule stems, as the variables D and J occur outside the sub-terms that unify with the arguments of the non-recursive rule. As a consequence of this, the terms ’one way_TravellingThereOnly’ and aller_simple_OneWay have not been replaced by transfer variables, and thus appear in the recursive rules. trule(Aˆ[name_of,A|tr(1)], Bˆ[name_of,B|tr(1)]). ... trule(term(_,q(_,all,plur),A,Bˆ[and,[and,[tr(1),B], form(_,verb(no,no,yes,no,y),C, Dˆ[D,form(_,prep(for),_,Eˆ[E,v(C)|tr(2)],_), [leave_GoAwayFrom,C,v(A)|tr(3)]],_)], [island,form(_,verb(pres,no,no,no,y),F, Gˆ[G,form(_,tr(4),_,Hˆ[H,v(F)|tr(5)],_), [depart_Leave,F,v(A)]],_)]],_,_), term(_,q(_,tous_les,plur),I,Jˆ[and,[and,[tr(1),J], [island,form(_,verb(pres,no,no,no,y),K, Lˆ[L,form(_,conj(pp,implicit_and),_, Mˆ[M,form(_,prep(pour),_,Nˆ[N,v(K)|tr(2)],_), form(_,prep(de_Directional),_,Oˆ[O,v(K)|tr(3)],_)],_), [partir_Leave,K,v(I)]],_)]], [island,form(_,verb(pres,no,no,no,y),P, Qˆ[Q,form(_,tr(4),_,Rˆ[R,v(P)|tr(5)],_), [partir_Leave,P,v(I)]],_)]],_,_)). Comment: leave_GoAwayFrom has not been replaced by a transfer variable as it, according to the non-recursive rules, only can go into quitter_Leave,
Induction of Recursive Transfer Rules
245
which is not present. This causes depart_Leave to be left too as there are two occurrences of partir_Leave. It should be noted that the technique is heavily dependent on the initially introduced transfer variables - if these are not placed properly the resulting recursive rules will most likely be inaccurate. In the current system, the strategy for introducing the initial transfer variables is quite simple-minded. In particular, a sub-term that occurs more than once in a QLF is never replaced by a transfer variable. Much could be won if this restriction is relaxed. For example, one could use some elaborate heuristic for coupling non-unique sub-terms. It should however be noted that in some cases all couplings should be rejected.
4
Concluding Remarks
We have presented the application of a prototype system, called TRL, to the problem of inducing recursive transfer rules, given examples of corresponding QLF pairs and a set of non-recursive transfer rules. An initial experiment demonstrates that this information is sufficient for the generation of generally applicable rules that can be used for transfer between previously unseen source and target QLFs. However, the experiment demonstrates that the system suffers from producing overly specific rules, even when the problem of disallowing the derivation of other target QLFs than the correct one is not considered. This is mainly due to the inability of appropriately using the non-recursive rules for introducing transfer variables prior to the generation of the recursive rules. One immediate approach to this problem is to relax the conservative condition that a sub-term that occurs more than once may never be replaced by a transfer variable, e.g., by using some heuristic for selecting which sub-term in the source that should go into a particular sub-term in the target (and vice versa). When the problem of producing overly specific rules in the first phase of TRL has been overcome, there are several possibilities for handling the problem with having more than one candidate target QLF that can be generated for a particular source QLF. One is to specialise the program by introducing new predicate symbols, e.g., as in Bostr¨om (1998). Another possibility is to look at probabilistic extensions, such as stochastic logic programs (cf. Muggleton (1995) and Cussens (1999)), and choose the target QLF that is given the highest probability. Another approach to the problem of learning transfer rules from QLF-pairs is to reduce the complexity of the learning task by utilising the grammar rules and lexica that are used when generating the source and target QLFs (these were not available in this study). Since the source and target QLFs can be reconstructed given parse trees that refer only to identifiers of the grammar rules and lexical items, the transfer rule learning problem can be reduced to the problem of learning a mapping from such parse trees for source sentences into parse trees for target sentences. While still a challenging problem, the nondeterminism inherent in the task is significantly reduced by using this indirect
246
Henrik Bostr¨ om
approach compared to inducing rules directly from the QLF pairs. This is the case since the number of sub-terms in (and hence possible mappings between) the parse trees is significantly less than the number of sub-terms in the corresponding QLFs. Once the above problems have been successfully solved, it could be interesting to evaluate the rules w.r.t. linguistic plausibility and compare them to handwritten rules, neither of which have been done at this early stage. Acknowledgements This work was supported by ESPRIT LTR project no. 20237 Inductive Logic Programming II and the Swedish Research Council for Engineering Sciences (TFR). The author would like to thank David Milward and Stephen Pulman at SRI-Cambridge for providing the data and for helpful discussions.
References 1. Alshawi H. (ed.) (1992). The core language engine, Cambridge, MA: MIT Press 2. Agn¨ as, M-S., Alshawi, H., Bretan, I., Carter, D., Ceder, K., Collins, M., Crouch, R., Digalakis, V., Ekholm, B., Gamb¨ ack, B., Kaja, J., Karlgren, J., Lyberg, B., Price, P., Pulman, S., Rayner, M., Samuelsson, C. & Svensson, T., (1994). Spokenlanguage translator: first-year report, SICS research report, ISRN SICS-R–94/03– SE, Stockholm 3. Bostr¨ om, H., (1998). Predicate Invention and Learning from Positive Examples Only. In Proceedings of the Tenth European Conference on Machine Learning, Berlin Heidelberg: Springer-Verlag (pp. 226-237) 4. Bostr¨ om, H., & Zemke, S., (1996). Learning transfer rules by inductive logic programming (preliminary report), Dept. of Computer and Systems Sciences, Stockholm University and Royal Institute of Technology 5. Clocksin, W.F., & Mellish, C.S., (1981). Programming in prolog, Springer Verlag 6. Cussens, J., (1999). Loglinear Models for First-Order Probabilistic Reasoning. In Proceedings of Uncertainty in Artificial Intelligence, San Francisco: Morgan Kaufmann (pp. 126-133) 7. Lloyd, J. W., (1987). Foundations of logic programming, (2nd edition), Berlin Heidelberg: Springer-Verlag 8. Milward, D., & Pulman, S., (1997). Transfer learning using QLFs, Technical report, SRI International, Cambridge, G.B. 9. Mooney, R. J., & Califf, M. E., (1996). Learning the past tense of English verbs using inductive logic programming. In S. Wermter, E. Riloff & G. Scheler (eds.), Symbolic, Connectionist and Statistical Approaches to Learning for Natural Language Processing, Berlin Heidelberg: Springer-Verlag (pp. 370-384) 10. Muggleton, S., (1995). Inverse Entailment and Progol. In New Generation Computing 13 245–286 11. Muggleton, S., (1995). Stochastic logic programs. In De Raedt L. (ed.), Advances Inductive Logic Programming, Amsterdam: IOS Press (pp. 254-264) 12. Quinlan, J. R., (1990). Learning Logical Definitions from Relations. In Machine Learning 5 239–266
Learning for Text Categorization and Information Extraction with ILP Markus Junker, Michael Sintek, and Matthias Rinck German Research Center for Artificial Intelligence (DFKI GmbH), P.O. Box 2080, 67608 Kaiserslautern, Germany {markus.junker,michael.sintek,matthias.rinck}@dfki.de
Abstract. Text Categorization (TC) and Information Extraction (IE) are two important goals of Natural Language Processing. While handcrafting rules for both tasks has a long tradition, learning approaches used to gain much interest in the past. Since in both tasks text as a sequence of words is of crucial importance, propositional learners have strong limitations. Although viewing learning for TC and IE as ILP problems is obvious, most approaches rather use proprietary formalisms. In the present paper we try to provide a solid basis for the application of ILP methods to these learning problems. We introduce three basic types (namely a type for text, one for words and one for positions in texts) and three simple predicate definitions over these types which enable us to write text categorization and information extraction rules as logic programs. Based on the proposed representation, we present an approach to the problem of learning rules for TC and IE in terms of ILP. We conclude the paper by comparing our approach of representing texts and rules as logic programs to others.
1
Introduction
Classifying texts into content-defined categories and extracting pieces of information from a document are important goals when dealing with documents. As a guiding example we use our efforts in office automation (Dengel & Hinkelmann, 1996; Dengel, Hoch, H¨ ones, Malburg, & Weigel, 1997). When a new business document comes in, one of the first problems is document categorization, i.e., the filing of documents into categories such as invoices, confirmation of order, etc. Subsequently, we are interested in extracting the process-relevant information from the document. In the case of an invoice we want to know the sender, the item to be paid for, the amount and similar information. This information can then be used to automate the document handling by triggering the right processes, e.g., by electronically remitting money to a certain bank account. There are many more applications for document categorization and information extraction. While the more traditional approaches work with hand-crafted categorization and extraction rules there is increasing interest in applying learning approaches. The idea is that a learning system only has to be provided with positive and negative examples of a certain document category (or important J. Cussens and S. Dˇ zeroski (Eds.): LLL’99, LNAI 1925, pp. 247–258, 2000. c Springer-Verlag Berlin Heidelberg 2000
248
Markus Junker, Michael Sintek, and Matthias Rinck
text fragments, respectively) and then learns a classifier (an extraction rule, respectively ) by itself. Since TC as well as IE rely on text given as a sequence of words. propositional learning approaches have strong limitations. In TC it is very common to work with attributes that correspond to single words. An attribute value indicates whether or how often a word occurs in a document. More complex attributes such as word phrases have to be determined prior to learning in a separate phase (Junker & Hoch, 1998). The extraction of useful complex attributes has to be done with care since the maximum number of attributes is often limited by practical reasons. In IE it is crucial to learn the boundaries of information to be extracted. For this reason it is impossible to rely on propositional learning approaches and people switch over to rather proprietary learning formalisms (for a survey see (Muslea, 1998; Soderland, 1999)). Capturing more complex document features in TC by an extraction phase prior to learning as well as proprietary learning approaches in IE are certainly acceptable. On the other hand, as we will show, ILP also provides the expressiveness to tackle both learning problems: – documents can be represented as they are: as a sequence of words which again consist of sequences of characters – rules for TC as well as for IE can be formulated as logic programs – background knowledge can easily be integrated using newly defined, textspecific predicates – learning strategies can be realized using text-specific refinement operators Clearly, there is no immediate need to use ILP for TC and IE since there is no obvious reason why ILP should provide any benefit in terms of efficiency and effectiveness. Nevertheless it is of great interest to see how far the well established techniques of ILP can reach within this domain. From the application perspective using ILP provides a sound and well-known framework for the formulation of the problems and its solution approaches. For these reasons we rely on ILP for our learning approaches to TC and IE. In the remainder of the paper we first briefly introduce common constructs of pattern languages used for text categorization and information extraction. Then we propose the basic types (like Word or Position) and predicate definitions used in mapping the functionalities of such pattern languages into logic programs. Having a means to formulate TC and IE rules as logic programs, the next section demonstrates how we apply standard ILP learning techniques to the problem of learning such programs.
2
Mapping Text Pattern Languages to Logic Programs
The tasks of learning rules for TC and IE are very closely related. In both cases, characteristic patterns for text fragments have to be found. Compared to text categorization, when learning rules for information extraction an additional problem arises. While in text categorization the boundary of the fragment is always given (which is the document to be classified), information extraction
Learning for Text Categorization and Information Extraction with ILP
249
rules have to locate the fragment boundaries by themselves. Nevertheless both tasks are very similar and thus also rely on very similar pattern languages. For this reason, it seems reasonable to treat both learning problems in a single learning framework. Text Pattern Languages There are some obvious constructs a pattern language suited for TC and IE should have, which include (for easier reference, we have marked the different constructs by A-E): A Testing for the occurrence of specific words (does the word “invoice” occur in the document to be classified?) B Tests for words occurring in some order and/or some restricted distance range. A special case of this kind of test is the test for word sequences. C Boolean combinations of tests. D Tests for properties of words or word sequences (e.g.: Is a “word” a number? Is it upper case? Does it have a specific syntactic word category, such as noun? Is a word/word sequence a noun phrase? Does it denote a person or company?) E Tests whether some patterns occur within some specified environment (e.g.: Does the pattern occur in one sentence? In the title?). These tests are particularly useful when dealing with HTML or XML documents. All of the above tests should be combinable in any reasonable way. For information extraction there must be some additional means to specify the extraction part(s) of a pattern. Types for Text Patterns Our mapping of text pattern languages relies on the representation of a text “w1 w2 . . . wn ” as a Word List or Text [w1 , w2 , . . . , wn ]. Texts in this sense can be used to represent a whole document, but they can also be used to describe information in the form of word sequences extracted from a document. A Word is also a special type and implemented as a list of characters. For easier readability, we do not write a word in the form of a list of characters separated by commas. Another type we need is Position which is used to locate single words within texts. Predicates for Text Patterns Our transformation of text patterns to logic programs relies on the predicates wordpos(T,W,P), fragment(T1,P1,P2,T2) and next(P1,Min,Max,P2): – The predicate wordpos(T,W,P) is used to describe that within text T the word W occurs at position P. Note that these texts are not necessarily documents, they can also describe some arbitrary word sequences within a document.
250
Markus Junker, Michael Sintek, and Matthias Rinck
– The predicate fragment(T1,P1,P2,T2) provides locations of word sequences within a larger text. A location of a word sequence T2 is given by its starting position P1 and its ending position P2 within the larger text T1. Here too, the larger text is not necessarily a whole document. – The predicate next(P1,Min,Max,P2) is true iff the two positions P1 and P2 have a distance of at least Min words and at most Max words (i.e., Min≤P2-P1≤Max). For easier readability we also use the abbreviations next(P1,X,P2) and next(P1,P2) with next(P1,X,P2)=next(P1,X,X,P2) and next(P1,P2)=next(P1,1,P2).
Table 1. Basic predicates for text patterns predicate
argument types
description
wordpos fragment
Text, Word, Position Text, Position, Position, Text Position, Position, Position, Position
position of word within a text starting and ending position of a text within another text states that a position follows another position within a specific distance range
next
The predicates are summarized in Table 1. In the appendix we give a definition of the predicates in PROLOG. Example Text Patterns Using only the predicates wordpos, fragment, and next, we are already able to write complex categorization and extraction rules which map the constructs A, B, C, and partly E. For easier readability, we write wordpos(Doc, Word, Pos) as W ord@Doc : P os. The following four rules describe document categories: invoice(Doc) :- invoice@Doc:P. invoice(Doc) :- payment@Doc:P1, next(P1,P2), within@Doc:P2. offer(Doc) :- thank@Doc:P1, you@Doc:P2, for@Doc:P3, next(P1,P2), next(P2,1,3,P3).
The first rule assigns category invoice to a document if the word “invoice” occurs somewhere in the document. The second rule indicates that category invoice is also assigned if the word sequence “payment within” occurs. The next rule for the category offer tests the word sequence “thank you” and the word “for” with up to two words in between. This also allows the formulation “thank you very much for” and similar ones. The following categorization rule for inquiry also returns some text fragment within the input document:
Learning for Text Categorization and Information Extraction with ILP
251
inquiry(Doc, Interest) :- interested@Doc:P1, in@Doc:P2, next(P1,P2), next(P2,P3), fragment(Doc,P3,P3,Interest).
The rule tests on “interested in” and extracts the following word, which indicates the subject of interest. This capability of extracting word sequences from documents makes the basis for information extraction, as some more examples illustrate: payment(Doc, Days) :- within@Doc:P1, days@Doc:P3, next(P1,P2), next(P2,P3), fragment(Doc,P2,P2,Days). cash_discount(Doc, Percent) :cash@Doc:P1, discount@Doc:P2, of@Doc:P3, ’%’@Doc:P6, next(P1,P2), next(P2,P3), next(P3,P4), next(P5,P6), next(P4,1,4,P5), fragment(Doc,P4,P5, Percent). cash_discount(Doc, Days, Percent) :days@Doc:P2, ’%’@Doc:P5, next(P1,P2), next(P2,P3), next(P4,P5), next(P3,1,4,P4), fragment(Doc,P1,P1,Days), fragment(Doc,P3,P4,Percent).
The rule for payment searches for “within” and “days” with exactly one word in between and returns exactly this word. The next rule for cash discount tests the word sequence “cash discount” and the word “%” with a maximum distance of three words in between and it extracts these words. The second rule for cash discount illustrates that even so-called multi-slot rules can be expressed. The rule extracts corresponding pairs of days and discounts. More Complex Patterns It is very easy to extend the basic language given by the primitive wordpos, fragment, and next. For instance, it is possible to introduce a unary predicate which tests whether a “word” is a number or a binary predicate which tests whether two words have the same stem. Another example for a built-in predicate is one that tests for a minimum syntactic similarity of two words given by the Levenshtein distance (Sankoff & Kruskal, 1983). Even more elaborated things such as testing whether some or all words within a word sequence have a certain property is pretty simple. By introducing these built-ins, we can map the constructs D. To model word sequence features given by the structure of a document, it is possible to introduce predicates which show, for instance, whether some word occurs within the title. This is the simple solution we are currently pursuing to map the constructs E. Parallel to this we are working on implementing a more sound mapping of document structures by extending the notion of word positions.
252
3
Markus Junker, Michael Sintek, and Matthias Rinck
The Learning Algorithm
The learning algorithm we implemented can be seen as a special ILP learner focusing on textual data. Textual data is supported by providing the three fundamental predicates described in the section before as built-ins. In addition to the basic predicates, many more built-ins for text handling are provided such as predicates which test word or word sequence properties or relations between words/word sequences. To achieve a reasonable efficiency when evaluating the built-ins, indexing structures borrowed from work in information retrieval are used. For instance, we map each word to all texts in which the word occurs using a specific hash function for natural language words. In a similar way, we also map words within documents to all positions at which they occur. We also have a data structure to access efficiently all words known to the system (i.e. words that occur in at least one text). Learning a Rule Set For learning TC and IE rules in the form of logic programs we have designed a learning algorithm which implements the widely used separate-and-conquer strategy (F¨ urnkranz, 1999): First a rule is learned that explains some of the positive examples, then the covered examples are separated, and the algorithm recursively conquers the remaining positive examples. This is repeated until no good rule for the remaining examples can be found. The set of negative examples remains unchanged during learning and is used in the evaluation of rule hypotheses. Learning a Rule The learning of a single rule relies on text specific refinement operators which can specialize as well as generalize a current rule. We will only introduce some of these operators as examples. We denote refinement operators with name (NAME) by a schema (NAME)
body1 body2
meaning that a rule Head :- body1 can be refined to Head :- body2. The simplest refinement operator (K) adds a literal to a rule which directly tests the occurrence of some word word (the li denote literals): (K)
l1 , . . . , ln l1 , . . . , ln , word@Doc : Q
Two more refinement operators, namely (SIL) and (SIR) are used to test word occurrences immediately to the left or to the right of some already given word. (SIL)
l1
l1 , . . . , word@Doc : Q, . . . , ln , . . . , word @Doc : P, next(P, 1, 1, Q), word@Doc
: Q, . . . , ln
Learning for Text Categorization and Information Extraction with ILP
(SIR)
253
l1 , . . . , word@Doc : Q, . . . , ln l1 , . . . , word@Doc : Q, next(Q, 1, 1, P ), word @Doc : P, . . . , ln
Note that in particular there are no refinement operators which replace position variables by constants. This reflects the heuristics that—at least in our experience—the absolute position of a word within a text fragment is not important. While the above refinement operators defined specializations of the current rule, we also have generalization operators. An example is the following generalization operator (SE), which increases the maximum distance required between two word occurrences: (SE)
l1 , . . . , next(Q, 1, x, P ), . . . , ln l1 , . . . , next(Q, 1, y, P ), . . . , ln
with y = x + 1
Another set of generalization operators replaces a test for some word occurrence with a test requiring just a word with some specific property (e.g., the property of being a number, or being upper case).1 (GWsome property )
l1 , . . . , word@Doc : P, . . . , ln l1 , . . . , W ord@Doc : P, some property(W ord), . . . , ln
In the current implementation, there is no declarative language to control the application of the refinement operators. Search strategies are hard-wired in the learning algorithm. Depending on the state of the learner, beams with various refinement operators and of various widths are used to investigate the space of rule hypotheses. Implementing and experimenting with hard-wired strategies was a conscious decision, since we first wanted to explore which kind of expressiveness of a declarative strategy definition is needed for our purpose. To prevent overfitting, we use standard pre-pruning techniques such as the Laplace estimate as optimization criterion and the likelihood ratio statistics to prevent from refinements which result in overfitted rules (Clark& Boswell, 1991) (Clark & Boswell, 1991). After having described the general features of our rule learner, we now turn to the concrete learning of text categorization and information extraction rules as we implemented it. Setting for Text Categorization For text categorization, the input to the learner is a set of positive and negative example documents for each category as shown in the following example (+ indicates a positive example, - a negative example, the dots “..” indicate more text): offer+([.. offer+([.. offer-([.. offer-([.. 1
thank, you, very, much, for, you, inquiry, ..]). thank, you, for, your, letter, of, 10, February, ..]). payment, within, 20, days, ..]). we, are, interested, in, ..]).
Note that word is a word constant while W ord is a variable.
254
Markus Junker, Michael Sintek, and Matthias Rinck
Learning rules for text categorization is straightforward: In each conquer step we successively refine the initial rule “offer+(Doc) :-” by applying our refinement operators. Setting for Information Extraction In the case of information extraction, the positive examples are a set of text fragments correctly extracted from a larger piece of text, typically a whole document: cash_discount+([.. cash_discount+([.. [2, cash_discount+([.. cash_discount+([..
cash, discount, of, 2, %, ..],232,232,[2]). cash, discount, of, 2, ., 5, %, ..],143,145, ., 5]). 2, %, for, cash, ..],198,198,[2]). 3, ., 5, %, for, cash, ..],101,103,[3, ., 5]).
For instance the first positive example states that in the document [.. cash, discount, of, 2, %, ..] the text fragment starting with position 232 and ending with position 232 and containing the text [2] is to be extracted (cash, discount, of, 2 is located from position 229 to 232 in the full document [.. cash, discount, of, 2, %, ..]). In contrast to text categorization the negative examples are not given as ground facts. Instead, we provide a set of documents which are guaranteed to not contain any information to be extracted besides the information given by the positive examples. t-([.. we, offer, a, discount, of, 30, %, to, all, these, goods, ..]). t-([.. is, 20, %, faster, .., we, offer, a, cash, discount, of, 2, %, ..]).
Using these texts, we define the negative examples for cash discount by: cash discount-(Doc,X,Y,F) :- t-(Doc), ∃ cash discount+(Doc,X,Y,F).
With this definition every text fragment f within the provided document d starting at some position p1 and ending at some position p2 is a negative example unless it is explicitly stated as positive by cash discount+(d,p1,p2,f). The way we introduce the negative examples reflects the typical situation when collecting training material for an information extraction task: For a set of documents all information of any kind is extracted by hand. While these extracts serve as positive examples, the remaining text implicitly defines the negative information. We require information extraction rules to return the exact location and length of the interesting text fragments. As a heuristic, we assume that for determining the beginning of this information, either the word position just before the information or the first word position within the information has to be located. Similarly, for determining the end of the information, either the
Learning for Text Categorization and Information Extraction with ILP
255
last word of the information or the first word after the information has to be found. This heuristics results in four initial rules. Each of these rules is refined separately and the best rule found is the result of the respective conquer step: x+(Doc,Pos1,Pos2,Text) :- Word1@Doc\!:\!Pos1, Word2@Doc\!:\!Pos2, fragment(Doc,Pos1,Pos2,Text). x+(Doc,Pos1,Pos2,Text) :- Word1@Doc\!:\!Pos1’, Word2@Doc\!:\!Pos2, next(Pos1’,Pos1), fragment(Doc,Pos1,Pos2,Text). x+(Doc,Pos1,Pos2,Text) :- Word1@Doc\!:\!Pos1, Word2@Doc\!:\!Pos2’, next(Pos2,Pos2’), fragment(Doc,Pos1,Pos2,Text). x+(Doc,Pos1,Pos2,Text) :- Word1@Doc\!:\!Pos1’, Word2@Doc\!:\!Pos2’, next(Pos1’,Pos), next(Pos2,Pos2’), fragment(Doc,Pos1,Pos2,Text).
It is important to note that in information extraction, we have additional refinement operators. These allow us to refine the initial extraction rules based on the occurring word variables: (WS)
l1 , . . . , W ord@Doc : P os, . . . , ln l1 , . . . , word@Doc : P os, . . . , ln
(WSPsome property )
l1 , . . . , W ord@Doc : P os, . . . , ln l1 , . . . , W ord@Doc : P os, some property(W ord), . . . , ln
The first refinement operator (WS) specializes a rule body by replacing a word variable with a word constant, while the operator (WSP) specializes by adding constraints to a word variable.
4
Related Work
As already mentioned in the introduction, ILP has rarely been used for text categorization and information extraction in its “pure” form. In (Cohen, 1996) an ILP approach to the problem of document categorization is described. Cohen represents a document by a set of facts of the form wordi (doc,pos). The predicate wordi indicates that a word wordi occurs in a document doc at position pos with pos being a natural number. The main difference of this document representation to our own representation is that in Cohen’s approach words are predicates while we have our own type for words. This allows us to test for word properties without leaving 1st order logic. Cohen’s approach also allows for phrases. This is done by the predicates near1(p1 ,p2 ), near2(p1 ,p2 ), near3(p1 ,p2 ) which denote the maximum difference between two word positions. For instance, near1(p1 ,p2 ) is true if |p1 -p2 | ≤ 1. An additional binary
256
Markus Junker, Michael Sintek, and Matthias Rinck
predicate after(p1 ,p2 ) is used to test whether a position p1 is before a position p2 , i.e., it is true if p2 >p1 holds. The problem of information extraction was not addressed by Cohen. In recent work Freitag (1998) proposes an ILP-like formalism for information extraction, called SRV. Freitag informally describes the examples as a set of annotated documents. Without going into the details of his rule language constructs, we would just like to give a rough impression of Freitag’s information extraction rules. The example shows a rule for extracting course numbers from a university’s web page: coursenumber :- length(= 2), every(in_title false), some(?A [] all_upper_case true), some(?B [] tripleton true).
The rule extracts every text fragment which satisfies the following conditions: the fragment contains two words (length(= 2)), no word within the title is part of the fragment (every(in title false)), one word of the fragment consists only of upper-case characters (some(?A [] all upper case true)), and the other word of the fragment consists of three characters (some(?B [] tripleton true)). The relative positions of words in Freitag’s approach are captured by two constructs. Using so-called relational paths a position relative to the current one can be addressed. For instance, some(?A [prev token prev token] capitalized true) requires a word within a fragment which is preceeded by a capitalized word two tokens back. Similar to our predicate next, another predicate relpos(Var1 Var2 Relop N) allows us to specify distances and ordering of word occurrences. The variables Var1 and Var2 denote word occurrences, Relop is a comparison operator and N is a certain natural number. Freitag did not address the problem of text categorization explicitly. As potential disadvantages of Freitag’s approach we find that: – Freitag’s rules are not represented in a standard logic language. – There are no variables whose bindings provide the document the information was found in, the position it was found at and the information itself. – Only single value extraction rules can be formulated. This is because all predicates implicitly relate to one text fragment. There are a a number of other systems that learn information extractions rules and do not use any logic programming formalism such as AutoSlog (Riloff, 1996), LIEP (Huffman, 1996), WHISK (Soderland, 1999), and RAPIER (Califf & Mooney, 1990) and (Thompson & Califf, this volume). More complete and detailed overviews are given in (Muslea, 1998) and in (Soderland, 1999). It was not the goal of the paper to compare the effectiveness of our learning algorithm to the effectiveness of other algorithms. Nevertheless, first experiments with the initial version of our rule learner indicate that results are comparable to those reported in literature for standard problems. More precisely, in TC as well as in IE we obtained results which differed no more than 3% from the effectiveness reported from Ripper, RAPIER and WISK.
Learning for Text Categorization and Information Extraction with ILP
5
257
Summary
We have proposed a mapping of typical text patterns to logic programs which is based on types for text, words, and text positions and three fundamental predicates. Based on this representation we presented the main concepts of a rule learner for text categorization and information extraction. In contrast to previous work we rely on standard ILP and deal with learning rules for TC and IE in one framework. Our experiments show that results obtained within the ILP framework for TC and IE are comparable to those reported in literature. Thus we believe that state-of-the-art rule learners can, in terms of effectiveness, be very well implemented using standard ILP techniques. Continuing research on rule learning for TC and IE in the ILP framework offers some benefits for ILP as well as for the respective application: – ILP provides a sound and communicable platform for describing existing and future approaches for TC and IE. In particular, external knowlede such as a natural language thesaurus or morphological knowledge can be integrated in a sound way as background knowledge. – Problems such as describing application oriented search strategies using specialization as well as generalization and overfitting can be considered in the broader context of ILP. Here, the application can benefit from experiences in other ILP applications while the general ILP framework can benefit from specific problems arising in TC and IE.
References 1. Califf, M., & Mooney, R. (1990). Relational learning of pattern-match rules for information extraction. In Working Papers of the ACL 97 Workshop on Natural Language Learning, pp. 9–15. 2. Clark, P., & Boswell, R. (1991). Rule induction with CN2: Some recent improvements. In Machine Learning: European Working Session on Learning – EWSL-91, pp. 151–163 Porto, Portugal. 3. Cohen, W. (1996). Learning to classify english text with ILP methods. In Advances in Inductive Logic Programming, pp. 124–143. IOS Press. 4. Dengel, A., & Hinkelmann, K. (1996). The specialist board—a technology workbench for document analysis and understanding. In Proceedings of the second World Conference on Integrated Design and Process Technology, pp. 36–47 Austin, TX, USA. 5. Dengel, A., Hoch, R., H¨ ones, F., Malburg, M., & Weigel, A. (1997). Techniques for improving OCR results. World Scientific Publishers Company. 6. Freitag, D. (1998). Toward general-purpose learning for information extraction. In Seventeenth International Conference on Computational Linguistics, pp. 404–408 Montreal, Canada. 7. F¨ urnkranz, J. (1999). Separate-and-conquer rule learning. Artificial Intelligence Review, 13 (1), 3–54.
258
Markus Junker, Michael Sintek, and Matthias Rinck
8. Huffman, S. (1996). Learning information extraction patterns from examples. In Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, pp. 246–260 Berlin. Springer Verlag. 9. Junker, M., & Hoch, R. (1998). An experimental evaluation of OCR text representations for learning document classifiers. International Journal on Document Analysis and Recognition, 1 (2), 116–122. 10. Muslea, I. (1998). Extraction patterns: From information extraction to wrapper generation. http://www.isi.edu/∼muslea/PS/epiReview.ps. 11. Riloff, E. (1996). Automatically generating extraction patterns from untagged text. In Thirteenth National Conference on Artifical Intelligence, pp. 1044–1049 Portland, Oregon, USA. 12. Sankoff, D., & Kruskal, J. (1983). Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. AddisonWesley. 13. Soderland, S. (1999). Learning information extraction rules for semistructured and free text. Machine Learning, 34 (1–3).
Appendix - Predicate Definitions in PROLOG wordpos([Word | Rest], Word, 1). wordpos([_ | Rest], Word, P) :- wordpos(Rest, Word, Q), P is Q + 1. fragment([_ | Rest], P1, P2, F) :-
P1 > 1, P11 is P1-1, P21 is P2-1, fragment(Rest, P11, P21, F). fragment(Text, 1, P2, F) :- fragment(Text, P2, F). fragment([W | R], P, [W | F]) :- P > 0, Q is P-1, fragment(R, Q, F). fragment(Text, 0, []). next(P, P1) :- next(P, 1, P1). next(P, X, P1) :- number(P), number(X), P1 is P + X. next(P, X, P1) :- var(P), number(X), number(P1), P is P1 - X. next(P1, Min, Max, P2) :- number(P1), number(Min), number(Max), number(P2), D is P2 - P1, D >= Min, D =< Max.
Note that not all of the above predicates are completely invertible.
Corpus-Based Learning of Semantic Relations by the ILP System, Asium Claire Nedellec Inference and Learning group Laboratoire de Recherche en Informatique UMR 8623 CNRS, bât 490 Université Paris-Sud, F-91405 Orsay
[email protected]
Abstract. This chapter presents the ILP method, Asium, that learns
ontologies and verb subcategorization frames from parsed corpora in specific domains. The ontology is learned by bottom-up conceptual clustering from parsed corpora. The clustering method also generalizes the grammatical relations between verbs and complement heads as they are observed in the corpora. The set of grammatical relations learned for a given verb forms the verb subcategorization frame. This paper details Asium's learning algorithms in an ILP formalism and shows how the learned linguistic resources can be applied to semantic tagging, language control and syntactic disambiguation within the ILP framework.
1. Introduction Many applications that handle electronic documents, especially industry-oriented applications, require specific knowledge bases and linguistic resources. Among them, predicate schemata are syntactico-semantic resources, which let the system interpret chunks of documents at the knowledge level. By predicate schemata we mean conceptual graphs which relate predicates to their semantic roles defined as semantic sets of terms such as Pouring1
object place
liquid, powder recipient, preparation
where liquid = {milk, wine, ...}. Such resources are useful in many applications such as Information Extraction (Riloff, 93), Information Retrieval, and Question Answering. So far, there is no fully automatic method for learning this kind of predicate schemata from parsed training corpora. However, verb subcategorization frames and semantic classes can already be learned and this is a step in the right 1The
examples illustrated in this paper have all been taken from the application of Asium to the cooking recipes domain.
J. Cussens and S. Džeroski (Eds.): LLL’99, LNAI 1925, pp. 259-278, 2000 © Springer-Verlag Berlin Heidelberg 2000
260
Claire Nedellec
direction towards predicate schemata learning. In addition, they are useful for the same applications as predicate schemata. For instance, the subcategorization frame of « to pour » could be defined as To pour
direct object liquid, powder adjunct (preposition = in, into) recipient, preparation
Compared to predicate schemata, the predicate here is a verb or a noun and the relations are syntactic. Semantic classes are defined in a similar way as in predicate schemata. Previous work attempted to learn verb syntactic relations and semantic classes by observing syntactic regularities in parsed corpora. The work reported in (Grishman and Sterling, 94), (Hindle, 90), (Grefenstette, 92), (Pereira et al., 93), (Dagan et al., 96) among others, is based on Harris' hypothesis of the distributional foundation of meaning in sublanguages (Harris et al., 89): syntactic schemata that consists of semantic word classes that reflect the knowledge domain can be identified by analyzing syntactic regularities in specific domain parsed corpora. Most of them learn flat classes (Grefenstette, 92), (Hindle, 90) or distances between terms which represent their semantic similarities (Grishman & Sterling, 94), (Sekine et al., 92), (Dagan et al., 96), (Resnik, 95). Hierarchies of concepts attached to verb frames (Pereira et al., 93), (Hogenhout & Matsumoto, 97) are more understandable than flat classes and they allow semantic interpretations of document chunks at the appropriate level of abstraction. This is particularly useful in query extension and question answering as shown by WordNet exploitation (Resnik & Hearst, 94), (Yarowsky, 92) but unnecessary for syntactic disambiguation which is the goal of most of the work cited above. Here, we present Asium, a method that learns verb subcategorization frames and concepts from parsed texts in a specific domain. It learns graphs of concepts instead of concept hierarchies so that a given concept may be related to more than a "mother". This property is required for representing the different meanings and roles of terms. Asium is based on a bottom-up clustering method as in previous work by Pereira et al (93) and Hogenhout and Matsumoto (97). The clustering method forms semantic classes of head terms, which represent the concepts of the ontology. It applies a novel distance that proves to be robust when applied to noisy corpora. The grammatical relations observed in the corpora between verbs and terms are also generalized so that they link verbs and all the classes of acceptable terms, not only the ones actually occurring in the corpora. The set of learned grammatical relations for a given verb forms its subcategorization frame where the selection restrictions are filled by the concepts of the ontology. Asium is formalized as an ILP method as opposed to previous work described in the statistics framework. The ILP formalism increases the comprehensibility of the learned knowledge and makes the comparison and the integration with other ILP methods easier. The nature of Asium input and output is intrinsically relational: it consists of relations between verbs and their complements and generality relations between concepts. Moreover, Asium has to be coupled with tools that can handle
Corpus-Based Learning of Semantic Relations by the ILP System, Asium
261
relational data and knowledge. For instance, parsing disambiguation and semantic tagging will apply coverage testing operations such as filtering, saturation and resolution. In the current implementation of the platform, data and knowledge are all stored in a relational database. In addition, learning predicate schemata from verb subcategorization frames and ontologies requires a relational representation for expressing the dependencies among concepts of the semantic roles. The ILP framework is thus viewed as a unified framework for integrating these tools and methods. The remainder of this chapter is organized as follows. Section 2 introduces the settings Asium uses, while Section 3 details the learning algorithms. Section 4 then presents some potential applications of the learning results. Future work is discussed in Section 5 and finally, Section 6 compares the approach with other related work.
2. Settings 2.1 Subcategorization Frames Subcategorization frames describe the grammatical relations between verbs and their arguments and adjuncts. The verb adjuncts are the verb complements such as the place, time, and means adjuncts that are optional as opposed to verb arguments (subject, direct object and indirect objects). A grammatical relation represents the type of complement and preposition needed if any. The verb sucategorization frames include selection restrictions, which define the concepts that are valid for a given grammatical relation. For example, Table 1 represents the subcategorization frame for the verb "to pour" as Asium learns it. Each clause represents a single grammatical relation between "to pour", and one argument or one adjunct. X represents the grammatical relation and Y represents the complement. Table 1. The subcategorization frame for the verb "to pour". c(pour,X) :verb(pour,X),comp(dobj,Y,X),prep(none,Y),head(liquid,Y). c(pour,X) :verb(pour,X),comp(dobj,Y,X),prep(none,Y),head(preparation,Y) . c(pour,X) :verb(pour,X),comp(padj,Y,X),prep(into,Y),head(liquid,Y). c(pour,X) :verb(pour,X),comp(padj,Y,X),prep(into,Y),head(recipient,Y).
The two first Horn clauses say that "to pour" allows two concepts "liquid" and "preparation" as selection restrictions for the direct object (denoted "dobj"). The last two clauses show that "to pour" also allows the two concepts "liquid" and "recipient" as selection restrictions for the place adjunct (denoted "padj") and they are both introduced by the preposition "into".
262
Claire Nedellec
2.2 Ontology The concepts filling the selection restrictions of the verb frames are defined in the ontology. The ontology concepts learned by the method are defined both extensionally and intensionally. The extensional definition defines a concept as a class of terms. Term should be understood in the linguistic sense. For instance "baking powder" is a term. The membership relation of a term T to a concept C is represented by a Horn clause, C(X) :- T(X). For instance, the clauses liquid(X) :- milk(X). liquid(X) :- water(X). liquid(X) :- red_wine(X).
define the concept "liquid" as the set of terms "{milk, water, red_wine}". The intensional concept definition defines concepts as being related to other concepts by a generality relation. The generality relation on the ontology concepts is defined by the inclusion between term classes. A concept C is more general than a concept C' if the class of terms of C' is included in the class of terms of C. The generality relation between a concept C and a concept C' is represented by the Horn clause C(X) :- C'(X), as exemplified by the clauses below. liquid(X) :- milky_liquid(X). milky_liquid(X) :- milk(X). milky_liquid(X) :- cream(X). liquid(X) :- alcoholic_beverage(X).
Concepts and terms are not explicitly distinguished in this representation; both are represented by unary predicates. The corpus terms are the predicates that do not occur as a head of Horn clauses in the ontology, they are terminal nodes. The structure of the ontology is a directed acyclic graph. The only relation represented in the ontology is the generality relation among the concepts.
2.3 Applications The applications of this kind of linguistic resource are numerous. Among others, they are useful for disambiguating syntactic parsing. For instance, the noun phrase "3 minutes", in the clause "cook until golden, 3 minutes per side", could be interpreted by a syntactic parser in two ways: as the direct object or as a time adjunct. No syntactic hint can help the parser here; additional semantic knowledge is needed. The appropriate ontology allows the parser to identify "3 minutes" as belonging to the "duration" concept. With the subcategorization frame of "to cook", the parser recognizes that the duration concept plays the role of time adjunct without a preposition, and the result is that the interpretation of "3 minutes" as time adjunct is selected and its interpretation as direct object is removed. Section 4 gives a more complete description of verb subcategorization frame use and it also presents how learned linguistic resources can also be used for semantic tagging and language control.
Corpus-Based Learning of Semantic Relations by the ILP System, Asium
263
2.4 Learning Input and Output The concepts of the ontology and the generality relation among them are learned from a parsed training corpus. The argument and adjunct heads of the verbs in the training corpus form Asium's input. The head of a complement is its main term. For example, in "pour the hot milk into the prepared pan. ", "milk" is the head of the noun phrase "hot milk". In the following, head should be understood in the linguistic sense and not the logical one. The output of the algorithm is a set of subcategorizations frames and ontologies as defined above.
3. Learning Ontology and verb subcategorization frame learning consists of three phases: corpus pre-processing, an initial generalization step, and further generalization steps. Corpus pre-processing transforms the training corpus into a set of examples. The initial generalization step builds the leaves of the ontology while further generalization steps build the higher levels and learn the valid grammatical relations between verbs and concepts.
3.1 Corpus Pre-processing First of all, a syntactic parser analyzes the training corpus. In the current implementation, SYLEX (Ingenia corp.) is used (Constant, 95). In each clause, the parser identifies the verb, its arguments and its adjuncts. It identifies the head and the preposition, if any, of each argument or adjunct (Section 3.1.1). Properly identifying a head term as a single word or a complex phrase depends on whether or not the terminology dictionary of the parser is suitable for the domain. Only grammatical relations between verbs and heads plus prepositions are relevant as input to Asium. Asium extracts all grammatical relations independently of each other (Section 3.1.2). Concept construction by clustering classes of terms is based on a predicate invention operator. Thus, we need a pre-processing flattening step that will turn terms into predicates (Section 3.1.3). 3.1.1 Parsing Output Parsed sentences (clauses) are represented by Horn clauses in the following form, verb(V_label,Clause_id) :- [ comp(Gram_rel,Phrase_id,Clause_id), prep(Prep_label,Phrase_id), head(Head_term,Phrase_id)]*.
A verb can have several complements, each of which has a preposition and a head. Each triple represents a complement of the verb.
264
Claire Nedellec
- The verb literal represents the verb, the comp literal, a complement of the verb, argument or adjunct, the prep literal, the preposition of the given complement, and the head literal, the head of the given complement. - V_label, Prep_label and Head_term respectively represent the labels of the verb, of the preposition and of the head term. In case where the argument of the verb is not introduced by a preposition, Prep_label is equal to none. - Clause_id and Phrase_id represent the unique identifiers of the clause and the phrase in the training corpus. - Gram_rel represents the grammatical relation in the clause between the arguments and the verb, as subject (subj), object (dobj), indirect object (iobj), and between the adjuncts and the verb as time adjunct (tadj), place adjunct (padj), etc. When the parser is not able to specify the type of adjunct, the more general label "adj" is given. Example. The Horn clause E represents the parsing of the clause, "pour the hot milk into the prepared pan". E:
verb(pour,c1) :comp(dobj,p11,c1), prep(none,p11), head(milk,p11), comp(padj,p12,c1), prep(into,p12), head(pan,p12). !
Subjects are generally absent from cooking recipes. However, the method treats subjects as the other verb arguments. When the syntactic analysis is insufficient for disambiguating multiple interpretations, all interpretations are kept as input. 3.1.2 Extraction of Verb Grammatical Relations The learning method generalizes the observed grammatical relations independently from each other. Thus single grammatical relations are extracted from parsed examples: extraction splits the input Horn clauses into the same number of new Horn clauses as there are grammatical relations between the verb and the complements occurring in the input Horn clause. Input Horn clauses are connected with a maximal depth of 3. Connection paths form a tree and this tree's root is Clause_id in the head literal. Extraction copies the head of the input clause as the input of each new clause and follows each possible connection path from the head clause variables through the variables of the body literals until it finds the deepest variables. It amounts to partitioning the variable set into subsets of k-local variables, where k = 4 (Cohen, 93). Example. The clause E from the above example is split into the two Horn clauses E1 and E2, with one clause per grammatical relation in E. E1: verb(pour,c1) :comp(dobj,p11,c1), prep(none, p11), head(milk, p11). E2: verb(pour,c1) :comp(tadj,p12,c1), prep(into, p12), head(pan, p12). !
Corpus-Based Learning of Semantic Relations by the ILP System, Asium
265
Candidate hypotheses are represented in the same way. Thus hypothesis generation and coverage tests are simplified during learning. In particular, this allows example storage in relational databases for faster coverage test. However, this results in overgeneralization since the grammatical relations and selection restrictions for a given verb are generalized independently of each other (Section 5). 3.1.3 Head Term Flattening Flattening then turns the head term in each Horn clause into a predicate so that predicate invention by intraconstruction can then apply. The other terms remain unchanged. verb(V_label,Clause_id) :comp(Gram_rel,Phrase_id,Clause_id),prep(prep_label,Phrase_id ), head(Head_term,Phrase_id).
becomes verb(V_label,Clause_id) :comp(Gram_rel,Phrase_id,Clause_id),prep(Prep_label,Phrase_id ), head(Head_id,Phrase_id), head_term(Head_id).
Example. F1 results from flattening E1. F1: verb(pour,c1) :comp(dobj,p11,c1),prep(none,p11),head(h11,p11),milk(h11) .!
The flattened clauses form the input examples of the generalization algorithm. Flattening in Asium slightly differs from flattening as defined by (Rouveirol, 94) in that it applies to constant head terms instead of all constants. The other constants c1, p11, h11 are turned into variables by the initial generalization step.
3.2 Initial Generalization Step The initial generalization step consists of first variabilizing the input examples (Section 3.2.1). Next, it creates the basic concepts of the ontology by predicate invention and compresses the input examples. The new predicates represent classes of head terms. This way, a given grammatical relation between a given verb and all heads occurring in the corpus is represented by a single Horn clause where a new predicate represents all possible heads in the training corpus (Section 3.2.2).
266
Claire Nedellec
3.2.1 Variabilization The identifiers in the input examples are variabilized, while the grammatical role, the verb and the preposition labels remain unchanged. The resulting clauses are called variabilized examples and define the set V. Example. V1: verb(pour,X) :comp(dobj,Y,X), prep(none,X), head(Z,Y), milk(Z). V1 results from variabilizing F1. !
One may notice that the number of input examples that are covered by a given variabilized example is the number of occurrences of the head terms occurring with the verb and in the grammatical role defined in the variabilized example. For instance clause V1 covers as many examples as the number of clauses in the corpus where "milk" should be "poured". The number of examples covered is attached to each variabilized example Vi and denoted Occ(Vi) as number of occurrences. 3.2.2 Predicate Invention In Asium, predicate invention is done by applying Muggleton and Buntine's (88) intraconstruction operator to variabilized examples. It creates one new predicate for each set of head terms that occur with the same verb and the same grammatical role and preposition if any. Let us first present the intraconstruction operator. • Intraconstruction As defined in (Muggleton & Buntine, 88), it applies to two clauses, the bodies of which contain a same subpart. Let R1 and R2 be these clauses, the bodies of which contain a same subpart B2, R1: C(X) :- B11 ∧ B2.
R2: C(X) :- B12 ∧ B2.
Intraconstruction creates 2 new clauses, C1: np(?) :- B11. and C2: np(?) :- B12.
which define the new predicate np(?) as the disjunction of B11(X) and B12(X), the subparts of R1 and R2 that differ. Inverting the resolution with the parent clause C1 and the resolvent R1 yields the parent clause G: G:
C(X) :- np(Xi) ∧ B2.
as well as inverting resolution with the clauses C2 and R2 (Figure 1).
Corpus-Based Learning of Semantic Relations by the ILP System, Asium C
G
1 R
C
267
2
R 2
1
Fig. 1. Intraconstruction
• Predicate Invention in Asium In Asium, predicate invention applies to the set of variabilized examples of the same verb, the same grammatical role and the same preposition, if any, but where the head terms may differ (in italics below). verb(V_label,X) :comp(dobj,Y,X), prep(none,X), head(Z,Y), head_term(Z).
Example. The three examples below concern the same verb "to pour", the same grammatical relation direct object, but different heads "milk", "water" and "wine". V1: verb(pour,X) :comp(dobj,Y,X),prep(none,X),head(Z,Y),milk(Z). V2: verb(pour,X) :comp(dobj,Y,X),prep(none,X),head(Z,Y), water(Z). V3: verb(pour,X) :comp(dobj,Y,X),prep(none,X),head(Z,Y),wine(Z). GR Clause
pour
dobj np Basic Cluster
milk 3 wine 7
water 2
Fig. 2. Predicate invention example. !
Let us call such sets of clauses (one set per verb plus grammatical relation) basic clusters. These clusters form a partition of the set of variabilized clauses. The predicate invention operator creates one new predicate per cluster. The new predicate is defined by the disjunction of all heads in the variabilized clauses. The clauses that define the new predicate form the so-called basic clauses of the Domain theory (DT). The clause obtained by intraconstruction verb(V_label,X) :comp(Gram_rel,Y,X), prep(Prep_label,X), head(Z,Y), np(Z).
268
Claire Nedellec
forms a generalization of the clauses of the corresponding basic cluster, with respect to DT according to generalized subsumption (Buntine, 88). Such clauses are called basic Grammatical Relation clauses, or basic GR clauses. Example. The basic clauses np(X) :- milk(X).
np(X) :- water(X).
np(X) :- wine(X).
are built from the clauses V1, V2 and V3. G is the basic GR clause obtained by intraconstruction. G: verb(pour,X) :comp(dobj,Y,X), prep(none,X), head(Z,Y), np(Z). !
New predicates formed by Asium are named by the user as they are built (see (Faure & Nedellec, 99) for more details on user interaction). At this stage, the set of GR clauses together with DT, comprehensively represents all the grammatical relations between verbs and terms that occur in the input corpus. As such, they represent pieces of subcategorization frames. Thus, in the example, np represents the set "{milk, water, wine}" of head terms observed in the corpus for the relation "pour - Direct Object" (Figure 2). Further steps will generalize these relations, extend DT and the variabilized example set.
3.3 Further Generalization Steps All further generalization steps iterate the same way. The algorithm is given in Table 2. It outlines the generalization process detailed below. For input, each step uses the results of the previous steps (i.e., the Domain Theory DT, the GR clause set and the variabilized example set) and extends them. Table 2. Asium generalization algorithm. Initialization GR ← {Basic GR clauses} DT ← {Basic clauses} V ← {Variabilized examples} NewGR ← ∅ Loop For all (Gi, Gj) ∈ GR x GR, and Gi ≠ Gj Compute Dist(Gi, Gj) If Dist(Gi,Gj) < Threshold, then • Generalization: Build a new predicate NewP, form NewP definition, and Gi' and Gj' generalizing Gi and Gj by intraconstruction NewGR ← {Gi'} ∪ {Gj'} ∪ NewGR DT ← NewP definition ∪ DT
Corpus-Based Learning of Semantic Relations by the ILP System, Asium
269
• Example generation: Generate new examples sets VGi' and VGj' by partial evaluation of Gi' and Gj'. V = VGi' ∪ VGj' ∪ V Endif Endfor exit when NewGR = ∅ GR ← GR ∪ NewGR NewGR ← ∅ End loop
A generalization step consists of selecting the most similar pairs among current GR clauses according to a given distance and generalizing each pair into two new clauses to be added to the current set of GR clauses. The similarity between clauses is based on the number of similar variabilized examples covered. Two examples are similar if they have the same head term (Section 3.3.2). The effect of such a generalization step on a given clause Gi, similar to a clause Gj, is that the new clause Gi' covers the grammatical relation of Gi, associating the verb of Gi and the terms of Gi plus the terms of the other clause Gj. In other words, it extends the set of acceptable terms for a given verb-grammatical relation by adding to it the terms of another similar set of terms for another verb - grammatical relation. The two extended sets of terms are equal and define a same new predicate (Section 3.3.1). New variabilized examples are then generated so that they can be associated to the number of occurrences required for computing distances in further generalization steps (3.3.3). 3.3.1 Generalization All pairs of GR clauses which are separated by a distance less than a given threshold, are generalized by applying the APT predicate invention operator defined in (Nedellec, 92). As with Muggleton's and Buntine's (88), it is based on intraconstruction but results in generalization instead of compression. It operates in the following way. Given a pair GRi and GRj of GR clauses, a new predicate is created, which is substituted for the head term predicates in the clause pairs. This yields two new clauses, GRi' and GRj'. The new predicate is defined in DT as the "mother" of the two head terms of the clause pair. The initial two clauses cannot concern simultaneously the same verb and the same grammatical relation and their head term differ. It is because of the way the initial GR set of clauses has been built. The new clauses GRi' and GRj' are thus respectively more general than the GR clauses GRi and GRj they are built from (Figure 3).
270
Claire Nedellec GR' i
C
GR
i
GR' j
i
C
GR
i
j
Fig. 3. APT's predicate invention More formally, Gri: verb(V_labeli,X) :comp(Gram_reli,Y,X),prep (Prep_labeli,X),head(Z,Y),npi(Z). GRj: verb(V_labelj,X) :comp(Gram_relj,Y,X),prep(Prep_labelj,X),head(Z,Y),npj(Z).
a new predicate np is defined in DT as, Ci: np(X) :- npi(X).
Cj: np(X) :-
npj(X).
Two new clauses GRi' and GRj' are built by intraconstruction, respectively substituting np for npi and npj in GRi and GRj by inverting two resolution steps. GRi': verb(V_labeli,X) :comp(Gram_reli,Y,X),prep(Prep_labeli,X),head(Z,Y),np(Z) . GRj': verb(V_labelj,X) :comp(Gram_relj,Y,X),prep(Prep_labelj,X),head(Z,Y),np(Z) .
The APT predicate invention operator applied here differs from the one described in Section 3.2 as it performs generalization instead of compression as Muggleton’s and Buntine’s (88) operator does. The parts of the two initial clause bodies which are not replaced by the new predicate differ, and thus two parent clauses are built instead of one and all that they have in common is the invented head term. They are thus more general than the ones they have been built from. The new clauses GRi' and GRj' are added to the GR set of clauses, however GRi and GRj are not removed so that they can produce other generalizations than GRi' and GRj'. DT then forms an acyclic graph and not a hierarchy. Example. Suppose G1 and G2 , given below, considered as similar. G1: verb(pour,X):comp(dobj,Y,X),prep (none,Y),head(Z,Y),np1(Z). G2: verb(drop,X):-comp(padj,Y,X),prep(in,Y),head(Z,Y),np2(Z). np1 and np2 are defined in DT by np1(X) :- milk(X).
np1(X) :- water(X).
np1(X) :- wine(X).
Corpus-Based Learning of Semantic Relations by the ILP System, Asium np2(X) :- milk(X). cream(X).
np2(X) :- water(X).
np2(X)
271 :-
A new predicate np is invented and defined as np(X) :- np1(X).
np(X) :- np2(X).
and two new clauses G1' and G2' are built which generalize G1 and G2: G1': verb(pour,X):comp(dobj,Y,X),prep(none,Y),head(Z,Y),np(Z). G2': verb(drop,X):-comp(padj,Y,X),prep(in,Y),head(Z,Y),np(Z). !
3.3.2 Heuristics, Distance Computing and Threshold Predicate invention results in generalizing the grammatical relation between verbs and head terms: the set of valid head terms for one verb and grammatical function is enriched by the addition of the valid head terms for another verb and grammatical function. For instance, the set of liquid terms valid as direct object of "to pour" is enriched by the set of liquid terms valid as place adjunct introduced by the preposition "in", after the verb "to drop". The induction leap due to predicate invention is controlled by a distance-based heuristic. GR clause pairs are considered as similar if their distance is less than a given threshold set by the user. The distance Dist (Table 3), between two clauses Gi and Gj of the GR set depends on the proportion of the number of occurrences of similar variabilized examples covered by both clauses, and of the total occurrence number of examples in V. • Similarity between Examples Two examples of V are similar if their head terms are the same. For example, V1 and V2 are similar. V1: verb(pour,X) :comp(dobj,Y,X), prep(none,Y), head(Z,Y), milk(Z). V2: verb(drop,X) :comp(padj,Y,X), prep(in,Y), head(Z,Y), milk(Z).
The set of variabilized examples covered by a GR clause G is denoted Cov(G). The set SimGj(Gi) is the set of examples of Cov(Gi) such that there exists a similar example in Cov(Gj). Notice that SimGj(Gi) = SimGi(Gj) . occ(VE) denotes the number of occurrences for the example VE. • Definition of the Distance, Dist The distance between two GR clauses Gi and Gj is defined as follows (Table 3).
272
Claire Nedellec Table 3. Asium distance.
log ifSimG j (Gi ) ≠ ∅,Dist(Gi ,G j ) = 1 −
∑
Occ(v) + log
log
∑
Occ(v) + log
v ∈Cov (Gi )
∑ Occ(v)
v ∈Sim G j (Gi )
v∈Sim Gi (G j )
∑ Occ(v)
v ∈Cov ( G j )
Dist(G i ,Gj ) = 1,otherwise.
The only criterion used for choosing the clause pairs is distance. It is possible for members of a candidate pair to have been built at different generalization steps. 3.3.3 Example Generation The distance between two GR clauses is computed on the basis of the occurrence number of the examples covered. Thus we want to easily associate each new GR clause Gi' with the variabilized examples it potentially covers and not only with the training examples it actually covers (the ones covered by Gi). We also want to associate a number of occurrences to the new examples. Asium generates all these new examples by a partial evaluation of Gi', (Van Harmelen and Bundy, 88), with respect to DT. New examples are then added to V extending Cov(Gi'). This amounts to generating one new variabilized example Vinew per example Vjold of Cov(Gj) that is not in SimGi(Gj) so that Vinew is similar to Vjold and vice et
versa. Thus Cov(Gi') and SimGj’(Gi') become equal. The number Occ(Vinew) associated to the example Vinew is equal to Occ(Vjold). Example. Consider G1' and G2'. G1': verb(pour,X) :comp(dobj,Y,X), prep(none,Y),head(Z,Y),np(Z). G2': verb(drop,X) :comp(padj,Y,X), prep(in,Y), head(Z,Y), np(Z). np, np1 and np2 are defined in DT by, np(X) :- np1(X). np(X) :- np2(X). np1(X) :- milk(X). np2(X) :- milk(X).
np1(X) :- water(X). np2(X) :- water(X).
np1(X) :- wine(X). np2(X) :- cream(X).
For Cov(G1') = SimG2’ Cov(G1') the following new examples have to be generated: verb(drop,X) :comp(padj,Y,X), prep(in,Y), head(Z,Y), wine(Z). verb(pour,X) :comp(dobj,Y,X), prep(none,Y), head(Z,Y), cream(Z).
Corpus-Based Learning of Semantic Relations by the ILP System, Asium
273
since "wine" is the only predicate that is less general than np1 and not less general than np2 and "cream" is the only predicate that is less general than np2 and not less general than np1. ! 3.3.4 Learning Result Generalization ends when no GR clause can be generalized further: when there is no pair that is similar enough with respect to the threshold. The learned ontology is the set of DT clauses that are not basic clauses. The subcategorization frame SubCatVerb_Id of a given verb verb_Id, is the set of the most general GR clauses concluding with verb(Verb_Id,X). There is one such clause per valid concept for a given verb - grammatical relation. As DT is not hierarchical, learning can results in more than one most general clause for a given verb - grammatical relation.
4. Applications To illustrate the potential applications of the proposed approach we present how learning results contribute to solving the semantic tagging, parsing disambiguation and language control problems. These components have been implemented in Asium for measuring performance with respect to the three tasks. • Semantic tagging tags verb complement heads in the test corpus by the ontology concepts according to the verb subcategorization frames. Semantic tagging is a way to extend documents or user queries for Information Retrieval by enriching texts by synonyms or more abstract concepts than those actually occurring. • The ontology and verb subcategorization frames help disambiguate parsing in two ways: by determining if a given noun phrase should be attached to a verb or to a noun; and by determining the type of attachment to a verb (argument, adjunct and the type of the argument or adjunct). • Language control checks the semantic validity of the heads of verb complements in the corpus according to the ontology and verb subcategorization frame. These three tasks are based on the same logical operation: given a parsed clause, show that the verb subcategorization frame covers the parsed clause according to generalized subsumption with DT. Examples to be handled (disambiguated, controlled or tagged) should be pre-processed as described in Section 3.1, that is, parsed and split. Given an example E, E:
verbe(v_labele,clause_id) :comp(gram_rele,phrase_id,clause_id), prep(prep_label,phrase_id), head(head_id,phrase_id), head_terme(head_id).
The subcategorization frame Fr of the verb v_label, covers E iff there exists a clause G of Fr,
274
Claire Nedellec
G : VerbG(v_labelG,X) :comp(gram_relG,Y,X), prep(prep_labelG,Y), head(Z,Y), head_termG(Z).
that is more general than E according to SLD resolution with DT: G and E must have the same verb, they also must have the same grammatical relation (preposition plus type of argument), and the head term of G, head_termG, must be more general than head_terme with respect to DT.
4.1 Semantic Tagging Semantic tagging consists of listing all intermediate goals when proving head_termG(Z), that is to say, listing the concepts of the ontology, the definition of which is needed for proving head_termG(Z) given head_terme(head_id). Example. G covers the example E, E:
verb(drop,c1) :comp(padj,p11,c1),prep(in,p11),head(h11,p11),wine(h11).
G:
verb(drop,X) :comp(padj,Y,X), prep(in,Y),head(Z,Y), liquid(Z).
since DT says, liquid(X) :- alcoholic_beverages(X). alcoholic_beverages(X) :- wine(X).
and E is tagged as, E:
verb(drop,c1) :comp(padj,p11,c1), prep(in,p11), head(h11,p11), wine(h11), Alcoholic_beverages(h11), liquid(h11).
!
Notice that tagging an example differs from saturation: we do not want to add all concepts that are more general than wine, but only the ones that are relevant here. Relevancy depends on the syntactic and semantic context given by the subcategorization frame as highlighted in (Riloff, 93). Learning simple classes from co-occurrences in text-windows cannot provide a way to disambiguate the role of a term, but learning subcategorization frames can. 4.2 Parsing Disambiguation Parsing disambiguation simply selects the parsing interpretation that is covered by the subcategorization frames and removes the others. When no interpretation is left, a possible parsing can be suggested by abduction as in (Duval, 91).
Corpus-Based Learning of Semantic Relations by the ILP System, Asium
275
Example. Of the two possible interpretations E:
verb(cook,c1) :comp(dobj,p11,c1),prep(none,p11),head(h11,p11),3_minutes(h 11). E': verb(cook,c1) : comp(dadj,p11,c1),prep(none,p11),head((h11,p11),3_minutes(h 11).
the second one is correct, according to C and DT, saying that 3_minutes is a duration, where C is given below. C:
verb(cook,X) :comp(dadj,Y,X),
prep(none,Y),
head(Z,Y),
duration(Z).
!
If the parser would not have built the second and correct interpretation, but only the first one, the Asium disambiguating component would have suggested it by abducing comp(dadj,p11,c1). When only one literal lacks among the four needed, it is abduced in order to complete the proof. 4.3 Language Control Language control checks the syntactic validity of the verbal grammatical relations, and the semantic validity of the heads. If there is no clause in the subcategorization frame covering the example to be tested, the example is considered as invalid. In particular, it allows one to detect metonymies. A possible replacement of the invalid head can be suggested by abduction in a similar way as when disambiguating. For example, C does not cover the example E, as "glass" is not defined as a "liquid" in DT. "liquid" can be suggested for replacing "glass". E: ). C:
verb(drink,c1) :comp(dobj,p11,c1),prep(none,p11),head(h11,p11),glass(h11 verb(drink,X) :comp(dobj,Y,X), prep(none,Y), head(Z,Y),liquid(Z).
5. Future Work The training example set ranges from large to very large. Asium, like other similar methods, learns grammatical relations for a given verb independently of one another for reasons of efficiency. As a consequence, concepts filling the selection restrictions can be overgeneral for some tasks like query extension in information retrieval where computational efficiency is crucial.
276
Claire Nedellec
For instance, the learned subcategorization frame of "to cook" will be, C1: verb(cook,X) :comp(dobj,Y,X), prep(none,Y), head(Z,Y), cake(Z). C2: verb(cook,X) :comp(dobj,Y,X), prep(none,Y), head(Z,Y), eggs(Z). C3: verb(cook,X) :comp(tadj,Y,X), prep(for,Y), head(Z,Y), duration(Z).
It says that cakes and eggs can be cooked in any duration, although eggs should not be cooked more than 12 minutes. A user query "how long should eggs be cooked? " would trigger a search through the cooking recipe base for all combinations of "eggs" and "duration" defined in the ontology instead of only the relevant ones. Learning grammatical relations independently has another consequence: the properties of the grammatical relations of a given verb such as mutual exclusion, optionality or requirement are not learned. For instance in the cooking recipe corpus, the time adjuncts of "to cook", "for - duration" and "duration" are mutually exclusive and the preposition "for" is omitted when the direct object is present. We are developing a post-processing method based on the method HAIKU (Nedellec et al., 96) and the language CARIN (Levy & Rousset, 98) in order to learn such dependencies. It will both specialize the overgeneral selection restrictions and learn dependencies between verb complements. Clustering based on FOL distances (such as the ones of (Esposito et al., 91), (Bisson, 92), (Kirsten & Wrobel, 98)) instead of the Asium distance could help to control the generalization of dependent selection restrictions. They are not applicable here for reasons of complexity. For instance, the cooking recipe corpus contains 90,000 examples. Up to 800 concepts and 1000 verb subcategorization frames have to be learned in parallel. However such distances could be successfully applied to learning predicate schemata from verb subcategorization frames and noun frames.
6. Related Work In this paper, we have presented the ILP method Asium which learns ontologies and verb subcategorization frames from a parsed corpus in an unsupervised way. As proposed by the work reported in (Hindle, 90), (Pereira et al., 93), (Grishman & Sterling, 94) and (Grefenstette, 92), among others, Asium clusters terms on the basis of syntactic regularities observed in a parsed corpus. The clustered terms are heads of verb complements, arguments and adjuncts. Asium differs from both Hindle's (90) and Grefenstette's (92) methods where adjuncts are not considered for learning. Instead, Hindle's method only considers arguments while Grefenstette's method considers arguments and noun relations (adjectival and prepositional). Experiments
Corpus-Based Learning of Semantic Relations by the ILP System, Asium
277
with the cooking recipe corpus and the Pascal corpus of INIST2 have shown that considering not only arguments but also adjuncts yields better results in terms of precision and recall. Further experiments are performed with the Mo’K system (Bisson et al.) for comparing the results when learning from noun relations as proposed by Grefenstette (92), and Grishman and Sterling (94). The way Asium clusters terms for building hierarchies of concepts fundamentally differs from the clustering methods described in (Pereira et al., 93), (Hogenhout & Matsumoto, 97) and more generally, from those applied in conceptual clustering. As the goal is to build classes of terms, terms are viewed as the examples, i.e. the objects to cluster. The examples are described by their attributes; that is to say, their syntactic context (verb plus grammatical relation) in the learning corpus. Notice that verbs are viewed as the objects when learning verb classes as in (Basili & Pazienza, 97). Bottom-up clustering usually computes the distances between pairs of objects according to the attributes they have in common. The best pair is selected, the two objects clustered, and clustering goes on until a tree is built with a single class containing all objects at its top. This strategy builds deep trees with many intermediate useless concepts and the concepts at the lowest levels contain very few terms. The novel strategy proposed here is to compute distances between all pairs of attributes and to cluster the two sets of objects which are described by the closest pair of attributes. Thus the number of terms in the classes is much larger and the tree much shallower. This improves the readability of the tree and the efficiency of its use. One effect could be a lack of precision; however, preliminary experiments on the two corpora cited above did not show major differences in precision but a notable reduction of tree size. Further experiments would be needed in order to characterize the properties of the corpora for which this strategy would be preferable. The ILP approach proposed here remains applicable in all the four cases, clustering terms versus clustering verbs, and clustering objects as usual, versus clustering attributes as in Asium. It could thus be usefully used for modeling previous work on clustering terms in an ILP framework.
Acknowledgement This work has been partially supported by the CEC through the ESPRIT contract LTR 20237 (ILP 2).
References 1. Basili R. & Pazienza M. T., "Lexical acquisition for information extraction" in Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology, M. T. Pazienza (Ed.), (pp.14-18), Lecture Notes in Artificial Intelligence Tutorial, Springer Verlag (Pub.), Frascati, Italy, July 1997,
2Pascal
is a base of scientific paper abstracts on agriculture, maintained by INIST.
278
Claire Nedellec
2. Bisson G., "Learning in FOL with a similarity measure", in Proceedings of the Tenth National Conference en Artificial Intelligence, (pp. 82-87), San Jose, AAAI Press / The MIT Press (Pub.), July, 1992. 3. Bisson G., Nedellec C. & Canamero L., "Clustering methods for ontology learning: The Mo’K workbench", in Proceedings of the European Conference on Artificial Intelligence Workshop on Ontology Learning, Staab S. et al. (Eds), Berlin, 2000 (in press). 4. Buntine W., "Generalized subsumption and its application to induction and redundancy", in Artificial Intelligence 36, (pp. 375-399), 1988. 5. Cohen W. W., "Cryptographic limitations on learning one-clause logic program" in Proceedings of the Tenth National Conference on Artificial Intelligence, Washington D.C., 1993. 6. Constant P., "L'analyseur linguistique SYLEX", Fifth CNET summer school, 1995. 7. Dagan I., Lee L., & Pereira F., "Similarity-based methods for word-sense disambiguation", in Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1996. 8. Duval B., "Abduction for explanation-based learning", in Proceedings of the European Working Session on Learning, (pp. 348-360), Lecture Notes in Artificial Intelligence, Y. Kodratoff (Ed.), Springer Verlag (Pub.), March 1991. 9. Esposito F., Malerba D. & Semeraro G., "Flexible matching for noisy structural descriptions.", in Proceedings of Twelfth International Joint Conference on Artificial Intelligence, (pp. 658-664), Sydney, August, 1991. 10. Faure D. & Nedellec C.,"Knowledge acquisition of predicate-argument structures from technichal texts using machine learning", in Proceedings of Current Developments in Knowledge Acquisition, D. Fensel & R. Studer (Ed.), Springer Verlag (Pub.), Karlsruhe, Germany, May 1999. 11. Hindle D., "Noun classification from predicate-argument structures", in Proceedings of the 28st annual meeting of the Association for Computational Linguistics, (pp. 1268-1275), Pittsburgh, PA, 1990. 12. Grefenstette G., "SEXTANT: exploring unexplored contexts for semantic extraction from syntactic analysis", in Proceedings of the Thirtieth Annual Meeting of the Association of Computational Linguistics, (pp. 14-18), 1992. 13. Grishman R. & Sterling J., "Generalizing automatically generated selectional patterns", in Proceedings of the Sixteenth International Conference on Computational Linguistics, 1994. 14. Harris Z., Gottfried M., Ryckman T., Mattick Jr P., Daladier A., Harris T. & Harris S., The form of information in science, analysis of immunology sublanguages, Kluwer Academic (Pub.), Dordrecht, 1989. 15. Hogenhout W. R. & Matsumoto Y., "A preliminary study of word clustering based on syntactic behavior", Proceedings of Thirty-fifth Annual Meeting of the Association of Computational Linguistics, 1997. 16. Kirsten M. & Wrobel S., "Relational distance-based clustering", in Proceedings of the Eighth workshop on Inductive Logic Programing, Page D. (ed.), (pp. 261-270(, Springer Verlag (Pub.), Madison, 1998. 17. Levy A. & Rousset M. C. "Combining Horn rules and description Logics in CARIN", in Artificial Intelligence Journal, vol 104, 165-210, September 1998. 18. Muggleton S. & Buntine W., "Machine invention of first order predicates by inverting resolution", in Proceedings of the Fifth International Machine Learning Worksho, Morgan Kaufman (Pub.), (pp. 339-352), 1988. 19. Nedellec C., "How to specialize by theory refinement", in Proceedings of the Tenth European Conference on Artificial Intelligence, (pp. 474-478), Neuman B. (Ed.), John Wiley & sons (Pub.), Vienna, August, 1992.
Corpus-Based Learning of Semantic Relations by the ILP System, Asium
279
20. Nedellec C., Rouveirol C., Ade H., Bergadano F. & Tausend B.,"Declarative bias in inductive logic programming" in Advances in Inductive Logic Programming, 82-103, de Raedt L. (Ed.), IOS Press (Pub.), 1996. 21. Pereira F., Tishby N. & Lee L., "Distributional clustering of English words" in Proceedings of the 31st annual meeting of the Association for Computational Linguistics, (pp. 183-190), 1993. 22. Resnik P. & Hearst M. A. "Structural ambiguity and conceptual relations", in Proceedings of Workshop on Very Large Corpora: Academic and Industrial Perspectives, (pp. 58-64), Ohio State University, 1993. 23. Resnik P., "Using information content to evaluate semantic similarity in a taxonomy.", in Proceedings of the International Joint Conference on Artificial Intelligence, Montreal, 1995. 24. Rouveirol C., "Flattening and saturation: two representation changes for generalization", in Machine Learning, 14, 219-232, Kluwer Academic (Pub.), Boston, 1994. 25. Riloff H., "Automatically constructing a dictionary for information extraction tasks", in Proceedings of the 11th National Conference on Artificial Intelligence, (pp. 811-816), AAAI Press / MIT Press (Pub.), 1993. 26. Sekine S., Caroll J. J., Ananiadou S. et Tsujii J., "Automatic learning for semantic collocation" in Proceedings of the Third Conference on Applied Natural Language Processing, (pp. 104-110), Trente, Italy, 1992. 27. Van Harmelen F. & Bundy A., “Explanation based generalization = partial evaluation”, in Artificial Intelligence 36, 401-412, 1988. 28. Yarowsky D., "Word-Sense disambiguation using statistical models of Roget's categories trained on large corpora", in Proceedings of the International Conference on Computational Linguistics, (pp. 454-460), Nantes, 1992.
Improving Learning by Choosing Examples Intelligently in Two Natural Language Tasks Cynthia A. Thompson1 and Mary Elaine Califf2 1
2
CSLI, Ventura Hall, Stanford University, Stanford, CA 94305, USA
[email protected] Department of Applied Computer Science, Illinois State University, Normal, IL 61790, USA
[email protected]
Abstract. In this chapter, we present relational learning algorithms for two natural language processing tasks, semantic parsing and information extraction. We describe the algorithms and present experimental results showing their effectiveness. We also describe our application of active learning techniques to these learning systems. We applied certainty-based selective sampling to each system, using fairly simple notions of certainty. We show that these selective sampling techniques greatly reduce the number of annotated examples required for the systems to achieve good generalization performance.
1
Introduction
Our research focuses broadly on applying machine learning techniques to build natural language systems. In this chapter, we discuss our learning approach for two such tasks, describing Chill, our system for learning semantic parsers, and Rapier, our system for learning information extraction rules. Because of the desire to learn relational (first-order) rules for carrying out these tasks, we draw from inductive logic programming (ILP) in both learning tasks. While others have applied learning to natural language tasks, few have concentrated on tasks requiring deep natural language understanding, focusing instead on tasks such as morphology and part-of-speech tagging. In particular, we know of no other recent learning research on mapping language into logical form, the task faced by Chill. Both Chill and Rapier combine and extend upon previous work in ILP, resulting in systems that perform well when tested on a variety of data sets. An important technique shared by the two systems is the combination of low level decisions to create the annotation of entire examples. This combination will be especially relevant when we discuss active learning. Besides applying machine learning to language tasks, we are also interested in reducing the effort required to annotate the needed training examples. Therefore, the second focus of this chapter is our demonstration of generalization from J. Cussens and S. Dˇ zeroski (Eds.): LLL’99, LNAI 1925, pp. 279–299, 2000. c Springer-Verlag Berlin Heidelberg 2000
280
Cynthia A. Thompson and Mary Elaine Califf
fewer examples by having the learner choose which examples to learn from, using active learning techniques. Active learning is an emerging area in machine learning that explores methods that, rather than relying on a benevolent teacher or random sampling, actively participate in the collection of training examples. The primary goal of active learning is to reduce the number of supervised training examples needed to achieve a given level of performance. Active learning systems may construct their own examples, request certain types of examples, or determine which of a set of unsupervised examples would be most useful if labeled. The last approach, selective sampling (Cohn, Atlas, & Ladner, 1994), is particularly attractive in natural language learning, since there is an abundance of text, and we would like to annotate only the most informative sentences. For many language learning tasks, annotation is particularly time-consuming since it requires specifying a complex output rather than just a category label, so reducing the number of required training examples can greatly increase the utility of learning. Only a few researchers applying machine learning to natural language processing have used active learning, and those have primarily addressed two particular tasks: part of speech tagging (Dagan & Engelson, 1995) and text categorization (Lewis & Catlett, 1994; Liere & Tadepalli, 1997). Both of these can be viewed primarily as classification tasks, while semantic parsing and information extraction, addressed here, are typically not. Some have viewed the sentence as a unit, so that part of speech tagging is also no longer a classification task, but the output is still less complex than parse trees and slot-filler pairs. Selective sampling has been applied to information extraction, but not in a form where the purpose was to select the most useful examples to annotate (Soderland, 1999) (see Section 7). Both semantic parsing and information extraction require annotating text with a complex output, but the application of active learning to tasks requiring such complex outputs has not been well studied. Our research shows how active learning methods can be applied to such problems and demonstrates that it can significantly decrease annotation costs for important and realistic natural language tasks. The remainder of the chapter is organized as follows. Section 2 introduces Chill and presents experimental results for learning parsers, while Section 3 does the same for Rapier and learning information extraction rules. Section 4 presents an overview of active learning. Sections 5 and 6 describe our active learning techniques for Chill and Rapier, respectively, and also demonstrate their effectiveness. Section 7 describes related work, and Section 8 concludes the chapter.
2
Learning to Parse
A straightforward application of ILP to parser acquisition would be to present to an ILP system a corpus of sentences paired with their representations (e.g., parse trees) as a set of positive examples. A learned Prolog definition for the predicate parse(Sentence, Representation) could then be used to prove goals
Choosing Examples Intelligently in Two Natural Language Tasks Prolog
<Sentence, Representation> Parsing Operator Generator
Training Examples
281
Overly−General Parser
Example Analysis
Control
Examples
Control Rule Induction
Control Rules
Program Specialization
Final Parser Prolog
Fig. 1. The Chill Architecture
with the second argument uninstantiated, thereby producing parses of sentences. However, there is no convenient set of negative examples or an obvious set of background relations to provide. Also, since parsers are very complex programs, it is unlikely that any existing ILP system could induce from scratch a complete parser that generalizes well to new inputs. The space of logic programs is simply too large, and the learning problem too unconstrained. 2.1
Chill
Chill (Zelle & Mooney, 1996) addresses this problem by considering parser acquisition as a control-rule learning problem; it learns rules to control the stepby-step actions of an initial, overly-general parsing shell. Figure 1 shows the basic components of Chill. First, during Parsing Operator Generation, the training examples are analyzed to formulate an overly-general shift-reduce parser that is capable of producing parses from sentences. Next, in Example Analysis, the overly-general parser is used to parse the training examples to extract sentence and parse stack contexts in which the generated parsing operators lead to a correct parse. The third step is Control-Rule Induction, which employs a general ILP algorithm to learn rules that characterize these contexts. Finally, Program Specialization “folds” the learned control-rules back into the overly-general parser to produce the final parser, which correctly parses all training sentences without being given access to the correct representation(s).
282
Cynthia A. Thompson and Mary Elaine Califf Table 1. Chillin Algorithm
DEF := {E :- true | E ∈ Pos} Repeat PAIRS := a sampling of pairs of clauses from DEF GENS := {G | G = build gen(Ci , Cj , DEF, Pos, Neg) for Ci , Cj ∈ PAIRS} G := Clause in GENS yielding most compaction DEF := (DEF - Clauses subsumed by G)) ∪ G Until no further compaction
One component of Chill that will be relevant to our discussion is the initial overly-general parser. This parser is overly-general in the sense that many parses, not all of them correct, can be derived from a sentence. While the basic shiftreduce paradigm remains fixed across representation languages, the specific type of shift and reduce operators needed for a given representation can be easily determined. These are then instantiated into several possibilities by the training examples presented. Suppose, for example, that we have the following training example, which uses a case-role analysis (Fillmore, 1968) as the representation language: parse([the, man, ate, the, pasta], [ate, agt:[man, det:the], pat:[pasta, det:the]]). The Prolog reduction operator needed to transform [man, the] on the parse stack to [man, det:the] would be op([Top,Second|Rest],Inpt,[NewTop|Rest],Inpt) :reduce(Top,det,Second,NewTop). However, this operator would also transform [ate,[man,det:the]] into [ate, det:[man,det:the]], an incorrect transformation given the training example. This would then serve as a negative example during Control Rule Induction. Also of interest is the induction algorithm used by Chill, called Chillin. The input is a set of positive and negative examples of a concept (in the case of Chill these are the parse control examples generated by the Example Analysis phase) expressed as facts. The output of Chillin is a definite-clause concept definition which covers the positive examples but not the negative. Chillin starts with a most specific definition (the positive examples), and introduces generalizations that make the definition more compact, as defined by a simple measure of the syntactic size of the program. At each step in the hill-climbing search for more general definitions, a number of possible generalizations are considered. The generalization producing the greatest compaction is implemented, and the process repeats. Table 1 shows the basic compaction loop. The notion of empirical subsumption influences the generalization and compaction process. Intuitively, the algorithm attempts to construct a clause that, when added to the current definition, renders other clauses superfluous. Formally, we define empirical subsumption as follows: Given a set C of Clauses
Choosing Examples Intelligently in Two Natural Language Tasks
283
Table 2. Build gen Algorithm Function build gen(Ci , Cj , DEF, Pos, Neg) GEN := LGG(Ci , Cj ) CNEGS := negatives covered by GEN if CNEGS = {} return GEN GEN := add antecedents(Pos, CNEGS, GEN) CNEGS := negatives covered by GEN if CNEGS = {} return GEN REDUCED := DEF - (Clauses subsumed by GEN) CPOS := {e | e ∈ Pos ∧ REDUCED e} LITERAL := invent predicate(CPOS, CNEGS, GEN) GEN := GEN ∪ LITERAL return GEN
{C1 , C2 , ..., CN } and a set of positive examples E provable from C, a clause G empirically subsumes Ci iff ∀e ∈ E : [(C − Ci ) ∪ G e]. That is, all examples in E are still provable if Ci is replaced by G. As in Golem (Muggleton & Feng, 1990), pairs of clauses are randomly sampled from the current definition to serve as “seeds” for the generalization process, outlined in Table 2. The best generalization produced from these pairs is used to reduce the current definition. There are three basic processes involved in constructing generalizations. First is the construction of a least-general generalization (LGG) (Plotkin, 1970) of the input clauses. If this generalization covers no negative examples, it is returned. Otherwise, an attempt is made to specialize it by adding antecedents. If the expanded clause is still too general, it is passed to a routine that invents a new predicate that further specializes the clause so that it covers no negative examples. For further details on these three processes, see Zelle and Mooney (1996). At the end of the Control Rule Induction phase, the parser generates only the correct parse(s) for each training sentence, without requiring any access to those parses. This chapter will focus on one application in which Chill has been tested, learning an interface to a geographical database. In this domain, Chill learns parsers that map natural language questions directly into Prolog queries that can be executed to produce an answer. Following are two sample queries for a database on U.S. geography paired with their corresponding Prolog query: What is the capital of the state with the biggest population? answer(C, (capital(S,C), largest(P, (state(S), population(S,P))))). What state is Texarkana located in? answer(S, (state(S), eq(C,cityid(texarkana, )), loc(C,S))).
Given a sufficient corpus of such sentence/representation pairs, Chill is able to learn a parser that correctly parses many novel sentences into logical queries.
284
Cynthia A. Thompson and Mary Elaine Califf 100
80
Accuracy
60
40
CHILL Geobase
20
0 0
50
100 150 Training Examples
200
250
Fig. 2. Chill Performance on Geography Queries
2.2
Chill Experimental Results
The corpus used here for evaluating parser acquisition contains 250 questions about U.S. geography paired with Prolog queries. This domain was chosen due to the availability of an existing hand-built natural language interface to a simple geography database containing about 800 facts. The original interface, Geobase, was supplied with Turbo Prolog 2.0 (Borland International, 1988). The questions were collected from uninformed undergraduates and mapped into logical form by an expert. Examples from the corpus were given in the previous section. For the results below we use the following general methodology. We first choose a random set of 25 test examples, and then learn parsers using increasingly larger subsets of the remaining 225 examples. The parser that is learned from the training data is used to process the test examples, the resulting queries are submitted to the database, the answers are compared to those generated by the correct representation, and the percentage of correct answers are recorded. We repeat this process for ten different random training and test sets. Figure 2 shows the accuracy of Chill’s parsers over a 10 trial average. The line labeled Geobase shows the average accuracy of the Geobase system on these 10 testing sets of 25 sentences. The curve shows that Chill outperforms the existing system when trained on 125 or more examples. In the best trial, Chill’s induced parser comprising 1100 lines of Prolog code achieved 84% accuracy in answering novel queries.
3
Learning to Extract Information
We have also explored a second complex language learning task: that of learning information extraction rules. The goal of an information extraction system is to
Choosing Examples Intelligently in Two Natural Language Tasks
285
Table 3. Sample Message and Filled Template Posting from Newsgroup Telecommunications. SOLARIS Systems Administrator. 38-44K. Immediate need Leading telecommunications firm in need of an energetic individual to fill the following position in the Atlanta office: SOLARIS SYSTEMS ADMINISTRATOR Salary: 38-44K with full benefits Location: Atlanta Georgia, no relocation assistance provided Filled Template computer_science_job title: SOLARIS Systems Administrator salary: 38-44K state: Georgia city: Atlanta platform: SOLARIS area: telecommunications
find specific pieces of information in a natural language document. The specification of the information to be extracted generally takes the form of a template with a list of slots to be filled with substrings from the document (Lehnert & Sundheim, 1991). Information extraction is particularly useful for obtaining a structured database from unstructured documents and is being used for a growing number of Web and Internet applications. Table 3 shows part of the desired template for the task of extracting information from newsgroup postings that would be appropriate for the development of a jobs database. 3.1
Rapier
As with semantic parsing, there are a number of approaches that could be taken to learn to extract information. With Rapier (Califf & Mooney, 1999), we chose to acquire rules that look for patterns of words in documents to determine how to fill slots. Rapier, a bottom-up relational learner, acquires patterns that are similar to regular expressions. They include constraints on the words, part-ofspeech tags, and semantic classes of the extracted phrase and its surrounding context. Rule Representation. Rapier’s patterns can make use of limited syntactic and semantic information. For each slot to be filled, one or more extraction
286
Cynthia A. Thompson and Mary Elaine Califf Table 4. Sample Rule Learned by Rapier Pre-filler: Filler: Post-filler: 1) tag: {nn,nnp} 1) word: undisclosed 1) sem: price 2) list: length 2 tag: jj
rules, describing conditions under which the slot may be extracted, are needed. An extraction rule consists of three parts: 1) a pre-filler pattern that matches the text immediately preceding the filler, 2) a pattern that matches the slot filler, and 3) a post-filler pattern that matches the text immediately following the filler. A pattern is a sequence of elements each of which must match the corresponding text in order for the pattern to match. For example, the last element in the pre-filler pattern must match the word(s) immediately prior to the filler, and the first element in the post-filler pattern must match the word(s) immediately following the filler. There are two types of pattern elements: pattern items and pattern lists. A pattern item matches exactly one word that satisfies its constraints, if any. A pattern list has a length N and matches 0 to N words, each satisfying any constraints provided. Rapier uses three kinds of constraints; it can constrain the specific word, the word’s part-of-speech (POS), or the word’s semantic class according to WordNet (Fellbaum, 1998). The constraints are thus characterized by disjunctive lists of one or more words, tags, or semantic classes. Table 4 shows a rule constructed by Rapier for extracting the transaction amount from a newswire concerning a corporate acquisition. This rule extracts the word “undisclosed” from phrases such as “sold to the bank for an undisclosed amount” or “paid Honeywell an undisclosed price.” The pre-filler pattern consists of two elements: 1) a word whose POS is noun (nn) or proper noun (nnp), followed by 2) a list of at most two unconstrained words. The filler pattern requires the word “undisclosed” tagged as an adjective (jj). The post-filler pattern requires a word in the WordNet semantic category “price.” The Learning Algorithm. Like Chillin, Rapier’s learning algorithm is compaction-based and primarily consists of a specific to general search. Table 5 gives pseudocode for the basic algorithm. Rapier begins with a most specific definition and compacts it by replacing sets of rules with more general ones. To construct the initial definition, most-specific patterns for each slot are created from each example, specifying words and tags for the filler and its complete context. Thus, the pre-filler pattern contains an item for each word from the beginning of the document to the word immediately preceding the filler with constraints listing each word and its POS tag. Likewise, the filler pattern has one item for each word in the filler, and the post-filler pattern has one item for each word from the end of the filler to the end of the document. Given this maximally specific rule-base, Rapier attempts to compact the rules for each slot. New rules are created by selecting pairs of existing rules and creating generalizations (like Golem and Chill). However, since our pattern
Choosing Examples Intelligently in Two Natural Language Tasks
287
Table 5. Rapier Algorithm For each slot S in the template being learned SlotRules = most specific rules for S from examples while compaction has failed fewer than CLim times RuleList = an empty priority queue of maximum length k randomly select M pairs of rules from SlotRules find the set L of generalizations of the fillers of each rule pair for each pattern P in L create a rule NewRule with filler P and empty pre and post-fillers evaluate NewRule and add NewRule to RuleList let n = 0 loop increment n for each rule, CurRule, in RuleList NewRL = SpecializePreFiller(CurRule,n) evaluate rules in NewRL and add to RuleList for each rule, CurRule, in RuleList NewRL = SpecializePostFiller(CurRule,n) evaluate rules in NewRL and add to RuleList until best rule in RuleList produces only valid fillers or the value of the best rule in RuleList has failed to improve over the last Lim iterations if best rule in RuleList covers no more than an allowable percentage of erroneous fillers then add it to SlotRules, removing empirically subsumed rules
language allows for unlimited disjunction, LGGs may be overly specific. Therefore, in cases where the LGG of two constraints is a disjunction, we create two alternative generalizations: the disjunction, and the removal of both constraints. Since patterns consist of a sequence of items, this technique on its own would result in a combinatorial number of potential generalizations, making it intractable to compute generalizations of two initial rules. Thus, we must limit the considered generalizations. Although we do not want to arbitrarily limit the pre-filler or post-filler pattern length, it is likely that the most important parts of the pattern will be close to the slot filler. Therefore, Rapier starts with rules containing generalizations of the filler patterns only, and empty pre-fillers and post-fillers. These are then specialized by adding pattern elements, working outward from the filler, in a kind of top-down beam search. Rapier maintains a priority queue (RuleList) of the best k rules and repeatedly specializes them by adding pieces of the generalizations of the pre-filler and post-filler patterns of the initial rules. The priority queue is ordered using an information gain metric (Quinlan, 1990) weighted by the size of the rule (preferring smaller rules). Whenever the best rule in the queue produces no erroneous fillers when matched against the training texts, specialization ceases and the rule is added to the final rule base, replacing any more specific rules that it renders superfluous. Specialization of the pre-fillers and post-fillers is abandoned if the value
288
Cynthia A. Thompson and Mary Elaine Califf
of the best rule does not improve across a predetermined number (Lim) of specialization iterations. SpecializePreFiller and SpecializePostFiller create specializations of CurRule using the n items from the context preceding or following the filler. Finally, compaction of the rule base for each slot is abandoned when the compaction algorithm fails to produce a compacting rule for more than a predefined number (Clim) of successive iterations. 3.2
Rapier Experimental Results
Rapier has been tested on three data sets, though we restrict our discussion here to extracting information about computer-related jobs from netnews (austin.jobs) postings, as was illustrated in Table 3. A full template contains 17 slots including information about employer, location, salary, and job requirements. The slots vary in their applicability to different postings. Relatively few postings provide salary information, while most provide information about the job’s location. A number of the slots may have more than one filler; for example, there are slots for the platform(s) and language(s) that the prospective employee will use. The corpus consists of 300 annotated postings, and training and test sets were generated using 10-fold cross-validation. Learning curves were generated by training on subsets of the training data. In information extraction, the standard measurements of performance are precision (the percentage of items that the system extracted which should have been extracted) and recall (the percentage of items that the system should have extracted which it did extract). In order to analyze the effect of different types of knowledge sources on the results, three different versions of Rapier were tested. The first version used words, POS tags as assigned by Brill’s tagger Brill (1994), and WordNet semantic classes. The other two versions are ablations, one using words and tags, the other words only. The curves also include results from another information extraction learning system, a Naive Bayes system that uses words in a fixed-length window to locate slot fillers (Freitag, 1998). Figures 3 and 4 show the learning curves for precision and recall, respectively. Clearly, the Naive Bayes system does not perform well on this task, although it has been shown to be fairly competitive in other domains. It performs well on some slots but quite poorly on many others, especially those which usually have multiple fillers. In order to compare at reasonably similar levels of recall (although Naive Bayes’ recall is still considerably less than Rapier’s), Naive Bayes’ threshold was set low, accounting for the low precision. Of course, setting the threshold to obtain high precision leads to even lower recall. These results clearly indicate the advantage of relational learning since a simpler fixed-context representation such as that used by Naive Bayes appears insufficient to produce a useful system. In contrast with Naive Bayes, Rapier’s precision is quite high, over 89% for both words only and words with POS tags. This fact is not surprising, since the bias of the bottom-up algorithm is for specific rules. High precision is important for such tasks, where having correct information in the database is generally more
Choosing Examples Intelligently in Two Natural Language Tasks
289
100
Precision
80
60 Rapier Rapier-words and tags Rapier-words only Naive Bayes
40
20
0 0
50
100
150 200 Training Examples
250
300
Fig. 3. Precision Performance on Job Postings
important than extracting a greater amount of less-reliable information (e.g., recall, though respectable at just over 60%, is quite a bit lower than precision). Also, the learning curve is quite steep. Rapier is apparently quite effective at making maximal use of a small number of examples. The precision curve flattens out quite a bit as the number of examples increases; however, recall is still rising, though slowly, at 270 examples. In looking at the performance of the three versions of Rapier, an obvious conclusion is that word constraints provide most of the power. Although POS and semantics can provide useful classes that capture important generalities, these classes can be implicitly learned from the words alone given enough examples. POS tags do improve performance at lower number of examples. Apparently though, by 270 examples, the word constraints are capable of representing the concepts provided by the POS tags, and any differences between words and words plus tags are not statistically significant. WordNet’s semantic classes provided no significant performance increase over words and POS tags only. Soderland applied another learning system, Whisk (Soderland, 1999), to this data set. In a 10-fold cross-validation over 100 documents randomly selected from the data set, Whisk achieved 85% precision and 55% recall. This is slightly worse than Rapier’s performance at 90 examples with POS tags with 86% precision and 60% recall.
4
Active Learning
The research and experimental results presented so far show promise in reducing the effort needed to build intelligent language processing systems. However, the annotation effort for both tasks is still quite high. We now turn to our attempts to improve this research even further, reducing this annotation effort by using
290
Cynthia A. Thompson and Mary Elaine Califf 100
80
Recall
60 Rapier Rapier-words and tags Rapier-words only Naive Bayes
40
20
0 0
50
100
150 200 Training Examples
250
300
Fig. 4. Recall Performance on Job Postings
active learning methods. Because of the relative ease of obtaining on-line text, we focus on selective sampling methods of active learning. In this case, learning begins with a small pool of annotated examples and a large pool of unannotated examples, and the learner attempts to choose the most informative additional examples for annotation. Existing work in the area has emphasized two approaches, certainty-based methods (Lewis & Catlett, 1994), and committee-based methods (Freund, Seung, Shamir, & Tishby, 1997; Liere & Tadepalli, 1997; Dagan & Engelson, 1995; Cohn et al., 1994). In the certainty-based paradigm, a system is trained on a small number of annotated examples to learn an initial classifier. Next, the system examines unannotated examples, and attaches certainties to the classifier’s predicted annotation of those examples. The k examples with the lowest certainties are then presented to the user for annotation and retraining. Many methods for attaching certainties have been used, but they typically attempt to estimate the probability that a classifier consistent with the prior training data will classify a new example correctly. In the committee-based paradigm, a diverse committee of classifiers is created, again from a small number of annotated examples. Next, each committee member attempts to label additional examples. The examples whose annotation results in the most disagreement amongst the committee members are presented to the user for annotation and retraining. A diverse committee, consistent with the prior training data, will produce the highest disagreement on examples whose label is most uncertain with respect to the possible classifiers that could be obtained by training on that data. Table 4 presents pseudocode for both certainty-based and committee-based selective sampling. In an ideal situation, the batch size, k, would be set to one to make the most intelligent decisions in future choices, but for efficiency in
Choosing Examples Intelligently in Two Natural Language Tasks
291
Table 6. Selective Sampling Algorithm Apply the learner to n bootstrap examples, creating one classifier or a committee of them. Until there are no more examples or the annotator is unwilling to label more examples, do: Use most recently learned classifier/committee to annotate each unlabeled instance. Find the k instances with the lowest annotation certainty/most disagreement amongst committee members. Annotate these instances. Train the learner on the bootstrap examples and all examples annotated to this point.
retraining batch learning algorithms, it is frequently set higher. Results on a number of classification tasks have demonstrated that this general approach is effective in reducing the need for labeled examples (see citations above). As we noted earlier in the chapter, both Chill and Rapier combine low-level decisions to annotate a complete example. Applying certainty-based sample selection to both of these systems requires determining the certainty of a complete annotation of an example, despite the fact that individual learned rules perform only part of the overall annotation task. Therefore, our certainty-based active learning methods take advantage of this by combining low-level annotation certainties to estimate certainty at the higher level of the examples to be annotated. Since neither system learns rules with explicit uncertainty parameters, simple metrics based on coverage of training examples are used to assign certainties to rule-based decisions. Our current work has primarily explored certainty-based approaches, although using committee-based approaches for both tasks is a topic for future research.
5 5.1
Active Learning for Parsing Estimating Certainty: Chill
Chill combines the results of individual parse operators to annotate a complete sentence. Our general approach to certainty-based active learning is complicated slightly in Chill by the fact that the current learned parser may get stuck, and not complete a parse for an unseen sentence. This can happen because a control rule learned for an operator may be overly specific, preventing its correct application, or because an operator required for parsing the sentence may not have been needed for any of the training examples, so the parser does not even include it. If a sentence cannot be parsed, its annotation is obviously very uncertain and it is therefore a good candidate for selection. However, there are often more unparsable sentences than the batch size (k), so we must distinguish
292
Cynthia A. Thompson and Mary Elaine Califf
between them to find the most useful ones for the next round of training. This is done by counting the maximum number of sequential operators successfully applied while attempting to parse the sentence and dividing by the number of words in the sentence to give an estimate of how close the parser came to completing a parse. The sentences with a lower value for this metric are preferred for annotation. If the number of unparsable examples is less than k, then the remaining examples selected for annotation are chosen from the parsable ones. A certainty for each parse, and thus each sentence, is obtained by considering the sequence of operators applied to produce it. Recall that the control rules for each operator are induced from positive and negative examples of the contexts in which the operator should be applied. As a simple approximation, the number of examples used to induce the specific control rule used to select an operator is used as a measure of the certainty of that parsing decision. We believe this is a reasonable certainty measure in rule learning, since, as shown by Holte, Acker, and Porter (1989), small disjuncts (rules that correctly classify few examples) are more error prone than large ones. We then average this certainty over all operators used in the parse to obtain the metric used to rank the example. To increase the diversity of examples included in a given batch, we do not include sentences that vary only in known names for database constants (e.g., city names) from already chosen examples, nor sentences that contain a subset of the words present in an already chosen sentence. 5.2
Active Learning Experimental Results: Chill
Our active learning experimental methodology is as follows. As for regular learning, for each trial, a random set of test examples is used and the system is trained on subsets of the remaining examples. First, n bootstrap examples are randomly selected from the training examples, then in each step of active learning, the best k examples of the remaining examples are selected and added to the training set. The result of learning on this set is evaluated after each round. When comparing to random sampling, the k examples in each round are chosen randomly. We again used the geography query corpus for our experiments. Test examples were chosen independently for 10 trials with n = 25 bootstrap examples and a batch size of k = 25. The results are shown in Figure 5, where Chill refers to random sampling, Chill+Active refers to sample selection, and Geobase refers to the hand-built benchmark. Initially, the advantage of sample selection is small, since there is insufficient information to make an intelligent choice of examples; but after 100 examples, the advantage becomes clear. Eventually, the training set becomes exhausted, the active learner has no choice in picking the remaining examples, and both approaches use the full training set and converge to the same performance. However, the number of examples required to reach this level is significantly reduced when using active learning. To get within 5% of the maximum accuracy requires 125 selected examples but 175 random examples, a savings of 29%. Also, to surpass the performance of Geobase requires less than 100 selected examples versus 125 random examples, a savings of 20%.
Choosing Examples Intelligently in Two Natural Language Tasks
293
80
70
60
Accuracy
50
40
30
Chill+Active CHILL Geobase
20
10
0 0
50
100 150 Training Examples
200
250
Fig. 5. Active Learning Results for Geography Corpus
Finally, according to a t-test, the differences between active and random choice at 125 and 175 training examples are statistically significant at the .05 level or better. We also ran experiments on a larger, more diverse corpus of geography queries, where additional examples were collected from undergraduate students in an introductory AI course. The first set of questions was collected from students in introductory German, with no instructions on the complexity of queries desired. The AI students tended to ask more complex and diverse queries: their task was to give 5 interesting questions and the associated logical form for a homework assignment. There were 221 new sentences, for a total of 471. This data was split into 425 training sentences and 46 test sentences, for 10 random splits. For this corpus, we used n = 50 and k = 25. The results are shown in Figure 6. Here, the savings with active learning is about 150 examples to reach an accuracy close to the maximum, or about a 35% annotation savings. The curve for selective sampling does not reach 425 examples because of our elimination of sentences that vary only in database names and those that contain a subset of the words present in an already chosen sentence. Obviously this is a more difficult corpus (note that the figure does not go to 100%), but active learning is still able to choose examples that allow a significant annotation cost saving.
6 6.1
Active Learning for Information Extraction Estimating Certainty: Rapier
A similar approach to certainty-based sample selection was used with Rapier. Like semantic parsing, information extraction is not a classification task; although, like parsing in Chill, it can be mapped to a series of classification
294
Cynthia A. Thompson and Mary Elaine Califf 60
50
Accuracy
40
30
CHILL+Active CHILL Geobase
20
10
0 0
50
100
150
200 250 Training Examples
300
350
400
450
Fig. 6. Parser Acquisition Results for a Larger Geography Corpus
subproblems (Freitag, 1998; Bennett, Aone, & Lovell, 1997). However, Rapier does not approach the problem in this manner, and in any case, the example annotations provided by the user are in the form of filled templates, not class labels. We therefore must estimate template certainties for entire documents by combining rule certainties. A simple notion of the certainty of an individual extraction rule is based on its coverage of the training data: pos − 5 · neg, where pos is the number of correct fillers generated by the rule and neg is the number of incorrect ones. Again, “small disjuncts” that account for few examples are deemed less certain. Also, since Rapier, unlike Chill, prunes rules to prevent overfitting, they may generate spurious fillers for the training data; therefore, a significant penalty is included for such errors. Given this notion of rule certainty, Rapier determines the certainty of a filled slot for an example being considered for annotation. In the case where a single rule finds a slot filler, the certainty for the slot is the certainty of the rule that filled it. However, when multiple slot-fillers are found, the certainty of the slot is the minimum of the certainties of the rules that produced these fillers. The minimum is chosen since we want to focus attention on the least certain rules and find examples that either confirm or deny them. A final consideration is determining the certainty of an empty slot. In some tasks, some slots are empty a large percentage of the time. For example, in the jobs domain, the salary is present less than half the time. On the other hand, some slots are always (or almost always) filled, and the absence of fillers for such slots should decrease confidence in an example’s labeling. Consequently, we record the number of times a slot appears in the training data with no fillers and use that count as the confidence of the slot when no filler for it is found. Once
Choosing Examples Intelligently in Two Natural Language Tasks
295
100
F-measure
80
60 Rapier Rapier+Active
40
20
0 0
50
100
150 200 Training Examples
250
300
Fig. 7. Information Extraction Results for Job Postings
the confidence of each slot has been determined, the confidence of an example is found by summing the confidence of all slots. Finally, in order to allow for the more desirable option of actively selecting a single example at a time (k = 1), an incremental version of Rapier was created. This version still requires remembering all training examples but reuses and updates existing rules as new examples are added. The resulting system incrementally incorporates new training examples reasonably efficiently, allowing each chosen example to immediately affect the result and therefore the choice of the next example. 6.2
Active Learning Experimental Results: Rapier
For the active learning results on the jobs corpus, we measured performance at 10-example intervals. The results for random sampling were measured less frequently. There were n = 10 bootstrap examples and subsequent examples were selected one at a time from the remaining 260 examples. We used the words only version of Rapier for our experiments, since POS tags and semantic classes had shown minimal advantages on this corpus. In order to combine precision and recall measurements to simplify comparisons, we present the usual F-measure: F = (2 · precision · recall)/(precision + recall). It is possible to weight the F-measure to prefer recall or precision, but we weight them equally here. Figure 7 shows the results, where Rapier uses random sampling and Rapier+Active uses selective sampling. From 30 examples on, Rapier+Active consistently outperforms Rapier. The difference between the curves is not large, but does represent a large reduction in the number of examples required to achieve a given level of performance. At 150 examples, the average F-measure for active learning is 74.56, the same as the average F-measure with 270 random
296
Cynthia A. Thompson and Mary Elaine Califf
examples. This represents a savings of 120 examples, or 44%. The differences in performance at 120 and 150 examples are significant at the 0.01 level according to a two-tailed paired t-test. The curve with selective sampling does not go all the way to 270 examples, because once the performance of 270 randomly chosen examples is reached, the information available in the data set has been exploited, and the curve just levels off as the less useful examples are added.
7
Related Work
Treating language acquisition as a control-rule learning problem is not in itself a new idea. Berwick (1985) used this approach to learn grammar rules for a Marcus-style deterministic parser. When the system came to a parsing impasse, a new rule was created by inferring the correct parsing action and creating a new rule using certain properties of the current parser state as a trigger condition for its application. In a similar vein, Simmons and Yu (1992) controlled a simple shift-reduce parser by storing example contexts consisting of syntactic categories of a fixed number of stack and input buffer locations. More recently, some probabilistic parsing methods have also employed frameworks for learning to prefer parsing operators based on context (Briscoe & Carroll, 1993; Magerman, 1995). However, these systems all use feature-vector representations that have only a limited, fixed-size access to the parsing context. Space does not permit a survey of all learning systems for information extraction, so we mention only the three most closely related systems. First, two systems have recently been developed with goals very similar to Rapier’s. These are both relational learning systems that do not depend on syntactic analysis. Their representations and algorithms; however, differ significantly from each other and from Rapier. SRV (Freitag, 2000) employs a top-down, set-covering rule learner similar to Foil (Quinlan, 1990). The second system is Whisk (Soderland, 1999), which like Rapier uses pattern-matching, employing a restricted form of regular expressions. It can also make use of semantic classes and the results of syntactic analysis. The learning algorithm is a covering algorithm, and rule induction begins by selection of a single seed example and then creates rules top-down, restricting the choice of terms to be added to those appearing in the seed example. For its active learning component, Whisk uses an unusual form of selective sampling. Rather than using certainties or committees, Whisk divides the pool of unannotated instances into three classes: 1) those covered by an existing rule, 2) those that are near misses of a rule, and 3) those not covered by any rule. The system then randomly selects a set of new examples from each of the three classes and adds them to the training set. Soderland shows that this method significantly improves performance in a management succession domain. Also requiring mention is work on learning information extraction and text categorization rules using ILP (Junker, Sintek, & Rinck, 2000). Unlike Rapier and the two systems just mentioned which use text-specific representations and
Choosing Examples Intelligently in Two Natural Language Tasks
297
algorithms informed by ILP methods, they use a logic representation with an algorithm focused on text. Comparisons to other work are not yet possible, since they present no results. With respect to active learning in general, Cohn et al. (1994) were among the first to discuss certainty-based active learning methods in detail. They focus on a neural network approach to actively searching a version-space of concepts. Liere and Tadepalli (1997) apply active learning with committees to the problem of text categorization. They show improvements with active learning similar to those that we obtain, but use a committee of Winnow-based learners on a traditional classification task. Dagan and Engelson (1995) also apply committeebased learning to part-of-speech tagging. In their work, a committee of hidden Markov models is used to select examples for annotation. Finally, Lewis and Catlett (1994) use heterogeneous certainty-based methods, in which a simple classifier is used to select examples that are then annotated and presented to a more powerful classifier. Again, their methods are applied to text classification.
8
Future Work and Conclusions
We are encouraged by our success to date in reducing annotation costs for systems that learn to perform natural language tasks. The basic systems perform quite well, though some improvements are still being explored. With respect to active learning, though, further experiments on additional corpora are needed to test the ability of these approaches to reduce annotation costs in a variety of domains. Our current results have involved a certainty-based approach; however, proponents of committee-based approaches have convincing arguments for their theoretical advantages. Our initial attempts at adapting committee-based approaches to our systems were not very successful; however, additional research on this topic is needed. One critical problem is obtaining diverse committees that properly sample the version space (Cohn et al., 1994). Although they seem to work quite well, the certainty metrics used in both Chill and Rapier are quite simple and somewhat ad hoc. A more principled approach based on learning probabilistic models of parsing and information extraction could perhaps result in better estimates of certainty and therefore improved sample selection. In conclusion, we have described two language learning systems that use methods from inductive logic programming, and improvements of these systems with active learning. First, Chill uses control-rule learning to construct semantic parsers from a corpus of sentence/parse pairs and an initial parsing framework. Second, Rapier uses relational learning to construct unbounded pattern-match rules for information extraction given a database of texts and filled templates. The learned patterns employ limited syntactic and semantic information to identify potential slot fillers and their surrounding context. Both systems make multiple choices in annotating their input, and these decisions are combined to form the annotation for a complete example. This
298
Cynthia A. Thompson and Mary Elaine Califf
characteristic is exploited by the active learning components of both, which use certainties in lower level decisions to obtain a certainty for an entire example. Our results on realistic corpora for semantic parsing and information extraction indicate that example savings as high as 44% can be achieved by employing sample selection using only simple certainty measures for predictions on unannotated data. Improved sample selection methods and applications to other important language problems hold the promise of continued progress in using machine learning to construct effective natural language processing systems. Acknowledgements Thanks to Ray Mooney for comments on earlier drafts of this chapter. Thanks to Dayne Freitag for supplying his seminar announcements data. This research was supported by the National Science Foundation under grants IRI-9310819 and IRI-9704943. The second author was further supported by a fellowship from AT&T.
References 1. Bennett, S., Aone, C., & Lovell, C. (1997). Learning to tag multilingual texts through observation. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pp. 109–116. 2. Berwick, B. (1985). The Acquisition of Syntactic Knowledge. MIT Press, Cambridge, MA. 3. Borland International (1988). Turbo Prolog 2.0 Reference Guide. Borland International, Scotts Valley, CA. 4. Brill, E. (1994). Some advances in rule-based part of speech tagging. In Proceedings of the Eleventh National Conference on Artificial Intelligence, pp. 722–727 Washington, D.C. 5. Briscoe, T., & Carroll, J. (1993). Generalized probabilistic LR parsing of natural language (corpora) with unification-based grammars. Computational Linguistics, 19 (1), 25–59. 6. Califf, M., & Mooney, R. (1999). Relational learning of pattern-match rules for information extraction. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, pp. 328–334 Orlando, FL. 7. Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with active learning. Machine Learning, 15 (2), 201–221. 8. Dagan, I., & Engelson, S. P. (1995). Committee-based sampling for training probabilistic classifiers. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 150–157 San Francisco, CA. Morgan Kaufman. 9. Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA. 10. Fillmore, C. J. (1968). The case for case. In Bach, E., & Harms, R. T. (Eds.), Universals in Linguistic Theory. Holt, Reinhart and Winston, New York.
Choosing Examples Intelligently in Two Natural Language Tasks
299
11. Freitag, D. (2000). Machine learning for information extraction in informal domains. Machine Learning, 39 (2/3), 169–202. 12. Freitag, D. (1998). Multi-strategy learning for information extraction. In Proceedings of the Fifteenth International Conference on Machine Learning, pp. 161–169. 13. Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1997). Selective sampling using the query by committee algorithm. Machine Learning, 28, 133–168. 14. Holte, R. C., Acker, L., & Porter, B. (1989). Concept learning and the problem of small disjuncts. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pp. 813–818 Detroit, MI. 15. Junker, M., Sintek, M., & Rinck, M. (2000). Learning for text categorization and information extraction with ILP. In This volume. 16. Lehnert, W., & Sundheim, B. (1991). A performance evaluation of textanalysis technologies. AI Magazine, 12 (3), 81–94. 17. Lewis, D. D., & Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the Eleventh International Conference on Machine Learning, pp. 148–156 New Brunswick, NJ. Morgan Kaufman. 18. Liere, R., & Tadepalli, P. (1997). Active learning with committees for text categorization. In Proceedings of the Fourteenth National Conference on Artificial Intelligence, pp. 591–596 Providence, RI. 19. Magerman, D. M. (1995). Statistical decision-tree models for parsing. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 276–283 Cambridge, MA. 20. Muggleton, S., & Feng, C. (1990). Efficient induction of logic programs. In Proceedings of the First Conference on Algorithmic Learning Theory Ohmsha, Tokyo, Japan. 21. Plotkin, G. D. (1970). A note on inductive generalization. In Meltzer, B., & Michie, D. (Eds.), Machine Intelligence (Vol. 5). Elsevier North-Holland, New York. 22. Quinlan, J. (1990). Learning logical definitions from relations. Machine Learning, 5 (3), 239–266. 23. Simmons, R. F., & Yu, Y. (1992). The acquisition and use of context dependent grammars for Engl ish. Computational Linguistics, 18 (4), 391– 418. 24. Soderland, S. (1999). Learning information extraction rules for semistructured and free text. Machine Learning, 34, 233–272. 25. Zelle, J. M., & Mooney, R. J. (1996). Learning to parse database queries using inductive logic programming. In Proceedings of the Thirteenth National Conference on Artificial Intelligence Portland, OR.
Author Index
Adriaans, Pieter 127 Andrade Lopes, Alneu de Bostr¨ om, Henrik Brill, Eric 49
237
Califf, Mary Elaine 279 Cussens, James 3, 143 Dˇzeroski, Saˇso
3, 69
170
Lindberg, Nikolaj
157
Manandhar, Suresh 3, 218 Mooney, Raymond J. 57 Nedellec, Claire Nerbonne, John Osborne, Miles Pulman, Stephen
259 110 184 143
Eineborg, Martin 157 Erjavec, Tomaˇz 69
Riezler, Stefan 199 Rinck, Matthias 247
Haas, Erik de
Sintek, Michael
127
247
Jorge, Al´ıpio 170 Junker, Markus 247
Thompson, Cynthia A. 36, 279 Tjong Kim Sang, Erik F. 110
Kazakov, Dimitar
Watkinson, Stephen
89
218