Computational Approaches to Morphology and Syntax
OXFORD SURVEYS IN SYNTAX AND MORPHOLOGY general editor: Robert D Va...
160 downloads
1419 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Computational Approaches to Morphology and Syntax
OXFORD SURVEYS IN SYNTAX AND MORPHOLOGY general editor: Robert D Van Valin, Jr, Heinrich-Heine Universität Düsseldorf & University at Buffalo, The State University of New York advisory editors: Guglielmo Cinque, University of Venice; Daniel Everett, Illinois State University; Adele Goldberg, Princeton University; Kees Hengeveld, University of Amsterdam; Caroline Heycock, University of Edinburgh; David Pesetsky, MIT; Ian Roberts, University of Cambridge; Masayoshi Shibatani, Rice University; Andrew Spencer, University of Essex; Tom Wasow, Stanford University published 1. Grammatical Relations Patrick Farrell 2. Morphosyntactic Change Olga Fischer 3. Information Structure: the Syntax-Discourse Interface Nomi Erteschik- Shir 4. Computational Approaches to Morphology and Syntax Brian Roark and Richard Sproat 5. Constituent Structure Andrew Carnie in preparation The Acquisition of Syntax and Morphology Shanley Allen and Heike Behrens The Processing of Syntax and Morphology Ina Bornkessel-Schlesewsky and Matthias Schlesewesky Morphology and the Lexicon Daniel Everett Grammatical Relations Revised second edition Patrick Farrell The Phonology–Morphology Interface Sharon Inkelas The Syntax–Semantics Interface Jean-Pierre Koenig Complex Sentences Toshio Ohori Syntactic Categories by Gisa Rauh Language Universals and Universal Grammar Anna Siewierska Argument Structure: The Syntax–Lexicon Interface Stephen Weschler
Computational Approaches to Morphology and Syntax BRIA N ROA RK A N D R I C H AR D SPROAT
1
3
Great Clarendon Street, Oxford ox2 6dp Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide in Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam Oxford is a registered trademark of Oxford University Press in the UK and in certain other countries Published in the United States by Oxford University Press Inc., New York © Brian Roark, Richard Sproat 2007 The moral rights of the authors have been asserted Database right Oxford University Press (maker) First published 2007 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this book in any other binding or cover and you must impose the same condition on any acquirer British Library Cataloguing in Publication Data Data available Library of Congress Cataloging in Publication Data Data available Typeset by SPI Publisher Services, Pondicherry, India Printed in Great Britain on acid-free paper by Biddles Ltd., King’s Lynn, Norfolk ISBN 978–0–19–927477–2 (Hbk.) 978–0–19–927478–9 (Pbk.) 1 3 5 7 9 10 8 6 4 2
Contents General preface Preface List of Figures List of Tables Abbreviations 1. Introduction and Preliminaries 1.1. Introduction 1.2. Finite-State Automata and Transducers 1.3. Weights and Probabilities 1.4. Weighted Finite-State Automata and Transducers 1.5. A Synopsis of Algorithmic Issues 1.6. Computational Approaches to Morphology and Syntax
ix x xii xv xvii 1 1 2 8 9 13 16
PART I. COMPUTATIONAL APPROACHES TO MORPHOLOGY 2. The Formal Characterization of Morphological Operations 2.1. Introduction 2.2. Syntagmatic Variation 2.2.1. Simple Concatenation 2.2.2. Interlude: Prosodic Circumscription 2.2.3. Prosodically Governed Concatenation 2.2.4. Phonological Changes Induced by Affixation 2.2.5. Subsegmental Morphology 2.2.6. Subtractive Morphology 2.2.7. Extrametrical Infixation 2.2.8. Positively Circumscribed Infixation 2.2.9. Root-and-Pattern Morphology 2.2.10. Morphomic Components 2.3. Paradigmatic Variation 2.4. The Remaining Problem: Reduplication 2.5. Summary
23 24 27 27 29 31 35 36 37 39 40 41 46 49 53 61
vi
contents
3. The Relevance of Computational Issues for Morphological Theory 3.1. Introduction: Realizational versus Incremental Morphology 3.2. Stump’s Theory 3.3. Computational Implementation of Fragments 3.3.1. Stem Alternations in Sanskrit 3.3.2. Position Classes in Swahili 3.3.3. Double Plurals in Breton 3.4. Equivalence of Inferential–Realizational and Lexical–Incremental Approaches: A Formal Analysis 3.5. Conclusions Appendix 3A: Lextools Appendix 3B: XFST Implementation of Sanskrit
62 62 66 67 68 73 79 83 85 86 95
4. A Brief History of Computational Morphology 4.1. Introduction 4.2. The KIMMO Two-Level Morphological Analyzer 4.2.1. KIMMO Basics 4.2.2. FST Intersection 4.2.3. Koskenniemi’s Rule Types 4.2.4. Koskenniemi’s System as a Historical Accident 4.3. Summary
100 100 102 103 105 109 110 113
5. Machine Learning of Morphology 5.1. Introduction 5.2. Goldsmith, 2001 5.2.1. Candidate Generation 5.2.2. Candidate Evaluation 5.3. Schone and Jurafsky, 2001 5.4. Yarowsky and Wicentowski, 2001 5.5. Discussion
116 116 119 121 122 124 129 132
PART II. COMPUTATIONAL APPROACHES TO SYNTAX 6. Finite-state Approaches to Syntax 6.1. N-gram Models 6.1.1. Background 6.1.2. Basic Approach
139 139 139 141
contents
6.2. 6.3.
6.4. 6.5.
6.1.3. Smoothing 6.1.4. Encoding 6.1.5. Factored Language Models Class-based Language Models 6.2.1. Forward Algorithm Part-of-Speech Tagging 6.3.1. Viterbi Algorithm 6.3.2. Efficient N-best Viterbi Decoding 6.3.3. Forward–backward Algorithm 6.3.4. Forward–backward Decoding 6.3.5. Log-linear Models NP Chunking and Shallow Parsing Summary
vii 143 148 150 151 154 159 160 162 164 168 170 173 174
7. Basic Context-free Approaches to Syntax 7.1. Grammars, Derivations and Trees 7.2. Deterministic Parsing Algorithms 7.2.1. Shift-reduce Parsing 7.2.2. Pushdown Automata 7.2.3. Top-down and Left-corner Parsing 7.3. Non-deterministic Parsing Algorithms 7.3.1. Re-analysis and Beam-search 7.3.2. CYK Parsing 7.3.3. Earley Parsing 7.3.4. Inside-outside Algorithm 7.3.5. Labeled Recall Parsing 7.4. Summary
176 176 180 181 182 184 189 191 193 201 203 206 208
8. Enriched Context-free Approaches to Syntax 8.1. Stochastic CFG-based Parsing 8.1.1. Treebanks and PCFGs 8.1.2. Lexicalized Context-free Grammars 8.1.3. Collins Parser 8.1.4. Charniak Parser 8.2. Dependency Parsing 8.3. PCFG-based Language Models 8.4. Unsupervised Grammar Induction 8.5. Finite-state Approximations 8.6. Summary
209 209 210 221 226 230 234 238 240 244 246
viii
contents
9. Context-sensitive Approaches to Syntax 9.1. Unification Grammars and Parsing 9.2. Lexicalized Grammar Formalisms and Parsing 9.2.1. Tree-adjoining Grammars 9.2.2. Combinatory Categorial Grammars 9.2.3. Other Mildly Context-sensitive Approaches 9.2.4. Finite-state and Context-free Approximations 9.3. Parse Selection 9.3.1. Stochastic Unification Grammars 9.3.2. Data-oriented Parsing 9.3.3. Context-free Parser Re-ranking 9.4. Transduction Grammars 9.5. Summary
248 248 257 258 265 270 271 273 273 275 277 279 283
References Name Index Language Index Index
285 307 312 313
General Preface Oxford Surveys in Syntax and Morphology provides overviews of the major approaches to subjects and questions at the centre of linguistic research in morphology and syntax. The volumes are accessible, critical, and up-to-date. Individually and collectively they aim to reveal the field’s intellectual history and theoretical diversity. Each book published in the series will characteristically contain: (1) a brief historical overview of relevant research in the subject; (2) a critical presentation of approaches from relevant (but usually seen as competing) theoretical perspectives to the phenomena and issues at hand, including an objective evaluation of the strengths and weaknesses of each approach to the central problems and issues; (3) a balanced account of the current issues, problems, and opportunities relating to the topic, showing the degree of consensus or otherwise in each case. The volumes will thus provide researchers and graduate students concerned with syntax, morphology, and related aspects of semantics with a vital source of information and reference. The current volume, Computational Approaches to Morphology and Syntax by Brian Roark and Richard Sproat, is the first volume in the series to examine analyses of morphology and syntax from outside of linguistics proper, in this case from computer science. The discussion presupposes a basic background in computational linguistics and not only surveys various computational procedures but also draws out their implications for morphological and syntactic theories. Robert D. Van Valin, Jr General Editor Heinrich Heine University, Düsseldorf University at Buffalo, The State University of New York
Preface This book provides an introduction to the current state-of-the-art in computational modeling of syntax and morphology, with a particular focus on the computational models that are used for these problems. As such it is not intended as a “cookbook”. The reader should not in general expect to come away knowing how a particular system is implemented; rather he or she should expect to understand the general algorithms that are involved in implementing such a system. So, for example, in our treatment of computational models of morphology, we shall focus largely on the regular operations that are involved in describing morphological phenomena. In principle there are a number of different ways that one might implement a morphological analyzer for a particular language, so rather than focus on such particulars, we prefer to give the reader an understanding of the formal operations that any such system would either implement directly or be formally equivalent to. Where possible we shall attempt to show the relevance of computational models to theoretical questions in linguistics. There has traditionally been a bit of a disconnect between theoretical and computational linguistics. Computational linguists have often been uninterested in the subtler questions that trouble theoretical linguists, while theoretical linguists have often been unimpressed by computational arguments for one or another position. This state of affairs is poised to change as larger and larger sources of linguistic data become available for linguists to test theoretical claims, and more and more well-designed linguistic features are needed for ever more sophisticated machine learning algorithms. Needless to say we have not been able to cover all topics that are relevant to the theme of the book. In most cases, we hope, this is not because we are ignorant of said topics, but merely because we have chosen to emphasize certain aspects of the problem over others. Where possible and relevant, we have justified such omissions. Computational linguistics is an interdisciplinary field. As anyone who has taught an introductory course on this topic can attest, it is a real challenge to uniformly engage students from different disciplinary
preface
xi
backgrounds. Every attempt has been made to make the material in this book broadly accessible, but many readers will find unfamiliar notation of one form or another. For some, linguistic glosses for an obscure language may be intimidating; for others, pseudocode algorithms with operations on sets or vectors. In many cases, apparently complex notation is actually relatively simple, and is used to facilitate presentation rather than complicate it. We encourage readers to spend the time to understand unfamiliar notation and work through examples. There are a number of people who have not been involved in the production of this book, but who have had a profound influence on our understanding of the issues that we discuss, and we would like to acknowledge their influence here. First and foremost we would like to thank the former AT&T Labs researchers–Michael Riley, Mehryar Mohri, Fernando Pereira, and Cyril Allauzen–who were instrumental in the creation of the AT&T weighted finite-state tools upon which much of our own work has been based. We would also like to acknowledge Mark Johnson, Eugene Charniak, and Michael Collins for insightful discussions of issues related to syntactic processing, language modeling, and machine learning. Ken Church has had an important influence on the field of computational linguistics as a whole, and directly or indirectly on our own work. On the material presented here, we have received helpful feedback from Lauri Karttunen, Dale Gerdemann, audiences at the Workshop on Word-Formation Theories (25–26 June, 2005, Prešov, Slovakia) and the Université du Québec à Montréal, and classes at the University of Illinois at Urbana-Champaign and Oregon Health & Science University. Thanks to Emily Tucker, Meg Mitchell, and Kristy Hollingshead for detailed comments on early drafts. We have also benefited from comments from two anonymous reviewers for Oxford University Press. Finally, we would like to thank our editors, Robert Van Valin at the University at Buffalo and John Davey at Oxford University Press, for their support and patience with our continual delays in bringing this work to completion.
List of Figures 1.1. A simple finite-state automaton accepting the language ab ∗ c dd ∗ e
5
1.2. A simple finite-state transducer that computes the relation (a:a)(b:b)∗ (c :g )(d: f )(d: f )∗ (e:e)
6
1.3. Representations of several pronunciations of the word data, with transcriptions in ARPAbet
10
1.4. Two transducers with epsilons
14
1.5. Naive composition of the transducers from Figure 1.4
15
2.1. Yowlumne durative template CVCVV(C)
34
2.2. Koasati plural truncation transducer
38
2.3. Marker insertion transducer M for Bontoc infixation
39
2.4. Transducer È, which converts > to -um- and inserts [+be] at the end of the word
40
2.5. Representation of the Arabic duuris, following Beesley and Karttunen (2000)
45
2.6. Filler acceptors representing the root drs and the template acceptor CVVCVC, following Beesley and Karttunen (2000)
46
2.7. A transducer for Gothic Class VII preterite reduplication
55
2.8. Schematic form for a transducer that maps a Gothic stem into a form that is indexed for copy checking
56
2.9. The Gothic Class VII preterite form haíhald “held” under Morphological Doubling theory
60
3.1. Lextools symbol file for implementation of a mini-fragment of Sanskrit, following Stump (2001)
71
3.2. Transducers 1–3 for implementation of a mini-fragment of Sanskrit, following Stump (2001)
72
3.3. Transducers 4–5 for implementation of a mini-fragment of Sanskrit, following Stump (2001)
73
3.4. Transducers 6–8 for implementation of a mini-fragment of Sanskrit, following Stump (2001)
74
3.5. Transducer 9 and lexicon for an implementation of a mini-fragment of Sanskrit, following Stump (2001)
75
3.6. Lextools symbol file for an implementation of a mini-fragment of Swahili, following Stump (2001)
77
3.7. An implementation of a mini-fragment of Swahili, following Stump (2001)
78
3.8. An implementation of a mini-fragment of Breton
82
4.1. Schematic representation of Koskenniemi’s two-level morphology system
104
5.1. A sample trie showing branches for potential suffixes NULL, -s and -ed: from Schone and Jurafsky (2001, Figure 2)
126
5.2. Semantic transitive closure of PPMVs, from Schone and Jurafsky (2001)
128
list of figures
xiii
5.3. The log VVBBD estimator from Yarowsky and Wicentowski (2001), smoothed and normalized to yield an approximation to the probability density function for the VBD/VB ratio
131
6.1. Deterministic finite-state representation of n-gram models with negative log probabilities (tropical semiring)
149
6.2. Hidden Markov Model view of class-based language modeling
152
6.3. Finite-state transducer view of class-based language modeling
153
6.4. Pseudocode of the Forward algorithm for HMM class-based language modeling
156
6.5. Example of the Forward algorithm for some made-up probabilities for a made-up sentence
157
6.6. Pseudocode of the Viterbi algorithm for HMM decoding
160
6.7. Example of the Viterbi algorithm using the probabilities from Figure 6.5
161
6.8. Pseudocode of a simple N-best Viterbi algorithm for HMM decoding, and an improved algorithm for the same.
163
6.9. Pseudocode of an iteration of the Forward–backward algorithm for HMM parameter estimation
166
6.10. Backward probabilities and posterior probabilities using the example from Figure 6.5
167
6.11. Representing NP Chunks as labeled brackets and as B/I/O tags
174
7.1. A context-free grammar
177
7.2. Parse tree for example string
178
7.3. A context-free grammar in Chomsky Normal Form
179
7.4. One-to-one mapping of parse trees corresponding to original and Chomsky Normal Form CFGs
180
7.5. The initial moves in the shift–reduce derivation, encoded as a pushdown automaton
183
7.6. Tree resulting from a left-corner grammar transform
187
7.7. The order of node recognition in the (a) shift–reduce; (b) top-down; and (c) left-corner parsing algorithms
189
7.8. Three possible parse trees for the example sentence
190
7.9. Rule productions from a PCFG covering example trees
191
7.10. A chart representation of the first tree in Figure 7.8
193
7.11. Chomsky Normal Form PCFG weakly equivalent to the grammar in Figure 7.9
196
7.12. All constituent spans spanning the string, given the CNF PCFG in Figure 7.11
197
7.13. Pseudocode of basic chart-filling algorithm
198
7.14. Pseudocode of the CYK algorithm
200
7.15. Two different chart cell orderings for the CYK algorithm
201
7.16. Dotted rules at start index x = 0 from the PCFG in Figure 7.11
203
7.17. Left-child and right-child configurations for calculating the backward probability
205
8.1. Example tree from the Penn Wall St. Journal Treebank (Marcus et al. 1993)
211
8.2. Four right-factored binarizations of a flat NP: (a) standard right-factorization; (b) Markov order-2; (c) Markov order-1; (d) Markov order-0
215
8.3. (a) Parent label annotated onto non-POS-tag children labels; (b) First two children labels annotated onto parent label
218
xiv
list of figures
8.4. Example tree from Figure 8.1 with head children in bold, and lexical heads in square brackets
222
8.5. (a) Basic 5-ary lexicalized CFG production; (b) Same production, factored into bilexical CFG productions; and (c) Factored bilexical productions with a Markov assumption
224
8.6. Complement annotated tree for use in Collins’ Model 2
229
8.7. Two-stage mapping from parse tree to dependency tree
233
8.8. Three possible dependency trees for the trees in Figure 7.8
235
8.9. Unlabeled parse tree, and cells in chart
242
8.10. Illustration of the effect of the Mohri and Nederhof (2001) grammar transform on trees, resulting in strongly regular grammars
247
9.1. Use of co-indexation to indicate that the subject of the main clause is the same as the (empty) subject of the embedded clause; and a re-entrant graph representation of the same
251
9.2. Tree of f-structures for string “the player wants to play”
252
9.3. Tree of partially unexpanded f-structures for string “the player wants to play”
256
9.4. Elementary and derived trees
259
9.5. Unlabeled derivation trees, showing either elementary tree number or the anchor word of the elementary tree
260
9.6. Chart built through the CYK-like algorithm for TAGs, using elementary trees in Figure 9.4
261
9.7. Pseudocode of chart-filling algorithm for Tree-adjoining Grammars
263
9.8. Pseudocode of algorithms called by chart-filling algorithm for TAGs in Figure 9.7
264
9.9. CCG categories for example string from last section
266
9.10. CCG derivation using forward and backward application, represented as a tree
267
9.11. CCG derivation for non-constituent coordination using both type lifting and forward composition
268
9.12. Two valid parse trees
274
9.13. Six DOP fragments out of a possible sixteen for this parse of the string “the player plays”
276
9.14. Alignment English to Yodaish that cannot be captured using ITG
282
9.15. Dependency tree and head transducer for example in Figure 9.14.
283
List of Tables 1.1. Phrasal reduplication in Bambara 1.2. Closure properties for regular languages and regular relations
3 7
2.1. Comparative affixation in English
31
2.2. Template- and non-template-providing affixes in Yowlumne
32
2.3. Diminutive suffixation in German
35
2.4. Irish first declension genitive formation
36
2.5. Initial consonant mutation in Welsh
37
2.6. Koasati plural stem truncation
38
2.7. Infixation of -um- in Bontoc
39
2.8. Infixation in Ulwa
40
2.9. Arabic active verb stems derived from the root ktb “notion of writing”
42
2.10. Latin third stem derivatives
47
2.11. The five major Latin noun declensions
50
2.12. Rewrite rules mapping concrete morphosyntactic features to an abstract morphomic representation
52
2.13. Fragment for Latin nominal endings
53
2.14. Gothic Class VII preterites after Wright (1910)
54
2.15. Unbounded reduplication in Bambara after Culy (1985)
55
2.16. Basic and modified stems in Sye (Inkelas and Zoll, 1999)
59
3.1. Partial paradigm of Finnish noun talo “house”, after Spencer
64
3.2. The four logically possible morphological theory types, after Stump (2001)
65
3.3. Stem alternation in the masculine paradigm of Sanskrit bhagavant “fortunate” (=Stump’s Table 6.1)
68
3.4. Stem alternation in the masculine paradigm of Sanskrit tasthivans “having stood” (=Stump’s Table 6.3)
69
3.5. Positional classes in Swahili, for taka “want” after Stump (2001, Table 5.1)
76
5.1. Top ten signatures for English from Goldsmith (2001), with a sample of relevant stems
123
5.2. Comparison of the F-scores for suffixing for full Schone and Jurafsky method with Goldsmith’s algorithm for English, Dutch, and German
129
5.3. The performance of the Yarowsky–Wicentowski algorithm on four classes of English verbs from Yarowsky and Wicentowski (2001, Table 9)
133
5.4. Comparison of three methods of morphological induction
134
7.1. Shift–reduce parsing derivation
181
7.2. Top-down parsing derivation
184
7.3. Left-corner parsing derivation
186
7.4. Top-down derivation with a left-corner grammer
188
xvi
list of tables
8.1. Baseline results of CYK parsing using different probabilistic context-free grammars
217
8.2. Parent and initial children annotated results of CYK parser versus the baseline, for Markov order-2 factored grammars
219
8.3. Performance of Collins’ Models 1 and 2 versus competitor systems
228
8.4. Performance of Charniak’s two versions versus Collins’ models
231
9.1. Example sentences illustrating grammaticality constraint
249
9.2. Transduction grammar notation from Lewis and Stearns (1968) and corresponding Inversion Transduction grammar notation from Wu (1997) for ternary production
281
Abbreviations ABL
ablative
ACC
accusative
Adj
Adjective
Adv
Adverb
argmax
the value resulting in the highest score
argmin
the value resulting in the lowest score
ASR
Automatic Speech Recognition
BITG
Bracketing ITG
CCG
Combinatory Categorial Grammar
CFG
Context-free Grammar
CKY
variant of CYK
CLR
function tag in Penn Treebank (close-related)
CNF
Chomsky Normal Form
COMP
complement
CRF
Conditional Random Fields
CS
Context Similarity
CYK
Cocke-Younger-Kasami dynamic programming parsing algorithm
DAT
dative
DATR
a language for lexical knowledge representation
DECOMP a module of the MITalk text-to-speech synthesis system DEF
definite
Det
Determiner
DFA
Deterministic Finite Automaton
DOP
Data Oriented Parsing
DP
Determiner phrase
DT
Determiner POS-tag (Penn Treebank)
EM
Expectation Maximization
exp
exponential
FEM
feminine
FSA
Finite-state Automaton
FSM
Finite-state Machine
FSRA
Finite-state Registered Automaton
FST
Finite-state Transducer
fut
future
FV
Final vowel
xviii abbreviations GCFG
Generalized Context-free Grammar
GEN
Candidate generation mechanism in OT
GEN
genitive
HMM
Hidden Markov Model
HPSG
Head-driven Phrase Structure Grammar
iff
if and only if
INF
infinitive
IPA
International Phonetic Alphabet
ITG
Inversion Transduction Grammars
JJ
Adjective POS-tag (Penn Treebank)
KATR
Kentucky version of DATR
KIMMO Koskenniemi’s first name, used first by Lauri Karttunen to refer to the two-level morphology system that Koskenniemi invented LC
left context
LCFRS
Linear context-free rewriting systems
LFG
Lexical Functional Grammar
LHS
Left-hand side of a context-free production
LL(k)
Left-to-right, Leftmost derivation with k lookahead items
log
logarithm
LP
Labeled precision
LR
Labeled recall
LR(k)
Left-to-right, Rightmost derivation with k lookahead items
LS
Levenshtein similarity
MAS
masculine
max
maximum
MaxEnt
Maximum Entropy
MCTAG
Multicomponent TAG
MD
Modal verb POS-tag (Penn Treebank)
MDL
Minimum description length
min
minimum
ML
Maximum Likelihood
MT
Machine Translation
N
Noun
NCS
Normalized cosine score
NER
Named entity recognition
NLP
Natural Language Processing
NN
common noun POS-tag (Penn Treebank)
NNP
proper noun POS-tag
NNS
plural common noun POS-tag (Penn Treebank)
NOM
nominative
abbreviations NP
noun phrase
NUM
number
OT
Optimality Theory
PA
Pushdown Automaton
PARC
Palo Alto Research Center
PARSEVAL Parse evaluation metrics PC
Personal computer
PCFG
Probabilistic CFG
PF
paradigm function
PL
plural
POS
Part-of-speech
PP
Prepositional phrase
PPMV
pair of potential morphological variants
RB
Adverb POS-tag (Penn Treebank)
RC
right context
RHS
Right-hand side of a context-free production
RR
realization rule
<s>
beginning of string
end of string
S
sentence
SBAR
subordinate clause
SG
Singular
SLM
Structured language model
SPE
Sound Pattern of English
SUBJ
Subject
SuperARV
a finite-state class-based language modeling approach
TAG
Tree-adjoining Grammar
TDP
Top-down parsing
TMP
temporal
V
Verb
VB
Infinitival verb POS-tag (Penn Treebank)
VBD
Past tense verb POS-tag (Penn Treebank)
VBG
Gerund verb POS-tag (Penn Treebank)
VBN
Past participle verb POS-tag (Penn Treebank)
VBP
Non-third person singular present verb POS-tag (Penn Treebank)
VBZ
Third person singular present verb POS-tag (Penn Treebank)
VP
Verb phrase
WCFG
Weighted CFG
WFSA
Weighted FSA
WFST
Weighted FST
xix
abbreviations
xx WSJ
Wall St. Journal
XFST
Xerox FST tools
XLE
Xerox Linguistic Environment
XTAG TAG grammar from University of Pennsylvania
1 Introduction and Preliminaries 1.1 Introduction Computational approaches to morphology and syntax are generally concerned with formal devices, such as grammars and stochastic models, and algorithms, such as tagging or parsing. They can range from primarily theoretical work, looking at, say, the computational complexity of algorithms for using a certain class of grammars, to mainly applied work, such as establishing best practices for statistical language modeling in the context of automatic speech recognition. Our intention in this volume is to provide a critical overview of the key computational issues in these domains along with some (though certainly not all) of the most effective approaches taken to address these issues. Some approaches have been known for many decades; others continue to be actively researched. In many cases, whole classes of problems can be addressed using general techniques and algorithms. For example, finite-state automata and transducers can be used as formal devices for encoding many models, from morphological grammars to statistical part-of-speech taggers. Algorithms that apply to finite-state automata in general apply to these models. As much as possible, we will present specific computational approaches to syntax and morphology within the general class to which they belong. Much work in these areas can be thought of as variations on certain themes, such as finite-state composition or dynamic programming. The book is organized into two parts: approaches to morphology and approaches to syntax. Since finite-state automata and transducers will figure prominently in much of the discussion in this book, in this chapter we introduce the basic properties of these devices as well as some of the algorithms and applications. For reasons of space we will only provide a high-level overview, but we will give enough references to
2
1. introduction and preliminaries
recent work on finite automata and their applications so that interested readers can follow up on the details elsewhere. One thing we hope to convey here is how to think of what automata compute in algebraic and set-theoretic terms. It is easy to get lost in the details of the algorithms and the machine-level computations. But what is really critical in understanding how finite automata are used in speech and language processing is to understand that they compute relations on sets. One of the critical insights of the early work by Kaplan and Kay from the 1970s (reported finally in Kaplan and Kay, 1994) was that in order to deal with complex problems such as the compilation of context-sensitive rewrite rules into transducers, one has to abandon thinking of the problem at the machine level and move instead to thinking of it at the level of what relations are being computed. Just as nobody can understand the wiring diagram of an integrated circuit, neither can one really understand a finite automaton of any complexity by simply looking at the machine. However, one can understand them easily at the algebraic level, and let algorithms worry about the details of how to compile that algebraic description into a working machine. We assume that readers will be at least partly familiar with basic finite-state automata so we will only briefly review these. One can find reviews of the basics of automata in any introduction to the theory of computation such as Harrison (1978), Hopcroft and Ullman (1979), and Lewis and Papadimitriou (1981).
1.2 Finite-State Automata and Transducers The study of finite-state automata (FSA) starts with the notion of a language. A language is simply a set of expressions, each of which is built from a set of symbols from an alphabet, where an alphabet is itself a set: typical alphabets in speech and language processing are sets of letters (or other symbols from a writing system), phones, or words. The languages of interest here are regular languages, which are languages that can be constructed out of a finite alphabet – conventionally denoted – using one or more of the following operations: • set union denoted “∪” • concatenation denoted “·” • transitive closure denoted “∗ ” (Kleene star)
e.g., {a, b} ∪ {c , d} = {a, b, c , d} e.g., abc · de f = abc de f e.g., a ∗ denotes the set of sequences consisting of 0 or more a’s
1.2 finite-state automata and transducers
3
Table 1.1 Phrasal reduplication in Bambara wulu dog wulunyinina dog searcher malonyininafilèla rice searcher watcher
o MARKER o
MARKER o MARKER
wulu dog wulunyinina dog searcher malonyininafilèla rice searcher watcher
“whichever dog” “whichever dog searcher” “whichever rice searcher watcher”
Any finite set of strings from a finite alphabet is necessarily a regular language, and using the above operations one can construct another regular language by taking the union of two sets A and B; the concatenation of two or more such sets (i.e. the concatenation of each string in A with each string B); or by taking the transitive closure (i.e. zero or more concatenations of strings from set A). Despite their simplicity, regular languages can be used to describe a large number of phenomena in natural language including, as we shall see, many morphological operations and a large set of syntactic structures. But there are still linguistic constructions that cannot be described using regular languages. One well-known case from morphology is phrasal reduplication in Bambara, a language of West Africa (Culy, 1985), some examples of which are given in Table 1.1. Bambara phrasal reduplication constructions are of the form X-o-X, where -o- is a marker of the construction and X is a nominal phrase. The problem is that the nominal phrase is in theory unbounded, and so the construction involves unbounded copying. Unbounded copying cannot be described in terms of regular languages; indeed it cannot even be described in terms of context-free languages (which we will return to later in the book). A couple of important regular languages are the universal language (denoted ∗ ) which consists of all strings that can be constructed out of the alphabet , including the empty string, which is denoted Â; and the empty language (denoted ∅) consisting of no strings. The definition given above defines some of the closure properties for regular languages but regular languages are also closed under the following operations: • • • •
intersection difference complementation string reversal
denoted “∩” denoted “−” denoted “X” denoted “X R ”
e.g., {a, b, c } ∩ {c , d} = {c } e.g., {a, b, c } − {c } = {a, b} e.g., A = ∗ − A e.g., (abc ) R = c ba
4
1. introduction and preliminaries
Regular languages are commonly denoted via regular expressions, which involve the use of a set of reserved symbols as notation. Some of these reserved symbols we have already seen, such as “∗ ”, which denotes “zero or more” of the symbol that it follows: recall that a ∗ denotes the (infinite) set of strings consisting of zero or more a’s in sequence. We can denote the repetition of multi-symbol sequences by using a parenthesis delimiter: (abc )∗ denotes the set of strings with zero or more repetitions of abc , that is, {∅, abc , abc abc , abc abc abc , . . .}. The following summarizes several additional reserved symbols used in regular expressions: • “zero or one” • disjunction
denoted “?” denoted “|” or ∪
•
denoted “¬”
negation
e.g.,(abc )? denotes {∅, abc } e.g., (a | b)? denotes the set of strings with zero or one occurrence of either a or b, i.e., {∅, a, b} e.g., (¬a)∗ denotes the set of strings with zero or more occurrences of anything other than a
This is a relatively abbreviated list, but sufficient to understand the regular expressions used in this book to denote regular languages. Finite-state automata are computational devices that compute regular languages. Formally defined: Definition 1 A finite-state automaton is a quintuple M = (Q, s , F , , ‰) where: 1. 2. 3. 4. 5.
Q is a finite set of states s is a designated initial state F is a designated set of final states is an alphabet of symbols ‰ is a transition relation from Q × ( ∪ Â) to Q, where A × B denotes the cross-product 1 of sets A and B
Kleene’s theorem states that every regular language can be recognized by a finite-state automaton; similarly every finite-state automaton recognizes a regular language. 1 The cross-product of two sets creates a set of pairs, with each member of the first set paired with each member of the second set. For example, {a, b} × {c , d} = {a, c ,a, d,b, c ,b, d}. Thus the transition relation ‰ is from state/symbol pairs to states.
1.2 finite-state automata and transducers b 0
a
1
5
d c
2
d
3
e
4
Figure 1.1 A simple finite-state automaton accepting the language ab ∗ c dd ∗ e
A diagram of a simple finite-state automaton, which accepts the language ab ∗ c dd ∗ e, is given in Figure 1.1. A string, say abbcddde, that is in the language of the automaton is matched against the automaton as follows: starting in the initial state (here, state 0), the match proceeds by reading a symbol of the input string and matching it against a transition (or arc) that leaves the current state. If a match is found, one moves to the destination state of the arc, and tries to match the next symbol of the input string with an arc leaving that state. If one can follow a path through the automaton in such a manner and end in a final state (denoted here with a double circle) with all symbols of the input read, then the string is in the language of the automaton; otherwise it is not. Note that the operation of intersection of two automata (see Section 1.5) follows essentially the same algorithm as just sketched, except that one is matching paths in one automaton against another, instead of a string. Note also that one can represent a string as a single-path automaton, so that the string-matching method we just described can be implemented as automata intersection. We turn from regular languages and finite-state automata to regular relations and finite-state transducers (FST). A regular relation can be thought of as a regular language over n-tuples of symbols, but it is more usefully thought of as expressing relations between sets of strings. The definition of a regular n-relation is as follows: 1. ∅ is a regular n-relation 2. For all symbols a ∈ [( ∪ Â) × . . . × ( ∪ Â)], {a} is a regular n-relation 3. If R1 , R2 , and R are regular n-relations, then so are (a) R1 · R2 , the (n-way) concatenation of R1 and R2 : for every r 1 ∈ R1 and r 2 ∈ R2 , r 1r 2 ∈ R1 · R2 (b) R1 ∪ R2 (c) R ∗ , the n-way transitive (Kleene) closure of R. For most applications in speech and language processing n = 2, so that we are interested in relations between pairs of strings. 2 In what follows we will be dealing only with 2-relations. An exception is Kiraz (2000), who uses n-relations, n > 2 for expressing nonconcatenative Semitic morphology; see Section 2.2.9. 2
6
1. introduction and preliminaries b:b 0
a:a
1
d:f c:g
2
d:f
3
e:e
4
Figure 1.2 A simple finite-state transducer that computes the relation (a:a)(b:b)∗ (c :g )(d: f )(d: f )∗ (e:e)
Analogous to finite-state automata are finite-state transducers, defined as follows: Definition 2 A (2-way) finite-state transducer is a quintuple M = (Q, s , F , × , ‰) where: 1. 2. 3. 4. 5.
Q is a finite set of states s is a designated initial state F is a designated set of final states is an alphabet of symbols ‰ is a transition relation from Q × ( ∪ Â × ∪ Â) to Q
A simple finite-state transducer is shown in Figure 1.2. With a transducer, a string matches against the input symbols on the arcs, while at the same time the machine is outputting the corresponding output symbols. Thus, for the input string abbcddde, the transducer in Figure 1.2 would produce abbgfffe. A transducer determines if the input string is in the domain of the relation, and if it is, computes the corresponding string, or set of strings, that are in the range of the relation. The closure properties of regular relations are somewhat different from those of regular languages, and the differences are outlined in Table 1.2. The major differences are that relations are not closed under intersection, a point that will be important in Chapter 4 when we discuss the KIMMO morphological analyzer (see Section 4.2.1); and that relations are closed under a new property, namely composition. Composition – denoted ◦ – is to be understood in the sense of composition of functions. If f and g are two regular relations and x a string, then [ f ◦ g ](x) = f (g (x)). In other words, the output of the composition of f and g on a string x is the output that would be obtained by first applying g to x and then applying f to the output of that first operation. Composition is a very useful property of regular relations. There are many applications in speech and language processing where one wants to factor a system into a set of operations that are cascaded together using composition. A case in point is in the implementation
1.2 finite-state automata and transducers
7
Table 1.2 Closure properties for regular languages and regular relations
Property concatenation Kleene closure union intersection difference composition inversion
Languages
Relations
yes yes yes yes yes — —
yes yes yes no no yes yes
of phonological rule systems. Phonological rewrite rules of the kind used in early Generative Grammar can be implemented using regular relations and finite-state transducers. 3 Traditionally such rule systems have involved applying a set of rules in sequence, each rule taking as input the output of the previous rule. This operation is implemented computationally by composing the transducers corresponding to the individual rules. Table 1.2 also lists inversion as one of the operations under which regular relations are closed. Inversion consists of swapping the domain of the relation with the range; in terms of finite-state transducers, one simply swaps the input and output labels on the arcs. The closure of regular relations under composition and inversion leads to the following nice property: one can develop a rule system that compiles into a transducer that maps from one set of strings to another set of strings, and then invert the result so that the relation goes the other way. An example of this is, again, generative phonological rewrite rules. It is generally easier for a linguist to think of starting with a more abstract representation and using rules to derive a surface representation. Yet in a morphological analyzer, one generally wants the computation to be performed in the other direction. Thus one takes the linguist’s description, compiles it into finite-state transducers, composes these together and then inverts the result. Of course, regular relations resulting from such descriptions are likely to be many-to-one, as in many input strings mapping to one output string; for example, many underlying forms mapping to the 3
We will not discuss these compilation algorithms here as this would take us too far afield; the interested reader is referred to Kaplan and Kay (1994) and Mohri and Sproat (1996).
8
1. introduction and preliminaries
same surface form. In such a case, the inversion yields a one-to-many relation, resulting in the need for disambiguation between the many underlying forms that could be associated with a particular surface form. 1.3 Weights and Probabilities Disambiguation in morphological and syntactic processing is often done by way of stochastic models, in which weights can encode preferences for one analysis versus another. In this section, we will briefly review notation for weights and probabilities that will be used throughout the book. Calculating the sum or product over a large number of values is very common, and a common shorthand is to use the sum ( ) or product ( ) symbols ranging over variables. For example, to sum the numbers one through nine, we can write: i <10
i = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 = 45
(1.1)
i =1
Similarly, to multiply them: i <10
i = 1 ∗ 2 ∗ 3 ∗ 4 ∗ 5 ∗ 6 ∗ 7 ∗ 8 ∗ 9 = 362880
(1.2)
i =1
Logarithms (log) and exponentials (exp) are also very common, and we will by convention use natural logarithm, 4 base e. Recall the basic relationship between them: for any x, y log(exp(x)) = log(e x ) = x
(1.3)
exp(log(y)) = e log(y) = y
(1.4)
One of the nicest properties of logs is that the log of a product is a sum: i <10 i <10 log i = log(i ) (1.5) i =1
i =1
This is nice because the log of each i can be taken separately, prior to combination, rather than having to combine them before taking the log. This is critical when the product leads to extremely large or 4
When multiple bases are used, natural log is sometimes denoted ln, but here we will just use log.
1.4 weighted finite-state automata and transducers
9
extremely small floating point numbers. Another nice property is that log is order preserving. That is, if x > y, then log(x) > log(y). Briefly, let us introduce simple empirically estimated probabilities of the sort we will mainly be considering in this book. All of the probabilistic models that we will be discussing are discrete distributions, where there are k discrete outcomes (such as different words from a vocabulary of size k) each with its own parameter. When k = 2, this is known as a binomial distribution; when k > 2, this is a multinomial distribution. For example, we can assign a probability to each word w in a vocabulary ; this is a multinomial distribution with || parameters (one parameter P(w) for each word w ∈ ) where w∈ P(w) = 1. If we have a corpus of N words taken from a vocabulary , we can calculate the probability of any observed word w ∈ in that corpus using relative frequency estimation: f (w) (1.6) N where f (w) is the frequency of the word (its count). Note that using relative frequency estimation leads us to give zero probability to words that have not occurred in our corpus, a problem that is discussed later in the book. We might want to find the most probable word in the corpus, that is, the word with the maximum probability. The maximum probability, here denoted pˆ , is P(w) =
pˆ = max P(w) w
(1.7)
If we want to know the word that provides us with this maximum probability, we use “argmax”: ˆ = argmax P(w) w w
(1.8)
ˆ = pˆ . Then, by definition, P(w) Of course, if we are using negative log probabilities, the order reverses, hence we will be more interested in the “min” and “argmin”, which are defined similarly. 1.4 Weighted Finite-State Automata and Transducers Finite-state automata and transducers can be extended to include weights or costs on the arcs. Such machines are termed weighted finitestate automata (WFSA) and weighted finite-state transducers (WFST).
10
1. introduction and preliminaries ey/0.4 0
d/1.0
1
dx/0.8 2
ae/0.6
3
ax/1.0
4
t/0.2
Figure 1.3 Representations of several pronunciations of the word data, with transcriptions in ARPAbet
The weights can serve a number of functions, but the most frequent use in speech and language processing is to represent probabilities, or more commonly, negative log probabilities, of different analyses. An example is shown in Figure 1.3. In this example, several plausible pronunciations are shown for the word data, with associated probabilities; note that the probabilities for all arcs leaving a given state must sum to one. (The four pronunciations correspond to the IPA transcriptions /deit@/, /dæt@/, /deiR@/, and /dæR@/.) The probability of a particular path is given by multiplying the individual arc probabilities along the path. In this example, for instance, the pronunciation /d ey t ax/ has the probability 1 ∗ 0.4 ∗ 0.2 ∗ 1 = 0.08. In a toy example like the one we have just examined, it is reasonable to represent probabilities as themselves, but in any realistic scenario this presents a computational problem since the probabilities along any given path can be very small and will generally lead to difficulties in floating point represention of the values. Thus it is more common to represent probabilities in the log domain and, more specifically, to represent them in terms of negative logs. Note that in this representation, smaller numbers correspond to more probable events. Recall that, if we use negative log probabilities, we must sum the weights along the path rather than multiply them. When automata and transducers are given weights, the interpretation of those weights must be provided. In addition to specifying how weights are combined along a path (which for probabilities is by multiplication), one must also specify how weights are combined between paths. For example, suppose there are two paths in an automaton with the same symbols on the arc labels of the paths, and we want to collapse those into a single path. How are the weights combined? When dealing with straight probabilities, in order to ensure proper normalization, the weights (probabilities) of two paths are added together. Different kinds of weights – e.g., logs or probabilities – will have different ways of combining the weights along the path (which we will generally term
1.4 weighted finite-state automata and transducers
11
the times operation) and between paths (which we will term the plus operation). The different interpretations of weights are usefully unified in terms of semirings. Before we can introduce the notion of a semiring, we first need the definition of a monoid: Definition 3 A monoid is a pair (M, •), where M is a set and • is a binary operation on M, obeying the following rules: 1. closure: for all a, b in M, a • b is in M 2. identity: there exists an element e in M, such that for all a in M, a • e = e • a = a. This is termed the neutral element. 3. associativity: • is an associative operation; that is, for all a, b, c in M, (a • b) • c = a • (b • c ) A monoid (M, •) is commutative if a • b = b • a for all a, b in M. We can take this notion of monoid and define semirings in terms of notional summation and multiplication as follows: Definition 4 A semiring is a triple (K, ⊕, ⊗), where K is a set and ⊕ and ⊗ are binary operations on K, obeying the following rules: 1. (K, ⊕) is a commutative monoid with neutral element denoted by 0 2. (K, ⊗) is a monoid with neutral element denoted by 1 3. The product (⊗) distributes with respect to the sum (⊕), i.e., a ⊗ (b ⊕ c ) = (a ⊗ b) ⊕ (a ⊗ c ) 4. For all a in K, a ⊗ 0 = 0 ⊗ a = 0 Since most applications use real numbers as the semiring set, one typically denotes a particular semiring with a pair (⊕, ⊗) that specifies the actual instantiation of the notional plus and times operations. Common semirings used in speech and language processing are the (+, ×) or “real” semiring, and the (min, +) or “tropical” semiring. 5 The (+, ×) semiring is appropriate for use with probabilities: to get the probability of a path, one multiplies along the path; to get the probability of a set of paths, one sums the probabilities of those paths. The (min, +) semiring is appropriate for use with negative log probabilities: one sums the weights along the path, and one computes the minimum of a set of paths – which is useful if one is looking for the best scoring path, since lower scores are better with negative logs. 5
So called because the mathematician who pioneered this semiring, Imre Simon, was from Brazil.
12
1. introduction and preliminaries
Weighted finite-state automata (and the corresponding weighted languages) are closed under intersection. When one combines two paths via intersection, the resulting cost is obtained by using the semiring ⊗ operation. Note that ⊗ is often referred to as the extend operation, since it is the operation that one uses when one extends a path by an additional arc. A formal definition of a weighted finite automaton is as follows: Definition 5 A weighted finite-state automaton is an octuple A = (Q, s , F , , ‰, Î, Û, Ò), where: 1. (Q, s , F , , ‰) is a finite-state automaton 2. An initial output function Î: s → K assigns a weight to entering the automaton 3. An output function Û: ‰ → K assigns a weight to transitions in the automaton 4. A final output function Ò: F → K assigns a weight to leaving the automaton For any transition d ∈ ‰, let i [d] ∈ ( ∪ Â) be its label; p[d] ∈ Q its origin state; and n[d] ∈ Q its destination state. A path = d1 . . . dk consists of k transitions d1 , . . . , dk ∈ ‰, where n[d j ] = p[d j +1 ] for all j , i.e., the destination state of transition d j is the origin state of transition d j +1 . We can extend the definitions of label, origin and destination to paths: let i [] = i [d1 ] . . . i [dk ]; p[] = p[d1 ]; and n[] = n[dk ]. A cycle is a path such that p[] = n[], i.e., a path that starts and ends at the same state. An acyclic automaton or transducer has no cycles. We can also extend the definition of the Û function of Definition 5 to paths: Û[] = Û[d1 ] ⊗ · · · ⊗ Û[dk ]. Let P (q , x, q ) be the set of paths such that p[] = q , i [] = x, and n[] = q . Given a semiring (K, ⊕, ⊗), the weight associated by A to a string x ∈ ∗ can be defined as follows: [[A]](x) = Î(s ) ⊗ Û() ⊗ Ò( f ) f ∈F ∈P (s ,x, f )
Weighted finite-state transducers are an obvious extension of finitestate transducers and weighted automata. A WFST computes a regular relation, but in addition it associates each mapping with a weight. For example, in a transducer encoding a rewrite rule system, the weights might represent the probabilities of a particular rule application.
1.5 a synopsis of algorithmic issues
13
1.5 A Synopsis of Algorithmic Issues The basic texts on automata theory that we have already cited give algorithms for various finite-state operations including concatenation, Kleene closure, union, intersection, complementation, determinization, and minimization. While these algorithms obviously produce correct results and work fine for small automata and transducers, they are often not efficient enough to handle the very large machines that are typical of serious speech- and language-processing applications. Furthermore, the textbook algorithms do not deal with weighted automata, and the correct treatment of weights turns out to be critical for efficient processing. Some algorithmic issues that have been very important in the application to speech and language processing include efficient algorithms for composition, minimization, determinization, and epsilon removal. In this section we will provide a brief high-level overview of some of these algorithmic issues, with pointers to some papers that deal with them in depth. One of the most fundamental algorithms that we will be assuming for much of our discussion of finite-state methods in this book is composition, so it is useful to have a basic understanding of how this works. At its core, composition is essentially the same as automata intersection as defined in standard texts. We start by reminding the reader how intersection works. The basic algorithm for intersection is as follows. Given two automata M = (Q, s , F , , ‰) and M = (Q , s , F , , ‰ ), construct a new automaton M such that:
r Its set of states Q = Q × Q is the cross-product of the states of the individual machines. r s = (s , s ) r F = F × F r = ∩ r ‰ (( p, p ), x) = (q , q ) just in case ‰( p, x) = q is in M and ‰ ( p , x) = q is in M . The basic algorithm for transducer composition is essentially the same, with the difference that with transducers one is matching the output label of one transducer with the input label of the other. The resulting arc has as its input label the input label of the arc from the first machine and as its output label the output label of the arc from the second machine. Automata can be seen as a special case of transducers, where the input and output symbols are always identical.
14 A:
B:
1. introduction and preliminaries 0
0
a:a
a:d
1
1
b:ε
ε:e
2
2
c:ε
d:a
3
d:d
4
3
Figure 1.4 Two transducers with epsilons. Example taken from Pereira and Riley (1997)
When one is dealing with weighted intersection or composition, as we noted above, one computes the weights of the resulting path as the extend (⊗) of the weights of the two input paths. Even with the basic algorithms there are various efficiency issues. Computation of the transition function ‰ for a new state ( p, p ) and label x requires efficient search. For example, suppose we are looking at transducer M, at state p, and at an arc with an output label x. We wish to find in M , state p , the set of arcs, if any, that have input labels x. If the arcs of the second machine are arranged in no particular order, then one has no choice but to search linearly through the arcs in M to find any that have input label x, and this will be inefficient if there are a large number of arcs exiting p . A solution is to index M on the input side, so that for any state the arcs are sorted by input label, allowing for a more efficient search method, such as a binary search. Even more efficient methods are possible. The complication with transducers involves epsilons: arcs where the output side (if on the first transducer) or the input side (if on the second transducer) are labeled with the empty-string symbol Â. Epsilons allow one to implement different-length relations (as opposed to same-length relations) so that, for example, one can implement a rule that deletes symbols in certain contexts. In such a case the transducer would contain arcs labeled with the symbols in question on the input side and  on the output side. The problem with epsilons is that they introduce nondeterminism over and above the non-determinism that one would have due to one or both of the input machines being non-deterministic, and hence inefficiency. 6 With weighted transducers the situation is worse: one will actually get the wrong result.
6 Note that the same issue arises with epsilons in the intersection of acceptors, and the same solution applies. However, note also that acceptors can always have their epsilons removed before intersection, whereas with transducers it is not generally possible to remove epsilons when the epsilon is only on one side of the arc label pair.
1.5 a synopsis of algorithmic issues 0,0
a:d
1,1
ε:e
b:ε 2,1
1,2 b:ε
ε:e
c:ε 3,1
15
2,2 c:ε
ε:e
3,2
d:a
3,3
Figure 1.5 Naive composition of the transducers from Figure 1.4. Example taken from Pereira and Riley (1997)
To see this, consider the two transducers in Figure 1.4, from Pereira and Riley (1997). In composing A with B, what is at issue is how to move from state 1 through 2 and 3 in machine A and at the same time move from state 1 to 2 in machine B. Since these operations consume no output in A or input in B, there are in principle a number of ways one could do this. One could, for example traverse both arcs from states 1 to 3 in A before traversing any arc in B; or one could traverse the arc between 1 and 2 in B before traversing any arcs in A; or one could choose to move from 1 to 2 in A, then stay in 2 in A while moving from 1 to 2 in B, and then complete the transition to 3 in A. These various options are diagrammed in Figure 1.5. These multiple paths lead to inefficiency in unweighted transducers but are otherwise correct. In weighted transducers, however, they yield the wrong result for the simple reason that the weights from the two original paths will be combined in each of the alternatives (via ⊗); these multiple alternatives will then be combined (via ⊕), meaning that in many semirings the resulting cost of the intersection of the two original paths will be wrong. The solution to this problem is to insert an epsilon filter F between the transducers A and B so that in effect one is composing A ◦ F ◦ B; the transducer F forces the result to have just one of the paths, specifically the bold-marked path in Figure 1.5. 7 Beyond composition, other algorithmic issues that arise relate to determinization, minimization and epsilon removal. Weighted automata and transducers (whether weighted or not) cannot in general be determinized, but certain types of machines, including acyclic 7
The actual algorithm is somewhat more complicated than what we have sketched here, and the epsilon filter transducer is simulated rather than actually constructed; see Pereira and Riley (1997) for further details.
16
1. introduction and preliminaries
machines, can be. (See Mohri, 1997, for a rigorous characterization of the class of determinizable machines.) Since machine minimization requires a determinized machine (Harrison, 1978; Hopcroft and Ullman, 1979; Lewis and Papadimitriou, 1981), this also implies that not all weighted acceptors or transducers can be minimized, though, again, some classes of machine can be. Transducers and weighted acceptors that fall into the class of determinizable and minimizable machines include machines that are useful in speech and language processing. For example, a dictionary can be modeled as an acyclic transducer, mapping input words to some other property such as their part of speech or pronunciation; and a lattice of possible analyses output by a speech recognizer can be modeled as an acyclic weighted acceptor. Determinizing and minimizing such machines can provide large efficiency gains (Mohri and Riley, 1999). 8 Epsilon removal with weighted automata is an interesting algorithmic issue in particular because epsilon arcs may have weights, and one must therefore be careful to distribute the weights correctly once the epsilon arcs are removed (Mohri, 2002). 1.6 Computational Approaches to Morphology and Syntax One might wonder why so much attention is paid to finite-state methods in this book, even in the sections that are devoted to syntax. Weren’t finite-state techniques mostly relegated to the dustheap during the 1970s? That has certainly been one common view. In the mid 1990s one of the authors gave a talk at a major industrial research lab in the Seattle, Washington area. He presented some work on applications of finite-state methods to text analysis for text-to-speech synthesis. Several natural language researchers in the audience reacted negatively to his presentation, claiming that finite-state methods were outdated and belonged in the 1970s not the 1990s. Developments over the past decade have proved this view to be illfounded: there has been a veritable explosion of research in finitestate methods with applications in a number of areas of speech and language processing including morphology and phonology (in which there was already substantial work by the mid 1990s and as we shall discuss further below), the computational analysis of syntax (e.g., Voutilainen, 1994; Mohri, 1994, and see Chapter 6), language modeling for speech recognition (Pereira and Riley, 1997; Mohri et al., 2002), text 8
Even for machines that cannot be determinized, it is often possible to locally determinize them (Mohri, 1997).
1.6 approaches to morphology and syntax
17
normalization systems for speech synthesis (Sproat, 1997a), pronunciation modeling (Mohri et al., 2002), the analysis of document structure (Sproat et al., 1998), inter alia. Outside speech and language processing, finite-state methods have found applications in other fields, such as computational biology (Durbin et al., 1998). Thus, if we seem to dwell too much on finite-state methods in this book, it is for a reason: such methods have a broad range of applications and students of speech and language processing would do well to master them. Despite the broad applicability of finite-state methods, however, there is a fundamental difference between computational approaches to morphology and computational approaches to syntax, in that the former (we shall argue) can be accomplished entirely with finite-state methods, while the latter cannot. Finite-state approaches to syntax can be extremely efficient and useful for many applications requiring some amount of syntactic processing, but it has been widely known since Chomsky (1957) that many syntactic phenomena simply cannot be described without context-free or even context-sensitive grammars. Grammars built for computational syntactic processing must typically trade-off the richness of syntactic description provided by the grammar with the computational cost of using it. Often the utility of a syntactic annotation will not justify – within the context of a particular application – the cost of annotating it. This is much less of an issue for morphological processing, since finite-state models and algorithms are generally sufficient for morphological description. Computationally, a grammar may be used for syntactic processing in several ways. First, it may be used to generate word strings in the language described by the grammar. It may also be used to recognize (or accept) strings in the language and reject strings that are not in the language. The grammar may additionally be used to provide some useful annotation to the accepted strings, such as, labels, delimiters, or perhaps a numerical score. Very often it is this annotation, and not just acceptance or rejection, that makes grammars useful computationally. The quality of a grammar (or syntactic model) is usually inversely related to the efficiency with which the model can be built and used. Very rich syntactic formalisms that find favor with syntacticians because they do a good job of accepting just those sentences that are grammatical in a language are often not used because explicit, detailed grammars require significant expertise and a relatively long time to write, and because the most efficient recognition algorithms that make use of such grammars are simply not efficient enough for particular
18
1. introduction and preliminaries
applications. The most commonly implemented syntactic models fall far short of what linguists would expect from a grammar in terms of describing languages, yet they provide useful information and are both easy to build and efficient to use. These latter considerations will carry the day, unless a compelling difference in application performance can be demonstrated. A very important consideration is robustness in the face of noise. Unless the application is severely constrained, for example, machine translation of official weather reports, the language usage will be varied and noisy. In more natural, less constrained settings, any grammar will be presented with false starts, disfluencies, sentence fragments, out-of-vocabulary words, misspellings, run-on sentences, or just plain ungrammaticalities. A parser is usually expected to yield some useful information to an application beyond rejection, even in the face of these phenomena. Issues of efficiency and robustness have made simple, weighted finite-state methods very popular. However, much research is currently focused on enriching robust syntactic models, and making richer syntactic formalisms more efficient and robust. Part II of this book (chapters 6–9) will look at various computational approaches to syntax, starting with the simplest and most efficient techniques before moving on to richer ones. The inclusion of scores among syntactic annotations will be particularly emphasized, since this particular annotation, as shall be seen, can make the difference between a useless model and a very useful one. It is in computational linguistics, more than any other subdiscipline of linguistics, where statistical and probabilistic approaches to disambiguation have been investigated, and it is this most of all that distinguishes computational approaches to syntax from other perspectives. Many of the most successful computational models of syntax share much in common with constraint-based linguistic formalisms, such as Optimality Theory (OT), with some simple differences. What is shared is the notion of a feature or constraint that encodes some informative linguistic distinction. General, effective automatic feature induction methods are beyond the current state-of-the-art in Natural Language Processing, so that effective manual feature selection – often based on linguist-documented generalizations – is a key part of building effective syntactic analyzers. Computational approaches typically differ from OT approaches primarily in terms of how the evidence of various features/constraints is combined, but also in terms of the
1.6 approaches to morphology and syntax
19
explicitness of the candidate generation mechanism, known as GEN in OT. Chapters 6–9 will present just how serious a problem efficient candidate generation can be, and a range of computational solutions. Disambiguation through candidate ranking will be presented in Chapter 9. In Chapter 2, we will show that models of morphology can be implemented in a unified framework of finite-state transducers. That is not the case for syntax, which brings efficiency to center stage. As a result, Part II (Computational Approaches to Syntax) will have much more of a focus on the accuracy/efficiency trade-off than Part I (Computational Approaches to Morphology), with efficient approximations that provide useful annotations receiving much attention. Efficient algorithms will be explicitly presented, to clearly illustrate why computational linguists are forced to make the choices they do. Differences in focus between the two parts of the book reflect differences in the key issues driving the two topic areas.
This page intentionally left blank
Part I Computational Approaches to Morphology
This page intentionally left blank
2 The Formal Characterization of Morphological Operations In this chapter we focus on the ways in which different languages encode information morphologically, with particular attention paid to the formal devices that are used. The point of this discussion is not merely to present a taxonomy but in addition to argue that all morphology can be understood in terms of a single regular operation, namely composition. This is a somewhat different view from the one normally presented in work on computational morphology. For example, Beesley and Karttunen (2000) observe that concatenation is sufficient for straightforward concatenative morphology (see Section 2.2.1). For nonconcatenative morphology such as that found in Semitic languages (Section 2.2.9) or reduplication (Section 2.4), they propose the operation compile-replace (which we will describe in more detail in Section 2.2.9). Indeed when we look at straightforward cases of concatenative morphology, we are naturally seduced by the idea that regular concatenation is the operation of choice. While this is perfectly viable, as we shall argue below, not only can concatenative morphology be handled by composition, but this is in fact the preferred mechanism, since many cases of affixation also have either restrictions on what kinds of bases they can attach to or impose modifications on their bases. As we shall see, such additional stipulations fall out naturally from treating affixation in terms of composition. But we are getting ahead of ourselves. Let us start out with some background on morphological analysis and what it is about.
24
2. formal characterization of morphology
2.1 Introduction Suppose you are studying Latin. You come across an unfamiliar form: scr¯ıps¯erunt. Naturally enough, you want to know what it means. You already know the verb scr¯ıb¯o “I write”, and you guess that there is enough similarity between the forms that perhaps scr¯ıps¯erunt is actually a form of the verb “write”. Sure enough, with a little delving into your grammar, you find that this is correct and that the form scr¯ıps¯erunt is in fact the third person, plural, perfect, active, indicative form of the verb “write”: “they wrote”. You also learn that scr¯ıps¯erunt can be decomposed into the stem scr¯ıb “write” (which becomes scr¯ıp before /s/), the perfect stem-forming affix -s-, which characterizes third conjugation verbs, and the (perfect) third plural suffix -¯erunt. Relating word forms and detecting the structure of word forms is what morphological analysis is all about. The task of relating a given form to a canonical form is called lemmatization. Thus in the Latin example, we would lemmatize scr¯ıps¯erunt to scr¯ıb¯o: note that the lemma is typically the form of the word that we find in a dictionary, which in the case of Latin is the first person, singular, present, active, indicative form. Both lemmatization and the decomposition into parts have their uses. For tasks that involve meaning (such as information retrieval or machine translation), we would typically be interested in lemmatizing words to some canonical form. For applications such as text-to-speech synthesis, we may also be interested in the analysis of a word into its parts. Clearly in a speech synthesizer we do not want the system to speak the lemma – e.g. to say scr¯ıb¯o when the text has scr¯ıps¯erunt. However, we may want the system to know something about the decomposition of the word, since such knowledge can affect its pronunciation. In Standard German, for example, knowing that a word-internal s is a compound linking morpheme is important for determining whether it is pronounced /s/ or /S/. Thus the middle /s/ in Staatsprotokoll “state record” is a linking morpheme, and therefore does not belong to the onset of the following syllable; thus it is pronounced /s/, and not /S/, which is what we would expect if it were part of the same syllable as the following /p/. In this discussion we will be neutral as to whether we are considering decomposition or lemmatization. In fact, one is formally derivable from the other. To see this, consider that lemmatization can be viewed as a combination of decomposition followed by normalization to the
2.1 introduction
25
lemmatized form. Anticipating somewhat, if you have a function D that transmutes a word to its morphological decomposition, we can compose that function with another function L that determines, for each decomposed stem, what the appropriate lemma would be. Thus we can lemmatize a given form by applying the function D ◦ L to that form. In the case of our Latin example, D would map scr¯ıps¯erunt to scr¯ıb+s+¯erunt, tagging the form as third person, plural, perfect, active, with the stem scr¯ıb: scr¯ıb+sperfect +¯eruntthird person, plural, active, indicative . L would then map the stem scr¯ıb to the canonical first person present active form, and remove other non-featural material, yielding: scr¯ıb¯operfect,third person, plural, active, indicative . The task of morphological analysis then is to take word forms and relate them to other word forms; while at the same time deriving featural information about the form. It is customary in discussions of morphology to talk about inflectional versus derivational morphology, in terms of the kinds of features each of these encodes. We will not have much to say about this distinction here. Rather we will concentrate purely on the computational mechanisms for performing morphological analysis and how these mechanisms can represent two kinds of linguistic information:
r The formal properties of morphological operations – i.e. the syntagmatic combinations of morphological elements. r The paradigmatic relation between forms. It will be argued that the most general operation that allows us to describe nearly all of morphology with purely finite-state devices is composition. This demonstration will occupy the bulk of the next two sections of this chapter. The third section deals with paradigms and how these can be represented computationally. We end in the fourth section with a discussion of reduplication, the one kind of morphology that seemingly requires special treatment in any finite-state approach. This computational approach to morphology has some important implications for debates in theoretical morphology; we will address those matters in the next chapter. Throughout this discussion we will assume that we can notate the features of complex morphological forms as a string of feature values, as in the above Latin examples. We will consider this as a shorthand for a more articulated feature-structure-style representation. Thus we
26
2. formal characterization of morphology
will assume that scr¯ıb¯operfect, third person, plural, active, indicative is a shorthand for something like the following: ⎡ ⎤ lex s cr¯ıb ⎢ ⎥ ⎢aspect perfect ⎥ ⎢ ⎥ ⎢person third ⎥ ⎢ ⎥ ⎢number plural ⎥ ⎢ ⎥ voice active ⎣ ⎦ mood indicative It is not obvious to us that there is any loss in descriptive power. 1 Of course, the sequence does have to encode the features so that, for instance, plural is a feature of number and not of, say, VOICE. But this is trivially arranged by making sure that each feature value is uniquely associated with a given feature. One note on notation: There is not a lot that different theories of morphology agree on, but one thing they do seem to agree on is that morphological pieces come in two flavors. One flavor is affixes or affixation processes. Generally these have some set of restrictions on their attachments and can thus be said to select for a particular kind of base. The other is the things they attach to, which are words, stems or roots, depending upon the affix (or process) involved, but in any event things that are more “word-like” in the sense that they lack selectional requirements for attachment. Simply put, in a word like dogs, it is the s that selects for dog, not the other way around. In the formal expressions in what follows, we will indicate affix-like entities with lower-case Greek letters and word-like entities with upper-case Greek letters. Second, we will per normal convention assume an alphabet that comprises the alphabet of all symbols (e.g., phonemes, letters) that make up the morphological pieces we are examining. For clarity, we will never use to represent a word-like entity. We will also use  for the empty string. Again for clarity, we will never use  to represent an affix-like entity. There are two issues that we wish to make clear at the outset. The first is that while the analyses presented in the ensuing discussion are toys, it should be understood that there have been many large-scale implementations of morphology using finite-state techniques. Much of that work has been done using Two Level Morphology (Koskenniemi, 1
On a related noted, Kornai (1991) presents a linearization of multidimensional autosegmental representations in phonology.
2.2 syntagmatic variation
27
1983) or the Xerox finite-state tools (Beesley and Karttunen, 2003), but there have also been systems developed using many other toolkits, such as the systems for German, French, Spanish, Russian, and Italian, which were part of the Bell Labs multilingual text-to-speech system (Sproat, 1997a, b), and were developed using the lextools toolkit. Second, there has been a significant amount of work on inflectional morphology using the nonmonotonic inheritance network language DATR (Corbett and Fraser, 1993; Andry et al., 1993; Barg, 1994, inter alia), and extensions such as KATR (Finkel and Stump, 2002). DATR is particularly well-suited to handling the kinds of partial inheritances common in inflectional morphology. For example, it is very easy in DATR to handle cases where a word inflects mostly according to a particular pattern, except in a few cases where there is a local override. It is easy to indicate, for example, that all English verbs form their present participles in -ing, that a subset of verbs form their passive participles in -en, and that the particular verb rise has its past tense form as rose. DATR is a perfectly reasonable high-level language for describing these kinds of phenomena. But as with, for example, context-sensitive rewrite rules (Kaplan and Kay, 1994; Mohri and Sproat, 1996), DATR descriptions of morphology can be compiled down into finite-state automata (see Evans and Gazdar, 1989b). The fact that we do not discuss DATR-based approaches in this book says nothing about the utility of such approaches. Rather, what it reflects is the fact that DATR does not add to the formal power of the finite-state approaches that we will be describing. 2.2 Syntagmatic Variation 2.2.1 Simple Concatenation At first glance, the most natural implementation of concatenative morphology would make use of the regular operation of concatenation. For example, if we have a stem A and a suffix ‚, we can combine them into a form using regular concatenation as in 2.1: = A·‚
(2.1)
An obvious example of this operation is regular plural formation in English, where a singular noun is suffixed (orthographically) with -s or -es. Note, however, that it is possible to analyze it instead as follows. Suppose, instead of treating ‚ as a string on a par with A, we instead
28
2. formal characterization of morphology
think of it as a function ‚ that takes as input a string, and produces as output that string concatenated with ‚. ‚ would be defined as in 2.2 and the whole construction as in 2.3: ‚ = ∗ [Â : ‚]
(2.2)
= A ◦ ‚
(2.3)
(Note that here and henceforth we will use ∗ – a specification of a regular language – to represent a regular relation that maps strings into themselves. In the notation of Kaplan and Kay (1994), this would be represented as Id( ∗ ), but we will dispense with this more correct notation as long as no ambiguity arises.) What are the advantages of doing this? From a computational point of view this is not immediately clear, since while concatenation as in 2.1 is a constant time operation, the composition in 2.3 is linear in the length of A, because minimally each symbol in A must be read to match with ∗ . But it is frequently the case that when affixes combine they also either trigger some phonological, spelling, or morphological change that affects the stem, or the affix, or both; or else they select for some prosodic property of the base. And in these cases composition is necessarily involved at some level since phonological rules are implementable using composition (C. D. Johnson, 1972; Kaplan and Kay, 1994; Mohri and Sproat, 1996), and composition also affords a natural way of implementing prosodic selection, as we shall see. Further examples will be given below, but for now note the obvious case of English plural ‘s’, which is /1z/ after apical fricatives and affricates, /z/ after voiced sounds, and /s/ elsewhere. This rule can be implemented with a transducer T . If we assume that the plural suffix Û is attached to the stem S with concatenation, for a plural form we have: = [S · Û] ◦ T
(2.4)
Now, we can refactor this as: = S ◦ [ ∗ [Â : Û]] ◦ T
(2.5)
And if we then define Û to be: Û = [ ∗ [Â : Û]] ◦ T
(2.6)
2.2 syntagmatic variation
29
this yields = S ◦ Û
(2.7)
Our new affix Û adds the suffix and makes the appropriate modifications in one fell swoop. We return below to other cases where affixes have a more drastic effect on their bases, thus further motivating the use of composition. But before that we need to introduce the concept of Prosodic Circumscription, which will be useful in our later discussions. 2.2.2 Interlude: Prosodic Circumscription In their work on Prosodic Morphology, McCarthy and Prince (1986; 1990) present the notion of prosodic circumscription: Prosodic Circumscription of Domains. The domain to which morphological operations apply may be circumscribed by prosodic criteria as well as by the more familiar morphological ones. In particular, the minimal word within a domain may be selected as the locus of morphological transformation in lieu of the whole domain. (McCarthy and Prince, 1990)
Prosodic circumscription starts with factoring a base into a prosodically defined unit and its residue. For example, we might define a syllable at the right edge of a stem as a prosodically defined unit, in which case the whole stem to the left of the final syllable is the residue. Or we might define the initial onset of the stem as the prosodically defined unit, in which case the remainder of the stem after the onset would be the residue. In McCarthy and Prince’s notation, a base B would be factored into a prosodically defined unit “B:” concatenated (“∗ ”), in some order, with a residue “B/”: 2 B = B: ∗ B/
(2.8)
Prosodic morphological operations can then be defined to apply either to the prosodically defined unit B: or to the residue B/. Thus, given an operation O, we can define operations O: and O/ as follows:
2
O:= O(B:) ∗ B/
(2.9)
O/ = B: ∗ O(B/)
(2.10)
The “:” and “*” operators are McCarthy and Prince’s notation and are not to be confused with the use of those symbols in regular expressions.
30
2. formal characterization of morphology
So, with O: we first factor the base into B: and B/, apply O to B:, and then reconstitute O(B:) with B/. Contrariwise, with O/ we factor the base, apply O to B/ and then reconstitute the result. O/ is the case of extrametricality: we define B: as a prosodic domain to ignore, apply an operation to the residue, and then reconstitute the whole. O: is the case of positive circumscription: we define B: as the prosodic domain to which operations apply. Both of these phenomena are common in morphology. Extrametricality is exemplified by infixation in many Philippine languages, where one ignores the first onset of a word, and attaches the infix as a prefix to the remainder. Thus in Tagalog we have tawag “call” but tumawag “call (perfective)”; see Section 2.2.7. The converse situation is illustrated by infixation in Ulwa where, as we shall see in Section 2.2.8, the infix is placed after an initial iambic foot. Specifics aside, we can completely characterize prosodic circumscription in terms of the finite-state operation of composition. The definition of the prosodic unit can be implemented as a transducer that inserts a marker after (or before) the prosodic unit of interest. For concreteness, consider the example of infixation in Tagalog, where the prosodic unit to be identified is the initial onset of the stem, if any. We wish to insert a marker, say >, after this constituent C and then leave the rest of the base alone. The marker transducer M that accomplishes this can be defined as follows: M = C ?[Â :>]V ∗
(2.11)
One can then use the marker > as an anchor to define the domain of other operations. For example, in Tagalog, one would use > to define where to place the infix -um-. One can then implement the actual infixation as a transducer that simply rewrites > as -um-: again, see Section 2.2.7. In one sense the description just given is simpler than that of McCarthy and Prince: note that once one has defined the placement of the marker, subsequent operations merely need to reference that marker and have no need of notions like “prosodically defined unit” or “residue”. Thus, if we handle Tagalog infixation as sketched above, we could characterize -um- as “prefixing” to the residue (e.g., -awag), but we could equally well characterize it as suffixing to the prosodically defined unit (e.g., t-). The former may make more sense as a linguistic
2.2 syntagmatic variation
31
Table 2.1 Comparative affixation in English fat dumb sìlly yéllow timid curious
fatter dumber sillier yellower timider ∗ curiouser
fattest dumbest silliest yellowest timidest ∗ curiousest
conception, but at the level of computational implementation, the distinction is immaterial. This computational understanding of prosodic circumscription is important for what follows since when we get beyond the most straightforward concatenation into increasingly complex modes of morphological expression, we find that the bases of the morphological processes involved are restricted prosodically in various ways. In such cases the ∗ of expression 2.2 is replaced by a more complex regular relation. 2.2.3 Prosodically Governed Concatenation The English comparative affix -er and the superlative affix -est are good examples of affixes that have prosodic restrictions on their attachment. The affixes are restricted to bases that are monosyllabic or disyllabic adjectives; see Table 2.1. 3 Assuming this is the correct characterization of English comparative formation, we can characterize the base to which the comparative affix
3
One class of adjectives that do not admit of comparative affixation are participles. Thus more known, but ∗ knowner. The other class of prima facie exceptions to the generalization are adjectives prefixed with un- such as unhappier. Note that the interpretation of unhappier is the same as that of more unhappy, suggesting that -er is attached outside unhappy. Some previous literature – e.g., Pesetsky (1985) and Sproat (1985) – has analyzed these cases as bracketing paradoxes, with mismatching syntactic and phonological structures. On this view, the syntactic and hence the semantic interpretation has -er outside unhappy, but the phonological structure has -er attaching only to happy, in line with expectations. More recently, Stump (2001) has characterized cases like unhappier as head operations, whereby the operation in question, in this case the operation of comparative formation, operates on the head of the word (happy) rather than the whole word. Thus, again, the prosodic conditions of the comparative and superlative affixes are met.
32
2. formal characterization of morphology Table 2.2 Template- and non-template-providing affixes in Yowlumne
Root
Neutral Affixes -al -t “dubitative” “passive aorist”
Template Affixesa -inay -Paa “gerundial” “durative” CVC(C) CVCVV(C)
caw “shout” cuum “destroy” hoyoo “name” diiyl “guard” Pilk “sing” hiwiit “walk”
caw-al cuum-al hoyoo-al diiyl-al Pilk-al hiwiit-al
caw-inay cum-inay hoy-inay diyl-inay Pilk-inay hiwt-inay
a
caw-t cuum-t hoyoo-t diiyl-t Pilk-t hiwiit-t
cawaa-Paa-n cumuu-Paa-n hoyoo-Paa-n diyiil-Paa-n Piliik-Paa-n hiwiit-Paa-n
For template-providing affixes, the form of the provided template is given in boldface.
attaches as in Equation 2.12: 4 B = C ∗ V C ∗ (V C ∗ )?
(2.12)
The comparative affix Í would then be characterized as follows: Í = B[Â : er[+comp]]
(2.13)
Composing a base adjective A with Í as in Equation 2.14 would yield a non-null output just in case the base A matches B. = A◦Í
(2.14)
More dramatic are cases where the affix provides the template for the stem, rather than merely selecting for stems that have certain prosodic forms. A well-known example is provided by Yowlumne (Yawelmani), as discussed in Newman (1944) and Archangeli (1984); see Table 2.2, which is copied from Bat-El (2001). The two template-providing affixes shown here are -inay, which requires that the stem be reconfigured to 4
As a reviewer has pointed out, Pullum and Zwicky (1984) argued against a phonological account, suggesting instead that comparative affixation was a case of “arbitrary lexical conditioning.” This conclusion was motivated by a particular theoretical position on the phonology/syntax interface, and their evidence consisted in the observation that many examples of adjectives that should allow comparative affixation were nonetheless unacceptable. The problem is that many of the adjectives they list as unacceptable – among them wronger, iller, afraider, unkemptest, aloner – are readily found via a Google search. Of course we cannot tell how many of these uses are intentionally jocular (such as Lewis Carroll’s use of curiouser) but it is a fair bet that not all are. In any case it is clear that English speakers readily accept comparatives of nonce adjectives that fit the prosodic criteria (John is a lot dribber than Mary), and just as readily reject those that do not (?John is a lot dramooliger than Mary). Given these considerations, it does not seem that we can so easily dismiss a prosodic account of the phenomenon.
2.2 syntagmatic variation
33
match the template CVC(C); and -Paa, which requires the template CVCVV(C). Thus a stem such as diiyl “guard”, must shorten its vowel to conform to CVC(C), when attaching to -inay (diylinay). Similarly, it must shorten the stem vowel, but then insert a long copy of the stem vowel between the second and third consonants in order to attach to -Paa (diyiilPaan). The template Tc vc (c ) for CVC(C) is easily characterized as follows: Tc vc (c ) = C V [V : Â]∗ C [V : Â]∗ C ?
(2.15)
This ensures that only the first vowel of the root is preserved, and in particular deletes any vowels after the second consonant. Thus the result of composing Tc vc (c ) with particular stems is as follows: caw ◦ Tc vc (c ) = caw
(2.16)
diiyl ◦ Tc vc (c ) = diyl
(2.17)
hiwiit ◦ Tc vc (c ) = hiwt
(2.18)
The expression for Tc vc vv(c ) (CVCVV(C)) is somewhat more complicated since it involves copying vowel material in the root. In order to see how to do this, it is somewhat easier to start by considering deriving possible roots from the durative form. Roots must have at least one V, matching the first vowel of the durative stem, but may have as many as three Vs as in cases like hoyoo and hiwiit. There may or may not be a final (third) consonant. This range of possibilities can be derived from the template CVCVV(C), if we assume a transducer specified as follows: = C V [Â : V ]?C (V ∪ [V : Â])(V ∪ [V : Â])C ?
(2.19)
This transducer will force the first V to match the vowel of the root, will allow an optional second vowel in the root’s first syllable, then will allow either zero, one or two vowels, followed optionally by a consonant following this first syllable. This is not quite enough, however, since it does not guarantee that the Vs are all identical. To get that, we can implement a simple filter F specified as follows: F = (C ∪ i )∗ ∪ (C ∪ a)∗ ∪ (C ∪ o)∗ ∪ (C ∪ u)∗
(2.20)
See Figure 2.1. The desired transducer Tc vc vv(c ) is then simply the inverse of the composition of F and ; since we want the vowels to be identical
C:C C:C 2
i:ε
6
7
ε:i i:i
ε:i 14
i:i
i:i C:C
0
C:C
a:a 1
3
a:ε
9
ε:a C:C
8
15
a:a
C:C
4 o:ε
11
ε:o C:C
10
ε:u
u:ε 13
Figure 2.1 Yowlumne durative template CVCVV(C)
16
o:o
o:o ε:u
C:C
5
a:a ε:o
o:o u:u
ε:a
C:C
12
u:u
17
u:u
18
C:C
19
2.2 syntagmatic variation
35
on both sides of the transducer, we impose the filter on both sides: Tc vc vv(c ) = [F ◦ ◦ F ]−1
(2.21)
Given Tc vc (c ) and Tc vc vv(c ) , we can now represent the morphemes -inay and -Paa as follows: Í1 = Tc vc (c ) [Â : inay[+ger]]
(2.22)
Í2 = Tc vc vv(c ) [Â : Paa[+dur]]
(2.23)
Again, composition of these affixes with the roots yields the observed forms. 2.2.4 Phonological Changes Induced by Affixation In the case of Yowlumne, affixes impose a particular prosodic shape on the base. A milder variety of this phenomenon involves phonological changes that are induced by affixes, which do not involve major prosodic rearrangements of the base material. An example is the diminutive suffixes -chen, and -lein in Standard German. Generally, these affixes induce umlaut (fronting) on the vowel of the stem to which they attach. Some examples are given in Table 2.3. Assuming this rule is productive, then we can implement it in a way that is straightforwardly parallel to the Yowlumne case. That is, we assume a transducer Tuml that will change vowels into their appropriate umlauted forms. (Only /a/, /o/ and /u/, and the diphthong /au/, undergo umlaut.) Then the entry for -chen would be as follows: ˜ = Tuml [Â : chen]
(2.24)
Composing a stem with ˜ produces the affixed form with umlaut applied to the stem. Table 2.3 Diminutive suffixation in German Hund Haus Blatt Maus Frau
Hündchen Häuschen Blätchen Mäuschen Fräulein
Rose
Röslein
“dog” “house” “leaf ” “mouse” “woman” (diminutive = “young woman”) “rose”
36
2. formal characterization of morphology Table 2.4 Irish first declension genitive formationa Nominative bád cat eolas leabhar mac naomh páipéar sagart a
Genitive /d/ /t/ /s/ /r/ /k/ /v/ /r/ /rt/
báid cait eolais leabhair mic naoimh páipéir sagairt
Gloss /dy / /ty / / S/ /ry / /ky / /vy / /ry / /ry ty /
“boat” “cat” “knowledge” “book” “son” “saint” “paper” “priest”
Shown are the nominative (base) form, the final consonant cluster of that base form, the genitive and the final consonant of the genitive.
2.2.5 Subsegmental Morphology Morphological alternants can also be indicated with subsegmental information, such as a change of a single feature. A straightforward example of this is provided by genitive formation in Irish, exemplified in Table 2.4. In these forms, the genitive is marked by palatalizing the final consonant (sequence) of the base (nominative) stem. (Note that the palatal variant of /s/ in Irish is regularly /S/.) A standard linguistic analysis of this alternation would be to assume that the feature [+high] is linked to the final consonant (Lieber, 1992). The genitive can be represented by a transducer „ that simply changes the final consonant (sequence) in the described way; „ can be compiled from a contextdependent rewrite rule, using algorithms described in Kaplan and Kay (1994) and Mohri and Sproat (1996). The genitive form is then produced from the nominative form N by composition: = N◦„
(2.25)
A more complex instance of subsegmental morphology is illustrated by initial consonant mutation in the related Celtic language Welsh. 5 Initial mutations are of three basic types: “soft” mutation, or lenition, which involves voicing unvoiced consonants, spirantizing or deleting voiced consonants, and otherwise leniting others; “aspirate” mutation, which involves spirantizing voiceless stops; and nasal mutation, which causes nasalization of stops. The mutations are induced by various 5
All modern Celtic languages, including Irish, have consonant mutation, but the mutation system of the Brythonic languages – Welsh, Cornish, and Breton – is more complex than that of the Goidelic languages – Irish, Scots Gaelic, and Manx.
2.2 syntagmatic variation
37
Table 2.5 Initial consonant mutation in Welsha Base
Gloss
Soft – a ‘his X’
Asp. – a ‘her X’
Nas. – fy ‘my’
pen tad cath bachgen damwain gwraig mam rhosyn llais (/ì/)
“head” “father” “cat” “boy” “accident” “wife” “mother” “rose” “voice”
a ben a dad a gath a fachgen (/v/) a ddamwain (/D/) a wraig a fam (/v/) a rosyn a lais
a phen (/f/) a thad (/T/) a chath (/x/)
fy mhen (/mh/) fy nhad (/nh/) fy nghath (/Nh/) fy machgen fy namwain fy ngwraig
a
Illustrated are (one variant of) “soft” mutation (lenition), nasal mutation, and aspirate mutation, along with example-triggering morphemes. Entries left blank have no change in the relevant cell. Phonetic values of the initial consonants are indicated where this may not be obvious.
morphosyntactic triggers including many function words (prepositions, possessive markers, articles with following feminine nouns), word-internally with various prefixes and in compounds, and (in the case of lenition) by various syntactic environments such as the initial consonant of an NP following another NP (Harlow, 1981). Many treatments and descriptions of the mutation system of Welsh have been published – Lieber (1987), Harlow (1989), and Thorne (1993) are just three examples, though the most complete analysis is probably still Morgan (1952). Some examples of the mutations are given in Table 2.5. If we assume three transducers – Î for lenition, Ì for nasalization and ˆ for aspirate mutation – then morphemes that select for the different mutations would be indicated as inducing the relevant mutation on the base. Thus to take a concrete example, the possessive marker dy “your (sg.)” would be specified as follows: ‰ = [Â : dy]Î
(2.26)
The second person singular possessive marking of a noun B would thus be constructed as follows: = B ◦‰
(2.27)
2.2.6 Subtractive Morphology One case of subtractive morphology, also called truncation, involves plural stem formation in Koasati, exemplified in Table 2.6 (Lombardi and McCarthy, 1991). The generalization is that the final rime of the
38
2. formal characterization of morphology Table 2.6 Koasati plural stem truncation Singular
Plural
Gloss
pitáf-fi-n /pitáf-li-n/ latáf-ka-n tiwáp-li-n
pít-li-n
“to slice up the middle”
lat-ka-n tiw-wi-n /tiw-li-n/ aták-li-n icokták-li-n albít-li-n cíł-ka-n fas-ka-n /fac-ka-n/ onasan-níici-n iyyakóf-ka-n /iyyakóh-ka-n/ kóy-li-n
“to kick something” “to open something”
atakáa-li-n icoktakáa-li-n albitíi-li-n ciłíp-ka-n facóo-ka-n onasanáy-li-n iyyakohóp-ka-n koyóf-fi-n /koyóf-li-n/
“to hang something” “to open one’s mouth” “to place on top of ” “to spear something” “to flake off ” “to twist something on” “to trip” “to cut something”
Data from Lombardi and McCarthy (1991)
singular stem, consisting of a final vowel and any following consonant, is deleted in the formation of the plural stem. The onset of the final syllable of the singular stem is retained. Lombardi and McCarthy, who offer an analysis in terms of Prosodic Circumscription theory argue that the retention of the onset follows from phonotactic considerations due to a generalization of root-final heaviness. Be that as it may, the analysis in terms of finite-state operations is clear. We assume a truncation transducer Ù that deletes the final rime of the base. The whole plural transducer is as in Equation 2.28, and is depicted in Figure 2.2. Ù = Ù [Â : [+pl]] V:V C:C 0
ε:[+pl]
V:ε V:ε
1
(2.28)
ε:[+pl]
C:ε 3
Figure 2.2 Koasati plural truncation transducer
2
2.2 syntagmatic variation
39
Table 2.7 Infixation of -um- in Bontoc antj´ˇoak k´ˇawˇısat p´¯usiak
“tall” “good” “poor”
umantj´ˇoak kum´ˇawˇısat pum´¯usiak
“I am getting taller” “I am getting better” “I am getting poorer”
A singular stem A, composed with Ù, will result in a truncated form tagged as morphologically plural: = A◦Ù
(2.29)
2.2.7 Extrametrical Infixation An example of extrametrical infixation can be found in Bontoc (Seidenadel, 1907), as well as many other languages of the Philippines. In Bontoc, the infix -um- marks a variety of different kinds of semantic information. In the examples in Table 2.7 the alternation involves what Seidenadel terms progressive quality. The infix is prefixed to the word, ignoring the first consonant, if any. It is perhaps simplest to think of this infixation as involving two components computationally. The first component inserts a marker > (>∈ / ) in the appropriate location in the word, and the second converts the marker to the infix -um-. Note that the other infix in the language, -in- (which among other things marks parts of the body as wounded), also attaches in the same position as -um-, so it makes sense to factor out placement of the infix from the actual spellout of the infix. Marker insertion can be accomplished by the simple transducer M in Figure 2.3. The infixation transducer È will map > to -um- and will simultaneously add a morphosyntactic feature marking the construction – call it [+be] – to the end of the word. This is shown in Figure 2.4. Infixation of -um- is then accomplished by composing the input word A with M and È: = A◦ M◦È
(2.30) Σ:Σ
ε:> 0
ε:>
C:C
1
V:V
3
2
Figure 2.3 Marker insertion transducer M for Bontoc infixation. “V” represents a vowel, and “C” a consonant
40
2. formal characterization of morphology >:um Σ:Σ ε:+be
0
1
Figure 2.4 Transducer È, which converts > to -um- and inserts [+be] at the end of the word
Again, we can precompose M and È: Ï= M◦È
(2.31)
= A◦Ï
(2.32)
so that now
2.2.8 Positively Circumscribed Infixation Ulwa (Nicaragua) (CODIUL, 1989) provides a good example of infixation that attaches to a prosodically defined portion of the base. Some examples are given in Table 2.8. In this case the prosodic unit is an iambic foot, and the generalization is that the infixes in question – in Ulwa, the set of personal possessive markers – suffix to the first foot of the word. Iambic feet include disyllables with a short vowel followed by either a long or short vowel (bilam, anaa), and monosyllables with either a long vowel (dii) or a closed syllable with VC (sik). If the word is a single foot, as in the case of bilam “fish” or dii “snake” the affix is merely a suffix; otherwise the affix attaches to the first foot, as in liima “lemon”. As with Bontoc, Ulwa infixation can be factored into two components, the first of which introduces a marker into the relevant postfoot position, the second of which transduces that marker to the relevant infixes. The marker machine M can be described using a set of Table 2.8 Infixation in Ulwa bilam dii liima sikbilh onyan baa mistu anaalaka
“fish” “snake” “lemon” “horsefly” “onion” “excrement” “cat” “chin”
bilamki diimamuih liikama siknibilh onkinayan baamana miskanatu anaakanalaka
“my fish” “your (sg.) snake” “his lemon” “our (incl.) horsefly” “our (excl.) onion” “your (pl.) excrement” “their cat” “their chin”
2.2 syntagmatic variation
41
context-dependent rewrite rules: Â →> /∧ C ?(V C |V V |V C V V ?) C V Â →> /∧ C ?(V C |V V |V V C |V C V V ?C ) $
(2.33) (2.34)
Here, “∧ ” and “$” indicate the beginning and end of string, respectively. The first rule introduces > after initial (C)VC, (C)VV and (C)VCVV? before a following CV. The second introduces > after initial (C)VC, (C)VVC, (C)VV, and (C)VCVV?C before a following end-of-string. Once again, a machine È will convert the marker to one of the infixes so that the infixed form of base A is again defined as follows, where once again we can reassociate: = A ◦ M ◦ È = A ◦ [M ◦ È] = A ◦ Ï
(2.35)
2.2.9 Root-and-Pattern Morphology The best-known example of root-and-pattern morphology is the derivational morphology of the verbal system of Arabic, which was given the first formal generative treatment by McCarthy (1979). As is well known, Semitic languages derive verb stems – actual verbs with specific meanings – from consonantal roots, where the overall prosodic “shape” of the derivative is given by a prosodic template (in McCarthy’s original analysis a CV template), and the particular vowels chosen depend upon the intended aspect (perfect or imperfect) and voice (active or passive). Some examples for the active forms with the ubiquitous root ktb “notion of writing” (Kiraz, 2000) are given in Table 2.9. 6 For the sake of the present discussion we will assume that we are combining two elements, the root and the vocalized stem; the analysis would only be marginally more complicated if we chose to separate out the vowel as a separate morpheme as in most autosegmental (and some computational – see Kiraz, 2000) treatments of the phenomenon. We will assume that the root morpheme, such as ktb, is represented as a sequence of consonants as it would be represented in traditional 6 While tables such as Table 2.9 are often referred to as “paradigms” it is important to understand that the relations expressed here are not inflectional as in a standard verb paradigm for a language like Spanish, but derivational: each line in Table 2.9 represents a different verb, much as stand, understand, and withstand are different verbs in English.
42
2. formal characterization of morphology Table 2.9 Arabic active verb stems derived from the root ktb “notion of writing”. Note that the use of the vowel /a/ indicates that the stem is active Pattern
Template
Verb stem
Gloss
I II III IV VI VII VIII X
C1 aC2 aC3 C1 aC2 C2 aC3 C1 aaC2 aC3 aC1 C2 aC3 taC1 aaC2 aC3 nC1 aC2 aC3 C1 taC2 aC3 staC1 C2 aC3
katab kattab kaatab aktab takaatab nkatab ktatab staktab
“wrote” “caused to write” “corresponded” “caused to write” “wrote to each other” “subscribed” “copied” “caused to write”
analyses. Thus we could define the root P as follows: P = ktb
(2.36)
Similarly, we assume that the templates are represented more or less as in the standard analyses, except that the additional affixes that one finds in some of the patterns – the n- and sta- prefixes in VII and X or the -tinfix in VIII – will be lexically specified as being inserted. This serves the dual purpose of making the linking transducer (below) simpler to formulate and underscoring the fact that these devices look like additional affixes to the core CV templates (and presumably historically were): ÙI = C aC aC
(2.37)
ÙII = C aC C aC
(2.38)
ÙIII = C aaC aC
(2.39)
ÙIV = [Â : a]C C aC
(2.40)
ÙVI = [Â : ta]C aaC aC
(2.41)
ÙVII = [Â : n]C aC aC
(2.42)
ÙVIII = C [Â : t]aC aC
(2.43)
ÙX = [Â : s ta]C aC aC
(2.44)
2.2 syntagmatic variation
43
To get a transducer corresponding to all of the above templates, one simply takes the union: Ù= Ùp (2.45) p∈ patter ns
Finally we need a transducer to link the root to the templates; this transducer will serve as the computational implementation of the association lines in the standard autosegmental analysis. The linking transducer must do two things. First it must allow for optional vowels between the three consonants of the root. Second, it must allow for doubling of the center consonant to match the doubled consonant slot in pattern II. The first part can be accomplished by the following transducer: Î1 = C [Â : V ]∗ C [Â : V ]∗ C
(2.46)
The second portion – the consonant doubling – requires rewrite rules (Kaplan and Kay, 1994; Mohri and Sproat, 1996) of the general form: Ci → Ci Ci
(2.47)
This optional rule (in a brute force fashion) allows a copy of the center consonant; call this transducer Î2 . Then the full linking transducer Î can be constructed as: Î = Î1 ◦ Î2
(2.48)
The whole set of templates for ktb can then be constructed as follows: = P ◦Î◦Ù
(2.49)
Given that composition is closed under association, we can factor this problem differently. For example, if we prefer to view Arabic verbal morphology as involving just two components – a root and a pattern component, sans the linking component – then we could analyze the pattern as, say, consisting of Î ◦ Ù, and then this new transducer (call it Ù ) would be composed with the root machine. This would bring Arabic root-and-pattern morphology more in line with a number of other examples we have considered in that it would involve the composition of two elements. But this would be a purely theoretical move, not computationally motivated. It should be noted that while what we have described works for Arabic, as a practical matter, most large-scale working systems for Arabic,
44
2. formal characterization of morphology
such as Buckwalter (2002), sidestep the issue of constructing verb stems, and effectively compile out the various forms that verbs take. This is not such an unreasonable move, given that the particular forms that are associated with a verbal root are lexically specified for that root, and the semantics of the derived forms are not entirely predictable. Another approach taken is that of Beesley and Karttunen (2000) who propose new mechanisms for handling non-concatenative morphology including an operation called compile-replace. The basic idea behind this operation is to represent a regular expression as part of the finite-state network, and then to compile this regular expression on demand. To see how this works, consider a case of total reduplication (see Section 2.4) such as that found in Malay: thus a form like bagi “bag” becomes bagibagi “bags”. In Beesley and Karttunen’s implementation, a lexical-level form bagi +Noun +Plural would map to an intermediate surface form bagi^2. This itself is a regular expression indicating the duplication of the string bagi, which when compiled out will yield the actual surface form bagi-bagi. Thus for any input string w, the reduplication operation transforms it into the intermediate surface form w^2, which compile-replace then compiles out and replaces with the actual surface form. For Semitic templatic morphology, Beesley and Karttunen propose an additional device that they call merge. The operation of merge is simple to understand by example. Consider a root drs “read” and a template CVVCVC. The root is viewed as a filler for the template. The merge operation walks down the filler and the template in parallel, attempting to find a match between the template slot and what is available in the current position of the filler; note that this presumes that we have a table that indicates that, say, C will match consonants such as d, r or s. As described by Beesley and Karttunen (2000: 196): 7 1.
2.
7
If a successful match is found, a new arc is added to the current result state. The arc is labeled with the filler arc symbol; its destination is the result state that corresponds to the two original destinations. If no successful match is found for a given template arc, the arc is copied into the current result state. Its destination is the result state that corresponds to the destination of the template arc and the current filler state.
The operation described here is in many ways similar to the mechanism described in Beesley (1989) and discussed in Sproat (1992).
2.2 syntagmatic variation
45
Lexical: d r s =Root C V V C V C =Template u * i =Voc Surface: ^[ d r s .m>. C V V C V C .<m. u * i ^]
Figure 2.5 Representation of the Arabic duuris, following Beesley and Karttunen (2000, Figure 16). The surface form is a regular expression involving two mergers between the root drs, the pattern u*i and the template CVVCVC. The surface form is derived via compile– replace
Beesley and Karttunen distinguish leftwards and rightwards merge depending upon whether the filler is on the right or left, respectively. These are expressed as .m>. and .<m.. in the Xerox regular expression formalism (Beesley and Karttunen, 2003). An example is shown in Figure 2.5, where the surface form duuris is constructed out of a root drs, a vocalism u∗ i (where “∗ ” has the normal Kleene closure interpretation) and a root CVVCVC. Once again, the intermediate surface form derives the surface form via compile-replace. While this is a clean model of Arabic morphology, it is important to remember that the system has no more computational power than a model based solely upon composition. In particular, the merge operation is really a specialized version of intersection or composition. The second option in the algorithm sketched above is straightforwardly simulated by adding to the filler a loop over a label that will match the current template position. Consider the case illustrated in Figure 2.6, consisting of the root drs and the template CVVCVC. Suppose we have read the first d of the root and have matched it with the first C of the template. Thus we are in state 1 in the root machine and 1 in the template machine, or state {1,1} in the merged machine. Now we go to read the V in the template machine, and find that there is nothing to match it in the root machine. In Beesley and Karttunen’s algorithm we advance the template machine into state 2, leaving the root machine in state 1, putting us in state {2,1} in the merged machine. But the equivalent resulting state is achieved through standard intersection if we replace the root machine at the top of Figure 2.6 with the augmented root machine at the bottom. In that case we will follow the loop on V in the root machine, matching it against the V linking states 1 and 2 in the template machine, putting us into state {2,1} in the resulting intersected machine. This is equivalent to the description of Semitic morphology that we presented earlier in this section. Therefore, not surprisingly, Beesley and Karttunen’s merge operation is equivalent to standard intersection with a filler machine augmented with loops that will match the template elements not otherwise matched by the elements in the
46
2. formal characterization of morphology d
0
1
r
s
2
3
a. Root machine 0
C
V
1
2
V
3
C
V
4
5
C
6
b. Template machine V 0
d
1
V
V r
2
s
3
c. Augmented root machine
Figure 2.6 Filler acceptors representing the root drs (a) and the template acceptor CVVCVC (b), following Beesley and Karttunen (2000). The third acceptor (c) is the root augmented with loops on V
root. Indeed, there should be no difference in efficiency between the standard intersection algorithm and the merge operation since in both cases we have to check the current state of the filler machine for arcs that match arcs leaving the current state of the template machine and move each machine into the appropriate next state. Thus, while the merge operation allows us to represent the roots and vowel templates in a somewhat simpler way (without the explicit self loops), this amounts to syntactic sugar. The operation of Beesley and Karttunen’s whole system using merge and compile-replace is entirely equivalent to the model based on composition that we had previously sketched. 8 2.2.10 Morphomic Components Aronoff (1994) discusses morphological functions that he terms morphomic, which are purely morphological constructs. One example is the English passive participle and past participle (eaten, fried, wrung, etc.), which are always identical in form, yet have clearly different morphosyntactic functions. The forms also differ morphophonologically: some verbs mark the form with -en (eaten), others with -ed (fried) or -t (dealt), still others with a vowel change (wrung). Aronoff therefore argues that the forms are identical at a purely morphological level and makes the general substantive claim that “the mapping from morphosyntax to phonological realization is not direct but rather 8
We realize that the description just presented violates the maxim we introduced in Chapter 1 to avoid thinking at the machine level. However, it is necessary to delve down into the machine-level computations when comparing the efficiency of two algorithms.
2.2 syntagmatic variation
47
Table 2.10 Latin third stem derivatives Verb
3rd Stem
Derived Form
Perfect Participle
laud¯are “praise” ducere “lead” vehere “carry” premere “press” ferre “bear”
laud¯atductvectpresslat-
laud¯atus ductus vectus pressus latus
Future Participle
laud¯are “praise” ducere “lead” vehere “carry” premere “press” ferre “bear”
laud¯atductvectpresslat-
laud¯at¯urus duct¯urus vect¯urus press¯urus lat¯urus
Supine
pisc¯are “fish” dicere “say”
pisc¯atdict-
pisc¯atum dictum
Agentive Noun
vincere “defeat” tond¯ere “shear”
victtons-
victor “winner” tonsor “barber”
-i¯o Noun
cogit¯are “think” conven¯ıre “meet”
cogit¯atconvent-
cogit¯ati¯o “thought” conventi¯o “meeting”
-¯ur Noun
scr¯ıbere “write” pingere “paint”
scriptpict-
script¯ura “writing” pict¯ura “painting”
Desiderative Verb
edere “eat” emere “buy”
e¯ sempt-
e¯ sur¯ıre “be hungry” emptur¯ıre “want to buy”
Intensive Verb
iacere “throw” trahere “drag”
iacttract-
iact¯are “fling” tract¯are “drag”
Iterative Verb
vid¯ere “see” scr¯ıbere “write”
v¯ısscript-
v¯ısit¯are “see often, visit” scriptit¯are “write often”
Participles are given in their masculine, singular, nominative form; supines in their accusative form; agentive and other nouns in their nominative singular.
passes through an intermediate level” (page 25). So in English two morphosyntactic functions, namely past participle and passive participle, are mapped to a single morphome, which Aronoff labels F en , and thence to various surface morphophonological forms, depending upon the verb. Another more complex example discussed by Aronoff is the third stem of Latin verbs, which forms the basis for the further derivation of nine distinct deverbal forms. The third stem is the base of the perfect participle, but also forms the basis for the future participle, the supine, agentive nouns in -or/-rix, abstract denominals in -i¯o and -¯ur, desiderative verbs, intensive verbs and iterative verbs. Some examples are given in Table 2.10.
48
2. formal characterization of morphology
Aronoff argues that there is no basis for regarding any of the nine forms as semantically basic, hence there is no reason to believe that any of the forms are derived from one another. However, they all share a common morphological base, the third stem, which thus, in Aronoff ’s analysis, has a morphomic status. Clearly in any analysis of these data, we must assume that the nine morphological processes outlined above have the property that they select for the third stem: this is true whether we believe, contra Aronoff, that one of the forms is systematically basic, or with Aronoff that there is no basis for such a belief. Thus any computational model of the formation of each of these nine deverbal forms must start from the assumption that there are two processes: one that maps a verb to its third stem form, and another that selects that third stem form, and (possibly) affixes additional material. For concreteness, let us consider the formation of agentive nominals in -or. Assume for the sake of argument that we wish to derive the third stem of a verb from its first stem – i.e., the one that we find in the present tense, or generally with the infinitive. 9 For the verb scr¯ıb¯o “write”, the first stem is scr¯ıb-. For many verbs, the derivation of the third stem from the first stem is predictable, though a quick examination of Table 2.10 will suggest that this is not always the case. For first conjugation verbs (infinitive in -¯are), third stems are regularly formed by suffixing -t to the first stem. Thus laud¯a- “praise” forms its third stem as laud¯at-. Third conjugation verbs such as scr¯ıb- frequently also suffix t (with regular devoicing of the /b/ to /p/): script-. We assume in any case that the third stem can be formed by a transducer that makes appropriate changes to the first stem and tags / . Call this transducer T . them with a marker >3st , where >3st ∈ Morphological processes that select for third stem forms will then be specified so as to combine only with stems that are marked with >3st . Agentive -or, for example, will look as follows: ‚ = ∗ [>3st : Â] ∗ [Â : or ]
(2.50)
Given a first stem F , then we will derive agentive forms as follows: = F ◦T ◦‚
(2.51)
As we argued in Section 2.1, it is immaterial whether we relate derived forms to underlying stems or citation forms of words. 9
2.3 paradigmatic variation
49
Formally, this is just an instance of prosodic selection of the kind we saw for Yowlumne in Section 2.2.3. In this analysis, then, morphomic elements are purely morphological in the sense that they serve as an intermediate step in the composition of a full form. In this regard they are no different from several other kinds of morphology that we have seen. 2.3 Paradigmatic Variation We have talked in this chapter about syntagmatic aspects of morphology, or in other words, how the pieces of morphology are put together. We have said nothing about another important aspect, namely how morphologically complex forms are related to one another. One kind of relationship is paradigmatic, and we turn now to a brief discussion of this. Traditional grammars of languages like Latin present inflected forms of words in a type of table called a paradigm. A paradigm is a (usually) two dimensional, though in principle n-dimensional (where n is the number of features expressed by the morphology), array where each cell in the array corresponds to a particular combination of features: for example, in Latin nouns, a combination of one case setting from the set {nominative, genitive, dative, accusative, ablative} and one number setting from the set {singular, plural}. A word form that occupies a given cell in a paradigm is understood as bearing the features associated with that cell. In languages of the type exemplified by highly inflected Indo-European languages, words of a particular part-of-speech may be divided into a number of different classes. Each of these classes, by and large, shares the same paradigm structure as the other classes, but the forms are different. A very typical example, involving the five major Latin noun declensions, is given in Table 2.11. The derivation of the different forms exemplified in Table 2.11 is at one level trivial in that the forms consist simply of stems that bear the meaning of the word and endings that mark the inflectional features. Thus the genitive plural of “barn” can be derived by taking the stem horre- and concatenating it with the suffix -¯orum. If this were all there were to it, then paradigms would have little status beyond being convenient ways of presenting grammatical data. But various scholars, including Matthews (1972), Carstairs (1984), Stump (2001), and others have argued that paradigms have more
50
2. formal characterization of morphology Table 2.11 The five major Latin noun declensions Singular
Plural
Gloss
Declension 1, F nominative genitive dative accusative ablative
f¯emina f¯eminae f¯eminae f¯eminam f¯emin¯a
f¯eminae f¯emin¯arum f¯emin¯ıs f¯emin¯as f¯emin¯ıs
“woman”
Declension 2, M nominative genitive dative accusative ablative
asinus asin¯ı asin¯o asinum asin¯o
asin¯ı asin¯orum asin¯ıs asin¯os asin¯ıs
“ass”
Declension 2, N nominative genitive dative accusative ablative
horreum horre¯ı horre¯o horreum horre¯o
horrea horre¯orum horre¯ıs horrea horre¯ıs
“barn”
Declension 3, F nominative genitive dative accusative ablative
fax facis fac¯ı facem face
fac¯es facum facibus fac¯es facibus
“torch”
Declension 3, N nominative genitive dative accusative ablative
opus operis oper¯ı opus opere
opera operum operibus opera operibus
“work”
Declension 4, F nominative genitive dative accusative ablative
manus man¯us manu¯ı manum man¯u
man¯us manuum manibus man¯us manibus
“hand”
Declension 5, F nominative genitive dative accusative ablative
r¯es re¯ı re¯ı rem r¯e
r¯es r¯erum r¯ebus r¯es r¯ebus
“thing”
“F” = feminine, “M” = masculine, “N” = neuter.
2.3 paradigmatic variation
51
significance than that and actually have a first-class status in any theory of morphology. Part of the motivation for this belief is the existence of regularities that transcend particular forms and seem to characterize the paradigms themselves. For example, a cursory examination of the paradigms in Table 2.11 will reveal that although the dative and ablative are distinct cases in that they typically have different forms in the singular, they are uniformly identical in the plural. This is true for all nouns in all genders and all declensions: no matter what suffix is used in the dative plural, the ablative plural will always be the same. A more circumscribed but nonetheless equally universal generalization is that the nominative and accusative forms of neuter nouns are always identical (and indeed, always marked by -a in the plural). Once again, whatever the form of the nominative, the accusative (for the same number) is always the same. These generalizations are distinct from, say, the identity in the first and second declension between the genitive singular and the nominative plural; this generalization does not extend beyond these two declensions, and does not even hold of neuter nouns of the second declension. Rather, the identity of dative and ablative plurals, and nominative and accusative neuters, are a general fact of the language, transcending any particularities of form: knowing these identities is part of what it means to know Latin. What is an appropriate computational characterization of these regularities? We can divide the problem into two components, the first of which relates morphosyntactic features to abstract morphomic features, and the second of which relates those forms to particular surface forms for a particular word class. We will illustrate this with a small set drawn from the first three declensions, which are the most widespread. The abstract representation can be derived by a set of rewrite rules, compiled as a transducer; see Table 2.12. The first two rules handle the fact that for any neuter noun, the nominative and accusative are identical. The rules third and second from the end handle the fact that the dative and ablative plurals are always the same and are gender independent, since the forms are identical in the first (mostly feminine) and second (masculine and neuter) declensions. The final rule handles the fact that the forms in the third declension are gender independent. The remaining rules map case/number combinations to abstract features. 10 10
Note that we could abstract away from gender further than in just the third declension, in that many of the endings of the first and second declensions are identical: thus the genitive plural is identical – viz. equ¯orum, femin¯arum – once we discount the stem vowel.
52
2. formal characterization of morphology Table 2.12 Rewrite rules mapping concrete morphosyntactic features to an abstract morphomic representation neut nom∪acc sg neut nom∪acc pl nom sg acc sg gen sg dat sg abl sg nom pl acc pl gen pl Gender dat pl Gender abl pl Gender a
→ → → → → → → → → → → → →
NeutNASg NeutNAPl NomSg AccSg GenSg DatSg AblSg NomPl AccPl GenPl DatAblPl DatAblPl  / IIIa
III denotes third declension.
A second transducer maps from the abstract endings to actual forms. We can express this in terms of affixes that select for stems with particular abstract morphomic features, as in Table 2.13. Note that this description is a fragment; in particular it does not account for first declension masculine nouns (nauta “sailor”) or second declension nouns in -r (puer “boy”), and it assumes that all non-neuter third declension nouns end in -s in the nominative singular which, on the surface at least, is not true. Call the transducer defined in Table 2.12 · and the transducer defined in Table 2.13 Û. Then given a set of bases B annotated with morphosyntactic features, we can define the set of inflected forms of B as follows: = B ◦·◦Û
(2.52)
The above analysis accords with a number of treatments of paradigmatic variation in that it assumes an abstract level at which morphomic features like DatAblPl have a first-class status. On the other hand, as with the treatment of Arabic root-and-pattern morphology in Section 2.2.9, we can also use the observation that composition is associative to precompile · and Û into a new transducer Û = · ◦ Û. This new Û will now produce endings based on the surface morphosyntactic features of B; the abstraction · is now hidden in the combined transducer. We will develop this point much further in Chapter 3.
2.4 the remaining problem: reduplication
53
Table 2.13 Fragment for Latin nominal endings ∗ [I-IIa DatAblPl : ¯ıs] ∗ [NeutNAPl : a] ∗ [I-II NeutNASg : um] ∗ [I-II fem AblSg : a¯ ] ∗ [I-II fem AccPl : a¯ s] ∗ [I-II fem AccSg : am] ∗ [I-II fem DatSg : ae] ∗ [I-II fem GenPl : a¯ rum] ∗ [I-II fem GenSg : ae] ∗ [I-II fem NomPl : ae] ∗ [I-II fem NomSg : a] ∗ [I-II mas AblSg : o¯ ] ∗ [I-II mas AccPl : o¯ s] ∗ [I-II mas AccSg : um] ∗ [I-II mas DatSg : o¯ ] ∗ [I-II mas GenPl : o¯ rum] ∗ [I-II mas GenSg : ¯ı] ∗ [I-II mas NomPl : ¯ı] ∗ [I-II mas NomSg : us] ∗ [I-II neut AblSg : o¯ ] ∗ [I-II neut DatSg : o¯ ] ∗ [I-II neut GenPl : o¯ rum] ∗ [I-II neut GenSg : ¯ı] ∗ [III AblSg : e] ∗ [III AccPl : e¯ s] ∗ [III AccSg : em] ∗ [III DatAblPl : ibus] ∗ [III DatSg : ¯ı] ∗ [III GenPl : um] ∗ [III GenSg : is] ∗ [III NeutNASg : Â] ∗ [III NomPl : e¯ s] ∗ [III NomSg : s] a
I-II denotes nouns of declensions I and II.
2.4 The Remaining Problem: Reduplication The one apparent significant exception to the generalization that all morphological operations can be cast in terms of composition is reduplication. The basic reason why reduplication is problematic is that it involves copying, and finite-state devices do not handle unbounded copying. A key point is contained in this last statement: finite-state devices are in principle capable, albeit inelegantly and non-compactly,
54
2. formal characterization of morphology Table 2.14 Gothic Class VII preterites after Wright (1910) Infinitive
Gloss
Preterite
falþan haldan ga-staldan af-áikan máitan skáidan sl¯epan gr¯etan ga-r¯edan t¯ekan saian
“fold” “hold” “possess” “deny” “cut” “divide” “sleep” “greet” “reflect upon” “touch” “sow”
faífalþ haíhald ga-staístald af-aíáik maímáit skaískáiþ saísl¯ep gaígr¯ot ga-raír¯oþ taít¯ok saís¯o
of dealing with any bounded copying. Basically we need as many paths as we have strings that are in the domain of the copying operation. For example, consider the reduplication found in the past tense of Class VII verbs in Gothic in Table 2.14 (Wright, 1910). In forming the preterite, some verbs undergo stem changes, as in the alternation between gr¯et and gr¯ot, in addition to reduplication; prefixes such as gaor af- are not part of the stem and are prefixed outside the reduplicated stem. Putting these issues to one side, the rule for reduplication itself is simple:
r Prefix a syllable of the form (A)Caí to the stem, where C is a consonant position and (A) an optional appendix position. r Copy the onset of the stem, if there is one, to the C position. If there is a pre-onset appendix /s/ (i.e., /s/ before an obstruent such as /p,t,k/), copy this to the (A) position. A fragment of a transducer that introduces the appropriate reduplicated prefix for Gothic Class VII preterite reduplication is shown in Figure 2.7. The model is naive in that it simply remembers what was inserted, and then imposes the requirement that the base match appropriately. Thus there are as many paths as there are distinct reduplicated onsets. Given that Gothic reduplication is limited, this is not an overly burdensome model. 11 It is, however, clearly inelegant. 11
Indeed, Gothic reduplication is even more limited than we have implied since it only applies to Class VII verbs, and this is a closed class.
2.4 the remaining problem: reduplication
55
f:f 2 ε:ai
ε:ai
1
ε:f
4 7 ε:t ε:k
ε:h
6
10
ε:ai ε:ai
h:h
5 s:s
9
t:t
s:s
12
k:k f:f
8 11
h:h s:s
ε:ai 13
s:s 14
ε:s
ε:m
0
ε:g ε:r
17
ε:ai 19
ε:ai
16 g:g
18 ε:ai
3
m:m
ε:ai
15
m:m 1:1 g:g r:r ai:ai V:V
V:V ai:ai r:r g:g 1:1 k:k m:m t:t s:s h:h f:f
r:r
20 V:V
21
Figure 2.7 A transducer for Gothic Class VII preterite reduplication. Here ai represents the vowel aí, and V represents all other vowels
Inelegance converts into impossibility when we turn to unbounded reduplication, a well-known example of which is Bambara noun reduplication, discussed in Culy (1985). The construction takes a noun X and forms a construction X-o-X, where -o- is a semantically empty marker, and the meaning of the whole construction is “whichever X”; see Table 2.15. The key issue, as shown in these examples, is that X is in principle unbounded. So, compounds such as wulu-nyinina “dog searcher” or malo-nyinina-filèla “rice searcher watcher” can readily serve as input to the process. If the input to the reduplication process is Table 2.15 Unbounded reduplication in Bambara after Culy (1985) wulu dog
o
MARKER
wulu-nyinina dog searcher
MARKER
o
malo-nyinina-filèla rice searcher watcher
MARKER
o
wulu dog
“whichever dog”
wulu-nyinina dog searcher
“whichever dog searcher”
malo-nyinina-filèla rice searcher watcher
“whichever rice searcher watcher”
56
0
2. formal characterization of morphology ε:ai ε:X
1 2
ε:1
o:o c:c v:v
v:v ε:2
5
ε:X 4
7
ε:2
9
ε:ai
11 s:s 12 ε:1
ε:ai
o:o s:s 8 ε:2 c:c 6 o:o
13
c:c v:v 3 10
Figure 2.8 Schematic form for a transducer that maps a Gothic stem into a form that is indexed for copy checking. Here V represents any vowel, ai is the vowel /aí/, s is /s/, O is an obstruent stop, C is any other consonant, and X stands for any consonant in the reduplicated portion, which has to be checked for identity per the introduced indices
unbounded, then simply precompiling out all of the copies as we did in the case of Gothic is not feasible. To handle bounded reduplication elegantly, and unbounded reduplication at all, we have to add an additional memory device that can remember what has been seen already in the copy, and then match that to what comes afterwards in the base. (In the case of suffixing reduplication it is the copy that follows the base, but the mechanism is the same.) It is useful to think of reduplication in terms of two separate components. The first component models the prosodic constraints, either on the base, the reduplicated portion, or both. In the case of Gothic, for example, the constraint is that the reduplicated portion is of the form (A)Caí, where C is a consonant and A is an optional preonset appendix. The second component – the copying component – verifies that the prefix appropriately matches the base. It is useful to separate these two components since the prosodic check can be handled by purely finitestate operations, and it is only the copying component that requires special treatment. Breaking down the problem as just described, we can implement Gothic reduplication as follows. First, assume a transducer R, which when composed with a base ‚, produces a prefixed version of ‚, and in addition to the prefix, adds indices to elements in the prefix and base that must match each other: · = ‚ ◦ R = (A1 )C 2 a´ı‚
(2.53)
Here, ‚ is an appropriately indexed version of ‚. The transducer R is shown schematically in Figure 2.8. For example, if we have the stem skáiþ “divide” and compose it with the transducer R it will produce the output X1 X2 aís1 k2 áiþ, where X here ranges over possible segments.
2.4 the remaining problem: reduplication
57
A number of approaches could be taken to check matches of the indexed arcs. One way is to impose a set of finite-state filters, one for each index. For example the following filter imposes the constraint that arcs indexed with “1” must be identical; it consists of union of paths where for each indexed segment s 1 , there must be a matching indexed segment. The overbar represents complementation and we use s i as shorthand for any single segment that is not s i . [ ∗ s 1 ∗ s 1 ∗ ] (2.54) s ∈s eg ments
For a finite number of indices – a viable assumption for bounded reduplication – we can build a filter that handles all matches as follows: [ ∗ s i ∗ s i ∗ ] (2.55) i ∈i ndi c es s ∈s eg ments
There have been various proposals in the literature for handling reduplication that are formally equivalent to the mechanisms we have just sketched. For example, Walther (2000a; 2000b) adds two new types of arcs to finite-state automata to handle reduplication:
r Repeat arcs, which allow moving backwards within a string, allowing for repetition of part of the string; r Skip arcs, which allow us to skip forward over portions of the string. A more general solution that handles reduplication and other types of non-concatenative morphology is finite-state registered automata (FSRAs), introduced in Cohen-Sygal and Wintner (2006). FSRAs extend finite-state automata with the addition of finite registers; note that since the registers are finite, the resulting machine is computationally equivalent to a potentially much larger FSA. The transition function of the automaton is extended to include operations that read or write designated elements of the register, and, depending upon the success or failure of those operations, allow the machine to move to a designated next state. Cohen-Sygal and Wintner demonstrate a substantial reduction in machine size for non-concatenative morphology such as Hebrew templatic morphology over a naive finite-state implementation, such as the one that we described in Section 2.2.9. 12 For reduplication they introduce a particular subclass of FSRA – FSRA∗ – where 12
Gains in size are offset, in general, by losses in runtime performance, however.
58
2. formal characterization of morphology
register-write operations record what was seen in the copy, and registerread operations verify subsequently that the base matches. 13 There are two limitations on Cohen-Sygal and Wintner’s approach. First, FSRA∗ s are defined for finite-length copies, which covers nearly all cases of reduplication but fails to cover the Bambara case described above. The second limitation is that the register-read operations presume exact matches between the copy and the base. While this is frequently the case, there are also many cases where the copied material undergoes modification due to phonological rules or for other reasons. A number of such cases are discussed in Inkelas and Zoll (1999, 2005). A simple example is Dakota kiˇcaxˇca„a “he made it for them quickly”, where the second syllable of the base kiˇcax is reduplicated, but shows up as ˇca„ for phonetic reasons. 14 But more dramatic examples can be found, such as in the case of Sye. In Sye, verb stems have, in addition to their basic form, a modified form that shows up in a variety of environments. Some examples, from Inkelas and Zoll (1999) are given in Table 2.16. In verbal reduplication, which has an intensifying meaning in Sye, we find two copies of the entire stem, but if the environment requires a modified stem, only the first copy shows up in the modified form. Thus, from omol, we find cw-amol-omol (3pl.fut-fallmod -fallbas ) “they will fall all over”. Another example comes from Kinande where a variety of constraints conspire to yield situations in which the copy in a reduplicative construction does not match exactly what follows. One constraint is that the copy – a prefix – must be disyllabic. This is straightforwardly met in an example like huma+huma “cultivate here and there” from huma “cultivate”, where the latter is itself morphologically complex, consisting of a base hum and a final vowel (FV) marker -a. With a base like gend-esy-a (go-cause-fv) “make go”, we find gend-a+gend-esy-a, where the stem gend in the copy is augmented with the FV -a to make it disyllabic. However, we can also find an alternative form involving a shortened causative morpheme -y (cf. the full form -esy): gend-ya+gend-esy-a, where again the copy is augmented with -a to make it 13 Dale Gerdemann, p.c., argues that the XFST extension of flag diacritics (Beesley and Karttunen, 2003) is more effective than FSRAs. Flag diacritics amount to a form of “lazy composition”, where we can limit the size of the automaton, and check correspondences between portions of the input at runtime. 14 Note that this particular feature was not what interested Inkelas and Zoll in this particular example, but rather the palatalization of the onset of the second syllable of the base from /k/ (kikax) to /ˇc/ (kiˇcax) in both the stem and the copy.
2.4 the remaining problem: reduplication
59
Table 2.16 Basic and modified stems in Sye Basic
Modified
Gloss
evcah evinte evsor evtit ocep ochi omol oruc ovoli ovyuowi pat vag
ampcah avinte amsor avtit agkep aghi amol anduc ampoli avyuawi ampat ampag
“defecate” “look after” “wake up” “meet” “fly” “see it” “fall” “bathe” “turn it” (causative prefix) “leave” “blocked” “eat”
Source: Inkelas and Zoll (1999)
disyllabic. Note that in neither alternative is the prefixed material identical to what follows. Another constraint in Kinande is a Morpheme Integrity constraint that states that no morpheme may be truncated in the copy; thus hum-ir-e (cultivate-applicative-subj) is reduplicated as hum-a+hum-ir-e. The form hum-i+hum-ir-e is ruled out since this form would result in a violation of the Morpheme Integrity constraint because the integrity of -ir would be violated. Following Cohen-Sygal and Wintner’s approach, we could of course design a machine that required identity on only some of the elements between the reduplicant and the stem. But such machines would have to be quite lexically specific in their application. In Sye, the modified stem is only partly and unpredictably related to the basic stem, so we would have to construct our copy machines on a stem by stem basis. Similarly, in Kinande, with cases like gendya+gendesya involving a truncated version of the causative morpheme in the copy, we would have to have the reduplication machine be sensitive to the particular stem involved: as it turns out, not all causative verbs allow the truncated alternate, and in those cases where this alternate is not available, a reduplicated form like gendya+gendesya is not possible. While we could certainly develop an analysis à la Cohen-Sygal and Wintner along these lines, it would somewhat defeat the purpose since if we require a set of lexical specifications on how the copy machines operate, then we may as well simply precompile out the reduplicated forms beforehand.
60
2. formal characterization of morphology haihald
hald
hai
hold [+pret]
hold [+pret]
hald
hold [+pret]
hald
Figure 2.9 The Gothic Class VII preterite form haíhald “held” under Morphological Doubling theory. Each copy is derived via a purely regular relation mapping (depicted by the shaded boxes) from the underlying form hald to either haí or hald. Restrictions on matching between the two halves reside at the level of morphosyntactic features, so that the final form haíhald is licensed by the match between the two [+pret] forms
Interestingly, Inkelas and Zoll’s point in discussing such examples is to argue for an alternative theory of reduplication to the “Correspondence Theory” that has dominated nearly all work on reduplication in theoretical morphology – and all work in computational morphology. Under their analysis, reduplication does not involve phonological copying at all, but rather is the result of morphological doubling, effectively meaning that we have two copies of the same (possibly complex) lexical form. Phonological similarity is thus expected in many cases (we are dealing with two copies of the same basic form) but not required (since different environments can require different actual forms to surface). Put in another way, under Inkelas and Zoll’s Morphological Doubling theory, there are constraints on how each copy is spelled out phonologically, but no constraints relating the reduplicated phonological form to the base, as we find in Correspondence Theory. This analysis has an interesting computational implication: if it is correct, it effectively removes the only exception to the generalization that formal morphological operations can be handled by purely finite-state mechanisms. If there is no phonological copying, there is no need to resort to special mechanisms to extend the finite-state apparatus beyond what is needed for the remainder of morphological operations. To see this, consider again the case of Gothic preterite reduplication. Under Morphological Doubling theory, each component of a form like haíhald “held” is generated separately. In each case, we can model the mapping – either from hald to haí or from hald to itself – as a regular relation. It is up to the morphosyntax to validate the form as involving two identical morphosyntactic feature bundles. See Figure 2.9.
2.5 summary
61
2.5 Summary Composition of regular relations is the single most general computational operation that can handle the formal devices found in natural language morphology. The only exception to this is reduplication, which seems, at least for an elegant account, to require a nonregular operation, namely copying. However, even reduplication can be reduced to purely regular mechanisms if Inkelas and Zoll’s Morphological Doubling theory is correct. The fact that composition is so general has interesting implications for theoretical views of morphology. In particular, morphologists have argued for decades about whether morphology should primarily be viewed as the construction of words out of small lexical pieces – morphemes – or whether instead it is better viewed as involving the modification of stems or roots via rules. The argument over “Itemand-Arrangement” versus “Item-and-Process” approaches has effectively divided the field. In the next chapter we shall argue that at least from a computational point of view, there is really very little difference between the two approaches.
3 The Relevance of Computational Issues for Morphological Theory 3.1 Introduction: Realizational versus Incremental Morphology For many decades a central debate in the theory of morphology has been whether morphology is best thought of in terms of constructing a complex form out of small pieces, usually called morphemes, which are roughly on a par with each other; or, alternatively, as involving base forms (stems, or roots) modified by rules. Hockett (1954) termed these two approaches “item-and-arrangement” versus “item-and-process”. Item-and-process theories are older; they are the theories that were implicitly assumed by, for example, traditional grammatical descriptions. In a traditional grammar of Latin, for example, one would be presented with paradigms listing the various inflected forms of words and their functions, as well as rules for deriving the inflected forms. But item-and-process approaches have also characterized much work in recent theories of morphology, including Matthews (1972), Aronoff (1976), Anderson (1992), Aronoff again (1994), and most recently Stump (2001). Item-and-arrangement theories are particularly associated with the American structuralists, but they have been adopted by at least some generative morphologists including Lieber (1980), Sproat (1985), and Lieber again (1992). To some extent these different approaches were motivated by the properties of different languages. For traditional grammarians the core languages of interest were highly inflected Indo-European languages such as Latin, Classical Greek, and Sanskrit. For these languages, it seems to make sense to think in terms of morphological
3.1 realizational vs. incremental morphology
63
rules rather than morphemes. For one thing, it is typical in such languages that many morphosyntactic features are expressed in a single affix, so that in Latin, for instance, the verbal ending -¯o does not only represent 1st singular but in fact something more like 1st singular present active indicative; change any one of those features and the form changes. Second, one often finds the kind of morphomic stem alternations that we have already discussed in the previous chapter, so that it is typically not an issue of merely affixing a particular affix to a stem, but rather affixing to a particular variant of the stem. Finally, the affixes themselves may change, depending upon the particular paradigm the base word belongs to. So first and second declension Latin nominals have the suffix -¯ıs for the dative/ablative plural; but third declension nominals use -ibus to encode the same morphosyntactic features. So all around it seems more obvious for these kinds of languages to think in terms of rules that introduce affixes according to whatever features one wishes to encode, rather than assuming separate morphemes that encode the features. In addition to being sensitive to morphosyntactic features, the rules can also be sensitive to features such as the paradigm affiliation of the word, and furthermore they can effect the particular stem form required for the affix in question. In contrast, if one looks at agglutinative languages like Finnish, one finds that morphosyntactic features are encoded fairly systematically by individual morphemes that are arranged in particular linear orders. Consider, for example, the partial paradigm for Finnish listed in Table 3.1. 1 As will be readily seen, each word form consists of either one, two or three pieces: the base talo, a possible affix marking number, and a possible case affix. The case affixes are consistent across singular and plural, so that, for instance, nominative is always marked with -∅, elative with -sta and genitive with -(e)n. With a single exception, the alternations in form of the affixes are due to phonological rules, as in -en versus -n for the genitive, and the spelling of the plural affix -i as a glide -j in intervocalic position. 2 The single exception is the appearance of the plural affix as -t in the nominative plural. 1 Adapted from Andrew Spencer’s course notes for a morphology course at Essex, L372, available at http://privatewww.essex.ac.uk/∼spena/372/372_ch7.pdf. Curiously, Spencer uses the Finnish example to illustrate the difficulties of morpheme-based approaches to inflection. 2 Some of the suffixes do change form in another way, namely due to vowel harmony. This is, however, a completely regular phonological process in Finnish, and does not target particular morphological operations.
64
3. morphological theory Table 3.1 Partial paradigm of Finnish noun talo “house”, after Spencer sg
pl
Nominative talo talo-t Genitive talo-n talo-j-en Partitive talo-a talo-j-a Inessive talo-ssa talo-i-ssa Elative talo-sta talo-i-sta Adessive talo-lla talo-i-lla Ablative talo-lta talo-i-lta Allative talo-lle talo-i-lle
With this one exception, the system looks as if it is most straightforwardly accounted for by assuming that there are in each case three morphemes, namely the stem morpheme which introduces the basic semantics of the word (“house”, or whatever) and the number and case affixes. As is typical with such morphemic approaches, one would assume empty morphemes in the case of singular number and nominative case. The number affix introduces the morphosyntactic feature singular or plural; the case affix introduces the appropriate case feature. The representation of the number affixes would be encoded so as to select for the base stem (i.e., without a case ending). Similarly, the case affixes would be marked so as to select for forms that are already marked for number. One could in principle treat the alternation in the plural morpheme (-i/-j vs. -t) as a case of selection: one would have two allomorphs of the plural affix, with the -t variant being selected by the nominative affix, and the -i/-j variant by all others. The item-and-arrangement versus item-and-process debate has often been cast in terms of this simple binary choice, with possible sub-choices under each main choice. But more recently Stump (2001) has presented a two-dimensional classification. The first dimension is lexical versus inferential. Lexical theories are those theories in which all morphemes are given lexical entries, so that the third singular verbal affix -s in English has an entry associated with the features 3rd singular, present, and indicative. In contrast, inferential theories posit rules which are sensitive to such morphosyntactic features; a form like likes is inferred from like due to a rule that associates a suffix -s with a particular set of morphosyntactic features (3rd singular).
3.1 realizational vs. incremental morphology
65
Table 3.2 The four logically possible morphological theory types, after Stump (2001)
Lexical Inferential
INCREMENTAL
REALIZATIONAL
Lieber Articulated Morphology
Distributed Morphology Matthews, Anderson, Stump
The second dimension is incremental versus realizational. Incremental theories assume that rules or morphemes always add information when they are applied. Thus likes has its meaning by virtue of the addition of -s to like. Under realizational theories, in contrast, the introduction of form is licensed by particular morphosyntactic features. All four logically possible combinations of these contrasts are found; see Table 3.2. Thus a classic lexical–incremental theory is that of Lieber (1992); in that theory, affixes are lexical entries with features just like roots and stems. Combinations of affixes are controlled by subcategorization and features are combined by percolation. 3 A lexical–realizational theory is that of Halle and Marantz (1993). Here, morphosyntactic features are combined in the syntax, but the expression of these features in the morphology is controlled by insertion rules that pick affixes that are compatible (given their lexical entries) with the syntax-derived features. Inferential–incremental theories are exemplified by Articulated Morphology (Steele, 1995), wherein affixal material is introduced by rule, rather than affixes having separate lexical entries. Here, rules not only introduce the phonological changes to the word, but also effect changes in the morphosyntactic feature matrices. Finally, inferential– realizational theories include those of Matthews (1972), Anderson (1992), Aronoff (1994), and Stump himself. Needless to say, Stump’s major objective is to argue that the facts of morphology dictate an inferential–realizational theory rather than any of the other three logical possibilities. In what follows we will consider only the two extremes of Stump’s four possible theories, namely lexical–incremental versus inferential– realizational, and argue that at least at a computational level, there is no difference between the two approaches. To the extent that this argument
3
In more sophisticated lexical–incremental theories, feature combination would involve unification of attribute-value matrices.
66
3. morphological theory
is convincing to the reader, it will follow that the other two types of theories are also amenable to the same reduction. Before proceeding, we note that Karttunen (2003) reached much the same conclusion as we will reach here. In that paper, he argued that Stump’s theory was in fact reducible to finite-state operations. From this observation it follows straightforwardly that Stump’s theory is formally equivalent to any other morphological models that can be reduced to finite-state mechanisms. 3.2 Stump’s Theory Stump’s inferential–realizational theory, which he calls Paradigm Function Morphology, is arguably the most carefully articulated theory of morphology to have appeared in recent years, and so we shall use his theory as the example of recent theoretical work in morphology. Stump’s main goal, as we have noted already, is to argue that the data from morphology support an inferential–realizational theory of the kind that he presents. On the other hand, when one considers this issue from a computational point of view – i.e., from the point of view of someone who is trying to build a system that actually implements morphological alternations – the distinctions that Stump draws between his approach and a lexical–incremental theory are less clear. But we are getting ahead of ourselves. Stump presents two fundamental arguments in favor of realizational over incremental approaches to inflectional morphology. First, inflectional morphology often exhibits extended exponence, by which is meant that a given morphosyntactic feature bundle may be spelled out by material scattered across the word. An example is Breton double noun plurals such as bagoùigoù “little boats” (sg. bagig), where the boldface suffix -où occurs twice, once after the stem bag and once after the diminutive suffix -ig. This is, on the face of it, problematic for incremental theories since the expectation is that a single morphosyntactic feature bundle should be expressed by a single morphological object. In contrast, realizational theories make weaker claims in this regard, and thus are fully compatible with extended exponence. A second argument for realizational theories over incremental ones comes from cases where there is no overt marker for a particular morphosyntactic feature bundle, and the form of the word thus underdetermines the actual morphosyntactic features of that word. Stump gives the example of Bulgarian verb inflection where the first singular form
3.3 computational implementation of fragments
67
in both the imperfect and aorist consists of a verb stem, followed by a vowel indicating the aspect/tense, followed by a stem-forming suffix; thus krad-’á-x (“steal” imperfect 1sg) and krád-o-x (“steal” aorist 1sg) consist of the stem krad, the thematic vowel (’a, o) and the stemforming suffix -x. Crucially there is no overt mark for first person singular. The only way to handle this case under an incremental theory is to assume a zero morpheme. Again, such phenomena are fully compatible with the weaker assumptions of realizational theories. In a similar vein, cases such as the Breton example given above would seem to mitigate against lexical theories, whereas they are fully compatible with inferential theories. As Stump notes, in a lexical theory such as Lieber’s (1980), a morphosyntactic feature bundle should be traceable to one and only one lexical item, consisting of a single affix. In contrast, in an inferential theory, since the only requirement is that a rule be sensitive to a particular set of morphosyntactic features, one expects to find cases where a given set triggers more than one rule: since morphosyntactic features are a property of a given word and are always present, there is nothing to stop more than one rule firing – or, as in the case of Breton, the same rule firing more than once – at different stages of the construction of a word form. Thus, it seems that there are large differences between lexical– incremental theories on the one hand, and inferential-realizational theories on the other, and furthermore there is evidence favoring the latter types of theory over the former. It is therefore perhaps surprising that at least from a computational point of view, there turns out to be essentially no difference between the two types of theory. We turn directly to a discussion of this topic.
3.3 Computational Implementation of Fragments In what follows, we present a computational analysis of data discussed in Stump, with a view to showing that, at least from a computational point of view, there is no fundamental difference between lexical– incremental theories and inferential–realizational theories. In each case the example is a fragment of the morphology in question: we do not attempt to offer as complete a treatment as Stump (although note that his descriptions are fragments also). However, we believe that we have provided proof-of-concept implementations of the selected phenomena.
68
3. morphological theory
3.3.1 Stem Alternations in Sanskrit In highly inflected languages, it is common to find alternations in the shape of stems depending upon the particular position in a paradigm. Often these alternations are quite systematic, in that words might have, say, two distinct stems, and each of these two stems will be associated with particular paradigm slots. For example, one might find that Stem I is used in the nominative singular, accusative singular and genitive plural; and that Stem II is used in all other cases. This pattern will likely be found for any lexeme that belongs to the inflectional class in question. While the stem alternations will be predictable at an abstract level, it will frequently not be the case that the form of the alternation is predictable in any general way. Thus, different lexemes may have different mechanisms for forming different stems so that if one knows how a lexeme Îi forms Stem I and Stem II, it will not necessarily tell us how Î j forms those two stems. The alternations are thus morphomic, in Aronoff ’s (1994) sense. To complicate matters further, a language may have a fixed set of mechanisms for forming stem alternants, but for a particular alternation for a particular paradigm, the mechanisms may distribute differently across different lexemes. Stump discusses a case of this kind from Sanskrit. Tables 3.3 and 3.4 give a flavor of the issues. Table 3.3 shows the masculine declension of the adjective bhagavant “fortunate”. This adjective has two stems, a strong stem, which is used in the nominative and accusative, singular and dual, as well as in the nominative plural; and a middle stem, which is used elsewhere. Some forms have three stems. So Table 3.4 shows the masculine declension of tasthivans. The strong stem for tasthivans is distributed Table 3.3 Stem alternation in the masculine paradigm of Sanskrit bhagavant “fortunate”, with strong stem bhágavant and middle stem bhágavat
nom acc instr dat abl gen loc
sg
du
pl
bhágav¯an bhágavant-am bhágavat-¯a bhágavat-e bhágavat-as bhágavat-as bhágavat-i
bhágavant-¯au bhágavant-¯au bhágavad-bhy¯am bhágavad-bhy¯am bhágavad-bhy¯am bhágavat-os bhágavat-os
bhágavant-as bhágavat-as bhágavad-bhis bhágavad-bhyas bhágavad-bhyas bhágavat-¯am bhágavat-su
(= Stump’s Table 6.1, page 170).
3.3 computational implementation of fragments
69
Table 3.4 Stem alternation in the masculine paradigm of Sanskrit tasthivans “having stood”, with strong stem tasthiva¯´ m . s, middle stem tasthivát, and weakest stem tasthús.
nom acc instr dat abl gen loc
sg
du
pl
tasthiva¯´ n tasthiva¯´ m . s-am tasthús.-¯a tasthús.-e tasthús.-as tasthús.-as tasthús.-i
tasthiva¯´ m . s-¯au tasthiva¯´ m . s-¯au tasthivád-bhy¯am tasthivád-bhy¯am tasthivád-bhy¯am tasthús.-os tasthús.-os
tasthiva¯´ m . s-as tasthús.-as tasthivád-bhis tasthivád-bhyas tasthivád-bhyas tasthús.-¯am tasthivát-su
(= Stump’s Table 6.3, page 174).
the same way as the strong stem for bhagavant, but in addition this lexeme has both a middle and weakest stem. The weakest stem is used throughout the singular (except for the nominative and accusative), in the dual genitive and locative and in the plural accusative and genitive. Cross-cutting these stem alternations are another class of alternations, namely the grade alternations of vr.ddhi, gun.a and zero grade. For the stem p¯ad- “foot”, for example, the vr.ddhi form is p¯ad-, the gun.a is pad- and the zero is pd-. The stem alternations cannot be reduced to these grade alternations. In particular the strong stem may be either vr.ddhi or gun.a, or in some cases a stem form that is not one of the standard grades, depending upon the lexeme. Similarly, the middle stem may be either zero, or one of a number of lexeme or lexemeclass specific stems, again depending upon the lexeme. Finally, for those lexemes that have a weakest stem, this may occasionally be a zero stem, but it is more often one of a number of lexeme or lexeme-class specific stems. For bhagavant the strong stem is gun.a grade and the middle stem is zero grade. For tasthivans the strong stem is vr.ddhi grade, and the middle and weakest stems are derived by mechanisms that are particular to the class to which tasthivans belongs. Stump uses these facts to argue for the Indexing Autonomy Hypothesis, whereby a stem’s index, which indicates which particular stem (e.g., strong, middle, weakest) will be used, is independent of the form assigned to that stem. Stem alternants are thus highly abstract objects. They seem to serve no particular function other than a purely morphological one. Neither can they in general be reduced to other alternations such as the vr.ddhi/gun.a/zero grade alternation. They are, in Aronoff ’s terms, purely morphomic.
70
3. morphological theory
In Figures 3.1–3.5 we give a lextools implementation of a tiny fragment of Sanskrit, following Stump’s description. Included are the following basic components: 4
· · · · · · · · · ·
Symbol file: Lextools file of symbols DummyLexicon: Template for lexical forms Features: Rules introducing predictable morphosyntactic feature sets given the specification of the lexical class of the lexeme Endings: Realizational rules that spell out endings appropriate to the features Referral plus Filter: Rules of referral to handle syncretic forms, plus a filter to filter out forms that did not undergo the rules of referral Stem: Stem selection rules Grademap: Lexeme-specific map between the morphomic stems and the grade Grade: Phonological forms of the grade Phonology: Phonological alternations Lexicon: Toy minilexicon containing entries for bhagavant and tasthivans
Fig. 3.1 Fig. 3.2 Fig. 3.2 Fig. 3.2 Fig. 3.3 Fig. 3.4 Fig. 3.4 Fig. 3.4 Fig. 3.5 Fig. 3.5
The fragment of Sanskrit above can be implemented by composing the transducers 1–9 together as follows: 5 Suffixes = DummyLexicon ◦ Features ◦ Endings ◦ Referral ◦ Filter ◦ Stem ◦ Grademap ◦ Grade ◦ Phonology
(3.1)
The entire morphology for this fragment can then be implemented as follows: Morphology = Lexicon ◦ Suffixes
(3.2)
But note now that we can factor this problem in another way. In particular, we can easily select out portions of the Suffixes FST that match 4
This and other fragments discussed in this chapter are downloadable from the web site for this book, . Information on lextools can be found in Appendix 3A. See Appendix 3B for an XFST implementation of the Sanskrit data due to Dale Gerdemann, p.c. 5 The function of the DummyLexicon is in part to keep this composition tractable, by limiting the left-hand side of the composition to a reasonable subset of ∗ .
3.3 computational implementation of fragments
71
VShort a e i o u au ai VLong a= e= i= o= u= a=u a=i Vowel VShort VLong UCons k c t. t p VCons g j d. d b UCons kh ch t.h th ph VCons gh jh d.h dh bh SCons n~ n. n m UCons sh s. s f SCons v y Cons UCons VCons SCons Seg Cons Vowel gend masc fem neut case nom acc instr dat abl gen loc num sg du pl Category: noun case gend num Grade Guna Vrddhi Zero ivat us GradeClass GZN VIUs Diacritics Ending MascNoun Diacritics Strong Middle Weakest Diacritics Grade GradeClass Class MascNoun
Figure 3.1 Lextools symbol file for implementation of a mini-fragment of Sanskrit, following Stump (2001). Note that the labels ivat and us are used to mark stem alternations needed for bhagavant and tasthivans
certain feature specifications. Thus, for instance, we can select the set of instrumental dual suffixes as follows: Suffixi ns tr,du = Suffixes ◦ ( ∗ [noun gen = masc case = instr num = du] ∗ )
(3.3)
And then one can affix this particular suffix to the base as follows: [bh]agavant[GZN][MascNoun] ◦ Suffixi ns tr,du
(3.4)
If this looks like an ordinary case of lexical, incremental affixation, this is no accident. Note in particular that the full form is being constructed using composition, which we argued in the previous chapter to be the most general single operation that can handle all forms of morphological expression. Furthermore, the feature specifications are introduced with the suffix, for example, Suffixi ns tr,du . In no sense is it the case, in the above factorization, that the particular form of the affix (in this case -bhy¯am), is introduced by a rule that is sensitive to features already present, and thus the operation would appear to be incremental. In a similar fashion it would appear to be lexical: the suffix is a separate entity that includes the case and number features and which combines with the base to form the fully inflected noun. Now there is no question that an approach such as Stump’s has some benefit at elucidating generalizations about Sanskrit paradigms.
72
3. morphological theory
################################################################# ## Transducer 1. DummyLexicon ## ## ## ##
Template for lexical entries: they consist of a string of segments followed by a grade class such as GZN (Guna-Zero-Nothing), or VIUs (Vrddhi-ivat-us); followed by a class, such as MascNoun
[Seg]* [GradeClass][Class] ################################################################# ## Transducer 2. Features ## ## ## ## ## ## ##
This rule introduces the actual feature matrices, given the lexical class of the stem. Note that nouns (actually nominals, including adjectives) are specified for case and number in addition to gender. The following rule thus produces a lattice of possible feature specifications, with only the gender feature specified. The Ending diacritic is just a placeholder for where the endings will go.
[MascNoun] -> [noun gend=masc][Ending] ################################################################# ## Transducer 3. Endings ## ## ## ##
This spells out the endings given the feature specifications of the base. These are thus straight realizational rules. These are just the endings applicable to masculine adjectives such as bhagavant and tasthivans
[Ending] [Ending] [Ending] [Ending] [Ending] [Ending]
-> -> -> -> -> ->
[<epsilon>] / [noun case=nom gend=masc num=sg] __ am / [noun case=acc gend=masc num=sg] __ [a=] / [noun case=instr gend=masc num=sg] __ e / [noun case=dat gend=masc num=sg] __ as / [noun case=abl gend=masc num=sg] __ i / [noun case=loc gend=masc num=sg] __
[Ending] -> [a=]u / [noun case=nom gend=masc num=du] __ [Ending] -> [bh]y[a=]m / [noun case=instr gend=masc num=du] __ [Ending] -> os / [noun case=gen gend=masc num=du] __ [Ending] [Ending] [Ending] [Ending] [Ending]
-> -> -> -> ->
as / [noun case=nom gend=masc num=pl] __ [bh]is / [noun case=instr gend=masc num=pl] __ [bh]yas / [noun case=dat gend=masc num=pl] __ [a=]m / [noun case=gen gend=masc num=pl] __ su / [noun case=loc gend=masc num=pl] __
Figure 3.2 Transducers 1–3 for implementation of a mini-fragment of Sanskrit, following Stump (2001)
Take the rules of referral, for example. The rules, as we have presented them, crosscut many paradigms and are not just properties of particular lexical items. As such, it is useful to have a notion of “paradigm” to which such rules can refer (though in actual fact, they refer to morphosyntactic feature combinations, rather than paradigm cells per se). But no matter: what is clear is that whatever the descriptive merits of Stump’s approach, it is simply a mistake to assume that this forces
3.3 computational implementation of fragments
73
################################################################# ## Transducer 4. Referral ## These are the rules of referral that specify the syncretic ## relations in the paradigm Optional [noun case=abl gend=masc num=sg] -> [noun case=gen gend=masc num=sg] [noun [noun [noun [noun
case=nom gend=masc num=du] -> [noun case=acc gend=masc num=du] case=instr gend=masc num=du] -> [noun case=dat gend=masc num=du] case=instr gend=masc num=du] -> [noun case=abl gend=masc num=du] case=gen gend=masc num=du] -> [noun case=loc gend=masc num=du]
[noun case=nom gend=masc num=pl] -> [noun case=acc gend=masc num=pl] [noun case=dat gend=masc num=pl] -> [noun case=abl gend=masc num=pl] ################################################################# ## Transducer 5. Filter ## ## ## ## ## ## ## ## ## ##
This filters out all strings ending in the diacritic Ending. This simply eliminates all forms that have not had an ending added. Note that the realizational ending rules above do not have specifications for feature combinations that are created by the rules of referral. This means that if we have a combination such as [noun case=abl gend=masc num=pl], and it is generated by the base rule in "Features" it will not get an ending; it will thus end in the Ending diacritic and be filtered out. The only case of [noun case=abl gend=masc num=pl] that will pass the filter is that created by the rules of referral.
[<sigma>]* ([<sigma>] - [Ending])
Figure 3.3 Transducers 4–5 for implementation of a mini-fragment of Sanskrit, following Stump (2001)
one to view morphology as realizational–inferential rather than, say, lexical–incremental, at a mechanistic level. The two are no more than refactorizations of each other. 3.3.2 Position Classes in Swahili In many languages, complex words are built up by concatenating classes of morphemes together in a fixed order. In Finnish, for example, as we saw above, inflected nouns consist of a stem, a number affix and a case affix. In Swahili, inflected verbs have a fixed set of prefixal position classes so that, for example, many verbs have the structure: (Neg)SubjectAgreement-Tense-Stem. On the face of it such facts would appear to call most naturally for an item-and-arrangement, or in Stump’s terms, lexical–incremental view. After all, the complex words would seem to be composed out of distinct pieces, and each of these pieces would appear to be identifiable with a clear set of morphosyntactic features. But realizational theories
74
3. morphological theory
################################################################# ## Transducer 6. Stem ## ## ## ##
These are the stem selection rules. By default introduce the Weakest stem, but modify it to Strong for a particular set of features, to Middle for everything in the GZN grade class, and to Middle everywhere else.
[<epsilon>] -> [Weakest] / __ [GradeClass] [Weakest] -> [Strong] / __ [GradeClass][noun gend=masc case=nom] [Weakest] -> [Strong] / __ [GradeClass][noun gend=masc case=acc num=sg] [Weakest] -> [Strong] / __ [GradeClass][noun gend=masc case=acc num=du] [Weakest] -> [Middle] / __ [GZN] [Weakest] [Weakest] [Weakest] [Weakest] [Weakest] [Weakest] [Weakest]
-> -> -> -> -> -> ->
[Middle] [Middle] [Middle] [Middle] [Middle] [Middle] [Middle]
/ / / / / / /
__ __ __ __ __ __ __
[GradeClass][noun [GradeClass][noun [GradeClass][noun [GradeClass][noun [GradeClass][noun [GradeClass][noun [GradeClass][noun
gend=masc gend=masc gend=masc gend=masc gend=masc gend=masc gend=masc
case=instr num=du] case=dat num=du] case=abl num=du] case=instr num=pl] case=dat num=pl] case=abl num=pl] case=loc num=pl]
################################################################# ## Transducer 7. Grademap ## This maps the selected stem to the particular grade given the grade ## class of the lexeme [Middle][GZN] -> [Zero] [Strong][GZN] -> [Guna] [Strong][VIUs] -> [Vrddhi] [Middle][VIUs] -> [ivat] [Weakest][VIUs] -> [us] ################################################################# ## Transducer 8. Grade ## This implements the phonological expression of the grade. Thus ## "avant" becomes "avat" in Zero grade. avant -> avat / __ [Zero] iv[a=]ms -> u[s.] / __ [us] iv[a=]ms -> ivat / __ [ivat] [Grade] -> [<epsilon>]
Figure 3.4 Transducers 6–8 for implementation of a mini-fragment of Sanskrit, following Stump (2001)
are committed to the idea that affix material is introduced by rule, and inferential theories are committed to the idea that such rules are triggered by morphosyntactic features. Within a lexical–incremental theory, the structure of complex words such as those in Swahili would be described in terms of a “word syntax”, which would either specify slots into which morphemes of particular
3.3 computational implementation of fragments
75
################################################################# ## Transducer 9. Phonology ## ## ## ## ##
Finally, this implements some phonological rules such as the expression of "-vant" or "-v[a=]ms" as "-v[a=]n" in the nominative singular and the voicing of "t" to "d" before a following voiced consonant. Note that [] and [<eos>] denote the beginning and end of string, respectively.
(vant|v[a=]ms) -> v[a=]n / __ [noun] [<eos>] vat -> vad / __ [noun] [VCons] ################################################################# ## 10. Lexicon ## A minilexicon for the two adjectives "bhagavant" ‘fortunate’ and ## "tasthivans" ‘having stood’. [bh]agavant[GZN][MascNoun] tas[th]iv[a=]ms[VIUs][MascNoun]
Figure 3.5 Transducer 9 and lexicon for an implementation of a mini-fragment of Sanskrit, following Stump (2001)
classes could be placed, or else would specify subcategorization requirements that would guarantee that affixes would attach in a particular order; another alternative is to assume a principle such as Baker’s Mirror Principle (Baker, 1985), whereby the construction of morphologically complex words is controlled by aspects of phrasal syntax. But no matter: within lexical–incremental theories the construction of complex words is stated in terms of syntactic conditions on the arrangement of discrete morphemes, each of which bears morphosyntactic features. Within realizational–inferential theories, such phenomena are handled by positing blocks of rules, with ordering among the blocks. So for Swahili, in Stump’s analysis there is a block of rules that applies first to add tense affixes to the stem; then a block of rules adds prefixes that spell out subject agreement morphology; and finally, if the verb is marked as negative, a rule block introduces the negative morpheme. Stump uses Swahili as an example of portmanteau rule blocks, by which he means cases where a single morph apparently occupies the position of two or more contiguous rule blocks. Consider the partial verbal paradigms for taka “want” in Table 3.5. The forms in these paradigms exemplify up to three position classes. The innermost block, Block III, comprises tense morphemes, which may also include indications of polarity, namely separate negative forms. Block IV, the next block, comprises subject agreement affixes. Finally, the optional Block V comprises the negative affix ha-. Each block is exemplified by one affix in a given paradigm entry, and vice versa, so that there
76
3. morphological theory Table 3.5 Positional classes in Swahili, for taka “want” V Past 1sg 2sg 3sg (class 1) 1pl 2pl 3pl (class 2) Negative Past 1sg 2sg 3sg (class 1) 1pl 2pl 3pl (class 2)
III
Stem
niuatumwa-
lililililili-
taka taka taka taka taka taka
uatumwa-
kukukukukuku-
taka taka (→ hukutaka) taka (→ hakutaka) taka taka taka
niuatumwa-
tatatatatata-
taka taka taka taka taka taka
uatumwa-
tatatatatata-
taka taka (→ hutataka) taka (→ hatataka) taka taka taka
sihahahahaha-
Future 1sg 2sg 3sg (class 1) 1pl 2pl 3pl (class 2) Negative Future 1sg 2sg 3sg (class 1) 1pl 2pl 3pl (class 2)
IV
sihahahahaha-
Source: Stump (2001), Table 5.1, p 140.
is a one-to-one relation between affixes and blocks, with one notable exception: in the case of the first singular, whereas we would expect to get the Block V + Block IV sequence ha-ni-, what we find instead is si-. This is an example of a portmanteau rule block: si- spans both Block IV and Block V. Stump accounts for the use of si- by a rather intricate mechanism involving a default multi-block rule of referral. This defaults to the modes of expression of the individual blocks, but is available for Paninian override by a more specific rule that refers to those particular blocks. This analysis thus reifies a superblock – in this case
3.3 computational implementation of fragments
77
Vowel a e i o u UCons k t p VCons g d b SCons n m UCons sh s f h SCons w y l Cons UCons VCons SCons Seg Cons Vowel num sg pl per 1 2 3 pol pos neg tns pst fut gen 1,2 Category: verb num per pol tns Label Verb
Figure 3.6 Lextools symbol file for an implementation of a mini-fragment of Swahili, following Stump (2001)
BlockIV + V – which will be expressed as a single morph under the condition that a specific rule referring to that block exists (as in the case of si- expressing first person and negative polarity features), and will be expressed as the sequence of the individual blocks otherwise. Putting aside the merits of this particular approach over other conceivable approaches, let us consider a computational implementation along the same lines. Figures 3.6 and 3.7 show a lextools implementation of the Swahili fragment that Stump discusses. As with the Sanskrit example, we assume a lexicon that generates all possible feature expressions for a given stem. Then, affixation rules are introduced block by block to spell out morphosyntactic features on the stem. To construct the complex forms, one simply composes the stem with each of the blocks in turn: Dummy ◦ Bl oc k I I I ◦ Bl oc k I V ◦ Bl oc kV
(3.5)
So far so good, but what about the case of si-? In the analysis in Figure 3.7, this is accomplished by doing a slight refactorization of the above composition and then combining the si- morpheme in a slightly different way. In particular the system is combined by composing the lexicon with Block III. Blocks IV and V are composed together, but we need to take care of the case where the expected ha-ni sequence is replaced by si-. This can be handled straightforwardly by priority union; see Definition 7, Section 4.3. The desired operation is: s i ∪ P [Bl oc k I V ◦ Bl oc kV ]
(3.6)
The range of the relation for si- is all entries with the specifications [verb per=1 num=sg pol=neg]. In such cases, the string prefix si- is inserted at the beginning of the string. The priority union ensures that
78
3. morphological theory
################################################################# ## Transducer definitions: ## ## ## ##
1. Lexicon Note that the feature [verb] expands into all possible ways of expressing number, person, polarity and tense, as defined in the symbol file.
taka[Verb] ## 1A. Dummy lexical entry that fills in the feature vector [verb] ## for anything tagged with the label [Verb] [Seg]+ ([Verb]:[verb]) ## 2. Rules for Block III [<epsilon>] -> ku / [] __ [<sigma>]* [verb pol=neg tns=pst] [<epsilon>] -> li / [] __ [<sigma>]* [verb pol=pos tns=pst] [<epsilon>] -> ta / [] __ [<sigma>]* [verb tns=fut] ## 3. Rules for Block IV [<epsilon>] [<epsilon>] [<epsilon>] [<epsilon>] [<epsilon>] [<epsilon>]
-> -> -> -> -> ->
ni / u / a / tu / m / wa /
[] [] [] [] [] []
__ __ __ __ __ __
[<sigma>]* [<sigma>]* [<sigma>]* [<sigma>]* [<sigma>]* [<sigma>]*
[verb [verb [verb [verb [verb [verb
per=1 per=2 per=3 per=1 per=2 per=3
num=sg] num=sg] num=sg gen=1,2] num=pl] num=pl] num=pl gen=1,2]
## 4. Rules for Block V [<epsilon>] -> ha / [] __ [<sigma>]* [verb pol=neg] ## 5. Regular expression for "si" prefix ([<epsilon>]:si)[<sigma>]* [verb per=1 num=sg pol=neg] ## 6. Phonological "cleanup" rules aa -> a au -> u
Figure 3.7 An implementation of a mini-fragment of Swahili, following Stump (2001)
verbs matching this feature specification can only pass through the sitransducer. Verbs with any other feature specification will pass through Blocks IV and V. This is a straightforward way to implement Stump’s notion of a portmanteau rule block that competes with a set of cascaded blocks. Thus, the combination of all the blocks is as follows:
Prefixes = Dummy ◦ BlockIII ◦ [s i ∪ P [BlockIV ◦ BlockV]]
(3.7)
3.3 computational implementation of fragments
79
For a particular verb such as taka[Verb] we have the following composition: taka[Verb] ◦ Prefixes
(3.8)
But note again that one can easily factor the problem differently. For example one can trivially redefine Block III as follows: ([<epsilon>]:ku) ([<sigma>]*) [verb pol=neg tns=pst] ([<epsilon>]:li) ([<sigma>]*) [verb pol=pos tns=pst] ([<epsilon>]:ta) ([<sigma>]*) [verb tns=fut]
(3.9)
This defines a transducer that on the one hand introduces a prefix – ku-, li-, or ta- – and on the other adds a specification for polarity and tense features to the verb. In this definition, the verbal feature matrix added by the affix serves as a filter to restrict the verbal affixes of the input, which are unconstrained with respect to the feature specifications for polarity and tense. The input to Block III for the verb taka, for example, would be: taka[verb num={sg,pl} per={1,2,3} pol={pos,neg} tns={pst,fut}]
(3.10)
In other words, in this specification taka is a verb with no restrictions on the specifications of the features num, per, pol, or tns. The new definition of Block III above will restrict these features and simultaneously introduce the related affixes. But this is completely equivalent to a unification mechanism that combines the affixal features with the (underspecified) features of the stem. The result is then: kutaka[verb num={sg,pl} per={1,2,3} pol=neg tns=pst] litaka[verb num={sg,pl} per={1,2,3} pol=pos tns=pst] tataka[verb num={sg,pl} per={1,2,3} pol={pos,neg} tns=fut]
(3.11)
Once again, the realizational–inferential theory turns out to be formally equivalent under refactorization to a lexical–incremental theory. 3.3.3 Double Plurals in Breton Another issue that Stump discusses is the issue of multiple exponence. We have already seen an example of that in Swahili, where negative polarity is spelled out in a couple of places, specifically in the Block III tense morpheme and as the Block V prefix ha-. Another example is Breton diminutive plurals such as bag+où+ig+où “little boats”, where the base is bag “boat”, the diminutive suffix is -ig and the two -où’s
80
3. morphological theory
represent the plural; note that the singular diminutive is bagig “little boat”, so that it seems that both the stem and the diminutive affix have a plural marker. Stump analyzes this as a case of word-to-stem derivation, whereby -ig suffixation derives a stem from the base bag. Stump’s derivation is complex and depends upon a number of assumptions. The first is that Breton pluralization involves two rule blocks, although for normal (non-diminutive) nouns, there is typically just one plural suffix: PF(< X, {Num : ·} >) =def Nar1 (Nar0 (< X, {Num : ·} >))
(3.12)
Here, “Nar” is the narrowest applicable rule where “narrowest” is defined in the Paninian sense: the rule whose featural domain specifications are the narrowest in a block of rules that still subsumes the form in question. For normal plurals, it seems, only the inner “zero” block has an associated plural rule. “PF” denotes Paradigm Function, that is, a function that mediates the spell out of a cell in a paradigm. More specifically a Paradigm Function is “a function which, when applied to the root of a lexeme L, paired with a set of morphosyntactic properties appropriate to L, determines the word form occupying the corresponding cell in L’s paradigm” (Stump, 2001, page 32). For bag “boat”, the narrowest applicable rule applying on the inner block simply attaches the default plural affix -où. In the case of bagig ‘little boat’, things are more interesting. Stump assumes a diminutive rule for Breton defined as a “word-to-stem” rule (page 204): DRdimin (X) =def Xig
(3.13)
He assumes a “universal metarule for word to stem derivatives” (= Stump’s 28, page 204): If X, Y are roots and M is a word-to-stem rule such that M(X)=Y, then for each set Û of morphosyntactic properties such that PF(< X, Û >) is defined, if PF(< X, Û >) = < Z, Û >, then RR0,Û,{L−index(Y)} (< Y, Û >)=def < M(Z), Û >.
Here, RR is Stump’s designation for a realization rule – a rule that phonologically realizes a particular set of morphosyntactic features, and RR0,Û,{L−index(Y)} designates a realization rule that applies in a particular block (here, Block 0), for a particular set of morphosyntactic features (Û), to a particular lexical item (designated as “L−index(Y)”). In plain language, Stump’s universal metarule states that the inflection
3.3 computational implementation of fragments
81
of a word-to-stem derivative is the same as applying that word-to-stem derivation to the inflected form of the base. Thus we have, for bagig, bagoùig. Here the assumption of two blocks for plural formation comes into play so that we have as an intermediate step in the derivation: RR1,{num:pl},N (< bagoùig, Û >)
(3.14)
The default rule for pluralization in Block 1 is to suffix -où, so we end up with the observed form bagoùigoù. The universal word-to-stem metarule is applied in this case in Breton, and also in another similar example in Kikuyu. Some of Stump’s other assumptions, however, seem rather more specific to Breton, and it is hard to tell how universal these could reasonably be expected to be. We turn now to a computational treatment of the same phenomenon. A mini-grammar in lextools format is given in Figure 3.8. This grammar produces the following forms for the singular and plural diminutives and non-diminutives, with the diminutive forms being derived via the optional rule that introduces diminutive features: bag[ou][noun bag[ou][noun bag[ou][noun bag[ou][noun
num=sg] num=sg] num=pl] num=pl]
bag bagig bagou bagouigou
While this account might be criticized as ad hoc or non-explanatory, it is far from obvious that it is less ad hoc than Stump’s account, or any less explanatory. And it does have the advantage of being easier to understand. As before, it is possible to refactor the analysis so that the inferential– realizational rules that we have presented above are recast as lexical– incremental. For example, the plural rule can be rewritten as a pair of rules: [ou] -> ou [sg] -> [pl]
This pair of rules introduces the suffix -où, and changes the number feature of a singular base to plural. These two rules can be represented as a single transducer P , which can then be composed with a singular base to produce the plural form. Thus, again, the feature
82
3. morphological theory
## Lextools symbol file for Breton Letter a b c d e f g h i j k l m n o p q r s t u v w x y z ## Note: no nasalized or accented vowels in this short fragment num sg pl Feat dim num Category: noun num ## ed marks nouns that take plurals in -ed InflClass ou ed ################################################################# ## Transducer definitions: ## ## 1. Lexicon ## ## Here we define the entry for "bag" ‘boat’, which has a diacritic ## [ou] indicating that the plural is -ou. If one objects that -ou ## should be the default, then one can always replace this with a ## generic diacritic -- e.g. ‘[xx]’ -- which will get spelled out as ## [ou] if no other plural mark is there. ## ## [noun] has all possible feature specifications for nouns. bag[ou][noun] ## ## ## ## ## ## ## ##
2. Diminutive rule This optionally introduces the feature [dim]. Then it introduces an affix "-ig", if the feature [dim] is present. The affix is marked to go after the inflection class marker (the indicator of which plural affix the noun takes). The diminutive suffix itself is marked to take -ou, but again if one prefers it could be marked with a generic diacritic.
optional [<epsilon>] -> [dim] / __ [<eos>] obligatory [<epsilon>] -> ig[ou] / [InflClass] __ [<sigma>]* [dim] ## 3. Plural spell out. ## ## Note that this spells out every [ou] as "-ou" in the context of a ## plural marking. The result: double plural marking for "bagig" [ou] -> ou / __ [<sigma>]*[noun num=pl] ## 4. Clean up rule. This just deletes feature specification and ## category information to derive the surface form. [InflClass]|[Feat]|[] -> [<epsilon>]
Figure 3.8 An implementation of a mini-fragment of Breton
“plural” is now introduced by an affix rule under the generic affixation operation of composition. And once again, the difference between an inferential–realizational approach and a lexical–incremental approach turns out to be a matter of factorization.
3.4 formal analysis
83
3.4 Equivalence of Inferential–Realizational and Lexical–Incremental Approaches: A Formal Analysis It is also possible to argue for the equivalence of inferential– realizational and lexical–incremental approaches on the basis of a formal analysis of the semantics of morphological operations. The argument is the same as what we have already presented in computational terms, but stated a bit differently. As an example consider Blevins’ (2003) analysis of West Germanic. Blevins provides an analysis of the stem morphology of weak verbs in English, German, Frisian, and Dutch within the realizational framework of Stump (2001). He observes that in all West Germanic languages, the past tense, perfect participle, and passive participle all share the same stem, which is formed with a dental. In English, this is exemplified by examples such as the following: 1. PAST: John whacked the toadstool 2. PERF: John has whacked the toadstool 3. PASS: The toadstool was whacked The past form in Sentence 1, the perfect participle in Sentence 2, and the passive participle in Sentence 3 all share the same phonological property. In other West Germanic languages, some of these forms may have additional material. Thus in German: 4. PAST: Er mähte das Heu “He mowed the hay” 5. PERF: Er hat das Heu gemäht “He has mowed the hay” 6. PASS: Das Heu wurde gemäht “The hay was mowed” The common feature of the forms is the dental -t, but there is additional material in each case: in the past there is a stem vowel -e, and in the perfect and passive participles, there is the prefix ge-. Blevins argues that one cannot view the dental suffix as being a single morpheme with a common semantics, and indeed this is also the conclusion reached by analyses based on morphemes, such as that of Pinker (1999), who argues for a set of distinct homophonous dental morphemes. This duplication is an embarrassment for lexical accounts to be sure, but it is important to bear in mind that it is simply a fact of the data and has to be incorporated somewhere in the model.
84
3. morphological theory
In realizational accounts, such as the one Blevins provides, such duplications are handled by allowing many-to-one mappings between semantic features and the morphological exponents of those features. Thus a realization function is defined as follows, for English, where F d (X) = Xd is a function that suffixes -d to the stem: ([past]) = F d (X) ([perf]) = F d (X) ([pass]) = F d (X)
(3.15)
Thus we have a many-to-one mapping between three semantic features and a single exponent. Let us try to define these notions more formally as follows. First of all, we will use the abstract catenation operator “·” to represent the catenation of -d with the stem, and so we can redefine F d (X) as a single place function that concatenates -d to a stem, as follows: F d (X) = Î(X)[X · d]
(3.16)
Second, the realization expressions presumably do not just realize, say, [past], but realize it with respect to a certain base, the same base to which -d is ultimately attached. Let us assume an operator ⊕ to represent the addition of the relevant feature. Thus we would write: (Î(X)[X ⊕ past]) = Î(X)[X · d]
(3.17)
Now, one assumes that what it means to realize a particular feature or set of features on a stem by means of a particular morphological exponent is that one adds the feature and realizes the exponent of that feature. 6 So we should be able to collapse the above into: Î(X)[X ⊕ past ∧ X · d]
(3.18)
Here, ∧ simply denotes the fact that both the feature combination and the catenation operations take place. But, we can condense this expression further by collapsing the two combinatoric expressions into one: Î(X)[X < ⊕, · >< past, d >]
(3.19)
6 Note that this statement is neutral as to incremental versus realizational approaches. In any case, one has as input a form that lacks a set of morphosyntactic features and a particular morphological exponent; and one ends up with a new form that has the morphosyntactic features and the particular exponent.
3.5 conclusions
85
Here <past, d > is simply a pairing of the morphosyntactic/semantic feature with the phonological exponence. We use < ⊕, · > to represent a catenation pair which combines elements on the morphosyntactic/semantic side using ⊕ and elements on the phonological side using “·”; see Sproat (2000) for a similar binary catenation pair applied to the simultaneous combination of linguistic elements with linguistic elements and graphemic elements with graphemic elements. In this formulation, we also need to consider X to be a morphosyntactic– phonological pairing, but we will leave this implicit in our notation. The above expression thus just describes a function that takes an element X and combines it (via < ⊕, · >) with another expression <past, d >. This is clearly just a formulation of a standard lexical– incremental model. 3.5 Conclusions For the past half century there has been a debate in morphological theory between, on the one hand, theories that maintain that morphology involves the assembly of small atomic meaning-bearing pieces (morphemes); and on the other, theories that maintain that the basic operations of morphology are rules that introduce or modify meanings and simultaneously effect phonological changes on the bases to which they apply. From a purely computational or formal point of view, as we have argued, the differences between such approaches are less significant than they at first appear to be. In many cases there may be bona fide reasons for preferring one approach over another. For example, to explain the complex paradigms of Sanskrit nominal morphology, one would like to have recourse to a model where the morphological function – the cells in the paradigm and their associated features – are dissociated from the actual spellout of those features. On the other hand, there seems to be little benefit in such a model if the task is to account for, say, noun plural formation in Kannada, where this simply involves attaching the plural morpheme -gul.u to the end of the noun. As much as anything else, what is at issue here is not whether inferential–realizational or lexical–incremental theories are sometimes motivated, but whether they are always motivated, and whether there is any justification in the all-or-nothing view that has prevailed in morphological thinking. Such a monolithic view is often justified on the basis of Occam’s razor: if a theory can account for certain phenomena elegantly, and is at least capable of accounting for other
86
3. morphological theory
phenomena, if not so elegantly, then that theory is to be preferred over another theory that fails to account for certain kinds of data. Unfortunately this is not the situation that obtains with morphological theories. Both lexical–incremental theories and inferential–realizational theories can account for the data; we have seen that they are formally and computationally equivalent. Of course, as a reader of an earlier version of this book pointed out to us, if someone is interested in how humans process morphology, rather than the formal computational description of such processing, they might be inclined to wonder: “so what?” But this clearly appeals to psycholinguistic evidence, and there the data does not seem to be that friendly to either side: there is increasing evidence that human morphological processing involves a powerful set of analogical reasoning tools that are sensitive to various effects including frequency, semantic and phonetic similarity, register, and so on (Hay and Baayen, 2005; Baayen and Moscoso del Prado Martín, 2005). It is not clear that the mechanisms at work particularly favor lexical–incremental over inferential-realizational models. What these issues ultimately come down to are matters of taste. There is of course nothing wrong with taste, but it is something that is less obviously in the purview of Occam’s razor. Appendix 3A: Lextools The following unix manual page describes the file formats and regular expression syntax for the lextools grammar development toolkit. It is a copy of the manual page available at the AT&T website (http://www.research.att.com/∼alb/lextools/lextools.5.htm), and also available at the web page for this book http://compling. ai.uiuc.edu/catms. NAMES Lextools symbol files, regular expression syntax, general file formats, grammar formats - lextools file formats DESCRIPTION Symbol Files The lextools symbol file specifies the label set of an application. Labels come in three basic flavors: Basic labels Superclass labels Category labels
appendix 3a: lextools
87
Basic labels are automatically assigned a unique integer representation (excluding 0, which is reserved for "<epsilon>"), and this information is compiled by lexmakelab into the basic label file, which is in the format specified in fsm(5). Superclass labels are collections of basic labels: a superclass inherits all of the integral values of all of the basic labels (or superclass labels) that it contains. Category labels are described below. The lines in a symbol file look as follows: superclass1
basic1 basic2 basic3
You may repeat the same superclass label on multiple (possibly non-adjacent) lines: whatever is specified in the new line just gets added to that superclass. The "basic" labels can also be superclass labels: in that case, the superclass in the first column recursively inherits all of the labels from these superclass labels. The one exception is a category expression which is specified as follows: Category:
catname feat1 feat2 feat3
The literal "Category:" must be the first entry: it is case insensitive. "catname" should be a new name. "feat1" labels are their values. The following sample symbol file serves to illustrate: dletters a b c d e f g h i j k l m n o p dletters q r s t u v w x y z uletters A B C D E F G H I J K L M N O P uletters Q R S T U V W X Y Z letters dletters uletters gender masc fem case nom acc gen dat number sg pl person 1st 2nd 3rd Category: noun gender case number Category: verb number person For this symbol set, the superclass "dletters" will contain the labels for all the lower case labels, "uletters" the upper case letters, and "letters" both. Defined categories are "noun" and "verb". "noun", for instance, has the features "gender", "case" and "number". The feature "gender" can have the values "masc" and "fem". (NB: this way of representing features is inherited in concept from Lextools 2.0.)
88
3. morphological theory
Some caveats: You should not use the reserved terms "<epsilon>" or "<sigma>" in a symbol file. If you use the -L flag to lexmakelab you should also not use any of the special symbols that it introduces with that flag (see lextools(1)). Symbol files cannot contain comments. Backslash continuation syntax does not work. Regular Expressions Regular expressions consist of strings, possibly with specified costs, conjoined with one or more operators. strings are constructed out of basic and superclass labels. Labels themselves may be constructed out of one or more characters. Characters are defined as follows: If 2-byte characters are specified (Chinese, Japanese, Korean...), a character can be a pair of bytes if the first byte of the pair has the high bit set. In all other conditions a character is a single byte. Multicharacter tokens MUST BE delimited in strings by a left bracket (default: "[") and right bracket (default: "]"). This includes special tokens "<epsilon>", "<sigma>", "" and "<eos>". This may seem inconvenient, but the regular expression compiler has to have some way to figure out what a token is. Whitespace is inconvenient since if you have a long string made up of single-character tokens, you don’t want to be putting spaces between every character: trust me. You may also use brackets to delimit single-character tokens if you wish. Some well-formed strings are given below: abc[foo] [<epsilon>]ab[<sigma>] Note that the latter uses the superclass "<sigma>" (constructed by lexmakelab to include all symbols of the alphabet except "<epsilon>"): in order to compile this expression, the superclass file must have been specified using the -S flag. If features are specified in the label set, then one can specify the features in strings in a linguistically appealing way as follows: food[noun gender=fem number=sg case=nom] Order of the feature specifications does not matter: the order is determined by the order of the symbols in the symbol file. Thus the following is equivalent to the above:
appendix 3a: lextools
89
food[noun case=nom number=sg gender=fem] The internal representation of such feature specifications looks as follows: "food[_noun][nom][sg][fem]". Unspecified features will have all legal values filled in. Thus food[noun case=nom number=sg] will produce a lattice with both [fem] and [masc] as alternatives. Inappropriate feature values will cause a warning during the compilation process. Since features use superclasses, again, in order to compile such expressions, the superclass file must have been specified using the -S flag. Costs can be specified anywhere in strings. They are specified by a positive or negative floating point number within angle brackets. The current version of lextools assumes the tropical semiring, so costs are accumulated across strings by summing. Thus the following two strings have the same cost: abc[foo]<3.0> a<-1.0>b<2.0>c<1.0>[foo]<0.5><0.5> Note that a cost on its own -- i.e. without an accompanying string -- specifies a machine with a single state, no arcs, and an exit cost equal to the cost specified. Regular expressions can be constructed as follows. First of all a string is a regular expression. Next, a regular expression can be constructed out of one or two other regular expressions using the following operators: regexp1* regexp1+ regexp1^n regexp1? !regexp1 regexp1 | regexp1 & regexp1 : regexp1 @ regexp1 -
regexp2 regexp2 regexp2 regexp2 regexp2
Kleene star Kleene plus power optional negation union intersection cross product composition difference
In general the same restrictions on these operations apply as specified in fsm(1). For example, the second argument to "-" (difference) must be an unweighted acceptor. Note also that the two argument to ":" (cross product) must be acceptors. The argument n to "^" must be a positive integer. The arguments to "@" (composition) are assumed to be transducers.
90
3. morphological theory
The algorithm for parsing regular expressions finds the longest stretch that is a string, and takes that to be the (first) argument of the unary or binary operator immediately to the left, or the second argument of the binary operator immediately to the right. Thus "abcd | efgh" represents the union of "abcd" and "efgh" (which is reminiscent of Unix regular expression syntax) and "abcd*" represents the transitive closure of "abcd" (i.e., not "abc" followed by the transitive closure of d, which is what one would expect on the basis of Unix regular expression syntax). The precedence of the operators is as follows (from lowest to highest), the last parenthesized group being of equal precedence: | & - : (* + ? ^) But this is hard to remember, and in the case of multiple operators, it may be complex to figure out which elements get grouped first. The use of parentheses is highly recommended: use parentheses to disambiguate "!(abc | def)" from "(!abc) | def". Spaces are never significant in regular expressions.
Escapes, Comment Syntax and Miscellaneous other Comments can appear in input files, with the exception of symbol files. Comments are preceded by "#" and continue to the end of the line. You can split lines or regular expressions within lines onto multiple lines if you include "\ " at the end of the line, right before the newline character. Special characters, including the comment character, can be escaped with "\ ". To get a "\ ", escape it with "\ ": "\ ".
Lexicons The input to lexcomplex is simply a list of regular expressions. The default interpretation is that these expressions are to be unioned together, but other interpretations are possible: see lextools(1) for details. If any of the regular expressions denotes a relation (i.e., a transducer) the resulting union also denotes a relation, otherwise it denotes a language (i.e., an acceptor). Arclists An arclist (borrowing a term from Tzoukermann and Liberman’s 1990 work on Spanish morphology) is a simple way to specify a finite-state morphological grammar. Lines can be of one of
appendix 3a: lextools
91
the following three formats: instate outstate finalstate finalstate cost
regexp
Note that cost here should be a floating point number not enclosed in angle brackets. State names need not be enclosed in square brackets: they are not regular expressions. The following example, for instance, specifies a toy grammar for English morphology that handles the words, "grammatical", "able", "grammaticality", "ability" (mapping it to "able + ity"), and their derivatives in "un-": START ROOT ROOT ROOT SUFFIX FINAL
ROOT FINAL SUFFIX SUFFIX FINAL 1.0
un [++] | [<epsilon>] grammatical | able grammatical abil : able [++] ity
Paradigms The paradigm file specifies a set of morphological paradigms, specified as follows. Each morphological paradigm is introduced by the word "Paradigm" (case insensitive), followed by a bracketed name for the paradigm: Paradigm
[m1a]
Following this are specifications of one of the following forms: Suffix Prefix Circumfix
suffix features prefix features circumfix features
The literals "Suffix", "Prefix" and "Circumfix" are matched case-insensitively. The remaining two fields are regular expressions describing the phonological (or orthographic) material in the affix, and the features. The "Circumfix" specification has a special form, namely "regexp...regexp". The three adjacent dots, which must be present, indicate the location of the stem inside the circumfix. In all cases, features are placed at the end of the morphologically complex form. There is no provided mechanism for infixes, though that would not be difficult to add. One may specify in a third field in the "Paradigm" line another previously defined paradigm from which the current paradigm inherits forms: Paradigm
[mo1a] [m1a]
92
3. morphological theory
In such a case, a new paradigm will be set up, and all the forms will be inherited from the prior paradigm except those forms whose features match entries specified for the new paradigm: in other words, you can override, say, the form for "[noun num=sg case=dat]" by specifying a new form with those features. (See the example below.) One may also add additional entries (with new features) in inherited paradigms. A sample set of paradigms (for Russian) is given below: Paradigm [m1a] Suffix [++] [noun num=sg Suffix [++]a [noun num=sg Suffix [++]e [noun num=sg Suffix [++]u [noun num=sg Suffix [++]om [noun num=sg Suffix [++]y [noun num=pl Suffix [++]ov [noun num=pl Suffix [++]ax [noun num=pl Suffix [++]am [noun num=pl Suffix [++]ami [noun num=pl Paradigm [mo1a] [m1a] Paradigm [m1e] [m1a] Suffix [++]"ov [noun num=pl Suffix [++]"ax [noun num=pl Suffix [++]"am [noun num=pl Suffix [++]"ami[noun num=pl
case=nom] case=gen] case=prep] case=dat] case=instr] case=nom] case=gen] case=prep] case=dat] case=instr]
case=gen] case=prep] case=dat] case=instr]
Note that "[mo1a]" inherits all of "[m1a]", whereas "[m1e]" inherits all except the genitive, prepositional, dative and instrumental plurals. See lextools(1) for some advice on how to link the paradigm labels to individual lexical entries in the lexicon file argument to lexparadigm.
Context-Free Rewrite Rules The input to lexcfcompile is a set of expressions of the following form: NONTERMINAL -> regexp The "->" must be literally present. "NONTERMINAL" can actually be a regular expression over nonterminal symbols, though the only useful regular expressions in this case are unions of single symbols. The "regexp" can in principle be any regular expression specifying a language (i.e., not a relation) containing a mixture of terminals and nonterminals. However, while lexcfcompile imposes no restrictions on what you put in the rule, the algorithm implemented in GRMCfCompile, which lexcfcompile uses, can only handle certain kinds of context-free grammars. The user is strongly advised to read and understand the description in grm(1) to understand the restrictions on the kinds of
appendix 3a: lextools
93
context-free grammars that can be handled. By default the start symbol is assumed to be the first nonterminal mentioned in the grammar; see lextools(1) for further details. The following grammar implements the toy English morphology example we saw above under Arclists, this time putting brackets around the constituents (and this time without the mapping from "abil" to "able"): [NOUN] -> \[ ( \[ [ADJ] \] | \[ [NEGADJ] \] ) ity \] [NOUN] -> \[ [ADJ] \] | \[ [NEGADJ] \] [NEGADJ] -> un \[ [ADJ] \] [ADJ] -> grammatical | able
Context-Dependent Rewrite Rules A context-dependent rewrite rule file consists of specifications of one of the following two forms: phi -> psi / lambda __ rho phi => psi / lambda __ rho In each case "phi", "psi", "lambda" and "rho" are regular expressions specifying languages (acceptors). All but "psi" must be unweighted (a requirement of the underlying GRMCdCompile; see grm(1), grm(3)). The connectors "->", "=>", and "/" must literally occur as such. The underbar separating "lambda" and "rho" can be any number of consecutive underbars. The interpretation of all such rules is that "phi" is changed to "psi" in the context "lambda" on the left and "rho" on the right. The difference between the two productions, "->" and "=>" is the following. "->" denotes a mapping where any element of "phi" can be mapped to any element of "psi". With "=>", the inputs and outputs are matched according to their order in the symbol file: this is most useful with single (superclass) symbol to single (superclass) symbol replacements. For example, suppose you have the following entries in your symbol file: V +voiced -voiced
a e i o u b d g p t k
The rule: [-voiced] -> [+voiced] / V __ V will replace any symbol in {p,t,k} with any symbol in {b,d,g} between two vowels. Probably what you want in this case is the following: [-voiced] => [+voiced] / V __ V
94
3. morphological theory
This will replace "p" with "b", "t" with "d" and "k" with "g". The matching is done according to the order of the symbols in the symbol file. If you had specified instead: +voiced -voiced
b g d p t k
then "t" would be replaced with "g" and "k" with "d". Similarly with +voiced -voiced
b d g p t k x
"p", "t" and "k" would be replaced as in the first case, but "x" would be ignored since there is nothing to match it to: nothing will happen to "x" intervocalically. Use of the matching rule specification "=>" thus requires some care in labelset management. Beginning of string and end of string can be specified as "[]" and "[<eos>]", respectively: these are added to the label set by default if you use the -L flag to lexmakelab. A line may also consist of one of the following specifications (which are case insensitive): left-to-right right-to-left simultaneous optional obligatory The first three set the direction of the rule application; the last two set whether the rule application is obligatory or optional; see grm(1). All specifications are in effect until the next specification or until the end of the file. The default setting is obligatory, left-to-right. In practice the user will rarely need to fiddle with these default settings. Replacements A replacement specification (for lexreplace) is a file consisting of lines like the following: foo.fst
a|b|c|d
The first column specifies a single fsm that must exist in the named file. The remainder of the line specifies a union of labels to be replaced in the topology fsm argument to lexreplace with said fsm. Specified substitutions for a given label will override any previous substitutions. In the following case: foo.fst bar.fst
a|b|c|d a
appendix 3b: xfst implementation of sanskrit
95
you will foo.fst for "b", "c" and "d", and bar.fst for "a". See also grmreplace in grm(1).
Currency Expressions Currency specification files contain lines of the following form: sym major-expr point minor-expr large-number Each entry is a regular expression. See lextools(1) for a description of the function of the entries. Note that the whitespace separator MUST BE A TAB: the regular expressions themselves may contain non-tab whitespace. There must therefore be four tabs. You must still have tabs separating entries even if you split an entry across multiple lines (with "\ ").
Appendix 3B: XFST Implementation of Sanskrit Dale Gerdemann has very kindly provided us with his XFST (Beesley and Karttunen, 2003) reimplementation of the lextools model of Sanskrit presented in Section 3.3.1. We reproduce his implementation verbatim below. define VShort a | e | i | o | u | au | ai; define VLong a= | e= | i= | o= | u= | a=u | a=i; define Vowel VShort | VLong; define UCons k | c | t%. | t | p | kh | ch | t%.h | th | ph | sh | s%. | s | f; define VCons g | j | d%. | d | b | gh | jh | d%.h | dh | bh; define SCons n%∼ | n%. | n | m | v | y; define Cons UCons | VCons | SCons; define Seg Cons | Vowel; define gend masc | fem | neut; define case nom | acc | instr | dat | abl | gen | loc; define num sg | du | pl; define noun Noun case= case gend= gend num= num; define Grade Guna | Vrddhi | Zero | ivat | us; define GradeClass GZN | VIUs; define Diacritics Ending | MascNoun | Strong | Middle | Weakest | Grade | GradeClass; define Class MascNoun; define DummyLexicon Seg* GradeClass Class; define Features MascNoun -> [noun & $masc] Ending;
96
3. morphological theory
define Ending .o. Ending .o. Ending .o. Ending .o. Ending .o. Ending .o.
Endings [ -> 0 || case= nom gend= masc num= sg \_ -> {am} || case= acc gend= masc num= sg \_ -> a= || case= instr gend= masc num= sg \_ -> e || case= dat gend= masc num= sg \_ -> {as} || case= abl gend= masc num= sg \_ -> i || case= loc gend= masc num= sg \_
Ending -> a={u} || case= nom gend= masc num= du \_ .o. Ending -> bh{y}a={m} || case= instr gend= masc num= du \_ .o. Ending -> {os} || case= gen gend= masc num= du \_ .o. Ending .o. Ending .o. Ending .o. Ending .o. Ending ];
-> {as} || case= nom gend= masc num= pl \_ -> bh{is} || case= instr gend= masc num= pl \_ -> bh{yas} || case= dat gend= masc num= pl \_ -> a= m || case= gen gend= masc num= pl \_ -> {su} || case= loc gend= masc num= pl \_
define Referral [ case= abl gend= masc num= sg (->) case= gen gend= masc num= sg .o. case= nom gend= masc num= du (->) case= acc gend= masc num= du .o. case= instr gend= masc num= du (->) case= dat gend= masc num= du .o. case= instr gend= masc num= du (->) case= abl gend= masc num= du .o. case= gen gend= masc num= du (->) case= loc gend= masc num= du .o. case= nom gend= masc num= pl (->) case= acc gend= masc num= pl .o. case= dat gend= masc num= pl (->) case= abl gend= masc num= pl ]; define Filter∼ [?* Ending];
appendix 3b: xfst implementation of sanskrit define Stem [ [..] -> Weakest // \_ GradeClass .o. Weakest -> Strong // \_ GradeClass .o. Weakest -> Strong // \_ GradeClass .o. Weakest -> Strong // \_ GradeClass .o. Weakest -> Middle // \_ GZN .o. Weakest -> Middle // \_ GradeClass .o. Weakest -> Middle // \_ GradeClass .o. Weakest -> Middle // \_ GradeClass .o. Weakest -> Middle // \_ GradeClass .o. Weakest -> Middle // \_ GradeClass .o. Weakest -> Middle // \_ GradeClass .o. Weakest -> Middle // \_ GradeClass ];
97
[noun & $masc & $nom] [noun & $masc & $acc & $sg] [noun & $masc & $acc & $du]
[noun & $masc & $instr & $du] [noun & $masc & $dat & $du] [noun & $masc & $abl & $du] [noun & $masc & $instr & $pl] [noun & $masc & $dat & $pl] [noun & $masc & $abl & $pl] [noun & $masc & $loc & $pl]
define Grademap [ [Middle GZN] -> Zero .o. [Strong GZN] -> Guna .o. [Strong VIUs] -> Vrddhi .o. [Middle VIUs] -> ivat .o. [Weakest VIUs] -> us ]; define Grade [ {avant} -> {avat} // \_ Zero .o. {iv}a={ms} -> {u}s%. // \_ us .o. {iv}a={ms} -> {ivat} // \_ ivat .o. Grade -> 0 ]; define Phonology [ [{vant}|{v}a={ms}] -> {v}a={n} // \_ noun .#. .o. {vat} -> {vad} // \_ noun VCons ];
98
3. morphological theory
define Suffixes [ DummyLexicon .o. Features .o. Endings .o. Referral .o. Filter .o. Stem .o. Grademap .o. Grade .o. Phonology ]; define Lexicon [bh{agavant} GZN MascNoun] | [{tas}th{iv}a={ms} VIUs MascNoun]; # Add spaces for nice printout define Spacing [ [..] -> % || \_ $\backslash$[Seg | case | gend | num] .o. [..] -> % || num \_ ]; define Morphology [ Lexicon .o. Suffixes ]; ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
Morphology (as above) defines a transducer with input: {tasthiv[a=]ms[VIUs][MascNoun], bhagavant[GZN][MascNoun]}* *Multicharacter symbols marked up as in FSM These inputs are mapped to {tasthivat[Noun case=loc gend=masc num=pl]su tasthivad[Noun case=instr gend=masc num=du]bhy[a=]m ... bhagavat[Noun case=acc gend=masc num=pl]as bhagavat[Noun case=instr gend=masc num=sg][a=] ... }
define Lemmatizer [ [Morphology .o. noun -> 0].i .o. [Class | GradeClass] -> 0 ];
appendix 3b: xfst implementation of sanskrit define MorphologicalTagger [ 0 <- noun .o. [Lexicon .o. Suffixes ].l .o. Seg -> 0 ];
read regex Morphology .o. Spacing; read regex Lemmatizer; read regex MorphologicalTagger .o. Spacing;
99
4 A Brief History of Computational Morphology 4.1 Introduction Automatic morphological analysis dates back to the earliest work in computational linguistics on Machine Translation during the 1950s (Andron, 1962; Woyna, 1962; Bernard-Georges et al., 1962; Boussard and Berthaud, 1965; Vauquois, 1965; Schveiger and Mathe, 1965; Matthews, 1966; Brand et al., 1969; Hutchins, 2001). There have been many applications over the years including the Porter stemmer (Porter, 1980) heavily used in information retrieval applications (Dolby et al., 1965; Attar et al., 1978; Choueka, 1983; Büttel et al., 1986; Meya-Lloport, 1987; Choueka, 1990; Koskenniemi, 1984), spelling correction (McIlroy, 1982; Hankamer, 1986), text input systems (Becker, 1984; Abe et al., 1986), and morphological analysis for text-to-speech synthesis (Allen et al., 1987; Church, 1986; Coker et al., 1990). Many of these earlier applications used quite ad hoc approaches including hard-coding much of the linguistic information into the system. For example, in the system reported in Coker et al. (1990), a lot of the morphological analysis is mediated by tables coded as C header files and spelling-change rules written as C functions. In many of the early applications, there was little interest in getting the morphological analysis correct as long as the resulting behavior served the purpose of the system. For example McIlroy’s spell program (McIlroy, 1982) was concerned only with those derivations that would affect the performance of the program. Thus, as McIlroy notes (page 94): . . . we are interested in soundness of results, not of method. Silly derivations like forest = fore+est are perfectly acceptable and even welcome. Anything goes, provided it makes the list [of elements needed in the dictionary] shorter without impairing its discriminatory power.
4.1 introduction
101
But the most interesting work in computational morphology both from a theoretical and a computational point of view, has been more principled than this and has depended heavily upon finite-state methods. Indeed the history of computational morphology over the last thirty years has been completely dominated by finite-state approaches. As we saw in Chapter 2, with the possible exception of reduplication, which (at least for reasons of elegance) may require further mechanisms, there are no morphological phenomena that fall outside the regular domain. Even non-local dependencies (of the German ge-. . . -en variety), can be handled by purely finite-state devices, at the cost of having duplication of structure. So it is natural that computational implementations should seek this lower bound. 1 Dominating finite-state morphology since the early 1980s has been the approach based on finite-state transducers, originally investigated in the 1970s by Ron Kaplan and Martin Kay at Xerox PARC, whose first practical implementation was due to Koskenniemi (Koskenniemi, 1983). Because of its centrality in the history of computational morphology, we will focus in this chapter on a review of Koskenniemi’s approach and attempt to elucidate some of its theoretical underpinnings. Alternative approaches to computational morphology have largely been of two varieties. The first is explicitly finite-state approaches that are based on an explicitly finite-state model of morphotactics but with more or less ad-hoc computational devices (e.g., C functions) to implement spelling change or phonological rules. Such systems include the DECOMP module of the MITalk text-to-speech synthesis system (Allen et al., 1987) and Hankamer’s keçi Turkish morphological analyzer (Hankamer, 1986). More ad hoc approaches include the work of Byrd and colleagues (Byrd et al., 1986) and Church (Church, 1986). In these two pieces of work, for example, affixation is modeled as stripping rules that relate a derived word to another word. For example, a rule that strips off -able and replaces it with -ate would relate evaluable to evaluate. These kinds of approaches are reminiscent of the early work on “suffix stripping” found in McIlroy’s SPELL program (McIlroy, 1982) and the Porter stemmer (Porter, 1980). 1 Dale Gerdemann (p.c.) points us to an interesting example of local dependency in Kanuri (), where person and number are marked with either prefixes or suffixes, but not both: as he points out, “one way or another, a finite-state network must record the fact that you get a prefix just in case you don’t get a suffix, and this involves doubling the size of the network.”
102
4. a brief history of computational morphology
Then there are extensions to basic finite-state models. For example, both Bear (1986) and Ritchie et al. (1992) present systems that integrate finite-state morphology and phonology, with unification of morphosyntactic features. Rather than relying entirely upon finite-state morphotactics to restrict combinations of affixes, these systems control the combinatorics by insisting that the morphosyntactic feature matrices associated with the morphemes correctly unify. So, for example, one need not have separate continuation lexica (see below) expressing the different possible suffixes that might follow a noun, a verb, or an adjective stem. Rather one could have a single suffix sublexicon, with the stem-suffix combinations filtered by feature unification. And as we have already discussed, there has been a lot of work in the DATR framework that has allowed for sophisticated models of morphological inheritance that are nonetheless finite-state equivalent. One thing to bear in mind in relation to the discussion in Chapter 3 is that virtually all practical approaches to computational morphology have assumed a lexical–incremental approach. The formal equivalence of this and inferential–realizational approaches, at least at the computational level, has already been discussed and will not be repeated here. But even those proponents of inferential–realizational approaches who were not convinced by the preceding arguments should still find the ensuing discussion to be of interest. Koskenniemi’s Two-Level approach, in particular, has been used to develop wide-coverage morphological analyzers for a wide range of languages. If the ability to describe and implement a large number of linguistic phenomena is a valid metric of success, then one would have to agree that these approaches to morphology have been quite successful, even if one does not agree with their fundamental theoretical assumptions.
4.2 The KIMMO Two-Level Morphological Analyzer The KIMMO two-level morphological analyzer has been described in a number of places including Antworth (1990); Sproat (1992); Karttunen and Beesley (2005). In the early descriptions, as in Koskenniemi’s original presentation, the system is presented at the level of the machine rather than in terms of the algebraic operations that are involved in the system. But it is the algebraic operations that are really more important in understanding how the system works, and we will thus focus mostly on this aspect here.
4.2 the kimmo two-level morphological analyzer
103
Let us start with an illustration of the difference between a “machinelevel” view and the “algebraic” view, a distinction that we have alluded to previously. Suppose we want to describe how we might check if a string is in a regular language, and therefore if it is accepted by the deterministic finite automaton (DFA) corresponding to the language. In that case we might describe how we would start the machine in its initial state and start reading from the beginning of the string. Each time we read a character from the string, we check the state we are in, see which arc is labeled with the character we are reading, and move into the state pointed to by that arc. We keep doing this until we get to the end of the string, or we fail because we get to a state where we are unable to read the current input character. (Since the machine is deterministic, we are guaranteed that in that case we cannot backtrack and try a different path.) If, once we have read all the characters in the string, we find ourselves in a final state of the automaton, then the string is in the regular language of the automaton; otherwise the string is not in the regular language of the automaton. This is a machinelevel description since it describes the process in terms of the low-level operations of the machine. An alternative description eschews the machine-level details and concentrates instead on what regular algebraic operations are involved in the process. In this case, the operation is simple. Represent the string as a trivial finite-state automaton with a single path. Then intersect the string machine with the language DFA. If the intersection is nonnull (and in fact in this case it will be identical to the string-machine) then the string is in the language of the DFA, otherwise it is not. The machine-level operations described in the previous paragraph still go on here: they are part of the implementation of the intersection operation. But the algebraic view allows one to ignore the algorithmic details and focus instead on the algebraic operation that is being performed. With these issues in mind, let us turn to the core ideas in the KIMMO system. 4.2.1 KIMMO Basics In Koskenniemi’s (1983) presentation, there are three aspects of the system that are central. The first is the representation of dictionaries as tries, following Knuth (1973). The second is the representation of morphological concatenation via continuation lexica. If one is reading a word, and reaches a leaf node of a trie before one has completed
104
4. a brief history of computational morphology m−a−t−o−[+noun] o−a−d−[+noun]
s−[+pl]
t−a−k−e−[+verb] s−[+3sg]
Lexical FSTs
t a k e s
Figure 4.1 Schematic representation of Koskenniemi’s two-level morphology system. The input “takes” is matched via a set of parallel FSTs to the lexicon, which is represented as a set of tries. A fragment of these tries is shown here, with the main lexicon containing the words “take”, “toad”, and “tomato”, and continuation lexica containing the suffixes for noun plural and verbal third singular
reading the word, one can look at the leaf node of the trie and see if there are pointers to other tries where one can continue the search; these other tries are the continuation lexica. Note that this is completely equivalent to splicing the continuation lexicon tries onto the leaf nodes in the first tree. Finally, there are the finite-state transducers that implement the surface-lexical phonological correspondences, processes that in Koskenniemi’s original Finnish version included vowel harmony and consonant lenition. Koskenniemi somewhat quaintly describes the rules as serving the function of “slightly distorting lenses” through which one views the lexicon. As a practical matter – and this is the distinctive property that gives Koskenniemi’s system the name of “TwoLevel Morphology” (though more properly it is “Two-Level Phonology”) – each transducer reads the lexical and surface “tapes” simultaneously. This requires that the lexical and surface strings be passed by each of the transducers, or in other words that the strings must be members of the regular relations implemented by each of the individual machines. Or in other words again, the transducers are logically intersected. As Koskenniemi notes, it is possible to replace all the individual machines with one “big machine,” which is simply the intersection of all the individual rule machines. A schematic figure of the Koskenniemi system is given in Figure 4.1. Tries and continuation lexica can straightforwardly be replaced by finite-state automata, with which they are completely formally equivalent. One can think of a trie as simply a finite-state automaton where each path leads to a distinct final state. Continuation lexica are easily modeled by grafting a continuation lexicon onto the terminal node(s) that allow the continuation in
4.2 the kimmo two-level morphological analyzer
105
question. One could even model this using concatenation, but this is not in general correct since concatenation would allow every “leaf ” node of a stem transducer to proceed to the continuation lexicon. However, this problem can easily be solved by “tagging” both the end of the stem and the beginning of a matching continuation lexicon with a distinctive tag, different for each continuation lexicon. One could then concatenate the lexica and filter illicit combinations of tags using a regular filter. Let B and A be, respectively, the stem lexicon and the affix set augmented as just described. Let T be a set of tags and Ù an individual tag. Then the appropriate construction is: ∀Ù ∈ T, (B · A ) ∩ ¬( ∗ Ù[T − Ù] ∗ )
(4.1)
This just states that one can obtain the desired result by concatenating the continuation lexicon A with B and then filtering illegal combinations, namely, each Ù followed by something that is in T but is not Ù. The situation with the transducers is more tricky, since Koskenniemi’s system depends upon intersection, for which regular relations are not in general closed. However, all cases involving a non-regular result of intersection involve unbounded deletions or insertions, which Koskenniemi (obviously) never needs. It can be shown, in fact, that as long as one bounds the number of deletions or insertions by some finite k, then regular relations are closed under intersection. We turn to this point in the next subsection. 4.2.2 FST Intersection Of course, regular relations are not in general closed under intersection. Kaplan and Kay (1994) demonstrate this using a simple counterexample. Consider two regular relations:
r R1 = {< a n , b n c ∗ > |n ≥ 0}, which can be expressed as the regular expression (a : b)∗ (Â : c )∗ r R2 = {< a n , b ∗ c n > |n ≥ 0}, which can be expressed as the regular expression (Â : b)∗ (a : c )∗ The intersection of these two relations is: R1 ∩ R2 = {< a n , b n c n > |n ≥ 0} This is clearly not regular, since the right-hand language b n c n is well known not to be regular.
106
4. a brief history of computational morphology
The construction above depends upon there being an unbounded difference in length between the input and output strings. Indeed, one can show that as long as one allows only a finite bound k on the difference between any pair of input and output strings in a relation, then regular relations are closed under intersection. More formally we define k-length-difference-bounded regular relations as follows: Definition 6 A regular relation R is k-length-difference-bounded iff ∃k, such that ∀(x, y) ∈ R, −k ≤ |x| − |y| ≤ k. We then state the following theorem: Theorem 4.1 k-length-difference-bounded regular relations S are closed under intersection, for all k. The proof depends upon the following lemma: Lemma 4.1 tion.
Same-length regular relations are closed under intersec-
Lemma 4.1 is discussed in Kaplan and Kay (1994). We prove Theorem 4.1 by reduction of k-length-difference-bounded relations to same-length relations. Proof of Theorem 4.1. Consider first the case where: R1 =< X 1 , Y1 > and R2 =< X 2 , Y2 >, two regular relations, are exact k-length-difference relations such that for any string x1 in the domain of R1 , and every corresponding y1 = R1 (x1 ) (a string in the image of x1 under R1 ), abs (|x1 | − |y1 |) = k; and similarly for R2 . Without loss of generality, we will assume that the output of R1 is the shorter so that |x1 | − |y1 | = k; and similarly for R2 . Now, consider for a moment the case of two regular languages L 1 and L 2 and their intersection L 3 = L 1 ∩ L 2 . Let be the alphabet of L 1 and L 2 , let Ó be a symbol not in , and Ók a klength string of Ó. Then clearly the following equality holds: L 3 · Ók = L 1 · Ók ∩ L 2 · Ók
4.2 the kimmo two-level morphological analyzer
107
That is, intersecting the result of concatenating L 1 with Ók and L 2 with Ók yields L 3 = L 1 ∩ L 2 concatenated with Ók . Returning to R1 and R2 , observe that Y1 and Y2 – the right-hand sides of these two relations – are regular languages and that by assumption any string in Y1 or Y2 is exactly k symbols shorter than the corresponding string in X 1 or X 2 . Consider then the relation k , which simply adds a string Ók to the end of any input string. It is easy to show that k is a regular relation: it can be implemented by an FST that has self loops to the start state over all elements in and then terminates by a sequence of arcs and states k long, mapping from  to Ó. Thus, R1 = R1 ◦ k and R2 = R2 ◦ k are regular relations since regular relations are closed under composition. Indeed, by construction R1 and R2 are same-length relations, and thus by Lemma 4.1 their intersection is also a regular relation. Furthermore, the right-hand side of R1 ∩ R2 is just Y1 · Ók ∩ Y2 · Ók = (Y1 ∩ Y2 ) · Ók , so R1 ∩ R2 =< X 1 ∩ X 2 , (Y1 ∩ Y2 ) · Ók >. Finally, note that −1 , which removes a final string of Ók , is a regular relation (since regular relations are closed under inversion) so we have: < X 1 ∩ X 2 , (Y1 ∩ Y2 ) · Ók > ◦−1 = < X 1 ∩ X 2 , Y1 ∩ Y2 > = R1 ∩ R2 Thus R1 ∩ R2 must be a regular relation, again because regular relations are closed under composition. Thus exact k-lengthdifference regular relations are closed under intersection. The final step is to extend this result to k-length-differencebounded regular relations. For any such relation R we can parti tion the relation into a union of relations i =0→k Ri , where each Ri is an exact i-length-difference relation (where R0 is a samelength relation). To see this, observe that since R is k-lengthdifference-bounded, it must be implemented by a transducer T in which any cycles must be length preserving. Now consider any path p from the initial state q 0 to a q f ∈ F (F the set of final states), where no state on the path is visited more than once – the path does not pass through a cycle. By assumption this path must implement an i-length-difference relation for 0 ≤ i ≤ k. Since all cycles must be length preserving, one can add to p any cycles that start and end at any q i ∈ p, and still have an i-length-difference
108
4. a brief history of computational morphology
relation; call this resulting sub-transducer pˆ . Clearly, for any i such that 0 ≤ i ≤ k, one can find the (possibly empty) set P of all pˆi such that pˆi is an i-length-difference relation. P , a subtransducer of T , is itself a finite-state transducer and thus implements a regular relation. Thus each Ri is a regular relation, and more particularly an i-length-difference regular relation. Now consider two k-length-difference-bounded relations R1 and R2 . We divide each into R1i ’s and R2i ’s, respectively, as above. We have already shown that the intersection of each pairR1i and R2i is also a regular relation. Furthermore R1 ∩ R2 is just i =0→k (R1i ∩ R2i ): this is because for R1i ∩ R2 j to be non-null, it must always be the case that i = j , since otherwise a string pair < x1 , y1 > in R1i could never match any < x2 , y2 > in R2i , since on at least one side of the pairs, the string lengths would not match. Since regular relations are closed under union, we have completed the proof. Note that the above construction does not work for the nonk-length-difference-bounded relations R1 = {< a n , b n c ∗ >} and R2 = {< a n , b ∗ c n >} discussed in Kaplan and Kay (1994). Consider R1 . For us to be able to use the construction, one would have to know how many Ós to add to the a side in order to match the extra c s. This could be done by factoring R1 into a series of zero-length-difference-bounded (same length), 1-length-difference-bounded, 2-length-difference-bounded relations and so forth as above. The problem is that the factorization is infinite and there is therefore no way to enumerate the cases. The same holds for R2 . Koskenniemi did not explicitly describe his system in terms of intersection, much less in terms of k-length-difference-bounded regular relations. But his algorithms were in effect computing intersections of multiple transducers that allowed both insertions and deletions, and in fact he had to take special care to ensure that the transducers did not get arbitrarily out of alignment. Indeed, rewrite rules involving deletions or insertions are not in general k-length-difference-bounded relations. But if this is the case, how can Koskenniemi’s system actually be modeled as the intersection of a set of rule transducers since this would appear to involve the intersection of a set of non-k-length-difference-bounded relations? One way around this is to observe that the intersection of a non-k-length-difference-bounded relation with a k-length-differencebounded relation is a k-length-difference-bounded relation. Thus, so long as one can impose a hard upper bound on the length difference between
4.2 the kimmo two-level morphological analyzer
109
the input and output, there is no problem implementing the system in terms of relation intersection. Note, finally, that Koskenniemi’s system is not the only system that depends upon intersections of transducers. The decision-tree compilation algorithm reported in Sproat and Riley (1996) models decision trees (and more generally decision forests) as the intersection of a set of weighted rule transducers, each one representing a leaf node of the tree. 4.2.3 Koskenniemi’s Rule Types The other innovation of Koskenniemi’s approach was his formalization of two-level rewrite rules; while he did not provide a compiler for these rules, the rules served to specify the semantics underlying the transducers that he built by hand. All rules in his system followed a template in that they were all of the following form: CorrespondencePair operator LeftContext
RightContext
That is, the rules specified conditions for the occurrence of a correspondence pair – a pairing of a lexical and a surface symbol (one of which might be empty, modeling deletion or insertion) – in a given left or right context. The contexts could be regular expressions, but the correspondence pair was a single pair of symbols, and thus was not as general as the ˆ → ¯ formulation from Kaplan and Kay (1994); See Section 4.2.4. Koskenniemi’s rules came in four flavors, determined by the particular operator used. These were: Exclusion rule Context restriction rule Surface coercion rule Composite rule
a:b /⇐ LC RC a:b ⇒ LC RC a:b ⇐ LC RC a:b ⇔ LC RC
The interpretation of these was as follows: r Exclusion rule: a cannot be realized as b in the stated context. r Context restriction rule: a can only be realized as b in the stated context (i.e., nowhere else). r Surface coercion rule: a must be realized as b in the stated context. r Composite rule: a is realized as b obligatorily and only in the stated context. Note that in many ways Koskenniemi’s formalism for rules was better defined than the ones that had previously been used in generative
110
4. a brief history of computational morphology
phonology. For one thing, each rule type specified a direct relation between the underlying and surface forms, something that was not possible within generative phonology due to the arbitrary number of ordered rewrite rules: in general, in generative phonology there was no way to know how a given lexical form would surface, short of applying all rules in the specified order and seeing what the outcome was. Koskenniemi’s rules, in contrast, specified the relation directly. Ignoring for the moment that traditional generative phonological rules were not two-level, one can ask which of Koskenniemi’s rules correspond to the rule types (basically just obligatory or optional rewrite rules) of generative phonology. In fact only the surface coercion rule has a direct counterpart: it corresponds pretty directly to an obligatory rewrite rule. All the other two-level rule types depend upon global knowledge of the system. Thus the context restriction rule is equivalent to a situation in a traditional generative account where there is but one optional rule that changes a into b; but note that this is a property of the system, not of a specific rule. The composite rule, which is just a combination of context restriction and surface coercion is similar, but in this case the unique rule changing a into b is obligatory. Note that since one could write, say, a context restriction rule that relates a to b in one environment, and then also write another context restriction rule that allows a to become b in another environment, it is perfectly possible in Koskenniemi’s system to write an inconsistent grammar. A lot of the work in designing later two-level systems involved writing de-buggers that would catch these kinds of conflicts. Finally, the exclusion rule is again global in nature: it is equivalent to the situation in a traditional generative grammar where there is no rule that relates a to b in the specified environment. But really, Koskenniemi’s rules can best be thought of as involving constraints on correspondence pairs. Constraints were virtually non-existent as a device in early generative phonology, but have since become quite popular in various theories of phonology including Declarative Phonology (Coleman, 1992), One-Level Phonology (Bird and Ellison, 1994) and Optimality Theory (Prince and Smolensky, 1993). 4.2.4 Koskenniemi’s System as a Historical Accident Koskenniemi’s development of Two-Level Morphology can be thought of as a fortuitous accident of history. It had been known since
4.2 the kimmo two-level morphological analyzer
111
C. Douglas Johnson’s PhD thesis (C. D. Johnson, 1972) that “contextsensitive” rewrite rules of the kind that had become familiar in generative phonology described regular relations and could thus be implemented using finite-state transducers (FSTs). By the late 1970s Ron Kaplan and Martin Kay at Xerox PARC were developing algorithms for the automatic compilation of FSTs from rewrite rules in a format that would be familiar to linguists, namely: ˆ → ¯/Î Ò
(4.2)
Here, ˆ, ¯, Î and Ò could be arbitrary regular expressions. Furthermore, since regular relations are closed under composition, this meant that one could write a series of ordered rules of the kind found in SPE (Chomsky and Halle, 1968), compile each of the rules into a transducer and then compose the entire series of rules together to form a single transducer representing the entire rule system. Kaplan and Kay finally published their algorithm many years later (Kaplan and Kay, 1994), and there has been subsequent work on a simpler and more efficient algorithm (Mohri and Sproat, 1996). But in the late 1970s and early 1980s there was just one problem: computers were simply not fast enough, nor did they have enough memory to compile rule systems of any serious complexity. Indeed complex rule systems of several tens of rules over a reasonable-sized alphabet (say, 100 symbols) can easily produce FSTs with several hundred thousand states with a similar number of arcs, with a total memory footprint of several megabytes. While any PC today could easily handle this, this was simply not viable around 1980. 2 So Koskenniemi proposed a compromise: avoid the space complexities of rule compilation and composition by instead hand-coding transducers and intersecting them rather than composing them. By hand-coding the transducers, and using various space-saving techniques (e.g., the use of “wildcard” symbols to represent arcs that would be traversed if no other arc matched the current symbol), Koskenniemi was able to build a very compact system. By carefully specifying an algorithm for traversing the arcs in the multiple machines – thus implementing the intersection algorithm “on the fly” – he was able to avoid doing full transducer intersection. Koskenniemi’s two-level morphology was remarkable in another way: in the early 1980s most computational linguistic systems were toys. This included parsers, which were usually fairly restricted in the kinds 2
Recall Bill Gates’ 1981 statement that “640k ought to be enough for anybody.”
112
4. a brief history of computational morphology
of sentences they could handle; dialog systems, which only worked in very limited domains; and models of language acquisition, which were only designed to learn simple grammatical constraints. In contrast, Koskenniemi’s implementation of Finnish morphology was quite real in that it handled a large portion of inflected words that one found in real Finnish text. To some extent this reflects the fact that it is easier to get a quite complete coverage of morphology in any language than it is to have a similar coverage of syntax, let alone dialog. But it also reflects Koskenniemi’s own decision to develop a full-fledged system, rather than present a mere “proof of concept” of his ideas. While two-level morphology was originally motivated by the difficulties, at the time, with Kaplan and Kay’s approach to cascaded rewrite rules, the model quickly took on a life of its own. Koskenniemi took it to be a substantive theoretical claim that only two levels of analysis were necessary, a claim that was fairly radical in its day (at least in contrast to generative phonology), but which has since been superseded by claims that only one level is needed (Bird and Ellison, 1994) or even that synchronic accounts in phonology are not needed since phonological systems are merely a residue of the process of historical change (Blevins, 2003). Nevertheless, practical considerations of developing morphological analyzers have led people to not rely wholly on the two-level assumption. Since transducers can be combined both by composition (under which they are always closed) and by intersection (under which they are closed under certain conditions), combinations of these two operations may be used in any given system; see, for example, Karttunen et al. (1992). Indeed, one of the beauties of finite-state techniques is that the calculus of the combination of regular languages and relations is expressive enough that one can develop modules of systems without regard to following any particular overall design: thus, for handling certain phenomena it may be more convenient to think in terms of a two-level system; for others, it may be easier to write cascaded rules. No matter: the two components can be combined as if one had built them both in one way or the other. While Koskenniemi certainly did not invent finite-state approaches to morphology and phonology, he was the first to develop a system that worked fully using finite-state techniques, and he is thus to be given much credit for bringing the field of finite-state morphology to maturity and building the way for the renaissance of finite-state approaches to language and speech that has developed over the past two decades.
4.3 summary
113
4.3 Summary This chapter has given a very brief selective overview of the history of computational morphology. We have focused here on the dominant paradigm, namely finite-state approaches; a more complete and “eclectic” survey can be found in Sproat (1992). The dominant paradigm within the dominant paradigm has been two-level morphology, pioneered by Koskenniemi. While this approach was clearly an accident of history it has had an enormous influence on the field. While two-level rules are no longer strictly speaking necessary – compilation of large sets of cascaded rewrite rules are well within the capability of current computers – it is sometimes convenient to be able to describe phenomena in terms of reference to both surface and underlying levels. And in any case, since one ends up with a finite-state transducer that computes the desired relation, it is of little consequence how that transducer was constructed. The latter point has been brought out especially nicely in Karttunen (1998). In that paper, Karttunen argues that Optimality Theory (Prince and Smolensky, 1993) can be implemented via a cascade of transducers that are combined using lenient composition. Lenient composition is defined in terms of priority union, which itself is defined as follows: Definition 7 The priority union ∪ P of two regular relations R1 and R2 is defined as follows, where 1 denotes projection onto the first dimension (domain) of the relation, and the overbar represents the complement of the language: R1 ∪ P R2 ≡ R1 ∪ [1 (R1 ) ◦ R2 ] In other words, the priority union of R1 and R2 maps any string in the domain of R1 to its image under R1 and any strings not in the domain of R1 to their image under R2 . Lenient composition is then defined as follows: Definition 8 The lenient composition ◦ L of two regular relations R1 and R2 is defined as follows: R1 ◦ L R2 ≡ [R1 ◦ R2 ] ∪ P R1 or in other words R1 ◦ L R2 ≡ [R1 ◦ R2 ] ∪ [1 (R1 ◦ R2 ) ◦ R1 ]
114
4. a brief history of computational morphology
Thus, under the lenient composition of R1 and R2 , strings that are in the domain of R1 ◦ R2 will undergo the composition of the two relations, but strings not in that domain will undergo only the relation R1 . For Optimality Theory, lenient composition can represent the combination of the “generator function” gen and some violable constraint C . Recall that the job of gen is to generate a candidate list of analyses given some input string. For example, the input might be a segment sequence ak, and gen will generate for that sequence a possibly infinite set of candidate syllabifications. The role of a constraint C is to rule out a particular configuration; for example, the sequence ak with a generated syllabification []ons [a]nucl [k]c oda might violate a constraint FillOns which requires onsets to be non-empty. 3 Let us say that C is the constraint FillOns, and for the sake of simplicity assume that this constraint is checked immediately on the output of gen. In that case, and assuming that all the outputs of gen for ak have an empty onset, then gen ◦ FillOns will produce no output at all for ak. In that case, lenient composition “rescues” ak by allowing all the outputs of gen. In another case, such as ka, for which gen might produce [k]ons [a]nucl []c oda , the output of gen would pass FillOns and so would pass through the normal composition gen ◦ FillOns, filtering any of gen’s outputs that happen to violate FillOns; but other constraints might be violated later on. As Karttunen shows, a rank-ordered list of constraints can be modeled using lenient composition. In particular, if one leniently composes the constraints together in a cascade in the order in which the constraints are ranked – the constraint at the left-hand side of an OT tableau first, and so on – then it is guaranteed that the selected output will be the one that violates the lowest ranked constraint among the violated constraints (or no constraint at all). This is because lenient composition will rescue an input to gen from a constraint C if and only if all analyses produced by gen for that input, and available as the input to C , violate C . So, for a sequence of ranked constraints C 1 , C 2 , . . . C n , lenient composition will rescue analyses only for the first C i for which no analyses whatever pass. All constraints prior to C i must involve regular composition (the first half of the priority union), and therefore some analyses will potentially be filtered for any given input. 4 So the 3 As Karttunen shows, constraints such as FillOns can readily be implemented in terms of regular languages or relations. 4 The model just sketched does not deal with cases where one must distinguish differing numbers of violations of the same constraint. As Karttunen shows, this can be accom-
4.3 summary
115
OT finger of salvation will point at the forms that are rescued from C i and survive the lenient composition of all subsequent constraints. But as Karttunen points out, this raises an interesting question: sets of traditional generative rewrite rules can be modeled by a single FST constructed by composition of the individual rules. Depending upon whether the rules involved are obligatory or optional, and depending upon how they are written, one could achieve either a unique output or a lattice of possible outputs. In a similar vein, an Optimality Theory system can be implemented by a single FST constructed by lenient composition of gen and the individual constraints. Again the resulting FST may have a single optimal output or a set of optimal outputs. At a computational level, then, there is no difference between these two approaches despite the fact that in the linguistics literature, the two approaches are considered to be as different as night and day. This is not to say that it may not be easier or more natural to describe certain phenomena in Optimality Theoretic terms than in traditional terms. But this is not always the case. A good example of where it is not is San Duanmu’s (1997) treatment of cyclic compounds in Shanghai. In that article Duanmu presents a traditional cyclic analysis of stress in Shanghai noun compounds, an analysis that takes up about a page of text. He then spends the remaining ten or so pages of the article laying out a fairly complex Optimality Theory analysis. While for some this may count as an advance in our understanding of the linguistic phenomenon in question, it is far from obvious to us that it does. Given the reductionist argument that Karttunen presents, it is not clear, given the ultimate computational equivalence of the two approaches, why one has to shoehorn data into one theory or the other. In the previous chapter, we already applied the same reductionist reasoning to argue that views that are considered radically different in the literature on theoretical morphology are really notational variants of one another when looked at from a computational point of view.
plished within the framework by implementing multiple violations of a single constraint as a single violation of a version of the constraint that “counts” an exact number of violations; and then cascading the set of such count-specific constraint versions via lenient composition. Of course, this approach cannot handle more than a bounded number of violations of any given constraint, though as Karttunen notes, this is unlikely to be a practical problem. However, as Dale Gerdemann points out, even with reasonable bounds, Karttunen’s automata can grow large. More recent work by Gerdemann and van Noord (2000) addresses this problem by replacing Karttunen’s counting method with a method based on exact matching. Their method, like Karttunen’s, uses lenient composition.
5 Machine Learning of Morphology 5.1 Introduction There is a disconnect between computational work on syntax and computational work on morphology. Traditionally, all work on computational syntax involved work on parsing based on hand-constructed rule sets. Then, in the early 1990s, the “paradigm” shifted to statistical parsing methods. While the rule formalisms – context-free rules, Tree-Adjoining grammars, unification-based formalisms, and dependency grammars – remained much the same, statistical information was added in the form of probabilities associated with rules or weights associated with features. In much of the work on statistical parsing, the rules and their probabilities were learned from treebanked corpora; more recently there has been work on inducing probabilistic grammars from unannotated text. We discuss this work elsewhere. In contrast, equivalent statistical work on morphological analysis was, through much of this time period, almost entirely lacking (one exception being Heemskerk, 1993). That is, nobody started with a corpus of morphologically annotated words and attempted to induce a morphological analyzer of the complexity of a system such as Koskenniemi’s (1983); indeed such corpora of fully morphologically decomposed words did not exist, at least not on the same scale as the Penn Treebank. Work on morphological induction that did exist was mostly limited to uncovering simple relations between words, such as the singular versus plural forms of nouns, or present and past tense forms of verbs. Part of the reason for this neglect is that handconstructed morphological analyzers actually work fairly well. Unlike the domain of syntax, where broad coverage with a hand-built set of rules is very hard, it is possible to cover a significant portion of the morphology of even a morphologically complex language such
5.1 introduction
117
as Finnish, within the scope of a doctoral-dissertation-sized research project. Furthermore, syntax abounds in structural ambiguity, which can often only be resolved by appealing to probabilistic information – for example, the likelihood that a particular prepositional phrase is associated with a head verb versus the head of the nearest NP. There is ambiguity in morphology too, as we have noted elsewhere; for example, it is common for complex inflectional systems to display massive syncretism so that a given form can have many functions. But often this ambiguity is only resolvable by looking at the wider context in which the word form finds itself, and in such cases importing probabilities into the morphology to resolve the ambiguity would be pointless. Recently, however, there has been an increased interest in statistical modeling both of morphology and of morphological induction and in particular the unsupervised or lightly supervised induction of morphology from raw text corpora. One recent piece of work on statistical modeling of morphology is Hakkani-Tür et al. (2002), which presents an n-gram statistical morphological disambiguator for Turkish. Hakkani-Tür and colleagues break up morphologically complex words and treat each component as a separate tagged item, on a par with a word in a language like English. The tag sequences are then modeled with standard statistical language-modeling techniques; see Section 6.1. A related approach to tagging Korean morpheme sequences is presented in Lee et al. (2002). This paper presents a statistical languagemodeling approach using syllable trigrams to calculate the probable tags for unknown morphemes within a Korean eojeol, a spacedelimited orthographic word. For eojeol-internal tag sequences involving known morphemes, the model again uses a standard statistical language-modeling approach. With unknown morphemes, the system backs off to a syllable-based model, where the objective is to pick the tag that maximizes the tag-specific syllable n-gram model. This model presumes that syllable sequences are indicative of part-of-speech tags, which is statistically true in Korean: for example, the syllable conventionally transcribed as park is highly associated with personal names, since Park is one of the the most common Korean family names. Agglutinative languages such as Korean and Turkish are natural candidates for the kind of approach just described. In these kinds of languages, words can consist of often quite long morpheme sequences, where the sequences obey “word-syntactic” constraints, and each morpheme corresponds fairly robustly to a particular morphosyntactic
118
5. machine learning of morphology
feature bundle, or tag. Such approaches are harder to use in more “inflectional” languages where multiple features tend to be bundled into single morphs. As a result, statistical n-gram language-modeling approaches to morphology have been mostly restricted to agglutinative languages. We turn now from supervised statistical language modeling to the problem of morphological induction, an issue that has received a lot of attention over the past few years. We should say at the outset that this work, while impressive, is nonetheless not at the stage where one can induce a morphological analyzer such as Koskenniemi’s system for Finnish. For the most part, the approaches that have been taken address the issue of finding simple relations between morphologically related words, involving one or two affixes. This is not by any means to trivialize the contributions, merely to put them in perspective relative to what people have traditionally done with hand-constructed systems. Automatic methods for the discovery of morphological alternations have received a great deal of attention over the last couple of decades, with particular attention being paid most recently to unsupervised methods. It is best to start out with a definition of what we mean by morphological learning since there are a couple of senses in which one might understand that term. The first sense is the discovery, from a corpus of data, that the word eat has alternative forms eats, ate, eaten and eating. Thus the goal is to find a set of morphologically related forms as evidenced in a particular corpus. In the second sense, one wants to learn, say, that the past tense of regular verbs in English involves the suffixation of -ed, and from that infer that a new verb, such as google, would (with appropriate spelling changes) be googled in the past tense. In this second sense, then, the goal is to infer a set of rules from which one could derive new morphological forms for words for which we have not previously seen those forms. Clearly the second sense is the stronger sense and more closely relates to what human language learners do. That is, while it is surely true that part of the process of learning the morphology of a language involves cataloging the different related forms of words, ultimately the learner has to discover how to generalize. For the present discussion we may remain agnostic as to whether the generalization involves learning rules or is done via some kind of analogical reasoning (cf. the classic debates between Rumelhart and McClelland (1986) and Pinker and Prince (1988)). Suffice it to say that such generalization must take place.
5.2 goldsmith, 2001
119
This stronger sense was the problem to which earlier supervised approaches to morphology addressed themselves. Thus the well-known system by Rumelhart and McClelland (1986) proposed a connectionist framework which, when presented with a set of paired present- and past-tense English verb forms, would generalize from those verb forms to verb forms that it had not seen before. Note that “generalize” here does not mean the same as “generalize correctly,” and indeed there was much criticism of the Rumelhart and McClelland work, most notably by Pinker and Prince (1988) (and see also Sproat, 1992: Chapter 4). Other approaches to supervised learning of morphological generalizations include van den Bosch and Daelemans (1999) and Gaussier (1999). Supervised approaches of course have the property that they assume that the learner is presented with a set of alternations that are known to be related to one another by some predefined set of morphological alternations. This begs the question of how the teacher comes by that set in the first place. It is to the latter question that the work on unsupervised learning of morphology addresses itself in that, as noted above, it aims to find the set of alternate forms as evidenced in a particular corpus. Indeed, while it is obviously an end goal of unsupervised approaches to not only learn the set of alternations that are found in a corpus, but also to generalize from such forms, the former is an interesting and hard problem in and of itself and so some of the work in this area has tended to focus on this first piece. Note, that it is reasonable to assume that once one has a list of alternation exemplars, one could apply a supervised technique to learn the appropriate generalizations; this is the approach taken, for example, in Yarowsky and Wicentowski (2001). Since most of the work in the past ten years has been on unsupervised approaches, we will focus in this discussion on these, and in particular three recent influential examples, namely Goldsmith (2001), Yarowsky and Wicentowski (2001) and Schone and Jurafsky (2001). Other recent work in this area includes Sharma et al. (2002); Snover et al. (2002); Baroni et al. (2002); Creutz and Lagus (2002); Wicentowski (2002); H. Johnson and Martin (2003). 5.2 Goldsmith, 2001 We start with a discussion of Goldsmith’s minimum description length (MDL) approach – called Linguistica – to the learning of affixation alternations. While this is not the first piece of work in unsupervised
120
5. machine learning of morphology
morphological acquisition, it is certainly the most cited, and since Goldsmith has made his system available on the Internet, it has come to represent a kind of standard against which other systems are compared. Goldsmith’s system starts with an unannotated corpus of text of a language – the original paper demonstrated the application to English, French, Spanish, Italian, and Latin – and derives a set of signatures along with words that belong to those signatures. Signatures are simply sets of affixes that are used with a given set of stems. Thus, one signature in English is (using Goldsmith’s notation) NULL.er.ing.s, which includes the stems blow, bomb, broadcast, drink, dwell, farm, feel, all of which take the suffixes ∅, -er, -ing and -s in the corpus that Goldsmith examines. One is tempted to think of signatures as being equivalent to paradigms, but this is not quite correct, for two reasons. First of all, notice that NULL.er.ing.s contains not only the clearly inflectional affixes -ing and -s, but the (apparently) derivational affix -er. Whether or not one believes in a strict separation of derivational from inflectional morphology – cf. Beard (1995) – is beside the point here: most morphologists would consider endings such as -s and -ing as constituting part of the paradigm of regular (and most irregular) verbs in English, whereas -er would typically not be so considered. Second, notice also that the set is not complete: missing is the past tense affix, which does show up in other signatures, such as NULL.ed.er.ing.s – including such verbs as attack, back, demand, and flow. An examination of the verbs belonging to the NULL.er.ing.s signature will reveal why: many of them – blow, broadcast, drink, dwell, feel – are irregular in their past tense form and thus cannot take -ed. However neither bomb nor farm are irregular, and the reason they show up in this signature class is presumably because they simply were never found in the -ed form in the corpus. A system would have to do more work to figure out that bomb and farm should perhaps be put in the NULL.ed.er.ing.s class, but that the other verbs in NULL.er.ing.s should be put in various different paradigms. Goldsmith briefly discusses the more general problem of going from signatures to paradigms, but note in any case that his system is not capable of handling some of the required alternations, such as blow/blew, since his methods only handle affixation – and are tuned in particular to suffixation. The system for deriving the signatures from the unannotated corpus involves two steps. The first step derives candidate signatures and signature-class membership, and the second evaluates the candidates. We turn next to a description of each of these.
5.2 goldsmith, 2001
121
5.2.1 Candidate Generation The generation of candidates requires first and foremost a reasonable method for splitting words into potential morphemes. Goldsmith presents a couple of approaches to this, one which he terms the takeall-splits heuristic and the second which is based on what he terms weighted mutual information. Since the second converges more rapidly on a reasonable set, we will limit our discussion to this case. The method first starts by generating a list of potential affixes. Starting at the right edge of each word in the corpus, which has been padded with an end-of-word marker “#”, collect the set of possible suffixes up to length six (the maximum length of any suffixes in the set of languages that Goldsmith was considering), and then for each of these suffixes, compute the following metric, where Nk here is the total number of k-grams: freq(n1 , n2 . . . nk ) freq(n1 , n2 . . . nk ) log k Nk 1 freq(ni )
(5.1)
The first 100 top ranking candidates are then chosen, and words in the corpus are segmented according to these candidates, where that is possible, choosing the best parse for each word according to the following metric, also used in the take-all-splits approach, which assigns a probability to the analysis of a word w into a stem w1,i and a suffix wi +1,l , where l is the length of the word: P (w = w1,i + wi +1,l ) = l −1
1
j =1 H(w1, j , w j +1,l )
e −H(w1,i ,wi +1,l )
(5.2)
where H(w1,i , wi +1,l ) = −(i log( freq(stem = wl ,i )) +(l − i )log( freq(suffix = wi +1, j )))
(5.3)
Finally, suffixes that are not optimal for at least one word are discarded. The result of this processing yields a set of stems and associated suffixes including the null suffix. The alphabetized list of suffixes associated with each stem constitutes the signature for that stem. Simple heuristic weeding is possible at this point: remove all signatures associated with but one stem and all signatures involving one suffix. Goldsmith terms the remaining signatures regular signatures, and these constitute the set of suffixes associated with at least two stems.
122
5. machine learning of morphology
5.2.2 Candidate Evaluation The set of signatures and associated stems constitutes a proposal for the morphology of the language in that it provides suggestions on how to decompose words into a stem plus suffix(es). But the proposal needs to be evaluated: in particular, not all the suggested morphological decompositions are useful, and a metric is needed to evaluate the utility of each proposed analysis. Goldsmith proposes an evaluation metric based on minimum description length. The best proposal will be the one that allows for the most compact description of the corpus (in terms of the morphological decomposition) and the morphology itself. This is, of course, a standard measure in text compression: a good compression algorithm is one that minimizes the size of the compressed text plus the size of the model that is used to encode and decode that text. The compressed length of the model – the morphology – is given by: ÎT + ÎF + Î
(5.4)
Here, ÎT represents the length (in bits) of a list of pointers to T stems, where T is the set of stems, and the notation represents the cardinality of that set. ÎF and Î represent the equivalent pointerlist lengths for suffixes and signatures, respectively. The first of these, ÎT , is given by: [W] (log(26) ∗ length(t) + log ) (5.5) [t] t∈T T is the set of stems, [W] is the number of word tokens in the corpus and [t] is the number of tokens of the particular stem t: here and elsewhere [X] denotes the number of tokens of X. (The log(26) term assumes an alphabet of twenty-six letters.) The term ÎF (where F is Goldsmith’s notation for suffixes) is: [WA ] (log(26) ∗ length( f ) + log ) (5.6) [f] f ∈suffixes Here, [WA ] is the number of tokens of morphologically analyzed words, and [ f ] is the number of tokens of the suffix f . The signature component’s contribution can be defined, for the whole signature component, as: [W] log (5.7) [Û] Û∈ where is the set of signatures.
5.2 goldsmith, 2001
123
Table 5.1 Top ten signatures for English from Goldsmith (2001), with a sample of relevant stems 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
NULL.ed.ing.s ’s.NULL.s NULL.ed.er.ing.s NULL.s e.ed.es.ing e.ed.er.es.ing NULL.ed.ing NULL.er.ing.s NULL.d.s NULL.ed.s
accent, afford, attempt adolescent, amendment, association attack, charm, flow aberration, abstractionist, accommodation achiev, compris, describ advertis, enforc, pac applaud, bloom, cater blow, drink, feel abbreviate, balance, costume acclaim, bogey, burden
Finally the compressed length of the corpus in terms of the morphological model is given by: [W] [Û(w)] [Û(w)] [w] log + log + log [Û(w)] [stem(w)] [suffix(w) ∈ Û(w)] w∈W (5.8) or in other words, assuming the maximum likelihood estimate for probabilities: [w][− log(P (Û(w))) − log(P (stem(w)|Û(w))) w∈W
− log(P (suffix(w)|Û(w)))]
(5.9)
As noted above, Goldsmith tested his method on corpora from English, French, Italian, Spanish, and Latin. For each of these languages, he lists the top ten signatures. For English these are reproduced here in Table 5.1, along with associated stems. Goldsmith also evaluated the results for English and French. Having no gold standard against which to compare, he evaluated the results subjectively, classifying the analyses into the categories good, wrong analysis, failed to analyze, spurious analysis. Results for English, for 1,000 words, were 82.9% in the good category, with 5.2% wrong, 3.6% failure, and 8.3% spurious. Results for French were roughly comparable. As we noted in the introduction to this section, Goldsmith’s work stands out in that it has become the de facto gold standard for subsequent work on unsupervised acquisition of morphology, partly because he has made his system freely available, but partly also because he has provided a simple yet powerful metric for comparing morphological analyses. Goldsmith’s method, of course, makes no explicit use
124
5. machine learning of morphology
of semantic or syntactic information. An obvious objection to his approach as a model of what human learners do is that children clearly have access to other information besides the set of words in a corpus of (spoken) language: they know something about the syntactic environment in which these words occur and they know something about the intended semantics of the words. But Goldsmith argues that this is a red herring. He states (page 190): Knowledge of semantics and even grammar is unlikely to make the problem of morphology discovery significantly easier. In surveying the various approaches to the problem that I have explored (only the best of which have been described here), I do not know of any problem (of those which the present algorithm deals with successfully) that would have been solved by having direct access to either syntax or semantics.
But while no one can yet claim to have a model that well models the kind of syntactic or semantic information that a child is likely to have access to, it has been demonstrated in subsequent work that having a system that models syntactic and semantic information does indeed help with the acquisition of morphology. We turn now to a discussion of some of this subsequent work. 5.3 Schone and Jurafsky, 2001 Schone and Jurafsky’s approach, a development of their earlier work reported in Schone and Jurafsky (2000), uses semantic, orthographic, and syntactic information derived from unannotated corpora to arrive at an analysis of inflectional morphology. The system is evaluated for English, Dutch, and German using the CELEX corpus (Baayen et al., 1996). Schone and Jurafsky note problems with approaches such as Goldsmith’s that rely solely on orthographic (or phonological) features. For example, without semantic information, it would be hard to tell that ally should not be analyzed as all+y, and since Goldsmith’s approach does not attempt to induce spelling changes, it would be hard to tell that hated is not hat+ed. On the other hand, semantics by itself is not enough. Morphological derivatives may be semantically distant from their bases – consider reusability versus use – so that it can be hard to use contextual information as evidence for a morphological relationship. Furthermore, contextual information that would allow one to derive semantics can be weak for common function words, so there is effectively no information that would lead one to prevent as being derived from a+s.
5.3 schone and jurafsky, 2001
125
Previous approaches also tend to limit the kinds of morphological alternations they handle. As we have seen Goldsmith’s method was developed with suffixational morphology in mind (though it could certainly be extended to cover other things). Schone and Jurafsky’s system at least extends coverage to prefixes and circumfixes. Their system, then, according to their own summary (page 2):
r considers circumfixes r automatically identifies capitalizations by treating them similarly to prefixes r incorporates frequency information r uses distributional information to help identify syntactic properties, and r uses transitive closure to help find variants that may not have been found to be semantically related but which are related to mutual variants. Schone and Jurafsky use the term circumfix somewhat loosely to denote apparently true circumfixes such as the German past participle circumfix ge—t, as well as combinations of prefixes and suffixes more generally. Their method for finding prefixes, suffixes and circumfixes is as follows: 1. Strip off prefixes that are more common than some predetermined threshold. 1 2. Take the original lexicon, plus the potential stems generated in the previous step, and build a trie out of them. 3. As in Schone and Jurafsky (2000), posit potential suffixes wherever there is a branch in the trie, where a branch is a subtrie of a node where splitting occurs. See Figure 5.1. 4. Armed with a set of potential suffixes, one can obtain potential prefixes by starting with the original lexicon, stripping the potential suffixes, reversing the words, building a trie out of the reversed words, and finding potential suffixes of these reversed strings, which will be a set of potential prefixes in reverse. 5. Identify candidate circumfixes, defined as prefix-suffix combinations that are attached to some minimum number of stems that are also shared by other potential circumfixes. The stems here are actually called pseudostems since, of course, they may not actually correspond to morphological stems. Note that since NULL will 1
As Schone and Jurafsky note, if you do not do this initial step some potential circumfixes may be missed.
126
5. machine learning of morphology E
C
D
T S
A S L I
G
N
E
D
Figure 5.1 A sample trie showing branches for potential suffixes NULL (empty circle), -s and -ed: from Schone and Jurafsky (2001, Figure 2). Used with permission of the authors and the Association for Computational Linguistics
in general be among the potential prefixes and suffixes found in previous stages, pure suffixes or prefixes will simply be circumfixes where one of the components is NULL. Schone and Jurafsky also define a rule to be a pair of candidate circumfixes sharing some minimum number of pseudostems. The output of these steps for English, German, and Dutch, with particular settings for the minima mentioned above, produces a large set of rules (about 30,000 for English) of which some are reasonable (e.g. -s⇒ ∅, -ed⇒-ing) but many of which are not (e.g. s-⇒ ∅, as induced from such seeming alternations as stick/tick or spark/park.) To compute semantic information, Schone and Jurafsky use a version of Latent Semantic Analysis introduced by Schütze (1998), that uses an N × 2N term–term matrix. N represents the N − 1 most frequent words, plus a “glob” in the Nth position corresponding to all other words. For each row, the first N columns represent words that occur in a window to the left of the row’s word, and the last N columns words that occur within a window to its right. Singular value decomposition is then performed on a normalized version of the N × 2N matrix M, to produce the product of two orthogonal matrices U and V T and the diagonal matrix of squared eigenvalues (singular values) D. The diagonal squared eigenvalue entries in D are ordered so that the first k of them account for the first k most significant dimensions of the |M|dimensional space. M = U DV T
(5.10)
5.3 schone and jurafsky, 2001
127
Each word w is assigned the semantic vector w = Uw Dk , where Uw is the row of U corresponding to w and Dk are the first k singular values from Equation 5.10. It is then necessary to compute a similarity between pairs of words, and this is accomplished using normalized cosine scores (NCS). For each word wk , k ∈ (1, 2), vectors for 200 other words are randomly selected, and the means (Ïk ) and variances (Ûk ) of the cosine values between wk and each of the 200 other words are computed. The cosine of w1 and w2 are then normalized, and the NCS is computed as follows: NC S(w1 , w2 ) = mink∈(1,2)
cos(w1 , w2 ) − Ïk Ûk
(5.11)
Assuming random NCS are normally distributed as N(0, 1) and similarly the distribution of means and variances of true correlations as N(ÏT , Û2T ), we can define a function that gives the area under the curve of the distribution from NCS to infinity: ∞ −(x−Ï) 2 NC S = e −( Û ) d x (5.12) NC S
The probability that an NCS is non-random is then given by: P (NC S) =
n T NC S (ÏT , ÛT ) (n R − n T ) NC S (0, 1) + n T NC S (ÏT , ÛT )
(5.13)
where n T is the number of terms in the distribution of true correlations. Finally the probability that w1 and w2 are related is given by: Ps em (w1 ⇒ w2 ) = P (NC S(w1 , w2 ))
(5.14)
Schone and Jurafsky assume a threshold for Ps em (0.85 in their implementation) above which a potential relationship is considered valid. The semantic probability is supplemented with an “orthographic” probability defined as follows: Por th =
max∀Z
2·f (C 1 ⇒ C 2 ) f (C 1 ⇒ Z) + max∀W f (W ⇒ C 2 )
(5.15)
Here f (A ⇒ B) is the frequency of the alternation involving circumfix A and circumfix B, and · is a weight between 0 and 1. Thus the probability of an alternation is the weighted ratio of the count of the alternation over the sum of the count of the alternation between the left circumfix and all circumfixes, and all circumfixes and the right circumfix. The orthographic probability is combined with the semantic probability as follows (Ps −o denotes the combined semantic-orthographic
128
5. machine learning of morphology 0.98
abusing
abusive 0.60
1.0 abuse 0.94
1.0
Abuse
1.0
0.79
0.56
abuses 1.0
abused 1.0
0.02
abusers 0.26
buses
Figure 5.2 Semantic transitive closure of PPMVs, from Schone and Jurafsky (2001, Figure 4). Used with permission of the authors and the Association for Computational Linguistics
probability): Ps −o (vali d) = Ps em + Por th − (Ps em Por th )
(5.16)
The similarities of local syntactic context between pairs of an alternation are also computed. For words that are members of each side of an alternation – the pair of potential morphological variants or PPMV – one first computes contextual words that are either far more frequent or far less frequent than expected by chance; these sets are the signatures for the words. Signatures are also collected for randomly chosen words for the corpus, and for unvalidated PPMVs. Lastly, one computes the NCS and the probabilities between the signatures of the ruleset and the unvalidated PPMVs. (The probability of the syntactic correspondence is computed as in Equation 5.13.) The probability of a correspondence being valid is given as a combination of the orthographic/semantic probability Ps −o and the syntactic environmentbased probability Ps yntax : P (vali d) = Ps −o + Ps yntax − (Ps −o Ps yntax )
(5.17)
The above method, while it captures a lot of valid PPMVs, still fails to capture some that are valid because, for example, they are not sufficiently represented in the corpus. However many of these can be reconstructed by following the transitive closure of PPMVs. This is exemplified in Figure 5.2. The probability of an alternation given a single path is modeled as the product of the probabilities of the individual links, along with a decay factor. For multiple paths between two nodes the probabilities of the paths are summed.
5.4 yarowsky and wicentowski, 2001
129
Table 5.2 Comparison of the F-scores for suffixing for full Schone and Jurafsky method with Goldsmith’s algorithm for English, Dutch, and German
Linguistica Schone & Jurafsky
English
German
Dutch
81.8 88.1
84.0 92.3
75.8 85.8
The output of Schone and Jurafsky’s method is a set of what they term conflation sets – a directed graph linking words to their relatives. The conflation sets can then be compared to the evaluation corpus – CELEX – by computing the correct, deleted, and inserted members of the found set against the true set. These scores can then be converted to F-scores. F-scores for Goldsmith’s method, in comparison with Schone and Jurafsky’s full method, are shown in Table 5.2, evaluating for suffixes only since Goldsmith’s system only handles suffixes. (Schone and Jurafsky also evaluate their own system for circumfixes.) 5.4 Yarowsky and Wicentowski, 2001 Yarowsky and Wicentowski (2001) present a lightly supervised method for inducing analyzers of inflectional morphology. The method consists of two basic steps. First a table of alignments between root and inflected forms is estimated from data. For example, the table might contain the pair take/took, indicating that took is a particular inflected form of take. Second, a supervised morphological analysis learner is trained on a weighted subset of the table. In considering possible alignment pairs, Yarowsky and Wicentowski concentrate on morphology that is expressed by suffixes or by changes in the root (as in take/took). The method depends upon the following resources: r A table of inflectional categories for the language; such a table for English would indicate that verbs have past tense forms. Along with these inflectional categories, a list of canonical suffixal exponents of these categories is needed. For past tense verbs in English, for example, canonical suffixes would be -ed, -t and -∅, the latter being found in cases like took. r A large corpus of text. r A list of candidate noun, verb, and adjective roots, which can be obtained from a dictionary, plus a rough method for guessing parts of speech of words in the corpus.
130
5. machine learning of morphology
r A list of consonants and vowels for the language. r A list of common function words (optional). The suffixes are useful for providing plausible alignments for many pairs of root and inflected form. For example, the existence of announce and announced in a corpus, along with the knowledge that -ed is a possible past-tense suffix, is sufficient to propose that the two words in question should be paired with announce being the root and announced the inflected past form. But in many cases the suffixes are not sufficient. For example, we would like to pair the root sing with sang, not with singed. A key to deciding such cases is the observation that there is a substantial difference in frequency between, on the one hand, sang/sing (ratio: 1.19/1 in Yarowsky and Wicentowski’s corpus), and on the other singed/sing (0.007/1). However, this is not quite enough, since a priori we have no way of deciding which of the two ratios is a more reasonable ratio for a past tense/present tense pair: some inflectional categories, for example, may be systematically far less frequent than their roots. Yarowsky and Wicentowski address this issue by computing the distribution of ratios over the corpus for word pairs from a given inflected/root class. Figure 5.3 shows the distribution of log VVBBD , where VBD and VB are the Penn Treebank tags for, respectively, past tense verbs and verb roots. One problem is that the computation of the log VVBBD distribution depends upon knowing the correct inflected/non-inflected pairs, which by assumption we do not know. Yarowsky and Wicentowski get around this problem by observing that the distribution is similar for regular and irregular verbs in English. They then estimate the distribution from regular verb pairs, which can be detected to a first approximation by simply looking for pairs that involve stripping the regular affix (in this case -ed). Initially the set of such pairs will be noisy; it will contain bogus pairings like singed/sing above. However, the estimate can be iteratively improved as improved alignments between inflected forms and their roots are computed. One is also not limited to considering the distribution of just one type of pair, such as VBD/VB. So, for example, VBG/VBD (where VBG is the Penn treebank tag for the gerund/participle ending in -ing) is also useful evidence for a putative VBD form. A second form of evidence for the relatedness of forms is contextual evidence of the kind also used by Schone and Jurafsky (2001). Cosine similarity measures between weighted contextual vectors give a good
5.4 yarowsky and wicentowski, 2001 0.3
131
took/take (−0.35) sang/sing (0.17)
0.25
singed/singe (1.5) 0.2 0.15 0.1 0.05 taked/take (−10.5) singed/sing(−4.9) 0 −10
−5
sang/singe (5.1) 0 log(VBD/VB)
5
10
Figure 5.3 The log VVBBD estimator from Yarowsky and Wicentowski (2001, Figure 1), smoothed and normalized to yield an approximation to the probability density function for the VBD/VB ratio. Correct pairings took/take, sang/sing and singed/singe are close to the mean of the distribution. Incorrect pairs singed/singe, sang/singe and taked/take are far out in the tails of the distribution; note that taked occurred once in the corpus, presumably as a typo. Used with permission of the authors and the Association for Computational Linguistics
indication that two forms are related, since even though semantically related words such as sip and drink will also show similar contexts, such words are still more dissimilar than are morphologically related forms. Also as with Schone and Jurafsky (2001), Yarowsky and Wicentowski use orthographic similarity between the putative root and inflected form, using a weighted Levenshtein distance. Costs for character or character-sequence substitutions are initialized to a predefined set of values. Thus, for example consonant–consonant substitutions are assigned an initial cost of 1.0, whereas single vowel–vowel substitutions are assigned a cost of 0.5. However, these costs are re-estimated as the system converges to a more accurate set of root-inflected form correspondences. Yarowsky and Wicentowski are not only interested in producing a table of accurate correspondences for particular roots and their inflected forms but are also interested in producing a morphological analyzer that can extend to new cases. The task here is to find the probability of a particular stem change, given a root, a suffix, and a part-of-speech tag: P (stemchange|root, suffix, P O S)
(5.18)
132
5. machine learning of morphology
This is estimated by an interpolated backoff model (see Section 6.1) that is the weighted sum of the probability of the stem change given the last three characters of the root, the suffix, and the POS; the last two characters of the root, the suffix, and the POS; and so on down to just the suffix, and the POS; and finally the context-independent stem change · → ‚: P (c hang e|r oot, s u f, P O S) = P (· → ‚|r oot, s u f, P O S) ≈ Î1 P (· → ‚|l as t3 (r oot), s u f, P O S) + (1 − Î1 )(Î2 P (· → ‚|l as t2 (r oot), s u f, P O S) + (1 − Î2 )(Î3 P (· → ‚|l as t1 (r oot), s u f, P O S) + (1 − Î3 )(Î4 P (· → ‚|s u f, P O S) + (1 − Î4 )P (· → ‚))))
(5.19)
The Îs are re-estimated on each iteration of the algorithm. In addition to the statistical methods outlined above, Yarowsky and Wicentowski impose a constraint, which they call the pigeonhole principle, that requires that there be only one form for a given inflection for a particular word. Note that this is what would more normally be called morphological blocking (Aronoff, 1976). Exceptions to this principle such as dreamed/dreamt are considered to be sufficiently rare that their existence does not detract significantly from the performance of the model. The pigeonhole principle is used to greedily select the first candidate inflected form for a given position in the paradigm from a rank-ordered list of putative forms. Table 5.3 shows the performance of the algorithm on four classes of verbs for the first iteration of each of the measures: frequency similarity (FS); Levenshtein (LS); context similarity (CS); various combinations of these metrics; and the fully converged system. As Yarowsky and Wicentowski observe, none of the metrics perform well by themselves, but by combining the metrics one can achieve very high performance. 5.5 Discussion In this section we have reviewed three of the more prominent pieces of work on morphological induction that have appeared in the past few years. Other recent work in this area includes:
5.5 discussion
133
Table 5.3 The performance of the Yarowsky–Wicentowski algorithm on four classes of English verbs Combination of Similarity Models
# of Iterations
All Words (3888)
Highly Irregular (128)
Simple Concat. (1877)
NonConcat. (1883)
FS (Frequency Sim) LS (Levenshtein Sim) CS (Context Sim) CS + FS CS+FS+LS CS+FS+LS+MS CS+FS+LS+MS
(Iter 1) (Iter 1) (Iter 1) (Iter 1) (Iter 1) (Iter 1) (Convg)
9.8 31.3 28.0 32.5 71.6 96.5 99.2
18.6 19.6 32.8 64.8 76.5 74.0 80.4
8.8 20.0 30.0 32.0 71.1 97.3 99.9
10.1 34.4 25.8 30.7 71.9 97.4 99.7
From Yarowsky and Wicentowski (2001, Table 9)
r H. Johnson and Martin (2003) propose a method for detecting morpheme boundaries based on what they term hubs. Hubs are nodes in a minimized automaton representing the words of the language that have in-degree and out-degree greater than one. As Johnson and Martin note, their idea is an extension on previous work such as Z. Harris (1951) and Schone and Jurafsky (2000). r Snover et al. (2002) propose a method that estimates probabilities for: the numbers of stems and suffixes; the lengths of the stems and suffixes; the joint probability of hypothesized stem and suffix sets; the number of paradigms; the number of suffixes in each paradigm, and which suffixes are associated with each paradigm; and the paradigm affiliation of stems. These estimates are then used to assign a probability to a particular analysis of the data into a set of hypothesized stems, suffixes and paradigms. A novel search algorithm is proposed for searching the set of possible analyses of the data. The method is shown to outperform Linguistica in terms of F-score on corpora of English and Polish. r Baroni et al. (2002) propose another method similar to Schone and Jurafsky (2000) that combines semantic and orthographic similarity for the discovery of morphologically-related words. They use an edit-distance measure for orthographic similarity and mutual information for semantic similarity. The reported performance of this method is hard to compare to the results of other work but seems to be degraded from what one would expect from other methods.
134
5. machine learning of morphology Table 5.4 Comparison of methods of morphological induction Method
Correct
Incomplete
Incorrect
Recursive MDL Sequential ML Linguistica
49.6% 47.3% 43.1%
29.7% 15.3% 24.1%
20.6% 37.4% 32.8%
From Creutz and Lagus (2002, Table 3).
r Creutz and Lagus (2002) propose a recursive MDL-based approach. Words are segmented all possible ways into two parts, including no split, and the segmentation that results in the minimum total cost is chosen. If a split is chosen, then the two chosen components are recursively submitted to the same procedure. As Creutz and Lagus note, this approach has the danger of falling into local optima. To avoid this, words that have already been segmented are resegmented when they occur again (since this may lead to a reanalysis of the word). A second method involved a maximum likelihood estimate of the P (data|model), estimated using Expectation Maximization; in this case a linear rather than recursive procedure is used, starting with a random segmentation of the corpus. A comparison of the two methods proposed by Creutz and Lagus, along with the performance of Linguistica on morpheme boundary detection in a sample of 2,500 Finnish words is shown in Table 5.4 (Creutz and Lagus’ Table 3). Note that correct means that all critical morpheme boundaries were detected, incomplete means that only some of the boundaries were detected, and incorrect means that at least one boundary was incorrect. One clear principle that can be gleaned from the more successful work in morphological induction is that multiple sources of evidence are crucial. Thus morphological induction is similar to other problems in natural language processing, such as Named Entity Recognition – e.g. Collins and Singer (1999) – in that one often does better by combining evidence from multiple sources than from relying on just one kind of evidence. Relevant evidence for morphological induction includes:
r Orthographic or phonological similarity between morphologically related forms. While it is not always the case that members
5.5 discussion
135
of a paradigm share phonological properties (go/went is an obvious counterexample), it is overwhelmingly the most common situation. Note that the notion of sharing phonological properties can be quite varied and needs to be parameterized to languageparticular cases. Methods such as Yarowsky and Wicentowski (2001) and Schone and Jurafsky (2001) go far towards dealing with a variety of cases involving circumfixation or stem changes. But no published work to date handles infixation, the kind of interdigitation found in Semitic morphology, or productive reduplication. The key here is clearly to allow for a variety of possible ways in which two forms may be phonologically related, without opening the flood gates and allowing all kinds of silly correspondences. The current approaches, in addition to being too limiting in the correspondences they allow, are also too limiting in another sense: they target the kinds of alternations that are known to be active in the languages they are being applied to. Obviously this is not realistic as a model for morphological induction in the general case. A traditional linguistic approach to this issue would be to constrain the search for possible correspondences to those kinds of alternations that are known to occur across languages. So one might start with a taxonomy of morphological alternations (one such taxonomy is given in Sproat, 1992) and allow correspondences only within those classes of alternations. This would be a reasonable approach, but it is also worth considering whether one could induce the range of observed alternation types completely automatically. r Syntactic context. Various methods for morphological induction depend upon related forms occurring in similar syntactic contexts. This is reasonable for some instances of inflectional morphology. For example in English, plural nouns have roughly the same syntactic distribution as singular nouns, modulo the fact that as subjects they co-occur with different verb forms. Similarly, there is no syntactic difference between present and past tense verb forms. But such similarity in syntactic context is not a feature of inflectional variants in general. For example, in many languages nouns are case marked precisely to mark differing syntactic contexts for the noun forms in question. So one would not in general expect, for example, a nominative and ablative form of a noun to occur in similar syntactic environments.
136
5. machine learning of morphology
Even tense marking on verbs can result in different syntactic contexts in some languages. For example in Georgian (A. Harris, 1981; Aronson, 1990), different verb tenses imply different case markings on subjects and direct objects. Aorist verbs, for example, take ergative–absolutive marking, whereas present verbs take nominative–accusative marking. Assuming one tags different nominal case forms with different syntactic tags, these tense differences would therefore correspond to different syntactic environments. And once one moves away from inflectional morphology into more derivational kinds of alternations, especially those that result in a change of grammatical category, one expects quite different syntactic environments. r Semantic context. Words that are morphologically related tend to occur in semantically similar environments. Thus eat will tend to occur in environments that are similar no matter what the tense is. Similarly, one would expect a nominative singular form such as equus “horse” in Latin to occur in similar semantic environments to other inflected forms such as equ¯ı (nominative plural) or equ¯orum (genitive plural). Semantic context is expected to be more stable than syntactic context across derivational relatives. Thus both donate and donation might be expected to co-occur with such context words as money, fund, dollars, and charity. Semantic drift will, of course, contribute to a weakening of this expectation: transmit and transmission (in the automotive sense) will clearly tend to occur in different semantic contexts. It is a fair question, however, whether native speakers even recognize such forms as related. Thus despite the substantial progress in recent years on automatic induction of morphology, there are also still substantial limitations on what current systems are able to handle. Nonetheless, the problems described in the preceding bullet items are just that: problems, and problems which we have every reason to believe will be seriously addressed in subsequent work.
Part II Computational Approaches to Syntax
This page intentionally left blank
6 Finite-state Approaches to Syntax Finite-state models are the most widely used syntactic models today. They include such common techniques as n-gram models, part-ofspeech (POS) taggers, noun-phrase chunkers and shallow parsers. The previous chapters on morphology described in detail finite-state approaches in that domain. Syntax is different, however, because it has been known for nearly fifty years that natural languages have very common syntactic structures that cannot be described using finite-state techniques. Nevertheless, these approaches to syntax can provide efficient and useful syntactic information to a range of applications, from automatic speech recognition (ASR) to machine translation (MT). Several dynamic programming algorithms for efficient inference with weighted finite-state models will be presented in this chapter, most notably the Viterbi algorithm (Section 6.3.1) and the Forwardbackward algorithm (Section 6.3.3). In subsequent chapters, related dynamic programming algorithms for weighted context-free and context-sensitive models will be presented, and the explicit presentation of these algorithms will make clear the large efficiency cost incurred with richer formalisms. We begin this chapter by describing the simplest syntactic models in wide use.
6.1 N-gram Models 6.1.1 Background Of the several things that a syntactic model may be asked to provide – acceptance/rejection, structure labeling, or other annotations – standard n-gram models provide just one: scores associated with strings of words. Removing the statistical dimension from n-gram modeling
140
6. finite-state approaches to syntax
leaves one with a model that accepts every string possible for the given vocabulary and provides no useful structural annotation; in other words, a model that does nothing. However, the scores that are provided by these very simple models have proven terrifically useful for a range of tasks, most notably large vocabulary ASR. This illustrates the tremendous value of statistical disambiguation in speech and language processing: even without hard grammatical constraints encoded in a model, the utility of the disambiguation provided is very high. The statistical approach to ASR involves finding the best scoring (highest probability) word transcription hypothesis for the given utterance from among all possible transcriptions. For a given vocabulary , the possible transcriptions are ∗ , and the best choice for a transcription w ∈ ∗ of the utterance A, denoted w, is given by w = argmax P(w | A) w∈ ∗
(6.1)
Through the use of Bayes’ rule, this is the same as w = argmax P(A | w) P(w) w∈ ∗
(6.2)
The first probability, P(A | w), is commonly known as the acoustic model, and the second, P(w), as the language model. The grammar in this case is required to provide, for every string w ∈ ∗ , a probability. The grammar’s role in statistical ASR is hence to provide an a priori preference for some strings versus others – all else being equal, select the string with the highest probability according to the language model. The same formulation of the problem can be applied to machine translation (Brown et al., 1990), with the source being a string in the language to be translated, rather than an acoustic signal. In such a formulation, if we let A stand for the source string rather than the acoustic signal of an utterance, the first probability is known as the translation model. The role of the language model remains the same. The score given to strings by a grammar has been discussed up to now as an annotation, but it can also be thought of as a natural generalization of acceptance. If a string w is such that P(w) = 0, then that string is not accepted. Beyond merely delimiting the set of strings in the language of the grammar (w such that P(w) > 0) from those that are not in the language, such a grammar also provides a probability distribution within the language, which can be used to distinguish likely versus unlikely strings. An unweighted grammar can be thought of as
6.1 n-gram models
141
assigning a uniform distribution over strings in the language, where all strings are equiprobable. 6.1.2 Basic Approach N-gram models provide a probability distribution over strings in the language as follows. Let w = w1 . . . wk be a string of k symbols (or words) from . For simplicity, we will adopt the convention that w0 is some special start-of-string symbol (commonly denoted <s>), and that wk is a special end-of-string symbol (commonly denoted ) that always terminates strings. By definition, using the chain rule, P(w1 . . . wk ) =
k
P(wi | w0 . . . wi −1 )
(6.3)
i =1
The n in n-gram indicates that the model makes a Markov assumption to limit the distance of dependencies, so that the conditional probability of each word depends only on the previous n−1 words. In other words, each word is assumed to be conditionally independent of the words in the sentence more than n−1 words distant. For a Markov model of order n+1, the estimate becomes P(w1 . . . wk ) =
k
P(wi | wi −n . . . wi −1 )
(6.4)
i =1
Most n-gram models in practice make a relatively short Markov assumption, as in the common bigram model, which makes a Markov assumption of order 1. Such a model assumes that, given the adjacent words, a word is conditionally independent of the rest of the sentence. Thus, for a bigram model, the string probability estimate becomes P(w1 . . . wk ) =
k
P(wi | wi −1 )
(6.5)
i =1
Let h be a sequence of k ≥ 0 words, which we will call an n-gram history, as in the history of the sequence leading up to the next word. To build an n-gram model of order k requires estimating P(w | h) for all w ∈ , for any given history h. The first method that might be used for this is maximum likelihood estimation, that is, relative frequency
142
6. finite-state approaches to syntax
estimation for a given corpus C , which we denote P: c (hw) w c (hw )
P(w | h) =
(6.6)
where c (hw) is the count of hw in C , and the denominator ensures that the sum of all probabilities for a particular history h is one. For any given history h, this is a multinomial distribution (see Section 1.3), with a parameter for each word in the vocabulary defining the probability of that word given the history. For example, c (dog food can) is the number of times that particular trigram (h = dog food, w = can) occurs in the corpus. The problem with the maximum likelihood approach is that, for any word w such that hw is unobserved in the corpus, that is, c (hw) = 0, the probability P(w | h) = 0. A trigram such as “dog food consultant” may not be observed in a corpus and may not be particularly likely relative to “dog food can”, but should its probability be zero? With a reasonably large vocabulary, many perfectly possible bigrams, let alone trigrams, may not occur in even a very large corpus. Most frequently, n-gram models are estimated so that all words have a non-zero probability given any history h. Methods for allocating non-zero probability to unobserved sequences is known as smoothing. An unsmoothed model gives probability of zero to unobserved sequences. If the size of the corpus were infinite – if the number of observed sequences were infinite – there would be no need for smoothing, since every possible n-gram would be observed (only impossible n-grams would be unobserved), and the maximum likelihood probability would be the correct probability to use. Even when the number of observations is finite (as is always the case), one might hope at some point to have enough observations to directly estimate the parameters of the model more or less correctly. Data is called sparse when the number of observations per model parameter is insufficient to adequately estimate the parameter values. For an n-gram model of order k, there are ||k parameters. Hence, for the same amount of training data, there will be less of a sparse data problem for an n-gram model of order k−1 than for a model of order k, because there are exponentially fewer parameters. In other words, lower order models, although they lose some (likely useful) context by making a more severe Markov assumption, are typically better estimated and more reliable. For that reason, smoothing, or estimating probabilities for unobserved n-grams, often involves making use of lower order distributions.
6.1 n-gram models
143
Data sparsity is almost always a problem, although this is a function of the number of observations and the number of parameters in the model. For the same corpus, in one case we will have sparse data, while in another we will not. A simple example is estimating bigram models from a corpus of 40 million words of newswire text. In one case, we would like to estimate bigrams of characters; in another bigrams of words. Suppose we have a fixed vocabulary of 10,000 word types in the corpus. The bigram model thus has 100 million parameters (every possible word pair), and we are left on average with less than one observation per parameter, surely not enough to adequately estimate the model. If our character set includes twenty-six letters, space, and maybe some number of punctuation characters – say forty characters to simplify matters – then the number of parameters in the bigram model will be 402 = 1,600, and we can expect on average 25,000 observations per parameter in our corpus, which is hopefully enough to begin converging upon the correct probabilities. Chances are, if you haven’t seen a particular combination of characters after 40 million words, the probability of the bigram is approximately zero. Unfortunately, most interesting models have a vocabulary larger than forty, so sparse data is a problem that everyone must deal with when building models. 6.1.3 Smoothing 6.1.3.1 Backing off For a given n-gram history h = wi +1 . . . wi +k of order k > 0, call h = wi +2 . . . wi +k the backoff history of h, that is, the final k−1 words of h. Many of the most popular smoothing techniques, including deleted interpolation (Jelinek and Mercer, 1980), Katz backoff (Katz, 1987), absolute discounting (Ney et al., 1994) and Witten-Bell smoothing (Witten and Bell, 1991), make use of the backoff history. The basic idea is to remove some probability previously allocated to observed n-grams, and allocate it to the unobserved n-grams. That probability is not allocated uniformly to all unobserved words following a particular history but rather according to the distribution provided by the backoff history. Let WC h = {w : c (hw) > 0 in C }, that is, the set of words w such that the n-gram hw has been observed in C ; and let P (w | h) be some estimate of the probability for w ∈ WC h such that w∈WC h
P (w | h) < 1
(6.7)
144
6. finite-state approaches to syntax
Since the probabilities of observed n-grams sum to less than one, the unallocated probability can be assigned to unseen n-grams. A smoothed model can be defined as P (w | h) if w ∈ WC h P(w | h) = (6.8) ·h P(w | h ) otherwise This recursive definition generally terminates with the unigram probabilities, which either go unsmoothed or are smoothed with a uniform distribution. For the model to be normalized, the term ·h is calculated as follows: 1 − w∈WCh P (w | h) ·h = (6.9) 1 − w∈WCh P(w | h ) The various smoothing methods mentioned above differ in how they define P used in equations 6.7–6.9. This is not a book on statistical modeling per se, so we will not discuss all of the approaches in the literature. Instead we will focus on two simple approaches: deleted interpolation and absolute discounting. 6.1.3.2 Deleted Interpolation Deleted interpolation (Jelinek and Mercer, 1980) smooths by mixing maximum likelihood estimates (Equation 6.6) of higher order models with smoothed lower order models, using a mixing parameter Îh between 0 and 1. P (w | h) = Îh P(w | h) + (1 − Îh )P(w | h )
(6.10)
The difficulty with this approach is choosing a good value for the mixing parameter Îh . Close to 0, the model uses mostly the lower order model, which has less context; close to 1, the model uses mostly the higher order model, which suffers more from sparse data problems. Since each history h can have its own mixing parameter, this becomes quite a difficult parameter estimation problem. The most common solution is to hold aside some of the training data, referred to as heldout data, and choose the parameters that maximize the likelihood of the held-out data, via the Expectation Maximization (EM) algorithm, which is a general iterative technique for improving model parameters (see Section 6.3.3 for a detailed example of the EM algorithm). Even that is generally not feasible, unless there is some kind of parameter tying, that is, unless histories h are clustered into sets, and parameters are tied to be the same value for all histories in a given set. For further discussion of this approach, see Jelinek (1998).
6.1 n-gram models
145
It is straightforward to show that, in the case of deleted interpolation, ·h from Equations 6.8 and 6.9 is 1 − Îh , which is how this approach is usually presented in the literature. 1 − w∈WCh P (w | h) ·h = 1 − w∈WCh P(w | h ) P(w | h) + (1 − Îh )P(w | h )) 1 − w∈WCh (Îh = 1 − w∈WCh P(w | h ) P(w | h) − (1 − Îh ) w∈WCh P(w | h ) 1 − Îh w∈WCh = 1 − w∈WCh P(w | h ) 1 − Îh − (1 − Îh ) w∈WCh P(w | h ) = 1 − w∈WCh P(w | h ) 1 − w∈WCh P(w | h ) (6.11) = (1 − Îh ) = 1 − Îh 1 − w∈WCh P(w | h ) Since P(w | h) = 0 if w ∈ WC h , Equation 6.8 simplifies, for deleted interpolation, to P(w | h) + (1 − Îh )P(w | h ) P(w | h) = Îh
(6.12)
Note that more complicated mixtures can be used for n-gram language modeling. For example, let h = wi +1 h = h wi +k be an n-gram history of order k > 1. Then h is the commonly defined backoff history, which involves forgetting the first word in the sequence h. The history h could also be used as a backoff history of h, after forgetting the last word in the sequence h. This sort of history is not commonly made use of (although cf. Saul and Pereira, 1997; Bilmes and Kirchhoff, 2003), under the assumption that adjacent words are more dependent than non-adjacent words, an assumption that is clearly not always true. One could define a mixture model that makes use of both h and h as follows: P(w | h) + (1 − Îh )(„h P(w | h ) + (1 − „h )P(w | h )) P(w | h) = Îh
(6.13)
where P(w | h ) is defined in such a way as to take into account the non-adjacency of h and w. Generally speaking, the performance gain from such mixing approaches has not been large enough for them to be widely used. See, however, the discussion of factored language models in Section 6.1.5, which has been shown to have particular utility in languages with richer morphologies than English.
146
6. finite-state approaches to syntax
6.1.3.3 Discounting Discounting involves removing some counts from the observations. In the simplest form of absolute discounting (Ney et al., 1994), some fixed Ë < 1 is removed from the count of every w ∈ WC h . Thus c (hw) − Ë P (w | h) = w c (hw )
(6.14)
Note that, under this approach, the amount of probability allocated to Ch | unobserved words is Ë|W , where |WC h | is the cardinality of the set w c (hw ) WC h . Thus histories h, where the average count c (hw) over w ∈ WC h is low, reserve more probability for unobserved events than those with high average counts. This has some intuitive appeal, because the fewer observations per observed word following a history, the less robust the probability estimates for those words and the more need for smoothing. For example, consider a simple experiment with the following two bigram histories: from and hundred. Each of these words occurs between 42 and 43 thousand times in a 10 million word sample of newspaper text that we examined. Yet, while from has a relatively large set of words following it (over 5,000 different words, for an average of less than 9 observations per bigram), hundred has far fewer (1,516 different words, for an average of over 28 observations per bigram). Only 3 bigrams from the former set occur more than 1,000 times (from the, from a, and from one), while over 2,600 occur only once. In contrast, there are 10 bigrams beginning with hundred that occur over 1,000 times, and less than 800 singletons. This should lead one to expect fewer unobserved words following hundred than following from, hence less probability mass allocated to smoothing for hundred than for from. Absolute discounting achieves this: for a given Ë, the amount of probability mass reserved for unobserved words following hundred would be 0.036Ë, while 0.117Ë would be reserved for unobserved words following from. We also looked at a second 10 million word sample of newspaper text, and found that 8.8 percent of bigram instances beginning with from in the new sample were unobserved in the first sample; whereas 3.0 percent of bigram instances beginning with hundred in the new sample were previously unobserved. This would suggest a Ë value of between 0.75 and 0.83 for discounting in these cases. A very common discounting approach to n-gram language modeling is Katz backoff (Katz, 1987), which makes use of Good–Turing estimation (Good, 1953). One nice quality of Katz backoff is that only low frequency n-grams are discounted, while the counts of high frequency
6.1 n-gram models
147
n-grams are left undiscounted. As the number of observations grows, the empirical distribution will converge to the true distribution, so relative frequency estimation will be better for events with many observations. Rather than modifying the maximum likelihood estimate for these frequent events (as absolute discounting or deleted interpolation would do), Katz backoff leaves those estimates and modifies those which are more likely to be farther from the true distribution. 6.1.3.4 Other approaches There are many other smoothing approaches that fall within this general class of approaches. See Chen and Goodman (1998) for a useful and thorough review of common n-gram smoothing techniques, as well as a comparative evaluation. A detailed presentation of the range of statistical estimation techniques employed for language modeling is beyond the scope of this book. However, there are two other common estimation techniques that deserve mention in this context. The first is commonly known as Maximum Entropy (MaxEnt) modeling, and it does not require a backoff structure for smoothing. The second is called Kneser–Ney modeling (Kneser and Ney, 1995), and this approach is intended for inclusion in a backoff structured model. A MaxEnt model is a kind of log-linear model that avoids assigning zero probabilities to unobserved sequences by using an appropriately normalized exponential model. Indicator features, such as the presence or absence of a particular n-gram sequence, are assigned parameter values, and the probability of a string is calculated by first summing the parameter values of all features that are present, then taking the exponential of that sum, and finally normalizing. N-gram sequences that are not features in the model make no contribution to the probability. If the sum of all parameter values is zero for a particular string, the string still receives a probability, since the exponential of zero is one, hence unobserved events receive a non-zero probability. More details on this sort of modeling approach will be provided in Section 6.3.5. Kneser–Ney modeling involves changing the estimation of lower order models that are being used for smoothing higher order models. The basic intuition is that these lower order models are mainly being used for modeling unseen n-grams (via smoothing), so they should be built to best model that situation. Let us take as an example a bigram model, which backs off to a unigram model. Suppose the corpus is some American newspaper archive, so that the bigram New York is quite frequent. As a result, the unigram York is also quite frequent, so that the
148
6. finite-state approaches to syntax
probability of that word will be relatively high. However, the true probability of York following any word other than New is very small, though not zero. Consider how the model will be used. If the previous word is New, the unigram estimate of the probability of York does not matter, because the bigram estimate is being used. If the previous word is not New, the unigram estimate will be relied upon, but in this case, it should be much lower than simple relative frequency estimation provides. The solution to this proposed in Kneser and Ney (1995) is to estimate the unigram model not with relative frequency estimation, but as follows |{w : c (w w) > 0}| P(w) = (6.15) w |{w : c (w w ) > 0}| That is, the number of different words that precede w divided by an appropriate normalizing factor, so that the sum of all probabilities is one. For our New York example, York may occur many times, but the number of words after which York occurs will be very few. In contrast, a word like of will occur after many different words, and hence would obtain a relatively high unigram probability under this approach, which is as it should be. We revisited the 10 million word sample from earlier in this section, and our two example words (from and hundred), each of which occurs between 42 and 43 thousand times. We counted the number of different words that precede each of them in the sample: 6,073 for from, and only 32 for hundred. Unsurprisingly, hundred occurs nearly always after the numbers one through nine. This makes it a particularly unlikely word to follow something it has not been observed to follow. Using Kneser– Ney estimation, the unigram probability of from would be nearly 200 times more than that of hundred, as opposed to roughly equiprobable. 6.1.4 Encoding N-gram modeling is a finite-state approach to syntactic modeling, and Figure 6.1 shows an efficient encoding of a backoff structured n-gram in a deterministic WFSA (Allauzen et al., 2003). Each state encodes an ngram history h; each arc is labeled with the word w and the probability (or, more commonly, the negative log probability) of the word given the history. The destination state of the arc is the history for the next word. Each history state h of length k > 0 has a backoff arc, labeled with ˆ and the backoff weight ·h (or the negative log of that value), which points to the state encoding the backoff history h . This arc has
6.1 n-gram models hi =
w i /−logP(w i|hi)
wi−2wi−1
hi+1 =
wi−1wi
φ/−log αhi+1
φ /−log αhi
w i−1
149
wi/−logP(w i|w i−1)
wi
φ/−log αwi−1 w i /−logP(w i)
/1.101
a/0.287
a/0.405
a a/1.108 <s>
φ/4.856
a/0.441 φ/0.231 b/0.693
/1.540 φ/0.356 b/1.945
b
Figure 6.1 Deterministic finite-state representation of n-gram models with negative log probabilities (tropical semiring). The symbol ˆ labels backoff transitions
the special semantics of only being traversed if the next symbol being processed has no arc out of the state. In other words, it encodes the otherwise in Equation 6.8. This is an implicit or on-line representation of a much larger automaton, which encodes the model with a transition for every word out of every state. The upper automaton in Figure 6.1 shows a fragment of a trigram model in the tropical semiring. 1 Although it is not shown in that graph, for every bigram history (e.g., wi −1 ), there will be many trigram histories that back off to the same bigram state (all those with the last word as wi −1 ). So there is a lot of structure sharing within the graph. The lower automaton shows a complete bigram model, actually estimated from a small toy corpus. The symbols in the vocabulary include the start-ofsentence and end-of-sentence symbols – <s> and respectively – and the letters a and b. All strings begin with the symbol <s>, which has probability 1.0 and is not emitted. It does, however, provide a bigram conditioning context. The probability of a word occurring as the first word in a sentence is not the same as the overall probability of the word. All sentences terminate with the symbol, and hence it labels all transitions going to the final state. Unlike nearly every other syntactic model that we will discuss in the remainder of this book, an n-gram model thus encoded has no hidden 1
See Chapter 1 for details on semirings and weighted finite-state automata.
150
6. finite-state approaches to syntax
structure, that is, for a given string, it is known, after every word, what is the current state in the above automaton. This allows for extremely fast processing. This, coupled with ease of training, efficiency of storage, and performance that can be difficult to improve upon, has led to n-gram models playing a role in many state-of-the-art large systems, including ASR and MT systems. Of course, they are models with clear shortcomings, in terms of their ability to provide richer syntactic annotations as well as their ability to capture richer syntactic dependencies and generalizations. The next section will discuss the first small step to try to capture a bit more of that. First, though, we will focus briefly on another potential shortcoming of basic n-gram modeling. 6.1.5 Factored Language Models For a language like English, which is morphologically relatively impoverished, treating word tokens as atomic within an n-gram model may not be optimal, but it is at least viable. That is not the case with other languages, such as agglutinative languages, where the number of distinct types in such an approach would be prohibitive. Morphologically rich languages would benefit from making use of various sorts of generalizations within n-gram models, for example, dependencies between roots, regardless of their surface realization. Factored language modeling (Bilmes and Kirchhoff, 2003) is one way to do this within a generative framework. The basic idea is that each word token is split into a number of factors, wi = { f 1i . . . f ki }, which can jointly and separately allocate probability to the next word, which is also represented as a set of factors. In Bilmes and Kirchhoff (2003), this was motivated by Arabic, where the factors were various deterministically labeled annotations including morphological class, stems, roots, and patterns. In a bigram factored language model, the probability of each word token is P(wi | wi −1 ) = P( f 1i . . . f ki | f 1i −1 . . . f ki −1 ) =
k
P( f ji | f 1i −1 . . . f ki −1 )
(6.16)
j =1
Thus the probability of each part of the word form can be modeled independently. Unlike the n-gram models that we have presented up to now, however, the factors are not necessarily ordered in a way that provides a natural, effective backoff order. Bilmes and Kirchhoff (2003)
6.2 class-based language models
151
propose a backoff graph, that smooths not with a single backoff state, but with multiple backoff states, that is, ˆ f ji | f 1i −1 f 2i −1 f 3i −1 ) P( f ji | f 1i −1 f 2i −1 f 3i −1 ) = Î1 P( ˆ f ji | f 1i −1 f 2i −1 ) + Î2 P( ˆ f ji | f 1i −1 f 3i −1 ) + Î3 P( ˆ f ji | f 2i −1 f 3i −1 ) + Î4 P(
(6.17)
where Î1 + Î2 + Î3 + Î4 = 1. This approach has been demonstrated to provide modeling improvements for Arabic. Maximum Entropy modeling is another approach to breaking words down to sets of features.
6.2 Class-based Language Models One shortcoming of n-gram models is that there is no generalization beyond dependencies observed for specific words. For example, in a quick experiment on 10 million words of newspaper text, we found that the bigram every Monday did not occur. However, every did occur before every other day of the week at least once, with a total of 10 observations for the set (or class) of days-of-the-week following the word every, out of a total of 1,762 observations of the word every, or 0.6 percent of the time. Over a large vocabulary, that is a fairly high probability for seeing a day-of-the-week. It seems an artifact of the particular corpus that Monday was not observed in that context in our sample, rather than something different about Monday versus the other days. In fact, if the sample is extended to 40 million words, every Monday occurs 11 times, which is about average for the class. Class-based language models (Brown et al., 1992) are an attempt to compensate for sparse data by making generalizations about the behavior of words based on the behavior of similar words. In its simplest form, each word wi is assigned exactly one class label C i . Again, by convention, let C 0 be <s> and C k = wk =. Then the probability of the string is P(w1 . . . wk ) =
k i =1
P(C i | C 0 . . . C i −1 )P(wi | C i )
(6.18)
152
6. finite-state approaches to syntax C1
C2
C3
Ck
W1
W2
W3
Wk
Figure 6.2 Hidden Markov Model view of class-based language modeling
Just as with n-gram models, a Markov assumption is typically made, for example, a bi-class model would be of the form P(w1 . . . wk ) =
k
P(C i | C i −1 )P(wi | C i )
(6.19)
i =1
Note that a standard n-gram model can be thought of as a class-based model, where each word is a class with exactly one member, in which case, Equation 6.18 simplifies to Equation 6.3 and Equation 6.19 reduces to Equation 6.5. Hybrid approaches are consistent with the above formulation, in which some classes have just one member and some, such as, days-of-the-week, have multiple class members. Note that we have abused notation a bit in this formulation, since in fact Equations 6.18 and 6.19 give joint word and class sequence probabilities, which just happen to be the same as the word sequence probabilities, in the case that there is just a single class sequence for any given word sequence. In the case that words can belong to multiple classes, a class-based language model can be viewed as a Hidden Markov Model (HMM), as shown in Figure 6.2. We must be careful not to confuse this graph with the finite-state automata graphs that we have been presenting up to now. In this HMM, the states represent random variables, that is, they can in practice take many different values, hence this representation is a sort of shorthand for many possible states. If we were to represent this as an explicit finite-state machine, as the fragment of a transducer shown in Figure 6.3, the states would represent class values, and arcs between states would be labeled with the class labels and words. The value of the HMM shorthand representation is that it very simply represents the dependencies that are being encoded in the model. An example of when an HMM class-based language model might be appropriate is with part-of-speech (POS) tags as the word class, in which certain words, such as can, are ambiguous with respect to their part-of-speech (in this case, most commonly either modal verb or common noun). The classes in the sequence C 1 . . . C k are represented
ε:NN / P(NN | NNS) dog:ε / P(dog | NN) ε:NN / P(NN | DT) DT
NN0
house:ε / P(house | NN) ε:NN / P(NN | NN)
dogs:ε / P(dogs | NNS) NN
ε:NNS / P(NNS | DT)
Figure 6.3 Finite-state transducer view of class-based language modeling
ε:NNS / P(NNS | NN)
NNS0
houses:ε / P(houses | NNS) ε:NNS / P(NNS | NNS)
NNS
154
6. finite-state approaches to syntax
by hidden states, that is, it is not certain which class sequence has generated the observation sequence w1 . . . wk . In order to derive the word probabilities for such a model, it is necessary to sum over all possible hidden class sequences. Suppose that T is the set of possible tags. Then k P(C i | C i −1 )P(wi | C i ) (6.20) P(w1 . . . wk ) = C 1 ...C k ∈T k
i =1
If the size of the tagset is, say, fifty POS-tags, then the number of possible sequences that must be summed is 50k , which gets very large even for fairly short sequences. The size of a set of POS-tags will depend on the language, as well as how fine the distinctions are between tags. The Brown Corpus (Francis and Kucera, 1982) makes use of eightyseven atomic POS-tags, which were subsequently adapted in the Penn Treebank (Marcus et al., 1993) to a smaller set of forty-eight POS-tags, including twelve for punctuation and other symbols. The Penn Treebank set includes four nominal classes – encoding plural/singular and proper/common distinctions – and seven verb classes: modal (MD); base form verb (VB); past tense (VBD); gerund/present participle (VBG); past participle (VBN); third person singular present (VBZ); and non-third-person singular present (VBP). One might choose to go with a coarser tagset, for example, just one noun and one verb class, but even with a minimal set in the dozens, the number of possible sequences will grow exponentially with the length of the string. Luckily there is an algorithm for calculating this which does not require enumeration of all possible sequences. 6.2.1 Forward Algorithm The Forward algorithm takes advantage of the Markov assumption to share sub-calculations needed to find the sum over many distinct tag sequences. For a word sequence w1 . . . wk , a set of m tags T = {Ùi : 1 ≤ i ≤ m}, and a vocabulary of n words = {vi : 1 ≤ i ≤ n}, we will establish some notation. Note that we are dealing with a sequence of possible classes, and we will denote by C t the random variable of the class associated with word wt for a time index t in the sequence w1 . . . wk . The random variable C t can take as its value any of the tags from T . Let ai j be defined as follows: ai j = P(C t = Ù j | C t−1 = Ùi )
for Ùi , Ù j ∈ T , 1 < t ≤ k
(6.21)
In other words, it is the probability that the class for word wt is Ù j given that the class for word wt−1 is Ùi . For example, if Ùi = Det (determiner)
6.2 class-based language models
155
and Ù j = N (noun), then ai j is the probability of the tag N labeling a word immediately following a word labeled with tag Det. We will denote by a·· the set of all ai j transition probabilities for a particular tagset T . For simplicity, let a0 j and ai 0 be defined as follows: 2 a0 j = P(C 1 = Ù j | C 0 =<s>)
ai 0 = P(C k+1 =| C k = Ùi ) (6.22)
Note that all of the transition probabilities are time index invariant, in other words, whether they occur towards the beginning of the sequence or towards the end does not change the probability. 3 Next, let b j (wt ) be defined as b j (wt ) = P(wt | C t = Ù j )
for wt ∈ , Ù j ∈ T , 1 ≤ t ≤ k (6.23)
For example, if wt = the and Ù j = Det, then b j (wt ) is the probability of the word the given that the word is in the POS-tag class Det. Note that there are two possible indices that we can use for words in the vocabulary: the time index t where the word appears in the string, and the index within the vocabulary vi ∈ . To avoid complicating notation, we will usually use the time index, but it is understood that b j (wt ) = b j (vi ) if wt = vi . We will denote by b· (· ) the set of all b j (wt ) emission probabilities for a particular tagset T and vocabulary . The Forward algorithm relies on the ability to factor equation 6.20 into smaller parts. To see this, let us consider a four word string, w=w1 w2 w3 w4 . Then
P(w) =
C 1 C 2 C 3 C 4 ∈T 4
=
4
P(C i | C i −1 )P(wi | C i ) P(| C 4 )
i =1
|T | |T | |T | |T |
a0i bi (w1 )ai j b j (w2 )a j p b p (w3 )a pq bq (w4 )aq 0
i =1 j =1 p=1 q =1
=
|T | q =1
⎛
aq 0 bq (w4 ) ⎝
|T | p=1
⎛ |T | ⎞⎞ |T | a pq b p (w3 ) ⎝ a j p b j (w2 ) ai j a0i bi (w1 ) ⎠⎠ j =1
i =1
(6.24) In the presentation of these algorithms, we do not assume that wk is in a sequence w1 . . . wk . 3 Of course, <s> and are special symbols that only occur at the beginning and end of the sequences, respectively, so that the transitions can only be taken at those locations in the string. Nevertheless, the transition probabilities themselves are time index invariant: the probability of ending the sequence after a particular tag does not depend on the time index of that tag. 2
156
6. finite-state approaches to syntax
Forward(w1 . . . wk , T , a·· , b· (· )) 1 ·0 (0) ← 1 2 for t = 1 to k do 3 for j = 1 to |T | do 4
· j (t) ← b j (wt ) |T | 5 return i =1 ·i (k)ai 0
|T | i =0
·i (t − 1)ai j
Figure 6.4 Pseudocode of the Forward algorithm for HMM class-based language modeling
Equation 6.24 can give us some insight on how to approach this calculation efficiently. If we begin with the innermost sum, we only need |T | to calculate b j (w2 ) i =1 a i j a 0i b i (w1 ) once for each tag Ù j . Once we have that value for any particular Ù j , then we can move to the next sum out, which also need only be calculated for every pair of tags. In such a way we can calculate this value without having to explicitly enumerate every possible tag sequence. Let us define a recursive value · j (t), known as the forward probability, as follows. Let ·0 (0) = 1. For t ≥ 1, ·0 (t) = 0 and for j > 0 |T | (6.25) ·i (t − 1)ai j · j (t) = b j (wt ) i =0
The forward probability · j (t) is the probability of seeing the initial observed sequence w1 . . . wt with tag Ù j at time t. Using this definition of the forward probability, Figure 6.4 presents an efficient algorithm for calculating the string probability, given an HMM class-based language model, where again a·· and b· (· ) in the input denote the given probability models. The complexity of this algorithm is k|T |2 . To get an idea of how much more efficient this is than full enumeration of all possible candidates, suppose we have ten POS-tags and ten words in a string. The number of possible candidates is 1010 , whereas k|T |2 = 103 . Hence, we can find the sum in at worst 1000 steps using the Forward algorithm, versus considering at worst 10 billion candidate tag sequences in a brute force manner. Consider the following example sentence: the aged bottle flies fast. Each word in the string, with the exception of the, has multiple possible POS-tags: aged can be an adjective (Adj) or verb (V); bottle can be a noun (N) or a V; flies can also be an N or a V; and fast can be an Adj,
6.2 class-based language models
157
“the aged bottle flies fast” b j (wt ) = P(wt | C t = Ù j ) t: j 1 2 3 4 5
Adj Adv Det N V
1
2
3
4
the 0 0 0.5 0 0
aged 0.1 0 0 0 0.01
bottle 0 0 0 0.1 0.1
flies 0 0 0 0.1 0.2
5 fast 0.1 0.3 0 0.01 0.1
ai j = P(C t+1 = Ù j | C t = Ùi ) j: i 0 1 2 3 4 5
<s> Adj Adv Det N V
0
1
2
3
4
5
0 0.2 0.2 0 0.2 0.2
Adj 0.4 0.2 0.2 0.3 0.1 0.2
Adv 0.2 0.1 0.2 0.1 0.1 0.3
Det 0.2 0 0.1 0 0.1 0.1
N 0.1 0.3 0.1 0.5 0.2 0.2
V 0.1 0.2 0.2 0.1 0.3 0
· j (t) = b j (wt ) t: j 0 1 2 3 4 5
Adj Adv Det N V
|T | i =0 ·i (t
− 1)ai j
0
1
2
3
4
5
<s> 1
the
aged
bottle
flies
fast
0.003 0.1 0.0001
0.0000001408 0.000000588 0.000092 0.00006
0.00000304 0.00000552
6 0.000000167424
0.00000001712 0.0000000912
Figure 6.5 Example of the Forward algorithm for some made-up probabilities for a made-up sentence. The algorithm produces the result that P(the aged bottle flies fast) = 0.000000167424
an adverb (Adv), an N (as in “the protester’s fast continued”), or a V. This particular example has, with just the POS-tags mentioned for each, thirty-two possible tag sequences, at least three of which are distinct, interpretable readings of the sentence (fasting flies, a flying bottle, and the bottling elderly). Figure 6.5 shows the application of the Forward algorithm to calculate the probability of this sentence, given some made up ai j and bi (wt ) probabilities. Note that the sum of each row in the ai j table is 1. The sum of the rows in the b j (wt ) table would be 1 if all words in the
158
6. finite-state approaches to syntax
vocabulary were displayed. Using the probabilities defined in these two tables, we can fill in the values in the · j (t) table, using the Forward algorithm. We initialize ·0 (0) = 1. We begin at t = 1, and consider all possible tags. The output probability b j (w1 ) is non-zero only for j = 3, so by definition · j (t) will only be non-zero for j = 3. We can now calculate ·3 (1) = b3 (w1 )
5
·i (0)ai 3
= b3 (w1 ) (·0 (0) × a03 )
i =0
= 0.5 (1 × 0.2) = 0.1 5 ·i (1)ai 1 = b1 (w2 ) (·3 (1) × a31 ) ·1 (2) = b1 (w2 ) i =0
= 0.1 (0.1 × 0.3) = 0.003 5 ·5 (2) = b5 (w2 ) ·i (1)ai 5 = b5 (w2 ) (·3 (1) × a35 ) i =0
= 0.01 (0.1 × 0.1) = 0.0001 5 ·4 (3) = b4 (w3 ) ·i (2)ai 4 = b4 (w3 ) (·1 (2) × a14 + ·5 (2) × a54 ) i =0
= 0.1 (0.003 × 0.3 + 0.00001 × 0.2) = 0.000092
(6.26)
and so forth. We leave it as an exercise to the reader to verify that the numbers in the table are correct. The Forward algorithm can be applied regardless of the order of the Markov assumption placed on the HMM. One way to see how this can be done is to create a new set of composite class labels, which encode not just the class of the current word, but the previous n−1 classes as well. Thus, instead of Ù j , we have Ù j1 ... jn , where Ù jn is the tag of the current word in the original tagset. For example, to encode a Markov model of order two, if the current tag is N and the previous tag was Det, then the new composite tag would be something like Det-N. Let i = i 1 . . . i m and j = j1 . . . jn . Then ⎧ P(C t = Ù jn | C t−n . . . C t−1 = Ùi 1 . . . Ùi m ) if i 2 . . . i m = j1 . . . jn−1 ⎪ ⎪ ⎪ ⎪ ⎨ ai j = P(C t = Ù jn | C 1 . . . C t−1 = Ùi 2 . . . Ùi m ) if i 1 . . . i m = j1 . . . jn−1 ⎪ ⎪ ⎪ and i 1 = 0 ⎪ ⎩ 0 otherwise
6.3 part-of-speech tagging
159 (6.27)
b j (wt ) = b jn (wt )
With these modifications, the Forward algorithm as presented, as well as the Viterbi and Forward–backward algorithms that will be presented shortly, can be used for Markov models of arbitrary order.
6.3 Part-of-Speech Tagging Up to now, the syntactic models we have been considering have been used to provide a single (admittedly important) annotation: a score. While part-of-speech (POS) tags may have utility in class-based language models, they are also widely used as syntactic annotations in their own right. There are many approaches to POS-tagging, and we will describe a few in this section. We will begin with HMM POS-tagging, which makes use of a model identical to the class-based language model that was presented in the last section but used here to find the maximum likelihood tag sequence. This approach can achieve relatively high accuracy – over 95 percent – though other models that we will describe can achieve slightly higher accuracy than this. Given a word string w1 . . . wk and a tag set T , the task of POStagging is to find Cˆ 1 . . . Cˆ k ∈ T k such that Cˆ 1 . . . Cˆ k = argmax P(C 1 . . . C k | w1 . . . wk ) C 1 ...C k ∈T k
= argmax P(w1 . . . wk | C 1 . . . C k )P(C 1 . . . C k ) C 1 ...C k ∈T k
= argmax
k
C 1 ...C k ∈T k i =1
≈ argmax
k
C 1 ...C k ∈T k i =1
P(C i | C 0 . . . C i −1 )P(wi | C 0 . . . C i , w1 . . . wi −1 )
P(C i | C i −n . . . C i −1 )P(wi | C i )
(6.28)
This last approximation makes a Markov assumption of order n+1 on the sequence of tags, and the assumption that the probability of a word given its POS tag is conditionally independent of the rest of the words or tags in the sequence. If n = 1, then Equation 6.28 is the same as Equation 6.20, except finding the max instead of summing. In fact, the same trick of factoring the probability provides an efficient algorithm to find the best tag sequence, widely known as the Viterbi algorithm.
160
6. finite-state approaches to syntax
Viterbi(w1 . . . wk , T , a·· , b· (· )) 1 2 3 4 5 6 7 8 9 10 11
·0 (0) ← 1 for t = 1 to k do for j = 1 to |T | do Ê j (t) ← argmax i ·i (t − 1)ai j · j (t) ← maxi ·i (t − 1)ai j b j (wt ) Ê0 (k + 1) ← argmaxi (·i (k)ai 0 ) Ò(k + 1) ← 0 for t = k to 1 do Ò(t) ← ÊÒ(t+1) (t + 1) Cˆ t ← ÙÒ(t) return Cˆ 1 . . . Cˆ k
traceback with backpointers
Figure 6.6 Pseudocode of the Viterbi algorithm for HMM decoding
6.3.1 Viterbi Algorithm The Viterbi algorithm is very similar to the Forward algorithm. The definitions of ai j and b j (wt ) are identical. The first change to the Forward algorithm is to redefine · j (t) as follows. Let ·0 (0) = 1. For t ≥ 1, ·0 (t) = 0 and for j > 0 |T | (6.29) · j (t) = b j (wt ) maxi =0 ·i (t − 1)ai j The only change that has been made from Equation 6.25 is to replace the summation with a max, that is, instead of summing across all possible previous classes, we take just the highest score. The second difference from the Forward algorithm is that in addition to calculating this recursive score, we need to keep track of where the max came from, in order to reconstruct the maximum likelihood path at the end. This is called a backpointer, since it points back from each cell in the table to the tag that provides the maximum score. Let Ê j (t) denote the backpointer for class j at time t. Then the Viterbi algorithm is shown in Figure 6.6. Figure 6.7 shows an example of the Viterbi algorithm, using the same probabilities defined in Figure 6.5. Each cell in the table stores the maximum forward probability, plus a backpointer index to the previous tag that provided the maximum forward probability. Once all of the cells have been filled in, we can re-trace the backpointers to reconstruct the maximum likelihood sequence. Note that this Viterbi definition of the forward probability gives the same probabilities as in Figure 6.5 for times 0–2, since there is only one possible tag for the first word “the”. In column 3, however, since only the max probability is stored, rather than the sum, the values differ in all cells from those in Figure 6.5. Both of the
6.3 part-of-speech tagging
· j (t)/Ê j (t) = b j (wt ) maxi|T=0| ·i (t − 1)ai j t: j 0 1 2 3 4 5
Adj Adv Det N V
161
/ argmaxi|T=0| ·i (t − 1)ai j
0
1
2
3
4
5
6
<s> 1
the
aged
bottle
flies
fast
0.0000000972/2
0.003/3 0.1/0
0.000000108/5 0.000000486/5
0.00009/1 0.0000018/4 0.0000000108/5 0.0001/3 0.00006/1 0.0000054/4 0.000000054/4
Ò(6) = 0 Ò(5) = ÊÒ(6) (6) = Ê0 (6) = 2 Ò(4) = ÊÒ(5) (5) = Ê2 (5) = 5 Ò(3) = ÊÒ(4) (4) = Ê5 (4) = 4 Ò(2) = ÊÒ(3) (3) = Ê4 (3) = 1 Ò(1) = ÊÒ(2) (2) = Ê1 (2) = 3
Cˆ 5 Cˆ 4 Cˆ 3 Cˆ 2 Cˆ 1
= ÙÒ(5) = ÙÒ(4) = ÙÒ(3) = ÙÒ(2) = ÙÒ(1)
= Ù2 = Ù5 = Ù4 = Ù1 = Ù3
= Adv =V =N = Adj = Det
Figure 6.7 Example of the Viterbi algorithm using the probabilities from Figure 6.5. The algorithm produces the result that the maximum likelihood tag sequence for the aged bottle flies fast is Det Adj N V Adv
max values in that column come from Ù1 , indicated by the back pointer “1” in those cells. Again, we encourage the reader to work through and verify the remaining values in the table. Of course, which sequence turns out to be the maximum likelihood sequence depends entirely on the probability model. Different models can result in different best-scoring hypotheses for the same string. What we have not discussed is where these models come from. An n-gram model requires large amounts of text for effective relative frequency estimation, and fortunately billions of words of text is available from newspaper archives and other sources. However, newspaper text does not typically come with POS-tags, so how can we perform relative frequency estimation for our models? The first way in which the probabilities can be estimated is with some amount of manually labeled data. This is called a supervised training scenario, where the learning is provided the true labels by a human, in this case an annotator. For example, the Penn Treebank, in addition to syntactic bracketing, has the POS-tag annotation for all of the words. One can use this manually annotated data to do relative frequency estimation to induce the probabilities for the HMM tagger.
162
6. finite-state approaches to syntax
The second way is to start with an initial guess of what the probabilities are, and then to re-estimate the model iteratively using unsupervised techniques, that is, with training data that has not been manually annotated. The general technique for doing this is called Expectation Maximization, and the efficient algorithm for the HMM case is commonly known as the Baum–Welch or Forward–backward algorithm, which is presented below in Section 6.3.3.
6.3.2 Efficient N-best Viterbi Decoding It is frequently the case that returning more than just the single best scoring sequence is useful. For example, one very common strategy in speech and language processing is to take a simple model, find a relatively small number of the best scoring candidates from that model, and then use a richer model to re-score the output from the simple model. This can be a useful strategy in cases where the richer model would have been very costly to apply to all possible candidates, but which provides improvements even when applied to a small subset of those candidates. One method for returning more than one sequence is to specify a number n and use an extension to the Viterbi algorithm to return exactly n candidates. In this section, we will briefly present one simple extension to the Viterbi algorithm, followed by a more efficient approach to the problem. To describe these algorithms, we must first introduce some notation. Whereas max and argmax return the highest score and highest scoring argument of a function respectively, let maxr and argmaxr be defined as returning the r th highest score and r th highest scoring argument of a function respectively. Hence max2 returns the second highest score, and argmax3 returns the third highest scoring argument. To return the n-best scoring sequences, the simple extension to the Viterbi algorithm maintains n backpointers for every tag at every word position, in rank order. Just as in the Viterbi algorithm, these backpointers store the previous tag associated with the score; additionally, however, they must store the rank of the score, from among the n-best for that previous tag. In cases where the maxr and argmaxr search over tag and rank pairs, the argmaxr returns a paired backpointer (Êr , Ïr ), where Êr stores the tag and Ïr stores the rank. Note that, in this case, a state may have multiple backpointers to the same previous tag, but with different ranks within that previous tag.
6.3 part-of-speech tagging
163
Simple-N-best-Viterbi(w1 . . . wk , T , a·· , b· (· ), n) 1 ·10 (0) ← 1 2 for c = 2 to n do ·c0 (0) ← 0 3 for t = 1 to k do 4 for j = 1 to |T | do 5 for r = 1 to n do n-best candidates 6 (Êrj (t), Ïrj (t)) ← argmaxri,c ·ic (t − 1)ai j 7 ·rj (t) ← maxri,c ·ic (t − 1)ai j b j (wt ) 8 for r = 1 to n do 9 (Êr0 (k + 1), Ïr0 (k + 1)) ← argmaxri,c (·ic (k)ai 0 ) 10 for c = 1 to n do traceback with backpointers 11 Ò(k + 1) ← 0 12 r ←c Tˆ c ← Â 13 14 for t = k to 1 do 15 Ò(t) ← ÊrÒ(t+1) (t + 1) 16 r ← ÏrÒ(t+1) (t + 1) 17 Tˆ c ← ÙÒ(t) Tˆ c 18 return Tˆ 1 , . . . , Tˆ n Better-N-best-Viterbi(w1 . . . wk , T , a·· , b· (· ), n) 1 ·0 (0) ← 1 2 for t = 1 to k do 3 for j = 1 to |T | do 4 Ê j (t) ← argmax i ·i (t − 1)ai j 5 · j (t) ← maxi ·i (t − 1)ai j b j (wt ) 6 S←∅ 7 for r = 1 to n do 8 p ← maxri (·i (k)ai 0 ) 9 if p > 0 then 10 Enqueue(S, ( p, argmaxri (·i (k)ai 0 ), Â, k)) 11 c ← 0 12 while c < n and S = ∅ do 13 ( p, j, T, t) ← head(S); Dequeue(S) remove top ranked 14 T ← Ù j T add tag to sequence 15 if (t = 1) then beginning of sequence 16 c ←c +1 17 Tˆ c ← T 18 else 19 p0 ← · j (t) 20 for r = 1 to n−c do 21 p ← pp0 maxri (·i (t − 1)ai j ) 22 if p > 0 then 23 Enqueue(S, ( p , argmaxri (·i (t − 1)ai j ), T , t − 1)) 24 return Tˆ 1 , . . . , Tˆ c
Figure 6.8 Pseudocode of a simple N-best Viterbi algorithm for HMM decoding, and an improved algorithm for the same.
Figure 6.8 gives the pseudocode for two N-best Viterbi algorithms. The relatively simple extension to the Viterbi algorithm (Simple-Nbest-Viterbi) involves finding the n-best solutions for each tag at each time step, instead of just the single best. A straightforward traceback routine at the end yields the n-best solutions. However, this approach is not as efficient a solution to this problem as can be achieved. To see why this is the case, consider that the algorithm computes and stores
164
6. finite-state approaches to syntax
the n-best solutions for every tag at every time step, even though the problem only requires n solutions in total. We can do far less work by aiming to find just n solutions. Our presentation of the improved algorithm (Better-N-bestViterbi) is based on the algorithm for finding the n-best strings from a word lattice, presented in Mohri and Riley (2002). The first five lines of this algorithm are identical to the Viterbi algorithm in Figure 6.6. Subsequently, a priority queue S is created, and the n-best candidates at the final timestep are pushed onto the queue, which is ranked by the probability of the candidate. Then, until either n candidates are found or the queue is emptied, the topmost ranked candidate is popped from the priority queue, and the n-best alternatives achieved by taking one step back are found and then pushed back on the queue. There are a couple of points to make about this improved algorithm. First, note that on line 20, only the top n − c candidates need to be found, where c is the number of complete candidates already found. This reduces the amount of work required after some candidates have already been completed. Second, finding the n-best backpointers is done on the fly during the backtrace stage of the algorithm. For many states, the n-best will never be calculated, since they will never be popped from the priority queue. As an implementation efficiency, once a state has had its n-best candidates calculated, they can be cached, in case another candidate goes through that state and requires the same n-best candidates. Several additional such efficiencies can be obtained with this algorithm; see Huang and Chiang (2005) for a presentation of such efficiency improvements when doing n-best extraction with a context-free parser, most of which are applicable in this case.
6.3.3 Forward–backward Algorithm As might be inferred from the name, the Forward–backward algorithm is related to the Forward algorithm described earlier. The algorithm performs a forward pass which is identical to the Forward algorithm; it then performs a similar pass in the other direction, to calculate a second recursive value, known as the backward probability. The ai j and b j (wt ) definitions remain the same from the Forward algorithm, as does the definition of the forward probability, · j (t), which uses the sum not the max of the Viterbi algorithm. We now define another recursive value ‚ j (t) as follows. Given a word string w1 . . . wk , let ‚i (k) = ai 0 . For
6.3 part-of-speech tagging
165
0≤t
|T |
‚ j (t + 1)ai j b j (wt+1 )
(6.30)
j =1
Whereas the forward probability ·i (t) is the probability of seeing the initial sequence w1 . . . wt with tag Ùi at time t, the backward probability ‚i (t) is the probability of seeing the remaining sequence wt+1 . . . wk given tag Ùi at time t. Note that the backward probability at time t depends on the backward probability at time t + 1, so that calculation of the backward probability goes from right-to-left. The product of the forward and backward probabilities ·i (t)‚i (t) is the probability of seeing the entire sequence w1 . . . wk with tag Ùi at time t. From this product, we can now calculate „i (t), the probability of tag Ùi at time t given the entire string w1 . . . wk : ·i (t)‚i (t) „i (t) = |T | j =1 · j (t)‚ j (t)
(6.31)
The value „i (t) is equivalent to the sum of the probabilities of all tag sequences with tag Ùi at time t. In addition, we can calculate Ói j (t), the probability of tag Ùi at time t and tag Ù j at time t + 1 given the entire string w1 . . . wk : ·i (t)ai j b j (wt+1 )‚ j (t + 1) Ói j (t) = |T | |T | m=1 ·l (t)al m b m (wt+1 )‚m (t + 1) l =1
(6.32)
See Bilmes (1997) for a detailed derivation of these values, in notation consistent with what has been used here. The Forward–backward algorithm is an instance of Expectation Maximization (EM), where expected frequencies are calculated given the model, and model parameters are updated based on the expected frequencies. Figure 6.9 presents one iteration of the ExpectationMaximization algorithm to re-estimate HMM parameters using the Forward–backward algorithm for calculating expected frequencies. The function we have called Update-Hmm-Parameters takes as input N word strings, a set of tags, a vocabulary, and the current HMM parameters a·· and b· (· ). It accumulates expected frequencies of tags, transitions, and word/tag pairs via the Forward-Backward function applied to each string. This is typically called the E-step in Expectation Maximization algorithms. The M-step involves maximum likelihood estimation given the expected counts, from which new model parameters are returned.
166
6. finite-state approaches to syntax
Update-Hmm-Parameters(W1 . . . WN , T , , a·· , b· (· )) 1 for i = 0 to |T | do Initialize 2 cˆ i ← 0 3 for j = 1 to || do 4 bˆ i (v j ) ← 0 5 for j = 0 to |T | do 6 aˆ i j ← 0 7 for s = 1 to N do E-step 8 („· (· ), Ó·· (· )) ← Forward-Backward(Ws , T , a·· , b· (· )) 9 for i = 1 to |T | do 10 aˆ 0i ← aˆ 0i + „i (1) 11 for t = 1 to |Ws | do 12 cˆ i ← cˆ i + „i (t) 13 bˆ i (wt ) ← bˆ i (wt ) + „i (t) 14 for j = 0 to |T | do 15 aˆ i j ← aˆ i j + Ói j (t) 16 for i = 1 to |T | do M-step 17 aˆ 0i ← N1 aˆ 0i 18 for j = 1 to || do 19 bˆ i (v j ) ← cˆ1i bˆ i (v j ) 20 for j = 0 to |T | do 21 aˆ i j ← cˆ1i aˆ i j 22 return (aˆ ·· , bˆ · (· )) Forward-Backward(w1 . . . wk , T , a·· , b· (· )) 1 ·0 (0) ← 1 2 for t = 1 to k do 3 for j = 1 to |T | do |T | 4 · j (t) ← b j (wt ) i =0 ·i (t − 1)a i j 5 for t = k to 1 do 6 for i = 1 to |T | do 7 if t < k |T | 8 then ‚i (t) ← j =0 ‚ j (t + 1)ai j b j (wt+1 ) 9 else ‚i (t) ← ·i 0 10 for i = 1 to |T | do 11 „i (t) ← |T·|i (t)‚i (t) 12 13 14 15
j =0
Forward pass
Backward pass
· j (t)‚ j (t)
if t < k then Ói 0 (t) ← 0 for j = 1 to |T | do ·i (t)ai j b j (wt+1 )‚ j (t+1) Ói j (t) ← |T | |T | k=1
16 else Ói 0 (t) ← „i (t) 17 for j = 1 to |T | do 18 Ói j (t) ← 0 19 return („· (· ), Ó·· (· ))
l =1 ·k (t)a kl bl (wt+1 )‚l (t+1)
Figure 6.9 Pseudocode of an iteration of the Forward–backward algorithm for HMM parameter estimation
The Forward-Backward function on a string takes its first part – the forward pass over the string – straight from the Forward algorithm presented in Figure 6.4. During the backward pass, once the backward probabilities have been calculated for all tags at time t, we can calculate the „· (t) and Ó·· (t) values which are returned from the function. Note that a special case must be made for time k when calculating Ó·· (k).
6.3 part-of-speech tagging ‚i (t) = t: j 0 1 2 3 4 5
|T | j =1
1
2
3
4
5
<s> 0.000000167424
the
aged
bottle
flies
fast
0.00005472 0.00000167424
„i (t) =
j 0 1 2 3 4 5
Adj Adv Det N V
‚ j (t + 1)ai j b j (wt+1 )
0
Adj Adv Det N V
t:
167
0.2 0.2
0.001632 0.0144 0.2 0.00003264 0.000288 0.0224 0.2
·i (t)‚i (t) |T | j =1 · j (t)‚ j (t)
0
1
2
3
4
5
<s> 1
the
aged
bottle
flies
fast
0.9805 1 0.0195
0.1682 0.7024 0.8968 0.1032
0.2615 0.7385
0.0205 0.1089
Figure 6.10 Backward probabilities and posterior probabilities using the example from Figure 6.5. The first table provides an example of the backward calculations in the Forward– backward or Baum–Welch algorithm using the probabilities from Figure 6.5. Note that ‚0 (0) = ·0 (6). The second table uses the forward and backward probabilities to calculate the posterior probability of each tag at each position
In an unsupervised training scenario, we do not know for certain what the POS tags are, so that we can not simply count how many times they occur for learning a statistical model, as in the supervised training scenario, when the true tags are given. Instead, what we can do is have an initial model, then use the Forward–backward algorithm to estimate how many times we expect them to occur, based on our initial model. This is known as the expected frequency, and this can be used to estimate models from unlabeled corpora. Figure 6.10 shows tables containing the backward probabilities and posterior probabilities over our example string.
168
6. finite-state approaches to syntax
6.3.4 Forward–backward Decoding We motivated the Viterbi algorithm at the beginning of this section via the commonsense idea that the best sequence according to the model is the one with the maximum probability. It turns out that this is true only for a particular definition of best. In this section, we present an alternative definition of best, which suggests an alternative approach to decoding, making use of the Forward–backward algorithm. To present these ideas, we will use a notation similar to that used for parsing in Goodman (1996b); similar ideas have been pursued in speech recognition (Stolcke et al., 1997; Mangu et al., 1999), machine translation (Kumar and Byrne, 2004), and finite-state chunking (Jansche, 2005). POS-tagging is a particularly nice example of this sort of approach, since the number of tags in all candidate tag sequences is exactly the same: one per word. For convenience, let us treat a tag sequence T as a set of time indexed tags (Ù j , t), so that (Ù j , t) ∈ T if the tag for word wt in T is Ù j . Then the intersection of two sequences, T ∩ T , is the number of words for which the two sequences have the same tag. Given the “true” tag sequence T¯, the exact-match accuracy of a given tag sequence T is: 1 if T = T¯ exact-accuracy(T | T¯) = (6.33) 0 otherwise In contrast, the per-tag accuracy of a given tag sequence T is: |T ∩ T¯| 1 tag-in-seq(Ù j , t, T¯) = per-tag-accuracy(T | T¯) = |w| |w| (Ù j ,t)∈T
(6.34) where
tag-in-seq(Ù j , t, T¯) =
1 if (Ù j , t) ∈ T¯ 0 otherwise
(6.35)
The denominator in Equation 6.34 is the number of words in the string w, since the true tag sequence must have exactly that length. Given a conditional distribution over tag sequences for a particular word sequence w, we can optimize our expected accuracy as follows: ˆ = argmax E (accuracy(T | T¯))
T
T
= argmax T
T
P(T | w) accuracy(T | T )
(6.36)
6.3 part-of-speech tagging
169
That is, we find the tag sequence T that maximizes the sum of the posterior probability of each tag sequence T , times the accuracy T would have if T were correct. If we are interested in the exact-match accuracy, this yields: T P(T | w) exact-accuracy(T | T ) ˆ = argmax T
T
= argmax P(T | w)
(6.37)
T
which is exactly the maximum likelihood tag sequence that motivated the use of the Viterbi algorithm. However, if we use the per-tag accuracy, we get: ˆ = argmax T P(T | w) per-tag-accuracy(T | T ) T
= argmax T
= argmax T
= argmax T
T
P(T | w)
T
1 tag-in-seq(Ù j , t, T ) |w| (Ù j ,t)∈T
1 P(T | w) tag-in-seq(Ù j , t, T ) |w| T (Ù j ,t)∈T
„ j (t)
(6.38)
(Ù j ,t)∈T
1 Note that, in the last step of this derivation, the |w| is removed, because it is a constant over all tag sequences, and hence does not impact the argmax. The inner sum is, by definition, equivalent to the „ defined in Equation 6.31 of the last section, which can be calculated using the Forward–backward algorithm. Thus, if we seek to optimize our expectation of exact-match accuracy, we should use the Viterbi algorithm. If, however, we seek to optimize the per-tag accuracy – which is most often how such taggers are evaluated – we should instead maximize the sum of „ j (t) for all (Ù j , t) in the tag sequence. This suggests an algorithm as follows: first apply the Forward– backward algorithm as presented in Figure 6.9; then select the tag at each time that gives the maximum posterior probability („) at that time step. If we inspect the values in the lower table of Figure 6.10, we see, in this case, that the tag sequence that would be returned by such an algorithm is, in fact, the same as would be returned by the Viterbi algorithm. In general, however, this may not be the case, and relatively
170
6. finite-state approaches to syntax
large improvements in accuracy can be obtained using such techniques instead of Viterbi decoding. 6.3.5 Log-linear Models Log-linear models associate features with parameter weights. A realvalued n-dimensional vector of feature values is derived from a candidate, and a score is assigned to that candidate by taking the dot product of with the real-valued n-dimensional weight vector ·¯ . The score can have a probabilistic interpretation, if normalized so that the sum of the exponential scores of all possibilities is 1. The benefit of these models over HMMs is that arbitrary, mutually dependent features can be used in the model, limited not by the strict conditional independence assumptions implicit in the HMM model structure, but rather only by the computational complexity of model structure. For finite-state models, the features typically remain quite similar to the features used for an HMM model, such as, bi-tag and tri-tag sequences, and tag-word pairs. However, more complicated feature types can be used, such as, bi-tag plus the corresponding words. Models are typically described as including features of a particular type or template, such bitags, but when we mention features in the following discussion, we are referring to very specific patterns that instantiate these templates, such as: “the word for tagged with tag IN followed by the word the tagged with DT”; or “the word many tagged with DT followed by an unknown word ending in -s tagged with NNS.” One question that must be answered up front is what, for normalization purposes, constitutes “all possibilities”? For example, in POStagging, one might want a joint model over the tag and word sequence, in which case “all possibilities” includes all possible tag and word sequence pairs. Alternatively, one might want a conditional model over tag sequences for a given word sequence, in which case “all possibilities” would be all possible tag sequences for the given word sequence, which is a much smaller set than in the joint case. An exponential model for a sequence of words and tags wT = w1 Ù1 . . . wk Ùk has the form: exp( in=1 ˆi (wT)·i ) exp((wT) · ·¯ ) P(wT) = = (6.39) Z( ∗ ) Z( ∗ ) where the n feature functions ˆi encode counts of features derived from the sequence wT, each of which has a corresponding parameter weight
6.3 part-of-speech tagging
171
·i . The denominator Z simply ensures normalization:
∗
Z( ) =
n exp( ˆi (wT)·i )
w∈ ∗ T∈T |w|
(6.40)
i =1
where T |w| is the set of tag sequences of length |w|. This may seem like a lot of summing, but if the features are chosen correctly (under a Markov assumption), dynamic programming approaches of the sort we have been discussing make this tractable. If instead of a joint model we use a conditional model, only the normalization changes. The equations become exp( in=1 ˆi (wT)·i ) (6.41) P(T | w) = Z(w) where
Z(w) =
∈T |w|
T
n exp( ˆi (wT)·i )
(6.42)
i =1
If we now take the log of these probabilities, we end up (in the conditional case) with log P(T | w) =
n
ˆi (wT)·i − log Z(w)
(6.43)
i =1
which, modulo the log of the normalization constant, boils down to the linear combination of weighted features. Within this general framework, there are various parameter estimation methods, which provide weights ·i that are optimized in some way. Here we will mention just three: Maximum Entropy models (MaxEnt); Conditional Random Fields (CRF); and the Perceptron algorithm. In the simplest MaxEnt modeling approaches, each tagging decision is modeled as a separate classification problem, thus, log P(T | w) =
k
log P(Ù j | wÙ1 . . . Ù j −1 )
(6.44)
j =1
where, under a Markov assumption of order one, we estimate log P(Ù j | wÙ1 . . . Ù j −1 ) =
n i =1
ˆi (wÙ j −1 Ù j )·i − log Z(w, Ù j −1 ) (6.45)
172
6. finite-state approaches to syntax
The normalization constant is here defined over tag sequences such that C j −1 has value Ù j −1 . Ratnaparkhi (1996) presented a MaxEnt POStagger, which is still a widely used, competitive tagger. A very similar model, known as Maximum Entropy Markov Models, was presented in McCallum et al. (2000), for information extraction. CRF modeling is very similar to this, but it performs a global normalization, rather than the local normalization of simple MaxEnt models. In other words, instead of including a normalizing constant in Equation 6.45, the normalization Z(w) would occur in Equation 6.44. Lafferty et al. (2001) presents the advantages of this approach over MaxEnt modeling, and the approach has been used profitably for a range of sequence modeling problems, including shallow parsing (Sha and Pereira, 2003), a task which will be briefly presented in the next section. For more details on MaxEnt and CRF modeling, see the cited papers, and the citations they contain. Both MaxEnt and CRF models can be estimated using general numerical optimization techniques. What they try to do is match the expected frequency of features with the actual frequency of features, which involves estimating the expected frequencies of features using techniques similar to the Forward–backward algorithm presented earlier. The third log linear modeling technique we will briefly present is the Perceptron algorithm, which does not calculate expected frequencies but rather bases parameter estimation on the highest scoring analysis. In other words, it uses a Viterbi-like algorithm instead of Forward– backward. The Perceptron algorithm has been around a long time, but we follow Collins (2002) in our presentation. Since the only thing that matters for the Perceptron algorithm is the best scoring output given the input, we can rewrite Equation 6.43 without the normalization and call it a weight (W) instead of a log probability: Wt (T | w) =
n
ˆi (wT)·it
(6.46)
i =1
where ·it denotes the parameter ·i after t updates. The Perceptron algorithm initializes all ·i0 as 0. Let T¯ denote the true tag sequence for words w, and let Tˆ(t) be defined as follows: ˆ(t) = argmax Wt (T | w)
T
∈T |w|
T
(6.47)
6.4 np chunking and shallow parsing
173
Then, after every training example, the parameters ·it are updated as follows: ·it = ·it−1 + ˆi (w¯T) − ˆi (wˆT(t))
(6.48)
The basic algorithm updates after each sentence for some number of iterations through the training corpus. See Collins (2002) for more details on this approach, as well as empirical results demonstrating improvements over MaxEnt models for POS-tagging. What has not been mentioned in the discussion of log linear models up to this point is smoothing. The reason for smoothing is to provide non-zero probability for unobserved events, and these approaches do that already. If none of the features occur within a word-tag sequence, in other words, there is nothing that has been seen in that sequence pair, then the linear combination of feature counts and parameter weights is zero in the log domain, which corresponds to a non-zero probability. Another (related) reason for smoothing, however, is to avoid over-fitting the training data. One hopes that the training data is a close match to whatever will be tagged, however the methods we have described, if left unchecked, will basically memorize the training data, and lose generalization to new data. Hence there are additional techniques to avoid this. For the MaxEnt and CRF modeling techniques, the most common technique is called regularization, whereby a penalty is associated with moving the absolute parameter weights away from zero. For the Perceptron algorithm, techniques known as voting or averaging are used to the same end. See the cited papers for more details. These log linear modeling approaches are extremely general, and have been applied to language modeling (Rosenfeld, 1996; Roark et al., 2004), POS-tagging (Ratnaparkhi, 1996; Collins and Duffy, 2002), named entity extraction (McCallum and Li, 2003), context-free parsing (Ratnaparkhi, 1997), and a range of re-ranking tasks (M. Johnson et al., 1999; Collins and Duffy, 2002, Charniak and Johnson, 2005). We will discuss them further, particularly in the chapter on syntactic approaches beyond context-free. 6.4 NP Chunking and Shallow Parsing Sequence tagging models as they have been presented in this chapter are directly applicable for non-hierarchical labeled bracketing, which is often called chunking or shallow parsing. The basic idea is that a tag can be associated with beginning positions in a chunk, inside positions,
174
6. finite-state approaches to syntax [NP the aged bottle ] flies fast [NP the aged bottle flies ] fast [NP the aged ] bottle [NP flies ] fast
= = =
B/the I/aged I/bottle O/flies O/fast B/the I/aged I/bottle I/flies O/fast B/the I/aged O/bottle B/flies O/fast
Figure 6.11 Representing NP Chunks as labeled brackets and as B/I/O tags
and outside positions. For example, Figure 6.11 shows three different bracketings of base (non-hierarchical) NPs on our example string, both with brackets and with tags on the words. In the first example, the word flies is outside a base NP constituent (because it is the verb); in the second, it is inside a previously started base NP constituent; and in the third it begins a new base NP constituent. By representing these flat brackets as tags, all of the algorithms and modeling approaches discussed for POS-tagging and class-based language modeling apply straightforwardly. One is not bound to just base NP categories either – one might choose to break the sentence into chunks of various categories, as was defined in Tjong Kim Sang and Buchholz (2000). Another variant, known as Named Entity Recognition (NER), typically labels named entities as either people, places, or organizations. Note that, while the algorithms and modeling frameworks remain the same for these various tasks, the features that are useful will differ. Various kinds of orthographic features, or features derived from lists of place or organization names, will have varying utility depending on whether one is performing NER, NP chunking, or some other shallow bracketing task. Fast finite-state partial parsing has been around for some time. The parser described in Abney (1996), does not annotate full hierarchical phrase structure trees, but is orders of magnitude faster than the approaches that do, by virtue of the lower complexity of finite-state algorithms versus context-free. To the extent that these tasks can be cast as a tagging problem, all of the algorithms that we presented in this chapter are applicable. We shall see in the next chapter the additional computation required to do similar inference using context-free grammars. When processing large amounts of speech or text, this difference in efficiency can make the difference between feasible and too costly. 6.5 Summary In this section we have looked at a number of finite-state syntactic models, including n-gram and class-based language models, and part-
6.5 summary
175
of-speech and related tagging models. We have presented three key dynamic programming algorithms which are critical for all such tasks: the Forward, Viterbi, and Forward–backward algorithms. Each of these algorithms has its counterpart for use with context-free grammars, which will be presented in the next chapter. By directly comparing the context-free and the finite-state algorithms, we can make explicit the computational cost of applying the richer models. While falling far short of what syntacticians would consider adequate, finite-state models have the great advantage of being very efficient in practice, often yielding useful annotations (including scores) when more complex models are inapplicable. These models continue to be an area of active research, particularly within the machine learning community. Apart from the brief presentation of log linear models in this chapter, we have not strayed into the rich area of sequence modeling that has produced advances in natural language tagging tasks such as information extraction and shallow parsing. Some finite-state approaches have not been discussed yet, in particular those which are intended as approximations to context-free or “mildly context-sensitive” grammars. The final two chapters will have sections discussing finite-state approximations.
7 Basic Context-free Approaches to Syntax Above finite-state languages in the Chomsky hierarchy come the context-free languages, which cannot be described using finite-state models. In this chapter, we will explore approaches to syntax making use of context-free grammars, which are able to model syntactic dependencies more effectively, though at a very significant efficiency cost.
7.1 Grammars, Derivations and Trees A context-free grammar (CFG) G = (V, T, S † , P ) consists of a set of non-terminal symbols V , a set of terminal symbols T , a start symbol S † ∈ V , and a set of rule productions P of the form: A → ·, where A ∈ V and · ∈ (V ∪ T )∗ . Commonly used non-terminal symbols include NP (noun phrase), VP (verb phrase), and S (sentence); the latter is often specified as the start symbol S † for a CFG. Terminal symbols are typically words (the, run, president, dog, . . . ). An example rule production is S → NP VP, which encodes the rule that an S nonterminal can consist of an NP non-terminal followed by a VP nonterminal. A Weighted CFG (WCFG) G = (V, T, S † , P , Ò) is a CFG plus a mapping Ò : P → R from rule productions to real valued weights. A Probabilistic CFG (PCFG) is a WCFG, where the weights are defined such that for any non-terminal A ∈ V
Ò(A → ·) = 1
(7.1)
A→·∈P
In other words, a PCFG is a CFG with a probability assigned to each rule, such that the probabilities of all rules with a given non-terminal
7.1 grammars, derivations and trees ⎧ S → NP VP, ⎪ ⎪ V = {Adj, Adv, Det, N, NP, S, V, VP} ⎨ NP → Det Adj N, T = {the, aged, bottle, flies, fast} P= Adj → aged, ⎪ ⎪ ⎩ S† = S V → flies,
177
⎫ VP → V Adv, ⎪ ⎪ ⎬ Det → the, N → bottle, ⎪ ⎪ ⎭ Adv → fast
Figure 7.1 A context-free grammar
.
on the left-hand side sum to one; specifically, each right-hand side has a probability conditioned on the left-hand side of the rule. One way to interpret a context-free rule production A → · is that the non-terminal symbol A can be rewritten as a sequence · of terminals and/or non-terminals. Given a CFG, we can define a rewrites relation between sequences of terminals and non-terminals, which we will denote with the ⇒ symbol: ‚A„ ⇒ ‚·„
for any ‚, „ ∈ (V ∪ T )∗ if A → · ∈ P (7.2)
These rewrites can take place in sequence, with the right-hand side of one rewrite in turn rewriting to something else. For example, suppose the following rules are in the grammar: A → B C and B → A C . Then ‚ A „ ⇒ ‚ B C„ ⇒ ‚ A C C„ ⇒ ‚ B C C C „
(7.3)
k
Let „ ⇒ ‚ denote a sequence of k rewrite relations, with „ on the lefthand side of the first, and ‚ on the right-hand side of the last. In the 3 above example, we would write ‚ A „ ⇒ ‚ B C C C „. We will use ∗ k + „ ⇒ ‚ to signify that „ ⇒ ‚ for some k ≥ 0; and „ ⇒ ‚ to signify that k „ ⇒ ‚ for some k ≥ 1. We will say that a sequence „ derives a sequence ‚ ∗ ∗ if „ ⇒ ‚. A CFG G defines a language LG = {· | · ∈ T ∗ and S † ⇒ ·}, that is, the set of terminal (word) strings that can be derived from the start symbol. To make this more concrete, let us define a small CFG. Figure 7.1 presents a simple CFG G such that the example string in the previous chapter, the aged bottle flies fast, is in the language LG . By definition, that means that there is a derivation that begins with the start symbol S, and results in the word string. In fact, there are many such derivations, for example the following two: (1) S ⇒ NP VP ⇒ NP V Adv ⇒ NP V fast ⇒ NP flies fast ⇒ Det Adj N flies fast ⇒ Det Adj bottle flies fast ⇒ Det aged bottle flies fast ⇒ the aged bottle flies fast (2) S ⇒ NP VP ⇒ Det Adj N VP ⇒ the Adj N VP ⇒ the aged N VP ⇒ the aged bottle VP ⇒ the aged bottle V Adv ⇒ the aged bottle flies Adv ⇒ the aged bottle flies fast
178
7. basic context-free approaches to syntax S NP
VP
Det
Adj
N
the
aged
bottle
V
Adv
flies
fast
Figure 7.2 Parse tree for example string
Each derivation step is associated with a rule production in the CFG, and these productions can be collectively represented as a parse tree. What differs between these two derivations is which non-terminal categories are chosen to be replaced with the right-hand side of a rule. In derivation (1), which is called the rightmost derivation, the nonterminal category farthest to the right is chosen to be replaced at every step. Thus, the second step in that derivation replaces the VP from the NP VP sequence with its children from the rule VP → V Adv. In derivation (2), which is called the leftmost derivation, the non-terminal category farthest to the left is chosen to be replaced. The second step in derivation (2) replaces the NP instead of the VP, since it is leftmost. Of course, both of these derivations are applying the same rules, just in a different order. In other words, both derivations correspond to the same structural analysis of the string, that is, they correspond to the same parse tree, which is shown in Figure 7.2. To avoid exploring multiple derivations with the same parse tree, a particular derivation strategy, or order, will be selected and adhered to. The form of a CFG can cause it to have particular computational properties. Some sub-classes of CFGs have beneficial properties, and it is often wise to transform a given CFG G to another CFG G that has those properties. Two grammars G and G are strongly equivalent if, modulo differences in non-terminal labels, they derive the same parse trees for any given string. For example, if we changed the nonterminal symbol NP to DP in the grammar from Figure 7.1, but left everything else in the grammar the same, the resulting grammar would be strongly equivalent to the original grammar. Two grammars G and G are weakly equivalent if LG = LG . Transforming a CFG to a weakly equivalent CFG in a particular form can be essential for certain uses of the grammar. One widely used CFG form is Chomsky Normal Form (CNF). A CFG is in CNF if all productions are of the form A → B C or A → w for
7.1 grammars, derivations and trees
179
A, B, C ∈ V and w ∈ T . In other words, a CFG is in CNF if the only productions that are not binary have a single terminal item on the righthand side, and terminals are only on the right-hand side of unary rules. A quick inspection will verify that the CFG defined in Figure 7.1 is not in Chomsky Normal Form, because it has a production with more than two children on the right-hand side: NP → Det Adj N. Certain algorithms that we will be discussing later in this chapter require grammars to be in Chomsky Normal Form. It turns out that every CFG G has a weakly equivalent grammar G in Chomsky Normal Form. Figure 7.3 shows a Chomsky Normal Form CFG G that is weakly equivalent to the CFG G in Figure 7.1. In place of the ternary production NP → Det Adj N from G , there are two productions in G : NP → Det NP-Det and NP-Det→ Adj N, which results in an extra step in any derivation and an extra non-terminal node in the resulting parse tree. To return a parse tree to the format of the original grammar, one simply removes all non-terminal nodes created when transforming G to G , and attaches the children of those nodes to their parent node. Figure 7.4 shows the two trees that result from using the different grammars. Another common property of grammars that can be computationally undesirable is left-recursion. Productions with the same category on the left-hand side of the production and as the first child on the right-hand side (A → A ·) are called left-recursive productions. A grammar is left recursive if there exists a category A ∈ V such that + A ⇒ A„ for some „ ∈ (V ∪ T )∗ , that is, a symbol A can rewrite to a sequence of symbols beginning with A in one or more steps of a derivation. For a left-to-right, top-down derivation strategy, which will be discussed in the next section, this can lead to non-termination. There are grammar transforms to remove left-recursion from a grammar, one V = {Adj, Adv, Det, N, NP, NP-Det, S, V, VP} T = {the, aged, bottle, flies, fast} S† = S ⎧ ⎪ ⎪ S → NP VP, ⎪ ⎪ ⎨ NP → Det NP-Det, P = Adj → aged, ⎪ ⎪ ⎪ V → flies, ⎪ ⎩ NP-Det → Adj N
⎫ VP → V Adv, ⎪ ⎪ ⎪ Det → the, ⎪ ⎬ N → bottle, ⎪ ⎪ Adv → fast, ⎪ ⎪ ⎭
Figure 7.3 A context-free grammar in Chomsky Normal Form
180
7. basic context-free approaches to syntax S
S NP
⇔
VP
Det
Adj
N
the
aged
bottle
V
Adv
flies
fast
NP Det the
VP NP-Det
Adj
N
aged
bottle
V
Adv
flies
fast
Figure 7.4 One-to-one mapping of parse trees corresponding to original and Chomsky Normal Form CFGs
of which, known as the left-corner transform, will also be discussed in the next section. The remainder of this chapter will investigate algorithms for annotating strings using various kinds of CFGs. First, we will present several deterministic parsing approaches, which are well documented in compiler textbooks (such as Aho et al., 1986), since they are most useful for unambiguous languages such as programming languages. After that, we will consider parsing in the face of ambiguity, when WCFGs provide the means to disambiguation. In the next chapter we will present the use of richer statistical models to improve disambiguation.
7.2 Deterministic Parsing Algorithms Students and researchers in Computational Linguistics and Natural Language Processing are among the most avid consumers of classic textbooks on compilers, since these textbooks have the most explicit presentation of the fundamentals of context-free parsing. Of course, parsing and compiling an unambiguous artificial language is far from parsing of natural language. Nevertheless, it is very helpful to understand the standard algorithms, since in many cases these algorithms will form the foundation upon which disambiguation approaches are built. The presentation in this section covers material that is presented in far more detail in Aho et al. (1986). For the three main parsing algorithms that we discuss in this section, we will not present formal algorithms, but will rather present an informal explanation of each algorithm with an example derivation. After covering the foundations of deterministic parsing, we will present methods for handling non-determinism.
7.2 deterministic parsing algorithms
181
Table 7.1 Shift–reduce parsing derivation Step
Stack
0 1 2 3 4 5 6 7 8 9 10 11 12 13
the Det Det aged Det Adj Det Adj bottle Det Adj N NP NP flies NP V NP V fast NP V Adv NP VP S
Input buffer the aged bottle flies fast aged bottle flies fast aged bottle flies fast bottle flies fast bottle flies fast flies fast flies fast flies fast fast fast
Move
shift “the” reduce (Det → the) shift “aged” reduce (Adj → aged) shift “bottle” reduce (N → bottle) reduce (NP → Det Adj N) shift “flies” reduce (V → flies) shift “fast” reduce (Adv → fast) reduce (VP → V Adv) reduce (S → NP VP)
7.2.1 Shift-reduce Parsing Shift–reduce parsing is a bottom up derivation strategy, that is, it starts from the words in the string, and tries to work upwards towards the root symbol in the grammar. There are two data structures in this algorithm: an input buffer and a stack. As you may have deduced from the name of the algorithm, there are two operations: shifting and reducing. Shifting moves the next symbol in the input buffer to the top of the stack. Reducing takes the top k symbols from the stack, which correspond to the right-hand side of some rule production in the grammar, and replaces them with the symbol on the left-hand side of that production. The algorithm starts with the buffer pointing to the first word in the string and an empty stack. It successfully terminates when the input buffer is empty and the stack contains only the start symbol. Table 7.1 shows the action of a shift–reduce parser using the grammar in Figure 7.1. In this algorithm, if anything can be reduced at the top of the stack, it is reduced, before the next symbol is shifted. The algorithm begins with the whole string on the input buffer, and the first word is shifted. It is reduced to its POS-tag. The next two words are shifted and reduced to their POS-tags. These three tags correspond to the right-hand side of the rule production with NP on the left-hand side, so they are collectively reduced to NP. And so on. The final step of the derivation, as with all successful shift–reduce derivations, is the
182
7. basic context-free approaches to syntax
reduction to the start non-terminal. The sequence of rules used for reduction collectively corresponds to the parse tree of the utterance. With more realistic grammars, even for unambiguous languages, there can be points of ambiguity in the derivation, where it is unclear whether shifting or reducing is the right move. To resolve such ambiguities, some number of look-ahead words (yet to be shifted from the input buffer) can be used to decide what move should be made. For situations such as this, a lookup table can be constructed that specifies the move to make for a given stack and input buffer state. Some grammars may only need one word of look–ahead, others more in order to completely disambiguate. A grammar for which a deterministic lookup table can be built for shift–reduce parsing using k words of look-ahead is called an LR(k) grammar. The L stands for left-to-right processing of the input buffer, and the R stands for the rightmost derivation, which is what a shift–reduce parser follows. See Aho et al. (1986) for more details of the standard shift–reduce approach, including building lookup tables. 7.2.2 Pushdown Automata Now that we have seen a basic shift–reduce algorithm, we can make an interesting comparison with finite-state models. There are context-free grammars that cannot be encoded by finite-state automata; but that does not mean that they cannot be encoded by an automaton. Rather, it means that the automaton will not have a finite number of states. If we augment each state in an automaton with a stack variable, which can encode the state of the stack in the shift–reduce algorithm, this is sufficient to model context-free languages. For most interesting cases, such a pushdown automaton is not finite-state. Figure 7.5 shows the first eight moves in the shift–reduce derivation from Table 7.1, represented as a pushdown automaton. Shift moves have the input word being shifted as a label on the transition. Reduce moves have an  labeled transition. Each state corresponds to a pushdown stack, with the rightmost symbol on the stack being topmost. This automaton representation is useful for several reasons. First, it can improve intuitions about the differences between finite-state and context-free grammars. If a context-free grammar can be encoded into a pushdown automaton that has a finite number of states, then the language that it describes is a finite-state language. This is the case for certain subclasses of the set of context-free grammars, such as
[]
the
[the]
ε
[Det]
aged
[Det aged]
ε
[Det Adj]
bottle
Figure 7.5 The initial moves in the shift–reduce derivation, encoded as a pushdown automaton
[Det Adj bottle]
ε
[Det Adj N]
ε
[NP]
flies
...
184
7. basic context-free approaches to syntax Table 7.2 Top-down parsing derivation Step
Stack
Input buffer
0 1 2 3 4 5 6 7 8 9 10 11 12 13
S VP NP VP N Adj Det VP N Adj the VP N Adj VP N aged VP N VP bottle VP Adv V Adv flies Adv fast
the aged bottle flies fast the aged bottle flies fast the aged bottle flies fast the aged bottle flies fast aged bottle flies fast aged bottle flies fast bottle flies fast bottle flies fast flies fast flies fast flies fast fast fast
Move
predict (S → NP VP) predict (NP → Det Adj N) predict (Det → the) match “the” predict (Adj → aged) match “aged” predict (N → bottle) match “bottle” predict (VP → V Adv) predict (V → flies) match “flies” predict (Adv → fast) match “fast”
left-linear or right-linear grammars. Those grammars for which there is no finite-state pushdown automaton cannot be exactly represented with a finite-state grammar. One might try to approximate such a grammar, however, by representing only some of the stack at a state (e.g., the top k symbols), which may result in a finite number of states in the approximate pushdown automaton. Finite-state approximations to context-free grammars is a topic that will be presented in Section 8.5. 7.2.3 Top-down and Left-corner Parsing The bottom-up shift–reduce algorithm just presented begins at the “bottom” of the tree, at the words on its periphery (or frontier), and succeeds when it reaches the start symbol on the stack and an empty input buffer. A top-down parser does the reverse: it begins with the start symbol on the stack and succeeds when it reaches the end of the input buffer and an empty stack. We can present this algorithm in a table much as we did for shift-reduce parsing. Table 7.2 shows the top-down derivation for our example string using the grammar from Figure 7.1. The derivation begins with the start symbol (S) on the top of the stack, and the look-ahead pointer at the first word (the) in the input buffer. There is only one rule with S on the left-hand side (S → NP VP), so the categories on the right-hand side are pushed onto the stack in reverse order. That puts the category NP at the top of the stack. It, too, has just one rule production, the
7.2 deterministic parsing algorithms
185
categories on the right-hand side of which are pushed onto the stack in reverse order. The right-hand side of the next predicted production matches the look-ahead word, hence the look-ahead pointer advances to the next word in the buffer. Moves continue until the stack is empty and the pointer has advanced past the last word, which is the condition of success. As with shift–reduce parsing, a grammar can be characterized by how many words of look-ahead are required to construct a deterministic lookup table for choosing the correct move. In this case, since the resulting derivation is leftmost, a grammar is called LL(k) if such a table can be built with k words of lookahead, signifying a Left-to-right, Leftmost derivation. Since the topic of this book is natural languages, algorithms for building deterministic parsing tables are interesting but not of great utility, since the ambiguity of the language is such that no such table can be built for most interesting grammars. For parsing natural languages, we will need to consider non-deterministic parsing algorithms, which is the topic of the next section. We thus again refer the reader to Aho et al. (1986) for more details on this topic. Before we move on to non-deterministic parsing algorithms, however, there is one remaining issue with top-down parsing that we must present: how to deal with left-recursion. As mentioned earlier, a gram+ mar is left-recursive if for some A ∈ V and „ ∈ (V ∪ T )∗ , A ⇒ A„, that is, a derivation of one or more steps rewrites some symbol A to a sequence of symbols beginning with A. The simplest example of this is if the grammar includes a left-recursive production, such as NP → NP PP. Consider what the derivation in Table 7.2 would look like, if this were the chosen rule at step 2. The resulting stack would be [VP PP NP] with the look-ahead unmoved. Since NP is still the topmost category on the stack, we can apply the same rule again, giving us the stack [VP PP PP NP], and so on ad infinitum. This is a big problem for grammars describing languages that in fact do have left-recursive structures. One solution to the problem of left-recursion is to transform a left-recursive grammar to produce a weakly equivalent grammar with no left-recursion. Several grammar transformations exist that will do this (see Moore, 2000, for a comparison of several). Another solution to left-recursion is to follow a derivation strategy which is not prey to non-termination in the face of left-recursion, such as the shift– reduce algorithm. Another derivation strategy, known as left-corner parsing (Rosenkrantz and Lewis, 1970), is closely related to top-down
186
7. basic context-free approaches to syntax
Table 7.3 Left-corner parsing derivation Step
Stack
Input buffer
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
S S the S Det S NP N Adj S NP N Adj aged S NP N Adj Adj S NP N S NP N bottle S NP N N S NP S S VP S S VP flies S S VP V S S VP VP Adv S S VP VP Adv fast S S VP VP Adv Adv S S VP VP SS
the aged bottle flies fast aged bottle flies fast aged bottle flies fast aged bottle flies fast bottle flies fast bottle flies fast bottle flies fast flies fast flies fast flies fast flies fast fast fast fast
Move
shift “the” announce (Det → the) announce (NP → Det Adj N) shift “aged” announce (Adj → aged) match Adj shift “bottle” announce (N → bottle) match N announce (S → NP VP) shift “flies” announce (V → flies) announce (VP → V Adv) shift “fast” announce (Adv → fast) match Adv match VP match S
parsing, yet skips over left-recursive structures, hence avoiding the nontermination problems associated with them. Table 7.3 shows the left-corner parse of our example string. For this algorithm, we have two kinds of symbols on the stack, those which are predicted to be found and those which have been found (or “announced”). We will denote predicted symbols in bold and announced symbols in normal font. The basic idea behind left-corner parsing is that when the leftmost child of a production is completed, the parent of the category is announced, and the remaining children of the production are predicted. When a predicted symbol and an announced symbol of the same category are together at the top of the stack, we can match them and remove them from the stack. We begin the derivation in Table 7.3 with a predicted S category. We jump down to the left-corner and shift “the” onto the stack. This is the leftmost child of the production Det → the, so the parent (Det) is announced, and since there are no other children, nothing else is predicted. The Det symbol is the leftmost child of the production NP → Det Adj N, so the NP category is announced, and the remaining children are predicted and pushed (in reverse order) onto the stack.
7.2 deterministic parsing algorithms
187
S the
S-Det
Adj aged
S-NP
N
Adj-Adj Â
bottle
N-N Â
S-S
VP flies
VP-V
VP-VP
Adv fast
Â
Adv-Adv  Â
Figure 7.6 Tree resulting from a left-corner grammar transform
Once these predicted nodes are found, the announced NP category will have been completed. Step 4 of the derivation has the word “aged” being shifted onto the stack, and then the Adj category is announced. The stack at step 5 has an announced category (Adj) at the top and a predicted category (Adj) next to it, so these can be matched and removed from the stack. The Adj that was predicted is found, and we can move on. In a similar fashion, “bottle” is shifted and matched with the predicted N. At this point (step 9), the NP that was announced at step 3 is at the top of the stack, meaning that it has been completed. This category is the leftmost child in the production S → NP VP, so an S category is announced, and the VP is predicted. Because this algorithm waits until the leftmost child is completed before predicting the parent, it can handle left-recursive grammars with no problem. Interestingly, it has been recognized since the original publication of this algorithm (Rosenkrantz and Lewis, 1970) that there is a left-corner grammar transform that takes a CFG G and returns another weakly equivalent CFG G that, when used in a top-down parsing algorithm, results in the same derivation steps as a left-corner derivation with the original grammar. This is one of the grammar transforms that removes left-recursion from the grammar. Before showing the transform itself, Figure 7.6 presents our example syntactic structure as it would appear with the left-corner transformed grammar and Table 7.4 provides the corresponding top-down
188
7. basic context-free approaches to syntax
Table 7.4 Top-down derivation with a left-corner grammer Step
Stack
Input Buffer
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
S S-Det the S-Det S-NP N Adj S-NP N Adj-Adj aged S-NP N Adj-Adj S-NP N S-NP N-N bottle S-NP N-N S-NP S-S VP S-S VP-V flies S-S VP-V S-S VP-VP Adv S-S VP-VP Adv-Adv fast S-S VP-VP Adv-Adv S-S VP-VP S-S
the aged bottle flies fast the aged bottle flies fast aged bottle flies fast aged bottle flies fast aged bottle flies fast bottle flies fast bottle flies fast bottle flies fast flies fast flies fast flies fast flies fast fast fast fast
Rule
S → the S-Det match “the” S-Det → Adj N S-NP Adj → aged Adj-Adj match “aged” Adj-Adj → Â N → bottle N-N match “bottle” N-N → Â S-NP → VP S-S VP → flies VP-V match “flies” VP-V → Adv VP-VP Adv → fast Adv-Adv match “fast” Adv-Adv → Â VP-VP → Â S-S → Â
derivation for that grammar, which can be compared with the derivation in Table 7.3. One can see, by inspection, that the stack representation is nearly identical, modulo slight notational differences. Pairs of predicted and announced categories in Table 7.3 are now represented as new hyphenated non-terminals, and the left-corner “match” move is represented by an empty Â-production. The left-corner grammar transform takes a CFG G =(V, T, S † , P ), and returns a new weakly equivalent grammar G =(V , T, S † , P ) where V =V ∪{A-X | A,X∈V }. These new non-terminals include the composite labels that can be seen in the tree in Figure 7.6, such as S-Det and S-NP. The new set of rule productions P is derived from the old set P by including all rules that meet the following conditions: A → w A-B for w ∈ T and A,B ∈ V and B → w ∈ P A-B → · A-C for C → B· ∈ P A-A → Â for A ∈ V
(7.4)
Note that some of the productions created by this transform will never be used in a derivation, for example, if a terminal item w can never be the leftmost terminal dominated by a non-terminal A. Techniques exist
7.3 non-deterministic parsing algorithms 1
8 4 1
2
7 3 5 (a)
2 6
3
4
5 6
5 7 (b)
189
2 8
1
3
7 4 6 (c)
8
Figure 7.7 The order of node recognition in the (a) shift–reduce; (b) top-down; and (c) left-corner parsing algorithms
to prune the resulting grammar to only include viable productions. Further, the left-recursion can be removed from a grammar by applying the transform to only a subset of the productions in the grammar. See M. Johnson and Roark (2000) and Moore (2000) for details addressing these issues. It is a worthwhile exercise to compare these three deterministic parsing algorithms in terms of when in the derivation particular nodes in a parse tree are recognized. Figure 7.7 shows the order in which nodes are recognized for the three algorithms that we have presented. In shift–reduce parsing, a parent node is recognized after all of its children have been recognized; in top-down parsing, a parent node is recognized before any of its children have been recognized; and in leftcorner parsing, a parent node is recognized after its first child has been recognized, but before any of the subsequent children are recognized. A natural question that arises from this comparison is whether there are other orders one might follow in parsing, such as recognizing the parent after all but the last child have been recognized. In fact, there is a continuum of parsing strategies, many of which can be implemented by simply transforming the grammar. We refer the reader to Nijholt (1980) for, among other things, an excellent presentation of these issues.
7.3 Non-deterministic Parsing Algorithms Having covered some parsing preliminaries, we can now begin to address some of the issues that make parsing of natural language so challenging and interesting, at least to some. First, let us remember from the previous chapter that the string we have been repeatedly parsing (“the aged bottle flies fast”) is in fact an ambiguous sequence with (at least) three possible syntactic structures, corresponding to three different readings. For ease of illustration, in the previous section we
190
7. basic context-free approaches to syntax S
NP Det the
Adj
aged bottle
VP
NP
VP N
S
S
V
Adv
flies
fast
Det the
Adj
N
aged bottle
N flies
VP
NP
V
Det
Adj
fast
the
aged
V
NP
Adv
bottle
N
fast
flies
Figure 7.8 Three possible parse trees for the example sentence
assumed that there was just one. Now we will expand our grammar to account for all three, though still just focusing on this particular string. To maintain simplicity, we will present three trees corresponding to the three readings of this sentence and then extract the rule productions from those trees. First, though, a disclaimer. These trees are not intended to make any strong statement about what is the correct structure of these analyses. That debate is best left to syntacticians. There are some characteristics of these trees that will likely leave many if not most (perhaps all) syntacticians dissatisfied. For example, flat NPs and flat VP modification are not what one will typically see in current textbooks on syntactic theory. Whether or not the grammar is “correct,” however, is irrelevant for what we are presenting here, which takes a given CFG as a starting point. We choose these flat structures for pedagogical reasons and because certain common corpora – such as the Penn Treebank – adopt relatively flat constituent structures. More on these flat treebanks later. Figure 7.8 shows the three syntactic structures that correspond to the three readings of the string (bottle flying, flies fasting, and oldage pensioners bottling). We can induce a CFG by reading off the rules needed to construct each tree in this collection of trees. To make things more interesting, though, we will add one rule that does not occur in the trees: VP → V S. To make this probabilistic, we will assign a uniform distribution over the possible right-hand sides given the lefthand side. This gives us the rule productions and probabilities shown in Figure 7.9. Since our non-terminal set V, terminal set T, and start symbol S † remain unchanged, this new set of productions and their associated probabilities are all that differentiate the PCFG in Figure 7.9 from the CFG that we were using before. This is a simple change, but the result is that we cannot use the simple parsing algorithms presented in the previous section, since the same string now has multiple valid analyses. Hence there is no deterministic parsing table that can tell the algorithm,
7.3 non-deterministic parsing algorithms LHS
RHS
Prob
LHS
NP VP
1
NP
→
Det Adj
NP
→
Det Adj N
NP
→
Det Adj N N
NP
→
N
S
→
VP
→
V
VP
→
V Adv
VP
→
V NP Adv
RHS
191
Prob 1 4 1 4 1 4 1 4
VP
→
VS
1 4 1 4 1 4 1 4
Adj
→
aged
1
Adv
→
fast
1
Det
→
the
1
V
→
bottle
N
→
bottle
V
→
flies
N
→
flies
1 2 1 2
V
→
fast
1 3 1 3 1 3
Figure 7.9 Rule productions from a PCFG covering example trees
for a particular parsing state, which action to follow, since there will be states from which multiple actions can successfully follow. For example, in Table 7.1, step 5 of the shift–reduce derivation can no longer deterministically be to shift the word “bottle,” since a reduce (NP → Det Adj) is also a move that will lead to a successful derivation, one with a different tree structure. The same problem arises for the other derivation strategies. One additional note on PCFGs: the probability of a tree is the product of the probabilities of all rule productions used in building the tree. For example, the first tree in Figure 7.8 uses eight rule productions. 1 We leave it as an exercise to the reader to verify that the product of 1 1 these rule probabilities is 96 , that is, the tree probability is 96 given this PCFG. The rest of this section will look at algorithms for exploring the space of possible parses for a particular string. First we will look at extensions to the algorithms we have just discussed that enable handling of nondeterminism. Next, we will look at dynamic programming approaches which efficiently search the space of possible analyses, in much the same way as was done for finite-state models from the previous chapter. 7.3.1 Re-analysis and Beam-search There are two broad strategies for dealing with multiple analyses within the parsing algorithms that were presented in the last section. First, when faced with multiple possible actions, one may choose one of them and follow it until reaching success or failure; in the case of failure, or 1
S → NP VP, NP → Det Adj N, VP → V Adv, and the 5 POS-tag-to-word productions.
192
7. basic context-free approaches to syntax
if more than one parse is desired, one can then go back to previous stages of the derivation and pursue other actions than the one that was chosen. This approach often goes under the name of re-analysis, since it involves re-analyzing the string. While this sort of strategy has been discussed at length in the psycholinguistic literature as a possible mechanism of human sentence processing (as in, for example, Frazier and Fodor, 1978; Lewis, 1998), it is less common in parser implementations. One reason for this is that in the absence of outright failure of a derivation, it can be difficult to define effective objective criteria for triggering re-analysis. In many cases, effective probabilistic grammars lead to overgeneration of a sort that can be challenging for re-analysis approaches, since outright failure is rare. The second approach maintains multiple derivations simultaneously and collectively moves them forward in parallel. Usually the number of derivations that are maintained is limited to include just a subset of the total possible, that is, a fixed beam is maintained, and when too many derivations are found, some are not pursued. The reason why the number of derivations is typically limited is that with ambiguous grammars the number of derivations grows exponentially with the length of the string. This approach has been pursued successfully even for very large grammars, using shift–reduce (Ratnaparkhi, 1997, 1999; Chelba and Jelinek, 1998, 2000), top-down (Roark and Johnson, 1999; Roark, 2001, 2004; Collins and Roark, 2004), and left-corner (Manning and Carpenter, 1997; Roark and Johnson, 1999; Henderson, 2003, 2004) parsing approaches. In all of these cases, heuristic scores derived from a weighted grammar were used to decide which derivations to keep within the beam and which to discard. Each derivation maintained within the beam continues its individual derivation using moves as presented in the previous section. In practice, the exponential growth of the number of derivations results in an astronomical number of distinct derivations for sentences of even moderate length. This effectively precludes using these parsing approaches as-is to explore every possible parse, for example, to find the parse with the highest score given a weighted grammar. Beamsearch provides no guarantee of finding the highest scoring analysis. One option for this problem is to use a more informative weight for a derivation, which includes an accurate bound on what the final weight for the derivation will be, known as A∗ search. A second solution is to use dynamic programming techniques, such as those used in the previous chapter with POS-tag models. These algorithms include the
7.3 non-deterministic parsing algorithms
193
Span 5
S
4 3
NP
2 1
VP
Det (the)
Start index: 0
Adj (aged)
N (bottle)
V (flies)
Adv (fast)
1
2
3
4
Figure 7.10 A chart representation of the first tree in Figure 7.8
well-known Cocke-Younger-Kasami (CYK) algorithm 2 and the Earley algorithm. 7.3.2 CYK Parsing The CYK algorithm was discovered independently by the three namesakes of the algorithm (Cocke and Schwartz, 1970; Younger, 1967; Kasami, 1965). It is a dynamic programming approach that was originally intended for use with an unweighted CFG, but which is most commonly used now with PCFGs to find the most likely parse. The basic concept in CYK parsing is the constituent span. For every non-terminal category in a parse, there is a label and a span of words that fall under it. In the first parse tree in Figure 7.8, there is a constituent with label NP that spans the words “the aged bottle”; and there is a constituent with label S that spans all of the words in the string. We can associate each word boundary with an index, for example, index 0 before “the”; index 1 between “the” and “aged”; and so on. We can restate our constituent spans in terms of these indices: there is an NP from 0 to 3, and an S from 0 to 5. Alternatively, we can represent a constituent as a label, a start point, and a length of span: an NP starting at 0 of span three and an S starting at 0 of span five. 2
This is also commonly denoted CKY.
194
7. basic context-free approaches to syntax
Figure 7.10 shows a simple two dimensional table representation, widely known as a chart (Kay, 1986), of the first tree in Figure 7.8. The vertical y-axis of the chart encodes the span of constituents in the row, and the horizontal x-axis encodes the start position of constituents in the column. Hence, we have a VP of span two at start position 3; an NP of span three at start position 0; and an S of span five at start position 0. The CYK parsing algorithm that we will be presenting performs dynamic programming over such a chart structure to find the maximum likelihood parse given a particular PCFG. As with other dynamic programming approaches, CYK parsing relies upon being able to partition the optimal solution into independent optimal subsolutions. In the case of a PCFG, the probability of the children is conditioned only on the category of the parent; hence, if the maximum likelihood parse for the string contains an NP node of span three starting at position 0, that NP must be the maximum likelihood constituent from among all ways of building an NP with that span and starting position. To simplify the algorithm, the basic CYK parser requires a grammar in Chomsky Normal Form (CNF). Recall that CNF grammars are those that meet three conditions: (1) all terminals (words) are only on the right-hand side of unary productions; (2) only words are on the righthand side of unary productions; and (3) all other productions are binary. The PCFG in Figure 7.9 has all words only on the right-hand side of unary productions, so condition (1) is met. However, there are nonterminals on the right-hand side of unary productions: VP → V and NP → N, so condition (2) is not met. Further, there are productions with more than two non-terminals on the right-hand side, so condition (3) is not met. In order to use the CYK algorithm, we will first need to transform this grammar into a weakly equivalent grammar in Chomsky Normal Form, which is possible for all CFGs. The tree returned from the parser will be de-transformed back into the format of the original grammar. Note that preserving the probability distribution over trees defined by the PCFG is an important consideration when performing the transform. We will perform the transform in two steps. First, we will transform the grammar to meet condition (2); then we will further transform the grammar to meet condition (3). For every unary rule A → B, with A, B ∈ V and probability p1 , we will perform the following five steps:
7.3 non-deterministic parsing algorithms
195
1. Add a non-terminal A↓B to V . 2. Remove A → B from P. 3. For every production C → · A „ in P with probability p2 , where ·, „ ∈ V ∗ (a) add a production to P of the form C → · A↓B „ with probability p1 p2 ; (b) set P(C → · A „) = (1 − p1 ) p2 . 4. For every rule production in P of the form A → · with probability p3 , set P(A → ·) = p3 /(1 − p1 ). 5. For every rule production in P of the form B → ·, add productions to P of the form A↓B → · such that P(A↓B → ·) = P(B → ·). Note that there is a potential problem with this approach if there are unary cycles in the original grammar, that is, if there are two symbols + + A, B ∈ V such that A⇒ B and B⇒ A. This is a bad idea in general, and luckily the grammar in Figure 7.9 has no such cycles. For that grammar, two new symbols are added to the non-terminal set (NP↓N and VP↓V) and a number of new productions. The second part of the transform is to factor the grammar to convert n-ary productions with n > 2 to n−1 binary productions. For every nary rule (n > 2) A → B1 . . . Bn−1 Bn with probability p, we will perform the following three steps: 1. Add a non-terminal B1 -. . . -Bn−1 to V . 2. Replace n-ary production A → B1 . . . Bn−1 Bn with binary production A → B1 -. . . -Bn−1 Bn in P , keeping same probability p. 3. Add production B1 -. . . -Bn−1 → B1 . . . Bn−1 to P with probability 1. For the grammar in Figure 7.9 this results in three new non-terminals: V-NP, Det-Adj, and Det-Adj-N. Figure 7.11 shows the new productions and probabilities for the resulting Chomsky Normal Form PCFG that is weakly equivalent to the PCFG in Figure 7.9. The transform resulted in nine more rules than the original PCFG. Now that we have transformed our grammar into Chomsky Normal Form, we can use the CYK algorithm to find the parses for the string. We will take a first step towards this algorithm by first showing an algorithm for adding all possible constituents (or edges) to a chart. Figure 7.12 shows the chart for this example string, with
196
7. basic context-free approaches to syntax LHS
RHS
S
→
NP VP
S
→
NP↓N VP
VP
→
V Adv
VP
→
V-NP Adv
VP
→
V-NP↓N Adv
VP
→
Adv
Prob
LHS
RHS
Prob
S
→
NP VP↓V
S
→
NP↓N VP↓V
NP
→
Det Adj
NP
→
Det-Adj N
NP
→
Det-Adj-N N
3 16 1 16 1 3 1 3 1 3
V S
9 16 3 16 1 3 1 4 1 12 1 3
Adj
→
aged
1
→
fast
1
Det
→
the
1
V
→
bottle
N
→
bottle
V
→
flies
N
→
flies
V
→
fast
NP↓N
→
bottle
VP↓V
→
bottle
NP↓N
→
flies
1 2 1 2 1 2 1 2
VP↓V
→
flies
V-NP
→
V NP
1
VP↓V
→
fast
1 3 1 3 1 3 1 3 1 3 1 3
V-NP↓N
→
V NP↓N
1
Det-Adj-N
→
Det-Adj N
1
Det-Adj
→
Det Adj
1
Figure 7.11 Chomsky Normal Form PCFG weakly equivalent to the grammar in Figure 7.9
rows associated with spans of a certain length, and columns associated with starting word indices. Each chart entry consists of a 4-tuple (A, ·, Ê, k), consisting of: the non-terminal category A ∈ V ; the probability ·; the backpointer Ê = (m, i, j ) which identifies the children constituents via a midpoint m, and the indices of the first (i ) and second ( j ) children; and the index k. This is enough information to reconstruct the tree that corresponds to the maximum likelihood parse. For example, the first entry in the cell with start position 0 and span three is [Det-Adj-N, 12 , (2, 1, 1), 1], which has a probability of 12 . The backpointer for this item has a midpoint m = 2. Since its span is three and its start index is 0, that means that the first child must have start index 0 and span 2; and the second child start index 2 and span one. Hence the cells in the chart contributing the children is made explicit. The backpointer points to the first item in each child cell, indicating that the Det-Adj-N constituent consists of a Det-Adj constituent of span two and an N constituent of span one. This is the first item in its cell, hence the index for the item is 1. Before presenting the CYK algorithm itself, we will first present a basic algorithm for filling a chart with all possible edges. We begin filling the chart by first “scanning” all words in the input, and putting the parts-of-speech, with the appropriate PCFG probability, into the cells associated with span one. Note that no backpointer is required, since these are all terminal items with no children. We then fill the
Span 5
4
3 2
1
1 [S, 4608 , (2,2,2), 1] 1 [S, 384 , (2,2,3), 2] [S, 961 , (3,2,2), 3] 1 [S, 192 , (4,1,3), 4] 1 [NP, 12 , (3,1,1), 1] [S, 961 , (3,2,4), 2]
[Det-Adj-N, 12 , (2,1,1), 1] [NP, 16 , (2,1,1), 2] [S, 481 , (2,2,4), 3] [Det-Adj, 1, (1,1,1), 1] [NP, 13 , (1,1,1), 2] [Det, 1, -, 1]
[S, 961 , (3,2,2), 1] 1 [VP, 864 , (3,3,1), 2] 1 [VP, 72 , (4,2,1), 3] [S, 961 , (3,2,4), 1] [Adj, 1, -, 1]
[V-NP↓N, 16 , (3,3,2), 2] [N, 12 , -, 1] [NP↓N, 12 , -, 2] [V, 13 , -, 3] [VP↓V, 13 , -, 4]
Start index: 0 1 2 Figure 7.12 All constituent spans spanning the string, given the CNF PCFG in Figure 7.11
[S, 961 , (4,2,3), 1] [VP, 19 , (4,3,1), 2] [N, 12 , -, 1] [NP↓N, 12 , -, 2] [V, 13 , -, 3] [VP↓V, 13 , -, 4] 3
[Adv, 1, -, 1]
[V, 13 , -, 2] [VP↓V, 13 , -, 3] 4
198
7. basic context-free approaches to syntax
FillChart(w1 . . . wn , G = (V, T, S † , P , Ò))
PCFG G must be in CNF
1 s ←1 scan in words/POS-tags (span=1) 2 for t = 1 to n do 3 x ← t − 1; C xs ← ∅ 4 for j = 1 to |V | do 5 if b j (wt ) > 0 6 then r ← |C xs | + 1 7 C xs ← C xs {[A j , b j (wt ), −, r ]} all spans > 1 8 for s = 2 to n do 9 for t = 1 to n − s + 1 do 10 x ← t − 1; C xs ← ∅ 11 for m = t to t + s − 2 do 12 p ← m − x; q ← s − p 13 for all [A j , · j , Ê j , y] ∈ C xp do 14 for all [Ak , ·k , Êk , z] ∈ C mq do 15 for i = 1 to |V | do 16 if ai j k > 0 17 then r ← |C xs | + 1 18 C xs ← C xs {[Ai , ai j k · j ·k , (m, y, z), r ]} Figure 7.13 Pseudocode of basic chart-filling algorithm
remaining cells, from smallest to largest span. The basic chart-filling algorithm is shown in Figure 7.13. To make things as simple as possible, we will adopt notation similar to what was used for the Viterbi algorithm in the previous chapter. For non-terminals Ai A j Ak ∈ V , let ai j k = P(Ai → A j Ak ), which is analogous to the transition probabilities in the Viterbi algorithm. Let b j (wt ) = P(A j → wt ). After scanning the POS-tags into the chart with the first part of the algorithm, we are left with the entries in the bottom row of the chart. We then process each row, from bottom to top, in a left-toright manner, that is, from starting index 0 to the last possible starting index for that particular span. Let us consider one particular cell as an example of the algorithm: span s = 3, start index x = 2. We will begin by trying the first possible midpoint m = 3, which implies that the first child comes from the cell x = 2, s = 1, and the second child comes from the cell x = 3, s = 2. As we scan through all possible combinations of the entries in the cell, we find that the categories of the second entry in the first cell ([NP↓N, 12 , –, 2]) and the second entry in the second cell ([VP, 19 , (4,3,1), 2]) occur together on the right-hand side of a production in our grammar S → NP↓N VP, which has probability 3/16.
7.3 non-deterministic parsing algorithms
199
1 Based on this, we can create the item [S, 96 , (3,2,2), 1]. In such a way, we can fill, in order, all the cells in the chart. While it is an important intermediate step to see how to efficiently fill in all possible items in a chart, this is not the CYK algorithm, which performs search more efficiently. To see where additional efficiencies can be gained, consider our example cell from the previous paragraph: s = 3, x = 2. Note that there are two entries in this cell with the same non-terminal category (VP). These entries differ in their probabilities and backpointers; however, they combine in exactly the same way with the NP in cell s = 2, x = 0 to give the top two analyses at the top of the chart. It is far more efficient to consider that combination only once; that is, we want to maintain only one entry per non-terminal category in any given cell of the chart. Non-probabilistic chart parsing approaches (e.g., Kay, 1986; Billot and Lang, 1989) that wish to find all possible parses for a string can achieve a more compact representation than what we have presented in Figure 7.12 by keeping a single entry per non-terminal category in each cell. Each entry maintains a vector of backpointers instead of just one, which enumerate the different ways in which the category can be built in that cell. The resulting “parse forest” compactly enumerates all possible parses. Note that such a representation would result in a single S entry in the topmost cell of Figure 7.12, with three backpointers not four. Two of the four backpointers in that cell point to the same category in the same cell (VP entries in the cell s = 3, x = 2) which would be a single entry with multiple backpointers in the more compact representation, hence requiring only one backpointer from the topmost S entry. The probabilistic version of the CYK algorithm, like the Viterbi algorithm from the previous chapter, is a dynamic programming approach that searches for the highest probability analysis, and thus can store just the highest probability per category along with an associated backpointer. It is easy to see that the lower probability VP in the cell s = 3, x = 2 could never be part of the highest probability parse, because the other VP will combine in exactly the same ways and will always result in a higher probability. We will now assume that every cell has an indexed position reserved for every non-terminal symbol in V , whether or not that position is actually taken. As a result, we just need to store a probability and a backpointer for each non-terminal in each cell, but no separate index or category label. Let ·i (x, s ) denote the probability for non-terminal
200
7. basic context-free approaches to syntax
Cyk(w1 . . . wn , G = (V, T, S † , P , Ò))
PCFG G must be in CNF
1 s ←1 scan in words/POS-tags (span=1) 2 for t = 1 to n do 3 x ←t −1 4 for j = 1 to |V | do 5 · j (x, s ) ← b j (wt ) 6 for s = 2 to n do all spans > 1 7 for x = 0 to n − s do 8 for i = 1 to |V | do 9 Êi (x, s ) ← argmaxm, j,k ai j k · j (x, m − x)·k (m, s − m + x) 10 ·i (x, s ) ← maxm, j,k ai j k · j (x, m − x)·k (m, s − m + x) Figure 7.14 Pseudocode of the CYK algorithm
i with start index x and span s ; and let Êi (x, s ) denote the backpointer to the children of the non-terminal. The dynamic programming algorithm is shown in Figure 7.14. There are several things to note in this algorithm. First, this algorithm is very similar to the Viterbi algorithm in Figure 6.6, except that, instead of performing dynamic programming over word positions, it is over chart cells. Second, the argmax in line 9 of the algorithm involves a search over all possible mid-points m, which ranges from x+1 to x+s −1, for a total of s −1 points. Thus there are three nested loops (s , x, m) that depend on the length of the string n, hence this algorithm has a worst case complexity on the order of n3 , though there are other factors, such as the size of the non-terminal set |V |. This n3 complexity makes context-free parsing far more expensive than, say, POS-tagging or NP-chunking, which are linear in the length of the string. For a string of length 50, for example (not unlikely in newswire text), context-free parsing will have complexity 125,000· f for some constant f , while NP-chunking will have complexity 50·g for some constant g . Needless to say, unless g is much, much larger than f (not likely), chunking will be many orders of magnitude faster in the worst case and usually in practice, that is, on the order of 2,500 strings would be chunked in the time it takes to parse just one. As with the Viterbi algorithm, one can generalize the CYK algorithm to return n-best parses, in much the same way as is presented in Section 6.3.2. Indeed, one citation referenced there (Huang and Chiang, 2005) is specifically focused on n-best parsing. See that paper, and
7.3 non-deterministic parsing algorithms Span 5 4 3 2 1 Start:
15 13 10 6 1 0
14 11 7 2 1
12 8 3 2
9 4 3
5 4
Span 5 4 3 2 1 Start:
15 10 6 3 1 0
14 9 5 2 1
13 8 4 2
12 7 3
201
11 4
Figure 7.15 Two different chart cell orderings for the CYK algorithm
the references therein, for more details on efficient n-best parsing. The utility of being able to do this will become clear in Chapter 9, when we present re-ranking approaches that make use of n-best lists. 7.3.3 Earley Parsing There are large efficiency gains in moving from the basic chart-filling algorithm in Figure 7.13 to the CYK algorithm in Figure 7.14, due to only having to store one entry per non-terminal per cell. One can legitimately ask: is that as efficient as we can get? In terms of worstcase complexity, the answer seems to be yes, that is as good as it gets. However, certain grammars may be such that there are additional efficiencies to be had. In this section, we will look at the idea of performing top-down filtering to try to reduce the number of categories in each cell, an idea first presented in the well-known Earley parsing algorithm (Earley, 1970). One way in which the purely bottom-up CYK algorithm can do extra work is in building constituents that cannot subsequently combine with other constituents in a valid parse. Consider the example cell with span s = 3 and start index x = 2 in Figure 7.12. In that cell, we have an entry for the non-terminal S, with a fairly high probability. Since this S does not cover the entire span of the string, in order to be part of a complete parse, it would have to combine with another constituent via a rule production in the grammar. If we inspect our grammar in Figure 7.11, however, we note that the only rule where an S appears on the righthand side is VP → V S, so in order for this particular S constituent to be in a completed parse, it would have to combine with a V to its left. But there is no V in any cell that it can combine with to its left, hence this particular S constituent is a dead end. If we can recognize that before placing it in the chart, we can avoid wasting any subsequent processing on that constituent.
202
7. basic context-free approaches to syntax
Figure 7.15 shows two valid orders for filling cells in the chart. The first is the one that we have been using for the CYK algorithm, which goes in span order – all of the cells of span one before any of the cells of span two. Of course, it is not necessary to finish the cell with start index 4 of span one before processing the cell with start index 0 of span two, since constituents in the latter cell can only be composed of span one constituents with start indices 0 and 1. The second chart ordering in Figure 7.15 is one that fills a cell in the chart as soon as all required possible children cells are completed. The advantage of the second order of filling the chart cells is that, by the time a cell is processed, all cells that it can possibly combine with to the left will already be filled. This allows for a simple filter: categories need only be created in a cell if they can effectively combine in that position to create a valid analysis. Consider again the example S category in the cell with span s = 3 and start index x = 2 in Figure 7.12, which is the thirteenth cell filled in the new chart ordering. As mentioned previously, this can only combine on the right-hand side of a rule (and hence in a valid full parse tree) if there is a V category adjacent to the left. Since this cell has start index 2, that category must be in one of two cells: start index x = 1 with span s = 1; or start index x = 0 with span s = 2. A quick inspection shows that no V occurs in either of these cells, hence the S does not need to be constructed. The filtering approach requires a list of categories that will not be dead-ends for any particular cell. In fact, the list can be shared for all cells with a particular start index. Let us begin with start index x = 0. Since S is our target root category, categories that are the leftmost child on the right-hand side of a production with S on the left-hand side can be in a valid parse, namely (looking at the grammar in Figure 7.11) either NP or NP↓N. In turn, we can have any category that is the leftmost child of a production with these two categories on the lefthand side, and so forth. In other words, we take the closure of the lefthand side/leftmost child relation to build the list of allowed categories. This filter is typically denoted with what are called dotted rules where the rules that may be used are stored with a dot before the allowed categories for the particular start index. For example, Figure 7.16 shows all the rules that were used to find the set of allowed categories for start index x = 0, with a dot before the first category on the righthand side. The set of allowable categories are all those which appear immediately to the right of a dot. Once the category to the right of the dot is recognized in a cell with start index x and span s , the dot is
7.3 non-deterministic parsing algorithms LHS
RHS
LHS
203
RHS
S
→
• NP VP
S
→
• NP VP↓V
S
→
• NP↓N VP
S
→
• NP↓N VP↓V
NP
→
• Det Adj
Det
→
• the
NP
→
• Det-Adj N
NP↓N
→
• bottle
NP
→
• Det-Adj-N N
NP↓N
→
• flies
Det-Adj-N
→
• Det-Adj N
Det-Adj
→
• Det Adj
Figure 7.16 Dotted rules at start index x = 0 from the PCFG in Figure 7.11
advanced to the next word position. If the resulting dot is not at the end of the sequence of right-hand side categories, it is placed in the list of dotted rules for all cells with start index x+s . Note that we are presenting this filtering within the context of improving CYK parsing, which uses Chomsky Normal Form PCFGs. The filtering approach, and the Earley algorithm that uses it, do not require the grammar to be in this form. The Earley algorithm, as typically presented (Earley, 1970; Stolcke, 1995), does binarization on the fly through these dotted rules. See those papers for more explicit details on the algorithm. After we process the cell with start index x = 0 and span s = 1, we move on to start index x = 1. For any category with start index x = 1 to participate in a parse that spans the entire string, it must combine with a category in the cell with start index x = 0 and span s = 1. Hence, we have enough information to produce our list of dotted rules for all cells with start index x = 1. The only one of our dotted rules that applies is NP → • Det Adj, which had a dot before the Det on the right-hand side. The dot moves past the Det, yielding the following dotted rule: NP → Det • Adj. Hence the only non-dead-end category that can be built in a cell with start index x = 1 is Adj. The utility of top-down filtering of this sort very much depends on the grammar. There is an overhead to keeping track of the dotted rules, and if most categories are typically allowed anyhow, then that overhead may be more than the efficiency gain to be had from the filtering. The Earley algorithm provides no worst-case complexity improvement, but under the right circumstances, it can provide much faster performance. 7.3.4 Inside-outside Algorithm Just as the CYK algorithm for context-free grammars is analogous to the Viterbi algorithm for finite-state models, there is an algorithm that
204
7. basic context-free approaches to syntax
allows for stochastic re-estimation analogous to the Forward–backward algorithm discussed in Section 6.3.3. The Inside–outside algorithm (Baker, 1979; Lari and Young, 1990) is used for calculating the expected frequencies of non-terminal categories and rule productions for use in an EM re-estimation. The inside probability for a category Ai ∈ V with start index x and span s is the probability of having that category with the particular words in its span: ∗
·i (x, s ) = P(Ai ⇒ wx+1 . . . wx+s )
(7.5)
For non-binary productions with terminals on the right-hand side, the definition is the same as in the CYK algorithm in Figure 7.14: · j (x, 1) = b j (wx+1 )
(7.6)
For the remaining binary productions, the inside probability is obtained by replacing the max in the algorithm in Figure 7.14 with a sum: ·i (x, s ) =
|V | |V | x+s −1
ai j k · j (x, m − x) ·k (m, s − m + x)
(7.7)
m=x+1 j =1 k=1
Just like the relationship between the Viterbi and the Forward algorithms presented in the last chapter, the dynamic programming involved in the CYK and the inside part of the Inside–outside algorithm are identical. Note that, if Ar = S † ∈ V , then P(w1 . . . wn ) = ·r (0, n)
(7.8)
The outside probability for a category Ai ∈ V with start index x and span s is the probability of having all the words in the string up to the start of the constituent followed by the constituent followed by the remaining words in the string, that is, for a string of length n: ∗
‚ j (x, s ) = P(S † ⇒ w1 . . . wx A j wx+s +1 . . . wn )
(7.9)
This is a little more complicated to calculate than the inside probability. We can start with the start symbol Ar = S † at the root of the tree. ‚r (0, n) = 1
(7.10)
For any category, given a grammar in Chomsky Normal Form, it can be the left child or the right child in whatever production that it participates in, as shown in Figure 7.17. Hence, we will sum over both
7.3 non-deterministic parsing algorithms
205
Ai
Aj
x
Ak
x+s+s ′
x+s Ai
Ak
Aj
x
x′
x+s
Figure 7.17 Left-child and right-child configurations for calculating the backward probability
possibilities. For a string of length n:
‚ j (x, s ) =
|V | |V | n−x−s s =1
·k (x + s , s ) ‚i (x, s + s ) ai j k +
k=1 i =1
|V | |V | x−1
·k (x , x − x ) ‚i (x , x + s − x ) ai k j (7.11)
x =0 k=1 i =1
Now that we have the inside (·) and outside (‚) probabilities for every category, start index, and span, we can now estimate the conditional probability, given the string, of seeing such a constituent in that position: „i (x, s ) =
·i (x, s )‚i (x, s ) P(w1 . . . wn )
(7.12)
206
7. basic context-free approaches to syntax
We can also calculate the conditional probability of a particular rule production applying at a particular span: x+s −1 ai j k ‚i (x, s ) · j (x, m − x)·k (m, x + s − m) (7.13) Ói j k (x, s ) = P(w1 . . . wn ) m=x+1
These values are analogous to what is calculated for the Forward– backward algorithm, although instead of time t they are associated with start index x and span s . These values can be used to re-estimate the PCFG iteratively, via an algorithm very similar to the Update-HmmParameters function in Figure 6.9. We leave it as an exercise for the reader to work out the algorithm to update PCFG parameters with the values just defined. 7.3.5 Labeled Recall Parsing In Section 6.3.4, we presented an approach to finite-state tagging that made use of the Forward–backward algorithm to calculate posterior probabilities for each tag at each time step. The Inside–outside algorithm can similarly provide posterior probabilities for labeled spans (Equation 7.12), which can be used for parsing, not just model reestimation. Goodman (1996b, 1998) proposed an algorithm for optimizing the expected labeled recall, under some assumptions. Note that, unlike the POS-tagging task, for which there is always a single tag per word in every candidate tag sequence, it is not the case that every parse tree of a particular string has the same number of constituents, unless all of the parse trees are binary branching. If there are n words in a string, all binary branching trees of that string have n−1 nodes in the tree; hence precision and recall are identical, and a single accuracy metric suffices. The restriction to binary branching trees limits the applicability of this approach, since many common grammars are not binary. However, a weakly equivalent Chomsky Normal Form grammar is used for CYK parsing, and such a grammar will be binary branching. Of course, for evaluation, the CNF trees are typically returned to the non-binary branching format of the original grammar, so that optimizing the expected recall with the CNF grammar does not optimize the expected F-measure accuracy in the original form. In particular, composite nonterminals created by the transform (see Figure 7.4) are not ultimately evaluated for accuracy, but this algorithm will maximize the expected
7.3 non-deterministic parsing algorithms
207
recall with them included. Nevertheless, it may be a better optimization than exact match accuracy, which is what CYK is optimizing. We will use a similar notation from the last chapter. Let p¯ denote the “true” parse for a string w. Adapting Equation 6.36 to the current task, we want to find the “best” parse pˆ given the string w for some definition of accuracy: pˆ = argmax E (accuracy( p | p¯ )) p
= argmax p
P( p | w) accuracy( p | p )
(7.14)
p
Treating each parse p as a set of labeled spans (i, x, s ) with label i , start position x and span s , labeled recall for binary branching trees is defined as follows | p ∩ p¯ | labeled-recall( p | p¯ ) = | p¯ | 1 in-parse(i, x, s , p¯ ) (7.15) = |w| − 1 (i,x,s )∈ p
where
in-parse(i, x, s , p¯ ) =
1 if (i, x, s ) ∈ p¯ 0 otherwise
(7.16)
Taking labeled-recall( p | p¯ ) from Equation 7.15 as the accuracy in Equation 7.14 yields: 1 pˆ = argmax P( p | w) in-parse(i, x, s , p ) |w| − 1 p p (i,x,s )∈ p
1 |w| − 1 p (i,x,s )∈ p = argmax „i (x, s ) = argmax
p
P( p | w) in-parse(i, x, s , p )
p
(7.17)
(i,x,s )∈ p
Just as in Equation 6.38 of Section 6.3.4, optimizing this particular accuracy thus involves summing the posterior probabilities. The suggested parsing algorithm is to calculate the posterior probabilities with the Inside–outside algorithm, followed by another CYK-like pass to find the tree that maximizes the sum of the posteriors. Goodman (1998, section 3.8) presents an extension of this approach for use with n-ary
208
7. basic context-free approaches to syntax
branching trees, which allows for a combined optimization of precision and recall. 7.4 Summary In this chapter, we have examined a number of computational approaches making use of grammars and models that are context-free, from deterministic parsing algorithms for CFGs to dynamic programming approaches with PCFGs. When confronted with natural language ambiguity, the most efficient approaches to exact inference (finding the maximum likelihood parse) have complexity that is cubic in the length of the string, versus linear for finite-state models. The reasons for this added complexity can be seen by a comparison of the CYK algorithm in Figure 7.14 and the Viterbi algorithm in Figure 6.6. In the next chapter, many of the approaches discussed will make use of the algorithmic basics that we have presented in this chapter. The key issue in that chapter will be improved disambiguation between syntactic alternatives by way of “enriched” grammars, for example, inclusion of lexical head words while maintaining the essential contextfree nature of the models and grammars. These enriched grammars can become very large, further degrading the efficiency with which they can be used, as well as increasing the severity of sparse data problems when estimating probabilistic models.
8 Enriched Context-free Approaches to Syntax Syntactic disambiguation often depends on information that goes beyond what syntacticians typically consider, such as semantic and pragmatic considerations. For example, consider the following two prepositional phrase (PP) attachment ambiguities: (1) Sandy ate a salad with a fork (2) Sandy ate a salad with a vinaigrette In the former example, the PP “with a fork” likely attaches to the VP, while in the latter example, the PP “with a vinaigrette” likely attaches to the NP. In a widely known paper, Hindle and Rooth (1993) demonstrated that corpus-derived statistics, indicating the strength of association between prepositions and nominal or verbal head words, could be used to accurately disambiguate a large number of PP attachment ambiguities. The incorporation of such information within contextfree grammars is the topic of this chapter. 8.1 Stochastic CFG-based Parsing In the last chapter, we discussed the algorithms that form the basis of context-free syntactic processing, but we have yet to really touch on the defining characteristics of the various well-known approaches to stochastic CFG parsing that have become popular over the past decade. Most of these have been greatly influenced by the release of the Penn Wall Street Journal (WSJ) Treebank (Marcus et al., 1993). This collection of parse trees allowed for a range of different statistical models to be trained in a supervised manner, which led to the investigation of many syntactic modeling approaches not previously explored. In this section, we will discuss some techniques for estimating grammars
210
8. enriched context-free approaches to syntax
from treebanks, and then follow this with a presentation of several wellknown parsers. 8.1.1 Treebanks and PCFGs Figure 8.1 presents an example tree from the Penn WSJ Treebank, which shows some of the features of the treebank. Recall that nodes with children are labeled with non-terminals; and terminals (words) in the string label nodes with no children beneath them at the frontier (or leaves) of the tree. Punctuation is included as terminals in the tree, with special POS-tags. The POS-tag -NONE- denotes empty nodes, where the terminal is the empty string, which is here denoted with ∗ and an index. In addition to base non-terminal categories, such as S, NP, VP, PP, etc., the treebank is annotated with function tags, such as -TMP for temporal, and -SBJ for subject. Long-distance dependencies are indicated through empty nodes and co-indexation, for example, the NP-SBJ has an index 6, which is shared with the empty object of “outlawed.” Some choices by the annotators of the treebank may not please some linguists, such as the relatively flat treatment of NPs, with no N-bar level of analysis for attachment of modifiers like prepositional phrases. Whether or not the annotated analyses are those most linguistically favored, they are a collection of systematic analyses of constituency over tens of thousands of real-world sentences, which makes possible certain approaches to parser development that did not widely exist before. First, as mentioned already, it provides a means for estimating weights for the grammars. In addition, it allows for an automatic evaluation of the accuracy of parses, by comparing parser output with the manually annotated trees. Let us briefly discuss the most widely used parse evaluation metric: labeled bracketing precision and recall. Labeled bracketing accuracy is one of the PARSEVAL metrics, proposed in Black et al. (1991) along with crossing bracket scores, for comparing parser performance. As we know from the discussion of chart parsing in Section 7.3.2, each constituent can be represented as a non-terminal category and a span, for example, (NP,0,3). To evaluate accuracy, one takes all of the labeled spans in the parser output and tries to match them with labeled spans in the true parse. Every labeled span in either tree can match with at most one labeled span in the other tree. The labeled bracketing precision is the number of matching spans divided by the total number in the parser
S
PP-TMP
,
IN
NP
,
By
CD 1997
NP-SBJ-6 NP VBG
MD
PP IN
ADJP
.
VP
NP
will VB
NNS of JJ
.
VP VP
be VBN NN
NP
DT remaining uses outlawed -NONEcancer-causing asbestos almost all *-6 RB
Figure 8.1 Example tree from the Penn Wall St. Journal Treebank (Marcus et al., 1993)
212
8. enriched context-free approaches to syntax
output. The labeled bracketing recall is the number of matching spans divided by the total number in the true parse. Note that neither of these scores is, by itself, completely suitable for evaluating the quality of a parser: a parse tree with just an S category and all terminal words as children of the S would have a very high precision; a parse tree with chains of unary productions predicting every possible category over every possible span would have a very high recall. By combining these scores into a composite score, however, we are left with a metric that avoids crediting such simple anomalies. Let t denote the number of labeled spans in the parser output; g denote the number of labeled spans in the gold-standard (true) tree; and m the number that match between the two sets. Then the F-measure accuracy is defined as 2 mt mg 2m 2Prec(m, t)Rec(m, g ) F (m, t, g ) = = m m = Prec(m, t) + Rec(m, g ) + g g +t t
(8.1)
This is a widely reported accuracy metric for comparing the accuracy of the output of two parsers. There are others, such as dependency extraction accuracy, which we will discuss later in the chapter, but this metric will suffice for now. The tree in Figure 8.1 is the raw treebank annotation, which is typically modified for use with context-free parsers. Empty nodes are removed for three main reasons. First, they correspond to nothing in the input string (they are empty), making them particularly problematic for efficient algorithms such as CYK to incorporate. Second, they correspond to long-distance dependencies that context-free approaches are hard-pressed to exploit adequately. Finally, including them has not been shown to improve parsing performance, as we will show when presenting the Collins parsing model. Since the empty nodes are removed, any co-indexation is also removed. The result of this for Figure 8.1 is that NP-SBJ-6 becomes just NP-SBJ, and the VP headed by “outlawed” becomes a unary production. In addition to removal of empty nodes, function tags are conventionally removed for context-free parsing, for example, PP-TMP becomes just PP and NP-SBJ becomes just NP. The reason for this is partly because the function tags are, in some cases, not annotated consistently across the corpus. Also, the inclusion of these tags splits the non-terminals into multiple non-terminals, even if there is some generalization to be had regarding the productions expanding those non-terminals: non-subject NP nodes may look somewhat like NP-SBJ
8.1 stochastic cfg-based parsing
213
nodes. Finally, early attempts at parsing the Penn Treebank removed the function tags for training and evaluation, so for backwards comparability, in evaluation at least, the function tags are most often ignored. Note that, even if empty nodes and function tags are not produced by the parser, a post-process can be used to re-introduce them into the annotation after the fact. The two-stage system described in Blaheta and Charniak (2000) takes the output of the Charniak context-free parser and assigns function tags to nodes in the output tree with very high accuracy for most tags. Some of the function tags, most notably CLR, were inconsistently annotated in the Penn Treebank and hence are difficult to accurately predict. For most of the common tags, however, this approach is viable. Recovering empty nodes (M. Johnson, 2002b; Levy and Manning, 2004) or both empty nodes and function tags (Jijkoun and de Rijke, 2004; Gabbard et al., 2006) as a post-processing stage on the output of treebank parsers is an active area of research, which promises to provide at least some of the long-distance dependencies that context-free parsers typically do not explicitly capture. To use the CYK algorithm, the grammar must be binarized, but the true parse tree is not in a binary format. To evaluate, the resulting binary tree must be un-binarized back to the original annotation format. This is easily done by removing composite non-terminals introduced by binarization, and attaching their children as children of the parent node, as mentioned in the previous chapter. In fact, it may be the case that other changes to the treebank or resulting grammar would have a beneficial effect for parsing, which could then be undone before evaluation. The benefit could have to do with processing efficiencies, or with changes to non-terminal nodes that improve the probability model. Recall that one of the rationales for removing function tags was that they split the non-terminal states, to the detriment of the probabilistic model. Some state splits, however, might be probabilistically beneficial, that is, they change one nonterminal to multiple non-terminals that are sufficiently different to result in quite different probabilities for various right-hand sides. In a PCFG, the probability of the right-hand side is conditioned on the left-hand side of the production, so the non-terminal set can have a big impact on the probabilities that are estimated from a treebank. It turns out that much of the difference between the best known parsing approaches comes down to the non-terminal set, and how that impacts the probabilistic model.
214
8. enriched context-free approaches to syntax
To make this more concrete, we will discuss two simple changes to the treebank made prior to grammar induction, which result in large changes to parsing efficiency and parsing accuracy. The first change that can be made is to make a Markov assumption regarding the dependency of children of a production on other children of the production. This is the same sort of Markov assumption that is made for an n-gram language model, as discussed in Section 6.1. Consider this as though the child categories of a rule production are a string, like a sentence, and we are modeling the probability of the string through an n-gram model, rather than defining a parameter for every possible string. Suppose we have an NP production with four children: NP → DT JJ NN NNS. Using the chain rule, the probability of this production is defined as P(NP → DT JJ NN NNS) = P(DT | NP) P(JJ | NP:DT) P(NN | NP:DT,JJ) P(NNS | NP:DT,JJ,NN) P(| NP:DT,JJ,NN,NNS)
(8.2)
where, just as with n-gram sequences, we explicitly model the fact that there are no further symbols after NNS in the right-hand side sequence. We can make a Markov assumption on this, so that the number of previous children in the production conditioning the next child category is at most some number k. If k = 1, the probability of the production becomes P(NP → DT JJ NN NNS) = P(DT | NP) P(JJ | NP:DT) P(NN | NP:JJ) P(NNS | NP:NN)P(| NP:NNS)
(8.3)
There are many ways to transform a treebank so that the induced grammar will encode a Markov assumption of this sort. We will choose one that is suited to the CYK algorithm, since that is what we presented in the last chapter, though it does not correspond precisely to Equation 8.3. Figure 8.2 shows right-factorizations of various orders that are produced by simply leaving children off of node labels created in a standard factorization for putting the grammar in Chomsky Normal Form. Any productions with two children or less are left unchanged by this factorization, so that we may retain binary productions as required for the CYK algorithm.
8.1 stochastic cfg-based parsing NP DT
NP:JJ-NN-NNS JJ
NP:NN-NNS NN
(a)
DT
DT
NP:JJ-NN JJ
NP
NP
NP
NP:NN-NNS NN
NNS (b)
NNS
215
NP:JJ JJ
DT
NP:NN NN
NP: JJ
NP:
NNS
(c)
NN
NNS
(d)
Figure 8.2 Four right-factored binarizations of a flat NP: (a) standard right-factorization; (b) Markov order-2; (c) Markov order-1; (d) Markov order-0
Consider the tree in Figure 8.2(c), which we are claiming encodes a Markov order-1 grammar. The probability of the NP is decomposed into the product of three productions: P(NP → DT JJ NN NNS) = P(DT NP:JJ | NP) P(JJ NP:NN | NP:JJ) P(NN NNS | NP:NN)
(8.4)
Using the chain rule, we get P(NP → DT JJ NN NNS) = P(DT | NP) P(NP:JJ | NP, DT) P(JJ | NP:JJ) P(NP:NN | NP:JJ, JJ) P(NN | NP:NN) P(NNS | NP:NN, NN)
(8.5)
Note, however, that P(X | NP:X) = 1 for all categories X, and that P(Y | NP:X, X) = P(Y| NP:X) for all categories X and Y. This leaves us with P(NP → DT JJ NN NNS) = P(DT | NP) P(NP:JJ | NP, DT) P(NP:NN | NP:JJ) P(NNS | NP:NN)
(8.6)
Because these productions are all guaranteed to be binary, the probability of continuing the production beyond the second child is zero, hence the probability to stop the original rule is implicit when the second child is not a composite, factorization-introduced non-terminal. There are two ways in which what is called “Markov order-1” in Figure 8.2, represented in Equation 8.6, differs from the formulation in Equation 8.3. First, the probability of the second child category JJ is conditioned not only on the previous child category DT, but also on the fact that DT is the first child in the production, since the lefthand side category of the production is not a composite category, which only happens when producing the first child of the original production. We could adhere to a strict order-1 Markov model in this case if the
216
8. enriched context-free approaches to syntax
composite node label introduced by factorization were NP-DT instead of NP:JJ, as was done in the factorization in Figure 7.4. If that were the sort of composite non-terminal created, however, a CYK parser would have to consider all ways in which, for example, NN and NNS could combine: into NP-JJ or NP-DT or NP-NN or NP-NP or any one of a very large number of composite categories. That is because it would have to predict more than just the original left-hand side, which is not the case with our factorization. The second way in which our factorization differs from the formulation in Equation 8.3 is that the probability of not continuing the original production past the NNS is conditioned on both the NN and the NNS, rather than just the NNS. In order to avoid this, we would have to add another layer of factorization, resulting in a final unary production: NP:NNS → NNS. However, unary productions of this sort are problematic for the CYK algorithm, hence we choose to stop factoring at binary rules, with the (in our opinion, bearable) side effect that the PCFG is actually Markov order-2 at certain locations in the production. To demonstrate the impact of making such a Markov assumption on CYK parsing, we will present some experiments, originally documented as baselines in Mohri and Roark (2006). First, a few notes about the experimental setup and the particular implementation of the CYK algorithm. All grammars were induced from sections 2–21 of the Penn WSJ Treebank, which is approximately one million words. For these experiments, we restricted possible POS-tags for a word to those in the k-best output of a discriminatively trained POS-tagger (Hollingshead et al., 2005). As mentioned earlier, empty nodes, co-reference indices, and function tags were removed from the treebank prior to training. Probabilities were assigned using maximum likelihood estimation, which is simple relative frequency estimation: P(A → ·) =
c (A → ·) c (A)
(8.7)
where c (x) denotes the count of x in the training corpus. Hence, unobserved rules get zero probability. Factorization is performed on the corpus prior to grammar induction. Finally, rather than transforming non-terminal to non-terminal unary rules, these rules are dealt with on-the-fly. Unary productions of this sort are relatively common in the treebank, especially after empty nodes are removed. After building all constituents in a cell using the standard CYK algorithm, an attempt
8.1 stochastic cfg-based parsing
217
Table 8.1 Baseline results of CYK parsing using different probabilistic context-free grammars PCFG
Time (s)
Words per sec
|V |
|P |
LR
LP
F
Right-factored Right-factored, Markov order-2 Right-factored, Markov order-1 Right-factored, Markov order-0
4,848 1,302 445 206
6.7 24.9 72.7 157.1
10,105 2,492 564 99
23,220 11,659 6,354 3,803
69.2 68.8 68.0 61.2
73.8 73.8 73.0 65.5
71.5 71.3 70.5 63.3
Grammars are trained from sections 2–21 of the Penn WSJ Treebank and tested on all sentences of section 24 (no length limit), given weighted k-best POS-tagger output. The second and third columns report the total parsing time in seconds and the number of words parsed per second. The number of non-terminals (|V |) and productions (|P |) is indicated in the next two columns. The last three columns show the labeled recall (LR), labeled precision (LP), and F-measure (F).
is made to improve the probability of constituents in the cell through unary productions to other constituents in the cell. See Mohri and Roark (2006) for more details. Table 8.1 presents the results of the experiments with various Markov orders. The baseline accuracy of a vanilla PCFG of the sort that we are inducing in this experiment is around 71.5 F-measure, with a parsing time (on the particular machine on which it was run) of 6.7 words per second. A Markov order-2 factorization results in a factor-4 reduction in time and in the number of non-terminals in the grammar, with no significant loss in parsing accuracy. An order-1 Markov assumption improves efficiency still further, though at the cost of around 1 percent F-measure accuracy. Recall that one reason we used the particular factorization that we did was to avoid having to predict information other than the lefthand side category of the original production when combining two categories. One might, however, want to predict richer categories if the resulting statistical models would lead to greater accuracy of parsing. For example, it has been shown (M. Johnson, 1998b) that by annotating the original parent category on every non-terminal, the accuracy of a standardly induced PCFG from the Penn WSJ Treebank improves dramatically. Figure 8.3(a) presents an example tree that has undergone the simple parent annotation transform. Note that under this version of the transform, POS-tags are not annotated with the parent category. When both parent-annotation and factorization transforms are performed, parent annotation is done first, so that the parent label is contained
218
8. enriched context-free approaches to syntax
in the composite non-terminals created by factorization. What this annotation does is condition the rule probability on the label of the parent; for example, the probability of the right-hand side of the NP rule depends on whether the NP is a child of an S category or a VP category. For comparison with parent annotation, we present another tree transform that annotates child labels onto the parent category, rather than the other way around. The benefit of such a transform is that, for a CYK parser, all of the extra information in the new node label is available when the children are combined. Rather than guessing the possible parent category, in addition to the original left-hand side category, just the children labels must be added, which are already known. However, the annotation is more complex for a couple of reasons. First, there are many children labels to be annotated, instead of just one parent label. For this reason, we chose to annotate just the first two children of the rule for these experiments. Even this leads to a large amount of state splitting, so we simplified it further, by annotating only the first letter of POS-tag children, followed by a “+” symbol to indicate it is a POS-tag. Figure 8.3(b) shows the example tree under this transform. Note that more dependencies are created under this transform than under the parent annotation, since a child of one category becomes dependent also on the children of its sibling. In contrast with parent annotation, when both child annotation and factorization transforms are performed, factorization is done first. Table 8.2 presents results achieved by performing these tree transforms on the training corpus prior to grammar induction, which is, for parent annotation, precisely the experiment reported in M. Johnson (1998b). For evaluation purposes, composite labels produced by factorization are removed from the tree output by the parser, re-flattening the productions to their original form. Additionally, parent or child S↓NP-VP
S↑TOP
VP↑S
NP↑S DT
NN
NNS
VBD
NP↑VP DT
(a)
NP↓D+-N+ DT
NN
NNS
VP↓V+-NP VBD
NN
NP↓D+-N+ DT
NN
(b)
Figure 8.3 (a) Parent label annotated onto non-POS-tag children labels; (b) First two children labels annotated onto parent label, with just the first letter of POS-tags included, rather than the full label.
8.1 stochastic cfg-based parsing
219
annotations are removed from node labels. The experimental setup is identical to the trials reported in Table 8.1, except that only results for Markov order-2 grammars are presented. A brief note about the impact of the Markov factorization in these trials. As with the baseline trials reported in Table 8.1, the Markov order2 factorization improves the efficiency of parsing over the grammar with no Markov assumption. However, in the case of the parent and child annotation, it also slightly improves the accuracy of the parses versus using the same annotation with no Markov assumption. The reason for this is that increasing the size of the non-terminal and rule production sets does not just impact the efficiency with which parsing can be accomplished, it also impacts the ability to effectively estimate parameters from a given training corpus. That is, the standard rightfactored, parent and child annotated grammars both suffer greatly from sparse data in estimating the PCFGs. The Markov grammars provide non-zero probabilities to unobserved productions, that is, they smooth the PCFGs. For the simple PCFG in the initial trials, this did not improve the accuracy of the parses, however as the size of the grammars grew with the additionally annotated information, smoothing resulted in better parameterizations of the problem, leading to better accuracy. As can be seen from Table 8.2, parent annotation improves the performance of the baseline model; yet the additional accuracy from this annotation comes at a great efficiency cost. With a Markov assumption of order 2, parsing takes more than five times as long with parent annotation as without. Note, however, that neither the number of nonterminals nor the number of productions is a perfect predictor of the parsing time required. They are related to the worst case complexity of the algorithms, but in fact the population of the cells in the chart will Table 8.2 Parent and initial children annotated results of CYK parser versus the baseline, for Markov order-2 factored grammars PCFG Baseline Parent annotated Child annotated
Time (s)
Words/s
|V |
|P |
LR
LP
F
1302 7510 1153
24.9 4.3 28.1
2492 5876 4314
11659 22444 32686
68.8 76.2 79.1
73.8 78.3 80.3
71.3 77.2 79.7
Grammars are trained from sections 2–21 of the Penn WSJ Treebank and tested on all sentences of section 24 (no length limit), given weighted k-best POS-tagger output. The second and third columns report the total parsing time in seconds and the number of words parsed per second. The number of non-terminals (|V |) and productions (|P |) is indicated in the next two columns. The last three columns show the labeled recall (LR), labeled precision (LP), and F-measure (F)
220
8. enriched context-free approaches to syntax
vary quite widely depending on the grammar. In the parent annotation case, right-hand sides will typically correspond to a larger number of possible left-hand sides, leading to far more entries in the chart cells. The initial children annotation that we described earlier is also presented in Table 8.2, also with a Markov order-2 factorization. This annotation provides even more accuracy improvement than the parent annotation (due to the extra dependencies), yet with a slight efficiency benefit over the baseline, rather than an efficiency cost. This is due to the fact that, unlike parent annotation, the extra annotation beyond the original node label is already known by the bottom-up CYK parser, since it comes from the child categories that have already been found and are combining to form the new category. The number of left-hand sides for an already found right-hand side is no larger than with the original grammar, since there is nothing other than the left-hand side of the original rule that is being predicted. In some cases, the extra context precludes combination, leading to sparser chart cells than with the baseline grammar and hence some speedup. Note, however, that the extra annotation leads to severe sparse-data problems, evidenced by the fact that, out of the test set of 1346 sentences, three had no valid parse under the grammar, unlike any of the other trials presented. The results presented in this section demonstrate several things about the use of PCFGs induced from treebanks. First, changing the treebank prior to grammar induction can result in dramatically improved performance, either with respect to the accuracy of the maximum likelihood parse, or with respect to the efficiency of parsing. Second, these changes are made in context-free grammars by changing the non-terminal set, either splitting non-terminals into multiple new non-terminals (as with parent-annotation), or merging multiple nonterminals into one (as with the Markov assumptions). As we can see from the improvement in accuracy by making a Markov assumption of order 2 when there is parent annotation, both efficiency and accuracy can be impacted from state mergers. Third, simple transforms such as parent or children annotation and Markov factorization can move the performance from around 72 percent to nearly 80 percent, which is roughly 40 percent of the way to the current state-of-the-art performance on this task of around 90 percent for context-free models. How far can just non-terminal state splitting and merging take performance? In a very interesting paper, Klein and Manning (2003b) push the paradigm of treebank transform and standard PCFG induction to the best accuracy that they could, which was around 85 percent,
8.1 stochastic cfg-based parsing
221
halving again the distance between parent annotation models and the highest accuracy approaches. While the paper title touts the “unlexicalized” nature of their approach, in fact the term “unsmoothed” is more accurate. Some lexical information is used in their node labels, mainly from very frequent closed class words like prepositions, which can have strong attachment preferences. A large number of node-label annotations were explored, though they kept their non-terminal size down enough that they did not require smoothing techniques to deal with sparse data. To go beyond their level of performance, even more contextual information is required for the left-hand side non-terminal labels, including lexical dependencies. 8.1.2 Lexicalized Context-free Grammars We have presented some results using tree transforms which provide robust parsing performance without smoothing. Other effective nonterminal state splits, however, result in very large non-terminal state spaces, which require additional smoothing to be made effective. The best known of these are called bilexical grammars, which encode lexicosyntactic dependencies of word pairs. Like the bigram language model discussed in Chapter 6, this sort of grammar requires smoothing to deal with the severe sparse data problems, compounded because the size of the treebank is much smaller than what is typically available for n-gram modeling. More than smoothing, however, bilexical grammars have a very large impact on parsing complexity. Unlike bigram models, which adhere to linear order in the dependencies, within any constituent of length n there are on the order of n2 possible bilexical dependencies. Throwing this extra complexity into the already expensive chart-parsing algorithm leads to what can be insupportable parsing times. As a result, all of the widely used parsers that include bilexical dependencies of this sort use pruning heuristics to reduce the parsing time, sacrificing any guarantees that the resulting best scoring parse is indeed the best scoring parse out of all possible parses. These heuristic pruning approaches will be discussed when we present the various parsers. The choice to use lexical dependencies in parsing is hence made despite the impact on parsing efficiency because of the positive impact it has on parsing accuracy. Incorporating the lexicon into a PCFG can help to capture such things as subcategorization preferences of particular verbs or attachment preferences of particular prepositions.
222
8. enriched context-free approaches to syntax S[will]
NP[uses]
PP[by] IN[by]
VP[will]
NP[1997] CD[1997]
MD[will] NP[uses]
PP[of] IN[of]
ADJP[all]
VBG[remaining]
VB[be]
NP[asbestos]
VP[outlawed] VBN[outlawed]
NNS[uses] JJ[cancer-causing]
RB[almost]
VP[be]
NN[asbestos]
DT[all]
Figure 8.4 Example tree from Figure 8.1 with head children in bold, and lexical heads in square brackets. Punctuation, function tags, empty nodes and terminal items removed for space
In addition, it can potentially capture selectional preferences between, say, a verb and object NP. Underlying all of this is the notion of a head child of a particular category, which can be thought of as the most important child from among all the children on the right-hand side of the production. For example, VP constituents are typically considered to be headed by verbs. This is an idea borrowed from syntactic approaches that typically go beyond context-free, such as Head-driven Phrase Structure Grammar (Pollard and Sag, 1994) and lexicalized Tree-adjoining Grammars (Joshi et al., 1975), which will be discussed in the next chapter. Once there is a systematic way to identify the head child of a constituent, we can recursively define the lexical head of the constituent (or head word). Let the lexical head of a terminal item be itself. For nonterminals, let the lexical head be the lexical head of its head child. In such a way, the lexical heads can “percolate” up the tree to the root. For example, the example Penn Treebank tree from Figure 8.1 is presented in Figure 8.4 with head children in bold and lexical heads in square brackets. For example, the lexical head at the root of the tree is “will,” which it inherits from the head child VP. The VP in turn inherited its lexical head from its head child MD. Note that the lexical head of each bold non-terminal category is identical to the lexical head of its parent. Which child should be the head of a particular constituent is not an uncontroversial question. For example, the Penn Treebank puts each modal and auxiliary verb in its own VP, with a VP argument. Is the head
8.1 stochastic cfg-based parsing
223
of the VP the modal/auxiliary verb, or the VP argument? If syntactic dependencies like agreement or auxiliary chain ordering constraints are more important, then percolating the auxiliaries is likely the best choice. However, if more semantic dependencies imposed by the main verb on, say, the subject of the sentence, are more important, then the VP is likely the best choice. In the tree of Figure 8.4, the modals and auxiliaries are chosen as the heads, rather than the VP. Most treebank parsers make use of head percolation rules that are closely related to those presented in Magerman (1995). Such rules typically identify as head, among other things: the final nominal in an NP; the preposition in a PP; the VP in an S; and the leftmost verb in a VP. A lexicalized PCFG G = (V, VN , T, S † , P , Ò) has rule productions of the form or
A[h] A[h]
→ →
B1 [b1 ] . . . B j [b j ] H[h] C1 [c1 ] . . . Ck [ck ] h
for some j, k ≥ 0, where A, H, Bm , Cn ∈ VN and h, bm , cn ∈ T for all 1 ≤ m ≤ j , 1 ≤ n ≤ k. The category H is the head child of the production, hence the left-hand side inherits its lexical head. The nonterminal set V of the PCFG is a subset of VN × T , that is, the cross product of the base non-terminal set and the terminal set. Note that the above lexicalized PCFG has n lexical heads in each n-ary production. Such an approach suffers from the same sort of sparse-data problem as an n-gram model. To make it tractable, most approaches factor this probability into pairwise dependencies between the head word and each of the non-head words on the right-hand side, which is known as a bilexical context-free grammar (Eisner and Satta, 1999). This probability model factorization can be represented as a grammar factorization, as explicitly detailed in Hall (2004). Figure 8.5 presents the factorization for a 5-ary production, with the head child of the production in the third position. Rather than a strictly left or right branching factorization, this grammar first combines everything to the right of the head child with that category, then combines with everything to the left of the head. As a result, at every binary production in the factored grammar, the head child is available. This sort of factorization, however, does complicate making a simple Markov assumption based on the linear order of the children, because it does not proceed in the linear order of the string. One can have a second order Markov assumption everywhere except for immediately following the head, as shown in Figure 8.5(c). Klein and Manning
224
8. enriched context-free approaches to syntax A[h]
A[h]
B1 [b1 ] A[h]:B2 -H-C1 -C2 B2 [b2 ]
A[h]
A[h]:H-C1 -C2
A[h]:H-C1 B1 [b1 ]
B2 [b2 ]
H[h]
C1 [c1 ] (a)
C2 [c2 ]
B1 [b1 ]
H[h] (b)
C1 [c1 ]
C2 [c2 ]
A[h]:B2 -H
B2 [b2 ]
A[h]:H
A[h]:H-C1 H[h]
C2 [c2 ]
C1 [c1 ]
(c)
Figure 8.5 (a) Basic 5-ary lexicalized CFG production; (b) Same production, factored into bilexical CFG productions; and (c) Factored bilexical productions with a Markov assumption
(2003b) discuss the use of head categories along with a linear-ordered Markov assumption in the estimation of the grammar. Inspecting this factorization, it is straightforward to see the worst case complexity changes to the CYK algorithm. Examining the pseudocode in Figure 7.14, the algorithm loops through n possible spans (line 6), and for each span there are on the order of n possible start positions (line 7). For each of these order n2 cells, the maximum probability analysis is found by looking at (in the worst case) n|V |3 possible combinations (order n midpoints, and |V |3 possible parent/children configurations). Hence there is a total worst-case complexity of n3 |V |3 . For each binary production, however, there is an associated head and non-head lexical item, which can be any word in the span. Hence, there are on order of n2 |VN |3 possible combinations with a bilexical grammar, where VN are the base non-terminals of the grammar, instead of just |VN |3 for the corresponding non-bilexical PCFG. That leads to a worst-case complexity of n5 |VN |3 , which is far worse than our standard CYK complexity. There are ways to improve the parsing complexity of bilexical grammars from this discouraging result. First, there are special cases, discussed in Eisner (1997), where the head position is guaranteed to be at the periphery of every constituent. For such cases, there are order n3 algorithms for parsing. Even for general bilexical CFGs, there are order n4 algorithms, presented in Eisner and Satta (1999). The basic idea in the algorithms presented in that paper is to build constituents in two steps, rather than just one. The first step abstracts over the span of the head child; the second step abstracts over the head of the nonhead child. Hence each step is n4 complexity, leading to overall order n4 complexity. Readers are referred to that paper for details. More commonly, parsers making use of bilexical grammars simply use pruning heuristics to avoid exploring the whole search space. Such
8.1 stochastic cfg-based parsing
225
approaches do not guarantee finding the maximum likelihood parse, but often work well in practice, that is, on average, they find quite accurate parses. Klein and Manning (2003a) showed that it is possible to make such heuristics admissible, that is, to perform an efficient best-first search in such a way that the maximum likelihood parse is guaranteed to be found. In the following sections, we will present two of the most widely used and cited lexicalized PCFG approaches to syntactic parsing: Michael Collins’ parser (Collins, 1997), and Eugene Charniak’s parser (Charniak, 1997, 2000). First, though, we should mention a few other treebank parsers that have been widely cited and used. 1 The decision tree parser in Magerman (1995) established the training and testing sections of the Penn WSJ Treebank that have become standard, along with some of the standard evaluation conventions, such as, reporting results on sentences of length ≤ 40. The MaxEnt parser documented in Ratnaparkhi (1997) was several percentage points more accurate than the Magerman parser, in terms of labeled precision and recall, and was very efficient, due to initial passes with a finite-state tagger and chunker, and an aggressive beam-search pruning strategy. The left-corner parser in Henderson (2004) achieves very high accuracy through the use of neural-network estimated derivation models and a beam search. Other context-free parsers are discussed in Section 8.3 on the use of parsing for language modeling. The Collins and Charniak models differ slightly in their parsing techniques, but both use a CYK parsing algorithm with some pruning of the space. The real differences between their approaches lie in the statistical models, namely, what sorts of non-terminal state splits they use, and how they estimate the probabilities of the productions. For that reason, we will focus on the probability models rather than the algorithms when presenting these approaches. This may lead to a bit of confusion, because the PCFG models are defined from the top down, that is, the probability of the right-hand side is conditioned on the lefthand side; whereas parsing proceeds bottom up, that is, the lexical head is inherited by the parent from the head child. How can the probability of the production be conditioned on the head of the left-hand side, when the left-hand side gets the head from that rule? In fact, as far as the probability distribution is concerned, the head child gets the 1 There is a large literature on stochastic context-free parsing, and we cannot hope to refer to all of even the best work in the field. We offer a very small number of references, and the reader should understand that this is not intended to be exhaustive.
226
8. enriched context-free approaches to syntax
lexical head from the parent; we just use our efficient bottom-up search technique to avoid predicting lots of possible lexical heads that do not occur in the string and hence cannot be in a valid analysis. Keep this in mind as we present the top-down parsing models in the next sections. 8.1.3 Collins Parser In his very influential paper, Collins (1997), Michael Collins defined three different parsing models, all of which involved the use of statistical dependencies between lexical heads and their dependents. As stated earlier, the probabilistic model is defined top-down, so all three models begin at the top of the tree. The strings in the Penn WSJ Treebank can be sentences (S) or some other category, such as, NP. For this reason, let us define a special non-terminal, TOP, which only occurs one level up from the topmost constituent of the tree. This TOP category rewrites with a unary production to the category at the top of the original tree. This is similar to the start-of-string symbol (<s>) that was part of ngram modeling or POS-tagging. In a slight abuse of notation, let S † be the root category of a particular tree (just below TOP), and h † the head of S † . The head of a category consists of the lexical head and the POS-tag of the lexical head, that is, h † = (w † , Ù† ). Collins’ most basic model, called Model 1 in the paper, begins with the probability P(S † , h † | TOP) = P(S † , Ù† | TOP) P(w † | S † , Ù† , TOP) (8.8) Then, starting with S † [h † ], and moving down the tree, all children categories and heads are predicted given the parent category and head. For every rule production of the form A[h] → B1 [b1 ] . . . B j [b j ] H[h] C1 [c1 ] . . . Ck [ck ], the probability is decomposed into three parts. First the head category H is assigned a probability, given the left-hand side A[h]; then all categories to the right of the head are assigned probabilities, given the head category H and the left-hand side A[h]; finally all categories to the left of the head are assigned probabilities, given the head category H and the left-hand side A[h]. In addition, there is a distance variable used to condition the probabilities, which we will explain in a moment, but for now just denote as dli for Bi to the left of the head, and dr i for Ci to the right of the head. This corresponds to an order-0 Markov factorization,
8.1 stochastic cfg-based parsing
227
with the head child H the only child in the production that conditions the probabilities of the other children: P(B1 [b1 ] . . . B j [b j ]H[h]C1 [c1 ] . . . Ck [ck ] | A[h])= P(H | A[h]) j
P(Bi [bi ] | A[h], H, dli )
i =0 k+1
P(Ci [ci ] | A[h], H, dr i )
(8.9)
i =1
where B0 [b0 ] and Ck+1 [ck+1 ] are special STOP symbols signaling the end of the constituent in either direction. Each of the three kinds of probabilities in Equation 8.9 (head, left, and right) is smoothed, to allow for robust estimation in the face of sparse data. The smoothing technique Collins used was a deleted interpolation approach (Jelinek and Mercer, 1980), where the maximum likelihood estimate with all of the conditioning information is mixed with a smoothed estimate with some of the conditioning information removed. See Equation 6.12 for this approach presented in the n-gram modeling case. For the current case, the probabilities are estimated as follows. Let h = (wh , Ùh ) be the head of the constituent, and let X be a non-head child, either to the left or the right of the head, with head x = (wx , Ùx ) and distance dx . Then, via the chain rule P(X[x] | A[h], H, dx ) = P(X | A[h], H, dx ) P(Ùx | A[h], H, X, dx ) P(wx | A[h], H, X, Ùx , dx )
(8.10)
Again, each of the component probabilities in Equation 8.10 is a smoothed probability estimate, based on mixing maximum likelihood ˆ For example estimates (denoted P). ˆ ˆ P(X | A[h], H, dx ) = Î1 P(X | A, H, dx , Ùh , wh ) + Î2 P(X | A, H, dx , Ùh ) ˆ | A, H, dx ) + Î3 P(X
(8.11)
where the mixing coefficients Îi are defined for a particular A[h], H, dx conditioning context and must sum to one. In addition to the smoothing, there are a number of practical issues that Collins had to deal with to reach the performance he did, such as how to deal with unknown words. In addition, a number of corpus pre-processing steps were taken, which were, in aggregate, quite important for system performance. Many of these are documented in
228
8. enriched context-free approaches to syntax Table 8.3 Performance of Collins’ Models 1 and 2 versus competitor systemsa System
LR
LP
F
Magerman (1995) Ratnaparkhi (1997) Collins (1997) Model 1 Collins (1997) Model 2
84.0 86.3 86.8 87.5
84.3 87.5 87.6 88.1
84.2 86.9 87.2 87.8
a
Trained on sections 2–21 and tested on section 23 of the Penn WSJ Treebank
Collins (1997; 1999), and some others are covered in Bikel (2004). The interested reader is referred there. The distance variable that we have mentioned is a feature derived from the word string spanning from the head to the nearest span boundary of the constituent. The variable is a vector with three values defined as follows: (1) 1 if the word string is of length zero, i.e., the constituent is adjacent to the head, 0 otherwise; (2) 1 if the string contains a verb; and (3) 0, 1, 2, or >2 commas. The rationale for these features has to do with learning right-branching and various attachment preferences. Table 8.3 shows the performance of Collins’ Model 1 on section 23 of the Penn WSJ Treebank, compared to two previously mentioned systems from the same era. It reaches an F-measure accuracy of over 87 percent on average over all sentences in the test set, many of which are longer than fifty words in length. Collins’ Model 2, which can be seen from Table 8.3 to improve on the performance of Model 1 by about half a percent, obtains this improvement by explicitly incorporating subcategorization into the model. For this model, complements expected by the head to its left, as well as complements expected by the head to its right, are generated probabilistically. These directional subcategorization expectations are then used to condition the probabilities of the children that are generated in either direction. When a child matches the subcategorized category in the predicted set, it is removed from the set for conditioning children farther away from the head in that direction. Note that complements are required, hence the special STOP symbol at the constituent boundary will have zero probability if the set of undischarged complements is non-empty. Similarly, the probability of generating a complement is zero if the set of complements is empty.
8.1 stochastic cfg-based parsing
229
S [VBD, gave]
NP-C [NNP, Popeye]
VBD [gave]
VP [VBD, gave]
NP-C [NNP, Sweetpea]
NP-C [NN, spinach]
NP [NN, yesterday]
Figure 8.6 Complement annotated tree for use in Collins’ Model 2
To illustrate the differences between Model 1 and Model 2, let us consider the following string: “Popeye gave Sweetpea spinach yesterday.” Figure 8.6 shows the tree, with two additional annotations: the lexical head of each constituent, along with its POS-tag, and “-C” for all required complements to the head. As a di-transitive verb, “gave” requires two arguments; and in order to form a sentence, the VP requires a subject NP. The temporal NP (denoted NP-TMP with the original function tags in the treebank) “yesterday” is not a required complement, hence it does not receive the “-C” annotation. Determining complements and adjuncts, which are not explicitly annotated in the Penn Treebank, was done via heuristic rules: certain function tags, such as -TMP or -ADV, ruled out categories as complements; of the remaining categories, NP, S, or SBAR in S or VP constituents, VP in VP constituents, or S in SBAR constituents are all annotated with -C. See the above citations for more details. First, let us consider the probability of the VP production. As in Model 1, first the head child is generated (VBD) given the parent and head information. In Model 2, the left and right subcategorization sets are generated next, followed by each of the non-head children being generated to the left and right of the head. For this particular production (for space purposes, denote the right-hand side as ·), the probability is calculated as P(VP [VBD, gave] → ·) = P(VBD | VP, VBD, gave) P({} | VP, VBD, gave, VBD(left)) P({NP-C, NP-C} | VP, VBD, gave, VBD(right)) P(STOP | VP, VBD, gave, VBD, left{}, d0l ) P(NP-C [NNP, Sweetpea] | VP, VBD, gave, VBD, right{NP-C, NP-C}, d1r ) P(NP-C [NN, spinach] | VP, VBD, gave, VBD, right{NP-C}, d2r ) P(NP [NN, yesterday] | VP, VBD, gave, VBD, right{}, d3r ) P(STOP | VP, VBD, gave, VBD, right{}, d4r ) (8.12)
230
8. enriched context-free approaches to syntax
while the probability of the S production (right-hand side denoted ‚) is calculated as P(S [VBD, gave] → ‚) = P(VP | S, VBD, gave) P({NP-C} | S, VBD, gave, VP(left)) P({} | S, VBD, gave, VP(right)) P(NP-C [NNP, Popeye] | S, VBD, gave, VP, left{NP-C}, d1l ) P(STOP | S, VBD, gave, VP, left{}, d0l ) P(STOP | S, VBD, gave, VP, right{}, d1r )
(8.13)
These productions are decomposed and smoothed similarly to what was presented for Model 1. The probability of the particular subcategorization set will be estimated from what is observed for the particular lexical head, smoothed by what is observed for the POS-tag for that head. Hence probability will be allocated to the range of subcategorization frames observed with the verb “gave”, but also to some that have never been observed with that verb, though the smoothing techniques used will ensure that those frames frequently observed with a particular verb get relatively high probabilities. In this example, the two standard dative-shift alternatives will certainly receive the highest probabilities. A third model was investigated in Collins (1997), which tried to exploit traces in the treebank annotation, though this failed to improve performance beyond that achieved by Model 2. The models explored in that paper and in Michael Collins’ thesis (Collins, 1999), and further explicated in Bikel (2004), remain quite influential, and the parser itself is widely used within the NLP research community. 8.1.4 Charniak Parser The other most widely used and cited PCFG parser in recent years is Eugene Charniak’s parser, of which there are two main versions. The first (Charniak, 1997) was developed at the same time as Collins’ models and shares much in common with those models. The second version (Charniak, 2000) makes a number of improvements that, in aggregate, buy large accuracy improvements over the first version. We begin by presenting the parser from Charniak (1997). The Charniak probability model, like Collins’ models, is centered around the notion of lexical head. In this model, first the lexical head of a category is assigned a probability, conditioned on the category label, the label of the parent of the category, and the parent’s lexical head.
8.1 stochastic cfg-based parsing
231
Table 8.4 Performance of Charniak’s two versions versus Collins’ models System
LR
LP
F
Collins (1997) Model 1 Collins (1997) Model 2 Charniak (1997) Charniak (2000)
86.8 87.5 86.7 89.6
87.6 88.1 86.6 89.5
87.2 87.8 86.7 89.6
Trained on sections 2–21 and tested on section 23 of the Penn WSJ Treebank.
Then the rule expanding the category is assigned a probability, conditioned on the category label, the label of the parent of the category, and the lexical head of the category. To compare directly with Collins’ model, consider again the tree in Figure 8.6. The Charniak model does not make use of either the “-C” annotations or the POS-tag of the head, but it does include the category of the parent to condition probabilities. The probability of the VP production for the model is defined as follows: P(VP↑ S [gave] → ·) = P(VBD, NP, NP, NP | VP, S, gave) P(Sweetpea | NP, VP, gave) P(spinach | NP, VP, gave) P(yesterday | NP, VP, gave)
(8.14)
and the probability of the S production is calculated as follows: P(S↑ TOP [gave] → ‚) = P(NP, VP | S, TOP, gave) P(Popeye | NP, S, gave)
(8.15) At the root of the tree, the probabilities are P(TOP → S[gave]) = P(S | TOP) P(gave | S, TOP)
(8.16)
All of these probabilities are smoothed similarly to Collins’ model, by deleted interpolation methods (Jelinek and Mercer, 1980). The one slight complication with this is that each lexical head h is associated with a cluster cl (h), which is used as a backoff when conditioning a probability on the value of h. For example, ˆ P(Popeye | NP, S, gave) = Î1 P(Popeye | NP, S, gave) ˆ | NP, S, cl (gave)) + Î2 P(Popeye ˆ ˆ + Î3 P(Popeye | NP, S) + Î4 P(Popeye | NP)
(8.17)
232
8. enriched context-free approaches to syntax
where the Îi are fixed for a given conditioning history and sum to one. Overall, this model is very similar to Collins’ Model 1, with a few differences: (1) Collins used a Markov order-0 model, whereas Charniak used full rules; (2) Collins used the POS-tag of the head, both when conditioning the probability of the head and using the head to condition probabilities, while Charniak used a hard cluster as a backoff just when using the head to condition probabilities; (3) Collins used a distance feature; and (4) Collins used the head child category label, in addition to the left-hand side category label, to condition non-head children, while Charniak used the label of the parent category of the left-hand side. Indeed, the performance of these two models, shown in Table 8.4, demonstrates that they are fairly similar in performance, with Collins’ Model 1 about half a percent better than Charniak. As can be seen from the last row in that table, the updated Charniak model from 2000 (Charniak, 2000) improved upon this result by 3 percent absolute, an over 20 percent relative error rate reduction. In the analysis presented in that paper, this improvement is attributed in aggregate to a relatively large number of changes to the model. The largest improvement to the model is achieved by, like Collins, explicitly predicting the POS-tag of the lexical head before predicting the lexical head itself, and further to use this as a backoff when conditioning on the head. This one change, in itself, is shown in the paper to contribute a very large benefit of nearly 2 percent improvement over the original Charniak parsing model. A gain is also achieved by using a smoothed order-2 Markov grammar, rather than assigning a probability to the rule as a whole. Some corpus pre-processing also contributed to the gain, most notably explicitly marking coordination structures. Finally, a novel factorization of the probabilities, to allow the inclusion of many more contextual features to the model (e.g., labels of parent and sibling nodes of the parent), significantly contributed to the overall gain. For details of this factorization, and further analysis of the improvements to the parsing model, we refer the reader to Charniak (2000). Of the four differences that were mentioned between Collins’ Model 1 and the original Charniak Model, the use of a Markov assumption and the explicit inclusion of the POS-tag of the head were both adopted in the new Charniak model. There is nothing in Charniak’s model equivalent to the string distance feature, nor does he explicitly model
8.1 stochastic cfg-based parsing S
flies
NP
⇒
VP V
Det
Adj
N
the
aged
bottle
233
bottle
Adv
flies the
flies
flies
aged
bottle
fast
⇓
fast
flies
bottle the
fast
aged
Figure 8.7 Two-stage mapping from parse tree to dependency tree. First stage re-labels each node with the lexical head, inherited from its head child (indicated in bold). Second stage, all head-child nodes are removed from the tree, leaving just the highest occurrence in the tree of any lexical item
subcategorization as in Collins’ Model 2. The updated Charniak model, however, goes beyond Collins’ models in making use of contextual information outside of a constituent to condition the probabilities of the structure, such as the label of parent and grandparent nodes. How are the features and dependencies that are used in these models discovered? In almost all cases, they come first from intuitions about what dependencies and generalizations matter, and second from trial and error. It remains an ongoing enterprise to clearly understand the best practice when building these sorts of stochastic CFGs. The question of what conditioning information buys real parsing accuracy is one that has provided some challenges to conventional wisdom since these parsers have been released. Several papers, including Gildea (2001) and Bikel (2004), have provided evidence that bilexical dependencies are not as important as once thought in achieving the highest level of accuracy. Contextual features such as the parent, grandparent, and sibling of the left-hand side, all of which are absent from Collins’ models, have been shown repeatedly to be of great use in treebank parsing (M. Johnson, 1998b; Charniak, 2000; Roark, 2001; Klein and Manning, 2003b). Many of the pre-processing steps used by individual parsers – e.g., the complement marking for Collins’ Model 2 and the coordination marking in the updated Charniak model – seem to provide benefits for those systems, yet are not widely adopted in other systems. Progress continues to be made in understanding how best to exploit rich dependencies within a context-free parsing system, though some of the best recent gains in parsing have been achieved by exploiting models that go beyond context-free in complexity, which is the subject of the next chapter.
234
8. enriched context-free approaches to syntax
However, there are a few remaining context-free topics that we must address.
8.2 Dependency Parsing Up to this point, we have been focusing on constituent structure. However there is another way to think about the structure of the string, based on the lexical dependencies in the structure. We already have at our disposal one common, very simple method for doing this, using constituent trees and head percolation. Figure 8.7 shows a two-stage mapping from a constituent structure to a dependency tree. The first step relabels every node in the tree with its lexical head, as defined in Section 8.1.2. The resulting tree is shown with the head child of each local tree in bold, from which the parent inherits its label. Recall that this sort of head percolation is used in defining lexicalized CFGs. In the second step, head child nodes are removed from the tree, and any children are attached directly to the parent of the removed node. In Figure 8.7, all of the labels in bold are removed, to achieve the dependency tree. This leaves exactly one node per word in the tree, at the highest node in the tree where it occurs. Each branch of the resulting tree represents a dependency between a head (governor), which is the parent, and a dependent child. Figure 8.8 shows three unlabeled dependency trees, corresponding to the three trees in Figure 7.8, obtained by taking the final nominal as head of NP and the verb as head of VP and S. In a dependency tree, each word is a single node in the tree, whereas in a constituent tree, a word can be the head of multiple constituents. We resolve this by only keeping the highest node in the tree labeled with a particular head, namely, the maximal projection of that head, as described above. This results in the three simple tree structures in Figure 8.8. If these dependencies form a tree, it means that there is a single governing head for every word in the string, an assumption that makes it difficult to capture all of the dependencies when facing phenomena like coordination or certain embedded clauses. The same, of course, is true of the lexicalized context-free models. In addition, by using the context-free trees to create the dependency tree in this way, we ensure that there are no crossing dependencies when they are projected into a graph over the linear string, as is done below the trees in Figure 8.8. While using the context-free constituency structure is one way to obtain
flies bottle the
The
aged
fast flies
fast
aged
bottle
flies
bottle
the
fast The
aged
aged
bottle
aged
fast
the
bottle
flies
flies
fast
The
aged
Figure 8.8 Three possible dependency trees for the trees in Figure 7.8, also projected into a graph over the linear string
bottle
flies
fast
236
8. enriched context-free approaches to syntax
a dependency tree, there exist very efficient lexicalized algorithms to do this directly (Eisner, 1996b). All of the example dependency structures that we have presented so far are unlabeled: the arcs connecting heads and dependents in the graphs at the bottom of Figure 8.8 have no label. We might choose, however, to label the arc between bottle and flies in the leftmost graph in Figure 8.8 as “subject,” since bottle is the subject of flies in that analysis. Clearly, the mapping in Figure 8.7 from constituent structure to dependency structure loses some grammatical information by replacing non-terminal node labels with head words; to include grammatical information in a dependency graph, arcs are labeled with the appropriate grammatical relation. Note, however, that labels of grammatical relations (e.g., “subject”) are typically quite distinct from non-terminal labels of constituents. Information encoded in a dependency structure may not be explicit in the constituent structure (and vice versa), which can complicate the mapping from one to the other. Crossing dependency structures are known as non-projective dependencies. While non-projective dependencies are relatively rare in English, more free word order languages, like Czech, commonly have a larger proportion of non-projective dependencies, making a straightforward mapping from context-free constituent structure inadequate. This is one reason why non-constituent-based dependency structures form the basis of the well-known Prague Dependency Treebank of Czech (Böhmová et al., 2002). See Mel’ˇcuk (1988) for details on approaches to dependency syntax. Several recent papers have explored non-projective dependency parsing for Czech (Nivre and Nilsson, 2005; McDonald et al., 2005; Hall and Novak, 2005). In Nivre and Nilsson (2005), special non-terminals are created in a context-free grammar which signal the presence of nonprojective dependencies. In Hall and Novak (2005), corrective postprocessing on constituent trees was investigated, to find tree structures which correspond to likely non-projective dependencies. In contrast to these two papers, in McDonald et al. (2005), the graph of possible dependencies is built and processed without an intermediate constituent structure. They show that the problem is the same as certain well-known maximum spanning tree algorithms that run in time quadratic with the length of the string. Hence, by eliminating the constraint on crossing dependencies that are typically used (with good reason) in dependency parsing algorithms intended for languages like English (Eisner, 1996b), parsing becomes more efficient, not less
8.2 dependency parsing
237
efficient. Dependency parsing is an active research area for a range of languages. See, for example, the Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL) from 2006, where the shared task was multilingual dependency parsing (Buchholz and Marsi, 2006). Lin (1998b,a) and others (e.g., Kaplan et al., 2004) have advocated the use of dependency graphs to evaluate the output of natural language parsers, as an alternative to the PARSEVAL constituent-based metrics discussed in Section 8.1. There are several arguments in support of the use of dependencies for evaluation, all based on the subsequent use of parser output for semantic processing. First, since dependency graphs explicitly encode lexical relationships in the string, the graph can be seen as a simple semantic shorthand which is a step closer than labeled brackets to the ultimately desired output. Parsers that produce more accurate dependencies will have more utility for subsequent semantic processing. Second, there are many potential syntactic annotation formats that differ in constituent labels and boundaries, but which ultimately represent the same semantic structure. By using dependencies as an intermediate representation, differences that are unrelated to semantic processing can be ignored. Finally, there are key long-distance dependencies that are particularly difficult for context-free parsers to capture, and which labeled bracketing measures do not credit. Parsing approaches that do capture such dependencies should get credit for them, since they are key for semantic processing. Given a dependency graph, dependency evaluation is very similar to bracketing evaluation. A constituent is represented as a label, a beginning position, and a span; a dependency is represented by a label, a head position, and a dependent position. By substituting the latter for the former, dependency accuracy can be evaluated just as labeled bracketing accuracy is. If every word is assumed to have a single governing head, there are exactly n−1 arcs in a dependency graph for a string of length n; hence precision and recall will be identical under such assumptions. While dependency evaluation has strong merits, true comparison between syntactic parsers with different annotation formats has proven very difficult. Post-hoc mapping from constituent structures to labeled dependency structures typically involves many assumptions, for example, which child is the head of a particular constituent or what dependency label is associated with a particular syntactic configuration. Ensuring that each mapping from parser output to dependencies is
238
8. enriched context-free approaches to syntax
truly extracting the parser’s implicit dependencies is tricky, but essential. Otherwise, superior performance may simply be due to a superior mapping from the parse tree to dependencies, rather than any actual difference in the quality of the parses. 8.3 PCFG-based Language Models We have seen how to calculate the probability of a string from a PCFG using the inside part of the Inside–outside algorithm in Section 7.3.4. Hence this sort of model can be used as a statistical language model for applications such as speech recognition and machine translation. In Chapter 6, we discussed the use of class-based language models as a way of capturing syntactic generalizations, and here there is potential to exploit even richer syntactic dependencies. One approach to PCFG-based language modeling is to calculate prefix probabilities, that is, the probability of the string w1 . . . wk being the first k words of a string of length >k. If one can efficiently calculate the prefix probability, then calculating conditional probabilities is straightforward. By definition P(wk+1 | w1 . . . wk ) =
P(w1 . . . wk+1 ) P(w1 . . . wk )
(8.18)
Efficient algorithms for calculating exact prefix probabilities given a PCFG, using incremental dynamic programming algorithms similar to the chart parsing algorithms presented in this chapter, were presented in Jelinek and Lafferty (1991) and Stolcke (1995). Stolcke’s algorithm, based on the Earley parsing algorithm, was used in Stolcke and Segal (1994) to derive n-gram probabilities for use in smoothing a relative frequency estimated bigram model. A shift–reduce parser provides the means to calculating conditional probabilities in what is known as the Structured Language Model (SLM) (Chelba and Jelinek, 1998, 2000). Consider the shift–reduce derivation in Table 7.1. The algorithm proceeds by shifting items from the input buffer, then performing a sequence of zero or more reductions, until there is nothing more to do except shift the next item from the buffer. When it is ready to shift the next word, the SLM uses the head-lexicalized categories at the top of the stack to define a probability distribution over words, that is, the probability, given the stack state, that the next word is w. Like a trigram model, the SLM makes a second order Markov assumption, so that the word is predicted by the top two
8.3 pcfg-based language models
239
categories on the stack. For example, before shifting “flies” in step 8 of Table 7.1, the SLM would calculate P(flies | NP, $), since NP is the topmost category on the stack, and there is no second category on the stack, so this is represented by a special “bottom-of-the-stack” character $. Note that in their model these non-terminal categories would be accompanied by their lexical heads, for example, NP-bottle. Recall that the basic shift–reduce parsing algorithm is deterministic, and that to extend this approach to non-deterministic grammars, some mechanism for dealing with multiple analyses is required, as discussed in Section 7.3.1. In the case of the SLM, a beam-search is used, whereby multiple derivations are pursued simultaneously. As a result, each derivation in the beam will be predicting the word given its stack state. A single probability is obtained by taking the weighted average of the probability assigned over the beam of candidate derivations. There are multiple probabilistic models that go into the SLM: the model that we have just discussed for predicting the next word given the stack state; and multiple models for making derivation moves, to provide derivations with their respective probabilities. Improvements to these models have been made over the past years, by including richer features in the statistical models (Xu et al., 2002), as well as by improvements in parameter estimation techniques (Xu et al., 2003). Another approach that makes use of a beam-search to apply statistical parsing techniques for deriving language model probabilities is the top-down approach taken in Roark (2001; 2002), which we will call the TDP model. Unlike the SLM, the TDP model derives conditional probabilities directly from the grammar, as in Equation 8.18. This approach also uses a beam-search, however, which means that some of the probability is thrown away – specifically the probability associated with parses that fall below the threshold and are pruned. What makes the model effective is the large amount of top-down context that is used to condition the probabilities of rules, including non-terminal labels and lexical head information from constituents in the left context. Because of the top-down, left-to-right derivation strategy, all of that contextual information is available for use in the probabilistic model, leading to a very peaked distribution over parses. This allows for effective pruning so that relatively little probability is lost. Hence, even though the model itself is deficient (i.e., the probabilities sum to less than one), in practice this is not much of a problem. Another approach that derives the probabilities from the grammar is that of Charniak (2001), which makes use of the Charniak (2000)
240
8. enriched context-free approaches to syntax
parsing model discussed in Section 8.1.4. Unlike the TDP model, the Charniak model is not incremental (left-to-right), hence it calculates probabilities for strings as a whole, rather than conditional probabilities based on prefix probabilities. It exploits richer statistical models than the TDP model by virtue of having access to contextual information not in the left-context, leading to a more effective language model. One of the key issues in applying these parser-based language models to real data is efficiency. It is one thing to apply the parser to a single string, but for speech recognition or machine translation, these models will be required to assign scores to many strings for a single utterance to be recognized or string to be translated. Sometimes these take the form of an N-best list, output from a baseline system. However, a small N may prune the search space too heavily, so that there is little room for improvement over the baseline system. One alternative is to use a word lattice, which is an acyclic weighted finite-state automaton that compactly represents a much larger set of alternative word strings. All three of the above-mentioned parsing approaches – SLM, TDP, and Charniak – have been applied to word-lattices. In Chelba (2000), an A∗ algorithm was used to apply the SLM to a weighted word-lattice, to extract the highest scoring path from the lattice according to the model. In Roark (2002), parse candidates were considered competitors if the prefix strings that they parsed led to the same state in the word lattice, allowing the parser to take advantage of re-entrancies in the graph. In Hall (2004) and Hall and Johnson (2004), attention-shifting techniques were used to apply the Charniak parser to a word lattice in such a way that all transitions in the graph were guaranteed to participate in a parse, hence receiving a score in the syntactic language model. The reader is referred to those papers for more details. Syntactic language modeling continues to receive attention in the research community. Another notable approach is the SuperARV model presented in Wang and Harper (2002) and Wang et al. (2003), though this model is not based on a PCFG, but rather on a finite-state approximation to a context-sensitive grammar formalism, and will be discussed in the next chapter in Section 9.2.4. 8.4 Unsupervised Grammar Induction The bulk of our discussion of computational approaches to syntax has focused on syntactic processing. Discussion of syntactic learning has focused on parameter estimation for stochastic models, not on learning
8.4 unsupervised grammar induction
241
the syntactic structure from scratch given unannotated language samples. The main reason for this is because learning syntactic structure is not nearly as active an area of research. This is due to the fact that it is very difficult, and that early attempts were discouraging. Even so, there has been some research on this problem, and even some recent progress, which we will briefly discuss here. Because any CFG can be converted to a weakly equivalent Chomsky Normal Form CFG, most work focuses on learning binary branching CFGs. Of course, the node labels are arbitrary inventions of those who invent the grammar, which cannot be inferred from text. From this perspective, there is a CFG learning approach that we have already presented: the Inside–outside algorithm. For a given non-terminal set, one can initialize all possible binary rules with some probability, then use the Inside–outside algorithm to iteratively re-estimate the grammar parameters until convergence (Lari and Young, 1990). After convergence, rules with probability zero are removed from the grammar, leaving an induced CFG that also happens to be a PCFG. In this approach, the initialization should not be uniform, or the algorithm will never move away from the initial distribution in re-estimation. The approach suffers from local maxima, where the learning can get stuck, that is, the globally optimum solution is not guaranteed to be found. As a result, the success of the approach can be very sensitive to the starting point. Evaluation of stochastic grammar learning can either be with respect to the quality of the probabilistic model over strings (as a language model) or with respect to the degree to which the grammar corresponds to human syntactic annotations. The motivation of Lari and Young (1990) was primarily the former, and the techniques described there have been used successfully to improve probabilistic language models based on stochastic grammars, for example, Chelba and Jelinek (1998); Chelba (2000). In contrast, Carroll and Charniak (1992) used the Inside–outside algorithm to try to induce syntactic structure that corresponded in some degree to human annotations. To do this, for every POS-tag A, they created a non-terminal tag A¯ that stands for a constituent headed by A, as well as a start symbol (S). From these non-terminals, they created all possible rule productions up to four non-terminals on the right-hand side. For this discussion, however, following the presentation in Klein (2005), we will discuss this approach just with respect to Chomsky Normal Form grammars. In the Carroll and Charniak (1992) approach, every POS-tag was treated as a terminal item, hence in a Chomsky Normal Form grammar there is a unary
242
8. enriched context-free approaches to syntax c c
Det the
Adj aged
c N bottle
V
Adv
flies
fast
Span 5 4 3 2 1 Start:
c c Det 0
Adj 1
N 2
c V 3
Adv 4
Figure 8.9 Unlabeled parse tree, and cells in chart
production A¯ → A. For every pair of POS-tags A, B, there are four possible rules by which they can combine: ¯ (2) A¯ → B¯ A; ¯ (3) B¯ → B¯ A; ¯ and (4) B¯ → A¯ B. ¯ (1) A¯ → A¯ B; The first experiment in Carroll and Charniak (1992) used a handbuilt PCFG to randomly generate 10,000 words of training data, from which a grammar could be trained using the Inside–outside algorithm. Low probability rules were pruned, and the resulting grammar was compared with the original. The results were poor. For example, they point out that, although the only rule for creating a pron constituent (for pronouns) in the original grammar was pron → pron, the two highest probability productions in the learned grammar expanding this category were pron → pron verb and pron → verb pron. Further, they randomly initialized the rule weights for use in the Inside–outside algorithm 300 times, which resulted in 300 different solutions, indicating severe problems with local maxima. A second experiment in Carroll and Charniak (1992) investigated the behavior when constraining the problem by restricting which nonterminals could appear on the right-hand side for particular left-hand sides. This is very similar to the notion of distituents (Magerman and Marcus, 1990), which are sequences of words that do not form a constituent. With some constraints of this sort, their Inside–outside based algorithm was able to learn the correct grammar. Imposing constraints on learning was also shown to be useful in Pereira and Schabes (1992), where the grammar learning with the Inside–outside algorithm was constrained to respect some given bracketings. Recently, some novel techniques have provided substantial improvements over such baseline unsupervised approaches (Klein and Manning, 2002, 2004; Klein, 2005). To understand these techniques, let us first discuss the evaluation that was applied. Since node labels are of labeler invention, and hence not learnable from an unlabeled corpus,
8.4 unsupervised grammar induction
243
a single node label is used, call it “c” for constituent. For example, in Figure 8.9, there are three “c” nodes, corresponding to the NP, the VP, and the root S. Given a reference parse, manually annotated, one can evaluate the quality of an unlabeled bracketing provided by an induced grammar by using standard precision and recall measures (and F-measure composite), however without considering the labels of the constituents. Of course, with such an evaluation, the root “c” is a given, so the score used in the above cited papers omits the root category from consideration. Note that just three cells in the chart representation in Figure 8.9 are filled with the “c” label. The remaining cells represent distituents, that is, substrings that do not correspond to constituents. For example, the cell span=3, start=2 corresponds to the string “bottle flies fast,” which in this analysis is not a constituent. The model in Klein and Manning (2002) and Klein (2005) departs from the PCFG models discussed above by explicitly modeling not just the constituent spans in the string, but all spans, including distituent spans. They model the joint probability of the sentence S and the bracketing of the sentence B as the product of a prior distribution over bracketings and the conditional probability of the sentence given the bracketing. Unlike a PCFG probability, the conditional probability is defined as follows. Let Ùsx be the sequence of POS-tags beginning at position x with span s , and let B xs be “c” if there is a constituent in the bracketing B from start position x with span s , otherwise “d”. Then P(S | B) = P1 (Ùsx | B xs )P2 (Ù1x−1 , Ù1x+s | B xs ) (8.19) x≥0;s >0
The first probability P1 can be thought of as modeling the “inside” probability of either the constituent or distituent. The second probability P2 can be thought of as some kind of “outside” probability. Note that some probability is allocated in this model to bracketings that do not correspond to trees. Using an algorithm similar to the Inside–outside algorithm, Klein and Manning (2002) and Klein (2005) show strong improvements over several baselines in unlabeled F-measure accuracy compared to the Penn Treebank. Their replicated version of the Carroll and Charniak (1992) algorithm gave them 48 percent F-measure accuracy, improving on the 30 percent achieved with a random model. A simple right branching structure is a good fit to English, and this gives 60 percent accuracy, while their novel model takes accuracy to 71 percent.
244
8. enriched context-free approaches to syntax
In subsequent work, Klein and Manning (2004) and Klein (2005) present results combining this model with a dependency model to get further improvements (reaching over 77 percent) for unsupervised learning of English. Given that the best possible performance is 87 percent, 2 this is an impressive result for unsupervised grammar learning.
8.5 Finite-state Approximations One of the main points that we have tried to make in the last couple chapters is that processing with context-free models comes at a large computational cost, relative to finite-state models. One way to try to get the best of both worlds is to start with a context-free model and modify it to make it finite-state equivalent. Recall from Section 7.2.2 that we can encode a CFG into a pushdown automaton (PA), that is, an automaton enhanced to allow for stacks at each state. If the grammar allows for self-embedding, 3 there are an infinite number of possible stacks, hence an infinite number of states in the PA. One approach for making a finite-state approximation is to change the PA in some way that results in a finite number of states. We will discuss two such approximations, but note that there are many variants, and we refer the reader to Nederhof (2000) for an excellent presentation of not just PA-based approximations, but many others as well. In Pereira and Wright (1997), a finite-state approximation is made to a PA by creating a finite number of equivalence classes of stacks, then merging states with stacks that belong to the same equivalence class. Note that the resulting automaton will accept all strings from the original PA, as well as others that would not have been accepted, hence this is termed a superset approximation. The question becomes how to define the equivalence classes. In Pereira and Wright (1997), if a stack S has two occurrences of the same symbol A on the stack, then this stack is considered equivalent to a shorter stack S that has all of the symbols between the two occurrences, plus the second A, removed
2
The original treebank is not binary branching, so the best one can do with a binary branching grammar is less than 100 percent accuracy. ∗ 3 A grammar is self-embedding if there exists a non-terminal A ∈ V such that A ⇒ ·A‚ + for · and ‚ that are not Â, i.e., ·, ‚ ∈ (V ∪ T ) .
8.5 finite-state approximations
245
from the stack. For example, ⎡
⎤ B ⎢C ⎥ ⎡ ⎤ ⎢ ⎥ B ⎢A⎥ ⎥ then S = ⎣ C ⎦ if S = ⎢ ⎢D⎥ ⎢ ⎥ A ⎣E ⎦ A
Since there are only a finite number of such classes (at most |V | symbols on the stack), the resulting automaton is finite-state, though perhaps very large. A second method for making a finite-state approximation based on a PA is to remove states whose stack depth (the number of categories on the stack) is larger than some bound k. Unlike the previous approximation, where states are merged, and hence paths preserved, here states (and hence paths) are eliminated. Thus only a subset of the strings that would have been accepted under the original PA are accepted under the new PA. This is a technique that Nederhof (2000) notes has been discovered and re-discovered multiple times. One relatively recent approach to this, however, suggests performing a left-corner transform to the grammar in advance of performing the finite-state approximation (M. Johnson, 1998a). The rationale for this is that the stack size with a left-corner transformed grammar is bounded, except with certain kinds of center embeddings, so that it is just these sorts of constructions that would be lost with the approximation. Indeed, it is this very property that has made some speculate that left-corner parsing is a good model for human sentence processing (Resnik, 1992), since people have trouble with center embeddings as well. The final method we will mention here is presented in Mohri and Nederhof (2001). The method involves transforming any CFG into one that is strongly regular, that is, one which has no self-embedding. Such a grammar can always be compiled, without approximation, into a finite-state automaton, in other words, it is finite-state equivalent. The grammar transformation is relatively simple, involving at most one additional non-terminal for every existing non-terminal. The approach works as follows. First, partition the set of nonterminals into sets of mutually embedding non-terminals. Two non∗ ∗ terminals A, B ∈ V are mutually embedding if A ⇒ ·B‚ and B ⇒ ÊA„ for some ·, ‚, Ê, „ ∈ (V ∪ T )∗ . Note that this is a transitive relation, hence this partitions the original non-terminal set into k subsets,
246
8. enriched context-free approaches to syntax
and only those non-terminals that neither self-embed nor are mutually embedding with another non-terminal will be outside of these subsets. Once these subsets M1 . . . Mk ⊆ V have been established, for each subset Mi do the following:
r Add a new non-terminal A to V for all A ∈ Mi , and add an epsilon production A → Â to P r For every production A0 → ·0 A1 ·1 A2 . . .·n−1 An ·n where · j ∈ (T ∪ (V −Mi ))∗ and A j ∈ Mi for all 0 ≤ j ≤ n, i.e., all · j are sequences of zero or more symbols not in Mi and all A j are in Mi , replace the production with the following n+1 productions: A0 → ·0 A1 A 1 → ·1 A2 A 2 → ·2 A3 .. . A n−1 → ·n−1 A j A n → ·n A 0 For example, consider the tree on the left of Figure 8.10. Let all · and ‚ sequences in the tree be outside of the set Mi and A0 , A1 , A2 ∈ Mi . Then the finite-state approximation above would transform the tree into the right-branching structure to its right. Note that, under the resulting transformation, non-terminals from the set Mi can now only appear on the right periphery of the tree, so that there is no selfembedding (see footnote 3). The resulting grammar is thus strongly regular and hence finite-state equivalent. Note that this is a superset approximation, since all strings accepted by the original grammar are still accepted, but some strings will be accepted that would not have been accepted, if the original grammar is not strongly regular. This is far from exhausting the topic of finite-state approximation to CFGs, and we recommend interested readers seek out the above references. In Section 9.2.4, we discuss finite-state and context-free approximations to certain context-sensitive formalisms. 8.6 Summary In this chapter, we have examined a large number of computational approaches making use of grammars and models that are contextfree, which attempt to disambiguate via enriched grammars. Central to
8.6 summary A0 ·0
A1 ‚1
·1
247
A0 A2 ‚2
·2
⇒
A1
·0
A 1
‚1
A2
·1
A 2
‚2 ·2
A 0 Â
Figure 8.10 Illustration of the effect of the Mohri and Nederhof (2001) grammar transform on trees, resulting in strongly regular grammars
most of the approaches outlined here are the CYK and Inside–outside algorithms from the last chapter, which are the context-free versions of the Viterbi and Forward–backward algorithms from the chapter before that. The best models presented here achieve parsing accuracies well beyond what could have been achieved just fifteen years ago, and context-free parsing continues to be an active area of research. Note that many of the features that were used to enrich the node labels in PCFGs discussed in this chapter – such as lexical heads and subcategorization frames – were inspired by approaches to syntax that go beyond context-free. The importance of non-local information for syntactic disambiguation is unquestioned, and in the next chapter we will look at a number of approaches to capturing such information beyond what is possible within a context-free grammar.
9 Context-sensitive Approaches to Syntax Just as context-free grammars describe languages that cannot be described with finite-state grammars, context-sensitive grammars go beyond what can be described with CFGs, though again at an efficiency cost. In this chapter, we will describe a number of computational approaches to syntax that go beyond the context-free models that we have described in the last chapter.
9.1 Unification Grammars and Parsing Consider the example sentences in Table 9.1. Most speakers of English would judge sentences 1 and 4 grammatical and sentences 2 and 3 ungrammatical because the subject NP and VP disagree in whether they are singular or plural. One way to think about this is that there is a constraint on grammaticality that requires a number feature of the NP and the VP to agree in the value that they take. Unification grammars are one way to extend context-free grammars to incorporate constraints of this sort. A grammar is context-free if the parse inside of a category is independent of anything except that category. If the CFG rule production is S → NP VP, then the inside of the NP does not depend on what happens inside the VP. This is crucially what allows for the dynamic programming solutions in the CYK and Inside–outside algorithms. Hence, if the rule production is as shown in Table 9.1, both “the players” and “the player” are possible NP constituents, and both “want to play” and “wants to play” are possible VP constituents, hence the CFG will permit the ungrammatical sentences, that is, it will fail to enforce number agreement.
9.1 unification grammars and parsing
249
Table 9.1 Example sentences illustrating grammaticality constraint
NP singular NP plural
VP singular
VP plural
1. The player wants to play 3. ∗ The players wants to play
2. ∗ The player want to play 4. The players want to play
For simple binary features like singular versus plural there is a simple way to include them in CFGs: annotate them directly onto node-labels. This is the same approach that was taken for improving PCFG models in the previous chapter, for example, parent or lexical head annotation. In this case, we can go from the single production S → NP VP to two productions S → NPsg VPsg, and S → NPpl VPpl. Each of these new non-terminals, then, would produce only singular or plural NPs or VPs. We gain this agreement through an increase in the number of nonterminals. If we have a CFG and a finite number of such simple feature values, then it is clear that, through feature value annotation, we can derive another (larger) CFG. The problem arises when the feature values are infinite in number, in which case there is no way to compile this into a CFG. This is similar to the relationship between CFGs and finite-state automata: if a pushdown automaton equivalent to the CFG has a finite number of states, then the CFG is finite-state equivalent. A unification grammar is CFG equivalent if its feature values can be compiled into a finite number of non-terminals. Unification grammars, such as Head-Driven Phrase Structure Grammar (HPSG) (Pollard and Sag, 1994) and Lexical-Functional Grammar (LFG) (Kaplan and Bresnan, 1982), add feature or attribute structures to constituents, and enforce constraints on these structures. Each feature or attribute takes a value, which may be (1) underspecified, such as, the number feature value of the noun “moose,” which can be either singular or plural; (2) an atomic value, as in the number feature being either singular or plural; (3) a composite feature structure in its own right, such as the feature structure of an embedded clause; or (4) a pointer to another value within the feature structure, for example, if the subject of the embedded clause is the same as in the main clause. It is easy to see that arbitrary feature structures can be disastrous computationally. Since the kinds of features that are often used in unification grammars include head dependencies and subcategorization requirements, one might imagine the feature structure encoding the
250
9. context-sensitive approaches to syntax
entire dependency structure of the constituent. However, the number of such alternatives grows exponentially with the length of the string, which quickly becomes intractable to explore exhaustively. Such circumstances lead to multi-pass parsing strategies, with initial contextfree parsing, or parsing with a minimal amount of feature structure instantiation, followed by a full feature structure analysis based on the resulting constituent structures. These strategies are similar to those employed by enriched context-free approaches discussed in the last chapter. Consider sentence (1) from Table 9.1. The sentence means that the player wants the player to play, not that the player wants somebody else to play. The player is hence the subject of both wants and to play. This dependency, in transformational grammar, is usually denoted with an empty subject in the embedded clause, with co-indexation to indicate that the empty subject is the same as the subject of the main clause. This is the way, for example, the Penn Treebank annotates such dependencies. Figure 9.1 shows the treebank tree that would label this string, along with a dependency graph that shows the subject relation between player and both of the verbs. Note the difference from the dependency trees that we discussed in the previous chapter – here words can have multiple governors, whereas in a tree structure, even a non-projective tree structure, a word can have only one governor. In a unification grammar approach like LFG, empty nodes are not part of the constituent structure (or c-structure) as in Figure 9.1. Rather, the dependencies between player and the verbs would be indicated within a feature/functional structure (or f-structure). The verb wants would contain within its lexical specification that, when it takes a subjectless infinitival verb phrase as an argument, the subject feature of the embedded clause would be the same as the subject feature of the main clause. In the c-structure, there would be no empty node and no indexing – all such information is kept in the f-structure. The f-structure in LFG has attributes (or features) and values of the attributes. As mentioned earlier, the value of a feature can be an fstructure in its own right. For sentence (1) from Table 9.1, a simple f-structure for the string as a whole would look something like the structure at the root of the tree in Figure 9.2. This f-structure can be interpreted as follows. 1 Outside of the f-structure is the category of the 1
Some of our attribute labels come from LFG, some from HPSG, and others are of our own invention. For our purposes, the conventional notation of one approach versus
wants S VP
NPi the
player
wants
subject S
NPi
⑀
the VP
to
to
player
play
subject
play
Figure 9.1 Use of co-indexation to indicate that the subject of the main clause is the same as the (empty) subject of the embedded clause; and a re-entrant graph representation of the same
252
9. context-sensitive approaches to syntax ⎡ ⎤ ⎤ DEF + ⎢ SUBJ (1) ⎣ HEAD “player” ⎦ ⎥ ⎢ ⎥ ⎢ ⎥ NUM SG ⎢ ⎥ ⎥ S⎢ “want <SUBJ, COMP>” ⎢ HEAD ⎡ ⎥ ⎤ ⎢ ⎥ SUBJ (1) ⎢ ⎥ ⎣ COMP ⎣ INF ⎦⎦ + HEAD “play <SUBJ>” ⎡
⎡
⎤ DEF + ⎣ NP HEAD “player” ⎦ NUM SG $ % Det DEF +
⎡
SUBJ ⎢ HEAD ⎢ VP⎢ ⎢ ⎣ COMP
HEAD “player” N NUM SG
the player
$ % ⎤ (1) NUM SG ⎥ “want <SUBJ, COMP>” ⎡ ⎤⎥ ⎥ SUBJ (1) ⎥ ⎣ INF ⎦⎦ + HEAD “play <SUBJ>”
$ % ⎤ SUBJ (1) NUM SG ⎢ HEAD “want <SUBJ, COMP>” ⎥ ⎥ V⎢ ⎣ ⎦ SUBJ (1) COMP INF + ⎡
INF + VP HEAD “play <SUBJ>”
to
play
wants
Figure 9.2 Tree of f-structures for string “the player wants to play”
constituent, which is S at the root of the tree. The attribute HEAD denotes the head meaning-bearing item, which is often accompanied with some requirements on its arguments. In this case, the head of the sentence is the verb wants, which has the meaning “want” and requires a subject (SUBJ) and (in this case) an infinitival sentential complement (COMP) with an empty subject. The SUBJ attribute has a feature structure as its value, which indicates that the subject of the sentence is definite (DEF), singular (SG), and carries the meaning “player.” In order to make links with other attributes, we have given this the index (1). 2 The sentential complement has its own HEAD attribute value, but the value of its SUBJ attribute is a pointer (in our case an index) to the SUBJ value from the main clause. In addition to defining feature structures, LFG involves some general constraints on well-formedness that operate on these structures. Most notably, there are constraints on completeness and on coherence. The former says (roughly) if a HEAD requires an argument, it should get it; the latter says (roughly) a HEAD should only get those arguments it asks for. another is irrelevant. We are aiming to give a very basic flavor of the approaches, and will omit much of the fairly rich notation associated with LFG and HPSG. 2 LFG notation typically draws a line from a value to other linked attributes.
9.1 unification grammars and parsing
253
One key insight in LFG and HPSG is that the lexicon is critical for describing what happens syntactically. Looking at the leaves of the tree in Figure 9.2, we can see that, in addition to the atomic POS-tags, each word is associated with an f-structure that at least partially dictates how it can combine with other constituents. The main verb “wants,” for example, requires a SUBJ and COMP. In addition, the SUBJ should have a NUM value of SG, to agree in number; the COMP should have the INF value of +, and its SUBJ value will be a pointer to the SUBJ of the main clause. Note that this is not the only f-structure that the word “wants” can have, since there are multiple syntactic variants of the word. For example, one can have the sentence “the player wants John to play,” in which case the subject of the embedded clause is not empty. That use of “wants” must be treated as a separate lexical f-structure, just as the word can would have separate atomic POS-tags for its common noun and modal verb uses. We have been discussing unification grammars, but so far we have yet to talk about the formal operation of unification. Unification is an operation on two feature structures that succeeds when values for the same feature match, and fails when the values do not match. Crucially, an underspecified value matches a specified value by taking on that value. By convention, assume that any feature that is not explicitly provided a value in any particular f-structure is underspecified. Let us look at the feature structures of the key constituents in Figure 9.2, and see how they combine through unification to provide the global feature structure at the root. First consider the unification between the Det and N to form the NP, which is the simplest in the tree. Since the f-structure associated with the noun is underspecified for definiteness (DEF), and the determiner is underspecified for either HEAD or NUM, unification serves merely to instantiate all three of these values in the f-structure of the resulting NP. To use formal notation $
DEF +
%
⎤ DEF + HEAD “player” = ⎣ HEAD “player” ⎦ NUM SG NUM SG
⎡
(9.1)
Suppose, however, instead of the NUM underspecified determiner “the,” we had a determiner such as “several,” which does not agree with the singular “player”? In such a case, we would have a failure of
254
9. context-sensitive approaches to syntax
unification: $
NUM PL
%
HEAD “player” = failure NUM SG
(9.2)
The question now arises, how did we know to simply unify the feature structures of the Det and the N to create the f-structure of the NP? Straight unification will not always be the case, as we shall see, so there must be some kind of notation to indicate how the feature structures combine. The convention is to augment the phrase structure rules with functional information that dictates how the f-structures combine to form the f-structure of the new constituent. Examples of this are the following three rules: NP →
VP S
→
→
Det ↑=↓
V
↑=↓
N
↑=↓
VP (↑COMP) =↓
NP (↑SUBJ) =↓
VP
↑=↓
(9.3)
These are interpreted as follows. The ↑ refers to the f-structure of the new constituent; the ↓ refers to the f-structure of the child under which the symbol occurs; and (↑X) refers to the X feature of ↑. Hence, the NP phrase structure rule dictates that the resulting f-structure comes from both the Det and the N f-structures. In such a case, the two fstructures must go through unification, as we demonstrated earlier. For the VP phrase structure rule, the resulting f-structure comes from the V child, and the f-structure of the VP child becomes the value of the COMP feature in the resulting f-structure. Note that, in our example tree in Figure 9.2, the f-structure associated with the V already has a COMP feature value, which must then unify with that provided by the VP child: ⎤ ⎡ SUBJ (1) SUBJ (1) INF + ⎦ = ⎣ INF + INF + HEAD “play <SUBJ>” HEAD “play <SUBJ>” (9.4) The outcome of this unification is the value of the COMP feature of the resulting VP. Finally, to produce the f-structure at the root of the sentence, the VP child provides the f-structure for the resulting
9.1 unification grammars and parsing
255
constituent, and the NP provides the SUBJ value. There is already a SUBJ value, which must unify with the NP f-structure. We leave it as an exercise to the reader to verify that it does so. With this very brief introduction to unification grammars, we can begin discussing parsing. Viewing the f-structures within a tree structure of the utterance gives an indication of how one might go about parsing with such a grammar. A brute-force method would simply have non-terminals in the grammar that are composed of constituent labels and their associated f-structures. For simple unification grammars, making use of relatively simple features and feature values, this sort of approach is viable. While a linguist would write the grammar using unification grammar notation, enabling a relatively parsimonious description of the grammar, the grammar would be compiled for processing into an equivalent context-free grammar by instantiating all possible combinations. For example, if the unification grammar simply encodes the NUM feature, then the single rule for S→NP VP that enforces an agreement constraint on this feature would be compiled out into two rules: S[SG]→NP[SG] VP[SG] and S[PL]→NP[PL] VP[PL]. Of course, this approach breaks down when the grammar is not context-free equivalent, that is, when the number of non-terminals is infinite. In fact, infinity is not the practical breaking point: even for context-free equivalent grammars, the size of the non-terminal set and the number of productions in the compiled context-free grammar can quickly become too large for tractable search. One can adopt a twostage approach, where the constituent structure minus the f-structures is first used to build a compact representation of all possible parses, using context-free algorithms along the lines of the Inside–outside algorithm. The resulting “parse forest” is then used to derive f-structures. Some of the constraints, however, that are part of the f-structures can serve to prune out analyses from the parse forest, leading to more efficient parsing and smaller forests. Including functional constraints in the constituent structure parsing is called interleaving, and there are some efficient techniques for doing this that we will briefly discuss. The basic idea behind making interleaving of functional constraints in chart parsing more efficient is to be lazy in copying features, only doing so when those features are required for unification or for some kind of constraint satisfaction. Consider, for example, the tree in Figure 9.2. While there is a lot of copying of features upward into the parent f-structure, only a subset of these features actually involve constraints that must be satisfied: the COMP f-structure has an INF feature
256
9. context-sensitive approaches to syntax ⎡ ⎤ ⎤ DEF ⎦ ⎥ ⎢ SUBJ (1) ⎣ HEAD ⎢ ⎥ S⎢ NUM SG ⎥ ⎢ ⎥ ⎣ HEAD “<SUBJ, COMP>” ⎦ COMP ⎡
⎡ ⎤ DEF ⎦ NP⎣ HEAD NUM SG $ % Det DEF +
⎡
SUBJ ⎢ HEAD ⎢ VP⎢ ⎢ ⎣ COMP
HEAD “player” N NUM SG
$ % ⎤ (1) NUM SG ⎥ “<SUBJ, COMP>” ⎡ ⎤⎥ ⎥ SUBJ ⎥ ⎣ INF ⎦⎦ + HEAD “<SUBJ>”
the player
$ % ⎤ SUBJ (1) NUM SG ⎢ HEAD “want <SUBJ, COMP>” ⎥ ⎥ V⎢ ⎣ ⎦ SUBJ (1) COMP INF + ⎡
INF + VP HEAD “play <SUBJ>”
to
play
wants
Figure 9.3 Tree of partially unexpanded f-structures for string “the player wants to play”
in both children of the VP; the main SUBJ f-structure has a NUM feature coming from both the NP and the VP; and the predicates of the main and subordinate clauses have requirements for certain arguments. In all other cases, the features do not potentially rule out other analyses. Figure 9.3 shows the f-structure tree from Figure 9.2 with partially unexpanded feature values at the three non-leaf f-structures. By unexpanded feature values, we mean features that have values specified by one of their children, but where that value is not made explicit in the fstructure. For example, at the root f-structure, there is a COMP feature, but its value is left blank. It can later be recovered by looking for the COMP feature in the children of the item. The benefit of doing this is that multiple items in the chart which would otherwise have to be stored separately can be stored together, since with their feature values unexpanded they may be equivalent. Later, if the feature needs to be expanded because it is needed for a unification, the feature values can be expanded. But this will only be done when needed, which can lead to far less effort than fully splitting from the beginning. Consider the subject NP. Since there is no feature coming from both the left and the right children, and no subcategorization constraints imposed by the head, all three features can be left unexpanded initially. Its parent, however, will require that the NUM feature be expanded, to check whether the NP agrees with the NUM feature value imposed
9.2 lexicalized grammar formalisms and parsing
257
by the VP. In the VP node, we leave the lexical identities of the heads unexpanded, while passing along their subcategorization requirements. At the root S node, the COMP feature can go unexpanded, because it places no constraint on the NP subject. This sort of approach is used in the Xerox Linguistic Environment (XLE) project (Maxwell and Kaplan, 1998). This and related projects at PARC have pushed the LFG parsing approach to become both competitively efficient and robust (Riezler et al., 2002; Kaplan et al., 2004). To do so, they make use of a number of advanced techniques, in addition to a high quality manually built LFG grammar (Butt et al., 2002). First, they perform efficient interleaving of functional constraints while building packed parse forests. They also place limits on processing (called skimming) to stop processing when the time spent goes beyond what is deemed acceptable. They additionally make use of robustness techniques to provide partial analyses when full parses are unavailable due to skimming, ill-formed strings, or lack of coverage. Finally, they use advanced stochastic models for disambiguation. This last topic, stochastic disambiguation, has been a lively area of research over the past decade, and techniques that are used for this and other context-sensitive grammars will be presented in Section 9.3. In fact, the techniques that are used for parse selection in contextsensitive approaches has also been shown to be effective on the output of context-free parsers. These approaches, however, are of particular importance with context-sensitive grammars, since these grammars do not generalize as straightforwardly as CFGs to properly normalized generative models. Before discussing these topics, however, we will present another class of context-sensitive grammars.
9.2 Lexicalized Grammar Formalisms and Parsing A second set of natural language grammar approaches that go beyond context-free are the so-called “lexicalized” grammar formalisms, including Tree-adjoining and Categorial grammars. In these approaches, phrase structure rules are replaced with a relatively small number of rule schemata, and most of the information about how to combine constituents or trees is embedded in lexical categories. These approaches belong to the set known as “mildly” context-sensitive grammar formalisms (Joshi et al., 1991), which, unlike general unification grammars, have polynomial complexity parsing algorithms.
258
9. context-sensitive approaches to syntax
9.2.1 Tree-adjoining Grammars In this section, we will present Tree-adjoining grammars (TAG) (Joshi et al., 1975; Joshi, 1985), specifically those that are lexicalized (Schabes et al., 1988). Recall that a CFG G = (V, T, S † , P ) includes non-terminal (V ) and terminal (T ) vocabularies, a special start symbol S † ∈ V , and a set of context-free rule productions P . For TAGs, the rule productions P are replaced with two sets of elementary trees. Two operations, substitution and adjunction, can then combine trees. A TAG G = (V, T, S † , I, A), where I is a set of initial trees and A is a set of auxiliary trees. Each tree consists of frontier nodes with no children (at the leaves or frontier of the tree) and interior nodes that are not at the frontier. Interior nodes of initial and auxiliary trees must come from the set V ; frontier nodes can come from either V or T . In lexicalized TAG, at least one frontier node, known as the anchor, must come from T . All non-terminals at the frontier of elementary trees, with one important exception, are marked for substitution. The exception is known as the foot node, indicated with an asterisk, which is exactly one non-terminal node on the frontier of auxiliary trees. The label on the foot node must match the label at the root of the auxiliary tree. Figure 9.4 shows five elementary trees, four initial (labeled with ·) and one auxiliary (labeled with ‚). Note that each tree is anchored with a single word, and that all other frontier nodes are non-terminals. The frontier non-terminal of the auxiliary tree ‚1 has an asterisk, and its label matches that of the root – it is the foot node of that tree. There are two ways to combine trees. The first is substitution. The initial tree ·2 in Figure 9.4 has a Det non-terminal in the frontier, which matches the root of the initial tree ·1 , hence ·1 can substitute for the frontier Det node in ·2 . A tree that results from the combination of trees is called a derived tree. By substituting ·1 into ·2 we get a new derived tree ·5 , shown in Figure 9.4. The second way that trees can combine is through adjunction. An auxiliary tree can be inserted (or adjoined) into another tree at nodes that match the label of the root and foot of the auxiliary tree. For example, the VP auxiliary tree ‚1 in Figure 9.4 can adjoin to the initial tree ·3 to produce the derived tree ·6 . One may think of this operation as follows: the VP node in ·3 is separated from the rest of the tree and attached to the foot node of ‚1 ; then the root of ‚1 attaches where the original VP node separated.
9.2 lexicalized grammar formalisms and parsing Det ·1 =
NP ·2 =
the
Det
S
NP
·3 =
N
·4 = NP
VP
kids
259
V
pinecones
NP
throw VP ‚1 =
V
NP
VP∗
·5 =
will
S
Det
N
the
kids
·6 = NP
VP VP
V
S ·7 =
will NP
VP
Det
N
the
kids
NP
throw VP
V will
V
V
NP
throw
pinecones
Figure 9.4 Elementary and derived trees: 4 initial trees (·1 –·4 ), one auxiliary tree (‚1 ), and 3 derived trees (·5 –·7 )
If we substitute NP-rooted trees ·4 and ·5 for the NP frontier nodes in ·6 , we get the full derived tree for the string “the kids will throw pinecones.” In addition to these operations, TAGs generally have constraints on adjunction, such as limiting the number of adjunctions that can occur at a particular node. A derivation produces a derived tree, such as ·7 for the example we just reviewed, but it also produces a derivation tree, which describes the moves that were made to produce the derived tree. Each elementary tree used in the derivation is a node in the derivation tree, and it has a labeled link to the initial tree that it either substituted or adjoined into. For the current example, unlabeled derivation trees are shown in Figure 9.5. The first tree shows the initial tree identifiers at the nodes; the second replaces that identifier with the anchor word of the initial tree. This should look familiar – it is a dependency tree, which is a useful byproduct of the TAG derivation process. Labeling the dependency tree with substitution versus adjunction provides complement/adjunct distinctions within the dependency tree. Adjunction allows for TAGs to describe languages that are not describable with CFGs, such as copying languages like a n b n ec n d n . This
260
9. context-sensitive approaches to syntax throw
·3 ·2 ·1
‚1
·4
kids
will
pinecones
the
Figure 9.5 Unlabeled derivation trees, showing either elementary tree number or the anchor word of the elementary tree
allows the approach to deal with documented natural language syntactic phenomena like cross-serial dependencies. It does, however, come at a computational cost. While parsing algorithms do remain polynomial, the complexity using dynamic programming is n6 . To use the example that we presented in Chapter 7 to illustrate the expense of contextfree parsing: for a 50-word string, context-free parsing will have complexity 125,000· f for some constant f , while NP-chunking will have complexity 50·g for some constant g . TAG parsing will have complexity 15,625,000,000·h for some constant h. We shall see later in this section that this consideration motivates some well-known approximations to TAGs. To see where this n6 complexity is coming from, we will now present a CYK-like parsing algorithm, first presented in Vijay-Shanker and Joshi (1985), but modified here to be directly comparable to the CYK algorithm presented in Chapter 7. To do that, we need to be able to identify elementary trees and the particular node of interest within the elementary trees. There is a conventional approach to indexing the nodes within a particular tree: the root category is 0; the k children of the root are numbered 1 . . . k; and for any other node numbered j , its k children are numbered j.1 . . . j.k. For example, the V node in ·3 is node 2.1; and the VP foot node in ‚1 is node 2. For this current algorithm, we will assume binary branching trees, just as the CYK algorithm assumes Chomsky Normal Form grammars. For items in a context-free chart, we use two indices to identify the yield of the edge, the start index x and the span s . For adjunction, we need to keep track additionally of the possible span of foot nodes, which adds another two indices to items in the chart: the start index of the foot node, f , and its span, p. For trees with no foot node, p=0 and, by convention, f =x. While this makes for a four-dimensional chart, we will present this as a two-dimensional chart with each chart entry augmented by the two indices associated with possible foot nodes.
9.2 lexicalized grammar formalisms and parsing
261
Span 5 [(·3 , 0), (0, 0)] 4 3
[(‚1 , 0), (3, 2)] [(·3 , 2), (3, 0)] [(‚1 , 0), (3, 1)]
2
[(·2 , 0), (1, 0)]
1
[(·1 , 0), (0, 0)]
[(·2 , 2), (1, 0)]
[(‚1 , 1), (2, 0)]
Start index: 0
1
2
[(‚1 , 2), (3, 2)] [(·3 , 2), (3, 0)] [(·3 , 2.1), (3, 0)] [(‚1 , 2), (3, 1)] 3
[(·5 , 0), (4, 0)] 4
Figure 9.6 Chart built through the CYK-like algorithm for TAGs, using elementary trees in Figure 9.4
Figure 9.6 shows the chart that results from using the elementary trees in Figure 9.4 for parsing the string “the kids will throw pinecones.” We will first walk through the filling of this chart, then present pseudocode for the algorithm. For this particular example, each cell has at most two entries, so we will refer to entries by their start and span indices, and whether it is first or second if there is more than one. For space purposes, we do not include a backpointer in the entries, which is fine since each chart entry has only one possible way of being constructed. For every word in the input, we put a span one edge into the chart, for the POS node in the elementary tree that the word anchors. Hence, for start index x=0, the word “the” anchors tree ·1 , and the POS-tag Det is the root of that tree, i.e., index 0. The resulting edge is put into the cell x=0, s =1, with (·1 , 0) denoting the tree and node index, and (0, 0) denoting the foot start index and span. At start index x=1, the word “kids” is associated with node index 2 of ·2 , and, as there is no foot node, the foot span is again zero. The three other words also produce span one edges with span zero foot node indices, and these make up the first entries in the span one cells. In addition to the anchor words, we need to establish cell entries for any foot nodes, for any start index and span they may fill. In the current case, the only foot node is the VP∗ in ‚1 , which must start right after “will,” and can span from that point onwards to the end of the string. Hence the start index is x=3, and since the string ends at index 5, the span can be s =1 or s =2. For these foot node chart entries, we will use the indices stored with the edge to indicate the empty span. The second entry in cell x=3, s =1 is for node index 2 in ‚1 (the VP∗ foot node) and has internal indices f =3, p=1, for a span one foot beginning at start index 3. The span two foot node is the first entry in cell x=3, s =2.
262
9. context-sensitive approaches to syntax
With the chart thus initialized, we can begin combining things in the chart. Beginning with start index x=0 at span s =2, we will work our way through the chart, looking to combine things that can be combined. The edge in cell x=1, s =1 points to node 2 in ·2 , which needs a Det to its left to form an NP. Conveniently, the adjacent edge at x=0, s =1 points to a root node of a tree that has label Det. Hence this can be substituted, and we end up with an edge pointing to node 0 of ·2 in cell x=0, s =2. There is nothing to combine to make an entry in cell x=1, s =2, but at the following cell x=2, s =3, we can combine the anchor of the auxiliary tree ‚1 with its single span foot. In this algorithm, if two chart entries combine, at most one can have the foot span index p>0. In the case that one of the child edges has a non-zero foot span, the resulting edge gets the foot indices from that child. By combining chart items pointing to (‚1 , 1) and (‚1 , 2), we get a chart item pointing to their parent (‚1 , 0). In a similar way, we get a chart item pointing to (‚1 , 0) in the cell x=2, s =3. The NP “pinecones” substitutes for the object NP in ·3 , resulting in a chart entry pointing to the node 2 VP in ·3 in cell x=3, s =2. This is the VP node to which adjunction applies. If we have a chart entry pointing to the root of an auxiliary tree with label A and indices x=i, s = j, f =k, p=l , and a chart entry pointing to a node with label A in cell x=k, s =l , then these can be combined to create a chart entry in cell x=i, s = j with the same node pointer and foot span as the nonauxiliary child. In the current case, the first entry in cell x=2, s =3 adjoins with the second entry in cell x=3, s =2 to create the second entry in cell x=2, s =3. We leave it to the reader to follow the remainder of the derivation, which includes the substitution of the subject NP into ·3 . Having traced through this example in detail, we are now ready to work through the pseudocode for a chart-filling algorithm similar to the algorithm in Figure 7.13, which was given as a preliminary to the full-blown CYK algorithm. Figure 9.7 presents a chart-filling algorithm for binary branching TAGs, while Figure 9.8 provides some additional functions called in the TagFillChart function of Figure 9.7. Lines 1– 4 of that algorithm initialize all chart cells to the empty set; lines 5–23 initialize span one anchor word items and associated foot node items; lines 24–49 process cells of span greater than one. Instead of just iterating through possible mid-points m, as in the context-free algorithm, we must take into account the possibility of the empty spans, represented by the f and p indices. For substitution (lines 27–40 in Figure 9.7),
9.2 lexicalized grammar formalisms and parsing TagFillChart(w1 . . . wn , G = (V, T, S † , I, A)) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
263
Assumes G has binary branching trees
for s = 1 to n do for x = 0 to n−s do for f = x to x+s do initialize chart cells for p = 0 to x+s − f do C xs f p ← ∅ s ←1 scan in words/POS-tags (span=1) for t = 1 to n do x ← t−1 for j = 1 to |I | do if Tree-node(· j , wt ) ≥ 0 then r ← |C xs x0 |+1 C xs x0 ← C xs x0 {[(· j , Tree-node(· j , wt )), −, r ]} for j = 1 to |A| do if Tree-node(‚ j , wt ) ≥ 0 then r ← |C xs x0 |+1 C xs x0 ← C xs x0 {[(‚ j , Tree-node(‚ j , wt )), −, r ]} if Foot-node(‚ j ) < Tree-node(‚ j , wt ) foot node left of anchor then for f = 0 to x−1 do p ← x− f ; r ← |Cf p f p |+1 {[(‚ j , Foot-node(‚ j )), −, r ]} C f pf p ← C f pf p else f ← x+1 for p = 1 to n− f do r ← |C f p f p |+1 C f pf p ← C f pf p {[(‚ j , Foot-node(‚ j )), −, r ]} for s = 2 to n do for t = 1 to n−s +1 do x ← t−1; for m = t to t+s −2 do Substitution c ← m−x; q ← s −c for f = x to m−1 do No foot nodes or foot in left child for p = 0 to m− f do for all [(„ j , a), Ê j , y] ∈ C xc f p do for all [(„k , b), Êk , z] ∈ C mq m0 do C xs f p ← Update-subst-l(„ j , a, „k , b, m, y, z, C xs f p ) C xs f p ← Update-subst-r(„k , b, „ j , a, m, y, z, C xs f p ) for f = m to x+s −1 do Foot in right child for p = 1 to x+s − f do for all [(„ j , a), Ê j , y] ∈ C xc x0 do for all [(„k , b), Êk , z] ∈ C mq f p do C xs f p ← Update-subst-l(„ j , a, „k , b, m, y, z, C xs f p ) C xs f p ← Update-subst-r(„k , b, „ j , a, m, y, z, C xs f p ) for m = t to t+s −2 do Adjunction for q = 1 to s +x−m do for f = m to x+s −1 do for p = 0 to x+s − f do for all [(„ j , 0), Ê j , y] ∈ C xs mq do for all [(„k , b), Êk , z] ∈ C mq f p do if Label(„ j , 0) = Label(„k , b) then r ← |C xs f p |+1 C xs f p ← C xs f p {[(„k , b), (m, q , y, z), r ]}
Figure 9.7 Pseudocode of chart-filling algorithm for Tree-adjoining Grammars
there may be no foot node in the children, or the foot node can be in either the left child or the right child (but not both). In substitution, the tree being substituted into can either be on the left or the right, requiring two different functions to update the cell. Note that, if there
264
9. context-sensitive approaches to syntax
Tree-node(„, w)
Elementary Tree „, word w
1 if Anchor(„) = w 2 then return Anchor-node(„) 3 return -1
if w anchors „ return POS-tag node index of „
Update-subst-r(„k , b, „ j , a, m, y, z, C xs f p ) 1 d ← Result-node-r(„k , b, „ j , a) 2 if d ≥ 0 3 then r ← |C xs f p |+1 {[(„k , d), (m, −1, y, z), r ]} 4 C xs f p ← C xs f p 5 return C xs f p Result-node-r(„k , b, „ j , a) 1 2 3 4 5 6
l ← Left-sibling-node(„k , b) if a>0 and („ j = „k or l = a) then return -1 if Label(„ j , a) = Label(„k , l ) then return -1 return Parent-node(„k , b)
if neither root node (a=0) nor left-sibling
Update-subst-l(„ j , a, „k , b, m, y, z, C xs f p ) 1 d ← Result-node-l(„ j , a, „k , b) 2 if d ≥ 0 3 then r ← |C xs f p |+1 4 C xs f p ← C xs f p {[(„ j , d), (m, −1, y, z), r ]} 5 return C xs f p Result-node-l(„ j , a, „k , b) 1 2 3 4 5 6
r ← Right-sibling-node(„ j , a) if b > 0 and („ j = „k or l = b) then return -1 if Label(„k , b) = Label(„ j , r ) then return -1 return Parent-node(„ j , a)
if neither root node (b=0) nor left-sibling
Figure 9.8 Pseudocode of algorithms called by chart-filling algorithm for TAGs in Figure 9.7
were no foot nodes then f =x, p=0, so no looping through f and p values is required, that is, the algorithm would be n3 just like with CFGs. With foot nodes, the substitution part of the algorithm is n5 . Adjunction (lines 41–49 of Figure 9.7) is what makes the algorithm n6 . The tree that is adjoining („ j ) must have its start index x and span s identical to the resulting item, and its foot start index m and span q must be the same as the start index and span of the item being adjoined into („k ). This latter item might itself have a foot node, leading to six variables of order n that must be iterated over, or an n6 worst-case complexity.
9.2 lexicalized grammar formalisms and parsing
265
To take this chart-filling algorithm and make it a Viterbi/CYK-like algorithm, we would first need to include scores over derivations. Perhaps the most straightforward way to do this is to define probabilities over derivation trees, rather than derived trees. For example, in Figure 9.5, we could define the probability of the derived structure as: P(·3 | ROOT) P(·2 ‚1 ·4 | ·3 ) P(·1 | ·2 ). How to define probabilities or weights over derivations, and how to estimate the models once they are defined, is a topic that goes beyond the scope of this discussion. See Section 9.3 for more discussion of how to define models over contextsensitive formalisms. The CYK-like parsing algorithm for TAGs that we have been discussing has worst case complexity of n6 , but as we saw with context-free grammars, depending on the grammar, there can be best-case speedups over this algorithm, such as the Earley parsing algorithm (Earley, 1970). Such a top-down filtering approach is presented for TAGs in Joshi and Schabes (1997). Effective, large scale TAG parsing has been investigated extensively as part of the XTAG project (XTAG Research Group, 2001) at the University of Pennsylvania. One issue of particular importance for lexicalized grammars like TAG, as well as for unification grammars, is the efficient representation of the lexicon. With a context-free grammar, a single rule can be sufficient to allow for a new subcategorization frame; for example, if a grammar allows for transitive VPs, then the addition of a single rule production (VP → V NP NP) would extend the grammar to ditransitives. In TAG, a new ditransitive elementary tree would have to be introduced, which shares a lot of structure with transitive elementary trees and other verb-anchored elementary trees. Furthermore, there are many verbs that can anchor the same elementary tree or even the same set of elementary trees. Techniques for representing a lexicon efficiently in an inheritance hierarchy, such as DATR (Evans and Gazdar, 1989a, b), can be used to compactly encode the lexicon of a lexicalized TAG (Evans et al., 1995) such as that used for the XTAG project cited above. These approaches are equally important for other lexicalized formalisms. 9.2.2 Combinatory Categorial Grammars Another lexicalized grammar formalism closely related to TAG is Combinatory Categorial Grammar (CCG) (Steedman, 1985, 1986, 1996), which is weakly equivalent to TAG as a “mildly” context-sensitive grammar formalism. In other words, for any CCG, there is a TAG that
266
9. context-sensitive approaches to syntax NP/N
N
(S\NP)/(S\NP)
(S\NP)/NP
NP
the
kids
will
throw
pinecones
Figure 9.9 CCG categories for example string from last section
describes the same language and vice versa. Like TAGs, much of the syntactic information is part of rich lexical tags, which in this case are made up of a small number of atomic non-terminal categories and directional slashes. Then a small number of combinatory rules dictate how categories can combine. Figure 9.9 shows the string from the last section, each word with its corresponding CCG tag. Note that the only atomic tags in these composite tags are S, NP, and N. In every composite tag, the category to the left of the slash is the goal category, and the category to the right is what is needed to reach that goal category. A forward slash means look for the needed category to the right; a backslash means look for the needed category to the left. For example, the tag for “the” is NP/N, which means “give me an N to the right, and the result will be an NP.” The category for “throw” means that, if an NP category is found to the right, the result will be S\NP, that is, a category that needs an NP to its left to produce an S. In other words, the main verb needs an object to its right and a subject to its left to make a full sentence. The auxiliary verb “will” has the most complicated category of all: it takes the composite tag S\NP to its right to produce exactly the same category, which is what adjunction does as well. Categories of the form X/X and X\X are how CCG handles modifiers like adjectives and adverbs. The first two (and simplest) combinatory rules are forward and backward application: • •
Forward application: Backward application:
X/Y Y
Y X\Y
→ →
X X
These rules basically state that, if a slash category finds what it needs to the left or right, then it results in the goal category to the left of the slash. Figure 9.10 shows the CCG derivation, using just forward and backward application, for our example string. Note that every non-POS non-terminal category is to the left of the slash of one of its children, and the other child is to the right of the slash. Categorial grammars with forward and backward application have a long history (Ajdukiewicz, 1935; Bar-Hillel, 1953; Lambek, 1958), and are
9.2 lexicalized grammar formalisms and parsing
267
S
NP
S\NP
NP/N
N
the
kids
(S\NP)/(S\NP) will
S\NP (S\NP)/NP
NP
throw
pinecones
Figure 9.10 CCG derivation using forward and backward application, represented as a tree
known to be weakly context-free equivalent. Combinatory Categorial Grammars get their mild context-sensitivity from other combinatory rules, such as forward and backward composition: • •
Forward composition: Backward composition:
X/Y Y\Z
Y/Z X\Y
→ →
X/Z X\Z
for some atomic or slash category Z. Forward composition states that, if the first category will result in an X if it finds a Y to its right, and the second category will result in a Y if it finds a Z to its right, then they can combine into something that will result in an X if it finds a Z to its right. Backward composition states the same thing to the left rather than the right. Note that each composition rule has all of the slashes in the same direction. There are other composition rules, known as crossed composition, which we will not be discussing here, but which play a role in the CCG analysis of certain constructions, such as heavy NP shift (Steedman, 1996). In addition, there are what are known as type raising (or type lifting) rules, which take a category and change it to another category. The basic idea is something like this: if there is a category Y and there is another category X\Y that produces an X if it finds Y to its left, in some sense the category Y is such that if it finds X\Y to its right, it produces an X: • •
Forward type raising: Backward type raising:
Y Y
→ →
Z/(Z\Y) Z\(Z/Y)
Function composition and type raising together make possible some very elegant analyses of very tricky natural language phenomena like non-constituent coordination, which cannot be captured easily (or at
268
9. context-sensitive approaches to syntax S
NP
S/NP
pinecones S/NP
(S/NP)/(S/NP)
we will pick up ((S/NP)/(S/NP))\(S/NP)
S/NP
and (S\NP)/NP
S/(S\NP) NP NP/N
N
the
kids
(S\NP)/(S\NP)
(S\NP)/NP
will
throw
Figure 9.11 CCG derivation for non-constituent coordination using both type lifting and forward composition
all) in other approaches. Consider an example like “the kids will throw and we will pick up pinecones.” In this example “the kids will throw” is conjoining with “we will pick up” to create something that is still in need of an object NP to be complete. Figure 9.11 shows a tree of the derivation of this coordinated sentence using both type lifting and forward composition. First, notice that the NP “the kids” is type lifted to S/(S\NP). Forward composition allows “will” and “throw” to combine to form the category (S\NP)/NP, which in turns combines with the type lifted NP using forward composition to create the category S/NP, that is, something that needs an (object) NP to its right to complete an S. The word “and” takes categories of the sort (X/X)\X, which, using standard application rules, takes an X on the left and an X on the right and combines them into an X. In this case, X is S/NP. The string “we will pick up” has a similar internal structure (which we omit for space purposes) to “the kids will throw,” leading to the category S/NP. These two categories conjoin, then find the sentence-final NP “pinecones” to form the S. This is a far more natural analysis for this sort of coordination than can be had in other grammar formalisms, and is a common example when demonstrating the utility of the approach.
9.2 lexicalized grammar formalisms and parsing
269
A second strong selling point for CCG is the very straightforward syntax/semantics interface, with corresponding semantic categories for all syntactic categories. For example, using standard lambda calculus notation, the transitive verb “throw” has the following syntactic:semantic category: (S\NP)/NP:ÎyÎx throw (x, y). The semantic categories have corresponding function application, composition, and type lifting rules, making this a very simple, straightforward framework within which to do compositional semantics, as in Montague (1974). While function composition and type lifting are the means to very elegant and compelling syntactic analyses, they are also the cause of a serious computational problem, namely spurious ambiguities. Note that, because of function composition, there are multiple derivations for our example string “the kids will throw pinecones,” that do not represent real syntactic alternatives. For example, in addition to the derivation shown in Figure 9.10, there is also a derivation in which “will” is combined with “throw” via forward composition as in Figure 9.11, before combining with the object NP. With unconstrained type lifting, there are an infinite number of possibilities, as one category type-lifts over its neighbor, only to have its neighbor type-lift over it, and so on. The solution to this issue of spurious ambiguity is to impose constraints on combinatory rules and type lifting, so that they cannot apply if a simpler “normal form” derivation can be followed for the same analysis. In Eisner (1996a), constraints on composition are presented that ensure only one derivation per analysis, which can be used in chart parsing algorithms for CCG, 3 such as Vijay-Shankar and Weir (1990; 1993). The past several years have seen a lot of research into efficient, robust, and accurate CCG parsing (Hockenmaier and Steedman, 2002; Clark et al., 2002; Clark and Curran, 2004), using pruning techniques and stochastic disambiguation. These CCG parsing approaches make use of CYK-like algorithms using normal form constraints, as discussed above. In addition, they at times make use of certain finite-state or context-free approximations to either restrict the search space of full CCG models, or as a stand-alone approximation to a full-blown CCG derivation. Similar techniques are used in the XTAG project cited in the 3 Note that parsing CCG is polynomial, just like TAG, and has CYK-like algorithms very similar to what was presented in the last section. Indeed, Vijay-Shanker and Weir (1993) present parsing algorithms for TAG, CCG and other related formalisms, all within the same general parsing framework.
270
9. context-sensitive approaches to syntax
previous section. These approximations are presented in Section 9.2.4, and stochastic disambiguation in Section 9.3. 9.2.3 Other Mildly Context-sensitive Approaches Joshi (1985) coined the term “mildly context-sensitive” to label a class of grammars that he claimed are sufficient to describe natural languages, while (among other things) allowing for polynomial parsing algorithms as Tree-adjoining Grammars do. Subsequent work has shown that a number of context-sensitive formalisms are, in fact, weakly equivalent 4 to TAG (Joshi et al., 1991), including CCG, and are hence members of this class of grammars. Other grammar formalisms go beyond TAG and CCG in the types of languages that they can describe, but remain mildly context-sensitive. We will briefly describe two such formalisms: multicomponent TAG (MCTAG) (Joshi et al., 1975) and Minimalist Grammars (Stabler, 1997). First, however, we will present some details that better define this broad class of grammars. Recall that derivations in TAG result in both a derived tree, which is the familiar syntactic structure, and a derivation tree (see Figure 9.5), which is a tree showing how elementary trees substituted or adjoined into other trees. This derivation tree can be described using Linear Context-free Rewriting Systems (LCFRS) (Vijay-Shanker et al., 1987), which are related to Generalized Context-free Grammars (GCFG) (Pollard, 1984). A GCFG G = (V, S † , F , P ) consists of a set of variables V ; a special start variable S † ; a finite set of function symbols F ; and a set of rule productions P of the form A → f (·) for A ∈ V , f ∈ F , and · ∈ V ∗ . The functions in F define how to combine the variables (strings or trees) on the right-hand side of each production. LCFRS follows this general approach, while restricting the kinds of functions that can be used – disallowing functions that involve unbounded copying or erasing of structure. For example, local trees in the TAG derivation tree in Figure 9.5 can be considered as rule productions in an LCFRS, where the functions in F allow for substitution and adjunction of the variables (trees). LCFRS can be parsed in polynomial time (Vijay-Shanker et al., 1987), hence formalisms which produce derivation trees that can be described with an LCFRS are within the set of mildly context-sensitive grammars. We now turn to two examples of this class of grammars. 4
Two grammar formalisms are weakly equivalent if for every grammar in one there is a weakly equivalent grammar in the other, and vice versa.
9.2 lexicalized grammar formalisms and parsing
271
MCTAG extends standard TAG by allowing sets of auxiliary trees to be defined as auxiliary sets, within which each tree can simultaneously adjoin to distinct nodes in an elementary tree. If auxiliary sets are allowed to adjoin to other auxiliary sets in such a way that each auxiliary tree in set 1 can adjoin to a different component of set 2, then the generative capacity goes beyond standard TAG, that is, languages can be described with MCTAG that cannot be described with TAG. Weir (1988) showed that MCTAG is weakly equivalent to LCFRS. Minimalist Grammars are a framework intended to capture certain ideas from Minimalist transformational syntax (Chomsky, 1995); in particular, feature-driven structure building and movement. Following Stabler (1997), a Minimalist Grammar G = (V, T, L , F ) consists of a set of syntactic features V ; a set of non-syntactic (phonetic/semantic) features T ; a finite set of trees L built from V and T ; and a set of two functions on trees F = {merge, move}. The syntactic features consist of base categories along with selectional and licensing features. The merge function combines two trees, and is constrained by selectional features of heads. The move function operates on a single tree and is constrained by licensing features. Stabler (1997) showed that Minimalist Grammars can describe a language with five counting dependencies (a n b n c n d n e n ), which TAGs cannot describe. Michaelis (2001a) showed that Minimalist Grammars are a subclass of LCFRS, and Michaelis (2001b) showed that LCFRSs are a subclass of Minimalist Grammars; hence they are in fact weakly equivalent. 9.2.4 Finite-state and Context-free Approximations One way to make a context-free approximation to a Combinatory Categorial Grammar (CCG) is to create a treebank of CCG derivation trees, such as those shown in Figures 9.10 and 9.11, and then induce a PCFG from this treebank using standard techniques. By limiting the possible combinatory rules to those actually observed in the treebank, this approach is context-free. One ends up with productions of the sort S\NP → (S\NP)/(S\NP) S\NP. The creation of CCGbank (Hockenmaier and Steedman, 2005), which involved the semiautomatic conversion of the Penn WSJ Treebank into CCG derivation trees, allowed for such an approach to be followed. In Hockenmaier and Steedman (2002), CCGbank was used to induce a lexicalized PCFG, similar (except for the node labels, head percolation, and shapes of
272
9. context-sensitive approaches to syntax
trees) to the lexicalized PCFGs that were presented in the last chapter, such as Collins and Charniak. In Clark and Curran (2004), in addition to normal form constraints of the sort discussed in Section 9.2.2, they use CCGbank to restrict combination of two children to combinations observed in the treebank, which makes the approach context-free. We should note that, in fact, the Penn WSJ Treebank was not annotated in a purely context-free fashion, since it included empty categories and co-indexation, which are the transformational grammar approach to capturing non-local dependencies. Recall from Section 8.1.1 that these annotations were stripped out prior to grammar induction by all of the lexicalized PCFG approaches discussed. Hence those approaches as well can be considered context-free approximations to the grammar framework within which the trees were annotated. In addition to this context-free approximation, the CCG parser in Clark and Curran (2004) makes use of a finite-state approximation in an initial pass, commonly known as supertagging (Joshi and Srinivas, 1994; Srinivas and Joshi, 1999). Consider the sequence of TAG elementary trees (·1 –·4 , ‚1 ) in Figure 9.4, and the sequence of CCG categories in Figure 9.9. One perspective on this is to treat them as composite POS-tags (supertags), and apply a POS-tagger to try to pick the best tag sequences for a given word sequence. This sort of finite-state disambiguation is used for some context-free parsers mentioned in the previous chapter, in particular the POS-tagging and NPchunking that is done in the Ratnaparkhi parser (Ratnaparkhi, 1997). In the case of lexicalized grammar formalisms like TAGs and CCGs, however, the potential gain can be much larger, because the size of the tagset is orders of magnitude larger than the simple POS-tags that one finds in, say, the Penn Treebank. Consider, for example, the number of POS-tags that may label a verb with many subcategorization alternations. The supertagging approach was originally investigated within the context of Tree-adjoining Grammars (Joshi and Srinivas, 1994), but the general approach has been used for other formalisms like CCG (Clark, 2002; Clark and Curran, 2004) and Constraint Dependency Grammars (Wang and Harper, 2002; Wang et al., 2003). Like standard POS-tagging models, these can either be used to find the highestscoring tag sequences, as is done in Clark and Curran (2004) and XTAG Research Group (2001) as a pre-processing step to reduce the ambiguity for parsing; or they can be used as class-based language models as in Srinivas (1996) and Wang et al. (2003).
9.3 parse selection
273
9.3 Parse Selection We have been discussing context-sensitive approaches to syntax, but we have yet to do more than mention the use of weights or scores for selecting the best parse from among a set of parses. The reason for this is that the context-sensitivity makes defining a probability distribution trickier. To make this clear, we will begin by discussing methods for defining a distribution for unification grammars, which then will be shown to be useful for re-ranking the output of context-free parsers. 9.3.1 Stochastic Unification Grammars Early attempts to put a probability distribution on unification grammars, such as Brew (1995), defined these distributions based on probabilities assigned to rule productions, which were estimated using relative frequency estimation. The problem with this approach, pointed out in Abney (1997), is that, when constraints apply to capture non-local dependencies, the probability associated with any structures violating constraints is lost. In other words, suppose a tree has probability 0.1 according to the stochastic model, but is ruled out by a non-local constraint. Then the probability of the remainder of the trees can only add up to 0.9, not 1.0 as in a probability model. This is not simply a problem of normalization, that is, one cannot just divide by 0.9 when using the model to fix this problem. In fact, as a result of the lost probability, relative frequency estimation is no longer the maximum likelihood estimator for these models, so simply counting local events from a corpus and normalizing is not (even in the limit) the most effective way to estimate the parameters of a probabilistic model. The problem is that rule production probabilities are multiplied to give total parse probabilities only when those probabilities define a branching process in such a way that the probability of one branch is independent of what happens in the other branch. In a unification grammar, the result of one branch depends on the result of another – that is the definition of context sensitivity – and hence it is not a true independent branching process at all. To make this explicit, consider the two valid parse trees in Figure 9.12, and suppose they made up our training corpus. If we define rule production probabilities based on the relative frequency observed in these trees, half the probability would belong to trees with yields “the players wants to play” or “the player want to play.” If subject/verb number agreement is a constraint that is imposed upon the trees, then
274
9. context-sensitive approaches to syntax S
S NP
VP
NP
VP
the player
wants to play
the players
want to play
Figure 9.12 Two valid parse trees
half the probability belongs to trees that are ruled out by constraints, that is, this does not define a probability distribution over the trees of the language, since the sum of the probabilities of all valid trees is 0.5 not 1. One might think to divide by 0.5 to renormalize, and that will give a probability distribution, however this will not, generally, be the maximum likelihood estimate. See Abney (1997) for examples where it is not. One solution to this is to include any non-local constraints directly on constituent node labels, that is, compiling the unification grammar into a context-free grammar. Then, of course, relative frequency estimation is maximum likelihood estimation, and everything is proper. We have discussed, however, the problems with such an approach: first, the unification grammar may not be context-free equivalent; and second, the size of the resulting non-terminal set, even if it is contextfree equivalent, may render parsing infeasible. A second option is to define the probability distribution using Random Fields, which allow, in principle, for arbitrary global features over every possible syntactic analysis. Rather than defining the model as parameterizing a branching process, instead the model is over the space of possible complete analyses. The approach is general enough that the analyses can be context-free tree structures, or the directed acyclic graphs used in unification grammars to encode re-entrant dependencies. The log-linear models discussed way back in Section 6.3.5 are along the same lines, but what is different in this case is that, for finite-state models like POS-taggers, there is a Markov assumption in the model, hence the name Markov Random Fields. In the current case, there is no Markov assumption, since the dependencies that are modeled in a unification grammar may be arbitrarily far apart in the string. Nevertheless, the formulation of a feature vector and a parameter vector, and all of the normalization issues (joint versus conditional) are very much the same as presented in Section 6.3.5.
9.3 parse selection
275
As mentioned in Section 9.1, stochastic disambiguation has been extensively studied within the approach taken at PARC (M. Johnson et al., 1999; Riezler et al., 2000, 2002). In these papers, they use a conditional normalization 5 – which they note is a special case of “pseudolikelihood” estimation, following Besag (1974) – as a tractable alternative to the expensive estimation techniques proposed in Abney (1997). The very general nature of these techniques makes them applicable to all of the approaches we have discussed in this book. Of course, there is still the issue of efficiency in processing, so that either features must be defined so as to allow efficient dynamic programming (Geman and Johnson, 2002) or relatively few candidates must be evaluated, in other words, n-best re-ranking, as was done in M. Johnson et al. (1999), or heuristic pruning as in Collins and Roark (2004). These general techniques provide the means for estimating parameter weights for arbitrary sets of features. Which features to use is up to the person building the model, and defining a feature set for use in such models is similar in many respects to defining a grammar. For stochastic unification grammars, obvious choices are the features that are already defined as part of the unification grammar. Other less linguistically interesting features may, however, buy disambiguation, hence building these models often involves an empirical investigation of the utility of features on some development corpus. In the next section, we will present an approach that suggests using all possible tree fragments as features, which introduces some serious efficiency issues. This will be followed by a discussion of work on re-ranking the output of contextfree parsers. 9.3.2 Data-oriented Parsing Data-oriented parsing (DOP) was popularized by Rens Bod in the 1990s (Bod, 1993a,b, 1998), and has been an interesting and controversial approach that pushes the limits of what can tractably be considered a feature in stochastic models of syntax. DOP is a type of treesubstitution grammar, sort of like Tree-adjoining Grammars without adjunction. However, unlike TAGs, which typically define a linguistically motivated set of elementary trees, in DOP every tree fragment in an observed corpus of trees is taken as a possible elementary tree. A tree fragment in DOP is a sub-tree that: (1) is fully connected; (2) includes See Section 6.3.5 for a discussion of the differences between joint and conditional normalization. 5
276
9. context-sensitive approaches to syntax S
S NP
NP
VP
Det
N
the
player
V
Det
plays the
S
S VP
NP
VP N
Det
player
the
N
NP Det
S
N player
VP
NP
VP Det
N
S NP
VP
Figure 9.13 Six DOP fragments out of a possible sixteen for this parse of the string “the player plays.”
more than one node; and (3) includes all siblings of every included node. Figure 9.13 shows six out of a possible sixteen fragments of a parse tree for the string “the player plays.” We leave it as an exercise for the reader to find the other ten possible fragments. The number of such tree fragments grows exponentially with the size of the tree, so that over even a relatively small corpus, the possible number of distinct features is too large for direct use. As a result, for parsing experiments, Bod (1993b) used Monte Carlo sampling techniques to approximate the model. The probabilistic model used in the original versions of DOP was based on very simple relative frequency estimation, as follows. For a given fragment f , let R( f ) be the root symbol of f , and let c ( f ) be the count of f in the corpus. Then P( f ) =
c( f ) f :R( f )=R( f )
c ( f )
(9.5)
A derivation involves substituting fragments for non-terminals when the root of the fragment matches the non-terminal, just as with TAG substitution. The probability of a derivation is taken as the product of the probabilities of all fragments used in the derivation. Note that for any given derived tree, there will be many derivations for that tree, similar to the spurious ambiguities in CCG. Worse, M. Johnson (2002a) showed that this estimation method is both biased and inconsistent, that is, as the sample size increases to the limit, the estimated model will not converge to the true distribution that generated the data. This demonstrates a real problem with the estimation method, motivating other ways of estimating the parameters – similar to the problems found with early attempts to make unification grammars stochastic. Goodman (1996a) showed that it is possible to relatively efficiently compile the DOP model into a PCFG, by assigning every node in every tree of the training corpus a unique index. Considering just trees in Chomsky Normal Form, this would be one index per POS-tag and one for each binary node, or roughly 2n indices, where n is the number of
9.3 parse selection
277
words in the training corpus. For every local tree Ai → B j C k where i, j, k are the unique indices, eight rules are introduced: Ai → B C A→BC
Ai → B j C A → Bj C
Ai → B C k A → B Ck
Ai → B j C k A → B j Ck (9.6)
With these eight rules, and appropriately defined weights (see the paper for details), Goodman showed that such a PCFG would give the same distribution over trees as DOP. Since there are eight rules for n binary local trees, the size of this grammar is linear in the size of the training corpus, rather than exponential. Unfortunately, Goodman (1996a) also reported that the results from Bod (1993b) were not replicable, leaving in doubt the true utility of the approach. Given the enormous size of these grammars, the problems with the estimation procedures, and the difficulty replicating results, the approach has remained a controversial one, despite further parsing improvements (Bod, 2001). That said, there remains considerable interest in the idea of exploiting an arbitrarily large feature set for stochastic disambiguation, which may include features that have no strong linguistic motivation other than that they capture real dependencies in the distribution of the language. In the next section, we will examine some recent work in re-ranking the output of context-free parsers, some of which will make use of features that are motivated from some noncontext-free approaches. 9.3.3 Context-free Parser Re-ranking In Ratnaparkhi (1997), an efficient, competitively accurate MaxEnt parser was presented. Along with standard evaluation of the accuracy of the best scoring parse, the evaluation included a report of the oracle accuracy of a twenty-best list of parses output from the parser, that is, the score that would be achieved if the most accurate of the twenty parses were selected. While the one-best accuracy of the parser was just less than 87 percent F-measure, the oracle accuracy of the twenty-best list was around 93 percent F-measure. It was suggested that a richer model could then be used to select a better parse from among this list. Collins (2000) took this approach with the output of his well-known parser (Collins, 1997), using a large number of features with two different parameter estimation approaches, one based on boosting (Freund et al., 1998), the other on random fields. Features included things like
278
9. context-sensitive approaches to syntax
counts of: context-free rules in the trees; sequences of siblings; parentannotated rules; head-modifier pairs; and lexical-head trigrams that include heads of arguments of prepositional phrases, among many others. The results were very encouraging, with the boosting approach achieving a 1.5 percent improvement over the baseline model (Collins, 1997, 1999). Further experiments along these lines were presented in Collins and Koo (2005). Collins and Duffy (2002) used the Perceptron algorithm (see Section 6.3.5 for a discussion of this algorithm) for re-ranking, with a novel implicit feature representation that efficiently encodes all subtrees as features, as in the DOP model. They define a tree kernel by which the similarity of two trees can be compared with respect to the features in the model. Consider again the DOP formulation, where every tree fragment in a corpus is a feature in a very large vector of possible features. For any given tree, the count of most of these will be zero, but some subset will have non-zero counts. Given two trees and their associated count vectors for the tree fragments, we can take the dot product of the two vectors as their similarity. That is, let t1 and t2 be two trees, and (t) be the d dimensional vector of tree fragment counts. Then (t1 ) · (t2 ) =
d
i (t1 )i (t2 )
(9.7)
i =1
Note that, if the two trees share no subtrees, then this dot product will be zero. Of course, while we present this as a d dimensional sum, in fact one really only needs to consider those subtrees that are in both t1 and t2 – all others will contribute zero to the sum. In addition, Collins and Duffy (2002) present a dynamic programming algorithm that allows this sum to be calculated in time polynomial in the size of the trees, rather than exponential, making this a tractable way to take into account (implicitly) all subtrees in a tree. A dual form of the Perceptron algorithm is then presented, in which weights are assigned to candidate trees in training examples based on these inner products. Using this kernel, they achieve about a 0.5 percent F-measure accuracy improvement over the baseline (Collins, 1997, 1999). Charniak and Johnson (2005) present a MaxEnt re-ranking approach using a conditional likelihood objective with a large number of novel features, and achieve a 1.0 percent improvement over the baseline parser, in this case (Charniak, 2000). They used the efficient n-best
9.4 transduction grammars
279
parsing algorithm from Huang and Chiang (2005) to produce the fiftybest lists for re-ranking. Among the features that they used were some focused on parallel structures in coordination, and explicit modeling of the length of constituents. This remains a fruitful and interesting direction of research that will likely produce further improvements in parsing accuracy. In particular, the space of possible features for such an approach is enormous, so that linguistically well-informed approaches to feature exploration are invaluable. What currently goes under the label of “feature engineering” is very like grammar engineering, in that it starts with linguistic insight and generalization. In many ways, going from grammar engineering to feature engineering in NLP is like the shift from rule-based to constraint-based generalization in linguistic theory. 9.4 Transduction Grammars The final topic that we will address in this chapter is that of transduction grammars (Lewis and Stearns, 1968), which allow for the parsing of two (or more) related strings simultaneously. In the current literature, these are sometimes known as transduction grammars (Lewis and Stearns, 1968; Wu, 1997), multitext grammars (Melamed, 2003; Melamed et al., 2004), or synchronous grammars (Shieber and Schabes, 1990), and their primary utility is currently in the area of machine translation, as a way of modeling alignments between sentences in two (or more) languages. The difference between transduction grammars and finite-state transducers is that transduction grammars allow for re-ordering of output symbols. A “simple” transduction grammar (Lewis and Stearns, 1968) is one in which only terminals are re-ordered on the output, not non-terminals. To adopt the notation from Lewis and Stearns (1968), each production contains two right-hand sides (RHS), with the output RHS delimited with curly braces. For example, consider the following three transduction grammar productions for Spanish/English: 1. 2. 3.
SBAR N S
→ → →
that S blue N NP VP
{que S} {N azul} {VP NP}
No re-ordering Terminal re-ordering Non-terminal re-ordering
The first production shows a transduction grammar production with no re-ordering of the symbols in the output RHS. Production 2 shows the kind of re-ordering that is allowed within “simple” transduction
280
9. context-sensitive approaches to syntax
grammars, that is, there is no re-ordering of non-terminals on the RHS. The final production does allow re-ordering of non-terminals on the RHS. Note that the NP and VP inside of the curly braces are the same as the ones outside of the curly braces, that is, their subsequent expansions are via a single transduction grammar rule expansion. If there is more than one category with the same label, indices are sometimes used to indicate which category on the output side goes with which one on the input side. Even simple transduction grammars can be beyond context-free. For example, Lewis and Stearns (1968) showed that the following simple transduction grammar, to convert from infix to Polish arithmetic notation, cannot be processed with a pushdown automaton: S S S S
→ → → →
( S 1 + S2 ) ( S1 ∗ S2 ) 1 2
{+ S1 S2 } {∗ S1 S2 } {1} {2}
(9.8)
Note that the subscript indices of the categories on the right-hand side are simply used to indicate that the ordering of the S constituents does not change on the output. Inversion transduction grammars (Wu, 1997) (ITGs) extend simple transduction grammars by allowing some non-terminal re-ordering. In particular, ITGs can have output non-terminals in either the same order as the input non-terminals, or in the reverse order, but no other permutations are allowed. This allows for a slight simplification of the notation, since all that must be indicated is whether the output has the same ordering or reverse. For example, in Table 9.2, the six possible transduction grammar productions associated with a ternary contextfree production are shown, with the corresponding ITG notation. For the two permutations that are allowed in ITG, the one where the output non-terminals are in the same order as the input non-terminals is denoted via square brackets around the right-hand side; the reverse order is denoted with angle brackets. It is shown in Wu (1997) that there exists an ITG normal form, similar to Chomsky Normal Form, where all productions are either binary to non-terminals or unary to terminals, and where either the input or output terminal items can be the empty string. With this normal form, there is a dynamic programming chart-parsing algorithm that is similar to the CYK algorithm, for parsing two strings simultaneously with an ITG (Wu, 1997). For a grammar with k non-terminals, an input string of
9.4 transduction grammars
281
Table 9.2 Transduction grammar notation from Lewis and Stearns (1968) and corresponding Inversion Transduction grammar notation from Wu (1997) for ternary production Original notation Lewis and Stearns (1968) VP → V NP PP VP → V NP PP VP → V NP PP VP → V NP PP VP → V NP PP VP → V NP PP
{V NP PP} {V PP NP} {PP NP V} {PP V NP} {NP PP V} {NP V PP}
ITG notation Wu (1997) VP → [V NP PP] Permutation not allowed in ITG VP → V NP PP Permutation not allowed in ITG Permutation not allowed in ITG Permutation not allowed in ITG
length n and an output string of length m, this algorithm has worst-case complexity O(k 3 n3 m3 ), hence polynomial in the length of the strings. In machine translation, one common use of such models is for inducing word alignments over parallel corpora. For such a purpose, one can ignore the labels on brackets, and let the bracketing of the ITG define the alignment. This special case of ITG, known as Bracketing ITG (BITG), loses the grammar constant in the parsing complexity by virtue of having a single non-terminal. The restriction on the output RHS ordering means that ITGs cannot capture certain kinds of crossing alignments, in particular those that can be termed inside-out alignments. For example, in aligning two four-word strings, if the inside two words of the input map to the outside two words of the output and vice versa, and, further, the alignments of the two inside words or the two outside words cross, then the alignment cannot be described by an ITG. Such crossing alignments can be difficult to find in natural language, and seem to be generally quite rare outside of free word-order languages (Wu, 1997). Nevertheless, they do occur, and this is a limitation of ITGs for finding word alignments in parallel strings. Consider the alignment between English and the artificial Yodaish in Figure 9.14. The outer words in the input (“Luke” and “complained”) map to the inner words in the output and cross. To see why this causes a problem for ITGs, consider the chart next to the alignment. We can index each cell by its row and column number, which correspond to the word index of the output and input. For example, the shaded cell corresponding to “Luke” is at (3, 1). Shading of a region indicates an alignment between the input and the output. Multi-cell rectangular regions can be denoted by pairs of indices, representing the upper left and lower right corner cells.
282
9. context-sensitive approaches to syntax
Luke
often
has
often
complained
complained Luke
has
often
complained Luke has Luke
often
has complained
Figure 9.14 Alignment English to Yodaish that cannot be captured using ITG
An ITG parsing algorithm will combine diagonally adjacent cells into a rectangular block, representing the alignment of constituents. Let [(i, j ),(k, l )] represent a rectangular block in the chart, with cell (i, j ) in the lower left corner, and cell (k, l ) in the upper right corner. If we have an alignment with [(i, j ),(k, l )] and [(k−1, l +1),(m, n)], then these can combine (if allowed by the grammar) into a constituent spanning [(i, j ),(m, n)]. Note, however, in the chart in Figure 9.14 that no two shaded cells are diagonally adjacent, meaning that no binary combination into larger constituents can be accomplished. Relaxing the constraints on the output RHS, including bilexical dependencies, and allowing for discontinuous constituents are the main differences between ITGs and the multitext grammars in Melamed (2003) and Melamed et al. (2004). Synchronous Tree-adjoining grammars (Shieber and Schabes, 1990) allow for input/output correspondences between nodes in elementary and derived trees. For transduction grammars based on these sorts of mildly context-sensitive approaches, however, there is an additional complexity burden for parsing and alignment algorithms. One approach that does not rely upon constituents is the dependency tree transduction grammar of Alshawi et al. (2000). In this approach, input and output local dependency trees are generated from the head out, via a head transducer. Each input/output pair of dependents is generated with a pair of directions, where a direction is a signed integer, where −k means k words before the head in the linear string, and +k means k words after the head. Note that an individual transducer handles just local dependency trees, that is, a head and its children in the tree. Each of the children may in turn be the head of another head transducer. Let us work through the example string above using a head transducer. The dependency tree for both the English and Yodaish is shown in Figure 9.15, above the head transducer for this dependency tree. We read the transducer as follows. The English word is on the input, and
9.5 summary ⎡ ⎢ ⎣ Luke
0
complained:complained 0:0
1
has:has −1:+2
2
complained has
283
⎤ ⎥ ⎦ often
often:often −1:−1
3
Luke:Luke −1:+1
4
Figure 9.15 Dependency tree and head transducer for example in Figure 9.14
the Yodaish word is on the output. 6 Below the transition, there is the word position information, relative to the head word, for example, “has” is −1 (to the left, one place) on the English (input) side, and +2 (to the right, two places) on the Yodaish side. Since this tree consists of just one head for both languages, a single head transducer suffices to encode the dependencies. There are two things to note about this head transducer. First, because it does not build constituents, rather aligns heads and dependents on both sides, it can handle the crossing alignments that make this an example outside of what ITGs can handle. Second, all three of the dependents on the input (English) side of the transduction have −1 as the position. If a specified position is already filled, by convention the dependent is moved to the next available slot away from the head. Hence, “has” fills the −1 slot, so “often” cannot take that slot and moves to −2. “Luke” cannot take the −1 slot either, and tries to move to −2, but that is also taken, so it moves to −3. Hence, ordering can be specified in this approach either by explicitly indicating the position, or by ordering the dependents to resolve collisions in the right way. While every individual head transducer consists of a finite number of states, this is not a finite-state approach, rather a context-sensitive one. This is demonstrated in Alshawi et al. (2000), by showing that the head-transducer approach can be used to convert from infix to Polish notation, which was mentioned earlier as something that cannot be done with a pushdown automaton. 9.5 Summary In this chapter, we have shown a range of approaches that go beyond context-free to describe more complicated syntactic phenomena, including unification grammars, lexicalized grammar formalisms, and transduction grammars. We discussed stochastic models for these 6
In this particular example, the words are the same. For machine translation in general, the words will differ.
284
9. context-sensitive approaches to syntax
approaches, and the use of re-ranking as a post-process for context-free (or finite-state) approaches. Going beyond context-free models comes at an additional efficiency cost, but many rich linguistic dependencies are difficult to model with context-free grammars. Some of the most interesting approaches use models of varying complexity at various stages, to try to get rich and accurate annotations at bargain basement efficiencies. It is clear that more research must be done to understand the most effective trade-offs between these levels of the Chomsky hierarchy. While computational approaches to syntax have turned increasingly to statistical methods based on large annotated corpora, this does not mean that there is less need for the kind of linguistic knowledge that has driven grammar engineering over the years. More than ever, the formal mechanisms exist to apply rich constraints for syntactic processing and disambiguation; the search for such constraints (or features) will continue to be largely driven by linguistic knowledge. There is much potential in approaches that combine the flexibility of some of the stochastic approaches we have discussed with linguistically informed feature exploration.
References Abe, M., Ooshima, Y., Yuura, K., and Takeichi, N., 1986. “A Kana-Kanji translation system for non-segmented input sentences based on syntactic and semantic analysis.” In: Proceedings of the 11th International Conference on Computational Linguistics (COLING), pp. 280–285. Abney, S., 1996. “Partial parsing via finite-state cascades.” Natural Language Engineering 2 (4), 337–344. 1997. “Stochastic attribute-value grammars.” Computational Linguistics 23 (4), 597–617. Aho, A. V., Sethi, R., and Ullman, J. D., 1986. Compilers, principles, techniques, and tools. Addison-Wesley, Reading, MA. Ajdukiewicz, K., 1935. “Die syntaktische konnexität.” In: McCall, S. (Ed.), Polish Logic 1920–1939. Oxford University Press, Oxford, pp. 207–231. Allauzen, C., Mohri, M., and Roark, B., 2003. “Generalized algorithms for constructing language models.” In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL), pp. 40–47. Allen, J., Hunnicutt, M. S., and Klatt, D., 1987. From Text to Speech: the MITalk System. Cambridge University Press, Cambridge, UK. Alshawi, H., Srinivas, B., and Douglas, S., 2000. “Learning dependency translation models as collections of finite-state head transducers.” Computational Linguistics 26 (1), 45–60. Anderson, S., 1992. A-Morphous Morphology. Cambridge University Press, Cambridge, UK. Andron, D., 1962. “Analyse morphologique du substantif russe.” Tech. rep., Centre d’Etudes pour la Traduction Automatique, Université de Grenoble 1. Andry, F., Fraser, N., Thornton, S., and Youd, N., 1993. “Making DATR work for speech: lexicon compilation in SUNDIAL.” Computational Linguistics 18 (3), 245–267. Antworth, E., 1990. PC-KIMMO: A Two-Level Processor for Morphological Analysis. Occasional Publications in Academic Computing, 16. Summer Institute of Linguistics, Dallas, TX. Archangeli, D., 1984. “Underspecification in Yawelmani phonology and morphology.” Ph.D. thesis, MIT. Aronoff, M., 1976. Word Formation in Generative Grammar. MIT Press, Cambridge, MA. 1994. Morphology by Itself: Stems and Inflectional Classes. No. 22 in Linguistic Inquiry Monographs. MIT Press, Cambridge, MA.
286
references
Aronson, H., 1990. Georgian: A Reading Grammar. Slavica, Columbus, OH. Attar, R., Choueka, Y., Dershowitz, N., and Fraenkel, A., 1978. “KEDMA— linguistic tools for retrieval systems.” Journal of the ACM 25, 55–66. Baayen, R. H., and Moscoso del Prado Martín, F., 2005. “Semantic density and past-tense formation in three Germanic languages.” Language 81 (3), 666– 698. Piepenbrock, R., and van Rijn, H., 1996. The CELEX Lexical Database (CD-rom). Linguistic Data Consortium. Baker, J., 1979. “Trainable grammars for speech recognition.” In: Speech Communication papers for the 97th Meeting of the Acoustical Society of America, pp. 547–550. Baker, M., 1985. “The mirror principle and morphosyntactic explanation.” Linguistic Inquiry 16 (3), 373–416. Bar-Hillel, Y., 1953. “A quasi-arithmetical notation for syntactic description.” Language 29, 47–58. Barg, P., 1994. “Automatic acquisition of DATR theories from observations.” Tech. rep., Heinrich-Heine Universität, Düsseldorf, theories des Lexicons: Arbeiten des Sonderforschungsbereichs 282. Baroni, M., Matiasek, J., and Trost, H., 2002. “Unsupervised discovery of morphologically related words based on orthographic and semantic similarity.” In: Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning, pp. 48–57. Bat-El, O., 2001. “In search for the roots of the C-root: The essence of Semitic morphology.” Paper presented at the Workshop on Roots and Template Morphology, Los Angeles, USC, handout available at: . Bear, J., 1986. “A morphological recognizer with syntactic and phonological rules.” In: Proceedings of the 11th International Conference on Computational Linguistics (COLING), pp. 272–276. Beard, R., 1995. Lexeme-Morpheme Base Morphology. SUNY, Albany, NY. Becker, J., 1984. Multilingual word processing. Scientific American (July 1984), 96–107. Beesley, K., 1989. “Computer analysis of Arabic morphology: a two-level approach with detours.” In: Proceedings of the 3rd Annual Symposium on Arabic Linguistics. University of Utah, pp. 155–172. and Karttunen, L., 2000. “Finite-state non-concatenative morphotactics.” In: Proceedings of the 5th Workshop of the ACL Special Interest Group on Computational Phonology (SIGPHON-2000), pp. 1–12. 2003. Finite State Morphology. CSLI Publications. University of Chicago Press. Bernard-Georges, A., Laurent, G., and Levenbach, D., 1962. “Analyse morphologique du verbe allemand.” Tech. rep., Centre d’Etudes pour la Traduction Automatique, Université de Grenoble 1.
references
287
Besag, J., 1974. “Spatial interaction and the statistical analysis of lattice systems (with discussion).” Journal of the Royal Statistical Society, Series D 36, 192– 236. Bikel, D. M., 2004. “Intricacies of Collins’ parsing model.” Computational Linguistics 30 (4), 479–511. Billot, S., and Lang, B., 1989. “The structure of shared parse forests in ambiguous parsing.” In: Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 143–151. Bilmes, J., 1997. “A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models.” Tech. rep., ICSI-TR-97-021, International Computer Science Institute, Berkeley, CA. and Kirchhoff, K., 2003. “Factored language models and generalized parallel backoff.” In: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), Companion Volume, pp. 4–6. Bird, S., and Ellison, T., 1994. “One-level phonology: Autosegmental representations and rules as finite automata.” Computational Linguistics 20 (1), 55–90. Black, E., Abney, S., Flickenger, D., Gdaniec, C., Grishman, R., Harrison, P., Hindle, D., Ingria, R., Jelinek, F., Klavans, J., Liberman, M., Marcus, M. P., Roukos, S., Santorini, B., and Strzalkowski, T., 1991. “A procedure for quantitatively comparing the syntactic coverage of English grammars.” In: DARPA Speech and Natural Language Workshop, pp. 306–311. Blaheta, D., and Charniak, E., 2000. “Assigning function tags to parsed text.” In: Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 234–240. Blevins, J., 2003. “Stems and paradigms.” Language 79 (4), 737–767. Bod, R., 1993a. “Data-oriented parsing as a general framework for stochastic language processing.” In: Sikkel, K., and Nijholt, A. (Eds.), Proceedings of Twente Workshop on Language Technology (TWLT6). University of Twente, The Netherlands, pp. 37–46. 1993b. “Using an annotated corpus as a stochastic grammar.” In: Proceedings of the 6th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 37–44. 1998. Beyond Grammar: An Experience-Based Theory of Language. CSLI Publications, Stanford, CA. 2001. “What is the minimal set of fragments that achieves maximal parse accuracy?” In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 66–73. Böhmová, A., Hajiˇc, J., Hajiˇcová, E., and Hladká, B. V., 2002. “The Prague dependency treebank: Three-level annotation scenario.” In: Abeille, A.
288
references
(Ed.), Treebanks: Building and Using Syntactically Annotated Corpora. Kluwer Academic Publishers, Dordrecht, pp. 103–127. Boussard, A., and Berthaud, M., 1965. “Présentation de la synthèse morphologique du français.” Tech. rep., Centre d’Etudes pour la Traduction Automatique, Université de Grenoble 1. Brand, I., Klimonow, G., and Nündel, S., 1969. “Lexiko-morphologische Analyse.” In: Nündel, S., Klimonow, G., Starke, I., and Brand, I. (Eds.), Automatische Sprachübersetzung: Russisch-deutsch. AkademieVerlag, Berlin, pp. 22–64. Brew, C., 1995. “Stochastic HPSG.” In: Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 83–89. Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S., 1990. “A statistical approach to machine translation.” Computational Linguistics 16 (2), 79–85. Della Pietra, V. J., deSouza, P. V., Lai, J. C., Mercer, R. L., 1992. “Classbased n-gram models of natural language.” Computational Linguistics 18 (4), 467–479. Buchholz, S., and Marsi, E., 2006. “CoNLL-X shared task on multilingual dependency parsing.” In: Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL), pp. 149–164. Buckwalter, T., 2002. Buckwalter Arabic morphological analyzer, version 1.0. Linguistic Data Consortium, Catalog # LDC2002L49, ISBN 1-58563-257-0. Butt, M., Dyvik, H., King, T., Masuichi, H., and Rohrer, C., 2002. “The parallel grammar project.” In: Proceedings of the COLING-2002 Workshop on Grammar Engineering and Evaluation, pp. 1–7. Büttel, I., Niedermair, G., Thurmair, G., and Wessel, A., 1986. “MARS: Morphologische Analyse für Retrievalsysteme: Projektbericht.” In: Schwarz, C., and Thurmair, G. (Eds.), Informationslinguistische Texterschliessung. Georg Olms Verlag, Hildesheim, pp. 157–216. Byrd, R., Klavans, J., Aronoff, M., and Anshen, F., 1986. “Computer methods for morphological analysis.” In: Proceedings of the 24th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 120–127. Carroll, G., and Charniak, E., 1992. “Two experiments on learning probabilistic dependency grammars from corpora.” In: Weir, C., Abney, S., Grishman, R., and Weischedel, R. (Eds.), Working Notes of the Workshop on StatisticallyBased NLP Techniques. AAAI Press, Menlo Park, CA, pp. 1–13. Carstairs, A., 1984. “Constraints on allomorphy in inflexion.” Ph.D. thesis, University of London, Indiana University Linguistics Club. Charniak, E., 1997. “Statistical parsing with a context-free grammar and word statistics.” In: Proceedings of the 14th National Conference on Artificial Intelligence, pp. 598–603.
references
289
2000. “A maximum-entropy-inspired parser.” In: Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 132–139. 2001. “Immediate-head parsing for language models.” In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 116–123. and Johnson, M., 2005. “Coarse-to-fine n-best parsing and MaxEnt discriminative reranking.” In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 173–180. Chelba, C., 2000. “Exploiting syntactic structure for natural language modeling.” Ph.D. thesis, The Johns Hopkins University, Baltimore, MD. and Jelinek, F., 1998. “Exploiting syntactic structure for language modeling.” In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics (ACL) and 17th International Conference on Computational Linguistics (COLING), pp. 225–231. 2000. “Structured language modeling.” Computer Speech and Language 14 (4), 283–332. Chen, S., and Goodman, J., 1998. “An empirical study of smoothing techniques for language modeling.” Tech. rep., TR-10-98, Harvard University. Chomsky, N., 1957. Syntactic Structures. Mouton, The Hague. 1995. The Minimalist Program. MIT Press, Cambridge, MA. and Halle, M., 1968. The Sound Pattern of English. Harper and Row, New York. Choueka, Y., 1983. “Linguistic and word-manipulation in textual information systems.” In: Keren, C., and Perlmutter, L. (Eds.), Information, Documentation and Libraries. Elsevier, New York, pp. 405–417. 1990. “Responsa: An operational full-text retrieval system with linguistic components for large corpora.” In: Zampolli, A. (Ed.), Computational Lexicology and Lexicography. Giardini, Pisa, p. 150. Church, K., 1986. “Morphological decomposition and stress assignment for speech synthesis.” In: Proceedings of the 24th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 156–164. Clark, S., 2002. “Supertagging for combinatory categorial grammar.” In: Proceedings of the 6th International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+6), pp. 19–24. and Curran, J. R., 2004. “Parsing the WSJ using CCG and log-linear models.” In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 104–111. Hockenmaier, J., and Steedman, M., 2002. “Building deep dependency structures with a wide-coverage CCG parser.” In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 327–334.
290
references
Cocke, J., and Schwartz, J. T., 1970. “Programming languages and their compilers: Preliminary notes.” Tech. rep., Courant Institute of Mathematical Sciences, New York University. CODIUL, 1989. “Diccionario elemental del Ulwa (sumu meridional).” Tech. rep., CODIUL/UYUTMUBAL, Karawala Región Autónoma Atlántico Sur, Nicaragua; Centro de Investigaciones y Documentación de la Costa Atlantica, Managua and Bluefields, Nicaragua; Center for Cognitive Science, MIT, Cambridge, MA. Cohen-Sygal, Y., and Wintner, S., 2006. “Finite-state registered automata for non-concatenative morphology.” Computational Linguistics 32 (1), 49–82. Coker, C., Church, K., and Liberman, M., 1990. “Morphology and rhyming: Two powerful alternatives to letter-to-sound rules for speech synthesis.” In: Bailly, G., and Benoit, C. (Eds.), Proceedings of the ESCA Workshop on Speech Synthesis, pp. 83–86. Coleman, J., 1992. “Phonological representations – their names, forms and powers.” Ph.D. thesis, University of York. Collins, M. J., 1997. “Three generative, lexicalised models for statistical parsing.” In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 16–23. 1999. “Head-driven statistical models for natural language parsing.” Ph.D. thesis, University of Pennsylvania. 2000. “Discriminative reranking for natural language parsing.” In: Proceedings of the 17th International Conference on Machine Learning (ICML), pp. 175–182. 2002. “Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms.” In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1–8. and Duffy, N., 2002. “New ranking algorithms for parsing and tagging: Kernels over discrete structures and the voted perceptron.” In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 263–270. and Koo, T., 2005. “Discriminative reranking for natural language parsing.” Computational Linguistics 31 (1), 25–69. and Roark, B., 2004. “Incremental parsing with the perceptron algorithm.” In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 111–118. and Singer, Y., 1999. “Unsupervised models for named entity classification.” In: Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing (EMNLP) and Very Large Corpora, pp. 100–110. Corbett, G., and Fraser, N., 1993. “Network morphology: a DATR account of Russian nominal inflection.” Journal of Linguistics 29, 113–142.
references
291
Creutz, M., and Lagus, K., 2002. “Unsupervised discovery of morphemes.” In: Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning, pp. 21–30. Culy, C., 1985. “The complexity of the vocabulary of Bambara.” Linguistics and Philosophy 8, 345–351. Dolby, J., Earl, L., and Resnikoff, H., 1965. “The application of English-word morphology to automatic indexing and extracting.” Tech. Rep. M-21-65-1, Lockheed Missiles and Space Company, Palo Alto, CA. Duanmu, S., 1997. “Recursive constraint evaluation in Optimality Theory: evidence from cyclic compounds in Shanghai.” Natural Language and Linguistic Theory 15, 465–507. Durbin, R., Eddy, S., Krogh, A., and Mitchison, G., 1998. Biological Sequence Analysis. Cambridge University Press, Cambridge, UK. Earley, J., 1970. “An efficient context-free parsing algorithm.” Communications of the ACM 6 (8), 451–455. Eisner, J., 1996a. “Efficient normal-form parsing for combinatory categorial grammar.” In: Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 79–86. 1996b. “Three new probabilistic models for dependency parsing: An exploration.” In: Proceedings of the 16th International Conference on Computational Linguistics (COLING), pp. 340–345. 1997. “Bilexical grammars and a cubic-time probabilistic parser.” In: Proceedings of the 5th International Workshop on Parsing Technologies (IWPT), pp. 54–65. and Satta, G., 1999. “Efficient parsing for bilexical context-free grammars and head automaton grammars.” In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 457–464. Evans, R., and Gazdar, G., 1989a. “Inference in DATR.” In: Proceedings of the 4th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 66–71. 1989b. “The semantics of DATR.” In: Cohn, A. (Ed.), Proceedings of the Seventh Conference of the Society for the Study of Artificial Intelligence and Simulation of Behaviour (AISB). Pitman/Morgan Kaufmann, London, pp. 79–87. and Weir, D., 1995. “Encoding lexicalized tree adjoining grammars with a nonmonotonic inheritance hierarchy.” In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 77–84. Finkel, R., and Stump, G., 2002. “Generating Hebrew verb morphology by default inheritance hierarchies.” In: Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages, pp. 9–18. Francis, N. W., and Kucera, H., 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin, Boston.
292
references
Frazier, L., and Fodor, J. D., 1978. “The sausage machine: a new two-stage parsing model.” Cognition 6, 291–325. Freund, Y., Iyer, R., Schapire, R., and Singer, Y., 1998. “An efficient boosting algorithm for combining preferences.” In: Proceedings of the 15th International Conference on Machine Learning (ICML), pp. 170– 178. Gabbard, R., Marcus, M. P., and Kulick, S., 2006. “Fully parsing the Penn treebank.” In: Proceedings of the 2006 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2006), pp. 184–191. Gaussier, E., 1999. “Unsupervised learning of derivational morphology from inflectional lexicons.” In: Proceedings of the Workshop on Unsupervised Learning in Natural Language Processing, pp. 24–30. Geman, S., and Johnson, M., 2002. “Dynamic programming for parsing and estimation of stochastic unification-based grammars.” In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 279–286. Gerdemann, D., and van Noord, G., 2000. “Approximation and exactness in finite state optimality theory.” In: Proceedings of the 5th Workshop of the ACL Special Interest Group on Computational Phonology (SIGPHON-2000). Gildea, D., 2001. “Corpus variation and parser performance.” In: Proceedings of the 6th Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 167–202. Goldsmith, J., 2001. “Unsupervised acquisition of the morphology of a natural language.” Computational Linguistics 27 (2), 153–198. Good, I. J., 1953. “The population frequencies of species and the estimation of population parameters.” Biometrica V 40 (3,4), 237–264. Goodman, J., 1996a. “Efficient algorithms for parsing the DOP model.” In: Proceedings of the 1st Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 143–152. 1996b. “Parsing algorithms and metrics.” In: Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 177–183. 1998. “Parsing inside-out.” Ph.D. thesis, Harvard University. Hakkani-Tür, D., Oflazer, K., and Tür, G., 2002. “Statistical morphological disambiguation for agglutinative languages.” Computers and the Humanities 36 (4), 381–410. Hall, K., 2004. “Best-first word-lattice parsing: Techniques for integrated syntactic language modeling.” Ph.D. thesis, Brown University, Providence, RI. and Johnson, M., 2004. “Attention shifting for parsing speech.” In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 40–46.
references
293
and Novak, V., 2005. “Corrective modeling for non-projective dependency parsing.” In: Proceedings of the 9th International Workshop on Parsing Technologies (IWPT), pp. 42–52. Halle, M., and Marantz, A., 1993. “Distributed morphology and the pieces of inflection.” In: Hale, K., and Keyser, S. J. (Eds.), The View from Building 20. MIT Press, Cambridge, MA, pp. 111–176. Hankamer, J., 1986. “Finite state morphology and left to right phonology.” In: Proceedings of the West Coast Conference on Formal Linguistics, Volume 5. Stanford Linguistic Association, Stanford, pp. 41–52. Harlow, S., 1981. “Government and relativization in Celtic.” In: Heny, F. (Ed.), Binding and Filtering. Croom Helm, London, pp. 213–254. 1989. “The syntax of Welsh soft mutation.” Natural Language and Linguistic Theory 7 (3), 289–316. Harris, A., 1981. Georgian Syntax: A Study in Relational Grammar. Cambridge University Press, Cambridge, UK. Harris, Z., 1951. Structural Linguistics. University of Chicago Press, Chicago. Harrison, M., 1978. Introduction to Formal Language Theory. Addison Wesley, Reading, MA. Hay, J., and Baayen, R., 2005. “Shifting paradigms: gradient structure in morphology,” Trends in Cognitive Sciences 9, 342–348. Heemskerk, J., 1993. “A probabilistic context-free grammar for disambiguation in morphological parsing.” In: Proceedings of the 6th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 183–192. Henderson, J., 2003. “Inducing history representations for broad coverage statistical parsing.” In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2003), pp. 103–110. 2004. “Discriminative training of a neural network statistical parser.” In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 95–102. Hindle, D., and Rooth, M., 1993. “Structural ambiguity and lexical relations.” Computational Linguistics 19 (1), 103–120. Hockenmaier, J., and Steedman, M., 2002. “Generative models for statistical parsing with combinatory categorial grammar.” In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 335–342. 2005. “CCGbank manual.” Tech. rep., MS-CIS-05-09, Department of Computer and Information Science, University of Pennsylvania. Hockett, C., 1954. “Two models of grammatical description.” Word 10, 210–231. Hollingshead, K., Fisher, S., and Roark, B., 2005. “Comparing and combining finite-state and context-free parsers.” In: Proceedings of the 2005 Human Language Technology Conference and Conference on Empirical
294
references
Methods in Natural Language Processing (HLT-EMNLP) 2005, pp. 787– 794. Hopcroft, J., and Ullman, J. D., 1979. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, Reading, MA. Huang, L., and Chiang, D., 2005. “Better k-best parsing.” In: Proceedings of the 9th International Workshop on Parsing Technologies (IWPT), pp. 53–64. Hutchins, W. J., 2001. “Machine translation over 50 years.” Histoire, Epistemologie, Langage 22 (1), 7–31, available at: . Inkelas, S., and Zoll, C., 1999. “Reduplication as morphological doubling.” Tech. Rep. 412-0800, Rutgers Optimality Archive. 2005. Reduplication: Doubling in Morphology. Cambridge University Press, Cambridge, UK. Jansche, M., 2005. “Algorithms for minimum risk chunking.” In: Proceedings of the 5th International Workshop on Finite-State Methods in Natural Language Processing (FSMNLP). Jelinek, F., 1998. Statistical Methods for Speech Recognition. The MIT Press, Cambridge, MA. and Lafferty, J. D., 1991. “Computation of the probability of initial substring generation by stochastic context-free grammars.” Computational Linguistics 17 (3), 315–323. and Mercer, R. L., 1980. “Interpolated estimation of Markov source parameters from sparse data.” In: Proceedings of the First International Workshop on Pattern Recognition in Practice, pp. 381–397. Jijkoun, V., and de Rijke, M., 2004. “Enriching the output of a parser using memory-based learning.” In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 311–318. Johnson, C. D., 1972. Formal Aspects of Phonological Description. Mouton, The Hague. Johnson, H., and Martin, J., 2003. “Unsupervised learning of morphology for English and Inuktitut.” In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2003), Companion Volume, pp. 43– 45. Johnson, M., 1998a. “Finite-state approximation of constraint-based grammars using left-corner grammar transforms.” In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics (ACL) and 17th International Conference on Computational Linguistics (COLING), pp. 619–623. 1998b. “PCFG models of linguistic tree representations.” Computational Linguistics 24 (4), 617–636. 2002a. “The DOP estimation method is biased and inconsistent.” Computational Linguistics 28 (1), 71–76.
references
295
2002b. “A simple pattern-matching algorithm for recovering empty nodes and their antecedents.” In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 136–143. and Roark, B., 2000. “Compact non-left-recursive grammars using the selective left-corner transform and factoring.” In: Proceedings of the 18th International Conference on Computational Linguistics (COLING), pp. 355– 361. Geman, S., Canon, S., Chi, Z., and Riezler, S., 1999. “Estimators for stochastic ‘unification-based’ grammars.” In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 535–541. Joshi, A. K., 1985. “How much context-sensitivity is necessary for assigning structural descriptions.” In: Dowty, D., Karttunen, L., and Zwicky, A. (Eds.), Natural Language Parsing. Cambridge University Press, Cambridge, UK, pp. 206–250. and Schabes, Y., 1997. “Tree-adjoining grammars.” In: Rozenberg, G., and Salomaa, A. (Eds.), Handbook of Formal Languages. Vol 3: Beyond Words. Springer-Verlag, Berlin/Heidelberg/New York, pp. 69–123. and Srinivas, B., 1994. “Disambiguation of super parts of speech (or supertags): Almost parsing.” In: Proceedings of the 15th International Conference on Computational Linguistics (COLING), pp. 154–160. Levy, L. S., and Takahashi, M., 1975. “Tree adjunct grammars.” Journal of Computer and System Sciences 10 (1), 136–163. Vijay-Shanker, K., and Weir, D., 1991. “The convergence of mildly context-sensitive formalisms.” In: Sells, P., Shieber, S., and Wasow, T. (Eds.), Processing of Linguistic Structure. MIT Press, Cambridge, MA, pp. 31–81. Kaplan, R. M., and Bresnan, J., 1982. “Lexical-functional grammar: A formal system for grammatical representation.” In: Bresnan, J. (Ed.), The Mental Representation of Grammatical Relations. MIT Press, Cambridge, MA, pp. 173–281. and Kay, M., 1994. “Regular models of phonological rule systems.” Computational Linguistics 20, 331–378. Riezler, S., King, T., Maxwell III, J. T., Vasserman, A., and Crouch, R., 2004. “Speed and accuracy in shallow and deep stochastic parsing.” In: Proceedings of the 2004 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLTNAACL 2004), pp. 97–104. Karttunen, L., 1998. “The proper treatment of optimality in computational phonology.” In: Proceedings of the 2nd International Workshop on FiniteState Methods in Natural Language Processing (FSMNLP), pp. 1–12. 2003. “Computing with realizational morphology.” In: Gelbukh, A. (Ed.), Computational Linguistics and Intelligent Text Processing. Vol. 2588 of Lecture Notes in Computer Science. Springer Verlag, Heidelberg, pp. 205– 216.
296
references
Karttunen, L., and Beesley, K., 2005. “Twenty-five years of finite-state morphology.” In: Arppe, A., Carlson, L., Lindén, K., Piitulainen, J., Suominen, M., Vainio, M., Westerlund, H., and Yli-Jyrä, A. (Eds.), Inquiries into Words, Constraints and Contexts (Festschrift in the Honour of Kimmo Koskenniemi and his 60th Birthday). Gummerus Printing, Saarijärvi, Finland, pp. 71–83, available on-line at: . Kaplan, R. M., and Zaenen, A., 1992. “Two-level morphology with composition.” In: Proceedings of the 14th International Conference on Computational Linguistics (COLING), pp. 141–148. Kasami, T., 1965. “An efficient recognition and syntax analysis algorithm for context-free languages.” Tech. rep., AFCRL-65-758, Air Force Cambridge Research Lab., Bedford, MA. Katz, S. M., 1987. “Estimation of probabilities from sparse data for the language model component of a speech recogniser.” IEEE Transactions on Acoustics, Speech, and Signal Processing 35 (3), 400–401. Kay, M., 1986. “Algorithm schemata and data structures in syntactic processing.” In: Grosz, B. J., Sparck-Jones, K., and Webber, B. L. (Eds.), Readings in Natural Language Processing. Morgan Kaufmann, Los Altos, pp. 35–70. Kiraz, G., 2000. Computational Approach to Non-Linear Morphology. Cambridge University Press, Cambridge, UK. Klein, D., 2005. “The unsupervised learning of natural language structure.” Ph.D. thesis, Stanford, Palo Alto, CA. and Manning, C. D., 2002. “A generative constituent-context model for improved grammar induction.” In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 128–135. 2003a. “A∗ parsing: Fast exact Viterbi parse selection.” In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLTNAACL 2003), pp. 119–126. 2003b. “Accurate unlexicalized parsing.” In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL), pp. 423–430. 2004. “Corpus-based induction of syntactic structure: Models of dependency and constituency.” In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 478–485. Kneser, R., and Ney, H., 1995. “Improved backing-off for m-gram language modeling.” In: Proceedings of the 1995 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1995), pp. 181–184. Knuth, D., 1973. The Art of Computer Programming. Vol. 3. Addison-Wesley, Reading, MA. Kornai, A., 1991. “Formal phonology.” Ph.D. thesis, Stanford University, distributed by Garland Publishers.
references
297
Koskenniemi, K., 1983. “Two-level morphology: a general computational model for word-form recognition and production.” Ph.D. thesis, University of Helsinki, Helsinki. 1984. “FINSTEMS: a module for information retrieval.” In: Computational Morphosyntax: Report on Research 1981–1984. University of Helsinki, Helsinki, pp. 81–92. Kumar, S., and Byrne, W., 2004. “Minimum Bayes-risk decoding for machine translation.” In: Proceedings of the 2004 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2004), pp. 169–176. Lafferty, J. D., McCallum, A., and Pereira, F. C. N., 2001. “Conditional random fields: Probabilistic models for segmenting and labeling sequence data.” In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289. Lambek, J., 1958. “The mathematics of sentence structure.” American Mathematical Monthly 65, 154–169. Lari, K., and Young, S., 1990. “The estimation of stochastic context-free grammars using the inside-outside algorithm.” Computer Speech and Language 4 (1), 35–56. Lee, G. G., Lee, J.-H., and Cha, J., 2002. “Syllable-pattern-based unknown morpheme segmentation and estimation for hybrid part-of-speech tagging of Korean.” Computational Linguistics 28 (1), 53–70. Levy, R., and Manning, C. D., 2004. “Deep dependencies from context-free statistical parsers: Correcting the surface dependency approximation.” In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 327–334. Lewis, H., and Papadimitriou, C., 1981. Elements of the Theory of Computation. Prentice-Hall, Englewood Cliffs, NJ. Lewis, R. L., 1998. “Reanalysis and limited repair parsing: Leaping off the garden path.” In: Fodor, J. D., and Ferreira, F. (Eds.), Reanalysis in Sentence Processing. Kluwer Academic Publishers, Dordrecht, pp. 247–284. Lewis II, P., and Stearns, R., 1968. “Syntax-directed transduction.” Journal of the Association for Computing Machinery 15 (3), 465–488. Lieber, R., 1980. “On the organization of the lexicon.” Ph.D. thesis, MIT, Cambridge, MA. 1987. An Integrated Theory of Autosegmental Processes. SUNY Series in Linguistics. SUNY Press, Albany, NY. 1992. Deconstructing Morphology: Word Formation in a GovernmentBinding Syntax. University of Chicago Press, Chicago. Lin, D., 1998a. “Dependency-based evaluation of minipar.” In: Workshop on the Evaluation of Parsing Systems, pp. 48–56. 1998b. “A dependency-based method for evaluating broad-coverage parsers.” Natural Language Engineering 4 (2), 97–114.
298
references
Lombardi, L., and McCarthy, J., 1991. “Prosodic circumscription in Choctaw morphology.” Phonology 8, 37–72. Magerman, D. M., 1995. “Statistical decision-tree models for parsing.” In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 276–283. and Marcus, M. P., 1990. “Parsing a natural language using mutual information statistics.” In: Proceedings of the 8th National Conference on Artificial Intelligence, pp. 984–989. Mangu, L., Brill, E., and Stolcke, A., 1999. “Finding consensus among words: Lattice-based word error minimization.” In: Proceedings of the 6th European Conference on Speech Communication and Technology (Eurospeech), pp. 495– 498. Manning, C. D., and Carpenter, B., 1997. “Probabilistic parsing using left corner language models.” In: Proceedings of the 5th International Workshop on Parsing Technologies (IWPT), pp. 147–158. Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A., 1993. “Building a large annotated corpus of English: The Penn Treebank.” Computational Linguistics 19 (2), 313–330. Matthews, P., 1966. “A procedure for morphological encoding.” Mechanical Translation 9, 15–21. 1972. Inflectional Morphology: a Theoretical Study Based on Aspects of Latin Verb Conjugations. Cambridge University Press, Cambridge, UK. Maxwell III, J. T., and Kaplan, R. M., 1998. “Unification-based parsers that automatically take advantage of context freeness.” MS., Xerox PARC, available at: . McCallum, A., and Li, W., 2003. “Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons.” In: Proceedings of the 7th Conference on Computational Natural Language Learning (CoNLL), pp. 188–191. Freitag, D., and Pereira, F. C. N., 2000. “Maximum entropy Markov models for information extraction and segmentation.” In: Proceedings of the 17th International Conference on Machine Learning (ICML), pp. 591–598. McCarthy, J., 1979. “Formal problems in Semitic morphology and phonology.” Ph.D. thesis, MIT, Cambridge, MA, distributed by Indiana University Linguistics Club (1982). and Prince, A., 1986. “Prosodic morphology.” MS. University of Massachusetts, Amherst, and Brandeis University. 1990. “Foot and word in prosodic morphology: The Arabic broken plural.” Natural Language and Linguistic Theory 8, 209–284. McDonald, R., Pereira, F. C. N., Ribarov, K., and Hajiˇc, J., 2005. “Nonprojective dependency parsing using spanning tree algorithms.” In:
references
299
Proceedings of the Conference on Human Language Technology Conference and Empirical Methods in Natural Language Processing (HLT-EMNLP), pp. 523–530. McIlroy, M. D., 1982. “Development of a spelling list.” IEEE Transactions on Communications 30 (1), 91–99. Melamed, I. D., 2003. “Multitext grammars and synchronous parsers.” In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLTNAACL 2003), pp. 158–165. Satta, G., and Wellington, B., 2004. “Generalized multitext grammars.” In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 661–668. Mel’ˇcuk, I., 1988. Dependency Syntax: Theory and Practice. SUNY Press, Albany, NY. Meya-Lloport, M., 1987. “Morphological analysis of Spanish for retrieval.” Literary and Linguistic Computing 2, 166–170. Michaelis, J., 2001a. “Derivational minimalism is mildy context-sensitive.” In: Moortgat, M. (Ed.), Logical Aspects of Computational Linguistics, LNCS/LNAI Vol. 2014. Springer-Verlag, Berlin/Heidelberg/New York, pp. 179–198. 2001b. “Transforming linear context-free rewriting systems into minimalist grammars.” In: Retoré, G. M. C. (Ed.), Logical Aspects of Computational Linguistics, LNCS/LNAI Vol. 2099. Springer-Verlag, Berlin/Heidelberg/New York, pp. 228–244. Mohri, M., 1994. “Syntactic analysis by local grammars automata: an efficient algorithm.” In: Papers in Computational Lexicography: COMPLEX ’94. Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, pp. 179–191. 1997. “Finite-state transducers in language and speech processing.” Computational Linguistics 23, 269–311. 2002. “Generic epsilon-removal and input epsilon-normalization algorithms for weighted transducers.” International Journal of Foundations of Computer Science 13 (1), 129–143. and Nederhof, M.-J., 2001. “Regular approximation of context-free grammars through transformation.” In: Junqua, J.-C., and van Noord, G. (Eds.), Robustness in Language and Speech Technology. Kluwer Academic Publishers, Dordrecht, pp. 153–163. and Riley, M., 1999. “Network optimizations for large vocabulary speech recognition.” Speech Communication 28 (1), 1–12. 2002. “An efficient algorithm for the n-best-strings problem.” In: Proceedings of the International Conference on Spoken Language Processing (ICSLP), pp. 1313–1316.
300
references
Mohri, M., and Roark, B., 2006. “Probabilistic context-free grammar induction based on structural zeros.” In: Proceedings of the 2006 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2006), pp. 312–319. and Sproat, R., 1996. “An efficient compiler for weighted rewrite rules.” In: Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 231–238. Pereira, F. C. N., and Riley, M., 2002. “Weighted finite-state transducers in speech recognition.” Computer Speech and Language 16 (1), 69–88. Montague, R., 1974. Formal Philosophy: Papers of Richard Montague. Yale University Press, New Haven, CT. Moore, R. C., 2000. “Removing left recursion from context-free grammars.” In: Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 249–255. Morgan, T., 1952. Y Treigladau a’u Cystrawen. University of Wales Press, Cardiff. Nederhof, M.-J., 2000. “Practical experiments with regular approximation of context-free languages.” Computational Linguistics 26 (1), 17–44. Newman, S., 1944. Yokuts Language of California. Viking Fund Publications in Anthropology, New York. Ney, H., Essen, U., and Kneser, R., 1994. “On structuring probabilistic dependencies in stochastic language modeling.” Computer Speech and Language 8, 1–38. Nijholt, A., 1980. Context-free Grammars: Covers, Normal Forms, and Parsing. Springer Verlag, Berlin/Heidelberg/New York. Nivre, J., and Nilsson, J., 2005. “Pseudo-projective dependency parsing.” In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 99–106. Pereira, F. C. N., and Riley, M., 1997. “Speech recognition by composition of weighted finite automata.” In: Roche, E., and Schabes, Y. (Eds.), Finite-State Language Processing, MIT Press, Cambridge, MA, pp. 431–453. and Schabes, Y., 1992. “Inside–outside reestimation from partially bracketed corpora.” In: Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 128–135. and Wright, R. N., 1997. “Finite-state approximation of phrase-structure grammars.” In: Roche, E., and Schabes, Y. (Eds.), Finite-State Language Processing. MIT Press, Cambridge, MA, pp. 149–173. Pesetsky, D., 1985. “Morphology and logical form.” Linguistic Inquiry 16, 193– 246. Pinker, S., 1999. Words and Rules. Weidenfeld and Nicholson, London. and Prince, A., 1988. “On language and connectionism: Analysis of a parallel distributed processing model of language acquisition.” In: Pinker,
references
301
S., and Mehler, J. (Eds.), Connections and Symbols. Cognition special issue, MIT Press, pp. 73–193. Pollard, C., 1984. “Generalized phrase structure grammars.” Ph.D. thesis, Stanford University. and Sag, I., 1994. Head-Driven Phrase Structure Grammar. University of Chicago Press, Chicago. Porter, M., 1980. “An algorithm for suffix stripping.” Program 14 (3), 130–137. Prince, A., and Smolensky, P., 1993. “Optimality theory.” Tech. Rep. 2, Rutgers University, Piscataway, NJ. Pullum, G., and Zwicky, A., 1984. “The syntax–phonology boundary and current syntactic theories.” In: Ohio State Working Papers in Linguistics. No. 29. Department of Linguistics, The Ohio State University, Columbus, OH, pp. 105–116. Ratnaparkhi, A., 1996. “A maximum entropy model for part-of-speech tagging.” In: Proceedings of the 1st Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 133–142. 1997. “A linear observed time statistical parser based on maximum entropy models.” In: Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1–10. 1999. “Learning to parse natural language with maximum entropy models.” Machine Learning 34, 151–175. Resnik, P., 1992. “Left-corner parsing and psychological plausibility.” In: Proceedings of the 14th International Conference on Computational Linguistics (COLING), pp. 191–197. Riezler, S., Prescher, D., Kuhn, J., and Johnson, M., 2000. “Lexicalized stochastic modeling of constraint-based grammars using log-linear measures and EM training.” In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 480–487. King, T., Kaplan, R. M., Crouch, R., Maxwell III, J. T., and Johnson, M., 2002. “Parsing the Wall Street Journal using a lexical-functional grammar and discriminative estimation techniques.” In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 271–278. Ritchie, G., Russell, G., Black, A., and Pulman, S., 1992. Computational Morphology: Practical Mechanisms for the English Lexicon. MIT Press, Cambridge, MA. Roark, B., 2001. “Probabilistic top-down parsing and language modeling.” Computational Linguistics 27 (2), 249–276. 2002. “Markov parsing: lattice rescoring with a statistical parser.” In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 287–294.
302
references
Roark, B., 2004. “Robust garden path parsing.” Natural Language Engineering 10 (1), 1–24. and Johnson, M., 1999. “Efficient probabilistic top-down and left-corner parsing.” In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 421–428. Saraclar, M., Collins, M. J., and Johnson, M., 2004. “Discriminative language modeling with conditional random fields and the perceptron algorithm.” In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 47–54. Rosenfeld, R., 1996. “A maximum entropy approach to adaptive statistical language modeling.” Computer Speech and Language 10, 187–228. Rosenkrantz, S. J., and Lewis II, P., 1970. “Deterministic left corner parsing.” In: IEEE Conference Record of the 11th Annual Symposium on Switching and Automata, pp. 139–152. Rumelhart, D., and McClelland, J., 1986. “On learning the past tense of English verbs.” In: McClelland, J., and Rumelhart, D. (Eds.), Parallel Distributed Processing, Volume 2. MIT Press, Cambridge, MA, pp. 216–271. Saul, L., and Pereira, F. C. N., 1997. “Aggregate and mixed-order Markov models for statistical language processing.” In: Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 81–89. Schabes, Y., Abeille, A., and Joshi, A. K., 1988. “Parsing strategies with ‘lexicalized’ grammars: application to tree adjoining grammars.” In: Proceedings of the 12th International Conference on Computational Linguistics (COLING), pp. 578–583. Schone, P., and Jurafsky, D., 2000. “Knowledge-free induction of morphology using latent semantic analysis.” In: Proceedings of the 4th Conference on Computational Natural Language Learning (CoNLL), pp. 67–72. 2001. “Knowledge-free induction of inflectional morphologies.” In: Proceedings of the 2nd Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 1–9. Schütze, H., 1998. “Automatic word sense discrimination.” Computational Linguistics 24 (1), 97–124. Schveiger, P., and Mathe, J., 1965. “Analyse d’information de la déclinaison du substantif en hongrois (du point de vue de la traduction automatique).” Cahiers de Linguistique Théorique et Appliquée 2, 263–265. Seidenadel, C. W., 1907. The Language Spoken by the Bontoc Igorot. Open Court Publishing Company, Chicago. Sha, F., and Pereira, F. C. N., 2003. “Shallow parsing with conditional random fields.” In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2003), pp. 213–220. Sharma, U., Jugal, K., and Das, R., 2002. “Unsupervised learning of morphology for building lexicon for a highly inflectional language.” In: Proceedings
references
303
of the ACL-02 Workshop on Morphological and Phonological Learning, pp. 1–10. Shieber, S. M., and Schabes, Y., 1990. “Synchronous tree-adjoining grammars.” In: Proceedings of the 13th International Conference on Computational Linguistics (COLING), pp. 253–258. Snover, M., Jarosz, G., and Brent, M., 2002. “Unsupervised learning of morphology using a novel directed search algorithm: Taking the first step.” In: Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning, pp. 11–20. Sproat, R., 1985. “On deriving the lexicon.” Ph.D. thesis, MIT, Cambridge, MA, distributed by MIT Working Papers in Linguistics. 1992. Morphology and Computation. MIT Press, Cambridge, MA. 1997a. “Multilingual text analysis for text-to-speech synthesis.” Natural Language Engineering 2 (4), 369–380. (Ed.), 1997b. Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Kluwer Academic Publishers, Dordrecht. 2000. A Computational Theory of Writing Systems. Cambridge University Press, Cambridge, UK. and Riley, M., 1996. “Compilation of weighted finite-state transducers from decision trees.” In: Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 215–222. Hu, J., and Chen, H., 1998. “EMU: An e-mail preprocessor for text-tospeech.” In: Proceedings of the IEEE Signal Processing Society Workshop on Multimedia Signal Processing, pp. 239–244. Srinivas, B., 1996. “Almost parsing technique for language modeling.” In: Proceedings of the International Conference on Spoken Language Processing (ICSLP), pp. 1169–1172. and Joshi, A. K., 1999. “Supertagging: an approach to almost parsing.” Computational Linguistics 25 (2), 237–265. Stabler, E., 1997. “Derivational minimalism.” In: Retoré, C. (Ed.), Logical Aspects of Computational Linguistics, LNCS/LNAI Vol. 1328. Springer-Verlag, Berlin/Heidelberg/New York, pp. 68–95. Steedman, M., 1985. “Dependency and coordination in the grammar of Dutch and English.” Language 61, 523–568. 1986. “Combinators and grammars.” In: Oehrle, R., Bach, E., and Wheeler, D. (Eds.), Categorial Grammars and Natural Language Structures. Foris, Dordrecht, pp. 417–442. 1996. Surface Structure and Interpretation. MIT Press, Cambridge, MA. Steele, S., 1995. “Towards a theory of morphological information.” Language 71, 260–309. Stolcke, A., 1995. “An efficient probabilistic context-free parsing algorithm that computes prefix probabilities.” Computational Linguistics 21 (2), 165– 202.
304
references
Stolcke, A., and Segal, J., 1994. “Precise n-gram probabilities from stochastic context-free grammars.” In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 74–79. König, Y., and Weintraub, M., 1997. “Explicit word error minimization in n-best list rescoring.” In: Proceedings of the European Conference on Speech Communication and Technology (Eurospeech), pp. 163–166. Stump, G., 2001. Inflectional Morphology: A Theory of Paradigm Structure. Cambridge University Press, Cambridge, UK. Thorne, D., 1993. A Comprehensive Welsh Grammar. Blackwell, Oxford. Tjong Kim Sang, E. F., and Buchholz, S., 2000. “Introduction to the CoNLL2000 shared task: Chunking.” In: Proceedings of the 4th Conference on Computational Natural Language Learning (CoNLL), pp. 127–132. Tzoukermann, E., and Liberman, M., 1990. “A finite-state morphological processor for Spanish.” In: Proceedings of the 13th International Conference on Computational Linguistics (COLING), pp. 277–286. van den Bosch, A., and Daelemans, W., 1999. “Memory-based morphological analysis.” In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 285–292. Vauquois, B., 1965. “Présentation d’un programme d’analyse morphologique russe.” Tech. rep., Centre d’Etudes pour la Traduction Automatique, Université de Grenoble 1. Vijay-Shanker, K., and Joshi, A. K., 1985. “Some computational properties of tree adjoining grammars.” In: Proceedings of the 23rd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 82–93. and Weir, D., 1990. “Polynomial time parsing of combinatory categorial grammars.” In: Proceedings of the 28th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 82–93. 1993. “Parsing some constrained grammar formalisms.” Computational Linguistics 19 (4), 591–636. and Joshi, A. K., 1987. “Characterizing structural descriptions produced by various grammatical formalisms.” In: Proceedings of the 25th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 104–111. Voutilainen, A., 1994. “Designing a parsing grammar.” Tech. rep. 22, University of Helsinki. Walther, M., 2000a. “Finite-state reduplication in one-level prosodic morphology.” In: Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 296–302. Walther, M., 2000b. “Temiar reduplication in one-level prosodic morphology.” In: Proceedings of the 5th Workshop of the ACL Special Interest Group on Computational Phonology (SIGPHON-2000), pp. 13–21. Wang, W., and Harper, M. P., 2002. “The superARV language model: Investigating the effectiveness of tightly integrating multiple knowledge sources.”
references
305
In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 238–247. and Stolcke, A., 2003. “The robustness of an almost-parsing language model given errorful training data.” In: Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003), pp. 240–243. Weir, D., 1988. “Characterizing mildly context-sensitive grammar formalisms.” Ph.D. thesis, University of Pennsylvania. Wicentowski, R., 2002. “Modeling and learning multilingual inflectional morphology in a minimally supervised framework.” Ph.D. thesis, Johns Hopkins University, Baltimore, MD. Witten, I. H., and Bell, T. C., 1991. “The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression.” IEEE Transactions on Information Theory 37 (4), 1085–1094. Woyna, A., 1962. “Morphological analysis of Polish verbs in terms of machine translation.” Tech. rep., Machine Translation Research Project, Georgetown University. Wright, J., 1910. Grammar of the Gothic Language. Oxford University Press, Oxford. Wu, D., 1997. “Stochastic inversion transduction grammars and bilingual parsing of parallel corpora.” Computational Linguistics 23 (3), 377–404. XTAG Research Group, 2001. “A lexicalized tree adjoining grammar for English.” Tech. rep. IRCS-01-03, IRCS, University of Pennsylvania. Xu, P., Chelba, C., and Jelinek, F., 2002. “A study on richer syntactic dependencies for structured language modeling.” In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 191–198. Emami, A., and Jelinek, F., 2003. “Training connectionist models for the structured language model.” In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 160–167. Yarowsky, D., and Wicentowski, R., 2001. “Minimally supervised morphological analysis by multimodal alignment.” In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 207–216. Younger, D. H., 1967. “Recognition and parsing of context-free languages in time n3 .” Information and Control 10 (2), 189–208.
This page intentionally left blank
Name Index Abe, M., 100 Abeille, A., 258 Abney, S., 174, 210, 273–275 Aho, A. V., 180, 182, 185 Ajdukiewicz, K., 266 Allauzen, C., 148 Allen, J., 100, 101 Alshawi, H., 282, 283 Andron, D., 100 Andry, F., 27 Anshen, F., 101 Antworth, E., 102 Archangeli, D., 32 Aronoff, M., 46–48, 62, 65, 68, 69, 101, 132 Aronson, H., 136 Attar, R., 100 Baayen, R. H., 86, 124 Baker, J., 204 Baker, M., 75 Bar-Hillel, Y., 266 Barg, P., 27 Baroni, M., 119, 133 Bat-El, O., 32 Bear, J., 102 Beard, R., 120 Becker, J., 100 Beesley, K., 23, 27, 44–46, 58, 95, 102 Bell, T. C., 143 Bernard-Georges, A., 100 Berthaud, M., 100 Besag, J., 275 Bikel, D. M., 228, 230, 233 Billot, S., 199 Bilmes, J., 145, 150, 151, 165 Bird, S., 110, 112 Black, A., 102 Black, E., 210 Blaheta, D., 213 Blevins, J., 83, 84, 112
Bod, R., 275–277 Böhmová, A., 236 Boussard, A., 100 Brand, I., 100 Brent, M., 119, 133 Bresnan, J., 249 Brew, C., 273 Brill, E., 168 Brown, P. F., 140, 151 Buchholz, S., 174, 237 Buckwalter, T., 44 Butt, M., 257 Byrd, R., 101 Byrne, W., 168 Büttel, I., 100 Canon, S., 173, 275 Carpenter, B., 192 Carroll, G., 241–243 Carstairs, A., 49 Cha, J., 117 Charniak, E., 173, 213, 225, 230–233, 239–243, 272, 278 Chelba, C., 192, 238–241 Chen, H., 17 Chen, S., 147 Chi, Z., 173, 275 Chiang, D., 164, 200, 279 Chomsky, N., 17, 111, 271 Choueka, Y., 100 Church, K., 100, 101 Clark, S., 269, 272 Cocke, J., 140, 193 Cohen-Sygal, Y., 57–60 Coker, C., 100 Coleman, J., 110 Collins, M. J., 134, 172, 173, 192, 212, 225–233, 272, 275, 277, 278 Corbett, G., 27 Creutz, M., 119, 134
308
name index
Crouch, R., 237, 257, 275 Culy, C., 3, 55 Curran, J. R., 269, 272 Daelemans, W., 119 Das, R., 119 Della Pietra, S. A., 140 Della Pietra, V. J., 140, 151 Dershowitz, N., 100 de Rijke, M., 213 deSouza, P. V., 151 Dolby, J., 100 Douglas, S., 282, 283 Duanmu, S., 115 Duffy, N., 173, 278 Durbin, R., 17 Dyvik, H., 257 Earl, L., 100 Earley, J., 201–203, 238, 265 Eddy, S., 17 Eisner, J., 223, 224, 236, 269 Ellison, T., 110, 112 Emami, A., 239 Essen, U., 143, 146 Evans, R., 27, 265 Finkel, R., 27 Fisher, S., 216 Flickenger, D., 210 Fodor, J. D., 192 Fraenkel, A., 100 Francis, N. W., 154 Fraser, N., 27 Frazier, L., 192 Freitag, D., 171 Freund, Y., 277 Gabbard, R., 213 Gates, W., 111 Gaussier, E., 119 Gazdar, G., 27, 265 Gdaniec, C., 210 Geman, S., 173, 275 Gerdemann, D., 58, 70, 95, 101, 115 Gildea, D., 233
Goldsmith, J., 119–125, 129 Good, I. J., 146 Goodman, J., 147, 167, 206, 207, 276, 277 Grishman, R., 210 Hajiˇc, J., 236 Hajiˇcová, E., 236 Hakkani-Tür, D., 117 Hall, K., 223, 236, 240 Halle, M., 65, 111 Hankamer, J., 100, 101 Harlow, S., 37 Harper, M. P., 240, 272 Harris, A., 136 Harris, Z., 133 Harrison, M., 2, 16 Harrison, P., 210 Heemskerk, J., 116 Henderson, J., 192, 225 Hindle, D., 209, 210 Hladká, B. V., 236 Hockenmaier, J., 269, 271 Hollingshead, K., 216 Hopcroft, J., 2, 16 Hu, J., 17 Huang, L., 164, 200, 279 Hunnicutt, M. S., 100, 101 Hutchins, W. J., 100 Ingria, R., 210 Inkelas, S., 58–61 Iyer, R., 277 Jansche, M., 168 Jarosz, G., 119, 133 Jelinek, F., 140, 143, 144, 192, 210, 227, 231, 238, 239, 241 Jijkoun, V., 213 Johnson, C. D., 28, 111 Johnson, H., 119, 133 Johnson, M., 173, 189, 192, 213, 217, 218, 233, 240, 245, 257, 275, 276, 278 Joshi, A. K., 222, 257, 258, 260, 265, 270, 272 Jugal, K., 119 Jurafsky, D., 119, 124–131, 133, 135
name index König, Y., 168 Kaplan, R. M., 2, 7, 27, 28, 36, 43, 101, 105, 106, 108, 109, 111, 112, 237, 249, 257, 275 Karttunen, L., 23, 27, 44–46, 66, 112–115 Kasami, T., 193 Katz, S. M., 143, 146 Kay, M., 2, 7, 27, 28, 36, 43, 101, 105, 106, 108, 109, 111, 112, 194, 199 King, T., 237, 257, 275 Kiraz, G., 5, 41 Kirchhoff, K., 145, 150 Klatt, D., 100, 101 Klavans, J., 101, 210 Kleene, S., 4 Klein, D., 220, 223, 225, 233, 241–244 Klimonow, G., 100 Kneser, R., 143, 146–148 Knuth, D., 103 Koo, T., 278 Kornai, A., 26 Koskenniemi, K., 26, 100, 101, 104–105, 108, 111–113, 116 Krogh, A., 17 Kucera, H., 154 Kuhn, J., 275 Kulick, S., 213 Kumar, S., 168 Lafferty, J. D., 140, 172, 238 Lagus, K., 119, 134 Lai, J. C., 151 Lambek, J., 266 Lang, B., 199 Lari, K., 204, 241 Laurent, G., 100 Lee, G. G., 117 Lee, J.-H., 117 Levenbach, D., 100 Levy, L. S., 222, 258, 270 Levy, R., 213 Lewis II, P., 185, 187, 279–281 Lewis, H., 2, 16 Lewis, R. L., 192 Li, W., 173 Liberman, M., 100, 210
309
Lieber, R., 36, 37 Lin, D., 237 Lombardi, L., 37, 38 Magerman, D. M., 223, 225, 228, 242 Mangu, L., 168 Manning, C. D., 192, 213, 220, 223, 225, 233, 242–244 Marcinkiewicz, M. A., 154, 209, 211 Marcus, M. P., 154, 209, 211, 213, 242 Marsi, E., 237 Martin, J., 119, 133 Masuichi, H., 257 Mathe, J., 100 Matiasek, J., 119, 133 Matthews, P., 49, 62, 65, 100 Maxwell III, J. T., 237, 257, 275 McCallum, A., 171–173 McCarthy, J., 29, 30, 37, 38, 41 McClelland, J., 118, 119 McDonald, R., 236 McIlroy, M. D., 100, 101 Mel’ˇcuk, I., 236 Melamed, I. D., 279, 282 Mercer, R. L., 140, 143, 144, 151, 227, 231 Meya-Lloport, M., 100 Michaelis, J., 271 Mitchison, G., 17 Mohri, M., 7, 16, 17, 27, 28, 36, 43, 111, 148, 164, 216, 217, 245, 247 Montague, R., 269 Moore, R. C., 185, 189 Morgan, T., 37 Nederhof, M.-J., 244, 245, 247 Newman, S., 32 Ney, H., 143, 146–148 Niedermair, G., 100 Nijholt, A., 189 Nilsson, J., 236 Nivre, J., 236 Novak, V., 236 Nündel, S., 100 Oflazer, K., 117 Ooshima, Y., 100
310
name index
Papadimitriou, C., 2, 16 Pereira, F. C. N., 14, 15, 16, 17, 145, 171, 172, 236, 242, 244 Pesetsky, D., 31 Piepenbrock, R., 124 Pinker, S., 83, 118, 119 Pollard, C., 222, 249, 270 Porter, M., 100, 101 Prescher, D., 275 Prince, A., 29–30, 110, 113, 118, 119 Pullum, G., 32 Pulman, S., 102 Ratnaparkhi, A., 171, 173, 192, 225, 228, 272, 277 Resnik, P., 245 Resnikoff, H., 100 Ribarov, K., 236 Riezler, S., 173, 237, 257, 275 Riley, M., 14–17, 109, 164 Ritchie, G., 102 Roark, B., 148, 173, 189, 192, 216, 217, 233, 239, 240, 275 Rohrer, C., 257 Roossin, P. S., 140 Rooth, M., 209 Rosenfeld, R., 173 Rosenkrantz, S. J., 185, 187 Roukos, S., 210 Rumelhart, D., 118, 119 Russell, G., 102 Sag, I., 222, 249 Santorini, B., 154, 209–211 Saraclar, M., 173 Satta, G., 223, 224, 279, 282 Saul, L., 145 Schabes, Y., 242, 258, 265, 279, 282 Schapire, R., 277 Schone, P., 119, 124–131, 133, 135 Schveiger, P., 100 Schwartz, J. T., 193 Schütze, H., 126 Segal, J., 238 Seidenadel, C. W., 39 Sethi, R., 180, 182, 185
Sha, F., 172 Sharma, U., 119 Shieber, S. M., 279, 282 Simon, I., 11 Singer, Y., 134, 277 Smolensky, P., 110, 113 Snover, M., 119, 133 Sproat, R., 7, 17, 27, 28, 31, 36, 43, 44, 102, 109, 111, 113, 119, 135 Srinivas, B., 272, 282, 283 Stabler, E., 270, 271 Stearns, R., 279–281 Steedman, M., 265, 267, 269, 271 Stolcke, A., 168, 203, 238, 240, 272 Strzalkowski, T., 210 Stump, G., 27, 31, 49, 62, 64–83 Tür, G., 117 Takahashi, M., 222, 258, 270 Takeichi, N., 100 Thorne, D., 37 Thornton, S., 27 Thurmair, G., 100 Tjong Kim Sang, E. F., 174 Trost, H., 119, 133 Ullman, J. D., 2, 16, 180, 182, 185 van Noord, G., 115 van Rijn, H., 124 van den Bosch, A., 119 Vasserman, A., 237, 257 Vauquois, B., 100 Vijay-Shanker, K., 257, 260, 269, 270 Voutilainen, A., 16 Walther, M., 57 Wang, W., 240, 272 Weintraub, M., 168 Weir, D., 257, 265, 269–271 Wellington, B., 279, 282 Wessel, A., 100 Wicentowski, R., 119, 129–135 Wintner, S., 57–59 Witten, I. H., 143 Woyna, A., 100
name index Wright, J., 54 Wright, R. N., 244 Wu, D., 279–281
Young, S., 204, 241 Younger, D. H., 193 Yuura, K., 100
Xu, P., 239 Yarowsky, D., 119, 129–135 Youd, N., 27
Zaenen, A., 112 Zoll, C., 58–60 Zwicky, A., 32
311
Language Index Arabic, 42–46, 150 Bambara, 3, 55, 58 Bontoc, 39, 40 Breton, 66, 67, 79–82 Bulgarian, 66 Dakota, 58
Kinande, 58, 59 Koasati, 37–38 Latin, 24, 25, 47–53 Sanskrit, 68–73 Swahili, 73–79 Sye, 58
Finnish, 63–64, 104, 112
Tagalog, 30 Turkish, 101
German, 24, 35 Gothic, 54–56, 60
Ulwa, 30, 40–41
Irish, 36
Welsh, 36–37
Kannada, 85
Yowlumne (Yawelmani), 32–33, 35, 49
Index A∗ search, 192, 224 affixation, 27 agglutinative languages, 150 alphabet, 2 argmax, 9 articulated morphology, 65 automata algorithms, 13 automatic speech recognition, 140 backpointer, 159, 195–199 beam-search, 192 binomial distributions, 9 Brown Corpus, 152 Chomsky hierarchy, 176 circumfix, 125, 127 closure properties, regular languages, 3 regular relations, 6, 7 weighted regular languages, 12 Combinatory Categorial Grammar (CCG), 265–272 function application, 266 function composition, 267 non-constituent coordination, 267, 268 normal form derivation, 269 type raising, 267 compile-replace, 23, 44 composition, 6, 7, 23 algorithm, 13, 14 epsilons, 14, 15 computational morphology, 17 computational syntax, 17 conflation sets, 129 consonant mutation, 36, 37 constituent, 199 span, 193, 194, 196–198 Constraint Dependency Grammar, 272 constraint-based formalisms, 18, 19
context-free grammar, 176, 248, 249 bilexical, 221, 223, 224 derivation, 177, 178 derivation, leftmost, 177, 178 derivation, rightmost, 177, 178 finite-state approximation, 244–247 left-recursion, 179, 180, 185, 187 lexicalized, 221, 223, 224, 233 LL(k), 185 LR(k), 182 PCFG prefix probabilities, 238 probabilistic (PCFG), 176, 190, 191 rule production, 176 strongly regular, 245–247 unsupervised induction, 240–243 weighted (WCFG), 176 context-free language, 177 context-sensitive grammars, 257 context-free approximation, 271, 272 finite-state approximation, 272 mildly context-sensitive grammars, 257, 258, 270, 271 continuation lexicon, 103 cosine measure, 127, 130 Data-oriented parsing (DOP), 275–278 DATR, 27, 102, 265 declarative phonology, 110 DECOMP, 101 dependency graph, 234, 235 non-projective, 236 dependency parsing, 233–236 dependency tree, 233–235, 259, 260 derivational morphology, 25 determinization, 15, 16 discrete distributions, 9 distituent, 242, 243 distributed morphology, 65 dotted rules, 202, 203
314
index
efficiency, 18 empty language, 3 empty nodes, 210, 212, 213 epsilon filter, 15 expectation maximization (EM), 144, 161, 165, 203, 206, 241, 242 extrametricality, 30 feature representations, 26 finite-state automata, 1, 2, 4 finite-state registered automata, 57, 58 finite-state transducers, 1, 2, 5, 6 formal language, 2 Forward algorithm, 152, 155, 156 example, 156–158 higher-order HMMs, 158 pseudocode, 156 Forward–backward algorithm (Baum–Welch), 164, 165 decoding, 167–168 example, 167 pseudocode, 166 function tags, 210, 212, 213 Generalized Context-free Grammars (GCFG), 270 grammar equivalence, 178 grammar transform, 178 Chomsky Normal Form, 178–180, 194–196, 206 left-corner, 187, 189, 245 left-recursion, 185 self-embedding, 245–247 grammars, 17, 18 head child, 222 head operations, 31 head percolation, 222 syntactic versus semantic, 223 head transducer, 282, 283 head word, 222 Head-Driven Phrase Structure Grammar (HPSG), 249, 252 hidden Markov models, 152 incremental theories, 65–66, 86 inferential theories, 64–66, 86
infixation, 30 extrametrical, 39, 40 positively circumscribed, 40, 41 inflectional morphology, 25 inside-outside algorithm, 203, 206, 241, 242 intersection, 5 algorithm, 13 inversion, 7 item-and-arrangement, 61–64 item-and-process, 61–64 KATR, 27 keçi, 101 KIMMO, 6, 102, 112 rule types in, 109, 110 Kleene’s theorem, 4 language modeling, 139, 140, 238 bigrams and trigrams, 141 class-based, 151, 152 n-grams, 139–141, 148–150 n-grams, conditioning history, 141 n-grams, factored models, 150 n-grams, finite-state encoding, 148, 149 n-grams, morphology, 117, 118 n-grams, start and stop symbols, 141, 149 PCFGs, 238–240 latent semantic analysis, 126 lemmatization, 24 lenient composition, 113–115 Levenshtein distance, 131 lexical theories, 64–66, 86 Lexical-Functional Grammar (LFG), 249, 252, 257 c-structure, 250 f-structure, 250, 252 lextools, 70, 86 Linear Context-free Rewriting Systems (LCFRS), 270, 271 Linguistica, 119 log-linear models, 170 conditional random fields, 171 kernel, 278 maximum entropy, 147, 278, 279
index perceptron, 171, 172, 278 random fields, 274, 275 regularization, 172 logarithms, 8, 9 look-ahead words, 182, 185 machine translation, 100, 140 Markov assumption, 141, 213–217, 223 maximum likelihood estimation, 141 merge, 44–46 Minimalist Grammars, 270, 271 minimization, 15, 16 minimum description length, 119 morphological blocking, 132 morphomic components, 46–49 multinomial distributions, 9 negative log probabilities, 9, 10 non-concatenative morphology, 5 non-terminal symbols, 176 normalization, joint vs. conditional, 170 one-level phonology, 110 Optimality Theory, 18, 19, 110, 114, 115 paradigm function, 80 paradigm function morphology, 66, 67 paradigms, 49, 52 parse forest, 199, 255 parse selection, 274, 275 re-ranking, 277–279 parser evaluation, 210 dependencies, 237, 238 F-measure, 212 PARSEVAL, 210, 212 parsing, Charniak’s models, 225, 230–232 chart, 193–199 chart cell, 195–198 Collins’ models, 225–232 complexity, 200 CYK, 192–200 Earley, 201–203 labeled recall optimization, 206, 208 left-corner, 185–187, 189 N-best, 200 shift-reduce, 181, 182, 189
315
top-down, 184, 185, 187, 189 Treebank parsers, 225 word lattices, 240 part-of-speech (POS) tags, 152 Penn Treebank, 152, 190, 209, 210, 212 pigeonhole principle, 132 Porter stemmer, 101 PP attachment, 209 priority union, 77, 113 prosodic affixation, 32, 33, 35 prosodic circumscription, 29, 30 pushdown automata, 182, 183, 244, 245 re-analysis, 191, 192 realizational theories, 65–66, 86 reduplication, 53, 54, 57, 58, 60 inexact copying, 58, 59 morphological doubling theory versus correspondence theory, 60 unbounded, 3, 55 regular expressions, 4 regular languages, 2, 3 regular relations, 5, 6 relative frequency, 9 rewrite rules, 7 robustness, 18 root-and-pattern morphology, 41, 43–46 morphological learning, 135 selectional preferences, 222 semirings, 11 real, 11 tropical, 11 , 26 signatures, 120, 122, 128 smoothing, 142, 143 absolute discounting, 145, 146 backing off, 143, 144 deleted interpolation, 144, 145 Good–Turing estimation, 146 Katz backoff, 146 Kneser–Ney estimation, 147 spelling correction, 100, 101 stochastic models, 8 subcategorization, 222 subsegmental morphology, 36, 37 subtractive morphology, 37, 39
316
index
Supertagging, 272 syntactic disambiguation, 18 tagging, chunking, 173, 174 named entities, 174 POS, 159, 161 terminal symbols, 176 text-to-speech synthesis, 101 top-down filtering, 201–203 transducer intersection, 105, 109 transduction grammars, 279, 280 ITG, 280–282 multitext, 282 synchronous TAG, 282 tree transform, child annotation, 218, 219 Markov factorization, 213–217, 219 parent annotation, 217–219 Tree-adjoining grammars (TAG), 258, 259, 272 adjunction, 258 derivation tree, 258–260 elementary trees, 258 foot node, 258, 260, 261 lexicon, 265
MCTAG, 270 parsing, 260–265 substitution, 258 XTAG project, 265 trie, 103 two-level morphology, 27, 104, 105 unification, 102, 253–255 unification grammar, 248–255 stochastic, 273, 274 unification parsing, 255 interleaving, 255–257 universal language, 3 Viterbi algorithm, 159, 169 example, 161 N-best, 161–164, 278, 279 pseudocode, 160 weighted finite-state automata, 9, 11, 12 weighted finite-state transducers, 9, 11, 12 weights, 8 Xerox Linguistic Environment (XLE), 257 XFST, 95