Empirical Linguistics
Open Linguistics Series Series Editor Robin Fawcett, Cardiff University This series is 'open' i...
673 downloads
3984 Views
14MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Empirical Linguistics
Open Linguistics Series Series Editor Robin Fawcett, Cardiff University This series is 'open' in two related ways. First, it is not confined to works associated with any one school of linguistics. For almost two decades the series has played a significant role in establishing and maintaining the present climate of 'openness' in linguistics, and we intend to maintain this tradition. However, we particularly welcome works which explore the nature and use of language dirough modelling its potential for use in social contexts, or through a cognitive model of language - or indeed a combination of the two. The series is also 'open' in the sense that it welcomes works that open out 'core' linguistics in various ways: to give a central place to the description of natural texts and the use of corpora; to encompass discourse 'above the sentence'; to relate language to odier semiotic systems; to apply linguistics in fields such as education, language pathology, and law; and to explore the areas that lie between linguistics and its neighbouring disciplines such as semiotics, psychology, sociology, philosophy, and cultural and literary studies. Continuum also publishes a series that offers a forum for primarily functional descriptions of languages or parts of languages - Functional Descriptions of Language. Relations between linguistics and computing are covered in the Communication in Artificial Intelligence series, two series, Advances in Applied Linguistics and Communication in Public Life, publish books in applied linguistics, and the series Modern Pragmatics in Theory and Practice publishes both social and cognitive perspectives on the making of meaning in language use. We also publish a range of introductory textbooks on topics in linguistics, semiotics and deaf studies. Recent titles in this series Classroom Discourse Analysis: A Functional Perspective, Frances Christie Culturally Speaking: Managing Rapport through Talk across Cultures, Helen SpencerOatey (ed.) Genre and Institutions: Social Processes in the Workplace and School, Frances Christie and J. R. Martin (eds) Learning through Language in Early Childhood, Clare Painter Pedagogy and the Shaping of Consciousness: Linguistic and Social Processes, Frances Christie (ed.) Relations and Functions within and around Language, Peter H. Fries, Michael Cummings, David Lockwood and William Spruiell (eds) Syntactic Analysis and Description: A Constructional Approach, David G. Lockwood Words, Meaning and Vocabulary: An Introduction to Modern English Lexicology, Howard Jackson and Etienne Ze Amvela Working with Discourse: Meaning beyond the Clause, J. R. Martin and David Rose
Empirical Linguistics
Geoffrey Sampson
continuum LONDON ? NEW YORK
Continuum The Tower Buildine. 11 York Road, London, SE1 7NX 370 Lexington Avenue, New York, NY 10017-6503 First published 2001 Reprinted in paperback 2002 © Geoffrey Sampson 2001 All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage or retrieval system, without permission in writing from the publishers. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. ISBN 0-8264-4883-6 (hardback) 0-8264-5794-0 (paperback) Library of Congress Cataloging-in-Publication Data Sampson, Geoffrey. Empirical linguistics / Geoffrey Sampson. p. cm—(Open linguistics series) Includes bibliographical references and index. ISBN 0-8264-4883-6 (hardback) 0-8264-5794-0 (paperback) 1. Linguistics—Methodology. I. Title II. Series PI 26.8242001 410M—-de 21 00-031802 Typeset by Paston Prepress Ltd, Beccles, Suffolk Printed and bound in Great Britain by Creative Print and Design (Wales)
Contents
Sources and acknowledgements 1 Introduction
vii 1
2 From central embedding to empirical linguistics
13
3 Many Englishes or one English?
24
4 Depth in English grammar
37
5 Demographic correlates of complexity in British speech
57
6 The role of taxonomy
74
7 Good—Turing frequency estimation without tears
94
8 Objective evidence is all we need
122
9 What was Transformational Grammar?
141
10 Evidence against the grammatical/ungrammatical distinction
165
11 Meaning and the limits of science
180
References
209
URL list
219
Index
221
This page intentionally left blank
Sources and acknowledgements
Although two chapters of this book are entirely new, many chapters are based, in part or in whole, on material previously published elsewhere. My justification for collecting them into a single volume is partly that a number of the original publications are out of print or relatively inaccessible, and partly that I hope the sum is greater than the parts: the various chapters express different aspects of a single coherent and rather distinctive picture of human language and its study, and they lose some of their force while scattered in separate locations. For the present book I have edited and added new material to the reprinted items as needed to bring them up to date and make the links between chapters explicit. Two chapters are based on papers which were co-authored with others: Chapter 3 on a paper I wrote with Robin Haigh (now of the Leeds University Computing Service), and Chapter 7 on a paper which William A. Gale (then of AT&T Bell Laboratories, New Jersey - since retired) wrote as first author with me as second author. The need for stylistic consistency throughout the present book has sometimes forced me to adjust the authorial 'we' to T in these chapters, but that should not be seen as detracting in any way from the roles of my co-authors. I am very grateful to Robin Haigh and to Bill Gale for approving my proposal to include these items in the present volume. The original publications on which various chapters are based were as follows: Chapter 2, on chapter 2 of Jenny Thomas and M. H. Short (eds), Using Corpora for Language Research: Studies in the Honour of Geoffrey Leech (Longman, 1996); reprinted by permission of Addison Wesley Longman Ltd. Chapter 3, on a paper in Merja Kyto, O. Ihalainen and M. Rissanen (eds), Corpus Linguistics, Hard and Soft (Amsterdam: Rodopi, 1988), pp. 207— 19; reprinted by permission of Editions Rodopi B.V. Chapter 4, on a paper in the Journal of Linguistics, vol. 33 (1997),pp. 131— 51; reprinted by permission of Cambridge University Press. Chapter 6, on a paper delivered to a joint Royal Society/British Academy Discussion Meeting on Computers, Language and Speech, in September 1999, and published in the Royal Society's Philosophical Transactions,
viii
SOURCES AND ACKNOWLEDGEMENTS
Series A, Vol. 358, 2000, pp. 1339—55, reprinted here by permission of the Royal Society; and on my keynote address to the Paris Treebanks Workshop, June 1999, to be published in the Proceedings of that meeting. Chapter 7, on a paper first published as Cognitive Science Research Paper 407, University of Sussex, 1996, and reprinted in the Journal of Quantitative Linguistics, vol. 2 ([1995] 1996), pp. 217-37. Chapter 8, on chapter 4 of my The Form of Language, Weidenfeld & Nicolson, 1975. Chapter 9, on a review article in Lingua, vol. 48 (1979), pp. 355—78; reprinted by permission of Elsevier Science Ltd. Chapter 10, on a paper in W. Meijs (ed.), Corpus Linguistics and Beyond (Amsterdam: Rodopi, 1987), pp. 219-26; reprinted by permission of Editions Rodopi B. V. Chapter 11, on chapter 3 of my Making Sense (Oxford University Press, 1980). I thank Gerald Gazdar, Stefan Gries and Max Wheeler, for supplying references to elusive publications (and I apologize to others who gave similar help but whose names now escape me). I thank the American Mathematical Society for permission to reproduce Figure 4.2, p. 40.1 am grateful to Professor I. J. Good, of Virginia Polytechnic Institute and State University, for comments on a draft of Chapter 7 and to Professor J. R. Hurford, now of Edinburgh University, for comments on a draft of Chapter 9. Any shortcomings in the book are my responsibility alone. Sussex, 29 February 2000
for Clara and Giles
This page intentionally left blank
1
Introduction
Language is people talking and writing. It is a concrete, tangible aspect of human behaviour. So, if we want to deepen our understanding of language, our best way forward is to apply the same empirical techniques which have deepened our understanding of other observable aspects of the universe during the four centuries since Galileo. Listen, look. Summarize what you hear and see in hypotheses which are general enough to lead to predictions about future observations. Keep on testing your hypotheses. When new observations disconfirm some of them, look for alternative explanations compatible with all the evidence so far available; and then test these explanations in their turn against further observational data. This is the empirical scientific method. Scientific knowledge can be used in many ways, some good, some bad, some perhaps neutral. But if you want knowledge rather than dogma, the empirical scientific method is an unbeatable way to achieve it, in domains where the method applies. It does not apply in every important domain. You cannot base a system of moral principles on the scientific method, because there is no way that observation can possibly 'refute' or 'confirm' a moral principle such as 'one should obey the law even if one disagrees with it'. As philosophers say, there is no way to derive an 'ought' from an 'is'. In such domains, one has to look for other ways to try to establish the validity of a set of beliefs. Typically, those other ways are less reliable than the scientific method; and so one finds that people often disagree about even basic moral questions. But hypotheses about patterns in the sounds which emerge from speakers' mouths, or in the marks they write on paper or at computer keyboards, are as testable as hypotheses about the speed of falling stones or the weight of substances before and after they are burned. There is no bar to linguistics (or much of it, at least) being an empirical science. Strange as it seems, in recent decades linguistics has not been an empirical science in practice. Linguists' 'grammars' (the term of art used for formalized, predictive descriptions of languages) have not been responsive to observations of concrete linguistic behaviour. Many members of the discipline have been persuaded by the views of the immensely influential linguist Noam Chomsky, who asserted in 1961 that 'it is absurd to attempt to
2
EMPIRICAL LINGUISTICS
construct a grammar that describes observed linguistic behaviour directly' (Chomsky 1961: ISO). 1 Chomsky's reason for making this startling statement was his awareness that linguistic behaviour is affected in practice by many considerations other than the intrinsic structure of the language being spoken. A speaker may utter incoherent wording because he gets confused about what he wants to say, for instance, or he may produce an incomplete utterance because he changes his mind in midstream. If our aim is to uncover the language structure itself, Chomsky felt, looking at concrete examples of language behaviour would make it impossible to disentangle these extraneous complicating factors: 'a direct record — an actual corpus - is almost useless as it stands, for linguistic analysis of any but the most superficial kind' (Chomsky 1961: 131). Someone familiar with analogous problems in other sciences might see this as a poor excuse for giving up on observational data. The acceleration of a material object towards the ground is controlled in part by the law of gravity, but extraneous factors (air resistance, the motion of air currents) interfere with the predictions that would follow from that law taken in isolation. If the falling object is, say, a leaf rather than a brick, the 'extraneous factors' may be almost as important as gravity. A physicist studying the law of gravity is not going to throw up his hands and ignore the data of observation, because of the interfering factors. He will put effort into disentangling the factors from one another. (If bricks reveal the workings of gravity more directly than leaves, perhaps he will focus on brick data more than on leaf data.) Unempirical linguists tend to answer this by saying that physicists are forced to use the data of observation, no matter how contaminated these are by irrelevant factors, because physicists have no alternative. Linguistic researchers have an alternative. Because languages are spoken by people, we, as speakers of languages, can consult our 'intuitions' - we can ask ourselves introspectively whether this or that sequence of words is something that we could say in our native language. The suggestion is that nativespeaker intuition gives us direct access to the intrinsic properties of our language, uncontaminated by irrelevant factors. Intuitive evidence is felt to have another advantage, too. Any sample of observations of (spoken or written) language will be limited in size. It will contain evidence bearing on some of the questions a linguist wants to ask; but, if the hypothesis he aims to test makes predictions about subtle, complex constructions, the sample may easily lack any relevant evidence bearing on that hypothesis one way or the other. If our evidence is drawn from our intuitions, on the other hand, it is unlimited. Whatever assemblage of words turns out to be crucial for testing some linguistic hypothesis, we can always ask ourselves 'Could I say that, or could I not?' Hence, according to Chomsky (1962: 158), 'The empirical data that I want to explain are the native speaker's intuitions.' To practitioners of other sciences, the reply to this would seem obvious. The data of 'intuitions' may be abundant, but they are hopelessly unreliable.
INTRODUCTION
3
In the Middle Ages, theories about the subject-matter that we now call physics were in many cases founded on intuition. For instance, the Sun, Moon and planets were held to move in circles, because the circle was obviously the only shape perfect enough to be associated with a celestial body. But, once the matter was treated as open to empirical testing, it turned out that circles were incompatible with the data of observation; the orbits of the Moon and planets are in fact ellipses (and the Sun does not move). Because linguistic scientists are the same creatures as the speakers of language, it might be that linguistics would escape the problem of intuition being misleading. Nobody nowadays would suppose that human beings possess reliable intuitive knowledge, independent of observation, about the motions of the planets; but one might imagine that we were endowed with a kind of mental hotline to the structure of our own native language, so that we could answer questions about what can and cannot be said in our language and unerringly get the answers right. This might have been so; but it soon turned out that it was not. When linguists began to found their hypotheses about language structure on intuitive rather than observational data, again and again it turned out that intuitions about crucial examples were hazy, or different linguists' intuitions were clear but mutually contradictory. Sometimes, native speakers expressed complete certainty that some particular form of words was impossible and unheard-of, yet they themselves used that very form of words, fluently and unawares. Most damaging of all, it often seemed likely that linguists' intuitive judgements about particular examples were coloured by the predictions which their theories made about those examples - in other words, the theories became self-fulfilling. One striking illustration of the contrasts that can occur between people's intuitions about their language, and what they actually say, was reported by the empirically minded sociolinguist William Labov in 1975. It related to a construction, well established in the English of white inhabitants of the 'Midland' region of the USA (though not found in other parts of the USA, in Britain, or among black speakers), involving the phrase any more (Labov writes it as one word, anymore). In most parts of the English-speaking world, any more can be used only with a negative word, e.g. Trains don't stop here any more. White American Midlanders also systematically and frequently use any more in grammatically positive contexts, to report facts that are negative in value rather than in grammar they say things like John is smoking a lot any more, meaning 'John has begun smoking a lot, regrettably'. Labov and his researchers found, repeatedly, that people whose speech includes this construction were as unaware of it, and as puzzled by it when it was drawn to their attention, as speakers of the majority dialect which lacks 'positive any more': Faced with a sentence like John is smoking a lot anymore they said they had never heard it before, did not recognize it as English, thought it might mean 'not smoking', and showed the same signs of bewilderment that we get from ... speakers
4
EMPIRICAL LINGUISTICS outside the dialect area. This describes the behavior ofjack Greenberg, a 58-yearold builder raised in West Philadelphia. His introspective reactions were so convincing that we felt at first that we had to accept them as valid descriptions of his grammar. Yet two weeks later, he was overheard to say to a plumber, 'Do you know what's a lousy show anymore? Johnny Carson.' (Labov 1975: 106—7, footnotes omitted)
In this case, speakers' intuitions are sharply at odds with the nature of their actual linguistic usage. (Presumably, the intuitions here are determined by conscious or unconscious awareness of other people's usage in the majority dialect.) No doubt there are many cases where linguistic intuition and linguistic reality correspond better. But the only way to check that is to describe languages on the basis of observable reality, and see how well speakers' intuitions match the description. (Only, if we have descriptions based on observation, why would we be interested in the intuitions?) In any case, by now it is clear that intuition and observation part company too frequently to place any reliance on intuition; debates between linguistic theorists have sometimes boiled down to unresolvable disagreements between individuals' different intuitions. Surveying the chaos to which Noam Chomsky's principles had reduced the discipline, Labov noted scathingly, quoting examples, that Chomsky's practice in dealing with conflicts among linguistic intuitions was to treat his own personal intuitions as facts to be accounted for by linguistic science, but to treat the intuitions of linguists who disagreed with him as mere fallible opinions (Labov 1975: 101). Intuition is no fit basis for a science of a subject concerned with tangible, observable phenomena. Science must be founded on things that are interpersonally observable, so that differences of opinion can be resolved by appeal to the neutral arbitration of objective experience. That does not imply a naive belief in a realm of pure observation statements, uncontaminated by theoretical assumptions. Every report of observation carries some theoretical baggage; but, if empirical scientists believe that observations are being distorted by incorrect assumptions, they can bring other kinds of observation to bear on the task of testing those assumptions. As the great philosopher of science Sir Karl Popper put it, science 'does not rest upon solid bedrock': scientific theories are like structures erected on piles driven into a swamp, but if any particular support seems unsatisfactory, it can always be driven deeper (Popper 1968: 111). There is no remedy for wobbly foundations, on the other hand, when theories are founded on personal intuitions. Popper was the thinker who formulated the insight (by now a generally recognized truism) that the essence of science is to be vulnerable to refutation. We cannot require scientists to put forward only claims whose truth is established beyond debate, because no weight of evidence is ever enough to rule out the possibility of new findings changing the picture. What we should require is that a scientific claim identify potential observations which, if they were ever made, would be reason to abandon the claim. Boyle's Law, which asserts that pressure times volume is constant for a given body of gas at a
INTRODUCTION
5
given temperature, is a good piece of science, because it implies many 'potential falsifiers': expanding a container of gas and finding that the pressure on the walls of the container failed to drop proportionately would refute the law. On the other hand, Popper felt that the psychoanalyst Alfred Adler's theory of the 'inferiority complex' did not rank as science, because any human behaviour can be explained in its terms. A man who tries to drown a child has an inferiority complex leading him to prove to himself that he dares to commit a crime, but, conversely, a man who sacrifices his life trying to save the child has an inferiority complex making him want to prove to himself that he is brave enough to attempt the rescue (Popper 1963: 34-5). A theory which no observations can refute tells us nothing concrete about the world, because it rules nothing out. This does not mean that a scientific theory, to be respectable, must stand or fall by a single observation. Many areas of science make predictions which are statistical rather than absolute, so that observations are described as more probable or less probable, but no single observation refutes the theory - even very improbable events will occasionally be observed. Popper's analysis of the nature of science is elaborated to handle the case of statistical theories (Popper 1968: ch. 8). For that matter, in practice even non-statistical theories are not usually rejected totally at the first hint of counter-evidence. Popper's colleague Imre Lakatos (1970) pointed out that, in a complex world, it is normal at any given time for the most respectable theory to coexist with a range of anomalous observations: what we judge are not single self-contained theories, but developing programmes of research, in which anomalies are successively dealt with by modifying details of the theories. We expect a research programme to be 'progressive', in the sense that anomalies are addressed by theoretical developments which broaden the empirical base to which the programme is answerable. A 'degenerating' programme, on the other hand, is one which responds to counter-evidence by reinterpreting the theory so as to avoid making any prediction about the problematic data, saving the research programme from refutation at the cost of shrinking its empirical basis. We see, then, that the concept of science as a body of statements which are vulnerable to refutation by objective evidence is not a simple recipe for an unrealistically 'clean' structure of knowledge, divorced from the complexities of other areas of discourse. Empirical science has, and always will have, its share of complications and ambiguities. But, so long as science strives to found itself on interpersonally observable data, it can always move forward through critical dialogue among the community of researchers. On the other hand, conceding the authority of subjective, 'intuitive' evidence cuts off this possibility of progress. Intuition-based linguistic theorizing has lingered on, in some university linguistics departments, for a remarkable length of time after the publication of exposes such as Labov's. (Sitting in an armchair drawing data out of one's head is so comfortable an approach to academic research that perhaps we should not be too surprised at how long some practitioners stick with it.) But,
6
EMPIRICAL LINGUISTICS
happily, an empirical approach to the investigation of language structure began to reassert itself in the 1980s, and since about 1990 has moved into the ascendant. The purpose of this book is to give the reader a taste of the aims and diverse achievements of the new empirical linguistics. Two key tools of empirical linguistics at the turn of the century are the corpus and the computer. A corpus (Latin, 'body' - plural corpora], as in the passage quoted from Chomsky earlier, simply refers to a sizeable sample of real-life usage in English or another language under study, compiled and used as a source of evidence for generating or testing hypotheses about the nature of the language. Commonly, corpora are sufficiently large, and the linguistic features relevant to a particular hypothesis are sufficiently specialized and complex, that searching manually through a corpus for relevant evidence would not be practical. Much or most work in empirical linguistics nowadays uses computers, and language corpora are machine-readable - they exist primarily as electronic files, which may or may not also be published as hard copy on paper but, in most cases, are not. The fact that modern empirical linguistics relies heavily on computers and corpora gives rise to one strategy of discourse whereby the armchairdwellers who prefer to stick to intuitive data continue to represent their research style as the linguistic mainstream. Linguists who make crucial use of computers in their work are called 'computational linguists', and 'corpus linguistics' is regarded as one branch of computational linguistics. Quite a lot of corpus linguists, the present author included, nowadays work in departments of computer science (where empirical techniques tend to be taken for granted) rather than departments of linguistics. One hears people say things these days like 'I'm a theoretical linguist — I'm not involved in corpus linguistics, that's not my special field.' (Such a remark might sound less impressive if it were paraphrased, accurately enough, as 'I am the type of linguist who decides what languages are like on the basis of people's opinions about what they are like — I don't get involved in examining objective evidence about language, that's not my speciality.') In reality, corpus linguistics is not a special subject. To be a corpus linguist is simply to be an empirical linguist, making appropriate use of the available tools and resources which are enabling linguists at the turn of the century to discover more than their predecessors were able to discover, when empirical techniques were last in vogue. As Michael Hoey put it in a remark at a recent conference, 'corpus linguistics is not a branch of linguistics, but the route into linguistics'. 2 This book will introduce the reader to the advantages of the new, empirical style of language research, through a series of chapters which mingle case studies of discoveries about the English language with more general discussions of techniques and theoretical underpinnings. Even now, not all empirical linguistics depends on computers. For instance, Chapter 2 examines a fundamental principle about grammatical
INTRODUCTION
7
organization, which for thirty years and more has been accepted by linguists of many different theoretical shades as constraining usage in every human language, but which falls to pieces as soon as one looks out for counterexamples. In this case, computerized searches were not needed. Just noticing examples in everyday reading was enough to correct the theory based on 'intuitions' about usage. But, even when generalizations about language are straightforward and easy to understand, the individual items of data which support or contradict them are often so numerous and complex that research progress would be impractical without computers to register, count and compare the facts. (This does not distinguish linguistics from other sciences at the present time, of course.) Chapter 3 looks at the question of where the difference lies between the simple, punchy English of fiction and the highly ramified structures of technical writing. Do the different genres have separate grammars? - are they, in effect, separate (if closely related) dialects of English? Computer analysis of the incidence of different types of construction shows that they are not. The evidence suggests that there is one English grammar. What feel like large overall differences in the 'shapes' of sentences from different genres arise as the cumulative effect of tiny statistical differences in patterns of choice among grammatical alternatives that are available in all genres. Again, Chapter 4 uses computational techniques to test a famous theory put forward forty years ago (long before computers were routinely available to linguists) about consequences of human memory limitations for language structure. The linguist Victor Yngve noticed that whenever in English a construction contains a 'heavy' constituent having a lot of internal structure of its own, this is usually the last element of the construction: the tree structures which are used to display sentence-grammar graphically branch freely to the right, but left-branching is restricted. Yngve explained this in terms of a limit to the number of items of information a speaker can hold in his memory while producing an utterance. I reexamine Yngve's ideas using modern methods and a modern data resource: Yngve is clearly correct about the existence of an asymmetry in the shape of English parse-trees, but it turns out that Yngve's description of the asymmetry is mistaken (in a way that, working with the methods available in 1960, he could not have known). What the true facts seem to be telling us about human language behaviour is different from what Yngve supposed. The empirical style of linguistics is sometimes criticized for paying disproportionate attention to written language as opposed to speech, which is unquestionably the more natural, basic mode of language behaviour. It certainly is true that empirical linguistic research using computers has to date focused chiefly on written language: the research depends on availability of sizeable language samples in machine-readable form, and it has been far easier to create electronic corpora of written than of spoken language. (Nowadays, many or most written texts are created electronically from the beginning; speech needs to be recorded and laboriously transcribed before it
8
EMPIRICAL LINGUISTICS
can be subjected to computer analysis.) But this gap is now beginning to be filled. Chapter 5 examines some socially interesting findings derived from a new electronic corpus of spontaneous spoken English as used in Britain at the end of the millennium. (Among other things, it calls into question the well known claim by the sociologist Basil Bernstein that the middle and working classes use distinct versions of English, what he called 'elaborated' and 'restricted' codes.) Incidentally, it seems rather strange for linguists who found their grammatical analyses on their intuitions to object to empirical linguists as being unduly concerned with the written mode, because the example sentences around which the former linguists' analyses revolve are commonly sentences which (if they occurred in real life at all) would be far more likely to occur in writing than in speech. The theoretical debates fought out in the pages of journals such as Linguistic Inquiry do not usually turn on the grammatical status of short, simple utterances. Much more commonly, they depend on whether some subtle configuration of constructions, requiring many words to instantiate it, is or is not allowable in the language under analysis; and, although linguists may express the issue as 'Can one say X?', it might be more appropriate for them to ask 'Can one write Jf?', because writing is the only mode in which many of these complicated sentences would have any real chance of being used in practice. Furthermore, because in writing we have time to edit out the false starts and slips of the tongue which are often unavoidable in speech, studying language structure through written examples sometimes has advantages akin to studying gravitation by looking at falling bricks rather than falling leaves. Nevertheless, spoken language has characteristic structures of its own, and clearly both modes deserve attention. In Chapter 6 I turn from the findings of empirical linguistics to issues of future research strategy. In recent years, it has begun to be accepted that scientific linguistics needs to adopt empirical rather than intuitive methods, but the discipline has not yet grasped some of the implications of the empirical approach. Reliance on the data of intuition allowed researchers in the past to pick and choose the structural phenomena addressed by their linguistic descriptions, and to focus on a limited core range of constructions, often ones with special logical significance. That is not possible for versions of linguistics which are answerable to concrete, real-life data. Empirical linguistics has to deal with everything that is out there in everyday speech and writing: not just prepositional phrases, verb groups, and relative clauses, but street addresses, references to sums of money or weights and measures, swear-words (which have distinctive grammar of their own), and so forth. This means that a large taxonomic effort is needed to provide explicit, detailed ways to register and classify the bits and pieces of real-life usage, as a precondition for assembling consistent databases that will permit us to formulate and test general theories. Researchers who have been applying computer techniques to human language in the 1990s have by and large not appreciated this need for taxo-
INTRODUCTION
9
nomic work. As I see it, computational linguistics has been repeating a mistake which was made by the pioneers of information technology, but which was recognized as a mistake within that wider domain some thirty years ago. I hope we can learn from the past and revise the priorities of our discipline more quickly than was possible for the early computer programmers. Chapter 7 is, among other things, a practical 'how-to' chapter. From other chapters, readers will have realized that valuable insights into real-life language use are often statistical or probabilistic in nature, rather than absolute rules. The new possibilities of quantitative analysis which computer technology has opened up are requiring linguists to develop new intellectual skills: statistical theory was not traditionally part of the education of students of linguistics. One of the commonest statistical operations needed in linguistic research is to estimate the frequency of linguistic forms of one sort or another - words, grammatical constructions, phoneme clusters, and so on. Typically, such a range of forms has a few very common cases and a lot of less common cases, with many cases being so rare that an observed sample will quite possibly include no examples. Frequency estimation in such a situation is a far subtler exercise than linguists usually realize. The 'obvious' calculations which are commonly performed give wildly misleading answers. A respectable technique in this area was invented by Alan Turing and his statistical assistant I. J. Good, in connexion with their wartime codebreaking work at Bletchley Park. In Chapter 7 I present this in a simple version, designed for use by linguists with little mathematical background. Inevitably, this material is a little more technical than most other parts of this book. Including it here is worthwhile, though, because it makes available a 'recipe-book' technique which (judging from my electronic postbag) many linguists want to use. (Estimating frequencies of infrequent items is of course only one statistical technique, though arguably a specially central one, which becomes relevant when one investigates real-life data quantitatively. For readers who would like to explore the subject further, Manning and Schiitze (1999) offer an outstanding general survey of methods and applications.) Many readers may find it surprising that a call for use of empirical evidence in place of intuition should be at all controversial. In other areas of the map of learning, those battles were fought and won hundreds of years ago. But the fact that many linguists in the recent past have avoided empirical scientific method does not mean that they were simply perverse. They had reasons for believing that linguistics, unlike other sciences, could not be founded on observational evidence (and that intuition was a satisfactory substitute). Their reasons were misguided, but the arguments cannot just be dismissed out of hand. Chapter 8 looks at the considerations which led many linguists to suppose that observations of people's speech and writing were an unsatisfactory foundation for linguistic description and theorizing: I show
10
EMPIRICAL LINGUISTICS
how this stemmed from a fundamental misunderstanding of how science in general works. However, the wrong turning which linguistics took in the closing decades of the century was not a result exclusively of intellectual errors about the nature of science. Other important factors had to do with more down-toearth aspects of the way the subject happened to develop historically. One significant point was the curious way in which a particular book, The Logical Structure of Linguistic Theory, was for twenty years treated as a landmark in the evolution of the discipline, while remaining unpublished and not available for critical assessment. Chapter 9 examines the contents of this book, and argues that modern linguistics might have developed along very different lines if the book had been in the public domain during the decades when it was influencing the subject. Chapter 10 considers one of the fundamental assumptions about human language which has pervaded the linguistics of recent decades: the idea that a language is defined by a specific set of rules, in terms of which any particular sequence of words either is or is not a grammatical sentence. Applying quantitative techniques to corpus data, I argue that the English language does not seem to be defined by a finite set of grammatical rules. It is more like the natural 'fractal' objects described by Benoit Mandelbrot (1982), such as coastlines, which go on revealing more and more detail as one examines them more closely. If this is right, I argue, then it may be impossible to draw a distinction between 'grammatical sentences' and 'ungrammatical' sequences of words: the quantity of evidence we would need, in order to plot the boundary between these two classes, would dwarf the capacities of any computing equipment available now or in the future. The idea that a language imposes a two-way classification on the set of possible strings of words from its vocabulary is a rather new one. Historically, grammarians have usually discussed the properties of various things that can be said or written in a language, without feeling a need to contrast them with 'starred sequences' that no speaker of the language would think of using. That attitude, I suggest, may be the most appropriate one. The issues examined in Chapter 10 imply limits to the enterprise of scientific linguistic description. But there are deeper limits: some aspects of human language cannot be the subject of scientific description because they are not the sorts of things that the empirical scientific method can deal with. Falsifiable scientific theories about observables do not form the totality of meaningful human discourse. Some domains which relate to human beings as imaginative and moral agents are outside the purview of science. Language is a phenomenon that straddles the worlds of humanities and science, linking meanings generated by minds to physical sounds and marks created by tongues and hands. Because intuition-based linguistics has had little interest in scientific method, it has tended to assume that all aspects of language are equally accessible to scientific investigation; but in my final chapter, Chapter 11,1 argue that this is a mistake. Word meanings are a topic falling on the humanities side of the arts/science divide. Linguists' theories about
INTRODUCTION
11
how to define the meanings of words have failed not because the theories are poorly formulated, but because the task is impossible. If we take empirical science seriously, we have to take seriously the boundaries to its sphere of application. A note about terminology. This book is intended to illustrate the nature and strengths of the empirical style of linguistics which has come to the fore over the past decade. I shall need a convenient term to refer to the very different linguistic tradition which predominated from the 1960s to the 1980s, and even now is very much alive. To refer to it negatively as 'unempirical linguistics' clearly would not do. I shall use the phrase generative linguistics. This term has been widely used, and I believe it is broadly acceptable to most or all members of the tradition in question. It identifies that tradition through one of its positive features: the goal of specifying the structures of languages via formal systems which 'generate' all and only the valid examples of a language, as an algebraic equation defines a circle by 'generating' all and only the points which comprise it. I shall argue that this goal does not survive scrutiny, but it is an admirable ideal. There are two aspects to the generative tradition in linguistics. One is concerned with the rules used to define the grammatical and other properties of human languages, and the picture of language structure which emerges from those systems of rules. The other aspect, powerfully advocated in recent years by Steven Pinker (e.g. Pinker 1994), relates to the psychological implications of language structure, and in particular to the idea that knowledge about language and its structure, and other kinds of knowledge, are innate in the human mind rather than learned by children through interaction with their environment or through instruction by their elders. This latter point of view, that linguistics gives grounds for belief in a nativist theory of human cognition, is one I have dealt with at length in a previous book, Educating Eve (Sampson 1999a). Generative linguists seem to me entirely mistaken in thinking that the findings of linguistics offer support to nativist psychology. I have little doubt that almost all features of one's mother tongue are learned from experience, not inborn. Educating Eve exhaustively scrutinizes the various arguments used by Pinker and other linguistic nativists, and shows that in each case the argumentation is founded on false premisses, or is logically fallacious (or both). The picture of human learning which I take to be correct is essentially the one described by Nelson Goodman when he portrayed the human mind (N. Goodman 1965: 87) as 'in motion from the start, striking out with spontaneous predictions in dozens of directions, and gradually rectifying and channeling its predictive processes' in response to the data of experience. Nothing said by Pinker or other generative linguists gives us any serious reason to doubt this view. Having said as much (I believe) as needs to be said on that topic in Educating Eve, I do not return to it here. The present book is about the first of the two aspects of the generative tradition identified above. This book is about the nature and structure of human language itself, as it appears when
12
EMPIRICAL LINGUISTICS
investigated in an empirical spirit; it is not about the psychological mechanisms underlying that structure. The intellectual errors which led many linguists to forsake accountability to empirical evidence in the 1960s and 1970s remained influential for a surprisingly long period. Changing relationships between governments, society, and higher education in the closing decades of the century made it a period when many academics were inclined to hunker down for safety in their established positions, and intellectual debate and development became less lively than they have sometimes been. But in linguistics the tide began to turn about ten years ago. As a new millennium dawns, we can surely hope to see the discipline as a whole breaking free from the spell of intuition, and rejoining the mainstream of empirical scientific progress. Notes 1 For references in the form 'Smith 1999', see the References beginning on p. 208. For those in the form 'URL n', see the URL list on p. 217. 2 Quoted in a posting by Tony Berber Sardinha on the electronic Corpora List, 10 December 1998. 3 I owe the analogy between natural languages and fractal objects to Richard Sharman, Director of the SRI Computer Science Research Centre, Cambridge. 4 Even a leading member of the generative school seems recently to have accepted that this aspect of their intellectual tradition was an error: see Culicover (1999: 137-8).
2
From central embedding to empirical linguistics
1 An untested dogma
Now that the empirical approach to linguistic analysis has reasserted itself, it is not easy to recall how idiosyncratic the idea seemed, twenty years ago, that a good way to discover how the English language works was to look at real-life examples. As a young academic in the 1970s I went along with the then-standard view that users of a language know what is grammatical and what is not, so that language description can and should be based on native-speaker intuition. It was the structural phenomenon of'central embedding', as it happens, which eventually showed me how crucial it is to make linguistic theories answerable to objective evidence. Central embedding (which I shall define in a moment) was a topic that became significant in the context of linguists' discussions of universal constraints on grammar and innate processing mechanisms. In this chapter I describe how central embedding converted me into an empirical linguist. Central embedding refers to grammatical structures in which a constituent occurs medially within a larger instance of the same kind of tagma (phrase or clause unit); an invented example is [ The book [the man left] is on the table], where a relative clause occurs medially within a main clause, as indicated by the square brackets. (By 'medially' I mean that the outer tagma includes material both before and after the inner tagma.) A single level of central embedding like this is normal enough, but linguists agreed that multiple central embedding — cases where X occurs medially within X which occurs medially within X, for two or more levels - is in some sense not a natural linguistic phenomenon. Theorists differed about the precise nature of the structural configuration they regarded as unnatural. De Roeck etal. (1982) distinguished four variant hypotheses about the unnaturalness of multiple central embedding. For Variant 1, the unnatural structures are any trees in which a node has a daughter node which is not the first or last daughter and which is nonterminal, and where that node in turn has a nonterminal medial daughter, irrespective of the labels of the nodes; that is, the unnaturalness depends purely on the shape of the tree rather than on the identity of the higher and
14
EMPIRICAL LINGUISTICS
lower categories. De Roeck et at. showed that several writers advocated the very strong hypothesis that multiple central embedding in this general sense is unnatural. For other linguists, the tree structure had to meet additional conditions before it was seen as unnatural. Variant 2 is a weaker hypothesis which rules out only cases where the logical category is the same, for example clause within clause within clause, or noun phrase within noun phrase within noun phrase. Variant 3 is a weaker hypothesis still, which treats structures as unnatural when the concentric logical categories not only are the same but occur within one another by virtue of the same surface grammatical construction, for example relative clause within relative clause within (main) clause; and Variant 4 weakens Variant 3 further by specifying that the structure is unnatural only when the hierarchy of tagmas incorporated into one another by the same construction are not interrupted by an instance of the same category introduced by a different construction (e.g. relative clause within nominal clause within relative clause within (main) clause would violate Variant 3 but not Variant 4). These variant concepts notwithstanding, there was general agreement that multiple central embedding in some sense of the concept does not happen in human languages. Theorists debated why that should be. For generative grammarians, who laid weight on the idea that grammatical rules are recursive, there was a difficulty in accounting for rules which could apparently apply once but could not reapply to their own outputs: they solved the problem by arguing that multiple central embeddings are perfectly grammatical in themselves, but are rendered 'unacceptable' by falling foul of psychological language-processing mechanisms which are independent of the rules of grammar but which, together with the latter, jointly determine what utterances people can produce and understand (Miller and Chomsky 1963: 471). 1 The relational-network theorist Peter Reich urged that this did not adequately explain the fact that the limitation to a single level of central embedding is as clearcut and rigid a rule as languages possess: 'The first thing to note about [multiple central embeddings] is that they don't exist. . . . the number of attested examples of M[ultiple] C[entral] E[mbedding]s in English can be counted on the thumbs of one hand' (Reich and Dell 1977). (The thumb was needed because Reich and Dell were aware of a single reported instance of multiple central embedding during the many years that linguists had been interested in the topic — but then, what linguistic rules are so rigid as never to be broken even on just one occasion?) Reich argued that generative grammar ought to make way for a finite-state theory of language within which the permissibility of one and only one level of central embedding is inherent in the model (Reich 1969). Even William Labov, normally an energetic champion of empirical methods when that was a deeply unfashionable position to take, felt that empirical observation of naturally produced language was irrelevant to the multiple central embedding issue. Labov believed that multiple central embeddings are grammatical in every sense, but he saw them as a paradigm
CENTRAL EMBEDDING
15
case of constructions that are so specific and so complex that instances would be vanishingly rare for purely statistical reasons: 'no such sentences had ever been observed in actual use; all we have are our intuitive reactions that they seem grammatical . .. We cannot wait for such embedded sentences to be uttered' (Labov 1973b: 101). Thus, linguists of widely diverse theoretical persuasions all agreed: if you wanted to understand the status of multiple central embedding in human language, one thing that was not worth doing was looking out for examples. You would not find any. 2 The dogma exploded Doubt about this was first sown in my mind during a sabbatical I spent in Switzerland in 1980-1. Giving a seminar to the research group I was working with, I included a discussion of multiple central embedding, during which I retailed what I took to be the standard, uncontroversial position that speakers and writers do not produce multiple central embeddings and, if they did, hearers or readers could not easily interpret them. In the question period Anne De Roeck asked 'But don't you find that sentences that people you know produce are easier to understand?' Well, perhaps, I responded, but this did not refute the theory because . . . - and I got quite a long way through my answer before the expression on Anne's face alerted me to the fact that the point of her question had been its grammar rather than its semantics. (The structure of the question, with finite subordinate clauses delimited by square brackets, is But don't youfind [that sentences [that people [you know] produce] are easier to understand] ?) So evidently, if multiple central embeddings were indeed 'unacceptable', this did not mean that if produced they will necessarily draw attention to themselves by being impenetrable to the hearer. Perhaps, then, it was worth checking whether they were so completely lacking from natural language production as the doctrine alleged. I began to monitor my reading, and quite soon encountered a series of examples which our research group assembled into the De Roeck etal. (1982) paper already cited. Many of the examples violated even the weakest Variant 4 of the orthodoxy (all violated at least Variant 2); some of them involved more than two layers of central embedding. A sceptic might have felt that there was something not fully convincing about that initial collection. More than half of them came from a single book - it is dangerous to rest conclusions about language in general on the linguistic behaviour of one individual, perhaps idiosyncratic, writer — and most of the remainder were taken from a very serious German-language newspaper, the NeueZjircher^eitung: German is a language with rigid and unusually complex word-order rules, and when used in highly formal written registers is arguably a more artificial category of linguistic behaviour than most. But, on returning from Switzerland, I went on looking for multiple central embeddings, and I began to find examples in very diverse linguistic contexts.
16
EMPIRICAL LINGUISTICS
For example, one could hardly find a newspaper more different in style from the New Zjircher ^eitung than the British News of the World: this is a massmarket Sunday paper beloved for its titillating exposures of the seamier side of life, and those responsible for its contents would, I believe, feel that they were failing if they allowed a highbrow or intellectual flavour to creep into its pages. But the LOB Corpus (see Chapter 3) contains an extract from a story 'Let's give the Welfare State a shot in the arm', by Kenneth Barrett, which appeared in the 5 February 1961 edition of that newspaper, and which includes the following sentence: [And yet a widow, [whose pension, [for which her husband paid], is wiped out because she works for a living wage], will now have to pay 12s. 6d.for each lens in her spectacles, and 17s. Sd.for the frames].
This is a case ofwh- relative clause within wh- relative clause within main clause, violating Variant 4. Even if popular writing for adults contains these constructions, it may be thought that writing for children will not. At this time my own children were aged 7 and 5, and their favourite books by a large margin for paternal bedtime reading were the series of boating adventure stories by Arthur Ransorne. The following sentence occurs in Ransome's Swallowdale (Jonathan Cape, 1931, pp. 113-14): [But Captain Flint laid to his oars and set so fast a stroke that John, [who, [whatever else he did], was not going to let himself get out of time], had enough to do without worrying about what was still to come].
Clause within clause within clause: violates Variant 2. (Indeed, the clause beginning whatever is quite similar in structure to a relative clause; if the two kinds of clause were regarded as varieties of a single construction, the sentence would violate Variant 4. But probably almost any grammarian would treat the two clause-types as separate.) For what it is worth, my daughters showed no observable sign of difficulty in understanding this sentence, though similar experiences in the past resigned them to their father's temporary unwillingness to proceed with the story while he scrutinized it. Still, while published writing addressed to unsophisticated readers apparently does contain multiple central embeddings, one could nevertheless argue that they are not likely to be found in writing produced by people unskilled with language. But the following sentence occurred in an essayassignment written in February 1983 by S. S., a first-year Lancaster University undergraduate student of (at best) moderate ability: All in all it would seem [that, [although it can not really be proved [that the language influences the script in the beginning at its invention], simply because we seldom have any information about this time in a scripts history], the spoken language does effect the ready formed script and adapts it to suit its needs].
CENTRAL EMBEDDING
17
Subordinate clause within subordinate clause within subordinate clause: violates Variant 2. (Here I assume that when a nominal clause is introduced by that, this word is part of the clause it introduces; this is the consensus view among linguists, but even if it were rejected, the example would still be a case of clause within clause within clause - the outermost clause would then be the main clause, that is the first opening square bracket would be repositioned at the beginning of the quotation.) The various solecisms in the passage (can not for cannot, scripts for script's, effect for affect, ready for already] were characteristic of the student's writing. Conversely, a true believer in the unnaturalness of multiple central embeddings might suggest that while laymen may sometimes produce them, professional linguists, who might be expected to be unusually sensitive to grammatically objectionable structures, would avoid them. (For the idea that professional linguists may be in some sense more competent in their mother tongue than native speakers who are not linguists, see, e.g., Snow and Meijer 1977.) However, the eminent linguist E. G. Pulleyblank wrote in a book review in the Journal of Chinese Linguistics (10 [ 1982]: 410): [ The only thing [that the words [that can lose -d] have in common} is, apparently, that they are all quite common words].
That relative clause within that relative clause within main clause: violates Variant 4. However, a defender of the orthodox line might suppose that, while the pressures of journalism, academic publication, and the like allow a certain number of'unnatural' constructions to slip through, at least when a short text is composed and inscribed with special care there should be no multiple central embeddings. What would be the clearest possible test of this? A ceremonial inscription ornamentally incised on marble seems hard to beat. Visiting Pisa for the 1983 Inaugural Meeting of the European Chapter of the Association for Computational Linguistics, I noticed a tablet fixed to one wall of the remains of the Roman baths. Apart from the conventional heading 'D O M' (Deo Optimo Maximo], and a statement at the foot of the names of the six persons referred to, the inscription on the tablet consists wholly of the following single sentence, in a language where word order is much freer than in German or even English, so that undesired structural configurations could easily be avoided. I quote the inscription using upper and lower case to stand for the large and small capitals of the original, and replace its archaic Roman numerals with modern equivalents: [{Sexuiri, [qui {Parthenonem, [ubi [parentibusorbteuirgines] aluntur,eteducantur], [{qui} {uulgo} [charitatis Domus} appelatur]}, moderantur, eiusque rem administranf]}, quum at suum ius ditionemquepertineat hie locus, in quo Sudatorium Thermarum Pisanarum tot Seculis, tot casibus mansit inuictum, et qfficii sui minime negligentes, et Magni Ducts iussis obtemperantes, et antiquitatis reuerentia moti reliquias tarn uetusti, tarn insignis (edificii omni ope, et cura tuendas,etconseruendascensueruntAn: Sal: MDCXCIII].
18
EMPIRICAL LINGUISTICS [Since this place, where the sudatorium of the Pisan Baths has remained unconquered by so many centuries and so many happenings, comes under their jurisdiction, {the six men [who govern and administer {the Parthenon, [where {orphaned girls} are brought up and educated], [{which} is knownby {the common people} as {the House of Charity}]}]}, being diligent in the performance of their duty, obedient to the commands of the Grand Duke, and moved by reverencefor antiquity, ordered every effort to be used carefully to protect and to conserve the remains of this building of such age and distinction, in the Year of Grace 1693].
Here, curly brackets delimit noun phrases, and square brackets delimit clauses. Thus we have four noun phrases within noun phrases within a noun phrase, a quadruple violation of either Variant 2 or Variant 4 (depending on the definition of identity of grammatical constructions); at the same time, with respect to clauses, we have two relative clauses within relative clause within main clause, a double violation of Variant 4. (The fact that one of the innermost relative clauses, together with the relative clause containing it, are both compound might be thought to make the constructions even more similar and accordingly more 'unnatural' than they would otherwise be.) The central embeddings occur at the beginning of the original text, making it wholly implausible that they were produced through careless oversight even if such carelessness were likely in inscriptions of this sort. At this period I did not record any examples with more than two levels of central embedding, so that a defender of the orthodox view might conceivably try to rescue it by arguing that the boundary between permissible and impermissible degrees of central embedding lies not between one and two levels but between two and three levels. This would be a gross weakening of the standard claim, which asserts not only that there is a fixed boundary but that it occurs between one and two levels (cf. Reich (1969), and his quotation from Marks (1968)). If such a strategy were adopted, then at least the first and probably also the second of the following new examples would be relevant: [Laughland's assertion that [thepresence of [Delors — [14years] old when [the war] began — ] in the Compagnons de France, the Vichy youth movement,] meant that he supported fascism] is ridiculous. (Charles Grant, letter to the Editor, The Spectator, 12 November 1994, p. 35)
The phrases 14years and the war are both cases of noun phrase within noun phrase within noun phrase within noun phrase, a double violation of the two-level limit. (Incidentally, two paragraphs later the same letter contains a case of clause within clause within clause.) [Tour report today [that any Tory constituency party [failing [to deselect its MP], should he not vote in accordance with a prime ministerial diktat,] might itself be disbanded], shows with certainty that Lord Hailsham's prediction of an 'elective dictatorship' is now with us]. (ViceAdmiral Sir Louis Le Bailly, letter to the Editor, The Times, 25 November 1994, p.21)
CENTRAL EMBEDDING
19
Infinitival clause within present-participle clause within that nominal clause within main clause: again a violation of the two-level limit, except that if the should clause were alternatively regarded as subordinate to deselect rather than tofailing, then the structure would violate only the one-level limit. All in all, it seemed clear that no matter what kind of language one looks at, multiple central embeddings do occur. The above examples include no case from speech; that is regrettable but not surprising, first because spoken language is structurally so much less ramified than writing that any kind of multiple embedding, central or not, is less frequent there, and equally importantly because it is difficult to monitor such cases in speech (when I happen on a written multiple central embedding, I always have to reread it slowly and carefully to check that it is indeed one, but with speech this is not possible). Nevertheless, De Roeck et al. (1982) did record one case from spoken English, which happened to have been transcribed into print because it was uttered by a prime minister in the House of Commons; and the example requiring the single thumb in Reich and Dell (1977) occurred in extempore spoken English. Here is a third case I recently encountered, from the late Richard Feynman's Nobel prize address given in Stockholm on 11 December 1965 (quoted from Gleick 1994: 382): [ The odds [that your theory will be in fact right, and that the general thing [that everybody's working on] will be wrong,] is low].
That relative clause within that nominal clause within main clause: violates Variant 2. Though this speech was obviously not a case of extempore chat, the quotation does contain several features characteristic of spoken rather than written language (everybody s for everybody is', colloquial use of thing', failure of agreement between odds and is). In any case, Reich and Dell's footnote 2 makes it clear that their belief in the unnaturalness of multiple central embedding applies to writing as well as to speech. Incidentally, the difficulty of identifying multiple central embeddings on first reading offers a further argument against the claim that they are 'unnatural'. During fluent reading for normal purposes I register no reaction more specific than 'clumsy structure here', and passages which provoke this reaction often turn out to include only structures that linguists do not normally claim to be 'unnatural' or ungrammatical, for example non-central embeddings. If the orthodox view of multiple central embedding were correct, one would surely predict that these structures should 'feel' much more different from other structures than they do. 3 Systematic data-collection The examples listed earlier were not the only cases of multiple central embedding I encountered in the months after I returned from Switzerland; they are ones I copied down because they seemed particularly noteworthy for one reason or another. More recently I tried to achieve a very rough
20
EMPIRICAL LINGUISTICS
estimate of how frequent these structures are, by systematically taking a note of each case I encountered over a period. This project was triggered by the experience of spotting two cases in quick succession; but the following list includes only the second of these, which I read on 4 October 1993, because having decided to make a collection I did not manage to locate the earlier case in the pile of newsprint waiting to go to the dustbin. Thus the list represents the multiple central embeddings noticed during a random period starting with one such observation and continuing for a calendar month (I made the decision to stop collecting on 4 November 1993 - both start and stop decisions were made in the middle of the day rather than on rising or retiring, though there may have been a few hours' overlap). In view of my failure to register Anne De Roeck's trick question, discussed earlier, there could have been further cases in my reading during this month which escaped my attention. A greater issue of principle there could not be than the transfer of self-government awayfrom the British electorate to the European Community; [but., [though Tony Wedgwood Benn thought that' [if only Harold would look and sound a bit more convincing (on that subject) ], we might have a good chance'], Wilson not only did not do so but his tactics on taking office steered his party, his government, Parliament and the electorate into a referendum of which the result is only now in course of being reversed]. (J. Enoch Powell, review of P. Ziegler's Wilson, p. 35 of The Times of 4 October 1993, read on that day) 2
Adverbial clause within adverbial clause within co-ordinate main clause: violates Variant 4. Harris and I would go down in the morning, and take the boat up to Chertsey, [and George, [who would not be able to get away from the City till the afternoon (George goes to sleep at a bank from ten to four each day, except Saturdays, when they wake him up [and put him outside] at two)], would meet us there]. (Jerome K.Jerome, Three Men in a Boat, 1889, p. 17ofPenguin edition, 1957, read 7 October 1993)
Reduced relative clause within relative clause within co-ordinate main clause (the relative clause beginning when they wake ... would make a fourth level, but this clause is right-embedded in the who clause): violates at least Variant 2 and perhaps Variant 3, depending on the definition of similarity between grammatical constructions. [When the pain, [which nobody [who has not experienced it] can imagine], finally arrives], they can be taken aback by its severity. (Leader, p. 17 of The Times of 16 October 1993, read on that day)
Wh- relative clause within wh- relative clause within adverbial clause: violates Variant 4. [ That the perimeters of [what men can wear and [what they cannot], what is acceptable and what is not,] have become so narrow] goes to show how intolerant our society has become. (Iain
CENTRAL EMBEDDING
21
R. Webb, 'Begging to differ', p. 70 of the Times Magazine of 9 October 1993, read 21 October 1993)
Reduced antecedentless relative clause within compound antecedentless relative clause within nominal clause: violates at least Variant 2 and perhaps Variant 4, depending on the definition of similarity between grammatical constructions. [For the remainder of his long and industrious life (apartfrom during the second world war [when he worked in the Ministry of Information - [where he was banished to Belfast for being 'lazy and unenthusiastic'] ~ and the Auxiliary Fire Service]) Quennell made his living as an author, a biographer, an essayist, a book-reviewer, and as an editor of literary and historical journals]. (Obituary of Sir Peter Quennell, The Times of 29 October 1993, read on that day) Adverbial relative clause within adverbial relative clause within main clause: violates Variant 4. [In the 18th century, [when, [as Linda Colley shows in her book Britons] ^ the British national identity was forged in war and conflict with France], our kings were Germans]. (Timothy Garton Ash, 'Time for fraternisation', p. 9 of The Spectator of 30 October 1993, read 29 October 1993)
As clause within adverbial relative clause within main clause: violates Variant 2. [ The cases ofDr Starkie, the pathologist whose procedures in the diagnosis of bone cancer is now being questioned, and Dr Ashok Kumar, {whose nurse, having been taught by him, used the wrong spatula, [which must have been provided by the practice], to obtain cells for cervical smears], are very different]. (Thomas Stuttaford, 'Patients before colleagues', The Times of 10 September 1993, read 31 October 1993; the agreement failure (procedures . . . is) occurs in the source)
Wh- relative clause within wh- relative clause within main clause: violates Variant 4 (and the having been clause constitutes a separate violation of Variant 2). To assess what rate of occurrence of multiple central embeddings these examples imply requires an estimate of my overall rate of reading, which is very difficult to achieve with any accuracy. On the basis of word-counts of typical publications read, I believe my average daily intake at this period was perhaps 50,000 and certainly not more than 100,000 written words, so that the seven multiple central embeddings quoted above would imply a frequency of perhaps one in a quarter-million words (more, if we suppose that I missed some), and at least one in a half-million words. Some time soon, it should be possible for language-analysis software automatically to locate each instance of a specified construction in a machine readable corpus, and we shall be able to give relatively exact figures on the frequency of the
22
EMPIRICAL LINGUISTICS
construction. For a construction as complex as multiple central embedding we are not yet at that point; but on the basis of these figures there is no reason to suppose that the single example quoted earlier from the LOB Corpus is the only example contained in it.
4 Implications for research method
The conclusion is unavoidable. Multiple central embedding is a phenomenon which the discipline of linguistics was united in describing as absent from the real-life use of language; theorists differed only in the explanations they gave for this interesting absence. Yet, if one checks, it is not absent. I do not go so far as to deny that there is any tendency to avoid multiple central embedding; I am not sure whether there is such a tendency or not. Independently of the issue of central embedding, we have known since Yngve (1960) that the English language has a strong propensity to exploit right-branching and to avoid left-branching grammatical structures — this propensity alone (which is examined in detail in Chapter 4 later) would to some extent reduce the incidence of central embedding. Whether, as for instance Sir John Lyons continues to believe (Lyons 1991: 116), multiple central embeddings are significantly less frequent than they would be as a by-product of the more general preference of the language for right branching is a question whose answer seems to me far from obvious. But it is clearly a question that must be answered empirically, not by consulting speakers' 'intuitions'. Hence, as I picked up the threads of my working life at home after the Swiss sabbatical, I knew that for me it was time for a change of intellectual direction. If intuitions shared by the leaders of the discipline could get the facts of language as wrong as this, it was imperative to find some way of engaging with the concrete empirical realities of language, without getting so bogged down in innumerable details that no analytical conclusions could ever be drawn. Happily, easy access to computers and computerized language corpora had arrived just in time to solve this problem. I seized the new opportunities with enthusiasm. Naturally, the discipline as a whole was not converted overnight. As late as 1988, reviewing a book edited by Roger Garside, Geoffrey Leech, and me about research based on corpus data, Michael Lesk (nowadays Director, Information and Intelligent Systems, at the US National Science Foundation) found himself asking (Lesk 1988): Why is it so remarkable to have a book whose analysis oflanguage is entirely based on actual writing? ... It is a great relief to read a book like this, which is based on real texts rather than upon the imaginary language, sharing a few word forms with English, that is studied at MIT and some other research institutes ... a testimony to the superiority of experience over fantasy.
CENTRAL EMBEDDING
23
However, one by one, other linguists came to see the virtues of the empirical approach. What Michael Lesk found 'remarkable' in 1988 has, a decade later, become the usual thing. And this is as it should be. Notes 1 Similar remarks have been made more recently by Church 1982: 24 n. 32 and Stabler 1994: 315-16. 2 The brackets surrounding on that subject were square in the original, and are replaced by round brackets here to avoid confusion with the square brackets of my grammatical annotation.
3
Many Englishes or one English?
1 Why are long sentences longer than short ones?
In this chapter, we turn to a question about English structure that could scarcely be investigated adequately without using computers. The question in essence is this: should genre differences in English be understood by seeing 'English' as a family of similar but distinct languages or dialects, with separate grammars accounting for the characteristic structural differences between genres, or is there one English grammar underlying diverse genres? There is no question that vocabulary shows many characteristic differences between different genres of prose. The words anode and spoilsport, to take two cases at random, are obviously at home in very different genres. That is not to say that anode is rigidly restricted to technical English, but surely a far higher proportion of all its occurrences must belong to that genre rather than to imaginative literature, say (and vice versa for spoilsport]. That much is uncontroversial. But what about the underlying structural framework of language? Does syntax differ characteristically between linguistic genres? To make the question more concrete, consider one specific and very obvious structural difference between genres: average sentence length. Some kinds of prose typically use sentences that are strikingly longer than those in other kinds of prose. In particular, technical writing tends to use longer sentences than informal writing, such as fiction. But what specific factors create this difference in sentence lengths? One way of thinking about this is to imagine ourselves faced with the task of predicting what style of writing a sentence belongs to, in a situation where our only source of information about the sentence is a small window onto part of its parse-tree - so that we can inspect a few of the individual productions that have been applied in deriving the sentence from a root node, but we cannot see how long the sentence as a whole is, and of course we cannot look at the vocabulary of the sentence and infer from that what genre of prose it belongs to. The word 'production' here means a minimal subtree - a pairing of a grammatical category labelling a mother node with a sequence of grammatical categories labelling its daughter nodes. For instance, the topmost production involved in the structure of the sentence The mouse ran up
MANY ENGLISHES OR ONE ENGLISH?
25
the clock, in the style of grammatical analysis assumed here, would be 'main clause realized as noun phrase followed by verb group followed by prepositional phrase' - in our notation, S —> N V P. (Where possible, this book will spare the reader from engaging with technical grammatical symbols in favour of describing the facts in words; but sometimes it will be necessary to mention symbols. Readers versed in generative linguistics will notice some characteristic differences of symbol usage, relative to that tradition, in the empirical tradition of grammatical analysis from which this book emerges. Our tradition does not recognize a category of'verb phrase' including both verbs and their objects or complements - a subject-verb-object clause, for instance, is treated as a clause with three daughter nodes; and we symbolize phrase and clause categories with single capital letters, e.g. 'N' rather than 'NP' for noun phrase. For generative linguists, the highest production in the parse-tree for The mouse ran up the clock would be 'S-»NPVP'.) The length of a sentence is determined by the various productions that occur in the parse-tree for the sentence. So, if technical sentences are characteristically longer than fiction sentences, it must follow that there is some distinctive difference or differences between the kinds of individual production that occur in technical sentences and those that occur in fiction sentences. Maybe the differences are statistical rather than absolute, but differences of some kind must exist: even if they are only statistical, they still should allow us, given a few productions from a parse-tree, to make a prediction that, with probability so-and-so, this sentence is taken from technical writing rather than fiction, or vice versa. 1 In fact people often seem to make informal comments in linguistic discussion which imply that there are quite sharp differences between the kinds of grammatical construction found in different types of prose. Phrases like 'the grammar of scientific English', or 'the grammar of literary language', recur frequently in the literature of linguistics and do not seem to be regarded as controversial. If phrases like this mean anything, they must surely mean that there are particular productions — perhaps even particular grammatical categories, that is specific node-labels — which are characteristic of one kind of prose rather than another; and, if that is true, then it could easily also be true that the productions associated with technical writing tend to make sentences containing them long, while the productions typical of fiction tend to make their sentences short. 2 The data set This chapter examines the issue by comparing analyses of technical and fictional prose in a subset of the million-word 'LOB Corpus' (URL 1), which was the first electronic corpus of British English to be compiled. (The LOB Corpus - in full, the Lancaster-Oslo/Bergen Corpus - was completed in 1978, and is still heavily used by researchers today. By now it is showing its age a little - the LOB Corpus samples published prose from the year 1961;
26
EMPIRICAL LINGUISTICS
but a forty-year age difference, though certainly significant in some contexts, is probably not very important in connexion with the question under discussion in this chapter.) In order to investigate quantitative properties of the grammar of corpus samples, the first requirement is to equip the samples with annotations making their grammar explicit. Sentences are assigned parse-trees, in which the nodes (representing tagmas, or grammatical 'constituents') are given labels representing their grammatical features in terms of an agreed set of codes. An electronic resource of this type, comprising natural-language samples equipped with coded parse-trees, has come to be called a 'treebank'. (The word 'treebank' is nowadays in standard use internationally, though we believe it was first coined by Geoffrey Leech of the University of Lancaster, in connexion with the work discussed here.) The task of compiling a treebank involves more intensive and skilled labour than that of compiling a 'raw' language corpus without structural annotations, so treebanks are often smaller than raw corpora. The research discussed below draws on one of the first treebanks ever produced, the 'Lancaster—Leeds Treebank' created at Lancaster and Leeds Universities in the 1980s in order to serve as a data source for an automatic parsing system which used statistical techniques. 2 The Lancaster—Leeds Treebank is described in Garside, Leech and Sampson (1987: ch. 7); I shall give only bare details. It consists of a total of 2,353 sentences drawn from all of the 15 genre categories into which the LOB Corpus is divided, and comprising in total about 4.6 per cent of the complete LOB Corpus. Each sentence is equipped with a parse-tree drawn in accordance with a detailed and consistent scheme of structural annotation, so that each of the words and punctuation marks of a sentence corresponds to a terminal node in a branching structure ultimately dominated by a single root node, and each nonterminal node carries a label drawn from a fixed class of grammatical category labels. In order to give the reader a sense of the nature of our annotation scheme, Figure 3.1 displays the parse-tree which the scheme would assign to the first sentence in the present chapter. Most readers will probably not wish to examine every detail of this diagram, and those who do are referred to the detailed published definition of the scheme in Sampson (1995). But, for instance, Figure 3.1 marks the opening words In this chapter as a prepositional phrase (P) functioning within its clause as a Place adjunct (: p), and consisting of a preposition (II) followed by a singular noun phrase (Ns). The word we is labelled as a one-word noun phrase which is morphologically marked as subject and plural (Nap) and which is functioning as subject of its clause at both surface and logical levels (: s). The label Fr above the wording from that... onwards identifies this sequence as a relative clause within the singular noun phrase beginning a question ... ; the index number 123 in the label of that tagma shows that the question phrase functions as surface but not logical subject (S123) within the passive relative clause. The empty node ml 2 5 shows that the adverb scarcely, which interrupts the verb group could be
MANY ENGLISHES OR ONE ENGLISH?
27
Figure 3.1
investigated, functions logically as a Modal adjunct, sister rather than daughter of the verb group. These are some typical examples of the kinds of structural property indicated by the annotation scheme. The Lancaster-Leeds Treebank is very small, by comparison with treebanks that have been developed subsequently (including some to be discussed in later chapters) - but then, when the work of compiling the Lancaster—Leeds Treebank was put in hand in 1983, no comparable resource of any size was available. Small though it is, we shall find that the Lancaster—Leeds Treebank offers a fairly clear-cut answer to the question with which we are concerned. Because this particular treebank was developed at an early date, its annotation scheme did not include all the detail of the scheme illustrated in Figure 3.1. For instance, the Lancaster—Leeds Treebank has no 'functional' information about the roles of clause constituents as subject, object, Place adjunct, etc. For the present enquiry, I used a classification of tagmas that is even coarser than that of the treebank itself; the data discussed shortly are based on a classification which lumps together all members of a basic grammatical category and ignores subcategories. This was the only way to get statistically significant findings out of the quantity of data available. In this chapter, then, I recognize just 28 classes of grammatical construction. Table 3.1 lists these, with the respective code symbols used in the up-to-date version of our annotation scheme, and the number of examples occurring in the Lancaster Leeds Treebank. Every production in the treebank has a mother node labelled with a symbol standing for one of these 28 categories,
28
EMPIRICAL LINGUISTICS
Table 3.1 noun phrase, N verb group, V (other than Vo,Vr) prepositional phrase, P main clause, S adverb phrase, R adjective phrase, J infinitival clause, Ti (other than Ti ?) conjoined main clause, S+, Snominal clause, Fn relative clause, Fr adverbial clause, Fa present-participle clause, Tg past-participle clause, Tn genitive phrase, G interpolation, I verb group operator in subject-auxiliary inversion, Vo verb group remainder, Vr direct quotation, Q comparative clause, Fc antecedentless relative, F f verbless clause, L, Z with clause, W numeral phrase, M determiner phrase, D nonstandard as clause, A complementizerless clause, Tb for-to clause, Tf infinitival relative or indirect question, Tq, Ti ?
11997 5842 5737 2353 1560 855 685 624 546 516 500 474 237 119 93 81 78 75 57 57 45 43 37 35 35 22 14 4
and its daughter nodes are labelled with symbols from the same set, or with word-class tags in the case of terminal nodes.3 The full set of subclassifications which our annotation scheme provides for these 28 basic categories includes many distinctions, ignored in the present analysis, which may well be relevant to inter-genre differences in sentence length. For instance, one subclassification of verb groups is 'passive v. active'; passive verb groups, such as could scarcely be investigated, are longer on average than their active counterparts, and may well be commoner in one genre than another. However, any such effects are subsumed under factors that will be examined below. Thus, for instance, a high incidence of passive verb groups would contribute to a high ratio of daughters per mother node in verb groups. Because I am interested in contrasting long and short sentences, I focused on the parse-trees in two parts of the treebank, representing fiction and relatively technical prose. The LOB Corpus is divided into 15 genre categories, identified by letters. In terms of these categories, it is easy to define the fiction
MANY ENGLISHES OR ONE ENGLISH?
29
samples: these are the samples drawn from LOB categories K to R inclusive. (The LOB Corpus recognizes six genres of fiction: for instance, LOB category L is 'Mystery and detective fiction', category R is 'Humour'.) Intuitively, the two most technical or formal LOB genre categories are category H, described in the LOB literature as 'Miscellaneous (mostly government documents)', and category J, 'Learned (including science and technology)'. These seem to be the only two LOB genre categories consisting mainly of texts addressed to specialist readers. And there are various kinds of objective evidence, independent of sentence-length considerations, that LOB categories H and J are the categories most opposite to the fiction categories. For instance, the table of inter-category correlations in Hofland and Johansson (1982: 23), based on rank numbers of common words, shows that H and J consistently have lower correlations than any other non-fiction category with each of the six fiction categories. So it seems appropriate to contrast LOB categories H and J as a group with the fiction categories as a group, and I shall call H and J the 'technical' categories. The Lancaster Leeds Treebank has a total of 14,123 words from the technical categories, and 12,050 words of fiction. We expect technical writing to use longer sentences on average than fiction, and the sample bears this out: average sentence length in the technical prose is 29.3 words, in the fiction it is 15.6 words - almost a two to one ratio. In what I shall call the 'general' section of the Lancaster-Leeds Treebank (material drawn from the seven LOB genre categories other than H, J, and K to R, e.g. journalism, hobbies, biographies), mean sentence length is intermediate, at 23.4 words. (Note that my sentence-length counts treat punctuation marks as 'words' which contribute to total sentence length, because our parsing scheme treats punctuation marks as items with their own terminal nodes in parse-trees. This makes my figures for sentence length rather unorthodox, but it does not affect anything I say about contrasts between technical prose and fiction.) Examples quoted in the following discussion are followed by references to their location within the LOB Corpus; for instance, 'N17.153' with the first quoted example means that this is taken from line 153 of text 17 in genre category N, 'Adventure arid western fiction'. 3 Frequencies of different grammatical categories
Perhaps the most obvious way in which the individual productions in technical prose might lead to longer sentences than those used in fiction is that productions in technical prose might more commonly introduce particular grammatical categories which are normally realized as long sequences of words. There are many grammatical environments in which either of two alternative grammatical categories is equally possible; for instance, the direct object of know can be a noun phrase, as in: . . .sheknew [thereason]. (N 17.153)
30
EMPIRICAL LINGUISTICS
or a nominal clause, as in: Already he knew [that he wouldnotfindthatproofamong
Leo's papers]. (LOS.072)
The average nominal clause is longer than the average noun phrase. In the Lancaster—Leeds Treebank as a whole, nominal clauses contain on average 4.7 immediate constituents (ICs) — that is, daughter nodes — while noun phrases contain 2.4 ICs. (The length difference is even larger if one counts the words ultimately dominated by nominal clause and noun phrase nodes, respectively, instead of counting ICs; but all my discussions of constituent size will be in terms of ICs, because, looking at a parse-tree through our fixed window, we cannot tell how many words are dominated by a given node.) So a style of writing in which nominal clauses have a relatively high frequency will have relatively long sentences, other things being equal. Table 3.2 represents an attempt to see whether this sort of factor is what lies behind the difference in average sentence length between our technical and fiction samples. I looked for significant differences in category frequencies by applying the chi-squared test (e.g. Mendenhall 1967: 251 ff.) to a 2 x 3 contingency table for each of the 27 categories excluding 'main clause', with the three columns standing for the three genre groups, technical, general, and fiction, and the two rows standing for number of constituents belonging to the given category and to all other non-terminal categories. Only 11 of the 27 categories gave differences of frequency between genres which were significant at the/? < 0.05 level. The table lists these in descending order of mean length (in ICs) in the general prose: for instance, the average with clause in general prose has 3.95 ICs. The columns headed 'Technical', 'General', and 'Fiction' show the frequencies of the categories per hundred constituents in the respective genre groups.
Table 3.2
with clause antecedentless relative direct quotation nonstandard as clause noun phrase past-participle clause prepositional phrase verb group verb group remainder adverb phrase verb group operator
Technical
General
Fiction
Mean ICs
TorF higher
0.056 0.045 0.011 0.15 35.5 0.93 21.0 15.5 0.078 3.4 0.12
0.12 0.25 0.072 0.13 35.7 0.64 17.3 16.7 0.22 4.4 0.21
0.21 0.14 0.73 0 33.8 0.56 11.5 19.8 0.40 6.1 0.41
3.95 3.78 3.50 2.64 2.46 2.33 2.03 1.50 1.19 1.15 1.09
F F F T T T T F F F F
MANY ENGLISHES OR ONE ENGLISH?
31
The rightmost column in Table 3.2 shows whether the category is commoner in technical writing (T) or fiction (F). In many cases, the answer to this question could easily be predicted. It is mainly in fiction that dialogue is found, and mainly in dialogue that questions occur; so, naturally, direct quotations, and the partial verb groups produced by the subject-auxiliary inversion characteristic of questions (e.g. the bracketed items in [Has] it [been tested]?}, are all commoner in fiction than in technical writing. Other cases, particularly that of antecedentless relatives, seem more surprising. (The category 'antecedentless relative' refers to tagmas such as the sequence bracketed in the example But [whoever did it] got clean away ... (L04.151). The case of this category is specially unusual in that the General figure is higher than either the Fiction or the Technical figure, rather than intermediate between them as with most categories.) Notice that what we do not find is that the categories which are usually long are commoner in technical prose, and those which are usually short are commoner in fiction. The three longest categories are commoner in fiction; then there is a group of categories commoner in technical prose that are intermediate in length (the mean length in ICs of all nonterminal nodes is 2.47, almost exactly the same as the figure for noun phrases); and then the shortest categories are commoner in fiction. A feature of this table is that most categories included in it are either quite rare in any kind of prose, or else show a frequency difference between genres which is small, even though statistically significant. (The 'longest' category, the with clause, represents tagmas such as the one bracketed in [With events in Brazil leading to fears of anarchy], Dr. Fidel Castro today urged ... (A29.227). This is a sufficiently distinctive English construction to have been alloted its own category in our annotation scheme, but, as Table 3.2 shows, it is quite infrequent.) The only categories with a frequency greater than one per hundred constituents and with inter-genre differences approaching two-toone are the adverb phrase and the prepositional phrase. Of these two, prepositional phrases are far the commoner, and they are one of the categories for which the inter-genre difference lies in the 'wrong' direction — prepositional phrases are shorter than the average grammatical category, yet they are commoner in technical writing than in fiction. The finding about adverb phrases and prepositional phrases is quite interesting in its own right, particularly when we consider that these two categories are in many cases logically somewhat equivalent. Both adverb phrases and prepositional phrases function as devices for expressing clause modification; and it does seem that fiction prefers to achieve this function using adverbs, or phrases with adverb heads, while technical writing prefers to achieve it with prepositional phrases. So far as I know, this has not been noticed before. But if prepositional phrases are themselves a shorter than average category, even if longer than adverb phrases, it is difficult to believe that this finding does much to explain the difference in sentence lengths between the two kinds of prose. On the other hand, the difference in frequency of direct quotations in
32
EMPIRICAL LINGUISTICS
Table 3.3
nominal clause adverbial clause relative clause comparative clause present-participle clause
Technical
General
Fiction
Mean ICs
9.6 8.5
11.1 10.3 10.0 1.01 8.3
10.4
4.85 4.37 3.86 3.46 2.78
10.1 1.06 9.8
9.5 9.6
1.33 10.0
fiction and in technical prose is very large; yet, even in fiction, quotations occur less often than once per hundred constituents — so it seems unlikely that this big difference can contribute much to the large overall difference in average sentence length between fiction and technical writing. Indeed, one of the striking findings that emerged when I looked at category frequencies across genres was how constant these tended to be, in the case of common categories. Table 3.3 gives frequencies per thousand words for five categories which are relatively large, in the sense that they have a high mean number of ICs. It seems that one can predict quite regularly of a piece of English prose, irrespective of its genre, that nominal and relative clauses will occur in it with roughly equal frequency, while there will be only one comparative clause for every nine or so nominal clauses. (It is interesting to compare these findings about grammatical constructions with Richard Hudson's recent finding (Hudson 1994) that a particular part of speech, i.e. class of individual word, occurs at a strikingly constant frequency in diverse genres of the written and spoken language.) 4 Many daughters versus few daughters If the difference in sentence lengths between technical prose and fiction is not explained by technical prose more frequently using categories which, as mother nodes, regularly have relatively many ICs (daughter nodes), an alternative hypothesis is that technical prose tends to prefer productions involving more daughters rather than productions involving fewer daughters in the expansion of a given mother category. Again, it is obvious that a given category can often be expanded in alternative ways that differ greatly in numbers of ICs. The category 'noun phrase', for instance, can be realized by a simple Determiner + Noun sequence, say: a man (NOT.166) or it may contain a whole range of optional additions, for instance: amuchyoungermanwhom I have already mentioned, Sidney Lewis (G16.140) which, in terms of our parsing scheme, would be a noun phrase with six ICs.
MANY ENGLISHES OR ONE ENGLISH?
33
Table 3.4
main clause nominal clause noun phrase prepositional phrase conjoined main clause verb group
Technical
Fiction
5.34 4.86 2.70 2.06 4.79 1.63
5.92 4.20 1.93 2.03 4.43 1.44
Only six categories showed significant differences in the average number of ICs between technical writing and fiction. These are shown in Table 3.4. The contrast for main clauses is in the 'unexpected' direction, that is to say, root nodes in parse-trees for fiction sentences have more daughters than in those for technical sentences. Of the other categories listed, nominal clauses and conjoined main clauses occur much more rarely than the rest, and the length difference for prepositional phrases is negligible. So it appears that the only categories showing important length differences are the noun phrase (the commonest category of all) and to a lesser degree the verb group. One factor relevant to the length difference for noun phrases is that fiction uses many more personal pronouns than technical prose; but this explains only about half of the difference — even when the category 'noun phrase' is not realized as a personal pronoun, it still tends to have fewer ICs in fiction than in technical writing. Overall, tagmas in technical sentences have on average 8 per cent more ICs than those in fiction sentences. This figure represents the combined overall effect of both factors so far considered, that is variations in relative frequencies of categories, and variations in length within a given category. 5 Terminal versus nonterminal daughters A third possible way of accounting for the large sentence-length difference between the two kinds of writing would be to say that technical writing has a propensity to use productions which contain a relatively high proportion of nonterminal to terminal daughters. There are many points in English grammar where either an individual word or a multi-word phrase or clause can equally well fill a particular slot. In premodifying position before a noun, for instance, we can have either a simple adjective, as in a large house, or a complex adjectival phrase, as in a [very large and somewhat decrepit] house. The higher the proportion of nonterminal to terminal nodes in productions, the more ramified parse-trees will become and hence the longer will be the sequences of words they dominate. But this certainly does not explain the sentence-length difference in the
34
EMPIRICAL LINGUISTICS
Lancaster—Leeds Treebank. Nonterminal to terminal ratio is 0.699 for fiction, 0.649 for general prose, and 0.633 for technical prose: the differences run in just the opposite direction to the one we would expect, if this factor were relevant to the inter-genre differences in sentence length. 6 The mystery resolved
At this point we seem to be in a paradoxical situation. We have seen that mean sentence length is substantially greater in technical prose than in fiction: almost twice as great. But we have looked at all the possible contributory factors, and wherever we have looked it seems fair to say that the differences between productions have been elusive, and such differences as we have found seem much smaller than expected. Furthermore, the differences have sometimes been 'the wrong way round'. If we write r for the ratio of nonterminal to terminal nodes in a tree, and w for the average number of daughters per mother node, then the number of terminals in the tree, /, is determined by the formula:
Here w is itself a function of the relative frequencies of the various categories, and of the mean number of ICs of each category, so this relation confirms that any relevant factor is subsumed in the three we have investigated. This formula has the property that changing r or w even slightly yields a large change in t. For instance, ifr is 0.6 and w is 2.5, which are fairly typical values, then t is 10; but if w is increased from 2.5 to 2.6, that is a 4 per cent increase, then t shoots up from 10 to 25, a 150 per cent increase. This property is the key to the apparent paradox. In the Lancaster—Leeds Treebank, we have seen that there is an average 8 per cent difference in w (average ICs per mother node) between fiction and technical writing. That looks small, but it is more than enough in itself to explain the overall sentence-length differences (even without the further factor of differing ratios of adverbial and prepositional phrases). In fact, the sentence-length contrast would be very much greater than it is, were it not for the compensating converse variation in r, mentioned above. The difference in w is almost wholly attributable to the figures for noun phrases alone. So what I am really saying is that sentences in technical prose are longer than those in fiction because the average noun phrase node in technical prose has 2.70 ICs whereas in fiction it has 1.93 ICs. This finding is consistent with (but more specific than) that of Ellegard (1978: 76—7), who finds that phrase length, rather than number of clauses per sentence or phrases per clause, is the chief factor distinguishing longsentence genres from short-sentence genres in the Brown Corpus (URL 1 — the American-English 'elder sister' to the LOB Corpus). One might sum up the situation like this. People who talk about the 'gram-
MANY ENGLISHES OR ONE ENGLISH?
35
mar of technical English', or the 'grammar of fiction', seem to suppose that there are certain types of construction which are typical of one or the other genre. Indeed, there are; for instance, direct quotations are typical of fiction, and 'non-standard as clauses' are typical of technical prose. (A 'nonstandard as clause' is exemplified by the bracketed sequence in The pitching moments [as measured in the experiments} included ... (J73.185).) But these telltale constructions are rare in absolute terms in any kind of prose. It is like saying that Scotsmen are people who wear kilts and Englishmen are people who wear bowler hats. It is quite true that if you see a man in a kilt he is likely to be a Scot, and if you see one in a bowler he is likely to be English. But if you pick a Briton at random and want to know whether he is Scottish or English, it is very unlikely that he will have either a kilt or a bowler to help you decide. Instead, the really significant differences between the prose genres lie in quite small differences in mean values of topological properties of grammatical structure, which have a large cumulative effect because of the recursive nature of grammatical hierarchy, but which are too small to allow anything to be predicted from individual productions. If you looked through a window at a small part of a parse-tree, it is quite unlikely that you would be able to tell whether you were seeing part of a short fiction sentence or part of a long technical sentence. The genres are strikingly different in grammatical macrostructure, but strikingly similar in grammatical microstructure. So far as this research goes, it suggests that one should not talk about different grammars for fiction or technical writing. Instead we need to think in terms of a single grammar, which generates a range of tree structures, some large and some small. Technical writing and fiction both use structures drawn from this same pool, but the selections made by technical writers cluster round a higher mean length than those made by fiction writers. If you want to guess whether a sentence structure is drawn from fiction or from technical writing, the only question worth asking about it is: how big is it? Notes 1 A different way of bringing corpus data to bear on genre differences is exemplified by the work of Douglas Biber (e.g. Biber 1995). Biber aims to quantify genre differences in English and other languages by studying statistical patterns in the incidence of particular structural features selected because they seem likely to be relevant to genre. Biber's work leads to great insight into the intellectual bases of differences among prose styles. However, his approach contrasts with the one adopted in this chapter, which applies statistical analysis to the entire structure of parse-trees. (If a small window onto part of a parse-tree is selected at random, there is no guarantee that it will contain any particular feature selected for study in a Biber-type analysis.) 2 The earliest English treebank of all, to my knowledge, was the one produced by Alvar Ellegard at Gothenburg University in the 1970s (Ellegard 1978). For a variety of reasons, this was little used before it was converted into a more user-friendly form as the 'SUSANNE Corpus', discussed in Chapter 4.
36
EMPIRICAL LINGUISTICS
3 I ignore, here, two kinds of case where nonterminal nodes are labelled by wordtags rather than clause or phrase category labels: 'grammatical idioms', such as up to date used adjectivally, and co-ordinations of single words within a phrase, as in all the [bricks and mortar]. These nonterminals are treated as terminals (that is, as if they were single words) for present purposes, since otherwise they would introduce a very large number of extra categories each with too few examples for statistical analysis. Single-word co-ordinations are in fact more than twice as common in technical writing as in fiction, but not common enough to have any appreciable effect on average sentence lengths. 4 Prepositional phrases have another function, not shared by adverb phrases, as nominal post-modifiers. However, a separate investigation of the LancasterLeeds Treebank, not reported in detail here, suggests that an inter-genre frequency difference between prepositional and adverb phrases similar to that shown in Table 3.2 is found even when the comparison is limited to sentence ICs, so that noun-modifying prepositional phrases are excluded.
4
Depth in English grammar
1 A hypothesis and its implications
The availability of quantities of structurally analysed language material in machine-readable form is enabling us to reopen questions which were broached in the days before computer processing of natural language was a possibility, but which could not be accurately answered then. For forty years, most linguists have known about an asymmetric property of English grammatical tree structures. If a sentence structure is diagrammed as a tree in the usual fashion, with a 'root' node at the top and successive downward branchings leading to 'leaf or 'terminal' nodes at the bottom with the successive words of the sentence attached to them, then the branching structure will normally be seen to flourish much more vigorously in the 'south-easterly' than in the 'south-westerly' direction. If the sentence is reasonably long, then the tree structure drawn out on a sheet of paper will occupy a diagonal swathe from the top left towards the bottom right corner of the sheet. (Alternatively, if one begins by writing the words horizontally across the paper and adds the tree structure above them later, then the branches linking the early words to the root node will have to be drawn very long, in order to leave enough vertical space to fit in the much more numerous layers of branching between the root and the later nodes.) The man who first drew attention to this asymmetry was Victor Yngve (1960, 1961). Yngve interpreted the phenomenon as a consequence of psychological mechanisms which favour right-branching over left-branching structures. English has various individual grammatical constructions which create left-branching, but Yngve believed that our mental languageprocessing machinery enforces restrictions on the use of those constructions so as to ensure that the grammatical 'depth' of any individual word is never more than some fixed maximum — perhaps seven (whereas the use of rightbranching constructions is unconstrained). Yngve argued that left-branching constructions impose burdens on the speaker's short-term memory. Figure 4.1, for instance, adapted from Yngve (1960: 462), is an (unlabelled) parse-tree for the sentence He is as good ayoung manfor thejob asyou will ever find. Note that, when the speaker utters the first as, he commits himself to completing the construction with the later as (you will
38
EMPIRICAL LINGUISTICS
Figure 4.1 ever find), and this commitment has to be retained in memory while the intervening wording, good a young man for the job, is developed. It is in order to reduce the number of such commitments to be held in memory, Yngve believed, that the word order of the sentence is organized so as to limit the numbers of 'NE-to-SW branches in the structure. The largest number of such branches between any word in Figure 4.1 and the root node is three (for the words a and young) — in Yngve's terms, these words have 'depth' 3, and most words have lower depth; whereas many words have larger numbers of 'NW-to-SE' branches above them, for instance the closing word find has six such branches. Note that the term 'depth', in Yngve's usage, refers purely to the quantity of left-branching contained in the path linking a terminal node to the root node of a grammatical tree structure (we shall become more precise shortly about how this is counted). It is necessary to stress this to avoid misunderstanding, because the term 'depth' is used in connexion with tree structures quite differently by computer scientists, for whom the depth of a terminal node is the total number of branches (of any kind) between itself and the root. Thus, for a computer scientist, the rightmost terminal node of a tree may have a large depth, but for Yngve the depth of the last word of a sentence is necessarily zero. Yngve's papers on this topic have attained such classic status in linguistics that I have chosen to follow his usage here. The computer scientists' 'depth' is a quantity which plays no part in the present discussion, so I have not needed to adopt any particular term fork. 1
DEPTH IN ENGLISH GRAMMAR
39
Lees (1961) and Fodor, Bever and Garrett (1974: 408 ff.) argued that the relevant psychological considerations are more complex than Yngve realized, and that the depth constraints in languages such as Japanese and Turkish are quite different from the constraint in English. Fodor etal. point out that while it is left-branching which imposes a burden on the speaker's memory for grammatical structure, for the hearer the situation is reversed, so that right-branching constructions are the ones which would be expected to create short-term memory burdens. By implication, one difference between English and Japanese linguistic structures is that English is a language whose speakers keep things relatively easy for themselves, in terms of these memory burdens, whereas Japanese is a language whose speakers make things relatively easy for their hearers. The question of left-branching became linked for some linguists with that of multiple central embedding, discussed in Chapter 2. Occasionally it is suggested (for example Lyons 1991: 116) that Yngve's hypothesis might have resulted from taking what is in reality a constraint on central embedding to be a more general constraint on left-branching. But these issues should be kept distinct. We saw in Chapter 2 that the alleged constraint on central embedding is of debatable status; but, even if some limitation did apply in that area, it would not in itself create any left-right asymmetry in the shape of grammatical structure trees. On the other hand, it is unquestionably true that there is a strikingly low incidence in English of leftbranching in general - that is, of multi-word constituents occurring anywhere other than as rightmost daughters of their containing constructions. One of the most immediately noticeable features of any grammatically analysed English corpus which uses brackets to delimit constituents is the frequent occurrence of long sequences of right brackets at the same point in a text, while sequences of adjacent left brackets are few and short. This chapter will study the general English tendency for multi-word constituents to occur at the end of their containing construction, ignoring the separate issue whether constituents which violate this tendency are significantly less frequent in the middle than at the beginning of the higher unit. Writing before the availability of computers and grammatically analysed corpora, Yngve noted (1960: 461) that 'It is difficult to determine what the actual [depth] limit is'; his figure of seven seems to have been a surmise based on psychological findings about memory limitations in other domains, rather than on an empirical survey of linguistic usage (which would scarcely have been feasible at that period). Fodor etal. (1974: 414) echoed Yngve's point about the difficulty of checking empirically just what the depth patterns are in real-life usage. But it is fairly clear that Yngve's conception involves a sharp cut-off: up to the depth limit (whether this is seven or another number) many words are found, beyond the limit none. He illustrates his concept with a diagram (reproduced here as Figure 4.2, after Yngve 1961: 134, Fig. 5) of the kind of structure that would be expected with a depth limit of three; of the 15 terminal nodes in Figure 4.2, apart from the
40
EMPIRICAL LINGUISTICS
Figure 4.2 last (which necessarily has depth 0) there are three at depth 1, six at depth 2, and five at depth 3. Yngve's caption to the diagram reads 'If the temporary memory can contain only three symbols, the structures it can produce are limited to a depth of three and can never penetrate the dotted line.' Yngve's depth hypothesis is significant for computational languageprocessing models, because — leaving aside the question whether sentences violating the depth limit should be regarded as 'ungrammatical' or as 'grammatical but unacceptable', a distinction that we shall not discuss - it seems to imply that English grammatical usage is determined in part by a nonlocal constraint. Since the phrase-structure rules of English grammar allow some left-branching and are recursive, it appears that the class of structures they generate should include structures with excessive left-branching, which would have to be filtered out by a mechanism that responds to the overall shape of a tree rather than to the relationship between a mother node and its immediate daughter nodes. Though there is undoubtedly something right about Yngve's depth hypothesis, to an empirically minded corpus linguist the postulation of a fixed limit to depth of left-branching has a suspicious air. Corpus linguists tend rather to think of high- and low-frequency grammatical configurations, with an 'impossible' structure being one that departs so far from the norm that its probability is in practice indistinguishable from zero, but without sharp cut-offs between the 'possible' and the 'impossible'. In this chapter, I
DEPTH IN ENGLISH GRAMMAR
41
shall bring corpus evidence to bear on the task of discovering precisely what principle lies behind the tendency to asymmetry observed by Yngve in English. We shall find that the answer is clear-cut; that it does not imply a sharp cut-off between acceptable and unacceptable depths of left-branching; and that it has positive consequences for the processing issue canvassed above. 2 The SUSANNE Corpus For this purpose, I used a treebank which is newer and larger than the Lancaster-Leeds Treebank discussed in Chapter 3: namely, the SUSANNE Corpus (described in URL 3). This is an approximately 130,000-word subset of the Brown Corpus of edited American English (URL 1), equipped with annotations which represent its surface and logical grammatical structure in terms of the full analytic scheme exemplified in Figure 3.1, p. 27, and defined in Sampson (1995). The SUSANNE analytic scheme is a set of annotation symbols and detailed guidelines for applying them to difficult cases, which is intended to come as close as possible to the ideal of defining grammatical analyses for written and spoken English that are predictable (in the sense that different analysts independently applying the scheme to the same sample of English must produce identical annotations), comprehensive (in the sense that everything found in real-life usage receives an analysis, and all aspects of English surface and logical grammar which are definite enough to be susceptible of explicit annotation are indicated), and consensual (in that the scheme avoids taking sides on analytic issues which are contested between rival linguistic theories, choosing instead a 'middle-of-the-road' analysis into which alternative theorists' analyses can be translated). This ideal can never be perfectly realized, of course, but critics' comments suggest that the SUSANNE scheme has made useful progress towards it; according to Terence Langendoen (1997: 600), for instance, 'the detail... is unrivalled'. At 130,000 words, the SUSANNE Corpus is by now far from the largest treebank available, but limited size is the penalty paid to achieve high reliability of the analysis of each individual sentence - for present purposes that is important. SUSANNE may still be the most comprehensively and consistently analysed English treebank in circulation. 3 The research discussed here used Release 3 of the SUSANNE Corpus, completed in March 1994; the many proofreading techniques to which this version was subjected before release included scanning the entire text formatted by software which uses indentation to reflect the constituency structure implied by the SUSANNE annotations, so that most errors which would affect the conclusions of the present research should have been detected and eliminated. Although the SUSANNE analytic scheme aims to be 'consensual' as just defined, obviously many individual linguistic theorists would prefer different structural analyses for particular constructions. However, although this might lead to some changes in the individual figures reported below, the
42
EMPIRICAL LINGUISTICS
overall conclusions are sufficiently clear-cut to make it reasonable to hope that they would be unaffected by such modifications, provided these were carried out consistently. Some readers may think it unfortunate that the present investigation is based on written rather than spoken English; if constraints on leftbranching derive from psychological processing considerations (as Yngve believed), it is likely that these considerations impact more directly on spontaneous speech than on writing. Until very recently there existed no analysed corpus of spontaneous spoken English which would have been suitable for the purpose (though see Chapter 5 for the new CHRISTINE speech treebank). But in any case, transcriptions of spontaneous speech tend not to contain long chains even of right-branching structure, and they contain many editing phenomena which make it difficult to analyse an utterance in terms of a single coherent tree-structure; so that it is questionable whether an analysed corpus of spontaneous speech could have been used for this research, even if one had been available when it was carried out. The highly ramified grammatical structures discussed by Yngve (1960) are in fact much more characteristic of written than of spoken English, and I believe that a written-English treebank may offer the best opportunity to take his work further. 3 Preparation of the test data
The SUSANNE Corpus, like the other treebanks and raw corpora discussed in this book, was produced as a general-purpose research resource not geared to any specific investigation; since its first release in 1992, the SUSANNE Corpus has been used heavily by researchers in many parts of the world for very diverse studies. In consequence, for any individual study it is often necessary to adapt the material in various ways to suit the needs of the particular investigation. (The fact that such adaptations are needed should be reassuring, in a case like this where the investigation is carried out in the same laboratory which developed the treebank; it demonstrates lack of circularity - we did not design the SUSANNE Corpus with a view to getting the results reported below.) In order to study left-branching, it was necessary to modify the structures of the SUSANNE Corpus in a number of respects: (i) The SUSANNE analytic scheme treats punctuation marks as 'words' with their own place in parse trees; and it recognizes 'ghost' elements (or 'empty nodes') - terminal nodes marking the logical position of elements which appear elsewhere in surface structure, and which have no concrete realization of their own, such as the item S123, representing the underlying relative-clause subject, in Figure 3.1, p. 27. Punctuation marks are not likely to be relevant to our present concerns (with respect to human syntactic processing they are written markers of structure rather than elements forming part of a syntactic structure); and ghost elements are too theory-dependent to be appropriately included in an empirical investigation such as ours (Yngve discussed only the structuring of concrete words). Therefore all
DEPTH IN ENGLISH GRAMMAR
43
terminal nodes of these two types, and any nonterminals dominating only such nodes, were pruned out of the SUSANNE structures. (ii) Any tree whose root node is labelled as a 'heading', Oh, was eliminated: this covers items such as numbered chapter titles, and other forms whose internal structure often has little to do with the grammar of running English text. (iii) Apart from 'headings', the SUSANNE texts are divided by the analysis into units whose root nodes are labelled 'paragraph', 0. A paragraph normally consists of an unstructured chain of sentences (interspersed with sentence-final punctuation marks which were eliminated at step (i)). Yngve's thesis relates to structure within individual sentences; therefore 0 nodes were eliminated, and the units within which left-branching was examined were the subtrees whose roots are daughters of 0 nodes in the unmodified corpus. Not all of these units are grammatically 'complete sentences'; occasionally, for instance, a noun phrase functions as an immediate constituent of a SUSANNE paragraph. The present investigation paid no attention to whether root nodes of trees in the modified corpus had the label S or some other label. (iv) Some SUSANNE tree structures contain nodes below the root, representing categories such as 'direct quotation', which with respect to their internal constituency are equivalent to root nodes. For the present investigation, the links between such 'rootrank nodes' (Sampson 1995: §4.40) and their daughters were severed: thus left-branching was measured within the sentence(s) of a direct quotation without reference to the sentence within which the quotation was embedded, and when left-branching was measured in that quoting sentence the quotation was treated as a single terminal node. (v) The SUSANNE analytic scheme treats certain sequences of typographic words, for example up to date used as an adjective, as grammatically equivalent to single words. Any node labelled with an 'idiomtag' (Sampson 1995: §3.55) was treated as terminal, and the structure below it in the unmodified SUSANNE Corpus was ignored. (vi) The SUSANNE analytic scheme makes limited use of singularybranching structure. For instance, a present-participle clause consisting of a present participle and nothing more (e.g. the word licensing in their own annual licensing fee, Brown and SUSANNE Corpora location code A02:0880) will be assigned a node labelled with a clausetag, dominating only a node labelled with a verb-group tag, dominating only a node labelled with a presentparticiple wordtag. Numerical measures of left-branching might take singulary branching into account in different ways, depending on exactly how the measures were defined, but intuitively it seems unlikely that singulary branching is significant in this connexion; and again singulary-branching nodes seem to be entities that are too theory-laden to be considered in the present context. (What would it mean to assert that the grammatical configuration just cited is a case of three separate units that happen to be coterminous, rather than a case of one word unit that happens to play three roles? - many would see these as different ways of talking about the same
44
EMPIRICAL LINGUISTICS
facts.) Therefore singulary branching was eliminated by collapsing pairs of mother and only-daughter nodes into single nodes.
4 Counts of word depths
The first question put to the resulting set of sentence structures was whether Yngve's concept of a sharp limit to the permissible degree of'depth' is borne out in the data. Let us say that the lineage of a word is the class of nodes including the leaf node (terminal node) associated with that word, the root node of its tree, and all the intermediate nodes on the unique path between leaf and root nodes; and let us say that a node e is &younger sister of a node d if d and e are immediately dominated by the same 'mother' node and e is further right than d. Then Yngve's concept of the 'depth' of a word corresponds to: Definition 1: the total number of younger sisters of all the nodes in the word's lineage. The number of words in the modified SUSANNE Corpus having various depths in this sense is shown in Table 4.1. Table 4.1 gives us not a picture of a phenomenon that occurs freely up to a cut-off point and thereafter not at all, but of a phenomenon which, above a low depth, becomes steadily less frequent with increasing depth until, within the finite quantity of available data, its probability becomes indistinguishable from zero.
Table 4.1 Depth
0 1
2 3 4 5 6 7 8 9 10 11 12 13 14+
Words 7851 30798 34352 26459 16753 9463 4803 2125 863 313 119 32 4 1 0
DEPTH IN ENGLISH GRAMMAR
45
However, although Definition 1 is the definition of 'depth' that corresponds most directly to Yngve's exposition, there are two aspects of it which might be called into question. In the first place, 'depth' in this sense can arise as much through a single node having many younger sisters as through a long lineage of nodes each having one younger sister. This is illustrated by the one word in SUSANNE having depth 13, which is the first word 5 of the sentence Constitutional government, popular vote, trial by jury, public education, labor unions, cooperatives, communes, socialised ownership, world courts, and the veto power in world councils are but afew examples (G11:0310)
The SUSANNE analysis of this sentence is shown in Figure 4.3; nodes contributing to the depth count of the first word are underlined. Although in principle the existence of individual nodes with large numbers of daughters and the existence of long lineages of nodes each having one younger sister are two quite different aspects of tree-shape, for Yngve the distinction was unimportant because he believed that branching in English grammatical structures is always or almost always binary (Yngve 1960: 455). But this seems to have been less an empirical observation about English grammar than an analytical principle Yngve chose to impose on English grammar. In the case of multi-item co-ordinations such as the one in Figure 4.3, for instance, where semantics implies no internal grouping of the conjuncts I know of no empirical reason to assume that the co-ordination should be analysed as a hierarchy of binary co-ordinations. In SUSANNE analyses, which avoid positing structure except where there are positive reasons to do so, many nodes have more than two daughters. Where SUSANNE has a single node with three or more daughters, it seems that Yngve regularly assumed a right-branching hierarchy of binary nodes. This implies that 'depth' measured on SUSANNE trees will best approximate to Yngve's concept if each node having younger sister(s) contributes exactly one to the depth of the words it dominates, rather than nodes having many younger sisters making a greater contribution. In that way, depth figures for words dominated by nodes with many daughters will be the same as they would be in the corresponding Yngvean trees containing only binary nodes. (To make the point quite explicit: although I do not myself believe that grammatical branching is always binary, I am proposing that we count word depth in a way that gives the same results whether that is so or not.) Secondly, even the most right-branching tree must have an irreducible minimum of left branches. A tree in which all nonterminal nodes other than the root are rightmost daughters ought surely to be described as containing no left-branching at all; yet by Yngve's definition each word other than the last will have a depth of one, rather than zero (and the average word depth will consequently depend on how many words there are). This inconsistency could be cured by ignoring the leaf node when counting left-branching in a lineage.
Figure 4.3
DEPTH IN ENGLISH GRAMMAR
47
Figure 4.4 Accordingly, I suggest that a more appropriate definition than Definition 1 of the depth of a word would be: Definition 2: the total number of those nonterminal nodes in the word's lineage which have at least one younger sister.
Thus, consider terminal node e in Figure 4.4. Counted according to Definition 1, the depth of e is four, the relevant younger sister nodes being F,j, k, L. Counted according to Definition 2, the depth of e is two, the contributing nonterminals being B and C. If the distribution of depths among SUSANNE words is recomputed using Definition 2, the results are as shown in Table 4.2. Table 4.2 Depth
Words
0
21 2 3 4 5
The decline is now much steeper, but again we seem to be looking at a continuously decreasing probability which eventually becomes indistinguishable from zero in a finite data-set, rather than at a sharp cut-off. The four words at depth 5 are the words New York, United States occurring in the respective sentences: Two errors by New York Yankee shortstop Tony Kubek in the eleventh inning donated four unearned rum and a 5-to -2 victory to the Chicago White Sox today (A11:1840)
48
EMPIRICAL LINGUISTICS Vital secrets oj Britain's fast atomic submarine., the Dreadnought, and, by implication, of the entire United Stales navy's still-building nuclear subJJeet, were stolen by a London-based soviet spy ring, secret service agents testified today (A20:0010)
These examples seem intuitively to relate more closely than the Constitutional government example to the depth phenomenon with which Yngvc was concerned; their SUSANNF. analyses are Figures 4.5 and 4.6 respectively. It is true that, if depth is counted in terms of Definition 2 rather than Yngve's original Definition 1, then Table 4.2 shows that the SUSANNE data are logically compatible with a fixed maximum depth of 7. But to explain the figures of Table 4.2 in terms of a fixed depth limit is scientifically unsatisfactory, because it is too weak a hypothesis to account for the patterning in the data. To give an analogy: a table of the numbers of twentieth-century Europeans who attain various ages at death would, in the upper age ranges, show declining figures for increasing age until zero was reached at some age in the vicinity ofl'20. Logically this would be compatible with a theory that human life is controlled by a biological clock which brings about death at age 125 unless the person happens to die earlier; but such a theory would be unconvincing. In itself it fails to explain why we do not meet numerous 124year-olds — to explain that we need some theory such as cumulative genetic transcription errors as cells repeatedly divide leading to increased probability of fatal maladies; and, if we adopt a theory of this latter kind, it is redundant also to posit a specific fixed maximum which is rarely or never attained. What we would like to do is to find some numerical property obeyed by the SUSANNE trees which is more specific than 'no depth greater than seven', which is invariant as between short and long sentences, and which predicts that the number of words at a given depth will decline as depth increases. In the following sections I address this issue in the abstract, prescinding from psychological questions about how human beings might produce or understand grammatical structures, and instead treating the set of observed SUSANNE parse-trees purely as a collection of shapes in which some invariant property is sought. The ratio of psychological theorizing to empirical description in this area has been rather high in the past, and the balance deserves to be redressed. Having found an empirical result, I shall not wholly refrain from speculation about possible processing implications, but these will be very tentative. The central aim of the work reported here is to establish the empirical facts, rather than to draw psychological conclusions. 5 Different ways of measuring left-branching
One possible invariant might be mean depth (in the Definition 2 sense) of the various words in a sentence. If there were no tendency to avoid leftbranching, then mean word depth would be higher in long sentences than in
Figure 4.5
Figure 4.6
DEPTH IN ENGLISH GRAMMAR
51
short sentences, because more words imply longer lineages between terminal nodes and root, and the lineages would contain left-branching as frequently as right-branching. Yngve's picture of a depth boundary that remains fixed however long a sentence grows suggests that mean word depth might be constant over different sentence lengths; this could be true despite the occasional incidence of words with unusually large depth figures. However, if we choose to compute the asymmetry of sentence structures by an averaging procedure over all parts of the tree, rather than by taking a single maximum figure, then averaging word depth is not the only way to do this. Two other possibilities present themselves. One could take the mean, over the nonterminal nodes, of the proportion of each node's daughters which are left-branching nodes - that is, which are themselves nonterminal and are not the rightmost daughter. Or one could take the mean, again over the nonterminal nodes, of the proportion of all words ultimately dominated by a node which are not dominated by the rightmost daughter of the node and are not immediately dominated by the node. Let us call these three statistical properties of a tree structure the depth-based measure, the production-based measure, and the realization-based measure respectively. A low figure for any of these three measures implies that a tree has relatively little left-branching. But the measures are not equivalent. Consider for instance the three six-leaf tree structures (A), (B), and (C) in Figure 4.7. By the depth-based measure, the most left-branching of the three structures is (A); by the production-based measure, the most left-branching is (B); by the realization-based measure, the most left-branching is (C) .The respective scores7 are shown in Table 4.3. So far as I am aware, other methods of calculating degree of left-branching will assign a ranking to the various trees having a given number of leaf nodes that will be identical or near-identical to the ranking assigned by one of these three measures.
Figure 4.7 Table 4.3
Depth-based Production-based Realization-based
(A)
(B)
(C)
1.50 0.20 0.327
1.00 0.25 0.325
0.67 0.17 0.333
52
EMPIRICAL LINGUISTICS
None of the three measures give figures for different trees which are directly comparable when the trees have different numbers of leaf nodes (i.e. dominate sentences of different lengths). An entirely right-branching tree, in which nonterminal nodes are always rightmost daughters of their mothers, will score zero by each of the three measures. But, for each of the measures, the score for an entirely left-branching tree will depend on sentence length. Writing w for the number of leaf nodes (words) dominated by a tree, the maximum score will be: for the depth-based measure
for the production-based measure
for the realization-based measure
We might therefore normalize the measures to a common scale by dividing the raw figures by the appropriate one of these three quantities. The resulting normalized measures give us a meaningful way of comparing the positions occupied by sentences of any lengths on a scale from 1, for 'completely left-branching', to 0, for 'completely right-branching' (with respect to any one of the three definitions of asymmetry). I shall refer to the six resulting statistical measures of left-branching as RD, RP, RR, ND, NP, NR, for raw v. normalized depth-, production-, and realization-based measures. The question now is which, if any, of these six measures yields figures for structural asymmetry in English that show little variance with different lengths of sentence. 6 Incidence of left-branching by alternative measures In order to answer this question I grouped the sentences of the modified SUSANNE Corpus into sets by length; for each set up to length w = 47 I computed the six asymmetry measures for the sentences in the set, and took their means. (The maximum length of sentences examined was fixed at 47 because, above this length, not all lengths are represented in the data by at least ten instances. Up to w = 47 the fewest instances of a sentence-length is 19 for w = 45.) For very short sentences the means display some patternless fluctuations, which is not too surprising: with few words and even fewer nonterminals to average over, one should perhaps not expect statistical measures of a tree's topological properties to be very informative.8 But the runs of figures from w = 1 up to w — 47 (covering a total of 5,963 sentences) display very clear trends, summarized in Table 4.4, which for each of the six measures gives the overall mean and standard deviation of the 41 individual
DEPTH IN ENGLISH GRAMMAR
53
Table 4.4
mean s.d. r
RD
ND
RP
NP
RR
0.73 0.19 0.96
0.067 0.023 -0.93
0.094 0.0038 0.093
0.20 0.0091 -0.61
0.10 0.0075 -0.83
NR
0.12 0.020
-0.88
means for different sentence lengths, together with the linear correlation coefficient r between sentence length and individual mean asymmetry figure. The measure closest, to Yngve's concept, RD, shows a very strong positive correlation (r = 0.96) between length and depth: individual mean RD figures range from 0.38 for 8-word sentences up to 0.98 for 47-word sentences. Normalizing the depth measure merely reverses the sign of the correlation (7" = —0.93): individual mean ND figures range between 0.136 for length 7 and 0.040 for length 41. By far the most consistent measure of left-branching is RP, which shows essentially no correlation with sentence length (r = 0.093). Mean RP figures for different sentence lengths cluster tightly (low standard deviation) round the overall mean of 0.094; the lowest individual mean is 0.084 for length 45, the highest is 0.102 for length 44. It is evidently RP which gives rise to the limited left-branching which Yngve took for an absolute bar on lineages containing more than a fixed maximum number of left branches. The normalized production-based measure of left-branching, and the realization-based measures, are not as precisely correlated with sentence length as the depth-based measures, but absolute correlation coefficients over 0.6 make it clear that these measures are not candidates for the invariant quantity adumbrated by Yngve. Individual means range from 0.22 (NP), 0.123 (RR), 0.189 (NR), for length 7, down to 0.17 (NP), 0.085 (RR), 0.094 (NR), for length 45. I do not suggest that the incidence of words at different Yngvean depths can be predicted purely from statistics on the average incidence of nonterminal and terminal daughters in individual productions. If that were possible, the figures of Table 4.2 would display a regularity that we do not find. Assuming that not only the proportion L of left-branching daughters but also the mean number b of daughter nodes per mother node, and the proportion R of rightmost daughters which are non-terminal, are constant for different sentence-lengths, then each figure in Table 4.2 ought to differ by a constant factor bLI( 1 — R) from its predecessor. Even if the figures of Table 4.2 were not to hand, we would know that things are not that simple. The great majority of root nodes in the modified SUSANNE Corpus have the same label S, 'main clause', and the class of those productions which share some particular mother label will not in general contain the same proportion of left-branching daughters as found in all productions (the fact, recorded in
54
EMPIRICAL LINGUISTICS
Table 4.2, that there are more depth 1 than depth 0 words in the corpus shows that productions having S to the left of the arrow have a relatively high proportion of left-branching daughters). Likewise the mean proportion of left-branching daughters for category labels which themselves occur on left-branching daughter nodes is very likely to deviate from the overall mean in one direction or the other. Considerations like these imply that we cannot predict an expected pattern of word depths against which Table 4.2 can be tested. But, once we know that the overall incidence of left-branching productions is a low constant frequency for sentences of different lengths, there is no need of further explanation for the fact that the figures in Table 4.2 dwindle to zero after the first few rows, and hence for Yngve's impression that depths above about 7 never occur in practice.9 7 Implications of the findings
From a language-processing perspective, the significance of the fact that RP is the invariant measure is that this is the one measure of asymmetry which depends purely on local grammatical facts. A context-free grammar with probabilities associated with alternative productions gives an invariant mean RP figure for sentences of different lengths; if any of the other five measures had proved to be invariant with sentence length, that would have implied some mechanism controlling global tree shape, separate from the class of allowable productions. Thus the finding may represent good news for computational tractability. Admittedly, even the invariance of RP might require an explanation in non-local terms, if the grammatical structures to be explained were to incorporate the singulary branching which was eliminated from the modified SUSANNE Corpus ((vi), pp. 43—4 above). For instance, if pronouns are introduced into clauses via rules which rewrite clause categories as sequences including the category 'noun phrase' at different points, and separate rules which rewrite 'noun phrase' alternatively as a pronoun or a multi-word sequence, then a probabilistic context-free grammar could not ensure that subjects are commonly pronouns and that multi-word noun phrases occur much more often at ends of clauses. But the grammar of English could be defined without singulary branching, by using rules in which, for instance, pronouns occur directly in the expansions of clause categories. It is interesting that the invariant measure is RP rather than NP. One interpretation of this finding might perhaps be that sentences are not in practice constructed by choosing the words they are to contain and then organizing those words into a suitable grammatical structure; rather, the grammatical structures are chosen independently of sentence-length considerations, and the expansion process terminates simply because productions having no nonterminals to the right of the arrow have a certain probability and hence will sooner or later be chosen. It is hard to accept that the consistent mean left-branching figure for English productions could be caused by a fixed limit to the number of items
DEPTH IN ENGLISH GRAMMAR
55
held in the speaker's/writer's short-term memory, as Yngvc argued: that mechanism would give invariant RD rather than invariant RP figures. If the language used low frequency of left-branching productions (that is, productions which add one to the Yngvean depth of the words ultimately dominated by their left-branching daughter node) as a strategy to avoid generating trees containing words deeper than some fixed limit such as 7, it would be a very inefficient strategy: most words would be at a depth much less than the limit, 'wasting' available memory, and even so there would occasionally be a violation of the limit. I suggest that fixed numerical limits may play little role in the psychological processing of language. It would be interesting to discover whether the different incidence of Yngvean depth found in languages such as Japanese and Turkish can equally be accounted for by left-branching production frequencies fixed at different language-specific values. We have seen that Yngve was right in saying that English grammatical usage embodies a systematic bias against left-branching constructions. But empirical evidence, of a kind that has become available only since Yngve published his hypothesis, suggests that the nature of that bias is rather different from what Yngve seems to have supposed. It is not that English enforces a left-branching depth maximum which is frequently reached but never exceeded. Rather, there is a specific probability of including a left-branching nonterminal category among the immediate constituents of a construction; this probability is independent of the wider sentence structure within which the construction is embedded, but because the probability is small the incidence of words at different depths becomes lower, and eventually vanishingly low, at greater depths. Notes 1 Computer scientists' quantitative measures of tree structure (e.g. Knuth 1973: 451 ff.: Aho, Hopcrofl and Ullman 1974: 167, Ex. 4.33} specify the extent to which a tree departs from perfect 'balance' where the paths between terminal nodes and root are all the same length: this affects the efficiency of algorithms which access data held in tree structures. These measures ignore the extent to which departures from balance occur in one direction rather than the other, which is the topic of this chapter but is not normally significant in a computing context. 2 The SUSANNF. Corpus was produced by a project sponsored by the Economic and Social Research Council ( U K ) , reference no. ROOD 23 1142, using the resource described in note 2 to Chapter 3, p. 35, developed earlier by Alvar Ellegard of the University of Gothenburg. The SUSANNE Corpus is distributed free of charge by anonymous ftp (URL 4}. (Note that the URL given in Sampson 1995: 461 is out of date.) 3 Although SUSANNE contains only a fraction of the Brown Corpus material, if the latter is accepted as a 'fair cross-section' of the language, there is some reason to see SUSANNE as comparably representative: it contains equal quantities of prose from each of the four broad genre categories established by Hofland and Johansson 1982: 22~~ 7 from objective evidence.
56
EMPIRICAL LINGUISTICS
4 Likewise, provided one agrees that grammatical structure can be represented in terms of labelled trees, I believe it is not important for what follows whether one takes the trees to be defined by unitary phrase-structure rules, by separate immediate-dominance and linear-precedence constraints (as many contemporary theoretical linguists would prefer), or otherwise. 5 Misprinted as Cansitutional in the source text from which the Brown Corpus was compiled. 6 Note that the YC nodes dominating commas, being punctuation nodes, were eliminated from the modified corpus used in this study. 7 I illustrate the calculations for the case of tree (B). For the depth-based measure, the nonterminals having younger sisters are the two lowest, hence the depth (by Definition 2) of the leaf nodes in left-to-right sequence is 0. 2, 2, 1, 1, 0 — total 6, averaged over six leaves gives 1.00. For the production-based measure, the leftbranching nodes are again the two lowest nonterminals, hence the proportion of left-branching daughters for the nonterminals in sequence from the root downwards is 0, 0.5, 0.5, 0: average 0.25. For the realization-based measure, the relevant proportions of words for the nonterminals in sequence from the root downwards are 0/6,4/5, 2/4, 0/2: average 0.325. 8 Some of the short 'sentences' in the SUSANNE Corpus consist of material such as prices shown numerically which, like 'headings' (see p. 43), can scarcely be seen as representing natural language structure in the ordinary sense. 9 My discussion (like Yngve's) has assumed a phrase-structure representation of sentence grammar, in which all the words of a sentence are associated with terminal nodes of a tree structure, and nonterminal nodes are labelled with grammatical categories. It would be interesting to consider whether generalizations about depth in English would be affected if one chose a dependency representation of grammatical structure (Tesniere 1965), in which nonterminal as well as terminal nodes are associated with words, and the mother/daughter relationship between nodes represents the head/modifier rather than the whole/part relationship. A dependency tree is notationally equivalent to a phrase-structure tree in which one daughter of each non-terminal node is marked as head, so facts about depth in phrase-structure trees should be mechanically translatable into facts about dependency trees. But the respective statements would not necessarily be equally straightforward - it might be that the facts about depth in English are more naturally stated in terms of one notation rather than the other; and conceivably the availability of headship information in dependency trees could permit generalizations to be stated in a stronger form lacking a translation into phrase-structure notation. I have not pursued these issues.
5
Demographic correlates of complexity in British speech
1 Speech in the British National Corpus
Some utterances are structurally more complex than others. Undoubtedly all of us who speak English use a mixture of less complex and more complex utterances, depending on the communicative needs of the moment. But it may be that individuals differ in their average utterance complexity; and it may be that such linguistic differences between individuals correlate with demographic properties of the individuals, such as sex, age, or social class. In the 1960s and 1970s, Basil Bernstein (e.g. Bernstein 1971) claimed that there exist distinct English speech-codes, a restricted code and an elaborated code, characteristic of the working class and the middle class respectively. However, working at a time before good data resources were available, Bernstein necessarily based this argument on limited and rather artificial evidence; whether for that reason or because of changing political fashion, it seems fair to say that Bernstein's claim is not treated as gospel nowadays.1 The completion in 1995 of the British National Corpus has created new possibilities of studying such questions objectively. The British National Corpus (URL 5) is an electronic resource that provides a comprehensive sampling of the English language as used in the UK in recent years. Most of its contents (90 million words) represent written English, but it also includes 10 million words of transcribed speech. Within the speech section, furthermore, material totalling about 4 million words is 'demographically sampled': individuals selected by demographic techniques to constitute a fair cross-section of the population, in terms of region, class, age, and sex, recorded the speech events they happened to take part in during days that included working days and weekends. The speech section of the British National Corpus, though it has a number of flaws to be discussed below, is by far the most representative sampling of English speech that yet exists for any English-speaking country. Before the structural complexity of different speakers' utterances can be studied, the transcribed utterances must be equipped with annotations making their linguistic structure explicit. My CHRISTINE project (URL 6), begun in 1996, is creating a treebank of spoken English using samples
58
EMPIRICAL LINGUISTICS
extracted from the British National Corpus and other resources, analysed according to the scheme of Sampson (1995), supplemented with additional conventions to represent the special structural features of speech, such as cases where speakers change their mind in mid-flow about what they want to say. The CHRISTINE Corpus, when complete, will contain about one hundred samples of spoken English, each about 2.000 words long. The CHRISTINE Corpus is intended to serve diverse purposes, and consequently some of its samples are drawn from sources other than the British National Corpus; but part of it consists of extracts drawn from random points in randomly chosen files in the demographically sampled British National Corpus speech section. The research findings discussed below are based on an incomplete version of this part of the Corpus, comprising annotations of 37 extracts.2 For a sentence to be 'simple' or 'complex' in traditional grammatical parlance refers to whether or not it contains subordinate clause(s); and the incidence of subordinate clauses intuitively seems an appropriate, quantifiable property of utterances for use as an index of speech complexity. English uses many types of subordinate clause to achieve greater logical precision than can easily be expressed without them. Relative clauses identify entities by reference to logically complex properties (compare e.g. 'the man who came to dinner on Tuesday' with 'that old man'); nominal clauses allow explicit propositions to play roles within higher-level propositions ('I know that she mistrusts Julian' v. T know it'); adverbial clauses express logically complex prepositional modifiers ('I shall do it when my daughter gets here' v. 'I shall do it soon'); and so on. Of course, subordinate clauses can be used to hedge or add vagueness to what would otherwise be a blunt, clear proposition (' I shall come if nothing arises to prevent me'); but hedging is itself a sophisticated communicative strategy which skilled speakers sometimes need to deploy in order to achieve their purposes successfully. Generative linguists frequently point to grammatical recursion as a central feature distinguishing human languages from the finite signalling systems used by some other species, and clause subordination is the most salient source of recursion in English grammar. There may be other indices of structural complexity that one could choose to measure, but incidence of subordinate clauses is at least one very obvious choice. Grammatical complexity in this sense was in fact one of the linguistic features on which Basil Bernstein mainly based his theory of sociolinguistic codes (Bernstein 1971: ch. 5-6). But Bernstein's data on this feature, though quantitative rather than merely impressionistic, were drawn entirely from one small experiment in which ten middle-class and fourteen working-class schoolboys between 15 and 18 years of age were asked to produce wording to suit an artificial experimental task. Since better data were not available thirty years ago, it is no criticism of Bernstein to point out that his findings were far from representative of natural usage in the population as a whole, and that factors such as prior familiarity with the experimental task could have been as relevant as social class in creating the statistically significant
DEMOGRAPHIC CORRELATES OF COMPLEXITY
59
differences which Bernstein found in the language of his two groups. (Bernstein drew his subjects from just two schools, and found it necessary to train the working-class group to carry out the speaking task he set, because it was unfamiliar to them, whereas the middle-class group were used to doing similar tasks.) The British National Corpus gives us the possibility of looking at how a true cross-section ol the national population use English spontaneously, in furthering the everyday purposes of their lives. Also, it allows us to look at demographic factors other than social class. We shall see that class is not the factor associated with the most interesting effects in the analysis discussed in the following pages. 2 Measuring speech complexity
In examining incidence of grammatical complexity in various speakers' usage, it is clearly important to measure complexity in a way that depends wholly or mainly on aspects of the analytic scheme which are reliable and uncontroversial. This means that the measurement should not refer to sentence units; when one transcribes recordings of spontaneous speech into ordinary orthographic form, there are frequent problems about placement of sentence boundaries. (For instance, it is often unclear whether successive main clauses should be seen as co-ordinated into a single compound sentence, or as separate sentences. The word and is not decisive; in speech it is sometimes omitted from clear cases of co-ordination, and often occurs as the first word ol a speaker's turn.) The present research treats degree of embedding as a property of individual words. Each word is given a score representing the number of nodes in the CHRISTINE 'lineage' of that word (see p. 44 above) which are labelled with clause categories.' Each speaker is then assigned an 'embedding index' representing the mean degree of embedding of the various words uttered by the speaker in the sample analysed. 'Words' for this purpose are those alphabetic sequences treated as word units by the rules of our annotation scheme. Enclitics such as the -// ofhe'II or the -n't of won't are treated as separate words, but the punctuation marks used by the British National Corpus transcribers are ignored, as are 'stage directions' such as indications of laughter or coughing, and 'ghost' (or 'trace') elements inserted by the annotator to mark the logical position of shifted or deleted elements. Grammatical idioms (Sampson 1995: §3.55) such as up to date, which are parsed as single words though written with spaces, are counted as one word each; and when a dysfluent speaker makes successive attempts to utter the same word, the sequence is counted as a single word.4 As an illustration, consider two utterances occurring in CHRISTINE text T i l , which was recorded at Llanbradach, Glamorganshire, in January 1992. The utterances with their CHRISTINE parse-trees are shown in Figure 5.1. (The speakers are discussing a black tie which they lend between
60
EMPIRICAL LINGUISTICS Tl 1.02616, speaker JackieOSO:
Til.02623-7, speakerDonald049:
Figure 5.1
families to attend funerals. To preserve speakers' anonymity, the names included in CHRISTINE speaker identifiers, such as 'Donald049', are not the speakers' real names. Locations of examples from the CHRISTINE Corpus are given as text names followed after a full stop by five-digit source-unit codes.) In Figure 5.1, clausetag labels are shown in bold type. By the rules of the annotation scheme, the 'discourse item' well and the vocative Mrs in speaker JackieOSO's utterance lie outside the main clause (S), hence these words are at embedding depth zero; each word within the main clause is at depth 1. The exclamation mark inserted by the British National Corpus transcribers is not counted as a word and hence not scored. The embedding index for this utterance would be 4 -f- 6 = 0.67. (In the text as a whole, Jackie050 produces many other utterances; her overall index is 1.092.) In Donald049's utterance, the first main clause contains a relative clause (Fr) modifying time, and the second main clause has a co-ordinated main clause. The words of the
DEMOGRAPHIC CORRELATES OF COMPLEXITY
61
relative clause score 2 each (1 for the Fr node, 1 for the S node), but those of the second main clause score only 1 each (whether they are within the 'subordinate conjunct' or not - see note 4). The ghost element t!47, showing that the relativized item is a Time adjunct, is ignored for depth-scoring; the repetition I $: I (where the symbol 4£ indicates the start of a repeated attempt to realize the same unit) is counted as a single scorable word. The mean index for this utterance is 21 -r- 17 = 1.24. This method of scoring speech complexity gives relatively small numerical differences for intuitively large complexity differences. A clause at any particular depth of embedding requires clauses at all lesser depths of embedding to act as grammatical environments for it, so increasing the depth of the most deeply embedded clause in a sentence will not increase the mean index for the sentence proportionately. But the scoring method has the advantage of being robust with respect to those aspects of the annotation scheme which are most open to disagreement. If one scored speakers by reference, say, to mean number of subordinate clauses per sentence, large and unresolvable debates would arise about whether well in Jackie050's utterance ought or ought not to be counted as a 'separate sentence' from what follows. For the scoring system chosen, it makes no difference. 3 Classifying the speakers
Leaving aside utterances which the British National Corpus transcribers were unable to assign to identified speakers, the material to hand represents 133 speakers who produced varying amounts of wordage, ranging from 1,981 words for speaker Dianel27 down to 2 words for speaker Jill044. If very few words are recorded for a particular individual (perhaps an eloquent speaker who just happens to be represented in the sample only by a brief greeting, say), there is no possibility for that individual's embedding index to be high. It was desirable, therefore, to exclude low-wordage speakers; in order to decide a threshold, I divided the speakers into groups whose wordage was similar, and checked the group means of their members' embedding indices. Above 16 words, there was no tendency for lowerwordage speakers to have lower embedding indices. Accordingly I excluded only the 13 speakers who each produced 16 words or fewer from the analyses that follow. The remaining 120 speakers are represented by a total of 64,726 words, mean 539.4 words per speaker. The grand mean of the 120 speakers' embedding indices is 1.169. All but two of the individual means fall within the range 0.66 to 1.71; the outliers are 1.980 for speaker Jill 136, a female lecturer (age unknown), and 0.146 for speaker Scott 125, a 1-year-old boy. The British National Corpus includes demographic information for each of these speakers. In principle, for each speaker the corpus aims to record sex, age, regional accent/dialect, occupation, social class in terms of the Registrar-General's classification based on occupation (Office of Population Censuses and Surveys 1991), 1 and relationship to the other participants in
62
EMPIRICAL LINGUISTICS
the conversation. However (understandably, given the large size of the corpus), this information is often far from perfect. For some speakers some categories of information are missing; in other cases the information given is clearly erroneous. For instance, speaker Gillian091 is specified as a doctor by occupation and as belonging to social class DE (partly skilled or unskilled); these statements are incompatible. The British National Corpus demographic information might be described as detailed but unreliable, whereas for present purposes we want information that is coarse but reliable. Thus, the corpus classifies dialects within England in terms of a system derived from (though not quite identical to) that of Trudgill (1990: 65), who divides England into 16 linguistic regions. For present purposes, 16 regions for England (together with categories for the other UK nations) is too refined a classification. Many regions are unrepresented or represented by very few speakers in the sample; and it strains credulity to think that there might be a consistent difference in speech complexity between Humberside and Central Northern England, say, though it is not inconceivable that there could be differences between Northern and Southern England, or between Northern England and Scotland with its separate education system. Because the CHRISTINE Corpus contains only a small subset of the full British National Corpus, and research such as that presented here needs only a coarse demographic classification, we were able to some extent to correct the information in the corpus, in consultation with its compilers and from internal evidence. (Details on the correction process are given in the CHRISTINE documentation file, URL 7.) Sex and age information was assumed to be correct. Speakers were assigned to regions primarily on the basis of the place where the conversation was recorded, except for rare cases where this datum was incompatible with a speaker's British National Corpus dialect code and there was internal evidence that the latter was correct. British National Corpus social class codes were adjusted in terms of information about occupation or spouse's occupation, since this relatively specific information is more likely to be accurate (thus speaker Gillian091, as a doctor, had her code changed from DE to AB). Even with these adjustments, social class is unquestionably the variable for which the information is least satisfactory, and 33 speakers remained uncategorized for social class. The research reported here used a five-way regional classification: Southern England Northern England Wales Scotland Northern Ireland6 The Southern/Northern England boundary corresponds to the principal dialect boundary identified by Trudgill (1990: 63), so that most of what is
DEMOGRAPHIC CORRELATES OF COMPLEXITY
63
ordinarily called the Midlands is included in 'Northern England'. 7 The research used a four-way social classification derived from the RegistrarGeneral's scheme: AB Cl G2 DE
professional, managerial, and technical skilled non-manual skilled manual partly skilled and unskilled
There are further sources of inaccuracy in the British National Corpus. The tapes were transcribed by clerical workers under time pressure; sometimes they misheard or misunderstood a speaker's words. (This is indisputable in a case where a speaker reads from the Bible and the transcribed words can be compared with the original, but it is morally certain in some other cases where the transcribed words make no sense but phonetically very similar wording would make good sense, e.g. unlessyou've low and detest children at T29.09621 must surely stand for unlessyou loathe and detest children]; furthermore some speaker turns are assigned to the wrong speaker, as revealed for instance when a speaker appears to address himself by name. In the CHRISTINE Corpus, such inaccuracies are corrected so far as is possible from internal evidence (with logs of the changes made to the original data); undoubtedly there remain cases where corrections should have been made but have not been. This may seem to add up to a rather unsatisfactory situation. But the incidence of error should not be exaggerated; in transcriptions of almost 65,000 words by 120 speakers I believe the erroneous data are few relative to what is correct, and in any case we have no alternative data source that is better or, indeed, nearly as good. Perhaps more important, although there are errors in the data, they should be errors of a 'fail-safe' type. The purpose of the present work is to look for significant correlations between embedding indices and demographic variables, and sporadic errors are likely to make such correlations harder to find: it would be a strange coincidence if errors conspired to create significant correlations where none exist in reality. So it seems worth proceeding despite the imperfections of the data. 4 Demographics and complexity indices compared
I searched for correlations with each of the four demographic variables (region, sex, class, age) by grouping the individual speaker indices into categories for the relevant variable (omitting speakers for whom the relevant data were missing), and applying a statistical test to check whether any differences found among the embedding-index distributions for the different categories were significant. The test statistic I used was the Fstatistic (e.g. Mendenhall 1967: ch. 12). This is suitable for comparing variance among more than two categories, and takes into account the different
64
EMPIRICAL LINGUISTICS
numbers of individual data-points in different categories, (This last point is important. For some of the demographic variables examined, the data contain very different numbers of speakers in different categories. As one would expect from the respective populations, for instance, our data contain far fewer speakers from Northern Ireland than from Southern England.) The 'region' variable makes a good initial illustration of the analytic method. For the five regions distinguished, Table 5.1 shows the means of the embedding indices for speakers from that region, their standard deviations, and the number of speakers from each region. The last column totals 119, because no regional information was available for one of the 120 speakers in the data. (As one would expect with a random sample, speaker numbers are not precisely proportional to regional populations; by chance Scotland is noticeably under-represented in our material.) On the basis of the means alone, one would say that Scots' English is slightly more grammatically complex, and the English of Northern Ireland slightly less, than average. But the F statistic computed from these data is 0.415, which corresponds to a significance level ofp = 0.798. In other words, supposing that there is in reality no difference between the regions of Britain with respect to the grammatical complexity of speech, then the probability of finding at least this degree of difference among the sample distributions, merely as a consequence of chance fluctuations, is almost four in five. It is much more likely than not that chance would throw up at least this much difference between the categories. So - what will probably surprise few readers - we conclude that there is no evidence for regional complexity differences. In this particular data-set, the Scots happened to score the highest average, but in a fresh sampling it is just as likely that any of the other regions would come out ahead. Similarly, contrary to what Bernstein might lead us to expect, the CHRISTINE data fail to support a correlation between speech complexity and social class. The data are shown in Table 5.2. (In this case the speaker numbers total only 87, because — as we have seen — 33 speakers are unclassified for social class.) The F statistic is 1.065, giving/* = 0.368. This probability is lower than in the case of the region variable, but is still greater than one in three. A difference in group sample distributions with a probability of more than one in Table 5.1 Region
Mean
s.d.
JV
Southern England Northern England Wales Scotland Northern Ireland
1.182 1.173 1.120 1.261 1.105
0.232 0.252 0.207 0.224
50 49 8 4 8
0.124
DEMOGRAPHIC CORRELATES OF COMPLEXITY
65
Table 5.2 Social class
Mean
s.d.
N
AB Cl C2 DE
1.158 1.168 1.257 1.142
0.326 0.178 0.227 0.136
25 9 20 33
three of occurring by chance would not be seen as evidence of a genuine difference among the group populations. That does not, of course, mean that our data prove that there is no link between social class and speech complexity. It merely means that the data do not show that there is such a link. It is perfectly possible that Bernstein's theory might be broadly correct, yet the effect fails to show up in the CHRISTINE data. There are many reasons why that might be so; perhaps the most obvious is that, as we have seen, social class is the least reliably recorded demographic variable. If these data were more complete and accurate, it might be that the figures would yield a correlation supporting the 'restricted v. elaborated code' idea. But the data we actually have tell us nothing of the sort. For the sex variable, the results are more interesting; see Table 5.3 (three speakers' sex unknown). Here, F= 4,052, giving/) = 0.0465. Conventionally, a probability of less than one in twenty is taken as good reason to think that a statistical phenomenon is real rather than a random effect. It seems that, systematically, British females may on average produce slightly more complex utterances than British males. (However, we shall see below that this inference is not straightforward.) Overwhelmingly the most significant finding relates to the age variable, where I grouped the speakers into categories using the six age bands of the British National Corpus coding. (For most speakers, the British National Corpus gives exact age in years, but it also assigns broader age-band codes.) The figures are given in Table 5.4 (six speakers' ages are unknown). The F statistic from these figures is 5.493, giving/? = 0.000152. In other words, there is a less than one in 5,000 probability that these different means have arisen by chance in samples drawn from a population that is homogeneous with respect to grammatical complexity.
Table 5.3 Sex
Mean
s.d.
JV
male female
1.126 1.213
0.263 0.202
55 62
66
EMPIRICAL LINGUISTICS
Table 5.4 Age band
Mean
s.d.
JV
up to 1 5 years 16-24 25-34 35-44 45-59 60 and above
0.941 1.169 1.171 1.226
0.251 0.188 0.169 0.186 0.219 0.194
18 18 27 12 20 19
1.225 1.257
Of course, there is nothing surprising in finding a correlation between age and speech complexity in a set of speakers which includes children as young as 1 and 2 years old. The largest gap between means of adjacent age bands is between the two youngest bands - although the means continue to rise, more slowly, in the higher age bands (except for the near-identical means for the 35-44 and 45-59 bands). It might be that these figures represent a growth of speech complexity from zero at birth to a value characteristic of mature speakers, followed by a scries of figures representing random fluctuations round a constant mean in adults of all ages. But there is an alternative interpretation. Before turning to that, since it is at least clear that children's speech is on average markedly less complex than adults', it seemed worth repeating the F statistic calculations for the other variables, omitting speakers in the youngest age band. This is a crude way of attempting to eliminate variation due to age from the figures for variation with other variables; but, since the precise nature of the variation due to age is unclear and, as I shall show below, contentious, it is probably the best that can be done in practice. For the region variable, after eliminating under-16 speakers it remains true that there are no significant differences; the five regional means change marginally (I do not show the modified figures here), but the F statistic 0.643 corresponds to p = 0.633. For the sex variable, the significant difference noted above melts away (Table 5.5). Here, F~ 1.130,p = 0.29. A difference with an almost three in ten probability of arising by chance would not normally be seen as significant. It seems that the appearance of a significant difference between the sexes in connexion with Table 5.3 earlier may
Table 5.5 Sex
Mean
s.d.
N
male female
1.189 1.234
0.220 0.197
44 55
DEMOGRAPHIC CORRELATES OF COMPLEXITY
67
Table 5.6 Social class
Mean
s.d.
JV
AB Cl C2 DE
1.247 1.168 1.300 1.152
0.267 0.178 0.196 0.135
18 9 18 31
have stemmed from the accident that the sample includes more male than female children. On the other hand, with under-16s excluded, the distribution of embedding indices across social classes almost attains significance at the p < 0.05 level (Table 5.6). For Table 5.6, F= 2.475, p = 0.068. The figure 0.068 is a little larger, but only a little larger, than the threshold 0.05 which is conventionally seen as the point where one ceases to dismiss observed differences as chance fluctuations and starts to believe that they represent real differences between the populations. However, the direction of the differences is very different from what either Bernstein or common sense would predict: anyone who believes in complexity differences between the speech of different social classes would surely expect the relationship to be AB > Cl > C2 > DE - not G2 > AB > Gl > DE. It is difficult to know what, if anything, to make of this finding. We may remind ourselves, again, that social class is the variable for which the data are least reliable.
5 'Critical period' or lifelong learning? So far, in sum, we seem to have established that children's speech is on average less complex than adults', which is to be expected, and that no demographic variable other than age shows reliable correlations with speech complexity. But let us look at the age variable in more detail. One leading idea in the linguistics of recent decades has been that there is a 'critical period' for first-language-acquisition: human beings have an innately programmed 'language-acquisition device' governed by a biological clock which causes them to be receptive language-learners for a number of years during childhood, but which then switches off so that any language-learning that takes place in later years is a relatively halting, unnatural process, controlled by general psychological problem-solving mechanisms rather than by an efficient special-purpose languageacquisition device. This is supposed to explain why people who learn a second language after early childhood, for instance as a secondary-school subject, typically master it very indifferently (while a child exposed to two languages in early years, for instance as a member of an expatriate family, may grow up bilingual), and also why 'wild children' who for some reason are isolated from all language experience during the childhood years (such
68
EMPIRICAL LINGUISTICS
as the well known, tragic case of'Genie', Curtiss 1977) are allegedly never able to make up for lost time if society first discovers them and attempts to help them learn to speak after their 'critical period' has expired. This idea was introduced into the mainstream of thinking about language by Eric Lenneberg (1967); for a recent exposition, see for instance Pinker (1994: 37-8, 290 ff.). The facts are controversial; I have argued elsewhere (Educating Eve, Sampson 1999a — see index references for 'critical period') that Lenneberg's and others' support for the critical-period concept is often based on false premisses. But, at present, the majority of linguists probably adhere to the 'critical period' picture of the language-acquisition process. According to this picture, human lives (unless cut short prematurely) are divided into two sharply different parts with respect to language: an early period when the human is a language-learner, and a later period when he or she has ceased to be a learner and has become a mature language user. As Noam Chomsky puts it (1976: 119), the child attains'a "steady state"... not changing in significant respects from that point on'. If one asks when the steady state is reached, Lenneberg (1967) gives the age of 12 (in diagrams on pp. 159ff.) or'about thirteen' (p. 153); in several passages, rather than quoting an age in years, he links the switching-off of the language-acquisition device to puberty.8 This picture of an individual's linguistic ability as developing on a rapidly rising trend for ten to thirteen years and then levelling out for the remainder of life is sharply at variance with views held by eminent linguists of earlier decades. According to Leonard Bloomfield (1933: 46), 'there is no hour or day when we can say that a person has finished learning to speak, but, rather, to the end of his life, the speaker keeps on doing the very things which make up infantile language-learning'. Fifty years earlier, W. D. Whitney wrote (1885: 25), 'We realize better in the case of a second or "foreign", than in that of a first or "native" language, that the process of acquisition is a never-ending one; but it is not more true of the one than of the other.' These writers saw the trend of linguistic ability as continuing upwards throughout life, with no sudden flattening out. Learning to use grammatical subordination devices is one important part of learning to use one's mother tongue, so the CHRISTINE data might help to choose between these alternative conceptions of language-learning. Do they show a change from growth to steady state at puberty? Figure 5.2 displays graphically the means of Table 5.4 above, but with the 'up to 15' age band divided into four narrower bands. For the age band 9-12 there is a complication: one 12-year-old speaker, Marcol29, has an extremely low embedding index (0.663 - lower than the index of any other speaker in the data except for Scott 125, a 1-year-old); because numbers of speakers in these narrow bands are small, a single outlier has a large impact on the overall average. The cross in the 9—12 column of Figure 5.2 gives the mean including Marco 129; the circle gives the mean omitting Marco 129. (This could be appropriate, if Marco 129 were in some way abnormal. We have no real information about that, but this speaker's CHRISTINE identifier
DEMOGRAPHIC CORRELATES OF COMPLEXITY
69
Figure 5.2 uses the name 'Marco' to reflect the fact that his true name sounds foreign; put together with the fact that the relevant conversation was recorded in London, this suggests a fair possibility that Marco 129 may be a non-native speaker. 9 ) To the eye, Figure 5.2 looks more like a graph of'lifelong learning' than of childhood learning followed by a steady state. Note that - even with the version of the graph that excludes Marco 129 the largest single jump between adjacent age bands is from the 9—12 to the 13—15 band: that is immediately after the steady state has allegedly been reached. (With Marco 129 included, this jump would be far larger.) One would like objective, statistical confirmation of this appearance of lifelong growth in speech complexity. Inferences drawn from Figure 5.2 might be challenged as misleading, for instance because the graph ignores the fact that the early age bands are narrower than the later ones. (It is widely regarded as a truism that learning, of all kinds, is a more concentrated, rapid activity in childhood than in maturity, so arguably it is quite appropriate for age bands of a few years in childhood to be represented on a par with bands of a decade or more of adult life; but the impression produced by a display like Figure 5.2 is so heavily dependent on a variety of graphic imponderables that inferences from it cannot be regarded as reliable.) Accordingly I examined the correlation between embedding index and age in years (ignoring age bands) among those individual speakers who, in terms of their age, should have passed the putative 'critical period' for language-acquisition. The question now is: how probable is it that the
70
EMPIRICAL LINGUISTICS
appearance of upward trend in the distribution of mature individuals' embedding-index/age values would occur as a chance fluctuation in a sample, if mean embedding index in the population from which the sample was drawn is invariant with age? The first decision here is what age to use as a fair cut-off to exclude speakers who might still be within their alleged 'critical period'. We have seen that, writing in the 1960s, Eric Lenneberg identified the end of the critical period variously as 12 or 13 years, or as puberty. Age of puberty has been dropping in subsequent decades; recent figures for age at onset of puberty in Britain are:10 males: average, 11.5 years; range including 2.5 s.d.s, 9~14years females: average, 10.5 years; range including 2.5 s.d.s, 8—13 years If Lenneberg counted individuals of 13 years and above as beyond the critical period in 1967, we shall surely be safe in using that age as the cut-off in the 1990s; accordingly I examined the data for speakers aged 13 or over (there were 100 such speakers in the data). The line of best fit to this sample of 100 embedding-index/age points has intercept 1.125, slope 0.00191, that is a gentle upward trend in embedding index with increasing age. (There is of course much variance round the line of best fit; s — 0.193.) In order to test the null hypothesis that the line of best fit to the population from which the sample is drawn has zero slope, I computed the Student's ^statistic (e.g. Mendenhall 1967: 232~3); t- 1.813. This comfortably exceeds the critical value for the p < 0.05 significance level (though it does not attain the/? < 0.025 level). In other words, these figures give us grounds for saying that (while the evidence is not overwhelming) increase in average grammatical complexity of speech appears to be a phenomenon that does not terminate at puberty, but continues throughout life. As Figure 5.2 suggests, not only do people around twenty produce more grammatical complexity than people in their early teens, but over-sixties produce more complexity than people in their fifties and below. Supporters of the 'critical period' picture of language-acquisition, if they did not simply dismiss these figures as a statistical fluke (the only answer to which would be to gather more material and hope for higher significance levels), might respond by saying that their concept of a language-acquisition device which switches off at puberty does not imply that grammatical complexity ceases to grow after the switch-off. It is true that average grammatical complexity of utterances is not a property which has featured very heavily in the 'critical period' literature, so far as I know. Proponents of that concept tend to focus more on individual linguistic constructions than on statistics of usage of an overall grammatical system. But learning the individual constructions of one's mother tongue, and learning to make fuller use of the system of constructions one has encountered to date, are both part of what most people would understand by 'language-acquisition'.
DEMOGRAPHIC CORRELATES OF COMPLEXITY
71
If we doubt whether findings of lifelong growth in complexity of usage are relevant to the 'critical period' hypothesis, we should consider how believers in the critical period would have responded to the opposite finding. If Figure 5.2, rather than displaying a continuing upward trend, had shown an upward slope to the age of puberty, followed by a horizontal trend from puberty to the end of adult life, it will be clear to anyone familiar with the 'critical period' debate that this would have been seized on as confirmation of the theory. It is always open to proponents of a scientific theory to respond to adverse evidence by reinterpreting the theory so that it makes no prediction about just those points where negative evidence has emerged. The penalty is that this procedure converts true science into pseudoscience - the evolving theory becomes what Imre Lakatos (e.g. 1970: 118) calls a 'degenerating problemshift', which reacts to cumulations of new evidence not by increasing its empirical scope but by defensively shutting itself off from possibilities of refutation. If the concept of an age-bounded innate 'language-acquisition device' is to be taken seriously as a scientific hypothesis, the findings discussed here should be admitted as at least prima facie counter-evidence. 6 Individual advance or collective retreat?
But these findings are of interest from other perspectives too. There is social and educational significance in the discovery that people seemingly advance in terms of the structural, logical richness of their spontaneous speech habits as they progress from youth through middle age towards old ageIt is important, therefore, to note that there is an alternative possible interpretation of the upward slope of Figure 5.2, and our data do not at present allow us to determine which interpretation is correct. The data give us a snapshot of the usage of people of different ages at a single period, the early 1990s. I have been assuming so far that data gathered in the same way several decades earlier or later would show an essentially similar picture; but it might not. Possibly, what the figures are telling us is that people who were born about 1930, and hence were in their sixties when their speech was collected for the British National Corpus, have throughout their adult lives spoken in a gramatically more complex manner, on average, then (say) people who were born about 1970, who were in their twenties when the corpus was compiled. Changing patterns of schooling, and/or cultural shifts from written to visual media, might conceivably have led to this sort of change in speech styles. To me it seems more plausible that the upward trend of Figure 5.2 represents a lifelong-learning effect which is repeated generation after generation, than that it represents a historical change in the nature of British speech. But many others to whom I have presented the data have found the second interpretation more plausible. At present, there is no way to know which is right. That would require a comparable body of data, gathered at a period at least
72
EMPIRICAL LINGUISTICS
two or three decades earlier or later than the British National Corpus. It is too late now to regret that no 'fair cross-section' of British speech was sampled earlier than the 1990s. (Such earlier speech corpora as do exist are too unrepresentative, socially and otherwise, to offer a meaningful comparison.) In about the year 2020, we may find out whether Britons are individually growing subtler, or collectively growing cruder, in the structure of their speech. Until then, we can only wait and wonder. Notes 1 Ammon 1994 offers a recent perspective on the theory of sociolinguistic codes originated by Bernstein, concluding that the theory 'deserves to be formulated and tested more rigorously than has been done so far'. 2 The files used were CHRISTINE texts T01 to T40, omitting T03, TOG, and T12 (which were not available at the time the research reported here was carried out). The partial version of the CHRISTINE Corpus which was subsequently released in August 1999 incorporates a number of additional corrections (of the kinds discussed in the documentation files) to British National Corpus data on issues such as assignment of individual speech-turns to particular dialogue participants, or social-class categorization of individual speakers; this means that some of the figures reported below would be slightly different, if recalculated from the August 1999 release, but it is unlikely that the differences would be great enough to modify the conclusions drawn here. 3 In the technical terms of our annotation scheme (Sampson 1995: §4.41), a nodelabel is reckoned as a clause label if it begins with one of the capital letters S F T W A Z L (not immediately followed by another capital). 4 Co-ordinate structures are treated in a special way. The SUSANNE scheme annotates co-ordinations in a non-standard (though convenient) manner (Sampson 1995: 310ff.): second and subsequent conjuncts within a co-ordination are treated as constituents subordinate to the first conjunct, thus a unit X and Y is given the structure [X [and T ] ] . Since embedding counts for present purposes ought not to depend on a contentious feature of the annotation scheme, the count is incremented by one, not by two, for cases where a clause node occurs as a 'subordinate conjunct' below a higher clause node. In terms of the SUSANNE scheme, when the label of a node on the path between word and root is a clausetag by the definition above, and includes one of the symbols + — @, the node immediately above it on the path is ignored for purposes of the depth count. 5 This has recently been superseded by a revised social classification scheme, but the 1991 scheme was the one current during the years when the British National Corpus material was compiled. 6 Strictly, this category should be labelled simply 'Ireland'; as well as material recorded within Northern Ireland, the data include utterances by an Irish speaker living in England, who may have come from the Republic. 7 This five-way regional classification is a simplification of the CHRISTINE Corpus regional classification: 'Southern England' in the research of this chapter subsumes the CHRISTINE 'South East' and 'South West' regions, and 'Northern England' here subsumes CHRISTINE 'Northern England' and 'Midlands' regions. 8 For completeness it should be mentioned that Lenneberg's diagrams suggest that
DEMOGRAPHIC CORRELATES OF COMPLEXITY
73
a linguistic biography has three parts, not just two, because they show languageacquisition as beginning at 2 years of age. More recent writers have paid less attention to the idea that language-acquisition has a well defined beginning than that it has a well defined endpoint. 9 The Times of 22 January 2000 reported research by Philip Baker and John Eversley showing that London is currently the world's most linguistically diverse city, with only two-thirds of schoolchildren speaking English at home. 10 I am indebted for these data to Dr G. H. Stafford of the Conquest Hospital, Hastings. 11 This alternative would be no more compatible than the original interpretation, so far as I can see, with the 'critical period' theory; genetically determined linguistic behaviour should be constant across the generations.
6 The role of taxonomy
1 A neglected priority
The kind of discoveries about language which have been discussed in preceding chapters can be made only if we have a scheme for representing the grammatical structure of language samples in a detailed, consistent fashion. One might imagine that developing such a scheme would have been a high priority for scientific linguistics. No empirical science is likely to be able to advance very far without a set of recognized standards for classifying and recording its data. Surprisingly, this kind of taxonomizing has not been seen as a priority in our field. I was able to conduct these investigations only because our group has spent a great deal of our time since the early 1980s developing a comprehensive, explicit scheme for annotating the structures of written and spoken English. When we began, there was neither an existing published scheme available, nor even much in the way of informal, tacit agreement about how to draw parse-trees. Yet, without that, it was impossible to compile databases of structurally analysed language material and extract meaningful statistics from them. In the absence of detailed, explicit analytic guidelines, successive examples of one and the same English construction would be analysed now this way, now that, and quantitative data distilled from an analysed corpus would be quite meaningless. For many linguists working in the period before computers became a standard research tool, 'taxonomy' was not merely a low priority but almost a dirty word - see for instance Chomsky's use of the term 'taxonomic model' (1964: 11), or Jerrold Katz's comments (1971: 3 Iff.) on linguistics as what he called 'library science'. But then, without computers, it would have been difficult to exploit significant quantities of data about structural usage, even if there were taxonomic schemes allowing such data to be compiled in a consistent way. What is less easy to understand is the fact that structural taxonomizing remained a low priority even after the growth of computational linguistics, when, objectively, the need for it became acute. My aim in this chapter, rather than discussing particular findings that have emerged from empirical linguistic research, is to argue for a revision of priorities on the part of those engaged in the research. Empirical linguists
THE ROLE OF TAXONOMY
75
need to recognize taxonomy as a valuable activity in its own right, entitled to its share of research effort and resources, and not see it as merely an uninteresting and fairly trivial preliminary to the work of making discoveries about language structure or developing software to execute human language-processing tasks. 2 Software engineering versus programming
Scholars who are interested in linguistic description and analysis as a purely academic activity (what one might call 'pure' linguists) are only now beginning to exploit the possibilities opened up by the availability of electronic corpora. Many of them use these resources for studies in which grammatical structure is not a central consideration (for instance, research on different speakers' use of individual words). Other 'pure' linguists have theoretical reasons, related to those which led Chomsky and Katz to make the comments quoted above, for placing little importance on taxonomic research. But the majority of researchers who are harnessing computing technology to the task of analysing natural languages nowadays are not engaged in 'pure' linguistics, but in what has come to be called 'language engineering': they are developing software systems to execute industrially or socially useful language-processing functions, such as automatic translation, or extraction of information from large natural-language databases.1 Language engineers have no theoretical commitments that would discourage them from taking taxonomy seriously, and the nature of the processing functions normally means that they have no possibility of focusing on individual words to the exclusion of grammatical structure. So it is truly surprising that structural taxonomy is not more salient than it is at present on the agenda of language engineering. The explanation for this failure of vision, I believe, is that naturallanguage computing has not yet learned, or has only partly learned, certain general lessons about how to harness the potential of computers, which those involved with more central applications of information technology (IT) learned (painfully and gradually) at an earlier stage in the short history of the computer. The aim of this chapter is to suggest that natural-language computing at present needs to take on board, more fully than it has done up to now, lessons which the wider IT profession learned some twenty to thirty years ago. The lessons I have in mind were those that led to the creation of the discipline of software engineering, which is nowadays a fundamental component of the training of computing professionals. Let me quote historical remarks from two standard textbooks: The term 'software engineering' was first introduced in the late 1960s at a conference held to discuss what was then called the 'software crisis'. . .. Early experience in building large software systems showed that existing methods of software development were not good enough. Techniques applicable to small systems could not be scaled up. Major projects were sometimes years late, cost much more
76
EMPIRICAL LINGUISTICS than originally predicted, were unreliable, difficult to maintain and performed poorly. Software development was in crisis. (Sommerville 1992: 3) In the middle to late 1960s, truly large software systems were attempted commercially. . . . The large projects were the source of the realization that building large software systems was materially different from building small systems. . . . It was discovered that the problems in building large software systems were not a matter of putting computer instructions together. Rather, the problems being solved were not well understood, at least not by everyone involved in the project or by any single individual. People on the project had to spend a lot of time communicating with each other rather than writing code. People sometimes even left the project, and this affected not only the work they had been doing but the work of the others who were depending on them. Replacing an individual required an extensive amount of training about the 'folklore' of the project requirements and the system design. . . . These kinds of problems just did not exist in the early 'programming' days and seemed to call for a new approach. (Ghezzi, Jazayeri and Mandriolil991:4)
There are different ways of glossing the term 'software engineering', but one way of explaining the concept in a nutshell might be to call it a systematic training of computing professionals in resisting their natural instincts. For most individuals who are attracted to working with computers, the enjoyable aspect of the work is programming, and running one's programs. Writing code, and seeing the code one has written make things happen, is fun. (It is fun for some people, at any rate; it leaves others cold, but those others will look elsewhere for a career.) Even inserting comments in one's code feels by comparison like a diversion from the real business; programmers do it because they know they should, not out of natural inclination. As for documenting a finished software system on paper, that is real punishment, to be done grudgingly and seeming to require only a fraction of the care and mental effort needed in coding, where every dot and comma counts. What is more, these instincts were reinforced in the early years by the instincts of information technology managers, who wanted objective ways of monitoring the productivity of the people under them, and quite inevitably saw lines of code per week as a natural measure. These instincts seem to be widely shared, and they were often harmless in the early years, when software development was a small-scale, craft-like rather than industrial process where all the considerations relevant to a particular system might reside in a single head. They led to crisis once the scale of software projects enlarged, and required teamwork and integrity of software operation under different conditions over long periods of time. Software engineering addresses that crisis by inverting computing professionals' instinctive scale of values and sequence of activities. Documentation, the dull part, becomes the central and primary activity. Developing a software system becomes a process of successively developing and refining statements on paper of the task and intended solution at increasing levels of detail - requirements definitions, requirements specifications, software
THE ROLE OF TAXONOMY
77
specifications; so that the programming itself becomes the routine bit done at the end, when code is written to implement specifications of such precision that, ideally, the translation should be more or less mechanical - conceptual unclarities that could lead to faulty program logic should be detected and eliminated long before a line of code is written. Gerald Weinberg (1971) argued for a culture of 'egoless programming', which systematically deprives computing professionals of the pleasures of individual creativity and control over the programs for which they are responsible, as a necessary price to be paid for getting large systems which work as wholes. Nobody suggests that now that we have software engineering, all the problems described thirty years ago as 'software crisis' have melted away and everything in the software development garden is rosy. But I think few people in the IT industry would disagree that the counter-instinctive disciplines of software engineering are a necessary condition for successful software development, though those disciplines are often difficult to apply, and clearly they are not sufficient to ensure success. 3 How far we have come
Natural-language computing is not a new application of computer technology. When Alan Turing drew up a list of potential uses for the storedprogram electronic computer, a few weeks after the world's first computer run at Manchester in June 1948, the second and third items on his five-item list were 'learning of languages' and 'translation of languages' (Hodges 1983: 382). Some of the early machine translation projects must have been among the larger software development projects in any domain in the 1950s and early 1960s. But, on the whole, natural-language computing has been late in making the transition from individualistic, craft activity to industrial process; and, where work was being done in a more realistic style, for instance on Petr Tonia's 'Systran' machine-translation system (Hutchins and Somers 1992: ch. 10; URL 2), for many years it was given the cold shoulder by computational linguists within the academic world (Sampson 1991:1278). Since the 1980s, in some respects the subject has made great strides in the relevant direction. It is hard, nowadays, to remember the cloistered, unrealistic ethos of natural-language computing as it was less than twenty years ago. To give an impression of how things were then, let me quote (as I have done elsewhere) a typical handful of the language examples used by various speakers at the inaugural meeting of the European Chapter of the Association for Computational Linguistics, held at Pisa in 1983, in order to illustrate the workings of the various software systems which the speakers were describing: Whatever is linguistic is interesting. A ticket was bought by every man.
78
EMPIRICAL LINGUISTICS The man with the telescope and the umbrella kicked the ball. Hans bekommt von dieser Frau ein Buch. John and Bill went to Pisa. They delivered a paper. Maria e andata a Roma con Anna. Areyou going to travel this summer? Yes, to Sicily.
Some critics of the field were unwilling to recognize such material as representing human language at all. As we saw in Chapter 2, Michael Lesk (1988) characterized it acidly as an 'imaginary language, sharing a few word forms with English'. To me, there was nothing wrong with these dapper little example sentences as far as they went; but they were manifestly invented rather than drawn from real life, and they were invented in such a way as to exclude all but a small fraction of the problematic issues which confront software that attempts to deal with real-life usage. Focusing on such artificial examples gave a severely distorted picture of the issues facing natural-language engineering. Contrast the above examples with, at the other extreme, a few typical utterances taken from the speech section of the British National Corpus: well you want to nip over there and see what they come on on the roll can we put erm New Kids # no not New Kids Wall OJ #'you know well it was Gillian and 4£ and # erm {pause} and Ronald's sister erm {pause} and then er {pause} a week ago last night erm {pause} Jean and I went to the Lyceum together to see Arsenic and Old Lace lathered up, started to shave {unclear} {pause} when I come to clean it there weren't a bloody blade in, the bastards had pinched it buter {pause} I don' t know how we gotontoit {pause} ersh-# and I think she said something about oh she knew her tables and erm {pause} you know she'd comefrom Hampshire apparently and she #{pause} an- # an-yo-# you know er we got talking about ma- and she's taken her child away from {pause} the local school {pause} and sen- # is now going to a little private school up {pause} the Teign valley near Teigngrace apparently fra-
Whatever IT application we have in mind, whether automatic information extraction, machine translation, generation of orthographically conventional typescript from spoken input, or something else, I think the degree of complexity and difficulty presented by the second set of examples, compared with the first set, is quite manifest. Of course, I have made the point vivid by using examples drawn from spontaneous, informal speech (but then, notice that the last, at least, of the
THE ROLE OF TAXONOMY
79
examples quoted from the Pisa meeting was clearly intended to represent speech rather than writing). Some natural-language computing applications are always going to relate to written language rather than speech, and writing does tend to be more neatly regimented than the spoken word. But even published writing, after authors and editors have finished redrafting and tidying it, contains a higher incidence of structural unpredictability and perhaps anarchy than the examples from the Pisa conference. Here are a few sentences drawn at random from the LOB Corpus: Sing slightly flat. Mr. Baring, who whispered and wore pince-nez, was seventy if he was a day. Advice - Concentrate on the present. Say the power-drill makers, 75 per cent of major breakdowns can be traced to neglect of the carbon-brushgear. But he remained a stranger in a strange land.
In the first example we find a word in the form of an adjective, flat, functioning as an adverb. In the next example, the phrase Mr. Baring contains a word ending in a full stop followed by a word beginning with a capital which, exceptionally, do not mark a sentence boundary. The third 'sentence' links an isolated noun with an imperative construction in a logic that is difficult to pin down. In Say the power-drill makers . . ., verb precedes subject for no very clear reason. The last example is as straightforward as the examples from the Pisa meeting; but, even in traditional published English, straightforward examples are not the norm. (Currently, technologies such as e-mail are tending to make written language more like speech.) There were no technical obstacles to real-life material being used much earlier than it was. The electronic Brown Corpus of American English, which is proving to be a very valuable research resource even now at the turn of the century, was published as early as 1964; for decades it was all but ignored. For computational linguists to develop software systems based entirely on well-behaved invented data, which was the norm throughout the 1980s, is rather analogous to the home computer buff who writes a program to execute some intellectually interesting function, but has little enthusiasm for organizing a testing regime which would check the viability of the program by exposing it to a realistically varied range of input conditions. And this approach to natural-language computing militates against any application of statistical processing techniques. Speakers of a natural language may be able to make up example sentences of the language out of their heads, but they certainly cannot get detailed statistical data from their intuitions.
80
EMPIRICAL LINGUISTICS
One must learn to walk before one runs, and the 1980s reliance on artificial linguistic data might be excused on the ground that it is sensible to begin with simple examples before moving on to harder material. In fact I think the preference of the discipline for artificial data went much deeper than that. In the first place, as we have seen, computational linguistics was not 'beginning' in the 1980s. More important, almost everyone involved with linguistics was to a greater or lesser extent under the spell of Noam Chomsky, who saw linguistics as more an aprioristic than an empirical discipline. One of Chomsky's fundamental doctrines was his distinction between linguistic 'performance' — people's observable, imperfect linguistic behaviour - and linguistic 'competence', the ideal, intuitively accessible mental mechanisms which were supposed to underlie that performance (Chomsky 1965: 4). Chomsky taught that the subject worthy of serious academic study was linguistic competence, not performance. The route to an understanding of linguistic performance could lie only through prior analysis of competence (ibid.: 9, 15), and the tone of Chomsky's discussion did not encourage his readers to want to move on from the latter to the former. For Chomsky, this concept of an ideal linguistic competence residing in each speaker's mind was linked to his (thoroughly misguided) idea that the detailed grammatical structure of natural languages is part of the genetic inheritance of our species, like the detailed structure of our anatomy. But Chomsky was successful in setting much of the agenda of linguistics even for researchers who had no particular interest in these psychological or philosophical questions. In consequence, if computational linguists of the 1980s noticed the disparity between the neatly regimented examples used to develop natural-language processing software and the messy anarchy of real-life usage, rather than seeing that as a criticism of the examples and the software, they tended obscurely to see it as a criticism of real-life usage. Aarts and van den Heuvel (1985) give a telling portrayal of the attitudes that were current in those years. Not merely did most natural-language computing not use real-life data, but for a while there seemed to be an air of mild hostility or scorn towards the minority of researchers who did. Happily, from about 1990 onwards the picture has completely changed. Over the past ten years it has become routine for natural-language computing research to draw on real-life corpus data; and the validity of statistics-based approaches to natural-language analysis and processing is now generally accepted. I am not sure that one can count this as a case of the profession being convinced by the weight of reasoned argument; my impression of what happened was that American research funding agencies decided that they had had enough of natural-language computing in the aprioristic style and used the power of the purse to impose a change of culture, which then spread across the Atlantic, as things do. But, however it came about, the profession has now accepted the crucial need to be responsive to empirical data.
THE ROLE OF TAXONOMY
81
4 The lesson not yet learned
In another respect, though, it seems to me that natural-language computing has yet to take on board the software-engineering lesson of the primacy of problem analysis and documentation over coding. I shall illustrate the point from the field of parsing — automatic grammatical analysis. I believe similar things could be said about other areas of natural-language processing; but automatic parsing is the languageengineering function of which I have experience myself, and it is a key technology in natural-language computing. Many would have agreed with K. K. Obermeier's assessment ten years ago that parsing was 'The central problem' in virtually all natural-language processing applications (Obermeier 1989: 69); more recently, I notice that 'parsing' takes up more space than any other technology name in the index of an NSF/European Commission-sponsored survey of natural-language and speech computing (Cole et al. 1997). As these pointers suggest, a large number of research groups worldwide have been putting a lot of effort into solving the parsing problem for years and indeed for decades. Many parsing systems have been developed, using different analytic techniques and achieving different degrees of success. Any automatic parser is a system which receives as input a representation of a spoken or written text, as a linear sequence of words (together possibly with subsidiary items, such as punctuation marks in the case of written language), and outputs a structural analysis, which is almost always in a form notationally equivalent to a tree structure, having the words of the input string attached to its successive leaf nodes, and with nonterminal nodes labelled with grammatical categories drawn from some agreed vocabulary of grammatical classification. (A minority of research groups working on the parsing problem use output formalisms which deviate to a certain extent from this description - see for instance notes 4 and 9 to Chapter 4 earlier; but I do not think these differences are significant enough to affect the substance of the point I am developing.) The structural analysis is something like a representation of the logic of a text, which is physically realized as a linear string of words because the nature of speech forces a one-dimensional linear structure onto spoken communication (and writing mimics the structure of spoken utterances). So it is easy to see why any automatic processing which relates to the content of spoken or written language, rather than exclusively to its outward form, is likely to need to recover the tree-shaped structures of grammar underlying the string-shaped physical signals. Obviously, to judge the success of any particular parser system, one must not only see what outputs it yields for a range of inputs, but must know what outputs it should produce for those inputs: one must have some explicit understanding of the target analyses, against which the actual analyses can be assessed. Yet it was a noticeable feature of the literature on automatic natural-language parsing for many years that - though the software systems were described in detail - there was hardly any public discussion of the
82
EMPIRICAL LINGUISTICS
schemes of analysis which different research groups were treating as the targets for their parsing systems to aim at. Issues about what counted as the right analyses for particular input examples were part of what Ghezzi etal. (1991: 4) called 'the "folklore" of the project requirements' (see above). Members of particular parsing projects must have discussed such matters among themselves, but one almost never saw them spelled out in print. Of course, unlike some of the topics which software is written to deal with, natural-language parsing is a subject with a long tradition behind it. A number of aspects of modern grammatical analysis go back two thousand years to the Greeks; and the idea of mapping out the logic of English sentences as tree structures was a staple of British schooling at least a hundred years ago. So computational linguists may have felt that it was unnecessary to be very explicit about the targets for automatic parsing systems, because our shared cultural inheritance settled that long since. If people did think that, they were wrong. The wrongness of this idea was established experimentally, at a workshop held in conjunction with the Association of Computational Linguistics annual conference at Berkeley, California, in 1991. Natural-language processing researchers from nine institutions were each given the same set of English sentences and asked to indicate what their respective research groups would regard as the target analyses of the sentences, and the nine sets of analyses were compared. These were not particularly complicated or messy sentences -• they were drawn from real-life corpus data, but as real-life sentences go, they were rather well-behaved examples. And the comparisons were not made in terms of the labels of the constituents: the only question that was asked was how far the researchers agreed on the shapes of the trees assigned to the sentences — that is, to what extent they identified the same sub-sequences of words as grammatical constituents, irrespective of how they categorized the constituents they identified. The level of agreement was strikingly low. For instance, only the two subsequences marked by square brackets were identified as constituents by all nine participants in the following example (and results for other cases were similar): One of those capital-gains ventures, in/act,has saddled him [with [Gore Court]].
If specialists agree as little as this on the details of what parsing systems are aiming to do, that surely establishes the need for a significant fraction of all the effort and resources that are put into automatic parsing to be devoted to discussing and making more publicly explicit the targets which the software is aiming at, rather than putting them all into improving the software. 5 The scale of the task
I do not mean to imply that every natural-language computing group working on English ought to agree on a single common parsing scheme. In the
THE ROLE OF TAXONOMY
83
context of applications executing commercially or socially valuable naturallanguage processing functions of various kinds, automatic parsing is only a means to an end. It may well be that the kind of structural analysis which is most appropriate with respect to one function differs in some details from the analysis that is appropriate for an application executing a different function. But the lack of agreement revealed at the 1991 workshop did not arise because various research groups had made explicit decisions to modify the details of a recognized public scheme of English-language parsing to suit their particular purposes. No such public scheme existed. Separate groups were forced to use different parsing schemes, because each research group had to develop its own standards, as a matter of internal project 'folklore'. The analytic concepts which we inherit from traditional school grammar teaching may be fine as far as they go, but they are far too limited to yield unambiguous, predictable structural annotations for the myriad linguistic constructions that occur in real life. And, because research groups developed their parsing standards independently and in an informal fashion, not perceiving this as truly part of the work they were engaged on, they were in no position to develop schemes that were adequate to the massive structural complexity of any natural language. The results of the 1991 ACL workshop experiment came as little surprise to me, in view of earlier experiences of my own. From 1983 onwards, as a member of the University of Lancaster natural-language computing group, I had taken responsibility for creating the Lancaster-Leeds Treebank, introduced in Chapter 3, which was needed for a statistics-based parsing project led by my senior colleague Geoffrey Leech. I remember that when I took the task on and we needed to agree an annotation scheme for the purpose, Leech (who knows more about English grammar than I ever shall) produced a 25-page typescript listing a set of symbols he proposed that we use, with guidelines for applying them in debatable cases; and I thought this represented such a thorough job of anticipating problematic issues that it left little more to be said. All I needed to do was to use my understanding of English in order to apply the scheme to a series of examples. I soon learned. As I applied the scheme to a sample of corpus data, the second or third sentence I looked at turned out to involve some turn of phrase that the typescript did not provide for; as I proceeded, something on the order of every other sentence required a new annotation precedent to be set. Real-life usage contains a far greater variety of constructions than a contemporary training in linguistics leads one to expect. Often, alternative structural annotations of a given construction each seemed perfectly defensible in terms of the grammatical tradition — but if we were going to use our treebank to produce meaningful statistics, we had to pick one alternative and stick toil. Consider, to give just one example, the construction exemplified in the more, the merrier - the construction that translates into German withj« and desto. Here are three ways of grouping a sentence using that construction into constituents:
84
EMPIRICAL LINGUISTICS [ [the wider the wheelbase is\, [the more satisfactory is the performance] ] [ [the wider the wheelbase is], the more satisfactory is the performance] [[[the wider the wheelbase is], the more satisfactory] is the performance]
The two clauses might be seen as co-ordinated, as in the first line, since both have the form of main clauses and neither of them contains an explicit subordinating element. Or the second clause might be seen as the main clause, with the first as an adverbial clause adjunct. Or the first clause might be seen as a modifier of the adjectival predicate within the second clause. There seemed to be no strong reason to choose one of these analyses rather than another. Linguists influenced by the concept of innate psychological 'competence' tend to react to alternatives like this by asking which analysis is 'true' or 'psychologically real' — which structure corresponds to the way the utterance is processed by a speaker's or hearer's mental machinery. But, even if questions like that could ultimately be answered, they are not very relevant to the tasks confronting natural-language computing here and now. We have to impose analytic decisions in order to be able to register our data in a consistent fashion; we cannot wait for the outcome of abstruse future psychological investigations. Indeed, I should have thought it was necessary to settle on an analytic framework in order to assemble adequately comprehensive data for the theoretical psycholinguists to use in their own investigations. Theoreticians cannot hope to make real progress in their own work without a solid foundation of grammatical taxonomy to catalogue and classify the data which their theories ought to explain. In the comparable domain of natural history, it was two centuries after the taxonomic work of John Ray, and a century and a half after that of Linnaeus, before theoretical biology was able to develop as a substantial discipline in its own right in the late nineteenth century (see e.g. Allen 1994: ch. 9). From a theoretical point of view the Linnaean system was somewhat 'unnatural' (and was known from the start to be so), but it provided a practical, usable conspectus of an immensely complex world of data; without it, theoretical biology could not have got off the ground. A science is not likely to be in a position to devise deep theories to explain its data before it has an agreed scheme for identifying and registering those data. To use the terms 'true' and 'false' in connexion with a scheme of grammatical annotation would be as inappropriate as asking whether the alphabetical order from A to Z which we use for arranging names in a telephone directory or books on shelves is the 'true' order. At any rate, within the Lancaster group it became clear that our approach to automatic parsing, in terms of seeking structures over input word-strings which conformed to the statistics of parse configurations in a sample of analysed material, required us to evolve far more detailed analytic guidelines than anything that then existed; without them, the statistics would be
THE ROLE OF TAXONOMY
85
meaningless, because separate instances of the same construction would be classified now one way, now another. This great growth in annotation guidelines was caused partly by the fact that real-life language contains many significant items that are scarcely noticed by traditional linguistics. Personal names are multi-word phrases with their own characteristic internal structure, and so are such things as addresses, or references to weights, measures, and money sums; we need consistent rules for annotating the structures of all these forms, but they are too culture-bound to be paid much attention by the inherited school grammar tradition (and recent theoretical linguistics has scarcely noticed them). In written language, punctuation marks are very significant structurally and must be fitted into parse trees in some predictable way, but syntactic analysis within theoretical linguistics ignored punctuation completely. (On general issues relating to the development of annotated corpora, see Garside, Leech and McEnery 1997.) The more important factor underlying the complexity of our annotation rules, though, was the need to provide an explicit, predictable annotation for every turn of phrase that occurs in the language. We evolved a routine in which each new batch of sentences manually parsed would lead to a set of tentative new analytic precedents which were logged on paper and circulated among the research team; regular meetings were held where the new precedents were discussed and either accepted or modified, for instance because a team member noticed a hidden inconsistency with an earlier decision. The work was rather analogous to the development of the Common Law. A set of principles attempts to cover all the issues on which the legal system needs to provide a decision, but human behaviour continually throws up unanticipated cases for which the existing legal framework fails to yield an unambiguous answer; so new precedents are set, which cumulatively make the framework increasingly precise and comprehensive. We want our nation's legal system to be consistent and fair, but perhaps above all we want it to be fully explicit; and if that is possibly not the dominant requirement for a legal system, it surely is for a scientific system of data classification. To quote Jane Edwards of the University of California at Berkeley: 'The single most important property of any data base for purposes of computer-assisted research is that similar instances be encoded in predictably similar ways' (Edwards 1992: 139). Ten years of our accumulated precedents on structural annotation of English turned a 25-page typescript into the scheme which was published as a book of 500 large-format pages (Sampson 1995). Beginning from a later start, the Pennsylvania treebank group published their own independent but very comparable system of structural annotation guidelines on the Web in the same year (URL 18). I am sure that the Pennsylvania group feel as we do, that neither of these annotation schemes can be taken as a final statement; the analogy with the growth of the law through cumulation of precedents suggests that there never could be a last word in this domain. My own group has been elaborating our scheme in the past few years by applying it to
86
EMPIRICAL LINGUISTICS
spontaneous speech (see Rahman and Sampson 2000; URL 7); but although the main focus here is on aspects of the annotation scheme that were irrelevant to the structure of written prose, we also continue to find ourselves setting new precedents for constructions that are common to writing as well as speech. For the generative linguists who set much of the tone of computational linguistics up till the 1980s, this kind of comprehensive explicitness was not a priority. Theorists of grammar commonly debated alternative analyses for a limited number of'core' constructions which were seen as having special theoretical importance, trying to establish which analysis of some construction is 'psychologically real' for native speakers of the language in question. They saw no reason to take a view on the analysis of the many other constructions which happen not to be topics of theoretical controversy (and, because they invented their examples, they could leave most of those other constructions out of view). Language engineering based on real-life usage, on the other hand, cannot pick and choose the aspects of language structure on which it focuses — it has to deal with everything that comes along. For us the aim is not to ascertain what structural analysis corresponds to the way language is organized in speakers' minds - we have no way of knowing that; we just need some reliable, practical way of registering the full range of data in a consistent manner. 6 Analysing spontaneous speech
For the structure of written English, a consensus on analysis does at least exist, even though (as we have seen) that consensus turns out to be far less comprehensive than linguists often suppose. When we turn to the analysis of spontaneous speech, we immediately confront analytic issues for which the linguistic tradition does not begin to offer a consensus solution. How, for instance, are we to indicate what is going on in a 'speech repair' — a passage in which a speaker corrects himself or changes tack in midutterance? Figure 6.1 is an outline version (using spelled-out rather than coded grammatical category labels) of the CHRISTINE analysis of part of an utterance (from the London—Lund Corpus, 2.2.669) in which a speaker embarks on a relative clause modifying any bonus and then decides instead to use anything as the head of the phrase and to make bonus the predicate. We use the # symbol to indicate the point where the speaker interrupts himself; but we need rules for deciding how to fit that symbol, and the words before and after it, into a coherent structure - do we, for instance, label what precedes the interruption point as a relative clause, even though only its first word, he, was actually uttered? Where in the tree do we attach the interruption symbol? The tree in Figure 6.1 is based on explicit decisions about these and related questions, and the variety of speech-management phenomena found in real-life spontaneous speech is such that these guidelines have had to grow quite complex; but only by virtue of them can thousands of individual speech repairs be annotated in a predictable, consistent fashion.
THE ROLE OF TAXONOMY
87
Figure 6.1
Structural annotation of spontaneous speech calls into question grammatical distinctions which, in writing, are fundamental. Written English takes pains to leave readers in no doubt about the distinction between direct and indirect speech, which is marked by punctuation even where the wording itself does not make the status of a particular quotation clear. Speech has no inverted commas, but commonly wording shows whether quotations are directly reported or paraphrased. For instance, in the following CHRISTINE example (I have added underlining to identify relevant wording): he saidhe^hates drama because the teacher takes no notice, he said one week Stuart was hitting me_ with a stick and the teacher just said calm down you boys (T 19.03060)
the words he hates drama (rather than I hate .. .) show that the object of the first he said is indirect speech, whereas hitting me (rather than hitting him}., and the imperative and second-person form in the quotation attributed to the teacher, show that the object of the second he saidis a direct quotation which itself contains an internal direct quotation. But matters are not always so straightforward. Two other CHRISTINE examples run: IsaidwM that's_fm hard luck (T 15.10673) well Billy, Billy says well take that and then he^ll come back (T 13.01053)
The discourse item well at the beginning of well that's his hard luck usually marks the beginning of a direct quotation, and present-tense (i)s rather than past-tense was agrees with that, but in context direct quotation would call for jour hard luck rather than his . . . . Again, following Billy the word well and the imperative take suggest direct speech, but he'll come in place of I'll come suggests indirect speech. In spoken English it seems that directness of
88
EMPIRICAL LINGUISTICS
quotation is not an absolute property but a matter of gradience. Quotations may be reported more or less directly, which creates a new challenge for an annotation scheme that was developed originally in connexion with written prose. Indeed, the structures found in spontaneous speech sometimes call into question not merely the inherited range of grammatical-category distinctions but the very concept of grouping utterances into tree-shaped structures. It is no surprise to find that spontaneous utterances are sometimes too chaotic for any structure to be clearly assignable. More troubling are cases where the structure does seem clear, but it conflicts with the hierarchical assumption which is implicit in the use of tree diagrams or labelled bracketing. This applies to what I call 'Markovian' utterances (which occur repeatedly in the CHRISTINE data), where a window of limited size moved through the wording would at each point contain a coherent, normal grammatical sequence, but the utterance as a whole cannot be assigned a single structure. Consider, for instance, the following, said by Anthony Wedgwood Benn, MP, on a BBC radio programme: and what is happening {pause} in Britain today {pause} is ay-demandfor an entirely new foreign policy quite different from the Cold War policy {pause} is emerging from the Left (XO 1.00539-45) The long noun phrase an entirely newforeign policy quite different from the Cold War policy seems to function, with respect to the words before it, as the complement of a prepositional phrase introduced by for which postmodifies a demand; yet, with respect to what follows it, the same phrase functions as subject of is emerging. Within the usual framework of grammatical analysis, one constituent cannot fill both of these roles simultaneously. Yet is it reasonable to abandon the concept of hierarchical grammatical structuring, which has been entrenched in Western linguistic thought for centuries, because of a limited number of 'Markovian' examples which seem to conflict with it? I have argued elsewhere (Rahman and Sampson 2000) that some of the difficulties in evolving well defined guidelines for consistent structural annotation of spoken language may stem from the fact that our inherited grammatical tradition has evolved almost exclusively in connexion with written language. It may be that fully adequate schemes for annotating speech treebanks will eventually need to adopt notational devices that depart further from traditional grammatical ideas than anything yet adopted in our CHRISTINE Corpus. These are problems with which we are only beginning to grapple. 7 Differential reception of data and specifications The only way that one can produce an adequate scheme of structural annotation is to apply an initial scheme to real data and refine the scheme in response to problem cases, as we have been doing; so in developing an
THE ROLE OF TAXONOMY
89
annotation scheme one inevitably generates a treebank, an annotated language sample, as a by-product. The Lancaster—Leeds Treebank which started me on this enterprise in the mid-1980s was for internal project use and was never published, but the larger SUSANNE Corpus, on which later stages of annotation-scheme development were based, was released in successively more accurate versions between 1992 and 2000. Part of the point I am seeking to make in the present chapter can be illustrated by the different receptions accorded by the research community to the SUSANNE Corpus, and to the published definition of the SUSANNE annotation scheme. Because it emerged from a manual annotation process which aimed to identify and carefully weigh up every debatable analytic issue arising in its texts, the SUSANNE Corpus is necessarily a small treebank; there is a limit to how reliable any statistics derived from it can hope to be. Yet it succeeded beyond expectations in establishing a role for itself internationally as a natural-language computing research resource. Accesses to the ftp site originally distributing it at the Oxford Text Archive quickly rose to a high level (and subsequently other 'mirror' sites began distributing it, so that I no longer have any way of monitoring overall accesses). The professional literature frequently includes research based on the SUSANNE Corpus, commonly carried out by researchers of whom I had no prior knowledge. Conversely, it seems fair to say that the book defining the annotation scheme has yet to find a role. Reviewers have made comments which were pleasing to read, but almost no-one has spontaneously found reasons to enter into correspondence about the contents of the annotation scheme, in the way that many researchers have about the SUSANNE treebank; published research based on the SUSANNE Corpus rarely discusses details of the scheme. Like every academic, I am naturally delighted to find that any research output for which I was responsible seems to be meeting a need among the international research community. The welcome that the corpus alone has received is certainly more than a sufficient professional reward for the effort which created the corpus and annotation scheme. Nevertheless, the imbalance in the reception of the two resources seems rather regrettable in what it appears to say about the values of the discipline. In my own mind, the treebank is an appendix to the annotation scheme, rather than the other way round; the treebank serves a function similar to what biologists call a type collection attached to a biological taxonomy - a set of specimens intended to clarify the definitions of the taxonomic classes. The SUSANNE treebank is really too small to count as a significant database of English grammatical usage; whereas the published annotation scheme, although it unquestionably has many serious limitations and imperfections, can (I believe) claim to be a more serious attempt to do its own job than anything that existed in print before. If the research community is not taking up the SUSANNE annotation scheme as a basis from which to push forward the enterprise of taxonomizing English structure, that could merely mean that they prefer the Pennsylvania scheme as a starting point for that work; but in fact I do not get the impression that this sort of activity has been getting under way
90
EMPIRICAL LINGUISTICS
in connexion with the Pennsylvania scheme either. (The fact that the Pennsylvania group limited themselves to publishing their scheme via the Web rather than as a printed book perhaps suggests that they did not expect it to.) When Geoffrey Leech began to look for support to create the first corpus of British English, about thirty years ago, I understand that funding agencies were initially unreceptive, because at that time a simple collection of language samples did not strike reviewers as a valid research output. People expected concrete findings, not just a collection of data from which findings could subsequently be generated — although Leech's LOB Corpus, when it was eventually published in 1978, served as the raw material for a huge variety of research findings by many different researchers, which collectively must far exceed the new knowledge generated by almost any research project which seeks to answer a specific scientific question. We have won that battle now, and it is accepted that the compilation of natural-language corpora is a valuable use of research resources - though now that massive quantities of written language are freely available via the Internet, the need at the turn of the century is for other sorts of language sample, representing speech rather than writing and/or embodying various categories of annotation. But there is still a prejudice in favour of the concrete. When I put together a new research proposal, I emphasize the work of compiling a new annotated corpus, rather than that of extending and testing a scheme of structural annotation. If I wrote the proposals in the latter way, I suspect they would fail, whereas research agencies are happy to sponsor new corpora even though the ones I can offer to create are small (because our way of working requires each individual turn of phrase to be examined in case it reveals a gap needing to be filled in the scheme of analytic guidelines). Before software engineering brought about a change of vision, IT managers measured their colleagues' output in terms of lines of code, and overlooked the processes of planning, definition, and co-ordination which were needed before worthwhile code could be written. At present, most empirical linguists see the point of an annotated corpus, but few see the point of putting effort into refining schemes of annotation. Some encouragement to give more priority to the annotationscheme development task has come from the European Commission, whose Directorate-General XIII induced the predominantly US-sponsored Text Encoding Initiative (URL 8) to include a small amount of work on this area about ten years ago, and more recently established the EAGLES group (the Expert Advisory Group on Language Engineering Standards, URL 9) to stimulate the development of standards and guidelines for various aspects of natural-language computing resources, including structural annotation of corpora. The EAGLES initiative has produced valuable work, notably in the area of speech systems, where the relevant working group has assembled between hard covers what looks to me like a very complete survey of problems and
THE ROLE OF TAXONOMY
91
best practices in various aspects of speech research (Gibbon, Moore and Winski 1997). But in the area of grammatical annotation the EAGLES enterprise was hobbled by the political necessity for EU-funded work to deal jointly with a large number of European languages, each of which has its own structure, and which are very unequal in the extent to which they have been worked over by either traditional or computer-oriented scholarly techniques (many of them lagging far behind English in that respect). Consequently, in this domain the EAGLES initiative focused on identifying categories which are common to all or most EU national languages, and I think it is fair to say that its specific recommendations go into even less detail than the inherited school grammar tradition provides for English. The nature of the EAGLES enterprise meant that it could hardly have been otherwise. What is needed is more effort devoted to identifying and systematically logging the fine details of spoken and written language structure, so that all aspects of our data can be described and defined in terms which are meaningful from one site to another, and this has to be done separately for any one language in its own terms (just as the taxonomy of one family of plants is a separate undertaking from the taxonomy of any other family). European languages obviously do share some common structural features because of their common historical origins and subsequent contacts; but a language adapts its inherited stock of materials to new grammatical purposes on a time-scale of decades — think for instance of the replacement of might by may in the most respectable written contexts within the past ten or twenty years, in constructions like if he had been in Cornwall he may have seen the eclipse - whereas the EU languages have developed as largely independent systems for millennia. We do not want our grammatical classification systems to be excessively dominated by ancient history. In developing predictable guidelines for annotating the structure of spontaneous spoken utterances, my group faced large problems stemming from the fact that, within English, there are different groups of speakers who, for instance, use the verb system in different ways. If a speaker of a non-standard version of English says she done it, rather than she did it or she's done it (which speakers very often do), to a schoolteacher this may represent heresy to be eradicated, but for us it is data to be logged. We have to make a decision about whether such cases should be counted as: simple past forms with non-standard use of done rather than did as past tense of do perfective forms with non-standard omission of the auxiliary or a third verbal category, alongside the perfective and simple past categories of the standard language The idea of developing guidelines at this level of detail which simultaneously take into account what happens in German or Modern Greek is really a nonstarter.
92
EMPIRICAL LINGUISTICS
In any case, encouragement from national or supranational government level will not achieve very much, unless enthusiasm is waiting to be kindled at grass-roots level among working researchers. Natural-language computing researchers need to see it as being just as fascinating and worthwhile a task to contribute to the identification and systematic classification of distinctive turns of phrase as to contribute to the development of languageprocessing software systems — so that taxonomizing language structure becomes an enterprise for which the discipline as a whole takes responsibility, in the same way as biologists recognize systematics as an important subfield of their discipline. The fact that natural-language computing is increasingly drawing on statistical techniques, which by their nature require large quantities of material to be registered and counted in a thoroughly consistent fashion, makes the task of defining and classifying our data even more crucial that it was before. It is surely too important to leave in the hands of isolated groups in Sussex or Pennsylvania. 8 A call to arms
If people are attracted to the task, there is plenty of work for them to do, and plenty of rewards to be reaped. My experience has been that even a smallscale English treebank soon yields new scientific findings, sometimes findings that contradict conventional linguistic wisdom. For instance, introductory textbooks of linguistics very commonly suggest that the two most basic English sentence types are the types 'subject - transitive-verb — object', and 'subject — intransitive-verb'. Here are the examples quoted by Victoria Fromkin and Robert Rodman in the 1983 edition of An Introduction to Language to illustrate the two first and simplest structures diagrammed in their section on sentence structure (Fromkin and Rodman 1983: 207-9): the child found the puppy the lazy child slept
Looking at statistics on clause structure in our first, small, Lancaster—Leeds Treebank, though, I found that this is misleading (Sampson 1987: 90). 'Subject - transitive-verb — object' is a common sentence type, but sentences of the form 'subject - intransitive-verb' are strikingly infrequent in English. If the sentence has no noun-phrase object to follow the verb, it almost always includes some other constituent, for instance an adverbial element or a clause complement, in post-verb position. The lazy child slept may be acceptable in English, but it could be called a 'basic' type of English sentence only in some very un-obvious sense of the word 'basic'. (The latest, 1998 edition of Fromkin and Rodman's textbook blurs this aspect of their account of basic sentence structures, perhaps in response to findings such as the one quoted.) The more closely one looks at usage in a language, the more detail turns out to be awaiting description and classification. I referred in Chapter 1 to
THE ROLE OF TAXONOMY
93
Richard Sharman's analogy between human languages and natural fractal objects, such as coastlines, which continue to display new and unpredictable detail no matter at what scale they are examined. Treebank research which I shall describe in Chapter 10 makes this analogy seem rather exact, even in the relatively well-behaved case of written prose. And the range of basic, common structural phenomena needing to be registered and classified, before we shall be in any position to begin formulating explanatory theoretical principles, expands considerably when one looks at the spoken language. At the end of the twentieth century, mankind's honeymoon with the computer has not yet quite faded, and software development still has a glamour which is lacking in research that revolves round ink and paper. But a computational linguist who helps to develop a natural-language software system is devoting himself to a task which, realistically, is unlikely to achieve more than a tiny advance on the current state of the art, and will quickly be forgotten when another, better system is produced. To improve our system for registering and classifying the constructions of English, on the other hand, is to make a potentially lasting contribution to our knowledge of the leading medium of information storage and exchange on the planet. I do not suggest that computational linguists should migrate en masse from the former activity to the latter. But it would be good to see more of a balance. Notes 1 One might expect this type of work to be called 'applied linguistics'; but that phrase was pre-empted, before the Information Technology revolution, to refer to the application of linguistics in the language-teaching profession, so the phrase is never used in connexion with computer language processing. For the phrase 'language engineering', see Cunningham 1999.
2 Only general concepts such as corpora, dialogue, speech, word occupy larger sections of Cole etal.'s index. 3 The artificiality of Linnaeus's main classification system was explicit. Linnaeus spent part of his career trying to develop fragments of a natural system as an alternative to his artificial but practical system which became standard (though he believed that a complete natural system was beyond the reach of the science of his day). See e.g. Stafleu 1971: 28, 115 ff.; Eriksson 1983: 79-80.
7
Good-Turing frequency estimation without tears
1 The use of Good—Turing techniques Suppose you want to estimate how common various species of birds are in your garden. You log the first 1000 birds you see; perhaps you see 212 sparrows, 109 robins, 58 blackbirds, and lesser numbers of other species, down to one each of a list of uncommon birds. Perhaps you see 30 species all told. How do you use these numbers to estimate the probability that the next bird you see will be, say, a blackbird? Many people would surely say that the best guess is 58 -=- 1000, that is 0.058. Well, that's wrong. To see that it is wrong, consider a species which did not occur at all in the thousand-bird sample, but which does occasionally visit your garden: say, nightingales. If the probability of blackbirds is estimated as 58 -r- 1000, then by the same reasoning the probability of nightingales would be estimated as 0 -r- 1000, i.e. non-existent. Obviously this is an underestimate for nightingales; and correspondingly 58 -i- 1000 is an overestimate for blackbirds. This kind of statistical problem crops up in many kinds of research. In linguistics, the 'species' whose frequency has to be estimated might be words, syllables, grammatical constructions, or the like. (In a linguistic context the terms 'type' and 'token' might seem more appropriate than 'species' and 'individual', but 'type' is a relatively ambiguous word and I shall use 'species' for the sake of clarity.) People often try to get round the problem of zero observations by adding some small quantity (say, 1) to the tally for each species; then, for the bird example, p (nightingale), the probability of seeing a nightingale, would be 1 -=- 1030 (and />(blackbird) would be (58+ 1) -f(1000 + 30) = 0.0573). But this is just papering over the cracks in the logic of the earlier estimates. It is still a rotten way of approximating the true probabilities; the estimates will often be not just slightly off, but wildly misleading. A much better technique was worked out by Alan Turing and his statistical assistant I. J. Good, during their collaboration at Bletchley Park, Buckinghamshire, in the Second World War effort to crack German ciphers, which led to the development of machines that were the immediate ancestors of the modern computer, and made a major contribution to Allied victory. (Bletchley Park was the home of the organization that has recently become
GOOD-TURING FREQUENCY ESTIMATION
95
familiar to a wide public through Robert Harris's best-selling thriller Enigma, and the BBC television series Station X.} The Bletchley Park codebreaking work depended heavily on calculating inferences about probabilities. Unfortunately, most versions of the Good-Turing technique required quite cumbersome calculations, so when peace came it was not used as widely as it might have been. In the 1990s William A. Gale of AT&T Bell Laboratories developed and tested a simple version of the Good-Turing approach, which for the first time made it easy for people with little interest in maths to understand and use. I worked with Gale to turn his 'Simple Good-Turing' technique into an account that spells out step by step how to apply the technique, as well as explaining its rationale and demonstrating that it gives good results. This chapter is a version of that account. Let us define some symbols to make the discussion precise. Say that our sample contains JV individuals (in the birdwatching scenario, JV was 1000), and that for each species i the sample includes rz- examples of that species. (The number s of distinct species in the population may be finite or infinite, though JV- and consequently the number of distinct species represented in the sample - must be finite. In the birdwatching case, the number of species represented was 30. I did not discuss the number s in that case; it must have been finite, because the whole world contains only a finite number of bird species, but in linguistics there are some applications — for instance, relating to grammatical structures — where s may be infinitely large.) I call r1 the sample frequency of 2, and we want to use it in order to estimate the population frequency pl of i, that is the probability that an individual drawn at random from the population will be a case of species i. Note that sample frequencies are integers from the range 0 to JV, whereas population frequencies are probabilities, that is real numbers from the range 0 to 1.1 The very obvious method (used in my opening paragraph) for estimating the population frequency is to divide sample frequency by size of sample — that is, to estimate pt as r,/JV. This is known as the maximum likelihood estimator for population frequency. As already pointed out, the maximum likelihood estimator has a large drawback: it estimates the population frequency of any species that happens to be missing from the sample — any unseen species — as zero. If the population contains many different low-frequency species, it is likely that quite a number of them will be absent from a particular sample; since even a rare species has some positive population frequency, the maximum likelihood estimator clearly underestimates the frequencies of unseen species, and correspondingly it tends to overestimate the frequencies of species which are represented in the sample. Thus the maximum likelihood estimator is quantitatively inaccurate. Even more importantly, any estimator which gives zero estimates for some positive probabilities has specially unfortunate consequences for statistical calculations. These often involve multiplying estimated probabilities for many simple phenomena, to reach overall figures for the probability of interesting complex phenomena. Zeros propagate through such
96
EMPIRICAL LINGUISTICS
calculations, so that phenomena of interest will often be assigned zero probability even when most of their elementary components are very common and the true probability of the complex phenomenon is reasonably high. This latter problem is often addressed (as suggested earlier) by adding some small figure k to the sample frequencies for each species before calculating population frequencies: thus the estimated population frequency of species z would be (rr + k ) / ( N + sk}. This eliminates zero estimates: an unseen species is assigned the estimated frequency kj (JV + sk). I shall call this the additive method. The additive method was advocated as an appropriate technique by Lidstone (1920: 185),Johnson (1932: 418-19), and Jeffreys (1948: §3.23), the first and third of these using the value k—\. When the additive method is applied with the value k— 1, as is common, I shall call it the Add-One estimator. In language research Add-One was used for instance by the Lancaster corpus linguistics group (Marshall 1987:54), and by Church (1988). But, although the additive approach solves the special problem about zeros, it is nevertheless very unsatisfactory. Gale and Church (1994) examine the Add-One case in detail and show that it can give approximately accurate estimates only for data-sets which obey certain quite implausible numerical constraints. Tested on a body of real linguistic data, Add-One gives estimated frequencies which for seen species are always much less accurate even than the maximum likelihood estimator, and are sometimes wrong by a factor of several thousand. The sole advantage of additive techniques is their simplicity. But there is little virtue in simple methods which give wrong results. With only a modest increase in complexity of calculation, one can achieve estimators whose performance is far superior. 2 A prospectus and an example
Good—Turing estimators, classically described in Good (1953), are a family of closely related population frequency estimators. Gale's Simple GoodTuring estimator is one member of this family, which is much easier to understand and to use than the various members previously described in print, and which gives good results. All Good-Turing techniques yield estimates for the population frequencies corresponding to the various observed sample frequencies for seen species, and an estimate for the total population frequency of all unseen species taken together. (I shall call this latter quantity PQ- note that capital Pis used for the sum of the separate probabilities of a number of species, whereas/); is used for the individual probability of a species i.} The techniques do not in themselves tell one how to share PQ between the separate unseen species, but this is an important consideration in applying the techniques and I discuss it in section 9 below. Also, Good-Turing techniques do not yield estimates for the number of unseen species, where this is not known independently. (Some references on this last issue are Fisher, Corbet and Williams 1943; L. A. Goodman 1949; Good and Toulmin 1956; McNeil 1973;EfronandThisted 1976.)
GOOD-TURING FREQUENCY ESTIMATION
97
Table 7.1
vcv vccv VCCRCRCV VCCRRCCCV VCCRCRV VRCCRCRCV VRCCCRCV VRRCCV VRRCCCV VCCRCRCRV VRCRCRV
7846 6925 224 23 7 6 5 4 3 2 1
In order to introduce Good Turing concepts, let us take a concrete example, which is drawn from research on speech timing reported in Bachenko and Gale (1993); I shall refer to this as the 'prosody example'. Assuming a classification of speech segments into consonants, full vowels, and reduced vowels, we wish (for reasons that are not relevant here) to estimate the frequencies in English speech of the various possible sequences containing only the classes 'consonant' and 'reduced vowel' occurring between two full vowels. That is, the 'species' in the population are strings such as VCV, VCRCV, VCCRCRCV, and so on, using C, V, and R to represent the three classes of speech-segment. Using the TIMIT database (URL 11) as a sample, observed frequencies were extracted for various species; a few examples of the resulting figures are shown in Table 7.1. The Appendix to this chapter shows the complete range of sample frequencies represented in these data, together with the frequencies of the respective sample frequencies; if r is a sample frequency, I write nr for the number of different species each having that frequency, thus nr is a 'frequency of a frequency'. For instance, the third row of the Appendix (r = 3, nr = 24) means that there are 24 distinct strings which each occur three times in the data. The sample comprises a total of 30,902 individual strings (this is the sum of the products of the two numbers in each row of the Appendix); that is, JV= 30,902. The string VCV, with frequency 7,846, is the single commonest species in the data. The commonest frequency is 1, which is shared by 120 species. As one moves to frequencies greater than 1, the frequencies of the frequencies decline, at first steadily but later more irregularly. These are typical patterns for many kinds of language and speech data. 3 The theoretical rationale I now outline the reasoning underlying the Good—Turing approach to estimating population frequencies from data-sets such as that in the Appendix.
98
EMPIRICAL LINGUISTICS
he techniques depend are stated without proof; are stated without proof; readers wishing to pursue the subject may like to consult Church, Gale and Kruskal (1991). Some readers may prefer to bypass the present section altogether, in favour of consulting only section 6, which presents mechanical 'recipe book' instructions for applying the Simple Good—Turing technique without explaining why it works. However, applications of the technique are likely to be more judicious when based on an awareness of its rationale. I first introduce an additional notation, r*. Given a particular sample, I write r* for the estimated number of cases of a species actually observed r times in that sample which would have been observed if the sample were perfectly representative of the population. (This condition would require the possibility of fractional observations.) The quantity r* will normally be less than r, since if the sample were perfectly representative, part of it would be taken up by unseen species, leaving fewer elements of the sample to accommodate the species that actually were observed. Good—Turing techniques consist mainly of a family of methods for estimating r* (for frequencies r > 1); given r*, we estimatep r (which is what we are trying to find) as r*/JV. Suppose we knew the true population frequencies p\,pz,. • . ,ps of the various species. Then we could calculate the expected frequency E(nr) of any sample frequency r, E(nr) would be
where
represents the number of distinct ways one can draw r objects
from a set of JV objects. That is, the expected frequency of frequency r would be the sum of the probabilities, for each r-sized subset of the sample and each species, that all members of the subset belong to that species and no other sample element belongs to it. This expectation depends on an idealized assumption that there are no interactions between occurrences of particular species, so that each occurrence of species i is the outcome of something akin to an independent dicethrowing experiment in which one face of the dice represents i and the other faces represent not-z, and the probability pi of getting i rather than not-z is fixed and unchanging: statisticians call this a binomial assumption. In reality the assumption is usually false, but often it is false only in ways that have minor, negligible consequences for the overall pattern of occurrences in a sample; in applying statistical methods that incorporate the binomial assumption (including Good-Turing methods) to a particular domain, one must be alive to the issue of whether the binomial assumption is likely to be seriously misleading in that domain. For our example, occurrences of particular strings of consonants and reduced vowels are not truly
GOOD-TURING FREQUENCY ESTIMATION
99
independent of one another: for some pairs of strings there are several English words which contain both strings at successive points, for other pairs there are no such words. But, within a sizeable database containing many words, these interrelationships are likely to affect the overall pattern of string frequencies sufficiently little to make the binomial assumption harmless.4 If we knew the expected frequencies of frequencies, it would be possible to calculate r*. The central theorem underlying Good-Turing methods states that, for any frequency r > 1: Equation 1 A corollary states that: Equation 2 In reality we cannot calculate exact figures for expected frequencies of frequencies, because they depend on the probabilities of the various species, which is what we are trying to find out. However, we have figures for the observed frequencies of frequencies, and from these we can infer approximations to the expected frequencies. Take first Equation 2. This involves only the expected frequency of sample frequency 1. In the sort of data we are considering, where there are few common species but many rare species, frequency 1 will always be the commonest sample frequency, and the actual figure for n\ is likely to be a close approximation to E(n\) — compare the fact that the oftener one tosses a coin, the surer one can be that the cumulative proportion of heads will be close to one-half. Thus it is reasonable to estimate PQ as equal to n\IN. In the example, n{ is 120, hence our estimate of the total probability of all unseen species of strings is 120/30902, or 0.0039. If another 10,000 strings were sampled from speech comparable to that sampled in the TIMIT database, we estimate that 39 of them would represent some string or strings not found in TIMIT. As we move to higher sample frequencies, the data become increasingly 'noisy': already at r — 5 and r — 7 in the Appendix we see cases where nr is greater than w r _ i , although the overall trend is for nr to decrease as r increases. Furthermore there are many gaps in the list of observed sample frequencies; thus for our example one could not get a sensible r* figure in the case of r = 10 by substituting actual for expected frequencies of frequencies in Equation 1, because the frequency r + 1, i.e. 11, does not occur at all (n\ \ is zero, so 10* calculated in this way would also be zero, which is absurd). As one moves towards higher values of r, the gaps where nr — 0 become larger. What we need is a technique for smoothing the irregular and 'gappy' series of nr figures into a regular and continuous series, which can be used as good proxies for the unknowable E(nr] figures in Equation 1.
100
EMPIRICAL LINGUISTICS
Much of Good's 1953 paper concerned alternative techniques for smoothing observed series of frequencies of frequencies. The reason for speaking of Good—Turing techniques, in the plural, is that any concrete application of the preceding concepts requires a choice of some particular method of smoothing the nr figures; not all methods will give equally accurate population frequency estimates in a given domain. Some techniques (including the smoothing technique of Church and Gale 1991) are mathematically quite elaborate. The Simple Good—Turing method is relatively easy to use, yet we shall see that it gives good results in a variety of tests. 4 Linear smoothing To gain an intuitive grasp of Simple Good-Turing (SGT) smoothing, it is helpful to visualize the data graphically. Figure 7.1 plots nr against r for our example. Because the ranges of values for both r and nr include values clustered close together in the lower reaches of the respective ranges and values separated widely in the upper reaches (as is typical of linguistic data), the plot uses a logarithmic scale for both axes. For lower sample frequencies the data points group round a northwest-tosoutheast trend, but at higher sample frequencies the trend becomes hori-
FigureT.l
GOOD TURING FREQUENCY ESTIMATION
101
zontal along the line nr — 1. This angular discontinuity in Figure 7.1 does not correspond to any inherent property of the population. It is merely a consequence of the finite size of the sample: a sample frequency may occur once or not at all, but cannot occur a fractional number of times. When using observed frequencies of frequencies to estimate expected frequencies of frequencies for high sample frequencies, we ought to take account not only of the fact that certain high r values correspond to positive nr values but also of the fact that neighbouring r values correspond to zero nr values. Following Church and Gale (1991), we do this by averaging positive nr values with the surrounding zero values. That is, we define a new variable 2^r as follows: for any sample frequency r, let r' be the nearest lower sample frequency and r" the nearest higher sample frequency such that nr> and nr« are both positive rather than zero. Then £r = 2nrl(r" — r'). For low r, r' and r" will be immediately adjacent to r, so that r" — r' will be 2 and £r will be the same as nr\ for high r, £r will be a fraction, sometimes a small fraction, of nr.5Most of our estimates of expected frequencies will be based on £r rather than directly on nr. Figure 7.2 plots £r against r for our sample on the same log—log scales as in Figure 7.1. The discontinuity in Figure 7.1 has disappeared in Figure 7.2: the data points all group fairly well along a single common trend. Furthermore, not only does Figure 7.2 display a homogeneous trend, but this trend is a straight line. That is not surprising: G. K. Zipf argued that distribution
Figure 7.2
102
EMPIRICAL LINGUISTICS
patterns for many linguistic and other behavioural elements are approximately log-linear. We have examined perhaps a dozen radically different language and speech data-sets, and in each case on a log—log plot the points group round a straight line (with a slope between —1 and —2). The Simple Good—Turing technique exploits the fact that such plots typically show linear trends. Any method of smoothing data must, if it is to be usable for our present purpose, satisfy certain prior expectations about r*. First, we expect r* to be less than r, for all non-zero values ofr; secondly, we expect r*/r to approach unity as r increases. The first expectation follows from the fact that observed sample frequencies must be reduced in order to release a proportion of sample elements to accommodate unseen species. The second expectation reflects the fact that the larger r is, the better it is measured, so we want to take away less and less probability as r increases. It is not at all easy to find a method for smoothing £r figures that will ensure the satisfaction of these prior expectations about r*. However, a downward-sloping log-log line is guaranteed to satisfy them. Since a straight line is also the simplest possible smooth, part of the SGT technique consists of using the line of best fit to the (log r, log £r) points to give our proxy for E(nr] values when using Equation 1 to calculate r*. I shall write S(r] ('smoothed /£/) for the value into which this line takes a sample frequency r.7 But, for the lowest few values of r, observed nr values may well be more accurate than any smoothed values as estimates ofE(nr). Therefore the other aspect of the SGT technique consists of a rule for switching between nr and S(r] as proxies for E(nr) when calculating r* - for switching between raw and smoothed proxies, I shall say. The rule is that r* is calculated using nr rather than S(r) as proxy for E(nr] for r from 1 upwards so long as these alternative methods of calculating r* give significantly different results. (The general pattern is that, as r increases from 1, there will be a short stretch of values for which the alternative r* estimates are significantly different, then a short stretch of values where the pairs of r* estimates oscillate between being and not being significantly different, and then above a certain value of r the pairs of estimates will never be significantly different.) Once the lowest value ofr is reached for which nr and S(r) give estimates ofr* which are not significantly different, S(r) is used to calculate r* for that value and for all higher values ofr. Pairs ofr* estimates may be considered significantly different if their difference exceeds 1.96 times the standard deviation (square root of variance) of the estimate based on nr (since, assuming a Gaussian distribution of that estimate, the probability of such a difference occurring by chance is less than the accepted 0.05 significance criterion).8 The variance in question is approximately equal9 to
GOOD- TURING FREQUENCY ESTIMATION
103
Table 7.2
vcv vccv VCCRCRCV VCCRRCCCV VCCRCRV VRCCRCRCV VRCCCRCV VRRCCV VRRCCCV VCCRCRCRV VRCRCRV
r
r*
pr
7846 6925 224 23 7 6 5 4 3 2 1
7839. 6919. 223.4 22.60 6.640 5.646 4.653 3.664 2.680 1.706 0.7628
0.2537 0.2239 0.007230 0.0007314 0.0002149 0.0001827 0.0001506 0.0001186 8.672e-05 5.522e-05 2.468e-05
It is the adoption of a rule for switching between smoothed and raw frequencies of frequencies as proxies for expected frequencies of frequencies which allows the SGT method to use such a simple smoothing technique. GoodTuring methods described previously have relied on smoothed proxies for all values of r, and this has forced them to use smoothing calculations which are far more daunting than that of SGT.l0 One further step is needed before the SGT estimator is completely defined. Because it uses proxies for the true expected frequencies of frequencies E(nr), we cannot expect the estimated probabilities yielded by the SGT technique to sum to 1, as they should. Therefore each estimated probability generated as discussed earlier has to be renormalized by dividing it by the total of the unnormalized estimates and multiplying by the estimated total probability of seen species, 1 — P0. Applying the technique defined earlier to the prosody data in the Appendix gives a line of best fit logS(r) = — 1.389 log r+ 1.941 (with S(r) interpreted as discussed above). For comparison with Table 7.1,1 show in Table 7.2 the r* and/v figures estimated by SGT for the same selection of species. In this particular example, as it happens, even for r — 1 the alternative calculations of r* give figures that are not significantly different, so values based on smoothed proxies are used throughout; but that is a chance feature of this particular data-set, and in other cases the switching rule calculates r* from raw proxies for several values of r - for instance, in the 'Chinese plurals' example of the next section, raw proxies are used for r — 1 and r = 2. 5 Open versus closed classes: Chinese plurals
A second example, illustrating an additional use of the concepts under discussion, is taken from the field of Chinese morphology. Chinese has various devices to mark the logical category of plurality, but (unlike in European
104
EMPIRICAL LINGUISTICS
languages) this category is by no means always marked in Chinese. For instance, there is a plural suffix men which can be added to personal pronouns and to some nouns; but many nouns never take men irrespective of whether they are used with plural reference in a particular context, and nouns which can take men will not always do so even when used with plural reference. In connexion with work reported in Sproat etal. (\994) on the problem of automatically segmenting written Chinese into words, it was desirable to establish whether the class of Chinese nouns capable of taking men is open or closed. Dictionaries are silent on this point and grammatical descriptions of the language tend to be less than wholly explicit; but it is important for word segmentation - if the class of nouns that can take men is closed, an efficient algorithm could list them, but if the class is open some other technique must be deployed. The frequencies of various nouns in men found in a (manually segmented) corpus of Chinese were tabulated, the commonest case being renmen 'people' which occurred 1,918 times. Altogether there were 6,551 tokens exemplifying 683 types of men plural. Some sample r,nr figures are shown in Table 7.3. The question whether a linguistic class is open or closed is not the same as the question whether the number .? of species in a population is finite or infinite. Asking whether a large class of linguistic items should be regarded as mathematically infinite tends to be a sterile, philosophical question. The number of words in the English vocabulary, for instance, must arguably be finite: for one thing because only a finite number of users of the language have lived, each producing a finite number of word-tokens in his lifetime, and word-types must be fewer than word-tokens; for another thing, because any English word is a string of tokens of a few dozen character-types, and it is probably safe to say that a word more than twice as long as the longest that has occurred would be unusable. Not all writers agree that these considerations imply finiteness; Andras Kornai (forthcoming) argues that they do
Table 7.3 r
«r
1 2 3 4 5 6 7
268 112 70 41 24 14 15
400
1
1918
1
GOOD-TURING FREQUENCY ESTIMATION
105
not, and that usage statistics demonstrate that the vocabulary really is infinitely large. But even if the English vocabulary is finite, it is certainly an open class: for practical purposes it 'might as well' be infinitely large. The question whether a class is closed or open in this sense might be glossed as whether a sample of a size that is practical to assemble will contain examples of a large fraction, or only a small fraction, of all the species constituting the class. A corpus of tens of millions of English word-tokens will exemplify only a tiny fraction of all the word-types used in the English language. In terms of the statistical concepts under discussion, if a class is closed we expect to find 1 * > 1. With a closed class one will soon see most of the species, so the number of species seen just once will tend to become small. For the Chinese plurals data, 1* = 2n 2 /«i = 2 x 112/268 = 0.84, which is convincingly less than unity; so we conclude that the class of Chinese nouns forming a plural in menis open, at least in the sense that it must be very much larger than the 683 observed cases. This harmonizes with the statement in Y. R. Chao's authoritative grammar of Chinese (Chao 1968: 244-5), according to which men can be suffixed to 'words for persons' (and, in certain regional dialects, to some other nouns), which suggests that men plurals form an open class. Rather than giving a series of figures analogous to Table 7.2 for the Chinese plurals example, a clearer way of showing the reader the nature of a set of SGT estimates is via a plot ofr*/r against r - such a plot is enlightening whenever Good-Turing techniques are applied. Figure 7.3 plots r*/r against
Figure 7.3
106
EMPIRICAL LINGUISTICS
r for both the prosody and the Chinese-plural examples, representing the two sets of data points by 'P' and 'C' respectively. The Chinese plurals example needs more probability set aside for unseen types than does the prosody example (0.04 versus 0.004); but it has twice as many types and five times as many tokens to take this probability from, so the ratios ofr* to r are not so very different between the two cases. The fact that 1* and 2* in the Chinese plurals case are based on raw proxies which yield estimates that are significantly larger than the alternative estimates based on smoothed proxies — as is apparent from the distribution of 'C' symbols in Figure 7.3 - hints that the class may not be entirely open-ended, but if required to categorize it on one or the other side of what is in reality a continuum between open and closed classes, on the basis of the data used here we would probably do well to treat it as open.
6 The procedure step by step This section presents a complete but totally mechanical statement of the SGT algorithm. No rationale is offered in this section. Section 3 covered the reasons for the steps in the algorithm which I now present.13 Our data are a sample of individuals belonging to various species. On the basis of the numerical properties of the sample we shall assign values to an integer variable JVand real variables PQ, JV', a, b, and to the cells of a table. The table is to have as many rows as there are distinct species frequencies in the data, and seven columns labelled r, n, £, log r, log/£, r*,p. The values in the r and n columns will be integers, those in the other columns will be reals (in a concrete computer realization of the algorithm it may be convenient to use separate arrays). First, tabulate the various species frequencies found in the sample, and the numbers of distinct species occurring with each species frequency, in the r and n columns respectively. For instance, a row with r = 3 and n = 24 will mean that there are 24 different species each represented in the sample by three individuals. Enter these pairs of numbers in the appropriate columns in such a way that r values always increase between successive rows: the first row will have r=l, and the last row will have the frequency of the commonest species in the r column. It is convenient not to include rows in the table for frequencies that are not represented in the sample: thus the n column will contain no zeros, and many integers between 1 and the highest species frequency will appear nowhere in the r column. Thus, for the prosody example of section 2 these first two columns will look like the Appendix. I shall use the values in the r column to identify the rows, and they will appear as subscripts to the labels of the other columns to identify cell values. For instance ^ will mean the contents of the cell in the £ column and the row which has i in the r column (not the z'th row). Assign to JV the sum of the products of the pairs of integers in the r and n columns. This will be the number of individuals in the sample. (In practice
GOOD-TURING FREQUENCY ESTIMATION
107
the value of A" will often have been ascertained at an earlier stage, but if not, it can be done in this way.) Assign to PO the value «i/jV (where n\ represents the value in the n column and the row for which r— 1). PQ is our estimate of the total probability of all unseen species. If the identity of the various unseen species is known,PQ should be divided between them by reference to whatever features of the species may suggest prior probabilities for them (cf. section 9 below). Enter values in the £ column as follows. For each rowj, let i and k be the values in the r column for the immediately previous and immediately following rows respectively (so that k > i}. Ifj is the first row, let i be 0; if/ is the last row, let k be 2j — i. Set ^ to the value 2«y/ (k — i}. Enter the logarithms of the r and Rvalues in the corresponding rows of the log r and log £ columns. Use regression analysis to find the line of best fit a + blogr to the pairs of values in the logr and log-£ columns. (Regression analysis to find the 'line of best fit' or 'line of least squares' for a set of data points is a simple and standard manipulation described in most elementary statistics textbooks; see e.g. Press etal. 1988: 523—6, which includes computer coding. ) I shall use 'S(r)' as an abbreviation for the function antilog (o + Hogr). (If base-10 logarithms are used, antilog (x) means 10*.) Working through the rows of the array in order beginning with the row r — \ , begin by calculating for each value ofr the two values x andjy defined by Equations 3 and 4 below. If the inequality labelled 'Equation 5' holds, then insert x in the r* column. (The notation \x — y\ represents the absolute difference between x andjv.) If Equation 5 does not hold, inserty in the r* column, and cease to calculate x values: for all subsequent rows insert the respective j value in the r* column. Equation 3 Equation 4
Equation 5
(Since the values in the r column are not continuous, in theory the instruction of the preceding paragraph might be impossible to execute because the calculation ofx could call for an nr+\ value when the table contained no row with the corresponding value in the r column. In practice this is likely never to happen, because the switch to usingjy values will occur before gaps appear in the series o f r values. If it did ever happen, the switch to using y values would have to occur at that point.)
108
EMPIRICAL LINGUISTICS
Let JV' be the total of the products nrr* for the various rows of the table. For each row calculate the value
and insert it in the p column. Each value pr in this column is now the SGT estimate for the population frequency of a species whose frequency in the sample is r. 7 Tests of accuracy: a Monte Carlo study
SGT gives us estimates for species probabilities in the prosody example, but although we have theoretical reasons for believing the estimates to be good, we have no way of determining the true probabilities for this example, and hence no objective way of assessing the accuracy of the method. I now present two cases where we do know the answers. The first is a Monte Carlo study, meaning that data-sets are created artificially, using a (pseudo-)random number generator, in order to constitute samples from populations with known statistical properties: statistical inference techniques can be applied to such samples and their findings compared with the properties which the population is known to possess. Such techniques are well established in statistical research. For this study, Gale constructed a set of sample texts each containing 100,000 word-tokens. Each text was constructed by drawing tokens randomly from an ordered list w\, u>2,... ,w, of word-types, with the probability of drawing a token of the z'th type being made proportional to iz for some z less than — 1. Specifically, for a text with given j and z the probability ofuij (1 ^ i: ^ s) was
Such a distribution is called a Zipfian distribution with exponent z (the reference here being to 'Zipf's Law' - cf. note 6). The study used five values of s (vocabulary size), namely 5,000, 10,000, 25,000, 50,000, and 100,000, and four values of z, namely -1.1, -1.2, — 1.3, and —1.4. One text was constructed for each combination of vocabulary size and exponent, giving 20 texts in all. At most 15,000 word-types were represented in any one text; thus the spectrum of vocabulary sizes extended from cases where the finite nature of the vocabulary was significant, to cases where it is impossible to tell from a 100,000-token text whether the vocabulary is finite or infinite. The range of exponents is representative of values seen in real linguistic data.
GOOD-TURING FREQUENCY ESTIMATION
109
The question to which Good-Turing methods estimate the answer, for any one of these texts, is what the average probability is of those types which are represented exactly r times in the text (for any integer r]. Each individual type is assigned a specific probability by the model used to construct the text, therefore the true average probability of types which are represented by r tokens can easily be calculated. Since the most difficult cases are those where r is small, we assessed accuracy over the range 1 ^ r ^ 10. Average probabilities for r in this range have two to three significant figures. We compared the performance of the Simple Good—Turing method on these data with the performance of three other frequency-estimation techniques which a researcher might be inclined to use: two variants of the additive method of section 1, and the Deleted Estimate ofJelinek and Mercer (1985).' 5 Fienberg and Holland (1972) survey six variants of the additive method, which all share the advantage of giving non-zero estimates for unseen species frequencies but differ with respect to choice of the figure k which is added to observed sample frequencies. They discuss three 'a priori values': 1 (as in section 1), Jj, and ^ (where s is the number of species, so that one observation is added to the total number of observations to renormalize - this choice ofk was advocated by Perks 1947: 308); and three 'empirical values', meaning that k is determined in different ways by the properties of the particular set of observations under analysis. For the kind of data Fienberg and Holland discuss, they suggest that 1 is too large a value for k and - too small, but that all four other choices are reasonable. We have chosen to assess the additive method here using k = A and k = - as two representative choices (I refer to the additive estimator using these values for k as Add-Half and Add-Tiny respectively): Add-Half is very similar to Add-One but has somewhat greater theoretical justification,16J and Add-Tiny has the superficial attraction of minimally distorting the true observations. We have not separately assessed Add-Orie, because it is so similar to Add-Half, and we have not assessed Fienberg and Holland's 'empirical' estimators because language and speech researchers attracted to the simplicity of the additive method would scarcely be tempted to choose these variants. Add-Half and Add-Tiny are extremely simple to apply, and they may be useful for 'quick and dirty' preliminary studies. But we shall see that they both perform too poorly on the Monte Carlo data-sets to seem worth considering for serious investigations.17 A theoretically respectable alternative to Good Turing methods is the well established statistical technique of cross-validation. It has been applied to linguistic data under the name 'Deleted Estimate' by Jelinek and Mercer (1985), and see also Nadas (1985). Gross-validation requires calculations which are more demanding than those of SGT, but they are by no means beyond the resources of modern computing facilities. For present purposes I examine the simplest case, two-way cross-validation ('2CV'). The Good-Turing estimator is based on a theorem about the frequency one would expect in a hypothetical additional sample for species occurring with a given frequency r in an observed sample. Jelinek and Mercer begin by
110
EMPIRICAL LINGUISTICS
defining a held-out estimator which turns this concept from hypothesis to reality, creating an actual additional sample by dividing an available text sample into two halves, called retained and held-out, corresponding respectively to the actual sample and the hypothetical additional sample of the Good-Turing approach. Let nr be the number of species which are each represented r times in the retained subsample, and let Cr be the total number of occurrences of those particular species in the held-out subsample. Then Cr/nr is used as the adjusted frequency r* from which the estimated population frequency is derived. As it stands, this technique is inefficient in the way it extracts information from available data. Two-way cross-validation (such as Jelinek and Mercer's Deleted Estimate) uses the data less wastefully; it combines two held-out estimates made by swapping the roles of held-out and retained subsamples. If we denote the two halves of the data by 0 and 1, we write nr for the number of species each occurring r times in subsample 0, and Cr for the total number of occurrences in subsample 1 of those particular species; nr and C\ are defined correspondingly. The two held-out estimators would be C®1 /H® and Clr In}; the Deleted Estimate combines the underlying measurements by using Equation 6 to estimate r*: Equation 6 Cross-validation does not make the binomial assumption made by GoodTuring methods; it makes only the much weaker assumptions that the two subsamples are generated by statistically identical processes, and that the probability of a species seen r times in a sample of size jV is half that of a sample seen r times in a sample of size JV/2. Cross-validation need not be 'two-way'; available data may be divided into three or more subsamples. However, even two-way cross-validation is a computationally intensive procedure, and the computational demands grow as the number of subsamples is increased. One consideration distinguishing the additive techniques from both the Good—Turing and cross-validation approaches is that the former, but not the latter, require knowledge of the number of unseen species. In a real-life application where the 'species' are vocabulary items, this knowledge would not ordinarily be available. Nevertheless, Gale and I allowed ourselves to use it, in order to produce results from the additive techniques for comparison with the SGT and 2CV results. Since both the additive techniques used prove inferior to both SGT and 2CV, allowing the former to use the extra information has not distorted the validity of our overall conclusions. Because true and estimated probabilities can vary by several orders of magnitude, it is convenient to express the error in an estimated probability as the logarithm of its ratio to the true probability. For each of the four estimation methods, Table 7.4 gives the root mean square of the base-10
GOOD-TURING FREQUENCY ESTIMATION
ill
Table 7.4
Method Add-Half Add-Tiny SGT 2CV
RMS error 0.47 2.62 0.062 0.18
logarithms of these ratios for 11 values of r from 0 to 10 for each of the 20 data-sets (five values of s times four values of z ) . I shall refer to the root mean square of the logarithms of a set of estimated-probability/true-probability ratios as the average error of the set. Add-Half gets the order of magnitude correct on average, but Add-Tiny fails even to achieve that on the Monte Carlo data. Of the four methods, SGT gives the best overall results. Breaking down the overall error rates by different values ofr shows where the different techniques fail. In Figure 7.4, different plotting symbols represent the different methods as follows:
H T G C
Add-Half Add-Tiny Simple Good-Turing Two-Way Cross Validation
Figure 7.4
112
EMPIRICAL LINGUISTICS
Figure 7.5
In order to accommodate the full frequency range in a single figure, Figure 7.4 uses two scales for average error: the scale on the left applies for r ^ 2, the scale on the right applies for r > 2. Each point represents the average error for the 20 combinations of vocabulary size and exponent. We see that the additive methods are grossly wrong for unseen species, and remain less accurate than SGT and 2GV over the range of positive frequencies shown. By eliminating the data points for the additive methods, Figure 7.5 is able to use a less compressed vertical scale to display the error figures for the SGT and 2CV methods. We see that for r greater than about 2, the performance of SGT is comparable to that of 2CV, but that the latter is poor for r ^ 2. (It is of course possible that multi-way cross-validation would give better performance for small r; we do not know whether that is so or not, but we have seen that multi-way cross-validation is far more demanding computationally than SGT.) The performance of SGT in particular is displayed in more detail in Figure 7.6, which further expands the vertical scale. We see that SGT does best for small r and settles down to an error of about 5 per cent for large r. There is an intermediate zone of a few frequencies where SGT does less well. This is because the SGT method switches between estimates based on raw proxies for small r, and estimates based on smoothed proxies for higher r. in the switching region both of these estimation methods have problems. Figures 7.7 and 7.8 show how average error in the SGT estimates varies with vocabulary size and with Zipfian exponent respectively. We see that
GOOD-TURING FREQUENCY ESTIMATION
113
Figure 7.6
there is no correlation between error level and vocabulary size, and little evidence for a systematic correlation between error level and exponent (though the average error for the largest exponent is notably greater than for the other three values). Furthermore, the range over which average error varies is much smaller for varying exponent or (especially) varying vocabulary size than for varying r. Error figures obtained using real linguistic data would probably be larger than the figures obtained in this Monte Carlo study, because word-tokens are not binomially distributed in real life. 8 Tests of accuracy: a bigram study
A second test of accuracy is based on the findings reported in Church and Gale (1991: tables 1 and 2), relating to the distribution of bigrams (pairs of adjacent words) in a large corpus. This study used a 44-million-word sample of English comprising most of the different articles distributed by the Associated Press newswire in 1988 (some portions of the year were missing, and the material had been processed in order to eliminate identical or nearidentical articles). Each bigram in the sample was assigned randomly to one of two subsamples: thus, although we may not know how representative 1988 AP newswire stories are of wider linguistic populations, such as 'modern journalistic American English', what matters for present purposes is that we can be sure that the two subsamples come as close as possible to
114
Figure 7.7
Figure 7.8
EMPIRICAL LINGUISTICS
GOOD TURING FREQUENCY ESTIMATION
115
Table 7.5 r
nr
r*SGT
r*HO
1
2,018,046 449,721 188,933 105,668 68,379 48,190 35,709 37,710 22,280
0.446 1.26 2.24 3.24 4.22 5.19 6.21 7.24 8.25
0.448 1.25
2 3 4 5 6 7 8 9
2.24 3.23 4.21 5.23 6.21 7.21 8.26
both representing exactly the same population (namely, 1988 AP newswire English). Since Good-Turing techniques predict frequencies within a hypothetical additional sample from the same population as the data, whereas the 'heldout' estimator of section 7 reflects frequencies found in a real additional sample, we can treat the held-out estimator based on the two 22-millionword AP subsamples as a standard against which to measure the performance of the SGT estimator based onjust the 'retained' subsample. Table 7.5 compares the r* estimates produced by held-out and SGT methods for frequencies from 1 to 9. In this example, the huge values for nr meant that for all r values shown, the SGT method selected the estimate based on raw rather than smoothed proxies. In no case does the SGT estimate deviate by more than 1 per cent from the held-out estimate. The quantity of data used makes this an untypical example, but the satisfactory performance of the SGT technique is nevertheless somewhat reassuring. (The largest error in the estimates based on smoothed proxies is 6 per cent, for r = 0 — that is, for the PQ estimate.)
9 Estimating the probabilities of unseen species Good—Turing techniques give an overall estimate PQ for the probability of all unseen species taken together, but in themselves they can give no guide to the individual probabilities of the separate unseen species. Provided the number of unseen species is known, the obvious approach is to divide PQ equally between the species. But this is a very unsatisfactory technique. Commonly, the 'species' in a linguistic application will have internal structure of some kind, enabling shares of PQ to be assigned to the various species by reference to the probabilities of their structural components: the resulting estimates may be rather inaccurate, if probabilities of different components are in reality not independent of one another, but they
116
EMPIRICAL LINGUISTICS
are likely to be much better than merely assigning equal probabilities to all unseen species. The bigram study discussed in section 8 offers one illustration. Many bigrams will fail to occur in even a large language sample, but the sample gives us estimated unigram probabilities for all word-types it contains (that is, probabilities that a word-token chosen at random from the population represents the respective word-type). Writingp(w) for the estimated unigram probability of a word w, the bigram probability of any unseen bigram w\W2 can be estimated by taking the product p(w\)p(w2} of the unigram probabilities and multiplying it by PQ/PQ (where PQ is the Good-Turing estimate of total unseen-species probability as before, and PQis the sum of the productsp(wi)p(u)j} for all unseen bigrams wjJOj- multiplying by PQ/PQ is a renormalization step necessary in order to ensure that estimated probabilities for all seen and unseen bigrams total unity). This technique is likely to overestimate probabilities for unseen bigrams consisting of two common words which would usually not occur together for grammatical reasons (say, the if], and to underestimate probabilities for two-word set phrases that happen not to occur in the data, but over the entire range of unseen bigrams it should perform far better on average than simply sharing PQ equally between the various cases. A second example is drawn from the field of syntax. In the SUSANNE Corpus, syntactic constituents are classified in terms of a fine-grained system of grammatical features, which in some cases allows for many more distinct constituent-types than are actually represented in the Corpus. (The following figures relate to a version of the SUSANNE Corpus from which 2 per cent of paragraphs chosen at random had been excluded in order to serve as test data for a research project not relevant to our present concerns: thus the sample studied comprises about 127,000 words.) Taking the syntactic category 'noun phrase' for investigation, the sample contains 34,204 instances, classified by the SUSANNE annotation scheme into 74 species. For instance, there are 14,527 instances of Ns, 'noun phrase marked as singular', which is the commonest species of noun phrase; there are 41 instances of Np@, 'appositional noun phrase marked as plural'; one of the species represented by a single instance is N j ", 'noun phrase with adjective head used vocatively'. However, the number of possible species implied by the annotation scheme is much larger than 74. A noun phrase may be proper or common, it may be marked as singular, marked as plural, or unmarked for number, and so on for six parameters having between two and six values, so that the total number of species is 1,944. The number of species seen once, n\, is 12; therefore the Good-Turing estimate of PQ, the total probability of unseen species, is 12/34204 = 0.00035. (The SGT estimate for pi is 2.6e —05.) Since each noun-phrase species is defined as a conjunction of values on six parameters, counts of the relative frequencies of the various values on each parameter can be used to estimate probabilities for unseen species. For instance, on the number-marking parameter, to two significant figures 0.64 of noun phrases in the sample are
GOOD TURING FREQUENCY ESTIMATION
117
marked as singular, 0.22 are marked as plural, and 0.13 are unmarked for number. If one estimates the probability for each unseen species by multiplying together these probabilities for the various parameter values which jointly constitute the species, then P'0, the sum of the products for the 1,870 unseen species, is 0.085; some samples of estimated probabilities for individual unseen species are shown in Table 7.6. In Table 7.6, Nas+ represents a conjunct introduced by a co-ordinating conjunction and marked as subject and as singular, for instance the italicized phrase in a hypothetical sequence 'my son and he were room-mates'. In English it is more usual to place the pronoun first in such a co-ordination; but intuitively the quoted phrase seems unremarkable, and the calculation assigns an estimated probability of the same order as that estimated for species seen once. Nyn represents a proper name having a second-person pronoun as head. This is possible - until recently I drove to work past a business named Slender You — but it seems much more unusual; it is assigned an estimated probability an order of magnitude lower. N j p ! represents a noun phrase headed by an adjective, marked as plural, and functioning as an exclamation. Conceivably, someone contemplating, say, the aftermath of a battle might utter the phrase These dead!, which would exemplify this species — but the example is patently contrived, and it is assigned an estimated probability much lower still. Thus the probability differentials in these cases do seem to correlate at least very crudely with our intuitive judgements of relative likelihood certainly better than would be achieved by sharing PQ equally among the unseen species, which would yield an estimated probability of 1.9e —07 in each case. (As in the bigram case, there are undoubtedly some individual unseen species in this example which will be much rarer or commoner than predicted because of interactions between values of separate parameters.) This approach to estimating the probabilities of unseen species depends on the nature of particular applications. For most language and speech applications, though, it should be possible to find some way of dividing unseen 'species' into complexes of components or features whose probabilities can be estimated individually, in order to apply this method. For the prosody example, for instance, a particular sequence of sound-classes could be divided into successive transitions from the eight-member set VC, VR, GC, CR, RC, RR, CV, RV, each of which could be assigned a probability from the available data. Whenever such a technique is possible, it is recommended.
Table 7.6 Nas + Nyn N j p !!
.0013xP0/P0=5.4e-06 .0015xP0/P'0=6.2e-07 2.7e-07 x P0/P'0 = 1.1e-09
118
EMPIRICAL LINGUISTICS
10 Summary
I have presented a Good—Turing method for estimating the probabilities of seen and unseen types in linguistic applications. This Simple Good—Turing estimator uses the simplest possible smooth of the frequencies of frequencies, namely a straight line, together with a rule for switching between estimates based on this smooth and estimates based on raw frequencies of frequencies, which are more accurate at low frequencies. The SGT method is more complex than additive techniques, but simpler than two-way cross-validation. On a set of Monte Carlo examples SGT proved to be far more accurate than additive techniques; it was more accurate than 2CV for low frequencies, and about equally accurate for higher frequencies. The main assumption made by Good—Turing methods is that items of interest have binomial distributions. The accuracy tests reported in section 7 relate to artificial data for which the items are binomially distributed; how far the usefulness of SGT may be vitiated by breakdowns of the binomial assumption in natural-language data is an unexplored issue. The complexities of smoothing may have hindered the adoption of GoodTuring methods by computational linguists. I hope that SGT is sufficiently simple and accurate to remedy this. Notes 1 Many writers reserve the termfrequency for counts of specific items, and if we conformed to this usage we could not call/?, a 'population frequency': such a quantity would be called a 'population probability'. In order to make this account readable for non-mathematicians, I have preferred not to observe this rule, because it leads to conflicts with ordinary English usage: people speak of the frequency, not the probability, of (say) redheads in the British population, although that population is open-ended, with new members constantly being created. Furthermore I. J. Good, the writer whose ideas I shall expound, himself used the phrase 'population frequency'. But it is important that readers keep distinct in their mind the two senses in which 'frequency' is used in this chapter. 2 The likelihood ofx givenjy is the probability ofjv given x, considered as a function of * (Fisher 1922: 324-7; Box and Tiao 1973: 10). The maximum likelihood estimator selects that population frequency which, if it obtained, would maximize the probability of the observed sample frequency. That is not the same as selecting the population frequency which is most probable, given the observed sample frequency (which is what we want to do). 3 When reading Church and Gale 1991, in which the item cited above is an appendix, one should be aware of two notational differences between that article and Good 1953, to which the notation of the present article conforms. Church and Gale use Nr rather than nr to represent the frequency of frequency r; and they use V (for 'vocabulary') rather than s for the number of species in a population. 4 Mosteller and Wallace 1964 concluded that for natural-language data the 'negative binomial' distribution tended to fit the facts better than the binomial distribution; however, the difference was not great enough to affect the conclusions of
GOOD T U R I N G FREQUENCY ESTIMATION
119
their own research, and the difficulties of using the negative binomial distribution have in practice kept it from being further studied or used in the subsequent development of computational linguistics, at least in the English-speaking world. 5 When r is the highest sample frequency, £, is computed by setting r" to a hypothetical higher frequency which exceeds / by the same amount as r exceeds r. 6 To avoid confusion, I should point out that Zipf made two claims which are pnma facie independent (Zipf 1935: 40-8). The generalization commonly known as Zipf's Law (though Zipf himself yielded priority to J. B. Estoup) is that, if vocabulary items or members of analogous sets are ranked by frequency, then numerical rank times frequency is roughly constant across the items. This law (later corrected by Benoit Mandelbrot, cf. Apostel, Mandelbrot and Morf 1957, and claimed by George Miller (1957) to be a statistical inevitability) does not relate directly to our discussion, which is not concerned with rank order of species. But Zipf also held that frequency and frequency-of-frequency are related in a log-linear manner. In fact Zipf claimed (see e.g. Zipf 1949: 32, 547 n. 10 to ch. 2) that the latter generalization follows from the former; however, Gale and I do not rely on (and do not accept) this argument, pointing out merely that our empirical finding of log-linear relationships in diverse language and speech datasets agrees with a long-established general observation. 7 Statistically sophisticated readers might expect data points to be differentially weighted in computing a line of best lit. We believe that equal weighting is a good choice in this case; however, a discussion would take us too far from our present theme. 8 The implementations of the SGT technique reported later in this paper used the coefficient 1.65 rather than 1.96, corresponding to a 0.1 significance criterion. This means that there are likely to be a handful of cases over the range of examples discussed where a/;, estimate for some value oi r was based on a raw proxy where, using the more usual 0.05 significance criterion, the technique would have selected an estimate based on a smoothed proxy for that particular value of r. 9 The approximations made to reach this are that n, and «,-H are independent, and that Varin,) = n,. For once the independence assumption is reasonable, as may be gathered from how noisy n, is. The variance approximation is good for binomial sampling of species with low probability, so it is consistent with Good -Turing methodology. 10 Standard smoothing methods applied over the entire range of r will typically oversmooth n, for small r (where the unsmoothed data estimate the probabilities well) and undersmooth n, for large r (where strong smoothing is needed); they may also leave local minima and maxima, or at least level stretches, in the series of smoothed «, values. All of these features are unacceptable in the present context, and are avoided by the SGT technique. They are equally avoided by the smoothing method used in Church and Gale (1991), but this was so complex that neither author has wished to use it again. 1 1 Chinese script represents morphemes as units, and lacks devices comparable to word spacing and hyphenation that would show how morphemes group together into words. 12 Testing the statistical significance of such a difference is an issue beyond the scope of this book. 13 Source code implementing this algorithm is available by anonymous ftp from URL 15
120
EMPIRICAL LINGUISTICS
14 Regression analysis yields some line for any set of data points, even points that do not group round a linear trend. The SGT technique would be inappropriate in a case where r and £ were not in a log-linear relationship. As suggested in section 4, I doubt that such cases will be encountered in the linguistic domain; if users of the technique wish to check the linearity of the pairs of values, this can be done by eye from a plot, or references such as Weisberg 1985 give tests for linearity and for other ways in which linear regression can fail. 15 Slava Katz (1987) used an estimator which approximates to the Good-Turing technique but incorporates no smoothing: r* is estimated as
for values of r below some number such as 6 chosen independently of the data, and simply as r for higher values of r, with renormalization to make the resulting probabilities sum to 1. Although having less principled justification than true Good—Turing methods, this estimator is very simple and may well be satisfactory for many applications; we have not assessed its performance on our data. (We have also not been able to examine further new techniques recently introduced by Chitashvili and Baayen 1993.) 16 Following Fisher (in the passage cited in note 2 above), Box and Tiao 1973: 34-6 give a non-informative prior for the probability, n, of a binomially distributed variable. Their equation 1.3.26 gives the posterior distribution for the probability n after observingjy successes out of n trials. The expected value of this probability can be found by integrating n times the equation given from zero to one, which yields
This is equivalent to adding one-half to each of the number of successes and failures. Add-Half is sometimes called the 'expected likelihood estimate', parallel to the 'maximum likelihood estimate' defined above. 17 l.J. Good (who defined one of the empirical additive estimators surveyed by Fienberg and Holland - cf. Good 1965: 23-9) has suggested to me that additive techniques may be appropriate for cases where the number of species in the population is small, say fewer than fifty (for instance when estimating frequencies of individual letters or phonemes), and yet some species are nevertheless unrepresented, or represented only once, in the sample. This would presumably be a quite unusual situation in practice.
GOOD TURING FREQUENCY ESTIMATION
121
Appendix This appendix contains the full set of (r, nr) pairs for the prosody example of section 2.
r
r
nr
1
2 3 4 5 6 7 8 9 10 12 14 15 16 17 19 20 21 23 24 25 26 27 28 31 32 33 34 36 41 43
120 40 24 13 15 5 11
2 2 1 3 2 1 1 3 1 3 2 3 3 3 2 2 1 2 2 1 2 2 3 1
45 46 47 50 71 84 101 105 121 124 146 162 193 199 224 226 254 257 339 421 456 481 483 1140 1256 1322 1530 2131 2395 6925 7846
nr
3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
8
Objective evidence is all we need
1 The generative view of linguistic evidence
The evidence on which a linguistic theory is based, whether this is a theory about an individual language (what linguists call a grammar] or a general theory about human language, consists of people's utterances. The data for a grammar of English are the utterances of English speakers; the data for a theory of language are the grammars of the various languages of the world, so that ultimately the general theory of language is again based on utterances. This point might seem too obvious to be worth making. But in fact generative linguists have taken a radically different line. According to Chomsky, grammars are based on people's 'intuitions' about, or 'knowledge' of, their native language. 'The empirical data that I want to explain [Chomsky said cf. p. 2] are the native speaker's intuitions.' Chomsky's statement about linguistics being based on speakers' intuitions was made in the context of a specialist conference, but it soon became part of the standard teaching of linguistics, retailed in introductory textbooks such as Terence Langendoen's The Study of Syntax (1969). According to Langendoen (p. 3), collecting specimens of what people actually say or write ... will not lead us to form a very clear picture of what the sentences of English are, since first of all, not all these specimens will be sentences of English, and secondly, the number of sentences actually collected will not even begin to exhaust the totality of English sentences.
Instead, We require techniques ofelicitation. Such techniques involve finding out the judgments that speakers of English make ... The most common form which elicitation takes is introspection ... [The linguist] may inquire of himself as to what judgments he makes concerning a particular linguistic object.
Nor are these introspective judgements limited to decisions that particular word-sequences are (or are not) grammatical in the language:
OBJECTIVE EVIDENCE IS ALL WE NEED
123
Other judgments that one can elicit concern the internal structure of objects regarded as sentences . .. such judgments may be called linguistic intuitions, and they constitute the raw linguistic data which the student of English syntax must use in his research. (Langendoen 1969: 3-4)
Larigendoen's next chapter goes into some detail on the kinds of linguistic facts which can readily be elicited from speakers: The simplest judgments that one can elicit concern classification of the words of sentences into 'parts of speech,' such as noun ..., verb .. ., conjunction ..., and the like, and of the groupings of these elements into phrases and clauses . .. [Quoting an example sentence, Langendoen states] No fluent speakers of English are likely to have difficulty in picking out the nouns ...; similarly the verbs ..., although many will rightly point out the auxiliary status of is . .. [English speakers also have] intuitions of the organization of the sentence ... [The example] as a whole may be broken down into two major phrases: a noun phrase . . . . . . . and a verb phrase ... The predicate V[erb] P[hrase] . .. can be further analyzed as being made up of a verb . . . and another V[erb] P[hrase] . . . [and so on] (ibid.: 10)
I find claims like these about what 'any fluent speaker' can introspect quite surprising. I can make such judgements myself, but then I was explicitly taught to make them during the years when I received a traditional grammar school education in the 1950s. My students in recent years, who are native speakers as much as I am but who have been exposed to different styles of education, frequently show every outward sign of being completely at sea when invited to formulate similar analyses. But Langendoen recognizes that speakers often need help to make their intuitions explicit: While it is correct to say that a person 'knows' the bases of these judgments, since they can be elicited from him by questioning, he knows them only subconsciously, and if they are pointed out to him, he may evince considerable surprise, (ibid.: 4)
A sceptic might see this picture of speakers who 'know' facts about language structure that come as surprises when they are pointed out by an expert questioner as illustrating people's willingness to go along with pronouncements by authority figures (such as university teachers are, for their students), and as having little to do with objective scientific data. But Langendoen's discussion was no isolated expression of initial overenthusiasm for a novel scientific approach. Here are remarks on the same topic from a more recent, British introduction to linguistics: as a native speaker of the language, [the linguist] is entitled to invent sentences and non-sentences in helping him to formulate and test his hypotheses. These abilities are usually referred to as linguistic intuitions and are important in that they form an essential part of the data-base of a Chomskyan approach to linguistics .. . (Atkinson, Kilby and Roca 1988: 38)
124
EMPIRICAL LINGUISTICS
Martin Atkinson and his co-authors go on to note that 'there are those who believe that [the use of such data] undermines the scientific status of Chomskyan linguistics', but they make it clear that they do not sympathize with this fear. Like Langendoen, they point out that 'once we have allowed the use of linguistic intuitions into our methodology there is no need to stop at intuitions of well- and ill-formedness' (p. 39) — native speakers have intuitions about grammatical structure, and these too are part of the linguist's data. Not all linguistics textbooks nowadays spell out the idea that language description is based on speakers' intuitions, but they hardly need to: the point is taught by example more effectively than by explicit statement. The literature of generative linguistics is full of arguments based on grammaticality judgements about example sentences, which are manifestly derived from the writers' introspections (even though the examples are often highly complex, so that one might think it difficult to formulate clear 'intuitions' about them). The pages of a journal such as Linguistic Inquiry scarcely ever give references to objective grammatical data, such as examples recorded from real-life usage at identified dates and places. The general scientific objections to this way of proceeding are so strong and so obvious that one naturally asks why linguists in particular should have thought it appropriate to base their theorizing on 'invented', 'intuitive' data. Note, incidentally, that when Atkinson etal. advocate the use of intuition 'to formulate and test... hypotheses', the empirical scientist's objection is to the latter only, not to the former. We do not care how a scientist dreams up the hypotheses he puts forward in the attempt to account for the facts — he will usually need to use imagination in formulating hypotheses, they will not emerge mechanically from scanning the data. What is crucial is that any hypothesis which is challenged should be tested against interpersonally observable, objective data. Once allow the scientist to invent the facts he uses for hypothesis testing, and the door is wide open to self-fulfilling predictions. It is asking the impossible to expect a linguist's opinions about what people do and do not say or write in particular cases to be unaffected by his ideas about the general structure of his language. If intuitive data about structure are admitted, then things become hopelessly circular: the linguist is using structural opinions to 'test' structural hypotheses. One cannot imagine professional meteorologists, say, or medical researchers taking seriously the suggestion that theories in their domains should be based on people's intuitive beliefs about weather forecasting or about the causes and nature of maladies: people have plenty of such beliefs, but many of them are incorrect. Scientific meteorology and medicine are supposed to advance beyond the untested beliefs of Everyman to establish sounder, fuller theories in their respective domains. The scientific theories may turn out to confirm that some of the folk ideas were correct, but they will surely show that others were mistaken. One might imagine that with language it would be different for some reason. But it is not. We have seen in earlier chapters that linguistic beliefs
OBJECTIVE EVIDENCE IS ALL WE NEED
125
which have been confidently retailed by writer after writer simply collapse when tested objectively. Not all native-speaker intuitions are wrong, by any means. I guess that most English speakers, if asked the grammatical status of the string John loves Mary, would say that it was a good English sentence — and I surmise that they would be correct. But, if the point were challenged, the respectable response would be to look for hard evidence, not to assert the clarity of one's intuitive judgement.1 One of the rare dissenters from the intuitive-data orthodoxy, William Labov, commented in 1975 that people commonly thought of the development of linguistics over the previous fifty years as divided into an early period when linguists described language on the basis of objective facts, and a more recent period in which they were concerned with 'explanation of the language faculty through the study of intuitions' (Labov 1975: 78). According to Labov, this was a mistake: in reality, linguistic description had been based on intuition throughout the period, but 'as the wealth and subtlety of linguistic description has increased, intuitive data has been found increasingly faulty' (ibid.}. In the early days of the subject, when language descriptions were relatively unsophisticated, the intuitive judgements which were used may mainly have been of the kind 'John loves Mary is good English', which speakers do tend to get right. More recently, linguistics had advanced to the point where hypotheses needed to be tested against subtler facts, where speaker intuition is not a reliable guide. In 1970, Labov urged that 'linguists cannot continue to produce theory and data at the same time' (Labov 1970: 199). But in the subsequent thirty years, many linguists have gone on doing exactly that. Since Labov wrote, computer technology has made it far easier than it was in 1970 to bring abundant objective evidence to bear on complex linguistic hypotheses. Yet still many linguists believe that their work is fundamentally answerable to intuition rather than objective data. Why?
2 Linguistics as mathematics
Some eminent linguists would answer this by saying that linguistic description is not an empirical science at all. For Jerrold Katz, in particular, human languages are not part of the furniture of the contingent world: they are objects comparable to the set of prime numbers, or to a proof of Pythagoras's theorem. Katz identifies 'the study of grammatical structure in natural language as an a priori discipline like mathematics' (J. J. Katz 1981: 3); a language 'is a timeless, unchangeable, objective structure' (ibid.: 9). The number 23 is prime, and it would always have been a prime number whether or not creatures capable of counting had ever come into being. The properties of mathematical objects are rightly studied through introspection controlled by logical rules; examining objective, interpersonally observable evidence is not very relevant in that domain. And Katz is not (quite) the only linguist who thinks that way: Terence Langendoen, already quoted,
126
EMPIRICAL LINGUISTICS
and Paul Postal have expressed enthusiasm for this concept of linguistics (e.g. Langendoen and Postal 1984: vii). It is easy to agree that empirical evidence is not relevant to aprioristic subjects such as mathematics. But comparing human languages to mathematical objects seems as inappropriate as treating the Marylebone Cricket Club, or my rickety and much-repaired garden shed, as a mathematical object. Even God could not have prevented the number 23 existing and being eternally prime. But the Marylebone Cricket Club, or my garden shed, might very easily never have existed - if the game of cricket had not been invented there would be no MCC; and even in a world containing these things, their detailed properties could very easily have been other than they are. No one could have foreseen at its foundation that the MCC would acquire a national governing role, or forfeit it later to the Test and County Cricket Board; no one could predict that this particular window of the shed would break and be replaced with plywood, or just that floorboard would rot. A language is much more like these things than it is like a prime number. English, as a separate, identifiable language, was brought into being by a few thousand settlers in eastern Britain in the centuries after the Romans withdrew, and it was shaped by a myriad subsequent contingencies into the rich, sophisticated international communication medium it is today. If the Roman Empire had lasted longer, or the post-Roman Britons had more successfully resisted invasion, there would never have been an English language. If Celtic rather than Roman Christianity had prevailed at Whitby in 664, English might have been less influenced than it has been (in grammar as well as vocabulary) by Latin usage; if the Normans had been defeated at Hastings in 1066, English would have been far less influenced than it has been by French. Whether particular structural innovations at the present day — say, the replacement of differentiated tag questions such as w ill you, doesn't she, by innit as an all-purpose tag like German nichtwahr— catch on as part of the language, or fade away, depends on numerous contingencies, such as the current social standing of particular age groups or recent largescale immigration by speakers of languages lacking differentiated tags. A language is not at all a 'timeless, unchangeable' structure. 3 No negative evidence The idea of linguistics as a branch of mathematics, though, is a minority view even within the generative camp. Most generative linguists do see human languages as contingent entities - as things that might have been other than they are. Nevertheless, they believe that this particular class of entities has to be studied via intuitive rather than objective evidence. If one sets aside, as (let us hope) irrelevant, the fact that it takes much less effort to draw one's data from one's own mind than to examine objective evidence, then the most important reason why so many linguists have accepted the intuition-based approach to linguistics seems to be an issue about negative
OBJECTIVE EVIDENCE IS ALL WE NEED
127
evidence. The grammar of a language predicts that some arrangements of words are possible in the language, and it predicts that other word-sequences are not possible. Speakers have intuitive feelings of both kinds 'I could say this', 'I couldn't say that'. But objective evidence about real-life usage is 'one-sided'. We can observe people in speech or writing using some of the many arrangements of words which are possible in their language. But there is no corresponding type of observation which explicitly shows that some particular arrangement of words is impossible in the language - if it actually occurs, then presumably it is possible. Most English speakers would probably agree that, say, the word-string Fm glad to meetyou is a good example of English, whereas the string Of of the of is a bad example. But, although we can observe people uttering the former string, we cannot observe people 'not-uttering' the latter. Yet mere failure to observe an individual wordsequence certainly does not show that it is ungrammatical — nobody could ever hope to observe uses of all the grammatical sentences of English, one only ever encounters a sample. So, if grammars are based on observation and not introspection, what motive could a linguist have for designing a grammar to exclude Of of the of as ungrammatical? Many linguists take this asymmetry to prove that grammars simply cannot be based on objective data: we need negative as well as positive evidence, and only introspection can provide negative evidence. To see how wrongheaded this conclusion is, think what the analogous argument would sound like in some more solidly scientific domain of enquiry •- say, the theory of gravity in physics. The theory of gravity predicts (among other things) that some kinds of motion are possible and others are not: for instance, a scenario in which an apple is released near the surface of the Earth and accelerates towards it, hitting it two seconds later at a speed of 64 feet per second, is possible; a scenario in which the same apple, after release, drifts away from the Earth at constant speed is ruled out by the theory. A physicist might say, with truth, 'We can observe various cases where physical objects obey the law of gravity, but we can never make "negative observations" of objects disobeying the law.' The laws of physics are absolutely binding, they are not like the laws of the land which people may choose to disobey. Nobody would take a physicist very seriously, though, if he went on to say 'Lack of negative evidence makes it impossible to found physics on observational data; instead, the theory of gravity has to be based on our intuitions about which kinds of motion can happen and which cannot.' The well established hard sciences, such as physics, are based exclusively on positive evidence, although they make negative as well as positive predictions. The logic of how this can be so is perfectly well understood by those who are interested in scientific method, and there is nothing special about language which makes that logic inapplicable to our subject. As an argument against founding linguistics on observational data, the negative evidence issue really does not get off the ground. Before looking at why absence of negative evidence is not a problem for empirical linguistics, it will be worth showing that I am not attacking a
128
EMPIRICAL LINGUISTICS
straw man, by quoting some other writers who have supposed that it is a problem. The classic statement of this view was by C. L. Baker in 1979, who noted that 'every human being who achieves fluency in his language succeeds in becoming able to distinguish between well-formed and ill-formed sentences without making use of a significant range of examples from the latter class', and went on to claim that 'the acquisition problem owes a great deal of its difficulty to this lack of negative information in the primary data' (Baker 1979: 537). Baker used the point to support the idea that we inherit innate knowledge of language structure: according to Baker, the postulate of a rich innate Universal Grammar is needed in order to explain how children master their parents' language without access to negative evidence. Not all linguists who see absence of negative evidence as a problem draw the same conclusions as Baker, though many do. Gary Marcus (1993) listed 22 publications (including Baker's 1979 paper), beginning in 1971 and continuing down to 1992, which tried to deal with the problem of languageacquisition in the absence of negative evidence, but he pointed out that the earliest of these publications, Braine (1971), used it to argue against nativism. The issue of whether children need innate knowledge to acquire their mother tongue without negative evidence is distinct from the question whether adult linguists need to base scientific language descriptions on introspective judgements for lack of objective negative evidence, and the latter is what this book is concerned with. Even linguists who are not wedded to a nativist conception of first-language acquisition commonly believe that adult scientific linguists are bound to use introspective data. Carson Schiitze, in a study of this issue which is a good deal subtler than some linguists have produced, begins by describing 'grammaticality judgments and other sorts of linguistic intuition' as 'indispensable forms of data for linguistic theory'; he gives four reasons for saying this, two of which boil down to lack of negative evidence (Schiitze 1996: 1—2). In a review of Schiitze's book, Benji Wald (1998) agrees with Schiitze about the indispensability of introspective data. Of course, so long as we are thinking of the problem confronting a child acquiring its mother tongue, one might wonder whether negative evidence is completely absent. Many people have suggested that, even though parents will not normally produce 'starred sentences' for their children's consideration - they will not utter ungrammatical word-sequences together with an indication that they are ungrammatical — their spontaneous reactions to their children's linguistic errors might amount to 'telling' the child that the forms in question are erroneous. But whenever this suggestion has been carefully investigated, it has failed to stand up. A 1970 study by Roger Brown and Camille Hanlon was summarized by Gropen etal. (1989: 203) as showing that 'children are neither corrected nor miscomprehended more often when they speak ungrammatically'. Marcus (1993) examined a series of more recent studies, and concluded that it is not plausible that any limited feedback which parents may possibly supply in response to some linguistic
OBJECTIVE EVIDENCE IS ALL WE NEED
129
errors by their children could be crucial to the children's languageacquisition achievement. In the case of the adult scientist consciously formulating a language description, it is even clearer that the only available observational evidence is positive evidence. The analogue, in this domain, of the hypothetical parent who boggles visibly when his child produces an impossible wordsequence would be the possibility that speakers of the language under study might tell a scientific linguist formulating a grammar that particular strings were ungrammatical, rather than leaving him to infer this from the fact of never encountering such strings in naturalistic usage. Linguists who believe in the value of introspective data do routinely elicit such judgements, particularly when studying languages other than their own; but this does not mean that they are tapping a source of objective negative evidence — they are merely drawing on other individuals' introspective judgements rather than their own. If intuitive data are not suitable as a basis for scientific theorizing, there is little reason to think that laymen's intuitions about their language are any more usable than professional linguists' intuitions. (There is evidence that linguists' intuitions tend to correspond more closely than laymen's intuitions to the objective facts about a language: Snow and Meijer 1977.) 4 Refutability
The key to understanding why absence of negative evidence is no problem for empirical linguistics is the principle classically expounded by Sir Karl Popper: the essence of science is refutability. A scientific theory must state that certain logically conceivable situations are scientifically impossible, so that the theory can be refuted if those situations are observed. We cannot refute a theory merely by failing to observe something which it treats as possible — perhaps we just have not looked far enough; but we refute it at once if we observe something which it treats as impossible. The best theory will be the strongest, that is the theory which rules out the largest number of possibilities and is thus most vulnerable to refutation. Refutability is desirable, because only by being refutable in principle (while not refuted in practice) does a theory tell us anything about the world. A theory which permitted all possibilities would be safe from refutation, but would tell us nothing. The point can be illustrated by a simple example. The statement Tf one drops an apple it either moves or remains still' rules out no possibility and is thus worthless. 'If one drops an apple it moves downwards' is better, because it rules out the case of an apple moving in other directions or remaining motionless in mid-air. 'If one drops an apple it accelerates downwards at 32 feet per second per second' is better still, since it rules out everything forbidden by the previous statement together with cases of apples falling at other rates of acceleration. Like any other scientific theory, a grammar of English is required to be as strong as possible. If we refute a grammar by hearing an utterance which the
130
EMPIRICAL LINGUISTICS
grammar forbids, then a 'strong' grammar is one that permits few utterances. In other words, we want a grammar of English that defines the narrowest possible range of strings as grammatical, providing it permits everything we actually hear; and this gives us a motive for excluding Of of the ^/independently of our intuitions. Melissa Bowerman (1988: 77) points out that a number of linguists who have discussed the negative-evidence problem as it affects the child acquiring its mother tongue have postulated an innate principle, the 'Subset Principle', that amounts to Popper's principle of maximizing theorystrength: 'children must first hypothesize the narrowest possible grammar compatible with the evidence observed so far'. But the linguists quoted by Melissa Bowerman take this Subset Principle to be part of a rich structure of genetically inherited mechanisms specific to the task of languageacquisition, as postulated by writers such as Steven Pinker. That suggestion is redundant: the most general, empiricist concept of knowledge-acquisition available must credit human beings with a tendency to apply abstract Popperian principles such as preferring strong to weak theories in their attempts to make sense of the world, and the Subset Principle for grammarchoice will be merely one particular application of that principle. Children look for strong theories in trying individually to understand the world they are born into, and adult scientists apply the same principle in their attempts to push collective human knowledge further.2 However, in solving one problem we may appear to have created a new one. If our aim is to construct the strongest possible grammar permitting the strings we have observed, will our best move not be simply to list the strings we have heard and to say that they and only they are grammatical in English? However many data we have collected, we will never have observed every possibility. To take a simple if banal example, perhaps we have heard utterances of Boys like girls, Boys like pets, Girls like boys, but we happen not to have heard, say, Girls like pets. Any English-speaker knows that the latter sentence is fully as grammatical as the other three, but what motive can we have for constructing our grammar accordingly? A grammar which permits the string Girls like pets is that much less strong than one which forbids it; apparently we are justified in treating Girls like pets as grammatical only if we require the grammar to permit all the sentences which the native speaker 'knows' to be grammatical. Not so. There is a general principle of scientific methodology which bids us choose simple theories; one can almost define 'science' as the process of reducing apparently complex phenomena to simple patterns. Consider, for instance, a physicist investigating the relationship between two measurable quantities (elapsed time and temperature in some physical process, or the like). He takes several measurements and plots the results on a graph, and finds that the points lie almost but not exactly on a straight line. Now the physicist will draw a straight line passing as near as possible to the various points, and will adopt a theory which states that the quantities are related by the simple equation corresponding to the line
OBJECTIVE EVIDENCE IS ALL WE NEED
131
he has drawn. He will explain away the slight deviations of the points from the line as due to inaccuracies in the experimental situation — perhaps his clock or his thermometer are slightly imperfect. The physicist is not forced to do this. Mathematically, it is always possible to find a complex equation defining a curve which passes through the points exactly; so the physicist might choose a theory embodying that complex equation rather than the linear equation. Such a theory would actually be stronger than the 'linear' theory. The complex theory allows the physicist to predict that future points will lie exactly on the curve; the linear theory permits him only to predict that they will fall near the line (since previous observations were close to rather than exactly on it). But, provided the points fall fairly near the line, the physicist will invariably regard the simplicity of the linear theory as outweighing the slightly greater strength of the complex theory. Now let us return to the sentences about boys, girls, and pets. We can describe the sentences we intuitively feel to be grammatical by saying something like 'A sequence of noun, -s, transitive verb, noun, -s is grammatical; boy, girl, and pet are nouns; like is a transitive verb.' To rule out Girls like pets, which we have not observed, we must insert a further clause after the word 'grammatical', say, 'except when the first noun, verb, and second noun are respectively girl, like, and pet\ To add this clause to the grammar is a considerable extra complication, which would be justified only if it made a grammar very much stronger; since it affects only one string, the loss of simplicity is too great to tolerate. There is nothing mysterious about this notion of 'simplicity'. When we say that we want linguistic descriptions to be as simple as possible, we are using the word in a perfectly everyday sense; one does not need to be a philosopher to agree that one body of statements is simpler than another, if the second contains all the clauses of the first together with an extra one. To say precisely what makes a scientific theory simple, and just how we are to trade theoretical simplicity and theoretical strength off against one another, is a very difficult problem; but it is not one that the linguist need concern himself with, since it is a general problem of the philosophy of science. (For the notion of'simplicity' in science, see N. Goodman (1961), Hempel (1966, §4.4), Rudner (1966, §2.9).) For the practising scientist, whether physicist or linguist, it is sufficient to trust his intuitive judgement to tell him how to balance strength and simplicity. This is where intuition is admissible in science: in deciding how best to account for the facts, not in deciding what the facts are. Chomsky (e.g. 1965: 38 ff.) made confusing and contradictory remarks to the effect that a priori notions of 'simplicity' or 'elegance' are not relevant to the choice between scientific theories, and that the only concept of 'simplicity' relevant in linguistics will be a concept to emerge from empirical linguistic research/ But Chomsky was mistaken. The standard sciences depend crucially on an a priori notion of simplicity of theories, and the same is undoubtedly true for linguistics.
132
EMPIRICAL LINGUISTICS
5 Interdependence of theories So far, then, we have seen two cases where the data of speech seemed to be inadequate as evidence for the grammar of English, so that Englishspeakers' intuitions had to be used instead; and in each case it turns out that, when we take into account standard methodological considerations about what makes one theory better than another, the empirical evidence is quite adequate after all. However, linguists who defend the use of intuitive data have a stronger argument. So far we have assumed that the grammar must permit everything which is actually observed, and we have seen that methodological considerations justify us in treating some of the strings that we have not observed as grammatical, others as ungrammatical. But my critics may point out that all linguists (myself included) also treat as ungrammatical some strings which they have observed; in other words, sometimes we ignore refutations. Here the criteria of strength and simplicity seem irrelevant, since we want theories to be strong and simple only so long as they are not refuted (otherwise no scientist would ever need to look at any evidence at all). A particularly straightforward example would be, say, someone who spots a saucepan boiling over and breaks off in mid-sentence to attend to it. She was going to say If the phone rings, couldyou answer it?, but she only gets as far as If the phone. One does not usually think of If the phone as an English sentence, and a linguist's grammar of English will not treat it as grammatical; but why not, since it was after all uttered? We cannot assume that every utterance which is immediately followed by a flustered grab at an overflowing saucepan is ungrammatical; sometimes, a speaker will notice an emergency just as he reaches what would in any case have been the end of his sentence. Again, the orthodox linguist answers that, if the evidence for a grammar is behaviour, then we have no grounds for excluding from the set of grammatical strings anything that is uttered; if we want to call If the phone 'ungrammatical', then we must base our grammar on our 'intuitive knowledge' that If the phone is incomplete. But, again, this attitude represents a misunderstanding of the nature of science. The misunderstanding in this case has to do with the way in which the various branches of knowledge are mutually interdependent. Our grammar says that If the phone is ungrammatical, and thereby predicts that, other things being equal, the sequence If the phone will not be uttered. But other things are not equal. Our theories confront reality in a body, not one by one; each individual branch of knowledge makes predictions about observable facts only when we take into account other relevant pieces of knowledge. Now, quite apart from linguistics, we know as a fact about human behaviour that we often interrupt one task when a higher-priority task intervenes. Given this knowledge, we can predict that we will sometimes hear strings which constitute the beginnings of grammatical sentences without their ends. As long as the grammar treats If the phone rings, couldyou answer it? as grammatical, we do
OBJECTIVE EVIDENCE IS ALL WE NEED
133
not care whether it also treats If the phone as grammatical. Given the longer string, we can predict, from non-linguistic (but perfectly empirical) knowledge, that there are circumstances in which we will hear the shorter string, whether or not it is grammatical. And, if it does not matter whether the grammar does or does not permit If the phone, then it will be simpler not to permit it. This case is particularly clear, because we could see the boiling saucepan. Sometimes we shall want to discount uttered strings as 'incomplete' or as 'mistaken', without being able to point to an external factor disturbing the speaker's behaviour. But this is all right. It is a matter of empirical, nonlinguistic observation that people sometimes interrupt what they are doing for no apparent reason, and that people make mistakes when performing complex tasks. We do not need linguistics to tell us that the same will be true of speaking. If an Englishman utters a string which we can treat as grammatical only at the cost of greatly complicating a grammar that accounts well for other data, then we are free to say that he has made a mistake as long as we do not use this escape route too often. Of course, we predict that this deviant behaviour will be rarer in writing or in formal public speaking than in informal chat, just as more false notes are played at rehearsals than in the concert itself. There will also be cases where the grammar permits strings which we know will never be uttered. One of the rules of a standard English grammar, for instance, says that a noun may have a relative clause in apposition to it; and the definition of'relative clause' implies that a relative clause may contain nouns. In this respect the standard grammar of English is recursive: main clauses have subordinate clauses upon their backs to bite 'em, and subordinate clauses have sub-subordinate clauses, if not adinfinitum, then at least ad libitum. For instance: This is the dog, that chased the cat, that killed the rat, that ate the malt, that lay in the house that Jack built. But, although the grammar imposes no limits to this process, there certainly is a limit in practice: the nursery rhyme about the house that Jack built is already something of a tour deforce. That rhyme, when it reaches its climax, has about a dozen levels of subordination; we would surely be safe in predicting that one will never get more than, say, twenty levels. So, is our standard grammar, which fails to distinguish between strings with two levels of subordination and strings with two hundred levels, not intolerably weak? Should we not either add a rule limiting the process of subordination, or else recognize that the standard grammar is justified because we intuitively know that sentences with many levels of subordination remain 'grammatical' (even if they are not 'acceptable' in practice)? Once more, no. It is again part of our general knowledge, quite independent of linguistics but perfectly empirical, that behaviour patterns are less and less likely to be executed as they become more and more long and complex. Many people hum the latest pop tune, few try to hum a Bach fugue. From this knowledge we predict that relatively long strings are relatively improbable in practice. To incorporate a rule to this effect in the grammar
134
EMPIRICAL LINGUISTICS
of English would be simply to duplicate within linguistics this piece of extralinguistic knowledge. There is no reason to do so. It is quite difficult enough to construct linguistic theories which do the tasks that only they can do — for instance that of distinguishing between the child seems sleepy and the child seems sleeping — without taking on extra work unnecessarily. We can sum up the situation I have been describing in a diagram (see Figure 8.1). Here, the large circle ^ represents all possible strings of English words. The circle X represents the set of strings which some grammar of English defines as grammatical. But the grammar is only one of the factors which jointly determine what is actually uttered: T represents the set of strings which are predicted to be possible when we take into account both the grammar and all the knowledge that we have apart from linguistics. Since they overlap partially, X and Y divide the complete set of strings into four subclasses: A, B, (7, and D. Strings in B are both grammatical and utterable: for instance This is the dog that chased the cat. Strings in A are grammatical but will not be uttered in practice: for instance a sentence like the 'house that Jack
Figure 8.1
OBJECTIVE EVIDENCE IS ALL WE NEED
135
built' one, but with two hundred rather than a dozen levels of subordination. Strings in Care ungrammatical, but may be uttered: for instance If the phone. Finally, strings in D are neither grammatical nor will they be observed: for instance Of of the of. The border of T is drawn 'fuzzy', because the non-linguistic factors that help settle whether a string is likely to be uttered are many and various, and many of them are intrinsically 'gradient' rather than yes-or-no factors. For instance, Of of the of is not likely to occur as a mistake in place of some grammatical sentence (as one might say The bus we made all the trips in are outside as a careless slip for . . . is outside), and accordingly I assigned Of of the of to set D. But Of of the of becomes quite a likely utterance in the context of a book or lecture on linguistics, and accordingly it should perhaps be assigned to set C. (Some strings really will be in D, of course. For instance, the string consisting of the word of repeated one thousand times will not occur in full even in a book on linguistics.) Although grammatical sentences become more and more implausible as their length increases, there is no sharp cut-off point below which they are possible but above which they are impossible. 6 The irrelevance of intuition The foregoing has shown that generative linguists are unjustified in claiming that linguistics rests on introspective evidence. The sorts of considerations which make generative linguists think that linguistics needs intuitive data can in fact be explained perfectly well on the basis of exclusively behavioural data, provided we impose on our grammars quite standard methodological requirements of strength and simplicity, and provided we realize that the predictions made by our grammars are affected by the rest of our knowledge. We do not need to use intuition in justifying our grammars, and, as scientists, we must not use intuition in this way. This is not to deny that we have intuitive knowledge, if one wants to call it that, about our own language or about language in general. I am concerned only to emphasize that the intuitions we have must not and need not be used to justify linguistic theories. Certainly we have opinions about language before we start doing linguistics, and in many cases linguistics will only confirm these pre-scientific opinions. But then we have opinions about very many subjects: for instance, one does not need to be a professional meteorologist to believe that red sky at night means fine weather next day. In some cases our pre-scientific opinions about language come from what we are taught in English lessons, or lessons on other languages, at school (the division of our vocabulary into parts of speech, for instance); in other cases they are worked out by individuals independently (as when a non-linguist judges that Of of the of is not English and that Three zebras stayed behind is English, for instance). Our pre-scientific opinions, both about the weather and about English, may well be right; but it is the job of meteorology and linguistics to find out whether they are right or wrong, to explain why they are right if they are right, and to show where they are wrong if they are wrong.
136
EMPIRICAL LINGUISTICS
What we are taught about English at school, and what we work out for ourselves, is rudimentary linguistics, just as the proverb about red sky is rudimentary meteorology; but it is the job of a science to attempt to improve on the rudimentary theories of Everyman, not simply to restate them in novel jargon. One may object that the analogy between linguistics and meteorology is unfair. If there are links between red evening sky and future fine weather, these links have to do with the physics of air, water vapour, and the like, and are quite independent of human beings: any opinion that a human has about meteorology is one he has formed solely as an observer. On the other hand, whether such and such a string of words is English or not depends on our own behaviour as native speakers of English. So it may be reasonable to suggest that we may have 'infallible intuitions' about what is or is not English, in a way that our opinions about the weather clearly cannot be infallible. As I put it figuratively in Chapter 1, the part of our brain which makes conscious judgements about the English language could perhaps have a 'hot line' to the part of our brain which controls our actual speaking, so that we know what we can and cannot say in English in the same direct, 'incorrigible' way that, say, I know I have toothache. This might be so, and it would be very convenient for linguists if it were. The very fact that we can ask the question shows that behaviour is the ultimate criterion: to decide whether we have such a 'hot line', we have to construct a description of English based on behaviour, and then see whether it coincides with our 'intuitive' opinions about English. And in fact the empirical evidence is negative: it is easy to show that people believe quite sincerely that they say things which they never do say, and, more strikingly, that they never say things which they in fact frequently say. Even with respect to the basic issue 'Are such and such strings grammatical for me?', though people's opinions tend to be fairly accurate, they are very far from infallible (cf. Fillmore (1972), Householder (1973), Labov (1975: §2.3)). As for subtler intuitive abilities which have been attributed to the native speaker, such as the ability to parse his sentences, linguists who state that this is an innate endowment must surely have forgotten the struggles their less language-minded classmates went through in the English lessons of their childhood. Someone wishing to defend the use of intuition may object that all linguists, including those who worked in the pre-generative period and who prided themselves on their empirical approach, have in fact relied heavily on intuition in formulating their grammars. The descriptive linguists of the mid-twentieth century sometimes took great pains to gather an objective corpus of data before commencing analysis, even when investigating their native language — for instance, Charles Fries's grammar of American English (Fries 1952) was based on a collection of bugged telephone conversations; but even Fries could not move from data to grammar without using intuitive 'guesses' (if you will) about the status of strings not in his corpus. And I freely admit that I myself, though a believer in empirical techniques,
OBJECTIVE EVIDENCE IS ALL WE NEED
137
have often published analyses of points of English syntax based exclusively on my intuitions about my mother tongue. This objection again misses its mark through failure to appreciate how science works. We do not care where a scientist gets his theory from, only how he defends it against criticism (cf. Popper 1968: 31). Any scientific theory is sure to incorporate many untested assumptions, guesses, and intuitive hunches of its creator. All that matters is that any feature of the theory which is doubted can be confirmed or refuted on empirical grounds. It seems to be true in the case of language that people's pre-scientific intuitions tend to be more reliable on the question of grammaticality of individual strings than on the structure of grammars. That being so, it is a sensible research strategy for a linguist to assume that his opinions about what is or is not English are correct, and to use these grammaticality judgements as evidence for or against grammars of English. But, should his grammaticality judgements about individual strings be challenged, the thing to do is to see whether English speakers actually utter strings like that — not to quarrel about whose intuitions are clearest. What one must never do is to say: T intuit that the string is grammatical/ ungrammatical in my kind of English; and my idiolect is an attested language too, so the theory of language must be able to handle my idiolect even if all other English speakers speak differently.' Short of following the author of such a comment round with a tape-recorder for a few months, there is simply no way of checking his claim. If what he claims is awkward for one's general theory of language, it is more sensible to reject his claim for lack of evidence than to change one's theory of language. Consequently, it is better to choose the speech of nations rather than that of individuals as the subject of linguistics; it is easy to check a claim about English, but hard to check a claim about Sampsonese. 7 Nonsensicality versus ungrammatically Reliance on introspective data has undoubtedly led to many distortions in linguists' descriptions of languages. One respect in which this is particularly obvious concerns the treatment of contradictory or nonsensical sentences. In his first book, Syntactic Structures (Chomsky 1957), Chomsky argued that a grammar should distinguish between grammatical and ungrammatical strings, but should not distinguish, among the grammatical strings, between sensical and nonsensical strings. Strings like The books is on the table or Of of the of should be forbidden by a grammar of English, but Sincerity is triangular, like Sincerity is admirable, should be permitted even though no one ever says Sincerity is triangular. The principle is by now familiar: it is not for linguistics to tell us things that we know independently of linguistics. If someone asks why Englishmen never utter the string Sincerity is triangular, one can reply 'Sincerity is a character trait and as such has no shape.' This is a statement about sincerity and triangularity, not about words, but it implies that there will be no point
138
EMPIRICAL LINGUISTICS
in uttering Sincerity is triangular: we do not also need the grammar of English to tell us that the sentence is odd. On the other hand, if one asks what is wrong with The books is on the table, the reply has to be along the lines: 'The verb does not agree with the subject'; in other words, it has to use linguistic terminology, and no non-linguistic knowledge will rule this string out. Similarly, what is wrong with Of of the of is that it 'simply is not English' — again we have to refer to a linguistic notion, namely 'English'. If the linguist relies on his intuition to tell him what his grammar should permit, then we can understand that he may decide to rule out Sincerity is triangular along with The books is on the table. Our intuition does not seem particularly sensitive to the distinction between nonsensicality and ungrammatically, but simply registers a general feeling of'oddity'. Here is a case where intuition lets the linguist down; if linguistics is an empirical science, we have excellent reasons to distinguish the two kinds of 'oddity'. To make clearer the idea that sentences like Sincerity is triangular should be treated as 'good' sentences by our grammar, let me give an analogy. Suppose the Highway Code were revised so that, in order to indicate one's intentions at a crossroads, one simply pointed in the appropriate direction. (This might be sensible if we drove horse-drawn carriages rather than sitting behind windscreens.) Now the revised Highway Code would not need explicitly to exclude the gesture of pointing up in the air, or pointing left and right with both hands simultaneously. As a signal of one's intentions, the former is patently false, and the latter contradictory; this is a quite adequate explanation of why drivers do not use these signs. Indeed, it is only because these signs do fit the system as defined that we can recognize them to be respectively false and contradictory. A sign outside the defined system (say, folding the arms) is not 'false' or 'contradictory' but simply 'not in the Code'. Similarly, it is only because it is a sentence of English that we can recognize Sincerity is triangular to be contradictory. We cannot call Of of the of 'false' or 'contradictory': it is not an English sentence, so the question of its meaning does not arise. In Syntactic Structures, as I said, Chomsky recognized this point. Unfortunately, by the time he published Aspects of the Theory of Syntax eight years later (Chomsky 1965), Chomsky changed his mind. Chapter 2 of Aspects is largely concerned with the problem of how to reorganize linguistic theory so as to allow grammars to exclude nonsensical as well as ungrammatical strings. Chomsky does not present this as a change of mind. Rather, he claims that, while a grammar should permit genuinely contradictory but grammatical strings such as Both of John's parents are married to aunts of mine, the oddity of strings such as The book dispersed (to use one of his examples) is a fact about the English language and must therefore be stated in the grammar of English. But this distinction is unfounded. The book dispersed is an odd thing to say because only aggregates which are not physically linked can disperse, whereas a book is a single continuous physical object: this is a statement in the 'material mode' of speech, referring to books and to dispersing but not to English. One can recast it in the 'formal mode' by saying, 'The verb disperse
OBJECTIVE EVIDENCE IS ALL WE NEED
139
cannot be predicated of the noun book in the singular', but it is not necessary to do so. (The oddity of The books is on the table, by contrast, can be explained only in the formal mode.) The only difference between the oddity of The book dispersed and that of Both of John's parents are married to aunts of mine is that it takes a slightly longer chain of reasoning to spell out the contradiction in the latter case. One group of linguists who understood this point well was Richard Montague and his followers. For instance, Richmond Thomason, a Montague grammarian, suggests (Thomason 1976: 83) that a string such as John lends Mary that a woman finds her father is grammatical in English. The quoted string is very bizarre; but that is because a woman's finding her father is a fact, and one cannot lend a person a fact, although one can tell a person a fact or lend him a book. This is a truth about facts and lending, not about English. To treat strings like this as grammatical greatly simplifies English grammar: this string is syntactically quite parallel to John tells Mary that a woman finds her father, so that it would be relatively awkward to permit the latter while ruling out the former. Unfortunately, this way of thinking about grammaticality never made much headway among mainstream generative linguists. Recent standard textbooks, for instance Radford (1988: 369 ff.), Fromkin and Rodman (1998: 184-5), and Ouhalla (1999: 46-9), continue to assert that nonsensical sentences such as The boy frightens sincerity represent 'violation[s]' of 'rules of language' (quoted from Fromkin and Rodman). Consequently, the generative approach to linguistic description continues to be weighed down with complex mechanisms designed to mirror generative linguists' intuitions about what strings are valid or invalid examples of the languages described, which from the viewpoint of empirical science are largely redundant.
8 Conclusion If the impossibility of observing 'negative evidence' forced scientists to base their theories on the data of intuition, then hard sciences like physics or meteorology would have to be as dependent on intuition as generative linguistics has been. In those domains such a suggestion would seem absurd; and it is equally inappropriate in our domain. Linguistics is not a subject which used to be based on crude observational evidence but has progressed by replacing observational with intuitive data. As William Labov has said, it is a subject which was once sufficiently simple that intuitive data may have been adequate, but which has come to develop theories of such subtlety than objective evidence is crucial for winnowing truth from falsity. What Labov said a quarter-century ago is even truer today. With the widespread availability of large electronic language corpora, there is really no good reason nowadays why linguistic research should be pursued in a less empirical manner than any other science.
140
EMPIRICAL LINGUISTICS
Notes 1 It is surprising how wrong people's judgements can be even in simple cases. One of my part-time researchers also develops software for the Department of Health and Social Security. He asked me recently to comment on a disagreement that had arisen with a colleague there, relating to software which generated English wording. The colleague insisted that choice between a and an forms of the indefinite article should be made to depend on the following noun, so as to produce 'correct' sequences such as an good egg rather than sequences such as a good egg, which he regarded as 'incorrect'. My researcher had different intuitions of correctness, but did not feel confident in preferring his own intuitions until they were supported by my professorial authority. One can see roughly what must have happened: the programmer had misunderstood something he had been told or read about the indefinite article rule, and this misunderstanding of an explicit statement had overridden anything he might have noticed about how the article is used in practice. (I am sure that he had never heard himself or anyone else saying an good egg in ordinary, spontaneous usage.) The fact remains that, if linguistic descriptions were answerable to native-speaker intuitions, the facts quoted would imply that a grammar of English which generates a good egg rather than an good egg is wrong. 2 Subsequently to Bowerman's article, Steven Pinker co-authored a paper (Gropen etal. 1989) which comes much closer to explaining the negative-evidence problem in terms of a general tendency to maximize Popperian theory-strength ('conservatism') . See also Brooks and Tomasello (1999). 3 Where more than one grammar is compatible with the data about a given language, according to Chomsky the choice between them will be made by an 'evaluation measure' which can be established empirically as part of the general theory of language, so that there is no room for an a priori simplicity criterion to operate (although Chomsky, following Willard Quine, explicitly stated the opposite in Syntactic Structures, Chomsky 1957: 14). With respect to the further question (whether an a priori simplicity criterion is needed to choose between alternative empirically adequate general theories of language), Chomsky appears to hold (1965: 39) that the principle of maximum refutability will suffice to eliminate all but one such theory, so that again the a priori criterion has no place. (However, p. 38 and p. 39 of Chomsky 1965 contradict each other on the question whether an apriori simplicity criterion applicable to theories of language is even available, let alone needed; subsequent remarks by Chomsky did not make his views any clearer.) There is in reality little doubt that the principle of maximum refutability will be insufficient, either to select a unique theory of language, or to select unique grammars of attested languages given a well supported theory of language; so that aprion: judgements of relative simplicity will be needed at both levels of theorizing. 4 I have slightly adapted the wording of Thomason's example in order to make the point clearer.
9
What was Transformational Grammar?
1 A hermetic manuscript
In the previous chapter, we confronted the puzzle of why many linguists since the 1960s have been strangely reluctant to use empirical techniques. One answer, I suggested, lay in a mistaken conviction that language description required a kind of'negative evidence' which is by its nature unobservable. But there is another puzzle about the direction linguistics took in the 1960s and 1970s (and whose consequences live on today). Noam Chomsky's fairly unempirical theory of 'Transformational Grammar' found acceptance internationally, as the way to study and define language structure, with surprising rapidity. As Akmajian etal. (1995: 179) put it: 'This general sort of model (including numerous variations) has dominated the field of syntax ever since the publication of Noam Chomsky's 1957 book Syntactic Structures'. Transformational Grammar has continued to enjoy this status for the subsequent four decades; O'Grady (1996: 182) continues to describe it, correctly, as 'the most popular and best known approach to syntactic analysis . . . the usual point of departure for introductions to the study of sentence structure'. Yet, from the word go, it has often been clear that academic linguists were assenting to the proposition that Transformational Grammar was the best available theory of language structure, while lacking any clear grasp of what distinctive claims the theory made about human language. This is surely not a very usual situation in intellectual history. Normally, one would suppose, understanding precedes acceptance. A Darwin, or a Wegener, proposes evolution by natural selection, or continental drift: at first their audience understand what they are saying (at least in broad outline), but in many cases feel sceptical if not downright hostile, and only after awareness of the theories has been widespread for some considerable time do they, in some cases, win sufficiently many converts to become conventional wisdom. Why was the sequence reversed, in the case of Transformational Grammar? Undoubtedly this question has many answers. One important factor, though, was the 'hermetic' status of one of the most central documents on which the reputation of the theory was based.
142
EMPIRICAL LINGUISTICS
Probably most people who read Chomsky, in the years when his approach to linguistics was striving to win acceptance, read only his relatively nontechnical books, which alluded to his theory of Transformational Grammar without spelling it out in detail. It is quite reasonable and necessary for laymen to take such matters on trust, relying on experts to blow the whistle if there is anything wrong with the technical material. However, those linguists who did wish to subject transformational-grammar theory to critical examination (and who had sufficient competence in mathematical and logical techniques to do so) soon ran up against a barrier. The available published expositions, notably Chomsky's first book, Syntactic Structures, contained many gaps that were bridged by means of references to a fundamental work, The Logical Structure of Linguistic Theory (I shall refer to it by the acronym LSLT). Although this book was written in the mid-1950s, before Syntactic Structures, it did not reach print until twenty years later, in 1975. As a result, during the years when Chomsky and his followers were seeking to rewrite the agenda of linguistic research, the only people who could claim real knowledge of his theory were a coterie in and around the Massachusetts Institute of Technology, who by and large were partisans. The possibility of Popperian critical dialogue was eliminated. In other words, during the period when Transformational Grammar achieved dominance as a linguistic theory, its hundreds or thousands of advocates in the universities of the world not only did not but could not really know what they were talking about. Linguists at a distance from the 'inner circle' who might have felt sceptical about some aspect of the theory had no way to check, and by and large what they were unable to assess they took on trust. It is true that copies of the manuscript of LSLT were circulated fairly widely, from an early date. Shortly before the book was finally published, I myself borrowed a copy from a colleague who had acquired it years earlier during a sabbatical at Harvard. But it would be naive to take this as meaning that the book was equally widely read. A densely formal text, such as LSLT is, is daunting enough when properly printed; but the document I borrowed was a reproduction, in the faint purple ink that was usual before the advent of modern photocopying technology in the late 1960s, of three volumes of old-fashioned typescript with many handwritten symbols and annotations. I borrowed it with good intentions, and I believe I am more tolerant than many linguists of mathematical formalisms; but I soon laid it aside, making a mental note that I really must go back to it at some future time which never came. In practice, the fact that people knew these manuscript copies were in circulation functioned to make the theory seem to be publicly testable; it allayed the suspicions which could not have failed to arise if the theory had explicitly been private and unpublished. In this chapter we shall examine the version of LSLT that was eventually published in 1975. We shall find that, in the light of its contents, it is open to question whether there ever was anything clear and definite enough to call a 'theory' of transformational grammar.
WHAT WAS TRANSFORMATIONAL GRAMMAR?
143
But by the latter half of the 1970s, the generativists had won: the agenda of linguistics had been rewritten. Controversies in the field by that period were about the philosophical implications of innate linguistic knowledge for our understanding of human nature, or about alternative models of the relationship between grammar rules and meaning ('generative v. interpretative semantics', see e.g. R. A. Harris 1993). For the overwhelming majority of linguists with any interest in these matters, Chomsky's initial thesis - that phrase-structure grammar was an inadequate model of human language structure and transformational grammar should be preferred - was an old story, and LSXTseemed to be a publication of historical interest only. Before 1975, the book was not read because it was not available in reasonably legible form. After that date, it was very little read because it seemed irrelevant to current issues. Transformational Grammar could hardly have succeeded if the sketchy versions of the theory offered in publications such as Syntactic Structures had not used references to LSLT in order to join up the dots; but references to specific passages in LSLT have always been exceedingly rare in the literature of linguistics. 2 Chomsky's early contributions
All this is not to suggest that Chomsky's early published writings consisted of nothing but intellectual lOUs for a theory which was published twenty years late. Indeed, what I take to be Chomsky's chief contributions to linguistics were contained quite explicitly in his early publications. But those contributions were to the conceptual foundations of the discipline, rather than to its substantive theoretical content. They consisted of novel accounts of the tasks which ought to be fulfilled by linguists' theories, rather than novel recipes for carrying those tasks out. As I read him, Chomsky's main achievement was to define two goals for the linguist to aim at. The first of these can be put as follows: a theory, or 'grammar', of an individual language ought to define the class of all and only the well formed sentences of that language. In the syntactic field, at least, Chomsky's predecessors had not set themselves this aim. Rather, they gave themselves the less demanding (though by no means trivial) goal of describing various syntactic patterns that do occur in a language, without worrying too much about completeness of coverage (and hence about the boundary between what is grammatical and what is not). The man who came closest to anticipating Chomsky in writing 'generative' rather than merely 'descriptive' grammars was perhaps his teacher Zellig Harris; but Harris did not quite do so, and it is very understandable that he should not have done - after Chomsky made the goal explicit, Harris came to realize that he (Harris) was bound to reject it as inappropriate, since he did not himself believe that there are in principle well defined bounds waiting to be discovered to the class of grammatical sentences in a natural language (Z. S. Harris 1965: 370-1). l
144
EMPIRICAL LINGUISTICS
This position — that we to a large extent 'make up the grammar of our language as we go along', so that there is little sense in speaking of a class of potential sentences distinct from the class of sentences that have actually been uttered in a language — is one which is more or less explicit in the writings of such eminent linguists as Schleicher and Saussure (and which was argued at length against Chomsky by Hockett 1968). Chomsky never defended his, contrary, position; but if there is even a possibility that Chomsky is right and Schleicher, Saussure, Harris and Hockett wrong about the syntactic well-definedness of natural languages, then Chomsky's explicit formulation of this concept of the grammarian's task was certainly a valuable one. The second novel goal that Chomsky advocated was that of stating a general theory of language which will draw a rigorous line of demarcation between 'natural', humanly possible languages and 'unnatural' languages that could not be used by humans - a notion which perhaps makes sense only on the assumption that it is possible to define individual languages rigorously, as Chomsky's first goal supposes. This second goal was not considered by Chomsky's predecessors, and it seems certain that many of them would have emphatically rejected it if it had been proposed to them explicitly. One of the beliefs or presuppositions common to those of Chomsky's predecessors whom he described (adversely) as 'empiricists' was that there are no fixed bounds to the diversity of natural languages, because the human mind is an immensely flexible organ not innately limited to any particular patterns of thought, which seems to imply that separate human communities will go different and unpredictable ways in developing their systems of communication. In my Educating Eve, and in other writings, I have argued that Chomsky is mistaken in his belief that the contents of human minds are subject to strong genetic constraints that lead to limits on the diversity of languages. But Chomsky's position is a defendable one, and if he were right it would follow that the goal of stating a general theory of language would be an appropriate aim for linguistics. 3 The received view of the early contributions
Whatever the virtues or vices of these methodological contributions, however, they are not what most linguists have seen as Chomsky's main achievement. According to the received version of the recent history of linguistics, what Chomsky did in his first published book, Syntactic Structures, was to propose a new linguistic theory, namely the theory of Transformational Grammar, superior to the general theory (Phrase-Structure Grammar) which underlay the grammatical work of his predecessors. Chomsky is alleged to have provided, not a novel question for general linguistics, but a novel answer to that question. As an account of the material actually printed in Syntactic Structures and other early publications, this seems fanciful. In the first place, I have already suggested that linguists before Chomsky did not believe in the existence of
WHAT WAS TRANSFORMATIONAL GRAMMAR?
145
any general linguistic theory limiting what can occur in the grammar of a human language; a fortiori, they did not believe that human languages were restricted to the kind of phenomena describable by (context-free or contextsensitive) phrase-structure rules. It is true that they talked mainly about immediate-constituent analysis, but that is a natural consequence of the fact that they aimed to describe the syntactic structures of individual grammatical sentences, rather than to define the class of all grammatical sentences: Chomsky, too, held that the (deep or surface) structure of any particular sentence is a constituency tree. Furthermore (apart from the fact that Chomsky explicitly borrowed the notion of'syntactic transformation' from Zellig Harris), the kinds of phenomena for which a generative grammar needs structure-modifying rules such as Chomsky's transformations were perfectly well known to many others of Chomsky's predecessors, who did not talk in terms of transformations simply because they were not interested in constructing generative grammars. The notion 'discontinuous constituent', for instance, was discussed at length by Rulon Wells (1947: §v), who credits the notion to a 1943 paper by Kenneth Pike, and it was widely used by other linguists. To say that a language contains discontinuous constituents is to say that, if we aim to state a generative grammar of the language, we shall need rules for permuting elements of constituency structures as well as rules for creating such structures. Once one has the notions of 'generative grammar' and 'discontinuous constituent', the notion of 'permutation rule' is not an additional intellectual advance, but an obvious, immediate necessity. Leonard Bloomfield talked explicitly about syntactic operations of replacement and deletion, for instance in his 1942 paper 'Outline of Ilocano syntax': After di 'not' . . . ku ['by-me'] is ... usually replaced by ak ['I'] . . . The form for 'by-me thou' omits ku .. . When the first two persons come together, the attributive forms of the third person are used .. .
The man who pointed out that American linguists were unduly concerned with the arrangement of forms into larger structures, and were tending to overlook the virtues of description in terms of processes applied to structures, was Charles Hockett in 1954, not Chomsky in 1957 (though it is true that Hockett's 'Two models' article discussed almost exclusively morphology rather than syntax). Nelson Francis's Structure of American English (1958) is full of process descriptions of syntax: The interrogative status is marked by a change in word order, involving the inversion of the subject and the auxiliary, or the first auxiliary if more than one are present (p. 337) The negative status is marked by the insertion of...not... immediately after the first auxiliary (ibid.')
146
EMPIRICAL LINGUISTICS
[in the example Rather than starve he chose to eat insects] the direct object of chose is a structure of co-ordination, the second part of which has been front-shifted to the beginning (p. 364) Questions involving interrogative pronouns . . . reveal split and dislocated structures of considerable complexity (p. 388)
— and so on. Admittedly, Francis's book was published in 1958; but it would be extremely implausible to suggest that Francis was able to write as he did only because of the enlightenment brought about in the preceding year by Chomsky (whom Francis never mentions) with his book Syntactic Structures. Will it be objected that these writers were using terms like 'omit', 'dislocate', 'replace' only metaphorically? — that when, for instance, Bloomfield writes that an Ilocano form such as inismanka 'omits ku' he does not literally mean that the form is 'underlyingly' inisman-ku-ka and that this is changed by rule to inisman-0-ka, but merely that forms such as inismanka are found where the patterning of the rest of the language would have led us to predict *inismankuka? Bloomfield would undoubtedly have concurred with this interpretation of his usage (cf. p. 213 of his book Language}. But Chomsky also is careful to stress that his process grammars must not be understood as models of speaker or hearer (cf. Chomsky 1965: 9). Chomsky's 'transformations' are not rules which speakers use to modify structures drawn up on some sort of cerebral blackboard in preparation for uttering them; rather, they are abstract rules which succeed in defining the range of sequences that actually do occur. I am not enough of a theologian to see how one could describe Chomsky's use of process notions as more 'literal' than Bloomfield's or Francis's. It is more formally explicit, but that is not the point here. In order to go beyond a programmatic statement of the goal of general linguistic theory and actually to produce such a theory, Chomsky would have had to be precise about his notion of Transformational Grammar. He would have had to state some specific hypothesis about just what can and what cannot be said within a transformational grammar; and this hypothesis, to count as novel, would have had to do much more than merely say that the rules of such grammars can delete, insert, and permute, since this merely expresses what was well known already, in the terms that become appropriate when one thinks of grammars as generative rather than descriptive. In Syntactic Structures, Chomsky did not produce such a hypothesis. All he said there in explanation of the term 'transformational rule' was: A grammatical transformation T operates on a given string (or . . . on a set of strings) with a given constituent structure and converts it into a new string with a new derived constituent structure. To show exactly how this operation is performed requires a rather elaborate study which would go far beyond the scope of these remarks . . . (p. 44) To specify a transformation explicitly we must describe the analysis of the strings to which it applies and the structural change that it effects on these strings (p. 61)
WHAT WAS TRANSFORMATIONAL GRAMMAR?
147
A footnote refers the reader who wants more detail to three works, of which one is LSLT and another is one chapter of LSLT. Only the third, 'Three models for the description of language' (Chomsky 1956), was in print by the time that Syntactic Structures appeared, in a journal for radio engineers to which few linguists would have had access. (In any case 'Three models' deals in a few paragraphs with material that takes some ninety dense pages of LSLT.} If people took Syntactic Structures to represent a new substantive theory of language - and they did - this can only be because of these references to LSLT. Syntactic Structures did not give us a general linguistic theory; it suggested that LSLT contained such a theory and gave us hints as to its nature, and linguists proved willing to take these suggestions on trust. One cannot even claim for Syntactic Structures that it established a Lakatosian 'research programme', by providing the hard core of a general theory which could then be modified by later research. One cannot modify the details of a theory, unless one knows what the details are. Syntactic Structures included many examples of transformational rules proposed for English, expressed in what is in reality an informal, inexplicit notation — though it looked offputtingly formal to some linguists who were less algebra-minded than Chomsky. In practice, it was these examples which took on the role of statement of the theory, as far as most linguists were concerned. When transformationalists claimed that the best way to define the syntax of any natural language was by means of a 'transformational grammar', what they meant was something like 'by means of rules of the kind found on pp. 11 \—\^ of Syntactic Structures'. But this is so vague that it can hardly be called a theoretical claim at all. How similar to Chomsky's examples must a rule be, in order to count as a rule of the same kind? Some of Chomsky's rules in Syntactic Structures contain devices which look rather ad hoc, for instance the structural description of the rule which came to be called 'Affix Hopping' (Chomsky's term was 'Auxiliary Transformation') includes elements Af and v that have to be defined in a rubric appended to the structural description, since these symbols do not appear in the phrase-structure base. This particular case is an important one, because Affix Hopping was one of Chomsky's best rules from the 'public relations' point of view; its apparent success in reducing chaos to an unexpected kind of order won many converts to Transformational Grammar. (It has remained a standard component of the transformational description of English, though with modifications down the years - see Ouhalla 1999: 92~9 for a recent restatement.) So it will be worth examining this case in detail.
4 A transformational rule Chomsky's Affix Hopping rule 1 dealt with the diverse patterns of verb group (sequences of auxiliary and main verbs) found in English. The subject of a
148
EMPIRICAL LINGUISTICS
finite clause may take a simple verb, say eat or eats; but we can also find sequences of many kinds, for instance: was eating may have eaten has been eating could have been eaten can eat must have been being eaten etc. Generating all the possibilities using a pure phrase-structure grammar is not straightforward, Chomsky claimed (1957: 38). A simpler approach is to generate the elements in the wrong order, and then apply a transformational rule which makes affixes 'hop' over adjacent verb stems. The phrase structure base allows the alternative sequences shown in Figure 9.1, where the contents of any pair of brackets may be either included as a whole or omitted as a whole - if you choose to take have, you must also take -en, representing the past participle ending. The term Modal stands for the uninflected form of any of the modal verbs, such as can, may, will; and V stands for any uninflected main verb. Then, having made some particular selection, say the selection shown in Figure 9.2, one applies Chomsky's Affix Hopping rule to the sequence. The rule says that each 'affix', in this case the elements PastTense and -en, 'hop' rightwards over whichever verb stem they are next to, so that the sequence of Figure 9.2 is transformed into that of Figure 9.3, which corresponds to the word-sequence could have been eaten. Some minor 'tidying-up' rules are needed to get the morphology right, specifying for instance that can with the past-tense suffix comes out as could, and that the past-participle suffix is -en after eat but -ed after, say, invite. The essence of the job, though, is achieved by the simple pattern of Figure 9.1 together with the Affix Hopping rule. Affix Hopping is essential. Without it, Figure 9.2 would correspond to a sequence something like -edcan had been eat - even without the past-tense affix stranded at the beginning without a stem
(Past-Tense) (Modal) (have -en) (be -ing) (be -en) V Figure 9.1 Past-Tense can have -en be -en eat Figure 9.2 can Past-Tense have be -en eat -en Figure 9.3
WHAT WAS TRANSFORMATIONAL GRAMMAR?
149
to attach itself to, can had been eat makes no sense in any normal version of English. On the other hand, no matter which selections you make from Figure 9.1, after applying Affix Hopping to the selections you end up with a sensible English verb group. Some later writers (e.g. Gazdar et al. 1985) would argue that Chomsky's exposition underestimated the potential of pure phrase-structure grammar, and that Chomsky needed an Affix Hopping rule to reorder the sequences generated by Figure 9.1 only because Figure 9.1 was expressed in an unnecessarily crude manner. But that is beside the point here. What matters for present purposes is that Chomsky believed that pure phrase-structure grammar could not do the job elegantly (he said so, and in terms of the kind of phrase-structure grammar he was using in 1957 he was correct), and he argued that Transformational Grammar solved the problem by permitting rules like Affix Hopping to be stated. Many readers took this as a telling argument in favour of Transformational Grammar. But it could be that, only if Affix Hopping was a kosher 'transformational rule'. Was it? The question arises, as already indicated, because Chomsky's formalization of the 'Affix Hopping' or 'Auxiliary Transformation' rule (1957: 113) included (and needed to include) a rubric glossing the symbols Af and v which appear in the structural description of the rule: 'where Af is any C or is en or ing', v is any M or V, or have or be\ Just what was it supposed to be legitimate to include in rubrics appended to transformational rules? As an example of the sort of operation which transformations cannot do, Chomsky often cited a hypothetical 'unnatural' language in which yes/no questions are formed by reversing the order of words in the corresponding statements. But, if there are no limits to what may appear in transformationrule rubrics, even this could be put into Chomskyan notation: 'Structural analysis: S; Structural change: X\ —> X~2 (where X2 contains the words of X\ in the reverse order) 1 . Of course, we feel that this stretches Chomsky's notation further than is proper, because we know that real languages do not contain operations of this kind. But what Transformational Grammar purported to do was to express some specific hypothesis about the location of the borderline between operations (such as Affix Hopping) which do occur in human languages, and operations (such as sentence reversal) which do not occur. The published works on which most linguists' knowledge of Transformational Grammar was based contained, in reality, no such hypothesis. Transformational grammars which were published in the decades following Syntactic Structures contained a great diversity of apparently ad hoc elements in their rules. Consider, for example, Marina Burt's From Deep to Surface Structure (Burt 1971), which is a good example for our purposes because (unlike, say, Jacobs and Rosenbaum's English Transformational Grammar, 1968) Burt's rules are always stated explicitly, while at the same time (unlike Stockwell, Schachter and Partee 1973) she aimed to make her grammar conform to orthodox Chomskyan theory. (Indeed, Burt's book carried the imprimatur of Chomsky arid his MIT colleagues.) Many of
150
EMPIRICAL LINGUISTICS
Hurt's structural descriptions include disjunctions of symbols in curly brackets; sometimes (e.g. in her Complementizer Placement rule) she used 'parallel' sets of curly brackets (if the second element is chosen from one pair, the second element must be chosen from the other pair, and so on). Burt's Equi NP Deletion rule includes the rubrics '2 — 5 or 5 = 8' (numbers referring to elements of the structural description), and '2 and 8 may not both be null'. Tag Formation is marked as optional 'if 1 = Imp'. In Do-Support, the variable letter Xin the structural description 'cannot end with a [Verb]'. (There are further examples which I shall not cite here.) For each of these devices, the question arises whether it was a legitimate possibility permitted by Chomsky's general theory, or an unfortunate piece of adhockery made necessary by intractable data that seem, at least prima facie, to constitute an anomaly for the theory. Scientific progress being the uncertain sort of enterprise that it is, one is willing to tolerate a certain proportion of cases of the latter kind, provided they are recognized as worrying anomalies. However, Burt did nothing to indicate which status she attributed to the examples I have quoted. And no blame can attach to her for this omission; the published statements of transformational theory gave her no way of knowing where to draw the distinction.2 In other words, the linguists who argued in the 1960s and early 1970s against Phrase-Structure Grammar and in favour of Transformational Grammar were in essence arguing that a format for linguistic description which is rigorously restricted to statements of certain very limited kinds runs into difficulties that do not arise, if one chooses to describe languages using an alternative format which not only explicitly permits a much wider range of statements, but furthermore is quite vague about the ultimate limits of that range. But that is an uninteresting truism. It becomes difficult to understand how Transformational Grammar succeeded in keeping linguists under its spell for so long. In my view the answer must be cast partly in sociological rather than wholly in intellectual terms (as suggested, for example, by Hagege 1981, Sampson 1999a: 159 ff.). Some of the respect accorded to Transformational Grammar was an emotional reaction of submission to the appearance of authority, rather than an intellectual response to a scientific theory in the ordinary sense. The scientific shortcomings of early statements of the theory were not addressed in later publications. Almost thirty years after Syntactic Structures, Gerald Gazdar and co-authors (Gazdar etal. 1985: 19) described how transformational theory had developed as 'a twilight world in which lip-service is paid to formal precision but primitives are never supplied and definitions are never stated'. (This was an unusual view in its day, though by now similar remarks are frequently encountered. Annie Zaenen (2000) describes a book she is reviewing as making 'the usual points about Chomsky's sloppiness and vagueness', and associates herself with its complaints about 'the lack of rigor and explicitness of syntactic work by the current followers of Chomsky'.) And, if Transformational Grammar succeeded partly through charisma
WHAT WAS TRANSFORMATIONAL GRAMMAR?
151
rather than scientific merit, LSLT was an important factor contributing to that charisma. If it had been available in print from the beginning, it seems likely that most linguists would have rightly concluded that they could play no useful part in the development of such a formidably mathematical theory and would have concerned themselves with other issues, while the minority who could read LSLT would have appreciated that it is possibly right in some places and probably wrong elsewhere, and would have perceived it as just one more book about syntax - much more formal than most, but not by that token more deserving of respect. As it was, the doctrine was offered to the public in admittedly watereddown, sketchy fashion, but linguists knew that the inner circle of MIT men who came to be regarded as leaders of the field had access to the real thing, in the shape of this manuscript which (it was presumed) answered all one's questions, but answered them in a style which the great majority of linguists could not begin to understand even if they got their hands on a copy - so that it made far better sense to leave that kind of thing to the high priests, and to teach one's students in the sketchy, popular style of Syntactic Structures. Transformational theory as actually taught was sufficiently flexible that any practitioner could make virtually what he wanted of it, and it thus ran little or no risk of refutation; but the pointlessness of a 'theory' with this degree of flexibility was masked by the aura of respectability which derived from the knowledge that LSLTlay lurking in the background. (Similarly, Chomsky's later best-seller Aspects of the Theory of Syntax (1965), which was read much more widely than Syntactic Structures, by philosophers and others with no special expertise in linguistics, in turn derived its authority partly from the readers' awareness that its references to formal syntactic analysis were underpinned by Syntactic Structures, which was understood to be highly regarded by the experts.) 5 The theoretical contents of LSLT
So much for the effect that LSL Thad on the development of linguistics in the latter half of the twentieth century. What of the book itself? LSLT grew out of Chomsky's doctoral dissertation, 'Transformational Analysis', which was essentially an attempt to define in rigorously formal terms a version of the concept of 'syntactic transformation' which was being applied in linguistic description by Zellig Harris. Chomsky found it profitable to depart from Harris's concept in certain respects. That work is chapter IX of LSLT. LSLT as a whole was completed in the spring of 1955, when Chomsky was 26. He revised some portions of the MS in January 1956, and later that year carried out a further revision of what became the first six (out of ten) chapters with a view to publication (though the publisher he approached turned him down). In 1958-9 Chomsky again substantially revised the first six chapters in the hope of publication, but broke his work off before moving on to the later chapters. Chomsky tells us that he had hoped to use the improved material of 1958-9 in the eventually published version,
152
EMPIRICAL LINGUISTICS
but found problems in integrating this material with the later part of the MS. In the book as eventually published in 1975, then, chapter I (a summary of the whole), and chapters II to VI, which are largely concerned with methodological issues, are the version of late 1956; chapters VII and VIII, on phrase-structure grammar, are as revised in January 1956; and the heart of the book, namely chapter IX (the PhD thesis defining the notion of 'transformational grammar') and chapter X (applying this notion to English), are the original material of 1955, which Chomsky never found it possible to revise even once. Certain appendices included in some stages of the MS were dropped in the published version. The most notable omission is 'The Morphophonemics of Modern Hebrew', which Chomsky submitted as a BA thesis in 1949, expanded into an MA thesis in 1951, and incorporated into LSLTas an appendix to what became chapter VI; this was thus Chomsky's first attempt at writing a grammar — indeed his only attempt, so far as I know, at a grammar of a language other than English. For the published version of LSLT, Chomsky added a 53-page introduction discussing the genesis of the MS and of his ideas, and the subsequent fate of the theory; but, in other respects, the book is exactly as he left the MS in the 1950s. How far was one right, in the 1960s, to suppose that LSL T provides the detail needed to convert the sketch of a theory contained in Syntactic Structures into a specific, testable hypothesis about the kinds of syntactic operation found in human languages? One way to seek an answer to this question is to take the case of Affix Hopping, already discussed. Does the rubric in the Syntactic Structures statement of Affix Hopping make the rule merely an ad hoc approximation to a transformational rule, or does such a statement fall within the rigorous definition of 'transformational rule' spelled out in LSL T? The answer is that it is very hard to tell. The phenomena which came to be described as Affix Hopping are first discussed in LSLT on p. 233, in chapter VIII - that is, before the notion of transformational analysis has been introduced. At that point the rule is stated informally, in much the same way as in Syntactic Structures. When the notion of'transformational rule' is defined in chapter IX, it is treated as a pairing of a structural description which defines the conditions a phrasemarker must meet for the transformation to be applicable, stated in terms of a sequence of syntactic categories required to occur in the phrase-marker, with a structural change or 'elementary transformation' which specifies what happens to the various constituents identified by the structural description, identifying the constituents purely in terms of their numerical positions within the sequence. Thus the problem with Affix Hopping, relating to the fact that Chomsky's symbol Affor 'affix' is not a single category defined by the base grammar but an informal abbreviation for a disjunction of several alternatives (and similarly for his symbol v), arises on the structuraldescription side only. But when we return from the general definition
WHAT WAS TRANSFORMATIONAL GRAMMAR?
153
of'transformational rule' to the particular case of Affix Hopping, on pp. 364—5, only the structural-change side of the rule is discussed; and the chapter on transformational analysis of English, chapter X, treats Affix Hopping merely by implicitly referring back to chapter VIII (see e.g. p. 422). The portion of the book revised after 1955 ends, on pp. 276-87, with a resume of English phrase-structure grammar, which draws together the analyses scattered through the preceding pages; but nothing of the sort is provided after the notion of 'transformation' has been introduced, so that Chomsky nowhere shows us just how Affix Hopping could be stated as a transformation. From the definition of 'proper analysis' on p. 313, which seems to make no allowance for symbols to stand for disjunctions of other symbols, I infer that Affix Hopping cannot be stated formally as a single transformation. As a generalization about English grammar, Affix Hopping is accurate and even insightful, but it seems not to be the kind of generalization permitted by Transformational Grammar. If that is correct, this rule - which did so much to win converts to Chomsky's theory - has in reality constituted a standing refutation of that theory throughout its history. But Chomsky nowhere forced himself to confront the question explicitly. Chomsky was much less concerned in LSLT to demarcate the border between 'natural' and 'unnatural' syntactic processes, than to dot every i and cross every t in his algebraic definition of certain kinds of process that, he believed, do occur in natural languages. Long passages of the book consist of sequences of definitions packed with algebraic symbolism. These passages are very difficult indeed to read. Because of this emphasis, which occurs elsewhere in Chomsky's writings, some commentators on linguistics appear to have seen Chomsky as a scholar who turned to linguistics after many years as a mathematician, and whose theorizing is inevitably difficult for the reader who comes to linguistics from the humanities, simply because the language of mathematics is so far removed from what readers of that class are used to. Biographically, this is wrong. Chomsky majored in linguistics as an undergraduate; he turned to logic and mathematics as a graduate student in his early twenties because of his interest in formal linguistics, rather than vice versa, collaborating with M. P. Schiitzenberger of the IBM research centre. There is no question that Chomsky produced some good work in mathematical linguistics — notably the theorems in 'On certain formal properties of grammars' (1959). IfLSL T is difficult to read, however, this is largely because Chomsky's mathematical expressions in that book are often maladroit, in a way that is forgivable in a newcomer to mathematical writing but which would normally disqualify a book from being considered as a seminal contribution to its field. Sometimes Chomsky's mathematical remarks are just plain wrong, as appears to be the case for instance in the following sentence quoted from p. 317: 'Each Te^T^is a single-valued mapping on a subset of {(£, K)}, where /£ is a string in P and Kis a class of strings, one of which is in £. A pair of names separated by a comma and surrounded by round brackets, as in ' (/£, K]', is normally used to represent the algebraic entity called an 'ordered
154
EMPIRICAL LINGUISTICS
pair', and Chomsky commonly uses these symbols in that way. It follows that the result of surrounding this symbol-sequence with curly brackets is a name of a set with a single member, namely the ordered pair in question (or with two members if an ordered pair (a, b] is treated as an unordered set {{a, b}, a}, as is sometimes done). In that case, {(£, K)} has subsets in only a trivial sense. What I think Chomsky probably means is that T_is a partial mapping on the set of all pairs (/£, K) such that £ is a string in P and Jf is a class of strings one of which is an element of £- but that is different. Elsewhere Chomsky's mathematics is not so much clearly wrong as perverse, stated in a way that a professional mathematician would never choose. Thus, consider the following crucial, and typical, passage from p. 324: Suppose now that we have a system S which is a level or a sum of levels.
Definition 4 ^ei ~ {k \i^ 1} is a set of elementary transformations defined by the following property: for each pair of integers n and r such that n < r, there is a unique sequence of integers, (flo» fli> • • • > ak}, and a unique sequence of strings in S, (%\,..., £k+i), such that (i) (ii) for each Y\,..., T,,
That is, the domain of ^ is the set of ordered pairs (Pi, Pg), where PI is an «-ad of strings, Pg is an (r —n+ l)-ad of strings, and the last element in PI is the first element in P2.
In the first place, I believe that Chomsky has not defined the term 'level' other than informally (since there is no index, it is difficult to check this). A footnote to the first quoted sentence refers to 'a more exact formulation' a few pages later, on p. 333, but this passage again merely assumes that we know what a 'level or sum of levels' is — indeed, I cannot see in what sense the passage referred to is a reformulation of anything in the passage quoted above. Now, consider the item following the equals sign in the first line of the definition. Normally, the device of a vertical bar separating elements within curly brackets is itself used for defining; the equation here would mean 'You already know what t\, t<±, etc., are; now I am going to use "&~ei" to stand for the set comprising all of those items with subscripts not less than 1'. But, in this case, we do not know what the ts are, when we start reading the definition — they are defined by what follows. Then, what is the force of the indefinite article in the first line of the definition? Does Chomsky mean that, for any particular S, it is a mathematical truth that there is just one set &~ei fitting his definition? - or, that any set $~d fitting the definition counts as a set of elementary transformations? Even if it
WHAT WAS TRANSFORMATIONAL GRAMMAR?
155
were possible by taking thought to eliminate one of these interpretations as inconsistent with what follows, it is poor mathematical practice to burden the reader with this kind of problem. More seriously: why does the sequence of integers mentioned in the definition begin with a^, which is required by (i) always to be zero, and which plays no role at all in (ii)? It appears that the two references to a^ could simply be deleted, without the effect of the definition being altered in the slightest. Unpacking the sense of the 'property' attributed to an individual 'elementary transformation' is no easy task. As I understand it, Chomsky was denning an elementary transformation, in effect, as any function having as domain the set of all pairs (q>, a) such that (p is a finite string and a is an element of (p, and which takes any such pair ((p, a) into some string whose length is an odd number, and in which the elements occupying the oddnumbered places are strings in S and those occupying the even-numbered places are elements of (p. This appears to say the same as everything following the colon near the beginning of Chomsky's definition, though my paraphrase uses three different algebraic symbols, compared with ten or more in Chomsky's statement. It may be that Chomsky's use of the letter T is intended to make the nature of the domain of an elementary transformation more specific than I have done in the words '(p is a finite string' (in other passages of LSLT, Chomsky regularly uses Tfor the elements of a 'proper analysis'). But it is surely improper to throw a heavy burden of sense on a mere choice of letter. It may also be that I have wholly misinterpreted Chomsky's definition; but, if so, I am not convinced that the fault is mine. Alan Sokal and Jean Bricmont have recently documented the way that a number of twentieth-century 'intellectuals' created an air of profundity in their writings on social and psychological topics by using mathematical terminology and symbols which, to readers ignorant of mathematics, look impressive, though to professional mathematicians they are nonsensical (Sokal and Bricmont 1998). (Sokal is a physicist who came to wide attention in 1996 when he exposed the pretensions of this style of academic discourse by writing a spoof article, 'Toward a transformative hermeneutics of quantum gravity', which as its author he knew to be meaningless, and getting it accepted and published by the sociology journal Social Text.) Sokal and Bricmont write as if this vice is exclusively French, quoting from the works of authors such as Jacques Lacan and Julia Kristeva. No doubt Chomsky's failings in LSL T represent limited competence, rather than the blatant intellectual abuse which Sokal and Bricmont impute to the French gurus; but it is hard not to be reminded of their analysis, when one contemplates the gap between the reality of Chomsky's mathematics in LSLT, and the reputation it achieved. One respected scholar (Gazdar 1979), for instance, described LSLT in all seriousness as representing the same kind of achievement as foundational treatises by such pre-eminent philosophers of mathematics as Frege and Hilbert. (As shown by the quotation on p. 150,
156
EMPIRICAL LINGUISTICS
Gerald Gazdar later saw the light; I am sure he would not write in similar terms today.) Not only is LSLTmaladroit in expressing its mathematical messages; the content of those messages often seems perversely roundabout. Chomsky frequently appears determined to treat all aspects of syntax in terms of strings and sets of strings, when algebra offers much more straightforward ways of achieving his aims, in terms of structures other than strings. James McCawley (1968) pointed out that it is clumsy to treat phrase-markers, in the way Chomsky does, as derived from sequences of symbol-strings ('derivations') produced by phrase-structure grammars interpreted as rules for rewriting symbols. Why not simply interpret phrase-structure grammars as well-formedness conditions on trees, and dispense with the intermediate concept of'derivation'? Even in his introduction to the published version of LSLT, Chomsky would not concede that this is an improvement: 'Plainly, nothing is at stake in this choice of interpretations' (p. 48, n. 21). The word 'Plainly' here is reminiscent of a note said to have appeared in the margin of one of the Soviet leader Nikita Khrushchev's speeches to the United Nations: 'Weak point — shout'. One thing at stake is that Chomsky's approach forces him to impose two quite arbitrary restrictions on phrase-structure rules, namely, that no rule may rewrite any symbol A as either the null string, or as a sequence including A. Both of these forbidden types of rule frequently seem appropriate in describing real languages, and under the alternative view of phrase-structure grammars there is no objection to them. LSLT includes a long and complex passage (pp. 176—201) concerned exclusively with the relationship between a phrase-structure grammar and the phrase-markers it generates, all of which could be collapsed to a paragraph or two under the tree-centred approach; pp. 187-8 of this passage deal with the latter of the forbidden rule-types just mentioned. Chomsky's undue preoccupation with strings causes problems not only for the treatment of phrase-structure grammar but also, and I believe more seriously, for the treatment of transformations. Transformational rules are described informally as functions from trees to trees, but they are not defined by Chomsky as such. In LSLT, trees are rarely mentioned, and transformations are treated as mappings from terminal strings into terminal strings controlled by strings of non-terminal symbols acting as analyses of the input terminal strings. This leads to endless complications. The effect is rather as if someone watching a film of some activity in three dimensions were to force himself to describe it in terms of the movements of two-dimensional shapes on the surface of the cinema screen. 6 Linguistic methodology
So much for the substantive theoretical aspects of LSLT. Let us turn now to the parts of it concerned with methodology. Here the book induced a radical change in my understanding of the development of Chomsky's thought. Before I read LSLT, I had taken the
WHAT WAS TRANSFORMATIONAL GRAMMAR?
157
history of Chomsky's methodological beliefs and assumptions to be roughly as follows. (It may well be that I was idiosyncratic in my interpretation, but I believe that my interpretation was plausible with respect to the literature available before the publication ofLSLTin 1975, and furthermore it had the merit of being more charitable to Chomsky than the interpretation which turns out on the basis ofLSLTto have been correct.) In the early years, I took it, Transformational Grammar had been essentially a theory about how most elegantly to define the sets of word-sequences from among which the utterances of a language community were drawn. What mattered about a grammar was its 'weak generative capacity' (Chomsky 1965: 60), that is the sequences of words it generated, which could be tested more or less directly against the evidence of speech or writing, rather than its 'strong generative capacity' the structures it assigned to the word-sequences it permitted - which could not be checked against any obvious category of objective data. Even 'weak generative capacity' is not immediately testable against the raw data of usage, because various factors, such as memory limitations and slips of the tongue, will interact with the factor of grammaticality to modify what is uttered, in such a way that some grammatical sequences will systematically be excluded from the data, while some sequences found in the data will be non-grammatical. Ultimately, we expect either the linguist, or scientists in other disciplines, to provide explicit theories not only of grammaticality, but also of these interfering factors; and, when we have theories about all these phenomena, the conjunction of the theories will generate predictions about speech that are immediately testable. In the meantime, since we have what seem to be reasonably satisfactory intuitive, inexplicit theories about many of the interfering factors, heuristically it is a good strategy to rely on these theories, and hence to base our grammars on our intuitions about which word-sequences are grammatical. Furthermore, language is a universally fascinating phenomenon, which has been studied for many centuries. Most of us are taught quite a lot about the results of this study at school, and many people think about the grammar of their language, in a less systematic and explicit way, also outside school; so that people often have opinions not only about the grammaticality of individual word-sequences, but about subtler grammatical matters such as constituent structure. It would be surprising if these opinions were regularly quite wrong; so, again, it makes heuristic sense to allow them to guide us in making decisions about grammatical analysis, in cases where we have not yet found empirical data which settle the issue. These uses of speakers' opinions in taking convenient short-cuts to plausible grammatical theories are entirely compatible with the claim that the theories arrived at are empirical, that is that what makes them right or wrong is ultimately the accuracy or otherwise of predictions they can be made to yield about interpersonally observable phenomena. Syntactic Structures gave us little or no reason to question that Chomsky saw grammars as empirical in this sense, with 'speakers' intuitions' having merely interim
158
EMPIRICAL LINGUISTICS
significance, pending the elaboration of more complete theories. Here and there, even in the early years, Chomsky made remarks which seem to give intuition a more important role, for instance his statement, quoted more than once already, 'The empirical data that I want to explain are the native speaker's intuitions.' But this might be explained away as a hasty and incautious comment at a conference by someone who was primarily interested in substantive questions about the shape of grammars, rather than in methodological purity. (After all, it is flatly contradictory to describe 'intuitions' as 'empirical' data, so Chomsky cannot have meant exactly what he said.) As time went on, Chomsky became more interested than he originally appeared to be in the implications of his grammatical theory for the philosophy of mind, and specifically in the notion that the grammar of a language must be seen as 'tacit knowledge' in an individual speaker's mind, rather than as a summary of a set of mechanical dispositions to respond to stimuli. It was this, I took it, which led him gradually to succumb to an attractive and convenient fallacy: namely that, because grammars are about speakers' 'knowledge of language', the ultimate data on which they are based are speakers' opinions about their language. This is a fallacy, because, if one admits that some of a speaker's statements about his language are mistaken (and Chomsky has always recognized that this is so), one is forced to admit that the grounds for respecting some nativespeaker 'language intuitions' and not others are that the former but not the latter agree with observation - which means that the intuitions are ultimately redundant. But, perhaps because armchair methods make grammars so much easier to construct than they are when they have to be validated by painstaking fieldwork, this point was lost sight of. Chomsky, and his followers, seemed instead to solve the difficulty of mistaken intuitions by supposing (in an inexplicit way, and without explaining how they could be sure of this) that linguistic training can be relied on somehow to remove the fallibility that attaches to the lay speaker's intuitions (cf. e.g. Labov 1975: 100-1). As a statement of Chomsky's position in recent decades, the preceding two paragraphs are not affected by anything in LSLT. What does emerge from that book is that I was quite wrong to think that the view of linguistics as intuition-based rather than observation-based resulted from developments in Chomsky's views subsequent to his first published book. Intuition is given as central a role in LSLT as, in Chomsky's later writings; and the contradictions that arise from confusing intuition and observation are perhaps more salient in LSL Tthan anywhere else in Chomsky's works. In the first place, it turns out that the discussion of weak generative capacity in Syntactic Structures was a mere afterthought. As Chomsky pointed out on pp. 5 and 8 of the 1975 introduction to LSL T, weak generative capacity is never mentioned in that book, which is concerned with the construction of grammars to reflect grammatical structures that are taken as being known independently of the linguist's analysis. The very first paragraphs of LSLT describe in some detail the categories of'knowledge about his language' and
WHAT WAS TRANSFORMATIONAL GRAMMAR?
159
'intuitions about linguistic form' allegedly possessed by any speaker, and imply that the task of general linguistics is to provide for grammars which reflect this knowledge and these intuitions. Moreover, Chomsky does not even restrict his examples to cases (such as grammaticality of simple word-sequences) where it is plausible to equate 'what a speaker "tacitly knows" qua speaker' with 'what a speaker is consciously aware of and can state accurately in response to an enquiry'. Far from it. Chomsky's first example of what 'any speaker of English knows' is that' "keep" and "coop" begin with the same "sound" '. Phonetically, they do not: the velar stop before the front vowel of keep is advanced, that before the back vowel of coop is retracted. It may be that these two physically fairly distinct sounds are to be treated as variants of a single phoneme in the 'correct' grammar of English, and that this grammar is 'correct' because it corresponds somehow to the way the language is organized in an Englishman's head; but speakers clearly do not have conscious access to that organization, or there would be fewer disagreements about phonemic analysis than there are. The variation between different velar positions is thought to have been no more contrastive in Etruscan than it is in English, but that did not hinder the Etruscans in systematically distinguishing the advanced, neutral, and retracted velar stops in writing by spelling them with gamma, kappa, and koppa, respectively, when they borrowed the alphabet from the Greeks. Apparently the Etruscans 'knew' that the phones in question were 'different sounds'. If English-speakers typically think of the initial sounds of keep and coop as the same, that is explained by the fact that they have mastered an orthography which happens to treat the letters K and C as largely interchangeable (C is advanced in cube and cape, retracted in coop; K is advanced in keep, relatively retracted in kite, kerb, or kudu). Chomsky's second example of native-speaker knowledge is the fact that bothfleeand fly are related to flight in the same way as see and refuse are related to sight and refusal. There must surely be very many English speakers who can use the word flee and the phrase put X toflightcompetently enough, while having no conscious idea of a special relationship between flee and flight. (Overwhelmingly the commoner sense of flight in modern English is the noun related to the verb fly, not flee.) Turning from phonology and morphology to syntax, Chomsky (like Langendoen, as discussed on p. 123 above) attributes to speakers knowledge of various aspects of the classification of sentence-types which seems to go far beyond what the average speaker who is not a professional linguist consciously knows, though the facts in question are elementary enough to speakers who have made a special study of grammar (and the claim that the layman knows them 'tacitly' is unrefutable). There is nothing intrinsically illogical about a theory which sets out to reflect in systematic form speakers' opinions about their language - though this is a very different task from the one that linguists before Chomsky saw themselves as engaged in. Likewise, it might be possible, and perhaps
160
EMPIRICAL LINGUISTICS
interesting, to set out to collect and systematize laymen's beliefs about illness, say - though that is not the normal occupation of professors of pathology. But Chomsky in LSLTtried to have it both ways. Having opened chapter I by describing various rather subtle kinds of speakers' knowledge as the data of a grammar, on the first page of chapter II Chomsky wrote that A grammar ... can be considered . . . to be a ... scientific theory . . . [which] seek[s] to relate observable events by formulating general laws ... In a particular grammar, the observable events are that such and such is an utterance of the language . . .
On p. 80, Chomsky even suggested that we require the various theoretical constructs, by means of which a grammar makes predictions about utterances, themselves to have 'observable correlates' - a very strong demand to make of a scientific theory. In the final paragraph of chapter II, Chomsky suggests that we should solve the problem of the unreliability of speakers' intuitions by 'formulating behavioral criteria to replace intuitive judgments'. But how do we decide what 'behavioural criteria' are relevant? As Chomsky has just pointed out on the previous page, we can do so only by checking whether they match the data of intuition, which means that observation of behaviour is ultimately redundant and the problem of unreliable intuitions remains unsolved. In practice, any linguist will no doubt draw on both intuition and observation in formulating his grammatical generalizations. But either the reliance on intuition is a temporary expedient made progressively redundant as more and more phenomena are described in terms of empirical theories, or else the use of behavioural data is treated as a convenient tell-tale short-cut to the informant-introspections which are the real subject of the theory. Contrary to some handwaving remarks by Chomsky (e.g. 'We might hope that some more general account of the whole process of linguistic communication than we possess now may permit us to reconstruct the criteria of adequacy for linguistic theory in more convincing and acceptable terms', p. 102), there is no middle path between these two approaches to the enterprise of linguistic analysis. Chomsky fudged the issue in LSLT; he went on fudging it ever since. 7 Discovery procedures
A less important but still interesting respect in which LSLT sheds light on the development of Chomsky's methodological ideas has to do with the concept of 'grammar-discovery procedures'. One of the influential, widely quoted passages of Syntactic Structures is §6, in which Chomsky attacked the notion that general linguistic theory should provide a means of mechanically deriving the grammar of a particular language from examples of that language. It is quite right not to impose such a demand on linguistics; in science more generally, there are no systematic
WHAT WAS TRANSFORMATIONAL GRAMMAR?
161
procedures for formulating good theories, only for assessing theories which have somehow been formulated (Popper 1968: 31-2). However, there were two problems with this principle as stated by Chomsky. The first was that, as Chomsky presented the tasks of general linguistic theory, in later books such as Aspects of the Theory of Syntax, those tasks did in fact include that of formulating discovery procedures. Pages 25 and 31 of Aspects suggest, as a desirable goal for a general linguistic theory, the goal of 'explanatory adequacy': meaning that, given an input of 'primary linguistic data', the theory succeeds in selecting, from an enumeration of the class of possible grammars, the particular grammar which an infant would in fact induce from those data. A theory which achieves this goal is a theory about 'discovery procedures', whether or not that term is used. Apparently Chomsky changed his mind about the virtues of discovery procedures between 1957 and 1965. The other problematic point about the Syntactic Structures passage was why Chomsky felt the attack on the discovery-procedure approach was worth making in the first place; contrary to what is often suggested, it is just not true that most of Chomsky's predecessors had advocated or practised this approach. (On this point, see J. Miller 1973.) One supposed that the passage had been included as a reaction to the ideas of Zellig Harris, since Chomsky's teacher was the one man who more than anyone else had publicly espoused the discovery-procedure approach (e.g. in his 1951 book Methods in Structural Linguistics]. However, it seems from the Introduction to LSLT that Chomsky's concern with discovery procedures stemmed from the fact that he himself had taken this approach for granted in his early work on linguistics: I was firmly committed to the belief that the procedural analysis of Harris's Methods and similar work should really provide complete and accurate grammars if properly refined and elaborated . . . (p. 29) During much of the period (1947-53) in which he was engaged on this task of refinement and elaboration, Chomsky was also working on the generative phonology of Hebrew in a Popperian, guess-and-test style very different from that of mechanical discovery procedures, but I ... assumed that whatever I was doing [in the work on Hebrew], it was not real scientific linguistics, but something else, obscure in status.... The work that I took more seriously was devoted to the problem of revising and extending procedures of analysis so as to overcome difficulties that arose when they were strictly applied. (p. 30) Chomsky went on to say that he became troubled by doubts about the discovery-procedure approach, and gave it up by 1953; and LSLT itself included a brief passage (on p. 79) prefiguring §6 of Syntactic Structures. But the reasons Chomsky gave in the 1979 Introduction for questioning
162
EMPIRICAL LINGUISTICS
discovery procedures in the early 1950s were not really objections to the principle of discovery procedures at all. Rather, they were objections to a particular type of discovery procedures (what he called 'taxonomic' procedures) which make few initial assumptions about the nature of the language underlying the data, and consequently have to be limited to relatively unsophisticated operations on the data- 'segmentation and classification'. But Chomsky's attack on the discovery-procedure approach in §6 of Syntactic Structures is not qualified so as to refer to 'taxonomic' procedures only. Thus it turns out that what Chomsky actually wrote in Syntactic Structures is something which he perhaps did not believe even at the time of writing, and certainly did not believe either a few years earlier or a few years later. In the 1979 Introduction to LSLT (p. 30), Chomsky states as if it were unarguable that 'every child serves as an "existence proof"' of mechanical discovery procedures. He does not even consider the possibility that children acquire the grammar of their first language by a guess-and-test process as openended and unmechanical as that by which (according to Syntactic Structures] a scientist invents a novel theory. 8 Finite-state grammar
A third respect in which Syntactic Structures turns out to have been oddly unrepresentative of Chomsky's thought has to do with the finite-state model of language. In Syntactic Structures, three increasingly powerful theories of grammar are presented in succession: finite-state grammar, phrasestructure grammar, and transformational grammar. The logic seems to be as follows: an individual human being is physically finite, and therefore (if we may ignore the issue of continuous v. discrete physical variables) there is a sense in which it may seem virtually axiomatic that he must be an (enormously complex) finite-state automaton. Consequently, the most obvious initial hypothesis about the class of human grammars is that it is identical to the class of grammars defined by finite-state automata (which, when expressed in terms of rewrite rules, are called 'one-sided linear' grammars). It emerges that this class is too restricted to cope with phenomena common in various human languages; if humans are finite-state automata, they must be ones which simulate to a close approximation some more powerful kind of automaton. By dropping various conditions in the definition of one-sided linear grammars (such as the requirement that there be at most one nonterminal symbol to the right of the arrow in any rule), we produce various classes of grammars corresponding to particular classes of automaton. By exploring which of these classes of grammar come closest to matching the class of languages which humans are observed to speak, we obtain not merely a hypothesis about human language but also, by reference to the corresponding class of automaton, some sort of hint, at least, at a hypothesis about the structure of the human mind - which is something that would be of first-rank importance outside the parochial world of linguistics. The grammar/automaton relationship which Chomsky makes very
WHAT WAS TRANSFORMATIONAL GRAMMAR?
163
explicit when discussing his first model appears to provide a main motive for the whole enterprise of Syntactic Structures, and it excuses some of the shortcomings of the enterprise. Thus, if phrase-structure grammar is intended as a formalization of the analytic practice of Chomsky's predecessors, it seems unreasonably rigid - as we have seen earlier in this chapter, many of those predecessors were much more resourceful than that, and Chomsky's definition of phrase-structure grammar is a Procrustean distortion of their work. But only by cutting away all the many complications that occurred in the real grammars of the 1940s and 1950s can one arrive at a class of grammars simple enough to be equivalent to a known class of automata. The interest inherent in the demonstration that contemporary grammatical theories relate to a particular, specifiable automaton-class is sufficiently great to outweigh a considerable measure of looseness in the relationship (at least pending a more sophisticated approach to the question). After the early 1960s, Chomsky ceased to write about automata theory. He continued to argue that in principle language is a 'mirror of the mind', and that an examination of the bounds to the diversity of human languages will reveal the structure of our innate cognitive machinery - indeed, this is perhaps the leading idea in most of Chomsky's later writings. But in recent decades, this idea was expressed purely programmatically. Chomsky no longer put forward specific hypotheses about mental mechanisms; and, following Chomsky, most theoretical linguists gave up consideration of the grammar/automaton relationship. (The revival of interest in this issue since the 1980s in connexion with the problem of automatic parsing was a consequence of the wish to include natural language processing in the computer revolution; it owed little to Chomsky.) The surprising thing is that this relationship turns out not to have been considered in LSLT either. As Chomsky pointed out in his 1975 Introduction (p. 53, n. 75), the topic of finite-state grammar, like that of weak generative capacity, was an afterthought in his work. Finite-state grammar was not mentioned in the 1955 version of LSLT, and the material of'Three models for the description of language' was a mere appendix (omitted in the published text) to chapter VII of the 1956 version. 9 Conclusion
In all three respects discussed in the previous few pages, namely on the question of the empirical status of linguistics, the question of discovery procedures, and the question of the relationship between grammar and automata theory, the position that appears to be taken in Syntactic Structures is much superior to that of Chomsky's later publications. But, on the evidence of LSLT, it seems that Syntactic Structures did not faithfully represent Chomsky's position even when he wrote it in the 1950s. Chomsky was not, as I had supposed before reading LSLT, a man who did important work in his youth and then switched to a different and less fruitful tack in middle life. Apparently he always approached language in a way which (to the present
164
EMPIRICAL LINGUISTICS
author, at least) is quite uncongenial; but, in boiling down his early magnum opus to turn it into a short, publishable first book, by what was presumably a remarkable chance he made himself seem to be taking a much more attractive line than the one he actually believed in. What was unattractive about Syntactic Structures, on the other hand, was the vagueness of Chomsky's positive claims about the theory of grammar which he advocated as the correct one. It seemed quite understandable that Chomsky could do no more than sketch his transformational theory in a book of that nature, and this was excusable in view of his repeated references to the fuller account in LSLT. But, when LSL Tdid eventually become available, it turned out not to answer the questions left open by Syntactic Structures. Surely it was a most fortunate thing for Transformational Grammar that in 1956 the Technology Press of MIT declined to publish LSLT, so that for many years the linguistic world encountered only the sketchy version of Chomsky's ideas found in Syntactic Structures. Chomsky might have established a reputation, with or without LSLT, on the basis of his philosophical writings. But, if the Technology Press had been more compliant, one wonders how many of us would be familiar today with the theory of Transformational Grammar. Notes 1 The name 'Affix Hopping' became so usual in later writings on Transformational Grammar that it would be pedantic to retain Chomsky's original name 'Auxiliary Transformation' in the following discussion. 2 LSLT seems to have proved fairly impenetrable even for some of those who did consult it. The first widely available book which purported to go beyond Syntactic Structures and the other published sources and to expound the rigorous version of Chomsky's theory was J. P. Kimball's Formal Theory of Grammar (1973). But Kimball's use (p. 47) of the key term 'elementary transformation' appears to contrast sharply with Chomsky's use - for Chomsky in LSLT (pp. 324, 404), a 'transformation' in the ordinary sense comprises one 'elementary transformation' together with things of other categories, but Kimball represents an ordinary transformation as including a list of elementary transformations. 3 In 1979 this document was published separately by Garland Press.
10
Evidence against the grammatical/ ungrammatical distinction
1 'Starred sentences' are a novel category
A fundamental distinguishing feature of the generative approach to language criticized in this book is its belief in a contrast between 'grammatical' and 'ungrammatical' sequences of words. In the artificial formal languages studied by logicians and computer scientists - systems such as the prepositional calculus, or Java - this distinction is crucial. Certain sequences of characters or words are 'well-formed' or 'valid', others are not; a grammatically ill-formed computer program will fail to compile, an ill-formed formula of a logical calculus will play no part in derivations of conclusions from premisses. In both logic and computer science, the valid sequences of a particular formal language are defined by a generative grammar which is typically not just finite in size but reasonably short. For human languages, this was not a usual perspective before the rise of generative linguistics. Traditional grammatical descriptions of English and other natural languages described various structural patterns and turns of phrase which do occur in the language in question, but they did not normally contrast these, overtly or implicitly, with another large class of wordsequences which never occur. Even if, on occasion, discussion of some particular construction was made clearer by giving a negative example of a mistaken attempt to realize the construction, that would usually only be intended to tell the reader 'This is not the right way to formulate this particular construction', not to tell him 'This is a sequence of words which has no use whatever in the language' - traditional grammarians were not interested in identifying sequences of the latter kind. So, for instance, Meiklejohn (1902: 103) illustrated the concept 'adjectival sentence' (what we would now call 'relative clause') with the example Darkness, which might be felt, fell upon the city, and he commented that the relative clause 'goes with the noun darkness, belongs to it, and cannot be separated from it': the last remark was plainly intended to mean that a reordering such as Darkness fell, which might befelt, upon the city cannot be used to express the tangibility of the darkness. It did not mean that the reordered sentence has no use in English at all (it has, but it means something different). Indeed, when traditional grammarians
166
EMPIRICAL LINGUISTICS
described some turn of phrase as not part of the language, they did not usually mean that it was never used by native speakers; they meant that it was often used, but was frowned on as substandard. Nobody to my knowledge envisaged a human language as dividing the class of word-sequences over its vocabulary into well-formed and ill-formed subclasses, before Chomsky wrote, in the opening pages of his first published book, that: The fundamental aim in the linguistic analysis of a language L is to separate the grammatical sequences which are the sentences of L from the ungrammatical sequences which are not sentences of L and to study the structure of the grammatical sequences. (Chomsky 1957: 13)
If natural languages resemble artificial formal languages in being defined by a limited number of generative rules (perhaps rather more rules, with more exceptions, in the case of natural languages, but still a finite system of rules for each language), then this specification of the aim of linguistics is reasonable. Each grammar rule of a language defines a specific construction that can occur in the language; any valid sentence will consist of some combination of valid constructions, and those word-sequences which instantiate valid combinations of valid constructions will be well-formed — all other word-sequences will be ill-formed. (Some ill-formed sequences might crop up from time to time in speakers' 'performance' in practice, as essentially erroneous deviations from their internal linguistic 'competence'.) Not all linguists found this a plausible way of thinking about human languages. People repeatedly pointed out how difficult it is in practice to construct sequences of English words which have no possible use in any context. For instance, F. W. Householder commented (1973: 371) that one needs to 'line up several articles and several prepositions in such examples [as I did for my example Of of the of in Chapter 8], otherwise they become intelligible as some kind of special style'. One might feel that human grammatical behaviour is more creative than the Chomsky quotation suggests. It is not that the English language (or any other language) presents us with a fixed, finite range of constructions which rigidly constrains our linguistic behaviour; rather, our speech and writing make heavy use of the best-known patterns of the language, but we are free to adapt these and go beyond them as we find it useful to do so, and there are no such things as word sequences which are absolutely 'ill-formed in English' — only sequences for which it is relatively difficult to think of a use, or for which no one happens yet to have created a use. 2 Some properties of a data set
Empirical corpus data allow us to examine this question quantitatively. In the 1980s I used an early version of the Lancaster—Leeds Treebank (cf. Chapter 3) to assess the idea of a clear-cut grammatical/ungrammatical
GRAMMATICAL/UNGRAMMATICAL DISTINCTION
167
distinction in English. The treebank version in question comprised manual structural annotations of extracts totalling about forty thousand words (counting punctuation marks, and enclitics such as -n't, as separate 'words') drawn from most genres of the LOB Corpus.1 The annotations conformed to an early and simple version of the SUSANNE structural annotation scheme. Being a sample of authentic material, the database included very many word-sequences that were quite unlike the 'core' constructions commonly discussed by theorists of linguistic competence, though in context they pose no problems to an ordinary English-speaking reader. This database was smaller than some which various researchers, including myself, have subsequently published, so for the present chapter I might in principle have repeated the analysis to check whether similar results occur with one of these newer and larger resources. But although (as we shall see later) there has been disagreement, since I published the original version of this research in 1987, with my conclusions about the grammatical/ ungrammatical distinction, all the disputes have focused on the inferences to be drawn from the quantitative data patterns I reported. No one to my knowledge has urged that those patterns would look different if a larger sample were analysed. Therefore in the present chapter I have allowed myself to reuse the same quantitative findings already published in 1987, while updating the theoretical analysis to take account of subsequent discussion. It is worth making the point that I had no thought of carrying out the type of investigation discussed shortly at the time when our group was developing this database, so I was not subject to any unconscious pressure to strengthen the case argued here by indulging in inconsistent parsing practices. The Lancaster—Leeds Treebank was developed as a resource for use in an automatic natural-language analysis system; all the psychological pressures operated in the direction of making things easy for our automatic parser by assigning like structures wherever possible even to word-sequences which some linguists might prefer to see as structurally distinct. Furthermore, a number of other factors might be thought to bias the data towards more-than-normal uniformity (thus making my present thesis more difficult to sustain). The texts of the LOB Corpus are extracted from documents that had undergone the editorial disciplines of publication (rather than being, for instance, transcriptions of impromptu speech); any clear misprints were corrected when the LOB corpus was assembled; the omission of 2 out of 15 genre categories from the treebank used in this investigation, if it had any relevant effect, should presumably have made the remaining material slightly more homogeneous than it would otherwise have been. By far the commonest single grammatical category in the sample was the category 'noun phrase', so I chose this category in order to investigate the range of diversity in its realizations. I ignored noun phrases occurring in coordinate structures (whether as individual conjuncts or as co-ordinations), since phenomena such as co-ordination reduction are likely to create special complications among these cases; and I also omitted a small number (44) of
168
EMPIRICAL LINGUISTICS
examples whose parsings seemed to me questionable and possibly inconsistent with material elsewhere in the database. There remained a total of 8,328 noun phrase tokens in the sample. For the purposes of this investigation, these were classified in terms of their immediate constituents (whether individual words, phrase or clause constituents, or punctuation marks), which were categorized in a manner even coarser than the classification discussed on pp. 27—8 earlier; the categorization used for the research of this chapter distinguished just 47 different constituent-types within noun phrases (14 classes of phrase and clause, 28 word-classes, and five classes of punctuation mark). I omitted almost all the fine subcategories of the SUSANNE scheme; for instance, all finite subordinate clauses were lumped together as a single category, with no distinction made between relative causes, nominal clauses, and so on. Again, the decision to classify constituents in a very coarse fashion ought to reduce the diversity of nounphrase types and is thus a conservative decision with respect to my present argument. For present purposes, two or more noun phrases were regarded as tokens of the same type if their respective immediate constituents (ICs) represent the same sequence of possibilities drawn from this 47-member set of constituent-types; an example of a noun phrase type, expressed in terms of our coarse categories, is 'determiner + plural noun + comma + finite clause'. The mean length of noun phrase tokens in the sample was 2.32 ICs, with a standard deviation of 0.94. (A phrase which is short in the sense that it contains few ICs may of course be long in terms of words, if one of its ICs is a complicated clause; length in words is not relevant to the present discussion.) Many noun phrases consisted of a single 1C (e.g. pronouns); the longest noun phrases in the data contained ten ICs each (there are three types this long, each represented by single token). The 8,328 noun phrase tokens were shared between 747 different noun phrase types, but the distribution was very unequal. At one extreme, the commonest type (namely 'determiner + singular noun') accounted for 1,135, or about 14 per cent, of all tokens. At the other extreme, there were 468 different noun phrase types which were each represented by a single token in the data. This last observation is one point which might seem to call into question the concept of defining a human language by means of a comprehensive generative grammar. In a reasonably large and representative sample of authentic English, analysed in terms of a very coarse scheme that recognizes far fewer grammatical distinctions than are likely to be relevant for any practical language-engineering system, it seems that almost two-thirds of all attested noun phrase types are represented by just a single instance; this must surely suggest that in practice it will often be a matter of chance whether or not a language-processing system based on a generative grammar recognizes constructions occurring in texts submitted to it. If so many constituent-types occur just once in a sizeable sample of language, it is reasonable to infer that there must be many other constituent-types, about equally common in the language as a whole, which happen not to occur in our sample.3
GRAMMATICAL/UNGRAMMATICAL DISTINCTION
169
Note that the problem is not that the number of distinct noun phrase types is very large. A generative grammar can define a large (indeed, infinitely large) number of alternative expansions for a symbol by means of a small number of rules. The problem is rather that the boundary between expansions of the category 'noun phrase' which are grammatical (and should therefore be permitted by a generative grammar) and expansions which are ungrammatical (and should therefore be forbidden by a grammar) is extremely difficult to determine, if a high proportion of grammatical expansions are very rare. 3 Should rare types be counted as deviant?
However, there are various ways in which the generative approach could be reconciled with what has been said so far. One might suppose that the many constituent-types which occur once each represent a sort of penumbra of randomly deviant cases which should be seen as qualitatively different from the core range of well-formed constructions, each of which would be represented by many more than one example. In that case it might perhaps be plausible to suppose that the most fruitful approach to automatic natural language processing would be via a generative grammar of'linguistic competence' to handle the core constructions, supplemented by ad hoc techniques for coping with the deviant cases (though, if 'deviance' is so common, a system would presumably have little practical value unless it did indeed somehow succeed in treating these cases as well as the core constructions — and it might be surprising to find that 'performance error' was an important factor in this treebank, which was based on published writing). Alternatively, one might hope that the occurrence of unique noun phrase tokens in our sample reflects the fact that this sample, although not insignificant in size, is not large enough to cover the core of linguistic competence adequately. It could be that enlarging the sample would eventually lead to a situation in which all grammatical noun phrase types were represented by enough examples to establish their well-formed status unambiguously, while few or no further unique types emerged to blur the grammatical/ ungrammatical borderline. In order to assess the plausibility of these views, I made a detailed examination of the incidence of different token-frequencies among the 747 noun phrase types in our sample; the 747 types are divided between 59 different frequencies, from 1,135 examples at the highest to one example at the lowest. An instructive way to represent the findings graphically is shown in Figure 10.1. Here the x axis represents the logarithm of the frequency of a noun phrase type in our sample, expressed as a fraction of the frequency of the commonest noun phrase type. Thus, a noun phrase type which is instantiated ten times in the sample (i.e. 10/1135 times as often as the type 'determiner + singular noun') corresponds to the location x = log( 10/1135) = —2.055. The y axis represents the logarithm of the fraction of noun
170 EMPIRICAL LINGUISTICS
Figure 10.1
phrase tokens in the sample which belong to types of not more than a given frequency. Thus, consider the point (—2.055, —0.830). We have seen that the x value corresponds to the frequency 10; three different noun phrase types are each represented ten times in the sample, making 30 noun phrase tokens belonging to types with frequency 10, and when these are added to the 63 tokens representing types with frequency 9, the 56 tokens representing types with frequency 8, and so on, the total is 1,231 noun phrase tokens belonging to types with frequencies of 10 or less, that is 0.148 of the total sample population of 8,328 noun phrase tokens: the logarithm of 0.148 is -0.830. The fact that observations are spaced widely relative to the x axis at the left-hand side of Figure 10.1, and more densely near the centre, corresponds to the fact that frequency differences between the few commonest noun phrase types are naturally large (the three commonest types, namely determiner + singular noun, subject-marked pronoun, and non-subject-marked pronoun, are represented by 1,135, 745, and 611 tokens respectively), whereas the many less-common types exhibit small differences in frequency (for instance, various noun phrase types are represented respectively by 30, 28, 27, 26, 24, 23, 22, 21, and 20 tokens). On the other hand, the fact that observations are widely spaced relative to the x axis at the right-hand side of the figure, and that no observation falls to the right of x = —3.055, is an uninteresting consequence of the particular size of sample used: the rightmost three points represent the many noun phrase types which occur three times, twice, or once in the forty-thousand-word sample - clearly no type could be
GRAMMATICAL/UNGRAMMATICAL DISTINCTION
171
represented by one-and-a-half tokens, or by a third of a token, though as proportions of the frequency of the commonest noun phrase type these would be perfectly meaningful values of* for a larger sample. Figure 10.1 offers no evidence at all of a two-way partition of noun phrase types into a group of high-frequency, well-formed constructions and a group of unique or rare 'deviant' constructions; instead, noun phrase types in the sample appear to be scattered continuously across the frequency spectrum. The property of Figure 10.1 which came as a surprise to me, and which seems particularly significant, is that the observations come very close to a straight line, y — 0.4x. The concrete meaning of this equation may not be obvious. It might be expressed as follows: if m is the frequency of the commonest single noun phrase type (in our data, m was 1,135 instances in 39,969 words, or about 28 per thousand words), and/is the relative frequency of some type (fm is its absolute frequency), then the proportion of all construction-tokens which represent construction-types of relative frequency less than or equal to/is about/ '4 As the fraction/declines towards zero, the fraction/ ' declines far more slowly. Although a rare type is by definition represented by fewer tokens in a sample than a common type, as we move to lower type-frequencies the number of types possessing those frequencies grows in a regular manner - so that the total proportion of tokens representing all 'rare' types remains significantly large, even when the threshold of'rarity' is set at relatively extreme values. If we call a constituent-type rare when it occurs no more than once for every hundred occurrences of the commonest type, then about one token in six will represent some rare type. If the threshold is one per thousand occurrences of the commonest type, then about one token in sixteen will represent some rare type. Consider what this mathematical relationship would mean, if it continued to hold true in much larger data sets than I was in a position to investigate. If the relationship held true in larger bodies of data, then: one in every ten construction-tokens would represent a type which occurs at most once per 10,000 words one in every hundred construction-tokens would represent a type which occurs at most once per 3.5 million words one in every thousand construction-tokens would represent a type which occurs at most once per billion (i.e. thousand million) words The number of distinct construction-types at these frequency levels would have to be very large indeed, in order to accommodate so many tokens between them. (Mathematically this is entirely possible; the number of different sequences of six or fewer elements drawn from a 47-member alphabet is more than 10,000,000,000, and we have seen that some noun phrase types
172
EMPIRICAL LINGUISTICS
have well over six ICs.) A very large proportion of the construction-types that a generative grammar would be required to predict would be types that are individually so rare that their occurrence or non-occurrence in the language would in practice be quite impossible to monitor. One per cent of all construction-tokens is surely far too high a proportion of language material for linguistic science to wash its hands of as too unusual to consider. One might feel that one in a thousand is still too high a proportion for that treatment. Yet, if we have to search millions, or even thousands of millions, of words of text to find individual examples, we seem to be a long way away from the picture of natural-language grammars as limited sets of rules, like programming languages with a few irregularities and eccentric features added. Richard Sharman's depiction of languages as fractal objects would seem more apposite. A natural fractal object such as a coastline can never be fully described - one has to be willing to ignore detail below some cut-off. But the degree of detail which linguistics ought to be taking into account extends well beyond the range of structures described in standard grammars of English. Of course it would be foolish to assume that the linearity of the relationship between the observed data can be reliably extrapolated beyond the observed frequency range. But my main point is that those researchers who believe in the possibility of comprehensive generative grammars appear to assume that they can rely on the linear relationship breaking down in due course; and the empirical evidence offers no support at all for this assumption. If it were correct that examination of some large but not absurdly astronomical sample of authentic data would reveal examples of noun phrase types comprising all but a truly negligible proportion of the noun phrase tokens occurring in real-life language, then the line of Figure 10.1 would need to become much steeper somewhere to the right of the frequency range displayed (and not very far to the right, presumably, if language-analysis systems are to be based on quantities of data which can realistically be examined with limited research resources). There is no tendency in Figure 10.1 for the line fitting the observations to grow steeper towards the right. On the contrary: the line of best fit to the full set of observations is slightly less steep than the line of best fit to the leftmost half of the points, though the difference is so small that I take it to lack significance. (The definitions of the x andj axes imply that the leftmost point must fall at (0, 0); the line passing through this origin which best fits the leftmost 29 points isy — 0.4165x, the line through the origin which best fits all 58 points isjy — 0.4104x.) Only careful examination of a significantly larger data sample can tell us anything reliable about the distribution of noun phrase types at frequencies lower than those observed here; but the onus must surely be on those who believe in the possibility of natural-language analysis by means of comprehensive generative grammars to explain why they suppose that the shape of constituent type/token distribution curves will be markedly different from the shallow straight line suggested by our limited - but not insignificant — database.
GRAMMATICAL/UNGRAMMATICAL DISTINCTION
173
4 Ungrammatically defended
Because the generative style of linguistics has been so influential in recent decades, some people nowadays seem to feel that the ability to speak a language must necessarily imply mastery of a finite set of grammatical rules. Even if data of the kind presented earlier seem to show some open-endedness in the range of constructions available in a language, this might be just a consequence of the fact that a corpus jumbles together the usage of many different individuals. In the mind of any one speaker, the suggestion would run, there must reside some definite and limited range of rules. (Chomsky (1962: 180~ 1) made this rather explicit; for him, 'the learnability of language forces us to assume that it is governed by some kinds of rules'.) However, it just is not true that this is the only rational way of thinking about how a person can know how to speak. One of the most interesting recent attempts at modelling individual human grammatical ability, the approach called 'data-oriented parsing' (e.g. Bod 1998), explicitly denies that grammatical rules have any relevance at all to the cognitive processes at work in the individual speaker. Data-oriented parsing is so new that the jury is still out on whether it will prove adequate to model the complexities of real-life grammatical behaviour, but the work done already is certainly enough to establish that linguistic ability without rules of language is a thinkable possibility. It is an empirical hypothesis, not a conceptual necessity, that an individual's linguistic knowledge might be representable as a finite set of grammar rules. Nevertheless, denying the reality of the grammatical/ungrammatical distinction undermines the validity of the majority approach to linguistics in a rather fundamental way, so it is not surprising that several linguists have defended that distinction against my arguments above, after these were published in their original version. Counter-arguments have been published by Edward Briscoe (writing with co-authors in Taylor, Grover and Briscoe 1989, and as sole author in Briscoe 1990), and by Christopher Culy (1998). All of these objectors claim in essence that my analysis underestimates the ability of generative, rule-based grammars to predict the kind of diversity of constructions and construction frequencies which I described as occurring in the Lancaster Leeds data. Briscoe and his co-authors argued that not merely might rule-based grammars of limited size in principle be adequate to deal with these data, but that a specific grammar and associated automatic parsing system which they were currently involved in developing, the 'ANLT' system, already came strikingly close to achieving this. Culy does not make that type of claim (and Culy does not accept that Briscoe et al. succeeded in making their case against me), but he argues that the statistical data summarized in Figure 10.1 are simply irrelevant to the debate about the grammatical/ungrammatical distinction, and he gives another reason to believe in the distinction. I discussed the Briscoe papers in detail in Sampson (1992), and at this late date it would not be appropriate to go again over all the problems I noted
174
EMPIRICAL LINGUISTICS
with them. As noted by Guly (1998: 12 n. 7), Edward Briscoe himself appeared in a later article (Briscoe and Waegner 1992: 33) to retreat from his opposition to the point of view I am putting forward. I shall limit myself here to the issues that are general enough to have continuing relevance to present-day research. Briscoe and his co-authors pointed out that the generative rules for English noun phrases contained in their grammar covered about 88 per cent of the noun phrase types, or about 97 per cent of the noun phrase tokens, in my data-set; their grammar achieved this using 54 distinct rules (often, as one would expect with a generative grammar, a particular type of noun phrase involved applying multiple rules, yielding a tree structure for the phrase with various intermediate nodes between the root node of the noun phrase and the words themselves). But, in the first place, eight of those rules were added to the grammar in order to adapt it to this data-set. Before these writers looked at the material discussed in my 1987 paper, the grammar contained 46 rules for noun phrases, and they added 8 more in order to be able to cover this material. Taylor etal. write that these extra rules 'express uncontroversial generalizations and represent "oversights" in the development of the grammar rather than ad hoc additions'; but the fact remains that, after a respected research team had evolved the best generative grammar for English that they could, confrontation with a smallish sample of edited (and therefore relatively well-behaved) real-life English prose immediately led to an increase in the size of the relevant rule-set by more than a sixth. Of course the new rules were uncontroversial: but there always will be uncontroversial additional rules needed to cure what look like 'oversights', whenever a finite generative grammar confronts real-life linguistic data. Furthermore, even with the additional rules, the grammar still failed to cover 3 per cent (by tokens) or 12 per cent (by types) of the Lancaster—Leeds data. Taylor etal. repeatedly concede the truth of the well known linguistic aphorism 'all grammars leak' (that is, no explicit formal grammar is ever fully adequate to real-life usage), but they do not seem to appreciate how much this undercuts their case. Someone who truly believes in the generative grammar programme ought surely to hold that, while all grammars to date may leak, the day will come when a leak-free grammar is constructed (or else they ought to claim that the minority of constructions not covered by the best available grammar in some sense do not count — that they are 'performance deviations' rather than genuine products of linguistic competence). Briscoe and his co-authors do not suggest any reason why their grammar is entitled to fail on a minority of constructions found in a real-life sample of careful edited prose; and, if one limited sample throws up one range of constructions on which the grammar fails, we can be rather confident that other samples would reveal plenty more such constructions. Part of Briscoe's case against my 1987 article seems to rest on a simple misunderstanding. Because the SUSANNE grammatical annotation scheme attributes relatively 'shallow' tree structures to sentences, with few (often, no)
GRAMMATICAL/UNGRAMMATICAL DISTINCTION
175
intervening nodes between the root node of a noun phrase and the individual words of the phrase, Briscoe and his co-authors appear to think that I was assuming that a rule-based grammar would need something approaching a separate rule for each distinct sequence of word-classes that can occur as a noun phrase. They repeatedly stress the point that generative grammar formalism permits elegant sets of rewrite rules allowing a large range of distinct word-class sequences to be generated by repeated application of a far smaller set of rules, so that in the case of their ANLT grammar 54 rules suffice to generate 622 of the word-class sequences observed in the Lancaster—Leeds data (together, no doubt, with many — perhaps infinitely many — other hypothetical sequences that happen not to crop up in that data-set, but might crop up in other samples). A linguist who did not appreciate that a few generative rules can cover a large variety of word-sequences would be ignorant indeed. But one cannot design an elegant, parsimonious rule-set for a language unless one has a comprehensive knowledge of which sequences count as valid, grammatical constructions in the language. The point of the statistical analysis of Figure 10.1 was to suggest that very many constructions of English might only be found once in millions or billions of words, in which case we can never hope to assemble a comprehensive listing of the word-class sequences to be summarized by our grammar. If we had such a comprehensive list, perhaps it would display the kind of internal patterning that would enable it to be covered by an elegant generative grammar of relatively few rules, or perhaps it would not — we never could get such a list, so the question is an empty one. Culy argues against my thesis not by reference to a specific grammar, but, in the first instance, by arguing that the kind of statistical distribution displayed in Figure 10.1, with frequencies of observed constructions scattered continuously along a line of slope 0.4 on a log—log scale, is perfectly compatible with the existence of rigid grammatical rules excluding many wordclass sequences as ungrammatical. Culy establishes this point using simple artificial grammars, generating 'sentences' consisting of sequences of letters of the alphabet, not related to naturalistic language data. He proposes two probabilistic phrase-structure grammars (i.e. phrase-structure grammars in which alternative expansions of individual nonterminal symbols are assigned probabilities summing to one); one of these grammars allows every possible string of letters over the given alphabet as a grammatical sequence, the other excludes many sequences as ungrammatical. Figure 10.2 gives the distributions of probabilities of different word-sequences resulting from these two grammars; graph (a) is for the grammar which allows all sequences, graph (b) is for the grammar which has a concept of 'ungrammaticality'. Thanks to judicious choice of probabilities for the various grammar rules, both graphs show a slope close to 0.4. One objection which immediately comes to mind on looking at Culy's figures is perhaps less telling than it at first seems. My argument from the data of Figure 10.1, earlier, was based not merely on the fact that the datapoints fell near a straight line of a particular slope (on a log-log graph) but
176
EMPIRICAL LINGUISTICS
Figure 10.2 also on the fact that they were scattered rather continuously, with no noticeable division between a group of relatively high-frequency 'competence' constructions and scattered low-frequency 'performance deviations'. Guly's graph (b) (for the grammar which excludes some strings as ungrammatical) does display obvious gaps towards the upper right. However, in comparing Culy's figures with my Figure 10.1 it is important to note that Culy has
GRAMMATIGAL/UNGRAMMATICAL DISTINCTION
177
plotted his graphs with both axes running in the opposite directions to mine: for Culy, the commonest constructions are plotted at the upper right, and rarer constructions correspond to points lower and further left. This means that the obvious gaps in Culy's graph (b) represent unpopulated frequency intervals within the range of highest construction frequencies, so they are probably irrelevant to the debate between us: the gaps are just a consequence of the artificial simplicity of Culy's grammar. (The concept of oneoff 'performance deviations' does not apply to artificial data such as Culy's.) The real objection to Culy's argument is essentially the same as to Briscoe's. Undoubtedly it is possible to devise artificial probabilistic grammars which exclude certain sequences as ungrammatical but whose construction frequencies exhibit a log-linear distribution which is more or less continuous out to the lowest frequencies that it is practical to monitor. But, if real language data show such a distribution, what evidence can we have for positing a grammar of that sort, in preference to the simpler idea that there is no fixed distinction between grammatical and ungrammatical constructions?4 The concept of grammaticality being controlled by a fixed set of rigid generative rules is redundant, if the data observed in practice are equally compatible with an 'anything goes' approach to grammar. Culy wants to cling to rigid grammaticality rules, because he believes that 'anything goes' is just not true of English. His final point against me consists of citing a construction, or class of constructions, which never occurs in a body of data larger than the one I used for the research discussed above, and which Culy thinks never could occur in English. His example is a noun phrase consisting of noun + article, or more generally consisting of any sequence containing an article following a noun, possibly with other material before and after. Undoubtedly, '[ nou n-phrasc noun article]' is at best a very low-frequency construction in English; in terms of my Figure 10.1, it would probably correspond to a point so far out to the right of the displayed distribution that its frequency could not in practice be checked empirically. But to suggest that the construction is not just very unusual but actually impossible in English is merely a challenge to think of a plausible context for it, and in this case (as usual in such cases) it is not at all hard to meet the challenge. There would be nothing even slightly strange, in a discussion of foreign languages, in saying Norwegians put the article after the noun, in their language they say things like bread the is on table the an utterance which contains two examples of Culy's 'impossible construction'. Talking about foreign languages is one valid use of the English language, among countless others. 5 No meaningful cut-off Perhaps a defender of the ungrammaticality concept might argue that using English words to simulate the grammatical patterns of a different language is such a 'special' linguistic activity that it must be excluded from the class of uses to which linguistic 'competence' is relevant. But there are very many
178
EMPIRICAL LINGUISTICS
uses of language which are more or less 'special', and it is not clear what we would be left with if we tried to eliminate all the special cases from our enquiries. The example I cited sounds unusual, because for English speakers the Norwegian language happens not to be culturally influential. On the other hand, for historical reasons the classical languages, and French, have at various periods been highly influential, and as a result modern English contains many constructions which are felt to be perfectly standard, but which came about only through using English words to simulate grammatical patterns of those other languages. If we produce language descriptions by ruling out as irrelevant every particular use of language that tends to yield unusual structural patterns, presumably we end up with a grammar that allows just the most obvious kinds of utterance - sentences such as John kissed Mary or The farmer killed the duckling, together with cases in which subordinate clauses are embedded in the most standard ways. Textbooks of generative grammar do tend to focus heavily on such examples; and if they are the total of what one counts as 'true English', then English might indeed be describable by a finite number of heavily used rules. If usage leading to uncommon constructions is systematically excluded, what is left will contain only common constructions. But this does not have much to do with languages as real-life systems of behaviour. Human languages are used for a vast range of purposes, some of which recur very frequently (say, describing physical interactions between medium-sized physical entities such as farmers and ducklings), some of which are a little less common (say, specifying numerical data — an area of usage which often leads to unusual constructions that generative linguists might be tempted to bar as falling outside competence grammar), and some of which are very much less common (say, describing the structure of relatively obscure foreign languages). It would be entirely arbitrary to impose a sharp cut-off somewhere within this continuum of higher-frequency and lower-frequency uses and to say that only grammatical structures associated with language uses more frequent than the cut-off count as 'true' examples of the language in question. Since there is no meaningful cut-off, there are no grounds for believing in a distinction between grammatical and ungrammatical word sequences.
Notes 1 The two LOB genres not included in the pre-final version of the treebank used for this research were G, 'belles lettres, biography, essays', and P, 'romance and love story'. The precise word count was 39,969. 2 I use 'database' in its general computer-science sense, i.e. an electronic collection of'persistent' data (see e.g. Date 1994: 9). With the increasing use of relationaldatabase management technology, some linguists (e.g. Montemagni 2000: 110) prefer to reserve 'database' for data collections formatted and controlled by systems of this type. Natural-language treebanks, including those discussed in this book, are not usually databases in that narrower sense.
GRAMMATICAL/UNGRAMMATICAL DISTINCTION
179
3 The figures of the previous paragraph obviously relate closely to the discussion of Good—Turing frequency estimation in Chapter 7, but the reader will recall that one issue on which the Good-Turing technique offers no help is estimating the number of different 'unobserved species', which is what concerns us here. 4 Culy's graphs (a) and (b) in Figure 10.2 plot the frequency of'sentences' generated by his artificial grammars, that is sequences of terminal symbols, rather than of 'constructions' in the sense of applications of individual rules. But in the context this is not an important distinction; if English were governed by a finite generative grammar, Briscoe etal. would very likely be right to argue that the individual constructions plotted in my Figure 10.1 were in many cases the outcome of several rule-applications per construction, rather than corresponding to single rules.
11
Meaning and the limits of science
1 Meaning in the generative tradition Throughout this book I have been urging that linguistics should be treated as an empirical science, making falsifiable claims about observable realities. If a body of discourse purports to tell us interesting general truths about a domain of observable facts — as any variety of linguistics does — then it is not worth much unless it satisfies Popper's falsifiability criterion. But that is not to say that falsifiable, empirical science is the only kind of valid knowledge there is. Popper's criterion of falsifiability does not pretend to demarcate science from nonsense; it demarcates science from all kinds of non-scientific discourse, some of which is highly meaningful and important. We have seen that moral principles, for instance, can never be translated into testable predictions about observables - one cannot derive an 'ought' from an 'is' — but that does not make moral discussions empty or senseless. They concern issues which are deeply significant in human life. An advocate of the empirical scientific method cares merely that topics which can be treated empirically should be, and that the intellectual environment should not be polluted by quantities of discourse which purports to theorize about observables but does so in a non-falsifiable style. He does not claim that all topics can be treated empirically. Traditionally, scholars distinguished the natural sciences, dealing with phenomena taken to be governed by fixed laws of nature, from the humanities, whose subject-matter was generated by conscious, sometimes imaginatively creative human minds and hence could not be reduced to fixed predictive laws. History, for instance, is a subject belonging to the humanities. Historians can find out and help us to understand what happened at different periods of the past, but one cannot expect study of an early period to offer more than a small amount of help towards understanding events at a much later period, because in the intervening time people will have developed new ideas about social and political life, and made new inventions which change economic and social relationships, and historical events at the later period will be conditioned by these new ideas and new ways of life. The laws that governed the motions of the planets a million years ago are the same laws which continue to govern them now, but a thorough understand-
MEANING AND THE LIMITS OF SCIENCE
181
ing of mediaeval English society would be little use when trying to understand a political event in today's England. Language links sound waves in the air, and marks on paper, to human thought: it bridges the worlds of natural science and the humanities. So, even if some areas of linguistics can be made more scientific than has been usual recently, one might expect to find other areas which cannot be treated scientifically at all. The outstanding example, to my mind, is word meaning. Generative linguistics has treated the meanings of words with similar kinds of formal apparatus to those it has introduced for areas such as grammar and phonology. Because generative linguists have had little interest in questions of empirical falsifiability in general, they have not appreciated that this similarity of style glosses over a large contrast in scientific status between these domains. Formal theories about grammatical or phonological patterns in a natural language may at least potentially be testable predictions, but human lexical behaviour is such that analysis of word meaning cannot be part of empirical science. The standard generative-linguistic approach to word meaning, often called 'componential analysis', was introduced in a 1963 article by Jerrold Katz and Jerry Fodor, 'The structure of a semantic theory' (J. J. Katz and Fodor 1963). Katz and Fodor's core idea was that the meaning of an ordinary word of a natural language (or a given one of its meanings, if it is an ambiguous word) is to be defined as a grouping of components, often called 'semantic markers', and commonly treated as binary features taking positive and negative values. (In later writings, semantic markers have conventionally been shown in small capitals within square brackets, to contrast their status, as theoretical entities, with the ordinary words they are used to define.) For instance, the most frequent sense of the English word bachelor might bedefined as '[+HUMAN], [+MALE], [ — MARRIED]'. The original 1963 paper drew a distinction between two sorts of semantic component: 'markers' which were relevant to the meanings of large numbers of words, and 'distinguishers' which were the residue of semantic content unique to one or a few words, after recurring marker properties had been subtracted. (For instance, Katz and Fodor originally treated 'who has never married' as a distinguisher rather than a marker within the meaning of bachelor, because while +HUMAN and +MALE are each part of the meaning of many English words, 'who has never married' is relevant only to bachelor and spinster.} Later publications (by these and other authors) fairly soon dropped the marker/distinguisher distinction, seeming to imply that what had been taken for distinguishers were really complexes of markers which had not been fully analysed. Generative linguists such as Steven Pinker, who lay emphasis on the claim that natural language structure is innate rather than learned, argue that the semantic markers correspond to vocabulary elements of a universal innate language of thought (what he calls 'Mentalese', Pinker 1994: ch. 3), implying that even if, in one particular language, some semantic marker is relevant only to a handful of words, that same marker is likely to recur in the meanings of words of many other languages.
182
EMPIRICAL LINGUISTICS
This core idea, that word semantics is about defining the meanings of words as constellations of simple semantic components, has continued to be the standard linguistic approach since Katz and Fodor's 1963 article. It remains the main, or only, approach described in recent introductory textbooks, whether to linguistics in general (e.g. Fromkin and Rodman 1998: 158ff.) or to semantics in particular (e.g. Saeed 1997: ch. 9). The electronic resource used in virtually all computational language-processing research that involves word meaning, namely 'WordNet' (Fellbaum 1998 - cf. Sampson 2000), is based on a concept of meaning structure very close to that of Katz and Fodor's 1963 paper. Some introductory books do draw attention to problematic points (Lyons (1995: 103) comments that componential analysis is 'no longer as widely supported by linguists as it was a decade or so ago, at least in ... its classical formulation'); but in context Lyons's comment, and similar remarks by other authors, read more like minor quibbles about detail than radical disagreements. There are actually two problems about componential analysis, one of which can be solved within the spirit of generative linguistics, but the other of which is more fundamental. 2 Markers or meaning postulates The relatively solvable problem has to do with what linguists are claiming to achieve when they translate ordinary words into sets of 'markers'. In itself, replacing a sequence of ordinary words by a structure of theoretical entities called semantic markers does nothing to define the meaning of an utterance. What we are implying when we say that 'part of the meaning of bachelor is male (or [+MALE])' is that people who understand English can draw inferences involving these terms — for instance, from Hilary is a bachelor an Englishspeaker may infer Hilary is male. Semantics is not primarily about properties of individual words or sentences considered in isolation; it is about relationships of entailment between sentences (or, more precisely, between propositions expressed by sentences - but in what follows I shall keep things simple by ignoring that logical nicety). An entailment is a relationship between a set of sentences on the one hand, and an individual sentence on the other hand, such that if the former set (the premisses] are true then the latter sentence (the conclusion) must also be true. The case of Hilary is a bachelor and Hilary is male is a case where a conclusion follows from a one-member set of premisses. But other entailments involve sets of more than one premiss. For instance, the conclusion John has no daughter follows from the set of two premisses {Hilary is a bachelor, Hilary is John's only child}, though neither premiss on its own entails that conclusion. In the limiting case, some conclusions are entailed by the empty set of premisses. An English-speaker knows that Bachelors are male is true, without needing any premisses from which it follows. Such sentences are called analytic: an 'analytic' sentence is one whose truth is determined by the semantic structure of the language. The converse of analytic is synthetic: a 'synthetic'
MEANING AND THE LIMITS OF SCIENCE
183
sentence is one, like My car is white, whose truth-value is left open by the semantics of the language, and can only be established by investigating extra-linguistic reality. Chomsky equated the task of specifying the grammar of a language L with the task of dividing the class of all possible finite sequences of words of L into two subclasses, the grammatical sentences of L and the sequences which are not grammatical. The linguist's apparatus of rules of grammar of various kinds was justified as a set of mechanisms which successfully discriminate good sentences of a natural language from word-salad. In similar vein, one might define the task of specifying the semantics of a natural language L as being to formalize a division between two subclasses of the class of all possible pairs (P, C), where Pis some set of grammatical sentences ofZ, and C is some grammatical sentence of L. A semantic specification of L would distinguish the class of valid entailments, where C follows from the premiss-set P, from the class of invalid entailments, where Cdoes not follow from P. This is an oversimplification (for one thing, it ignores relationships between utterances and the non-linguistic realities which they describe), but it identifies an important core part of what we mean by describing the semantics of a natural language - and this is broadly the aspect of the task which the system of semantic markers might help to achieve. (O'Grady, Dobrovolsky and Katamba 1996: 175, while giving componential analysis the usual prominent place in their introduction to semantics, note that meaning components in themselves do nothing to link words to extralinguistic reality.) Now, although Katz and Fodor's original paper did not put things that way, one virtue of their system might be that an efficient, elegant system for specifying the class of valid entailments in a natural language could be developed in terms of their system of markers. If bachelor translates into '[+HUMAN], [+MALE], [ — MARRIED]', and male translates into '[+MALE]', then the system might be able to predict the validity of the entailment ({Hilary is a bachelor], Hilary is male) by means of a general rule about cases where one set of markers is a subset of another set. In fact, though, this would soon run into difficulties. The semantic marker system works well for what logicians call 'one-place predicates': properties of individual entities, such as humanness or maleness. Many entailments, though, depend on words that refer to relationships between two or more entities. If Jeremy bought a car from Arthur is true, then it must be true that Arthur sold a car to Jeremy: buy and sell both stand for three-place predicates, relationships between a buyer, a seller, and a thing traded. In general it seems easier to see how semantic markers might be used to represent the meanings of adjectives and nouns, many of which stand for one-place predicates, than the meanings of verbs, which commonly stand for multiple-place predicates. To capture the buy /sell relationship, what we surely need is not a translation of either word into a constellation of features, but a rule something like 'From A buy Bfrom C infer C sell B to A (and vice versa)'. Rules like this are
184
EMPIRICAL LINGUISTICS
called meaning postulates. Once we introduce meaning postulates into the apparatus for denning word meanings, componential analysis becomes redundant. Rules such as 'From Xis a bachelor infer Xis human, Xis male, and X has never married' do the same work as a translation of bachelor into a set of semantic markers, so there is no point in using both systems; and meaning postulates can represent many-place predicates while semantic markers can represent only one-place predicates, so it is better to describe naturallanguage semantics using meaning postulates rather than componential analysis. Some generative linguists appreciated this point from an early stage (see e.g. Bierwisch 1970), and by now the meaning-postulate approach to semantic description is widely accepted by linguists as an alternative, perhaps a preferable alternative, to componential analysis (see e.g. Lyons 1995: 126 if., Saeed 1997: 292 ff.). If componential analysis is still by far the most visible approach in introductory textbooks, that is probably just because componential analysis is easier to explain briefly to beginners, rather than because linguists actively favour it over the meaning-postulate approach.
3 The fluidity of meanings
The precise nature of the notational apparatus for defining word meanings, though, is only the first and more superficial of the two problems I flagged as arising with the generative-linguistic approach to semantics. A much larger problem is linguists' assumption that words have definite meanings capable of being captured in symbols of any formal notation - whether semantic markers, meaning postulates, or something else. Word meanings are the area of language which relates most closely to the imaginative, inventive aspects of human nature that cause the humanities to be different in style from the natural sciences. Grammar is a neutral framework for all kinds of thought. The grammar of mediaeval English, as it happens, was not quite the same as modern English grammar, but it surely could have been the same: there is no obvious way in which mediaeval thought would have been incompatible with the use of modern English syntax, and anything we want to say or write nowadays could have been couched in Middle English grammar. But the meanings of our words have in many cases changed massively in response to changes in prevailing ideas about society and the natural world. It would be quite hopeless to try to conduct a modern discussion about almost any topic while using words with the meanings — the entailment relations — which they had in the fourteenth century: if the words were understood with those meanings, we would not be saying the things which a modern person wants to say. This coupling between word meanings and innovative human thought means that word meanings have an unpredictability that, arguably, makes them incapable of being brought within the purview of empirical scientific
MEANING AND THE LIMITS OF SCIENCE
185
theorizing. Fluidity of word sense is not merely a phenomenon arising from changes over centuries in the use of words relating to lofty subjects: it applies to all words all the time. In consequence, trying to produce a rigorous, scientific account of the semantics of a human language may be a task as futile as chasing a rainbow. Consider a typical question of the kind which would have to be given some definite answer in a semantic description of English: the question whether possession of a handle is a criterial feature for the application of the word cup to an object. Cup is one of a group of words, including mug, tumbler, beaker, vase, and others, whose meanings are similar but not identical; prima facie it seems plausible but not certain that possession of a handle might be a factor relevant for the use of the word cup as opposed to some of the other words. A generative linguist would typically phrase the question as: 'Should the lexical entry for cup include a feature [+HANDLED]?'. Given my earlier comments about meaning postulates, it might be more sensible to reword this as: 'Should a semantic description of English contain a rule "From X is a cup infer X has a handle"?'. However, as questions about the English language, I suggest that neither of these is meaningful. At best they could be construed as questions about the state of mind of an individual English-speaker on a particular day; but it is not at all clear that the questions will necessarily have answers (let alone discoverable answers) even when understood in this way, and certainly an answer to such a question could not count as a scientific prediction - this is just the kind of issue on which a thinker is liable to change his mind at any time, which means that no future behaviour by an individual could be incompatible with a positive or negative answer to the question with respect to that individual at a given moment. How, after all, does an individual acquire a word - cup, or any other? It is fairly unusual to learn the meaning of a new word by being given an explicit definition; and even when this does happen, unless the word is a technical term the definition is unlikely to exert a decisive influence on how one subsequently understands and uses the word. (If theoretical linguists admit that they are very far as yet from being able to specify the correct analysis of most elements of our vocabulary into markers, it is hardly likely that the average parent will manage a masterpiece of semantic analysis on the spur of the moment when asked by his child for the meaning of a word; yet the child will grow up using the word as competently as anyone else, unless it is unusual enough for the child to have no opportunity to correct his understanding of it.) Indeed, even in the case of technical words such as atom, though society arranges things in such a way that they receive standard definitions to which a considerable measure of authority is ascribed (this is what we mean by calling them 'technical'), nevertheless there is a limit to the influence of that authority. Were there not, it would have been impossible for scientists ever to have discovered that atoms are composed of particles, since at one time (as the etymology suggests) the stipulative definition of atom implied that atoms have no parts. Had the meaning of the word atom been determined by
186
EMPIRICAL LINGUISTICS
the stipulative definition, the discovery that things previously called 'atoms' were composite would have led people to cease applying the word atom to them; what happened instead was that the word remained attached to the same things, and the new knowledge about those things changed the meaning of the word. In general we do not learn new words of our mother tongue through definitions; we learn them by hearing them applied to particular examples. We encounter a noun applied in practice to certain physical objects, say, and we guess as best we can how to extend the noun to other objects. The child hears its parent call certain things which resemble one another in some ways but differ in other ways cups; he may notice that one property shared by all of them is the possession of a handle, and if so he may conjecture that cups have handles as a matter of course. But any particular object has an endless variety of properties; there can be no guarantee that the resemblances which the child notices coincide with the features which led the parent to apply the word cup to the various objects in question. (The child may of course be aided in his conjectures about the criterial features for applying a given word by working out more general rules about the kinds of feature relevant for applying whole classes of words, for instance he may notice that colour is very rarely relevant for applying nouns, while function is frequently crucial so that, if a cup is used for drinking, the chances are that a similar object used to put flowers in will not be called a cup. But these more general rules are themselves creative and fallible conjectures.) Our understanding of word meanings depends on the same interplay between conjectures and experience from which, according to Karl Popper, scientific knowledge emerges, and consequently it shares the provisional status which Popper ascribes to scientific knowledge. Some linguists seem to accept that this is how we learn words, but to suggest that it is a fact only about the process of language-acquisition rather than about the nature of the language acquired. They suggest that the child has to guess his way towards the correct set of criterial features which control the adult's usage. But we are all in the same boat; we all spend our time guessing what sets of criterial features would explain the application of given words to given things in the speech we hear around us (and in the writing we read), while trying to conform our own usage to our conjectural reconstructions of each other's criteria in order to be understood. It is not clear what it could mean in such a situation to talk of a standard of correctness which a given speaker either has or has not achieved with respect to his use of a given word. That notion suggests that learning to use words correctly is like discovering the right combination to open a combination lock, which is a problem with an objective solution independent of people's opinions. It would be more appropriate to compare the acquisition of a word meaning to the problem of dressing fashionably in a very clothes-conscious society. In this case it would seem absurd to suggest that there was an objectively 'right' costume to wear; rather, what is right is what people think right.
MEANING AND THE LIMITS OF SCIENCE
187
4 The fashion analogy
It is worth pursuing this analogy a little way. In the dress case, one solution which may be perfect if it is practicable is to dress exactly like someone recognized to be fashionable; I believe there have been societies in which fashion has led to rigid uniformity in this way. The counterpart of this principle in the case of language would be that one cannot go wrong in applying a word to the very same individual thing to which it has been applied by people who already know the word. But in the case of language one cannot restrict oneself to doing this. One constantly needs to refer to things (whether physical objects, actions, properties, or any other category of referent) about which, as individual instances, one has not heard anyone else talk. We have to extend terms that we have encountered used of other individual entities to the individual entity that confronts us now - an entity which, perhaps, no one has ever mentioned before. The thing we want to refer to now will in various respects resemble various other individual things to which we have heard people give names, but it will not be identical to any of them; so we have to make a guess (an 'educated' guess, of course, not just a random stab in the dark) that entity Alpha, which we have heard referred to by words X, T, and £, was called T by virtue of possessing the very properties which Alpha shares with the entity we want to talk about now - so that it is legitimate for us to call this thing a Ttoo. The counterpart of this situation in the case of dress would be a situation in which it is normally impossible to copy someone else's costume exactly because garments are not made up to customers' specifications, rather there exist a large and diverse range of individual garments and accessories lying around for anyone to take and wear, each of which is unique in its details. In such a situation one would have to guess what properties of the costume of some dandy — let us call him Dandy Lion ~ lead to his being admired; given any limited set of properties one could assemble a costume of one's own sharing those properties, but duplication of the admired costume in all respects would not be possible. (The situation in our own fashionable society is actually not unlike this, not because exact duplication is impossible but because it is not admired a point to which there is no analogy in the case of language.) One would ask oneself (consciously or unconsciously) questions such as Ts it the cut of Dandy Lion's clothes that matter, or the colour, or the texture? If the colour plays a role, is it the general hue which counts, or the degree of contrast between the various elements, or what? Is the vaguely Oriental appearance of his costume relevant, so that I could get high marks by wearing something quite different but also Oriental in flavour? Perhaps it is only on tall men that three-piece suits are admired?' — and so on. The counterpart of a generative linguist would approach the dress situation with the assumption that the range of possible questions of this sort that a would-be dandy could ask himself form a finite, specifiable set, and that underlying the dress behaviour of the fashionable crowd are definite answers to these questions. Newcomers to the community know the
188
EMPIRICAL LINGUISTICS
questions when they arrive, work out the answers by observation, and then dress fashionably, and that is all there is to it (except that occasionally the fashion changes in some definite respect — i.e. a word undergoes a well defined 'change of meaning' by dropping one feature or acquiring another). But in the dress case, surely, nobody would entertain this account for a moment? It seems quite obvious that there is no determinate set of possible questions to be asked about what makes fashionable costume fashionable, and that there are no authoritative answers to the questions one thinks of asking. Dandy Lion may be better at dressing fashionably than a newcomer to society, but he has no rules that guarantee fashionability. Everybody has to make up new ensembles day by day (i.e. mature speakers, as well as learners, have to extend words to items not exactly like things they have already heard words used for); Dandy Lion keeps as sharp an eye on the rest of the smart set as they do on him, since they are all equally at risk of being seen to have missed the trend. In such a world, standards of fashionability would gradually drift in various directions over time, but not in identifiable, discrete steps; and at any given time one could give only vague, sketchy indications of what currently counted as fashionable - the concept of a rigorous, predictive scientific theory of contemporary fashion would be inapplicable. (Many readers will very reasonably feel that a society as narcissistic as the one I describe would be too contemptible to be worth thinking about, and they may take me to be suggesting that there is something frivolous about our semantic behaviour. But this would be to press the analogy far beyond the limits of its applicability. The dandies I describe seek to imitate each other's dress behaviour because they want to be admired, but real humans seek to imitate each other's semantic behaviour because they want to be understood, and there is nothing frivolous or untoward in that motive.) If the reader agrees that there is no 'right answer' to a question about what is fashionable in the case of dress, but yet feels that the notion of scientific semantic description is an appropriate goal for the linguist, he must explain what relevant difference between the two domains has been suppressed in the above discussion. I believe the analogy is a fair one. I know no evidence in favour of the idea that words of ordinary language have meanings that are more well defined than the fashion analogy suggests. Consider for instance the work of Jeremy Anglin, a psychologist who was heavily influenced by the Katz/Fodor approach and tried to find experimental evidence for componential analysis. Anglin (1970) tells us that he began work on his doctoral thesis with, among other assumptions, the 'preconceptions' that the meanings of words were (at least to a close approximation) determinate sets of semantic features, and he used various experimental techniques to try to check that the feature composition of certain English words in the speech of children and adults had the particular structure he expected. When the design of Anglin's experiments made it possible for them to do so, however, his subjects (both children and adults) persisted in behaving linguistically in ways which refuted successively weaker hypotheses about the observable consequence of any feature analysis,
MEANING AND THE LIMITS OF SCIENCE
189
though their behaviour was recognizably sensible. Anglin concludes by asking 'how helpful has it been to view semantics in terms of features ...?', and answers 'The notion of a feature .. . has taken a beating. In retrospect it required considerable innocence to attempt to specify' the semantic features relevant to Anglin's word-set. Adult subjects appear to be able to generate a myriad of equivalence relations [i.e. common features] which for them make two words similar. A boy and a horse may both be animals, beings, objects, and entities, but they also both eat, walk, and run, they both have legs, heads, and hair, they both are warm-blooded, active, and social. (Anglin 1970:94)
Anglin here discusses the problem of identifying shared semantic features common to a pair of words; it is all the less likely that one could specify a determinate set of semantic properties for any individual word, when subjects can imagine a 'myriad' of properties possessed by the entities to which they know the word to apply. Fashion to return briefly to my analogy - can be described, but only in the anecdotal discourse style of the humanities rather than by the rigorous techniques of science. One can make impressionistic statements about current fashion, and write histories of the vicissitudes of fashion up to the present, but any attempt to be more exact about the present state of play or to make predictions about future developments would represent a misunderstanding of the nature of the subject. Similarly, in the case of language one can give approximate, nondefinitive accounts of the current meanings of words (which is what most dictionaries do), and describe the developments of word meanings during the past (which is the additional goal of a dictionary such as the Oxford English Dictionary), and for either of these tasks ordinary English is as adequate a medium as any. When generative linguists offer to formalize the meanings of words in a quasi-mathematical notation, they are claiming (at least implicitly, but usually explicitly) to be able to define word meanings more rigorously than is possible in the ordinary English of dictionary entries, and to make testable predictions about utterances of words other than the particular utterances they have observed. To suppose that these goals are achievable even in principle is to misunderstand the nature of language. 5 Not statistical but indeterminate
Notice that the point I am making is not answered by the suggestion that the 'semantic interpretations' of words should be probabilistic rather than absolute. I chose the word cup as my example because it was used in an interesting series of experiments by William Labov (1973a), who set out to attack the generative dogma that words have essences - criterial features necessary and sufficient for the word to be applicable. Labov elicited from subjects names for a variety of containers differing in respect of a number of variables,
190
EMPIRICAL LINGUISTICS
some of which were discrete (e.g. having or not having a handle) and others continuous (e.g. how wide they were for their height). The orthodox generative linguist should predict that some of the features will be crucial and the others wholly irrelevant to the question whether a given container is called a cup rather than anything else; and that, if a continuous variable such as width-to-height ratio is relevant at all, what will count will be whether or not some more or less sharp threshold value is exceeded. Labov found instead that the different variables contribute various increments of probability that an object will be called a cup, and increasing values on a continuous variable contribute smoothly increasing probability increments with no threshold value. One might conclude from such findings that all that is wrong with the picture of a semantic description as a set of inference rules linking the words of the language is that the rules need to be equipped with statistics. Rather than a rule 'From X is a cup infer Xhas a handle', say (which makes a handle part of the essence of cup, so that something without a handle just cannot be a cup no matter how cup-like it may be in other respects), we should write, perhaps, 'From Xis a cup infer with n per cent reliability Xhas a handle', filling in some value of n. The mathematics might need to be a little more sophisticated than this in order to explain how values of different variables interact to produce a decision about naming, and how the contributions of continuous variables are factored in, but this would be a purely mathematical problem (about which Labov makes some detailed suggestions) — it would leave unaffected the principle that natural-language semantics can be rigorously defined.2 From the point of view advocated here, however, a Labovian semantic description - although it can perhaps be welcomed as a step towards a better understanding of semantics — embodies the same fallacy as that underlying a non-statistical semantic description of the kind envisaged by Katz and Fodor. The variables differentiating Labov's containers were supplied by the experimenter. Labov decided that possession of a handle, width-toheight ratio, use for liquid rather than solid food, and half a dozen other characteristics might be relevant to the meaning of cup, and he constructed his test objects accordingly. In other words, his experiments did not test whether the range of properties of an object relevant for calling it a cup are determinate; the experiments tested only the manner in which a certain few properties that are relevant influence the application of the word. What I am arguing is that the set of properties that speakers may use to justify the application of a word is not determinate, because objects do not come to our attention labelled with a fixed set of properties and our minds do not impose a fixed set of categories on our experience. It was not for nothing that Anglin, in the passage quoted above, wrote of his subjects 'generating' (rather than 'noticing') a 'myriad' common properties for pairs of words; we create ways of seeing things as similar. The coining of the word picturesque, for instance, institutionalized a way of seeing-as-similar two sights as physically dissimilar as, say, a romantic garden and a Mediterranean fish-market. It
MEANING AND THE LIMITS OF SCIENCE
191
seems absurd to suggest that picturesque might abbreviate some compound of simple ideas (in the empiricist philosopher John Locke's sense — that is, direct features of sense-data). Thus, one of the features of a garden that may lead us to call it 'picturesque' is that it is 'romantic' rather than 'classical'; romantic seems no closer to the immediate properties of sense-data than picturesque itself is. And it would surely be a gratuitous assumption to hold that picturesque corresponds to a category innately implicit in the human mind: the concept is a quite recent product of our culture, lacking near equivalents in languages of other cultures or, as far as I know, in the English of Chaucer or Shakespeare. We are surely bound to admit that the concept 'picturesque' is an original creation by an individual which has succeeded in becoming institutionalized in our society. It is always possible, of course, to keep saying of each apparently novel concept thrown up by an individual or a culture that it must have existed implicitly all along; but if this becomes merely a mechanical assertion of dogma, devoid of any testable implications about future conceptual innovation, what value has it? I do not suggest that in seeking to master the use of the word cup an English speaker will typically invent properties as novel as the property 'picturesque' was when it was invented. Commonly, perhaps, one distinguishes in the objects one hears called cups properties which can be spelled out in terms of words one has already mastered - 'having a handle', for instance. Given one's previous vocabulary, the set of properties which can be defined in its terms is in principle specifiable, though inconceivably diverse. But examples such as picturesque establish the point that we do, at least sometimes, invent properties that cannot be spelled out in terms of previous vocabulary, and indeed such invention is probably relevant for cup: no doubt for many speakers, as for Labov's subjects, width-to-height ratio is relevant (too wide a container is a bowl rather than a cup, too tall is a mug or a vase), but how many people have an adjective for the particular width-to-height ratio of a 'good' cup? Furthermore, although many of the properties we pick out as possibly relevant for the use of a word are definable in terms of our previous vocabulary, very few of them will be reducible in this way to ideas that it is plausible to regard as innate. One obvious property of many cups is that they have a handle, but the concept 'handle' is no more a mere compound of sense-qualities than 'cup' is. There is no escaping the creativity of semantic behaviour. 6 The example of'gay'
Cup is not in fact the best of examples for the case I am arguing, for reasons which have to do with the fact that cups are artefacts. Manufacture, and modern mass-production even more, lead to the existence of ranges of objects which appear to lend support to the 'essentialist' view of semantics because they are designed by speakers of the language. Thus, suppose it is commonly held that from X is a cup one can reliably infer X has a handle but not X has a saucer, if a few pottery manufacturers believe this, it may soon
192
EMPIRICAL LINGUISTICS
happen that everything sold in a package marked 'CUP' does indeed have a handle, though some come with saucers and others without. Massproduction creates a series of object-lessons in the theory of absolute semantic essences, and it may be no accident that that theory has appealed to late-twentieth-century linguists who in most cases inhabit highly urban environments in which they are surrounded by human manufactures. It is possible that words for manufactured objects may be relatively stable in meaning for this reason. Let me give a more favourable example from the less mass-produced area of human emotions and relationships, namely the recent development in meaning of the word gay. This is of course a word which has modified its meaning so greatly in a short period as to have become notorious; that makes it an unusually clear example for expository purposes, but I suggest that the process exemplified by gay is typical qualitatively though not quantitatively of the life of much of our vocabulary. In my youth, gay (applied to a person or a social environment) meant, to me, something like 'happy or enjoyable in a witty and carefree way'. (I deliberately give a personal impression of the meaning rather than quoting an authoritative dictionary, because part of the point I am making is that shifts in meaning occur through the workings of individual, nonauthoritative minds.) By some time in the 1970s I had become used to the fact that it had come to mean something like 'homosexual, but with connotations of allegiance to a style of life deemed to be as honourable as others rather than in a clinical sense'; since people are anxious not to commit unintended sexual double-entendres, the previous sense of gay became more or less extinct. It is easy to see how such a transition was possible, given that the hearer of a word can only guess what properties of a referent evoked that word from a speaker. If people are carefree, it is often because they are irreverent, disrespectful of authority. While homosexual acts between males were illegal, groups of homosexuals who decided not to be ashamed of their inclinations were necessarily disrespectful towards the authority of the law (and although the law eventually changed, public opinion was another and more conservative kind of authority). Homosexual society is also supposed to be unusually 'bitchy', and witty talk is often malicious although the two ideas are quite distinct. One can well see how a speaker might have referred to a 'gay crowd' thinking of them as carefree and amusing, and a hearer have taken him to refer to the characteristics of irreverence and verbal malice which the group in question also displayed; when the hearer in turn became a speaker he would then refer to other groups as 'gay' who were irreverent without necessarily being carefree, and so the semantic drift would proceed.3 Indeed, it is not necessary to restrict ourselves to the hypothetical; I can quote a real example of the kind of usage that could have encouraged the semantic drift gay has undergone. On p. 57 of Angus Wilson's 1952 novel Hemlock and After, Elizabeth Sands confronts her father Bernard with her discovery of his homosexuality; she has met Sherman Winter, one of his
MEANING AND THE LIMITS OF SCIENCE
193
homosexual associates, who, affecting not to know her relationship with Bernard, '. . . treated me, well, like any other queen's woman, dear.' She at last got out one of the terms ['which she had so determined not to evade'], but it really made her seem more gauche than her straight embarrassment. 'He remarked on your absence from the gay scene ...'
Anyone who had not already got a fixed view about the precise meaning of gay might well take this passage to suggest at least that it had connotations of social and/or sexual unorthodoxy, and if his previous experience of the word did not rule out such an interpretation he might even take gay here to mean 'homosexual', as it does nowadays. This might then be reinforced by the passage on p. 193 of the same novel where Wilson refers to the gaiety of Eric Graddock, an adolescent who is being drawn into the world of homosexuality. In fact the interpretation of gay as 'homosexual' is wrong here; in a letter dated 3 October 1978 Wilson told me that when writing Hemlock and After he had never heard gay used for 'homosexual', and he explained his usage by saying that 'people like Sherman (but not necessarily homosexuals) aspired to being thought always on the crest of a wave'. Indeed, on p. 194 of Hemlock and After there is a use of gay which is incompatible with the modern equation gay — 'homosexual', since it refers to the relationship between Eric and his mother; but this relationship is sufficiently emotionally charged to make a slight hint of sexual unorthodoxy in our understanding of gay not altogether inappropriate even here. The reader of Hemlock and After could easily be led to add an implication of vague sexual unorthodoxy to his understanding of gay whether or not such an implication existed for the novelist. Furthermore, it seems unlikely to have been pure coincidence that Wilson repeatedly used a word which has subsequently acquired an explicitly sexual sense in contexts of sexual unorthodoxy; though Wilson thought of the word as referring to a quality of emotion or behaviour not peculiar to homosexuals, it is permissible to suppose that the semantic drift which has by now run its full course had already begun for Wilson, so that the emotions or behaviour which the word implied for him were ones more commonly displayed or more commonly aspired to by homosexuals than by the population at large. Such a view is unavailable to the generative linguist, however. For him, a change of meaning is a sharp transition corresponding to the acquisition or dropping of a particular semantic 'marker', so that Wilson's use of gay must be dismissed as mere coincidence and his novel cannot be seen as relevant to the recent development in meaning of the word. Yet the ambiguities of interpretation I have discussed in connexion with gay are the very stuff of real-life discourse. Even the single short passage already quoted from Hemlock and After offers another good example, in its use of straight. Straight often means 'ordinary, basic, without special admixture', and it is also used nowadays to mean 'heterosexual' — whether as a
194
EMPIRICAL LINGUISTICS
development from the usage just given, that is 'ordinary in sexual inclinations', or from straight as 'morally upright', given widespread assumptions about the immorality of homosexuality. When Wilson writes about Elizabeth Sands's 'straight embarrassment' this can perfectly well be read either as 'basic embarrassment deriving from the overall situation, as opposed to the special awkwardness of talking about "queens" in those circumstances', or as 'embarrassment of a heterosexual confronted with homosexuality'. In this case there is no objection to taking the word in either sense, or even to supposing that Wilson intended it in both senses at once, in the way described by William Empson (1930) as characteristic of poetic language. But, however the word was intended, the fact that one interpretation could plausibly be imposed on the word by a reader when the writer had intended the other illustrates again the notion that usage does not rigorously determine meaning. There remains a question why one particular development of meaning becomes widespread, given that on my account a particular word will presumably be developing in unrelated ways in different individuals' conversations. The question is not unanswerable; in the case of gay, perhaps the new sense caught on because it answered a need on the part of homosexuals recently emancipated from legal discrimination for a non-pejorative means of self-description. Whether that is right or not, surely it is incredible that a semantic drift of the sort exemplified by gay could have occurred in discrete steps, at each of which the current implications of the word were clear-cut? A human social group is a very obvious case of an entity in which an imaginative onlooker can see an endless diversity of characteristics. Now that the new sense of gay has led to incorporation of the word into the title of formal institutions such as the Gay Liberation Movement, it may be that the sense has become relatively rigid — such an institution is likely to equip itself with articles of association defining the meaning of gay, which turns the word into a technical term like atom. But during the transitional period the implications of a given use of the word must surely have been very fluid, indeterminate, unpinnable-down? 7 Meaning in a changing world
So far I have argued that the meanings of words must be indeterminate because individuals have to learn which aspects of their world correlate with use of a word, and the possible hypotheses are endless while the data are limited. There is a further factor: the 'world' to which individuals have to try to relate the words they encounter is itself changing in unpredictable ways, both subjectively (i.e. the individual is learning more about it) and objectively (natural conditions change, and human life is altered as a result of innovatory thinking by other individuals, e.g. technologists or politicians). This means that the semantic fluidity or indeterminacy of language is a very good thing. There are occasional changes in our world which are so clear-cut and public that we can react to them by clear-cut, public changes in our language, specifically by the creation of new words. A quite novel
MEANING AND THE LIMITS OF SCIENCE
195
form of transport is invented, and a new word hovercraft12 is coined to go with it. But the great majority of the changes in an individual's understanding of the world will be subtler and more gradual than this, and furthermore they will not in general run in parallel with the learning of other individuals. Much of what we learn about the world is learned from our own experience rather than from school lessons, and different individuals often reach different conclusions. Such developments in people's structures of belief cannot possibly be reflected by public changes in the language such as innovation in vocabulary; even if we neglected the problem of disparities between different individuals' learning, the gradual and unceasing character of many of the developments in our world would require a state of permanent revolution in our lexicon, and people simply could not learn all the new words. So, given that the words themselves must remain broadly constant with only a limited number of additions and subtractions to reflect the grosser, more sudden and public modifications in the pattern of our lives, then if our vocabulary were not semantically fluid either the language would rapidly become unusable in face of the many subtler and more private developments in our beliefs, or those developments would be prevented from occurring. Indeterminacy of word meaning is a necessary condition for the growth of individual thought; and it is the fact that individual humans can think creatively, rather than being tied from birth to fixed instinctual patterns that remain rigid in face of an unpredictable world, that places Man in a position so much superior to brute beasts. b Modifications to word meanings induced by modifications in purely individual knowledge or belief are difficult to illustrate, simply because any particular process of evolution of an individual's thought is not likely to be paralleled in the average reader's biography. But it is easy to illustrate the dependence of word meanings on beliefs in connexion with more public belief-changes. Consider, for instance, the case of the wordfather. Typically, the generative linguist will analyse father as a semantic molecule including the atoms MALE and PARENT (see e.g. Fromkin and Rodman 1998: 161). The unit PARENT, as a primitive, is left undefined, but the analysis is plausible only if we take PARENT to be synonymous with the English wordparent, i.e. one of the partners in the act of conception. The difficulty is that there exists a large class of English-speakers who use the word father frequently and, on the great majority of occasions, in a quite standard fashion who certainly do not mean this by the word: namely young children. The fact that children use father in another sense is demonstrated by occasional mismatches between their usage and that of adults; thus the linguist Barbara Partee (1975: 198) reported that One year when I was living alone in a house in an area populated mainly by families, the small children in the neighborhood would repeatedly ask me whether I had a father and if so, where he was, and we had a number of perplexing conversations on the subject before someone pointed out to me that what they really wanted to know was whether I had a husband.
196
EMPIRICAL LINGUISTICS
For the children, a. father was the adult male who acts as head of a family unit, and conception did not enter the picture for the obvious reason that they were too young to know about the role of males in reproduction. When in due course children are taught about this, they recapitulate what was once an intellectual discovery made by some long-past generation; Malinowski (1929: ch. 7, §§3—4) claimed that there were cultures in our own age that had not yet made the discovery. One might explain the facts by saying that children first analyse father as something like MALE HEAD-OF-HOUSEHOLD, and that when they learn about sexual reproduction they drop the latter feature and slot in PARENT in its place. But this is surely implausible. When children growing up in a society of stable marriages learn the sexual facts of life, they do not suddenly take themselves to have misunderstood the word father until that time; rather, they register a new fact about what fathers do, and the word father is not thereby changed in meaning. Only later, as they become gradually aware of the variety of child-rearing patterns in the larger world, are they forced to make a decision as to whether initiation of conception or role as head of a child's family is decisive for the use of father, and they find that when the criteria clash, members of their society invariably prefer the former (if we ignore compound expressions such as stepfather, foster-father). While the child's experience remains compatible with either the head-ofhousehold or the initiator-of-conception sense of father (i.e. while the two criteria always or almost always pick out the same individuals in the child's world) it is surely meaningless to ask which of the two is the sense of father for the child; the matter is genuinely indeterminate. If the individual child discovers independently of talk about 'fathers' that there are many cases of male conception-initiators deserting their families, there is no saying whether the child will decide to think of such men as fathers or not — the decision is not pre-empted by any aspect of his prior mental representation of the word father, and children invariably opt in due course for the same usage only because there already exists a convention, the many adult parties to which 'outvote' any contrary decision by an individual child. But (the generative linguist may reply) however he manages it, the child does after all end up with the analysis MALE PARENT for father. So long as we claim only to describe adult English — and granted that it may be preferable to substitute meaning postulates, such as 'From A is B'sfather infer A is male and A was a partner in the conception ofB', for reduction of words to molecules of semantic 'atoms' written in small capitals - we are perfectly justified in setting out to produce a rigorous semantic analysis of the language. This is not true, because people do not stop learning when they reach adulthood. Again the point can be illustrated from the wordfather. Until recently, a person's sex was a fixed given. In the twentieth century, a surgical operation was developed which turns men into women; the first case occurred in 1952, and in Britain the phenomenon decisively entered the arena of public awareness when the well-known writer James Morris became Jan Morris and published Conundrum in 1974. A few years after that,
MEANING AND THE LIMITS OF SCIENCE
197
instances had become scarcely more a matter of surprise than, say, divorces were a hundred years ago. Persons who undergo the operation do not, of course, emerge identical in all respects to persons who were born female, but they change sufficiently for them to be counted as 'women' and referred to as 'she' (even by those who know that they used to be men) indeed this, as I understand it, is part of the purpose of the operation. Now consider the case of someone who fathers children and then undergoes the operation. Does she remain the children's father, or become their second mother, or is neither word now applicable? In fact I have the impression that the first of these three usages has become standard. If so, then it is no longer true that a father is 'by definition' a male parent; some fathers are female. We would need to correct a semantic description of the word to something along the lines of 'parent who was male at the time of conception' or 'parent who supplied the seed'. But the chief point is that it would surely be misleading to describe what has happened, as a generative linguist presumably must, by saying that before the invention of the sex-change operation linguists had difficulty in discovering whether father meant 'male parent' or 'seed-supplying parent', and that cases such as Jan Morris's have demonstrated that their original guess was wrong when it was made. Rather, before the invention of the operation the question whether father meant 'male parent' or 'seed-supplying parent' did not arise not merely did linguists not think to ask the question, but it would have been a meaningless one if they had thought of it. Once the operation arrived on the scene, the language had to move in one direction or another, but nothing in its previous state pre-empted the decision. Nobody could know beforehand which way the cat would jump. In view of cases such as this, I suggest, the notion of trying to establish a rigorous scientific analysis of word meanings is as misguided as would be the proposal that scientists ought to work out a theory enabling us to predict, now, whether there will be a General Election in England in the year 2020. There is no answer, now, to that question; when the year 2020 comes, the question of an election will be determined by free decisions that have yet to be made and which will be taken in the light of considerations that we cannot now know about. In the linguistic case, before sex-change operations became a reality an individual Englishman could have asked himself how he would talk about hypothetical cases of the kind discussed, and he would have been free to decide to call ex-male parents 'fathers' or 'mothers'; but whichever decision he made he could not have been accused of imperfect mastery of his mother tongue, and whichever decision he made he would have remained free to change his mind. Thus to ask for an analysis of the meaning of a word of'the English language' is to ask what decisions would be made by English-speaking individuals about an indefinitely large series of hypothetical questions which many of the individuals may never have consciously considered; the reply must always be that, unless we are told what concrete situation (such as the invention of the sex-change operation) might force Englishmen in general to reach a decision about a particular question,
198
EMPIRICAL LINGUISTICS
we cannot even guess which answer to it would be likely to become conventional in the English linguistic community (and even if we are given the situation we can do no more than guess). As the nineteenth-century linguist Wilhelm von Humboldt put it, language 'is not a finished product, but an m activity'. The generative linguist's next move may be to suggest that I am cheating by representing an exceptional phenomenon as normal. Granted that the sex-change operation has shaken up some of the semantic relationships in our language, he may say, nevertheless such innovations are very rare; in between times the semantic relationships are static and can be rigorously described, and it is not objectionable to say that novelties as unusual as this do indeed force a well defined change in the language so that adults have to relearn certain aspects of English. Clearly there is a sense in which the sex-change operation is an exceptional phenomenon; the belief it overthrew, that sex is fixed, was particularly well entrenched (and emotionally significant). But, in the first place, the sexchange operation did not (or did not only) change the semantic structure of English when it was invented; it drew attention to an indeterminacy in that structure that had existed for centuries if not millennia previously, and thus cast doubt on the notion that a language ever has a determinate semantic structure. Furthermore, a belief does not have to be as long-standing and firmly established as the belief that sex is fixed to play a part in inference. If I decide that Manchester is a poor town for shopping I shall be disposed to infer, from sentences such as John has a lot of shopping to do, that John had better not go to Manchester. The difference is that no one would be inclined to argue that a semantic description of English should include a rule 'From Manchester infer bad for shopping' (or that the word Manchester should include a feature [ —GOOD-FOR-SHOPPING]). My opinion about Manchester is clearly 'knowledge about the world' rather than 'knowledge of English'. What the sex-change example shows is that even the kind of principles which one might think of as part of the English language are sometimes overthrown by events independent of the language; there is no real distinction between knowledge of the world and knowledge of the semantics of the language. We are tempted to assign a principle to the latter rather than to the former category when it is one of relatively long standing, so of course it is true that the kind of beliefs implicit in a typical essay by a linguist at semantic description change less often than the sort of beliefs which the linguist ignores. It is tautologous to say that beliefs which remain constant for long periods do not often change. I might add, however, that I do not see cases such as the sex-change operation as being so very exceptional as one might imagine from the writings of generative linguists. These scholars often seem to presuppose a dreary, static world in which nothing ever crops up which is the least bit unusual from the point of view of a sheltered academic living in their particular time and place. Such a world is not recognizable to me as the one that I or anyone else inhabits; real life is full of surprises. (To quote only one, by no means
MEANING AND THE LIMITS OF SCIENCE
199
extreme, case of this attitude: Katz and Fodor's paradigm example, in their 1963 article, of a semantically 'anomalous' sentence is He painted the walls with silent paint. It would leave me quite unstunned to learn that, say, when spray guns are used for applying paint, certain types of paint make a noise on leaving the nozzle; and in that case I should expect sentences like Katz and Fodor's to occur in the speech of paint-sprayers.) The generative linguist's attitude contrasts with a more widely held principle of academic life which suggests that thinkers should be open to new possibilities and should not expect the world to conform to their prior assumptions. Only given the premiss that reality is almost always unsurprising, however, does the generative linguist's programme of semantic analysis become at all plausible. As Imre Lakatos put it (1976: 93 n. 1): 'Science teaches us not to respect any conceptual-linguistic framework lest it should turn into a conceptual prison - language analysts have a vested interest in at least slowing down this process [of conceptual change].' 8 The scholarly consensus
Little in the preceding critique of the generative approach to word meaning is original. Almost everything I am saying was stated, with a wealth of examples, in an excellent book by Karl Otto Erdmann published a hundred years ago (Erdmann 1900). 8 Erdmann's book went through a number of editions, the most recent being issued in 1966, but I have never encountered a reference to it in the writings of the generative linguists. More recently, the idea that words do not have definite meanings became virtually a cliche of philosophy in the decades after the Second World War - indeed, for much of that period it ranked as one of the central doctrines of philosophy in the English-speaking world/ Ludwig Wittgenstein (1953) argued that what links the various things to which a word applies are only an overlapping range of'family resemblances', like the individual fibres which make up a rope although no single fibre runs for more than a fraction of the length of the rope. Morton White (1950) and Willard van Orman Quine (1951) argued that the contrast between analytic and synthetic truths is not an absolute distinction encoded in rules of language structure, but is only a difference of degree between beliefs that we feel less willing or more willing to abandon when we encounter evidence that requires some change to our overall system of beliefs. We met Imre Lakatos objecting in the last section to the assumption that words have fixed conceptual essences; in this he was concurring with repeated arguments by his colleague Karl Popper against 'essentialism'. Almost all English-speaking philosophers of recent decades who discuss language have done so within the general framework of assumptions of one or all of these figures.11 (For that matter, some of the obvious difficulties in the Katz/Fodor notion of componential analysis were pointed out from within the discipline of linguistics by Dwight Bolinger in an article (Bolinger 1965) which went largely unmentioned in subsequent linguistic writings about semantics.)
200
EMPIRICAL LINGUISTICS
Some readers may suspect that philosophers' objections to essentialism relate to airy-fairy considerations, perhaps relevant only to a limited number of words in lofty areas of vocabulary, or so abstract that in normal, practical circumstances they can be ignored. That is not at all true. Consider, for instance, the work of Adam Kilgarriff on automatic word-sense disambiguation. Kilgarriff conducts hard-nosed research on computational processing of natural language. He initiated the Senseval series (URL 12) of competitive trials of word-sense disambiguation systems, similar to longer-established series in areas such as automatic parsing and information extraction. For commercially valuable natural language processing applications, such as machine translation, automatic detection of the sense in which an ambiguous word is used in a given context is a crucial function: an English-toFrench translation system needs to know whether a particular use of bank refers to a financial institution or a landscape feature, because the respective French equivalents are different words. Dictionaries list alternative senses for ambiguous words, and if the generative linguistic conception of word meanings as formally specifiable were right, then questions about how many and what distinct senses a given word has should have clearcut answers: if two uses of a word correspond to different structures of semantic markers, they would be distinct senses, and if the marker-structures were the same then the uses would be different aspects of a single sense. (The original Katz and Fodor paper was largely concerned with the use of the marker system for discriminating in context between different word senses listed in an ordinary dictionary.) However, Kilgarriffs detailed studies (1993, 1997) of the ways in which dictionaries distinguish between word senses, and the relationship between uses of words in real-life contexts and the sense discriminations identified in dictionaries, have led him to conclude (to quote the title of one of his papers) 'I don't believe in word senses.' The realities are so blurred and fluid that, for Kilgarriff, the sense distinctions listed in published dictionaries are not to be seen (as generative linguists suppose) as informal statements in layman's language of facts which could be made more scientifically exact via a formal notation system. Rather, published dictionaries impose distinctions, for their own institutional purposes, on a reality from which they are largely absent. The bank example is repeatedly used in the linguistic literature; but, contrary to what these discussions suggest, it is not a typical case — it is quite exceptional in the clarity of the distinction between two meanings. Kilgarriff believes that in general it makes no sense to ask how many meanings a word has in the abstract; one can only ask what meanings it might be convenient to distinguish for the purposes of some specific task. In this situation, where scholars representing disciplines as diverse as abstract philosophy and practical computer science have been urging for a hundred years and more, and continue to urge, that words do not have definite meanings, it seems remarkable that generative linguists have felt so little obligation to defend their contrary point of view.
MEANING AND THE LIMITS OF SCIENCE
201
The fact that numerous distinguished thinkers have concurred on a particular point of view over a long period does not make that view necessarily correct, of course. The distinguished thinkers might all be mistaken. But scholarship is supposed to be cumulative. People who put forward an idea contrary to the consensus of established opinion are normally expected to show, at length and in detail, why established opinion is misguided - not just to ignore it. If scholars fail to respect that obligation, then changes in the scholarly consensus cease to represent progress towards greater understanding; such changes would be as trivial as the year-to-year fluctuations in the length of girls' hemlines. Within the tradition inaugurated by Katz and Fodor's 1963 paper, most publications just take for granted that it makes sense to formalize word meanings, and the only debate is about the notation to be used. The 1963 paper made no mention at all of figures such as Wittgenstein, White, Quine, or Popper, and that continues to be usual. For instance, Saeed (1997) is a detailed and highly regarded 360-page textbook on semantics for linguists; the names Wittgenstein, White and Popper are not in its index, and Quine is mentioned only in a different connexion — there is just one brief allusion (pp. 87—8) to problems about the analytic/synthetic distinction, which if taken seriously would undermine much of the rest of the book. Two exceptions to this unhappy tradition of ignoring well known problems are George Lakoff and Jean Aitchison, in their respective books Women, Fire, and Dangerous Things (Lakoff 1987), and Words in the Mind (Aitchison 1994), both of which draw attention to Wittgenstein's family-resemblance account of word usage, and invoke Eleanor Rosch's 'prototype' theory (Rosch 1975) as a possible solution to the problems about essentialism. (Rosch argued that, rather than words being defined by sets of necessary and sufficient properties for their application, they are associated with prototype examples, and a word is extended to other items if those items are sufficiently similar to the prototype.) The prototype idea is reasonable in itself, though (contrary to what Aitchison seems to suppose) it does not answer the problem raised by Wittgenstein: the difficulty in specifying defining properties for a word does not go away in prototype theory, it reappears as a difficulty in specifying the features in terms of which something is similar to or dissimilar from a prototype. Lakoff makes less large claims for prototype theory. His interesting book is written in too allusive a style for me to know for sure whether he agrees with the point I am arguing in this chapter, but some of his remarks (for instance 'indirectly meaningful symbolic structures are built up by imaginative capacities (especially metaphor and metonymy)', Lakoff 1987: 372), are congenial in tone. Whatever precisely Lakoff means to say about word meaning, though, his view cannot be said to have become representative of the generative linguistic mainstream. 9 Semantic competence and semantic performance It is true that, after the publication of the original 1963 paper, Jerrold Katz
202
EMPIRICAL LINGUISTICS
did more than once attempt to construct responses to Quine's arguments against the analytic/synthetic distinction. But it is hard to take Katz's responses very seriously. His first try (J. J. Katz 1964) simply showed (correctly) that the Katz/Fodor semantic-marker system assumed a sharp distinction between analytic and synthetic propositions, and asserted that this refuted Quine. That is like inventing a system of map-making based on the assumption that the Earth is flat, and treating it as a refutation of the spherical Earth. Later, Katz developed a slightly more sophisticated form of response (e.g. J. J. Katz 1972: 249-51); but this 'refutation' of Quine boiled down in essence to describing an experiment (in which speakers sort sentences into categories) that might settle the issue, explaining how the results of the experiment would establish the validity of the distinction, but not actually performing the experiment - the results Katz quoted were what he expected to find rather than what he had found. This is all the more remarkable when one considers that Quine's original writings quoted similar experiments that actually had been performed, and had yielded results unfavourable to the analytic/synthetic distinction (Quine 1960: 67 n. 6). A number of later experiments (e.g. Stemmer 1972) were no more successful from Katz's point of view. Katz referred to none of the actual experiments. This is surely a very rum way to carry on an academic debate. Perhaps Katz might seek to defend himself against the negative findings of published experiments by referring to the distinction between linguistic 'competence' and linguistic 'performance' — individuals' observable linguistic behaviour does not always conform perfectly to the linguistic rules stored in their minds (and which linguists set out to describe) because the behaviour is subject to interference from extraneous factors. In this case, the idea would be that the semantics of our language is governed by clear-cut inference rules, but that when drawing inferences in practice we sometimes make mistakes in following the rules. This would be a fairly desperate defence of Katz's position, since if we make so many mistakes that the rules cannot be inferred from what we say or do, it is not clear what substance there could be in the claim that such rules exist inside us. In case that difficulty is felt not to be fatal, however, let me quote the results of a further experiment on the analytic/synthetic distinction, carried out by a Lancaster University student, which is particularly interesting in this connexion. Linguists who appeal in their research methodology to the notion that performance diverges from competence (a notion which Chomsky introduced in connexion with grammatical research) hold that it is possible to overcome the difficulty by using the intuitions of people trained to discount the disturbing factors summed up as 'performance'. Thus a trained grammarian will be able to say whether a given word-sequence really is grammatical or not in his language, although a lay speaker of the language might take the sentence to be 'ungrammatical' merely because, say, it reports a bizarre situation or uses vocabulary associated with stigmatized social groups. (There are of course important objections to this appeal to 'trained intuition', but I ignore them here in order to give the other side the benefit of
MEANING AND THE LIMITS OF SCIENCE
203
the doubt.) In the case of the analytic/synthetic distinction, if one seeks a group who have been formally trained to draw the distinction explicitly, the answer will presumably be the class of professional philosophers. Accordingly, Nigel Morley-Bunker (1977) compared the behaviour with respect to the analytic/synthetic distinction of two groups of subjects, one composed of university teachers of philosophy and the other consisting of people lacking that training (it included undergraduates and teachers of another subject). The subjects were not, of course, asked in so many words to classify sentences as analytic or synthetic, since that would raise great problems about understanding of the technical terms. Instead, Morley-Bunker adopted the experimental technique advocated by Katz, presenting subjects with two piles of cards containing sentences which to members of our contemporary society seem rather clearly analytic and synthetic respectively, and allowing subjects to discover the principle of classification themselves by sorting further 'clear cases' onto the piles and having their allocations corrected by the experimenter when they clashed with the allocations implied by the analytic/synthetic principle. After subjects had in this way demonstrated mastery of the distinction with respect to 'clear cases', they were given 18 sentences of more questionable status to sort. The experiment involved no assumptions about whether these sentences were 'in fact' analytic or synthetic; the question was whether individuals of either or both groups would be consistent with one another in assigning them to one or other of the two categories. On the whole, as would be expected by a non-believer in the analytic/synthetic distinction, neither group was internally consistent. Two sentences were assigned to a single category by a highly significant (p < 0.005) majority of each group of subjects, and for each group there were three more sentences on which a consistent classification was imposed by a significant majority (p < 0.05) of members of that group; on ten sentences neither group deviated significantly from an even split between those judging it analytic and those judging it synthetic. (To give examples: Summerfollows spring was judged analytic by highly significant majorities in each group of subjects, while We see with our eyes was given near-even split votes by each group.) Over all, the philosophers were somewhat more consistent with each other than the non-philosophers. But that global finding conceals results for individual sentences that sometimes manifested the opposite tendency. Thus Thunderstorms are electrical disturbances in the atmosphere was judged analytic by a highly significant majority of the non-philosophers, while a (nonsignificant) majority of the philosophers deemed it synthetic. In this case, it seems, philosophical training induces the realization that well established results of contemporary science are not necessary truths; in other cases, conversely, cliches of current philosophical education impose their own mental blinkers on those who undergo it - Nothing can be completely red and green all over was judged analytic by a significant majority of philosophers but only by a non-significant majority of non-philosophers.
204
EMPIRICAL LINGUISTICS
All in all, Morley-Bunker's results argue against the notion that our inability to decide consistently whether or not some statement is a necessary truth derives from lack of skill in articulating our underlying knowledge of the rules of our language. Rather, the inability conies from the fact that the question as posed is unreal; we choose to treat a given statement as open to question or as unchallengeable in the light of the overall structure of beliefs which we have individually evolved in order to make sense of our individual experience. Even the cases which seem 'clearly' analytic or synthetic are cases which individuals judge alike because the relevant experiences are shared by the whole community; but, even for such cases, one can invent hypothetical future experiences which, if they should be realized, would mn cause us to revise our judgements. As already suggested in connexion with cup, the notion that wordsemantics is rule-governed is most plausible in fields where human decisions control the realities to which words correspond; then, if people decide to adopt a certain semantic rule, they can manipulate reality so as to avoid the need to drop the rule. The notion of analysing words as clusters of meaningfeatures, which Katz and Fodor saw as applicable to the entire vocabulary of a language, was originally invented, by anthropological linguists, as a tool to describe one particular area of vocabulary, namely that of kinship terms. (The idea stemmed from a pair of papers published jointly in the journal Language: Lounsbury 1956, Goodenough 1956.) Kinship classification is very clearly an area in which reality is determined largely by the wish to make situations clear-cut; our society, for instance, goes to considerable lengths (in terms of ceremonial, legal provisions, etc.) to ensure that there shall be no ambiguity about whether or not someone is a wife. And it is also, of course, an area in which Nature has happened to operate in a relatively cut-and-dried way; until the advent of in vitro fertilization and the possibility of surrogate motherhood in the 1970s, there had never in human history been a half-way house between being and not being a given individual's mother. It is not at all clear that the inventors of componential analysis intended it as a philosophical claim about the nature of human thought, or even as a useful practical descriptive tool outside the limited area for which it was designed. Even within that limited area, furthermore, the technical device of 'meaning-components' had already turned out to be inadequate (see Lounsbury 1964) by the time that Katz and Fodor set out to extend it to the whole vocabulary.
10 Precision so far as the nature of the subject admits
Katz's dispute with Quine over analyticity is the only real attempt I have encountered by a generative linguist to answer the objections to the 'rulegoverned' view of semantics that arise from the consensus of ideas discussed earlier; and we have seen that it was not a very serious attempt. Most generative linguists ignore the problem. Chomsky, for instance, simply
MEANING AND THE LIMITS OF SCIENCE
205
dismisses out of hand the possibility that the analytic/synthetic distinction might be open to question: 'There are no possible worlds in which someone was . . . an uncle but not male [etc.] . .. The necessary falsehood of "I found a female uncle" is not a matter of syntax or fact or belief (Chomsky 1975b: 170). Yet, in view of my discussion of the sex-change operation, it seems likely that one could find a female uncle in the actual world. James McGilvray, a recent interpreter of Chomsky's wuvre with more knowledge than the average linguist about twentieth-century philosophy, brushes aside both Nelson Goodman and Ludwig Wittgenstein: unlike Chomsky, Goodman 'fell back into the Wittgensteinian quagmire of socialized, plastic linguistic meanings' (McGilvray 1999: 163). McGilvray is using the word 'quagmire' as a substitute for argument here; at no point does he take the trouble to explain why Goodman's or Wittgenstein's views were wrong. By the turn of the twenty-first century it has become difficult to avoid the conclusion that academics who broadly sympathize with the generative approach to linguistics which means probably the majority of staff in university linguistics departments - are willing to take for granted that word meanings can be formalized, and they are simply not interested in considering the weight of contrary evidence and argumentation provided by writers who do not owe allegiance to generative linguistics. The generative linguistic paradigm has become something of a closed world, whose inhabitants listen to one another but ignore outsiders and set little value on observational evidence. So their pet ideas never have to be given up, by themselves. One symptom of this unhealthy state of affairs is the way that the phrase 'linguistic semantics' has come into use as the title of a distinct academic discipline semantics as studied by linguists - which is presented as independent, and properly independent, of 'philosophical semantics'. Textbooks from G. L. Dillon's Introduction to Contemporary Linguistic Semantics (Dillon 1977) to Lyons's Linguistic Semantics: An Introduction (Lyons 1995) express this independence either in their titles or in the body of their contents — the Preface to Hurford and Heasley (1983) begins by saying 'This book presents a standard and relatively orthodox view of modern linguistic semantics ...'. Many of these books pay little if any attention to the weighty grounds that philosophers have offered for scepticism about generative semantic analysis. But boundaries between academic subjects are matters of administrative convenience only. The truth is that the nature of the meanings of ordinary words in human languages is one topic, whether addressed by linguists or by philosophers. For generative linguists to ignore everyone else's views in order to preserve their own belief in rigorous, scientific semantic description makes what they themselves say unpersuasive. One way of stating in general terms the point which generative linguistics (and some other contemporary 'human sciences') are failing to grasp is to quote remarks made, more than two millennia ago, by Aristotle in the Nicomachean Ethics (David Ross's translation):
206
EMPIRICAL LINGUISTICS
precision is not to be sought for alike in all discussions ... it is the mark of an educated man to look for precision in each class of things just so far as the nature of the subject admits (1094b) the whole account of matters of conduct must be given in outline and not precisely (1104a)
In other words, we cannot expect to find the exactness which is proper to mathematics in analyses of human behaviour, which is produced by imaginative and unpredictable human minds. Word meanings have been described, for practical purposes rather well, in published dictionaries for centuries. Formal theories such as those of Katz and Fodor and their generative successors seem to be offering rigorous, scientifically exact statements of word meaning in place of the useful but unscientific definitions found in dictionaries. But no such offer can ever be made good: word meanings - speakers' propensities to draw inferences from uses of words — are one of the aspects of human behaviour which are necessarily described 'in outline and not precisely'. The best dictionary does not pretend that its definitions have the exactness of a physical equation, but that is not a shortcoming in the dictionary: good but imperfect definitions are as much as can ever be available, for the bulk of words in everyday use in a human language. Word meanings are not among the phenomena which can be covered by empirical, predictive scientific theories. Notes 1 There are certainly respects in which the analogy breaks down. Thus we commonly find ourselves choosing a word to name a given thing rather than seeking a thing to which a given word applies, which is the analogue of choosing a costume in order to be recognized as fashionable; and there is no analogue in the dress case of the fact that we have a vocabulary of many different words for each of which the conditions of use must be guessed — although this could be built into the analogy, e.g. by saying that what counts as fashionable is very different in different domains (the 'right' clothes for a cocktail party are unrelated to the 'right' clothes for going to the races, the 'right' kind of interior decoration is different again, etc.) and that people must simultaneously make independent guesses for each domain. These limitations of the analogy, in any case, do not affect its force as against the assumptions of generative linguistics. 2 The earliest suggestion known to me that natural-language semantics involves probabilistic rules was made by Yehoshua Bar-Hillel 1969. 3 It may be that someone can produce evidence that gay changed its meaning by a route quite different from that reconstructed here; this would not destroy my general point, though it might rob the example of its persuasive force. 4 I do not mean to presuppose here that Man is the only animal species with a creative intellect; I do not need to take a position on this issue either way. Experiments with other apes have shown them to exhibit many apparently intelligent patterns of behaviour (Weiskrantz 1997). Jane Hill (1972: 311) went so far as to argue that apes are likely to be fully as intelligent as men, and that their failure to evolve com-
MEANING AND THE LIMITS OF SCIENCE
5 6
7 8 9
10
11
12 13
207
plex languages and high culture can be explained by factors unrelated to intelligence which prevent them using language in the wild. Fromkin and Rodman add the marker HUMAN, though not all speakers would limit the use of father to our species. Note that it is axiomatic for generative linguists that 'semantic primitives', being innate, do not themselves evolve in meaning; the word parent presumably develops its meaning in parallel with the wordfather, but the semantic feature PARENT must always mean something to do with conception — young children simply are not yet consciously aware of the idea PARENT which nevertheless exists implicitly in their minds. 'Die Sprache . . . ist kein Werk fErgonJ, sondern eine Thdtigkeit (Energeiaj' (von Humboldt 1836: 57; my translation). For references to earlier, eighteenth- and nineteenth-century statements of the ideas in question see Cohen 1962: ch. 1,§3. It is no longer as true now as it was in the third quarter of the twentieth century that an undergraduate studying philosophy is likely to learn about the indefiniteness of word meaning if he learns nothing else; professional philosophers shifted their main focus away from language to other issues, after the 1970s. But, if the issue has lost its centrality, that is because new areas of interest have come to the fore, not because earlier beliefs have been refuted. The idea that there is a clear distinction between analytic and synthetic sentences is a special case of the idea, discussed earlier in this chapter, that all possible pairs (P, C) of a set of premisses with a conclusion can be divided into subsets of valid and invalid entailments. For ease of exposition, philosophers of language commonly discuss the special case of analytic and synthetic sentences, i.e. valid v. invalid entailments where the premiss-set is empty. One leading modern philosopher whose treatment of language might seem somewhat more compatible with the linguists' approach is J. L. Austin, who would have counted as a 'language analyst' in the terms of the Lakatos quotation in §7 above. However, Austin's view of language seems self-defeating, for reasons elegantly set out by Keith Graham 1977: ch. 2. Jerrold Katz and T. G. Bever (1977) adopted this defence explicitly. This is not intended to call into question the special status of the 'truths of logic', such as Either it is raining or it is not. I am inclined to accept the traditional view according to which 'logical particles' such as not, or are distinct from the bulk of the vocabulary in that the former really are governed by clear-cut inference rules. I shall not expand on this point here.
This page intentionally left blank
References
Details of reprints in anthologies are included where possible for papers which may be inaccessible in their original edition. Page references are to original publications except when explicitly stated. Place of publication is omitted for publishers who list an office in London, or whose name includes their location. Aarts, J. and van den Heuvel, T. (1985) 'Computational tools for the syntactic analysis of corpora 1 , Linguistics, 23: 302~35. Aho, A. V., Hopcroft, J. E. and Ullman,J. D. (1974) The Design and Analysis of Computer Algorithms, Reading, Mass.: Addison-Wesley. Aitchison, J. (1994) Words in the Mind, 2nd edn, Oxford: Blackwell. Akmajian, A., Demers, R. A., Farmer, A. K. and Harnish, R. M. (1995) Linguistics: An Introduction to Language and Communication, 4th edn, MIT Press. Allen, D. E. (1994) The Naturalist in Britain: A Social History, 2nd edn, Princeton University Press. Ammon, U. (1994) 'Code, sociolinguistic'. In R. E. Asher (ed.) Encyclopedia of Language and Linguistics, Oxford: Pergamon, vol. 2, pp. 578-81. Anglin, J. M. (1970) The Growth of Word Meaning (Research Monograph 63), MIT Press. Apostel, L., Mandelbrot, B. and Morf, A. (1957) Logique, langage, et theone de I'information, Paris: Presses Universitaires de France. Atkinson, M., Kilby, D. and Roca, I. (1988) Foundations of General Linguistics, 2nd edn, Unwin Hyman. Austerlitz, R. (ed.) (1975) The Scope of American Linguistics: Papers of the First Golden Anniversary Symposium of the Linguistic Society of America, Lisse: Peter de Ridder. Bachenko, J. and Gale, W. A. (1993) 'A corpus-based model of interstress timing and structure', Journalof the Acoustic Society of America, 94: 1797. Baker, C. L. (1979) 'Syntactic theory and the projection problem', Linguistic Inquiry, 10:533-81. Bar-Hillel, Y. (1969) 'Argumentation in natural languages'. In Akten des XIV. Internationalen Kongresses fur Philosophic, vol. 2, Vienna: Verlag Herder. Reprinted as ch. 6 of Y. Bar-Hillel, Aspects of Language, Amsterdam: North-Holland, 1970. Bernstein, B. (1971) Class, Codes and Control, vol. 1: Theoretical Studies Towards a Sociology of Language, Routledge & Kegan Paul. Biber, D. (1995) Dimensions of Register Variation: A Cross-Linguistic Comparison, Cambridge University Press.
210
REFERENCES
Bierwisch, M. (1970) 'Semantics'. InJ. Lyons (ed.) New Horizons in Linguistics, Penguin, 166-84. Bloomfield, L. (1933) Language, New York: Holt. Bloomfield,L. (1942) 'Outline of Ilocano syntax', Language, 18: 193-200. Bod, R. (1998) Beyond Grammar: An Experience-Based Theory of Language, Stanford, California: Center for the Study of Language and Information. Bolinger, D. (1965) 'The atomization of meaning', Language, 41: 555-73. Bowerman, M. (1988) 'The "no negative evidence" problem: how do children avoid constructing an overly general grammar?' In J. A. Hawkins (ed.) Explaining Language Universals, Oxford: Blackwell, 73—101. Box, G. E. P. and Tiao, G. C. (1973) Bayesian Inference in Statistical Analysis, AddisonWesley. Braine, M. D. S. (1971) 'On two types of models of the internalization of grammars'. In D. I. Slobin (ed.) The Ontogenesis of Grammar, Academic Press. Briscoe, E. J. (1990) 'English noun phrases are regular: a reply to Professor Sampson'. InJ. Aarts and W. Meijs (eds) Theory and Practice in Corpus Linguistics, Amsterdam: Rodopi, 45-60. Briscoe, E. J. and Waegner, N. (1992) 'Robust stochastic parsing using the insideoutside algorithm'. In Statistically-Based Natural Language Programming Techniques: Papers from the AAAI Workshop, Technical Report WS-92-01, Menlo Park, California: American Association for Artificial Intelligence, 39-53. Brooks, P. J. and Tomasello, M. (1999) 'How children constrain their argument structure constructions', Language, 75: 720-38. Burt, M. K. (1971) From Deep to Surface Structure, Harper & Row. Chao, Y. R. (1968) A Grammar of Spoken Chinese, Berkeley and Los Angeles: University of California Press. Chitashvili, R. J. and Baayen, R. H. (1993) 'Word frequency distributions'. In L. Hfebicek and G. Altmann (eds) Quantitative Text Analysis (Quantitative Linguistics, 52), Trier: Wissenschaftlicher Verlag, 54-135. Chomsky, A. N. (1956) 'Three models for the description of language', Institute of Radio Engineers Transactions on Information Theory, IT-2: 113-24. Reprinted in Luce, Bush and Galanter 1965: 105-24. Chomsky, A. N. (1957) Syntactic Structures, 's-Gravenhage: Mouton. Chomsky, A. N. (1959) 'On certain formal properties of grammars', Information and Control, 1.91-112. Reprinted in Luce etal. 1965, 125-55. Chomsky, A. N. (1961) 'Formal discussion: the development of grammar in child language'. A conference contribution reprinted in J. P. B. Allen and P. van Buren (eds) Chomsky: Selected Readings, Oxford University Press, 1971, 129-34, to which my page references relate. Chomsky, A. N. (1962) Various contributions to A. A. Hill (1962). Chomsky, A. N. (1964) Current Issues in Linguistic Theory, Mouton. Chomsky, A. N. (1965) Aspects of the Theory of Syntax, Cambridge, Mass.: MIT Press. Chomsky, A. N. (1975a) The Logical Structure of Linguistic Theory, Plenum. Chomsky, A. N. (1975b) 'Questions of form and interpretation'. In Austerlitz 1975, 159-96. Chomsky, A. N. (1976) Reflections on Language, Temple Smith. Church, K. W. (1982) On Memory Limitations in Natural Language Processing, Bloomington, Indiana: Indiana University Linguistics Club. Church, K. W. (1988) 'A stochastic parts program and noun phrase parser for
REFERENCES
211
unrestricted text', Proceedings of the 2nd Conference on Applied Natural Language Processing, Austin, Texas, 136-43. Church, K. W. and Gale, W. A. (1991) 'A comparison of the enhanced GoodTuring and deleted estimation methods for estimating probabilities of English bigrams', Computer Speech and Language, 5: 19-54. Church, K. W., Gale, W. A. and Kruskal, J. B. (1991) 'The Good-Turing theorem'. Appendix A in Church and Gale 1991. Cohen, L.J. (1962) The Diversity of Meaning, Methuen. Cole, R., Mariani, J., Uszkoreit, H., Varile, G. B., Zaenen, A., Zampolli, A. and Zue, V. (eds) (1997) Survey of the State of the Art in Human Language Technology, Cambridge University Press and URL 14. Culicover, P. W. (1999) 'Minimalist architectures', Journal of Linguistics, 35: 137-50. Culy, C. (1998) 'Statistical distribution and the grammatical/ungrammatical distinction', Grammars, 1: 1-13. Cunningham, H. (1999) 'A definition and short history of Language Engineering', Natural Language Engineering, 5: 1-16. Curtiss, S. (1977) Genie: A Psycholinguistic Study of a Modern-Day 'Wild Child', Academic Press. Date, C. J. (1994) An Introduction to Database Systems, 6th edn. Reading, Mass.: Addison-Wesley. De Roeck, A., Johnson, R., King, M., Rosner, M., Sampson, G. R. and Varile, N. (1982) 'A myth about centre-embedding', Lingua, 58: 327-40. Dillon, G. L. (1977) Introduction to Contemporary Linguistic Semantics, Prentice-Hall. Edwards, J. (1992) 'Design principles in the transcription of spoken discourse'. In Svartvik( 1992), 129-44. Efron, B. and Thisted, R. (1976) 'Estimating the number of unseen species: how many words did Shakespeare know?', Biometrika, 63: 435-47. Ellegard, A. (1978) The Syntactic Structure of English Texts: A Computer-Based Study of Four Kinds of Text in the Brown University Corpus. Stockholm: Almqvist & Wiksell, Gothenburg Studies in English, 43. Empson, W. (1930) Seven Types of Ambiguity, Chatto & Windus. Erdmann, K. O. (1900) DieBedeutung des Wortes, Leipzig: Eduard Avenarius. Eriksson, G. (1983) 'Linnaeus the botanist'. In T. Frangsmyr (ed.) Linnaeus: The Man and His Work, University of California Press, 63—109. Fellbaum, C. (ed.) (1998) WordNet: An Electronic Lexical Database, MIT Press. Fienberg, S. E. and Holland, P. W. (1972) 'On the choice of flattening constants for estimating multinomial probabilities', JournalofMultivariate Analysis, 2: 127—34. Fillmore, C. J. (1972) 'On generativity'. In S. Peters (ed.) Goals of Linguistic Theory, Prentice-Hall. Fisher, R. A. (1922) 'On the mathematical foundations of theoretical statistics', Philosophical Transactions of the Royal Society of London, A, 222: 309-68. Reprinted in J. H. Bennett (ed.) Collected Papers of R. A. Fisher, vol. I: 1912-24, University of Adelaide Press. Fisher, R. A., Corbet, A. S. and Williams, C. B. (1943) 'The relation between the number of species and the number of individuals in a random sample of an animal population', Journal of Animal Ecology, 12: 42~58. Fodor, J. A., Bever, T. G. and Garrett, M. F. (eds) (1974) The Psychology oj'Language: An Introduction to Psycholinguistics and Generative Grammar, McGraw-Hill. Fodor, J. A. and Katz, J. J. (eds) (1964) The Structure of Language: Readings in the Philosophy of Language, Prentice-Hall.
212
REFERENCES
Francis, W. N. (1958) The Structure of American English, New York: Ronald Press. Fries, C. C. (1952) The Structure of English, New York: Harcourt Brace. Fromkin, V. and Rodman, R. (1983) An Introduction to Language, 3rd edn, Holt, Rinehart & Winston. Fromkin, V. and Rodman, R. (1998) An Introduction to Language, 6th edn, Harcourt Brace. Gale, W. A. and Church, K. W. (1994) 'What is wrong with adding one?' In N. Oostdijk and P. de Haan (eds) Corpus-Based Research into Language, Amsterdam: Rodopi, 189-98. Garside, R. G., Leech, G. N. and McEnery, A. (eds) (1997) Corpus Annotation: Linguistic Information from Computer Text Corpora, Longman. Garside, R. G., Leech, G. N. and Sampson, G. R. (eds) (1987) The Computational Analysis of English: A Corpus-Based Approach, Longman. Gazdar, G. (1979) Review of books by Stock-well, Journal of Linguistics, 15: 197-8. Gazdar, G., Klein, E., Pullum, G. and Sag, I. (1985) Generalized Phrase Structure Grammar, Oxford: Blackwell. Ghezzi, C., Jazayeri, M. and Mandrioli, D. (1991) Fundamentals of Software Engineering, Prentice-Hall. Gibbon, D., Moore, R. and Winski, R. (eds) (1997) Handbook of Standards and Resources for Spoken Language Systems, Berlin: Mouton de Gruyter, and URL 10. Gleick, J. (1994) Genius, Abacus Books. Good, I. J. (1953) 'The population frequencies of species and the estimation of population parameters', Biometrika, 40: 237-64. Good, I. J. (1965) The Estimation of Probabilities: An Essay on Modern Bayesian Methods, Cambridge, Mass.: MIT Press. Good, I. J. and Toulmin, G. H. (1956) 'The number of new species, and the increase in population coverage, when a sample is increased', Biometrika, 43: 45-63. Goodenough, W. H. (1956) 'Componential analysis and the study of meaning', Language, 32: 195-216. Goodman, L. A. (1949) 'On the estimation of the number of classes in a population', Annals of Mathematical Statistics, 20: 572-9. Goodman, N. (1961) 'Safety, strength, simplicity', Philosophy of Science, 28: 150-1. Reprinted in P. H. Nidditch (ed.) The Philosophy of Science, Oxford University Press, 1968, 121-3. Goodman, N. (1965) Fact,Fiction,and Forecast, 2nd edn, Indianapolis: Bobbs-Merrill. Graham, K. (1977) J. L. Austin: A Critique of Ordinary Language Philosophy, Hassocks, Sussex: Harvester. Gropen, J., Pinker, S., Hollander, M., Goldberg, R. and Wilson, R. (1989) 'The learnability and acquisition of the dative alternation in English', Language, 65: 203-57. Hagege, C. (1981) Critical Reflections on Generative Grammar, Lake Bluff, Illinois: Jupiter Press. Harris, R. A. (1993) The Linguistics Wars, Oxford University Press. Harris, Z. S. (1951) Methods in Structural Linguistics, University of Chicago Press. Harris, Z. S. (1965) 'Transformational theory', Language, 41: 363-401. Hempel, C. G. (1966) Philosophy of Natural Science, Prentice-Hall. Hill, A. A. (ed.) (1962) Third Texas Conference on Problems of Linguistic Analysis in English (Studies in American English), Austin, Texas: University of Texas Press. Hill, J. H. (1972) 'On the evolutionary foundations of language', American Anthropologist, 14: 308-17.
REFERENCES
213
Hockett, C. F. (1954) 'Two models of grammatical description', Word, 10: 210-31. Hockett, C.F. (1968) The State of the Art, The Hague: Mouton. Hodges, A. (1983) Alan Turing: The Enigma of Intelligence, Hutchinson. My page reference is to the Unwin Paperbacks edn, 1985. Hofland, K. and Johansson, S. (1982) Word Frequencies in British and American English, Bergen: Norwegian Computing Centre for the Humanities. Householder, F. W. (ed.) (1972) Syntactic Theory I: Structuralist, Harmondsworth: Penguin. Householder, F. W. (1973) 'On arguments from asterisks', Foundations of Language, 10:365-76. Hudson, R. A. (1994) 'About 37% of word-tokens are nouns', Language, 70: 331-9. von Humboldt, K. W. (1836) Ueber die Verschiedenheit des menschlichen Sprachbaues, Berlin: Diimmler. English translation published as On Language: The Diversity of Human Language-Structure and its Influence on the Mental Development of Mankind, Cambridge University Press, 1988. Hurford, J. R. and Heasley, B. (1983) Semantics: A Coursebook, Cambridge University Press. Hutchins, W. J. and Somers, H. L. (1992) An Introduction to Machine Translation, Academic Press. Jacobs, R. A. and Rosenbaum, P. S. (1968) English Transformational Grammar, Blaisdell. Jakobson, R. (ed.) (1961) Structure of Language and Its Mathematical Aspects, Proceedings of Symposia in Applied Mathematics, 12, Providence, Rhode Island: American Mathematical Society. Jeffreys, H. (1948) Theory of Probability, 2nd edn, Oxford: Clarendon Press. Jelinek, F. and Mercer, R. (1985) 'Probability distribution estimation from sparse data', IBM Technical Disclosure Bulletin, 28: 2591-4. Johnson, W. E. (1932) 'Probability: the deductive and inductive problems', Mind, new series, 41: 409-23. Katz, J. J. (1964) 'Analyticity and contradiction in natural language'. In Fodor and Katz 1964:519-43. Katz, J. J. (1971) The Underlying Reality of Language and Its Philosophical Import, Harper & Row. Katz,J.J. (1972) Semantic Theory, Harper & Row. Katz,J. J. (1981) Language and Other Abstract Objects, Totowa, New Jersey: Rowman & Littlefield. Katz, J. J. and Bever, T. G. (1977) 'The fall and rise of empiricism'. InT. G. Bever, J. J. Katz and D. T. Langendoen (eds) An Integrated Theory of Linguistic Ability, Hassocks, Sussex: Harvester, 11—64. Katz,J.J. and Fodor, J. A. (1963) 'The structure of a semantic theory', Language, 39: 170-210. Reprinted in Fodor and Katz 1964: 479-518. Katz, S. M. (1987) 'Estimation of probabilities from sparse data for the language model component of a speech recognizer', IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-35: 400-1. Kilgarriff, A. (1993) 'Dictionary word sense distinctions: an enquiry into their nature', Computers and the Humanities, 26: 365—87. Kilgarriff", A. (1997) T don't believe in word senses', Computers and the Humanities, 31: 91-113. Kimball.J.P. (1973) The Formal Theory of Grammar, Prentice-Hall.
214
REFERENCES
Knuth, D. E. (1973) The Art of Computer Programming: Vol. 3, Sorting and Searching, Addison-Wesley. Kornai, A. (forthcoming) 'How many words are there?' Labov, W. M. (1970) 'The study of language in its social context', Studium Generale, 23: 30-87. My page reference relates to the reprint as ch. 8 of W. M. Labov, Sociolinguistic Patterns, Oxford: Blackwell, 1978. Labov, W. M. (1973a) 'The boundaries of words and their meanings'. In C.-J. N. Bailey and R. W. Shuy (eds) New Ways of Analyzing Variation in English, Washington, DC: Georgetown University Press, 340-73. Labov, W. M. (1973b) 'The place of linguistic research in American society'. In E. P. Hamp (ed.) Themes in Linguistics: The 1970s, The Hague: Mouton, 97-129. Labov, W. M. (1975) 'Empirical foundations of linguistic theory'. In Austerlitz 1975: 77-133. Reprinted separately as What Is a Linguistic Fact?, Lisse: Peter de Ridder, 1975. Lakatos, I. (1970) 'Falsification and the methodology of scientific research programmes'. In I. Lakatos and A. Musgrave (eds) Criticism and the Growth of Knowledge, Cambridge University Press, 91-195. Lakatos, I. (1976) Pro of sand Refutations, Cambridge University Press. Lakoff, G. (1987) Women, Fire, and Dangerous Things: What Categories Reveal About the Mind, University of Chicago Press. Langendoen, D. T. (1969) The Study of Syntax: The Generative-Transformational Approach to the Structure of 'American English, Holt, Rinehart & Winston. Langendoen, D. T. (1997) Review of Sampson 1995. Language, 73: 600-3. Langendoen, D. T. and Postal, P. M. (1984) The Vastness of Natural Languages, Oxford: Blackwell. Lees, R. B. (1961) Comments on Hockett's paper. InJakobson 1961: 266-7. Lenneberg, E. H. (1967) Biological Foundations of Language, John Wiley. Lesk, M. (1988) Review of Garside, Leech, and Sampson 1987, Computational Linguistics, 14:90-1. Lidstone, G. J. (1920) 'Note on the general case of the Bayes-Laplace formula for inductive or a posteriori probabilities', Transactions of the Faculty of Actuaries, 8: 182-92. Lounsbury, F. G. (1956) 'A semantic analysis of the Pawnee kinship usage', Language, 32:158-94. Lounsbury, F. G. (1964) 'The structural analysis of kinship semantics', Proceedings of the Ninth International Congress of Linguists, Cambridge, Massachusetts, 1962, The Hague: Mouton, 1088-90. Luce, R. D., Bush, R. R. and Galanter, E. (eds) (1965) Readings in Mathematical Psycho logy, John Wiley, vol. II. Lyons, J. (1991) Chomsky, 3rd edn, Fontana. Lyons, J. (1995) Linguistic Semantics: An Introduction, Cambridge University Press. McCawley, J. D. (1968) 'Concerning the base component of a transformational grammar', Foundations of Language, 4: 243-69. McGilvray, J. (1999) Chomsky: Language, Mind, and Politics, Cambridge: Polity Press. McNeil, D. (1973) 'Estimating an author's vocabulary', Journalofthe American Statistical Association, 68: 92-6. Malinowski, B. (1929) The Sexual Life of Savages in North -Western Melanesia, Routledge. Mandelbrot, B. (1982) The Fractal Geometry of Nature, rev. edn, New York: W. H. Freeman.
REFERENCES
215
Manning, C. D. and Schiitze, H. (1999) Foundations of Statistical Natural Language Processing, MIT Press. Marcus, G. F. (1993) 'Negative evidence in language acquisition', Cognition, 46: 53-85. Marks, L. E. (1968) 'Scaling of grammaticalness of self-embedded English sentences', Journal of Verbal Learning and Verbal Behavior, 7: 965-7. Marshall, I. (1987) 'Tag selection using probabilistic methods'. In Garside, Leech and Sampson 1987: ch. 4. Meiklejohn, J. M. D. (1902) The English Language: Its Grammar, History, and Literature, 23rd edn, Alfred M. Holden. Mendenhall, W. (1967) Introduction to Probability and Statistics, 2nd edn, Belmont, California: Wadsworth. Miller, G. A. (1957) 'Some effects of intermittent silence', American Journal of Psychology, 70:311-14. Miller, G. A. and Chomsky, A. N. (1963) 'Finitary models of language users'. In R. D. Luce et al. (eds), Handbook of Mathematical Psychology, John Wiley, vol. ii, ch. 13. Miller, J. (1973) 'A note on so-called "discovery procedures'". Foundations of Language, 10:123-39. Montemagni, S. (2000) Review of van Halteren, Excursions in Syntactic Databases, International Journal of Corpus Linguistics, 5: 106-10. Morley-Bunker, N. (1977) 'Speakers' intuitions of analyticity', unpublished Psychology BA thesis, Lancaster University. Morris, J. (1974) Conundrum, Faber. Mosteller, F. and Wallace, D. L. (1964) Inference and Disputed Authorship: The Federalist, Addison-Wesley. Nadas, A. (1985) 'On Turing's formula for word probabilities', IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-33: 1414-16. Obermeier, K. K. (1989) Natural Language Processing Technologies in Artificial Intelligence: The Science and Industry Perspective, Chichester, Sussex: Ellis Horwood. Office of Population Censuses and Surveys (1991) Standard Occupational Classification, vol. 3: Social Classifications and Coding Methodology, HMSO. O'Grady, W. (1996) 'Syntax: the analysis of sentence structure'. In O'Grady et al. 1996: ch. 5. O'Grady, W., Dobrovolsky, M. and Katamba, F. (eds) (1996) Contemporary Linguistics: An Introduction, 3rd edn, Longman. Ouhalla, J. (1999) Introducing Transformational Grammar: From Principles and Parameters to Minimalism, 2nd edn, Arnold. Partee, B. H. (1975) 'Comments on C. J. Fillmore's and N. Chomsky's papers'. In Austerlitz 1975: 197-209. Perks, W. (1947) 'Some observations on inverse probability including a new indifference rule', Journal of the Institute of Actuaries, 73: 285-312. Pinker, S. (1994) The Language Instinct: The New Science of Language and Mind. New York: William Morrow. My page references relate to the reprinted version published by Penguin, 1995. Popper, K. R. (1963) Conjectures and Refutations: The Growth of Scientific Knowledge, Routledge & Kegan Paul. Popper, K. R. (1968) The Logic of Scientific Discovery, rev. edn, Hutchinson. (English translation of a book originally published in German in 1934.)
216
REFERENCES
Press, W. H., Flannery, B. P., Teukolsky, S. A. and Vetterling, W. T. (1988) Numerical Recipes in C, Cambridge University Press. Quine, W. van O. (1951) 'Two dogmas of empiricism', Philosophical Review, 60: 20-43. Reprinted as ch. 2 of W. van O. Quine, From a Logical Point of View: 9 Logico-Philosophical Essays, 2nd edn, New York: Harper & Row,
1963. Quine, W. vanO. (1960) Word and Object, Cambridge, Mass.: MIT Press. Radford, A. (1988) Transformational Grammar: A First Course, Cambridge University Press. Rahman, A. and Sampson, G. R. (2000) 'Extending grammar annotation standards to spontaneous speech'. In J. M. Kirk (ed.) Corpora Galore: Analyses and Techniques in Describing English, Amsterdam: Rodopi, 295-311. Reich, P. A. (1969) 'The finiteness of natural language', Language, 45: 831-43. Reprinted as ch. 19 of Householder 1972. Reich, P. A. and Dell, G. S. (1977) 'Finiteness and embedding'. In R. J. Di Pietro and E. L. Blansitt (eds) The Third LAC US Forum, 1976, Columbia, South Carolina: Hornbeam Press, 438-47. Rosch, E. (1975) 'Cognitive representations of semantic categories', JournalofExperimental Psychology: General, 104:192-233. Rudner, R. S. (1966) Philosophy of Social Science, Prentice-Hall. Saeed, J. I. (1997) Semantics, Oxford: Blackwell. Sampson, G. R. (1987) 'Probabilistic methods of analysis'. In Garside, Leech and Sampson 1987: ch. 2. Sampson, G. R. (1991) 'Natural language processing'. In C. Turk (ed.) Humanities Research Using Computers, Chapman & Hall, ch. 8. Sampson, G. R. (1992) 'Probabilistic parsing'. In Svartvik 1992: 425-47. Sampson, G. R. (1995) English for the Computer: The SUSANNE Corpus and Analytic Scheme, Oxford: Clarendon Press. Sampson, G. R. (1999a) Educating Eve: The'Language Instinct' Debate, rev. paperback edn, London: Continuum. Sampson, G. R. (1999b) Documentation file for the CHRISTINE Corpus: URL 7. Sampson, G. R. (2000) Review of Fellbaum 1998, International Journal of Lexicography, 13:54-9. Schiitze, C. T. (1996) The Empirical Base of Linguistics: Grammatically Judgments and Linguistic Methodology, University of Chicago Press. Snow, C. and Meijer, G. (1977) 'On the secondary nature of syntactic intuitions'. In S. Greenbaum (ed.) Acceptability in Language, The Hague: Mouton, ch. 11. Sokal, A. and Bricmont, J. (1998) Intellectual Impostures: Postmodern Philosophers' Abuse of Science, Profile. (Published in the USA under the title Fashionable Nonsense.} Sommerville, I. (1992) Software Engineering, 4th edn, Wokingham, Berks.: AddisonWesley. Sproat, R., Shih, C., Gale, W. A. and Chang, N. (1994) 'A stochastic finite-state word-segmentation algorithm for Chinese', Proceedings of the 32nd Annual Meeting of the Associationfor Computational Linguistics, 66—73. Stabler, E. P. (1994) 'The finite connectivity of linguistic structure'. In C. Clifton, Jr., L. Frazier and K. Rayner (eds), Perspectives on Sentence Processing, 303-36. Hillsdale, New Jersey: Lawrence Erlbaum. Stafleu, F. A. (1971) Linnaeus and the Linnaeans, Utrecht: A. Oosthoek's Uitgeversmaatschappij. Stemmer, N. (1972) 'Steinberg on analyticity', Language Sciences, 22: 24.
REFERENCES
217
Stockwell, R. P., Schachter, P. and Partee, B. H. (1973) The Major Syntactic Structures of English., New York: Holt, Rinehart, and Winston. Svartvik,J. (ed.) (1992) Directions in Corpus Linguistics: Proceedings ofMobel Symposium 82, Berlin: Mouton de Gruyter. Taylor, L., Grover, C. and Briscoe, E. J. (1989) 'The syntactic regularity of English noun phrases', Proceedings of the Fourth Conference of the European Chapter of the Association/or Computational Linguistics, UM 1ST ( Manchester), April 1989, 256-63. Tesniere, L. (1965) Elementsdesyntaxestructurale, 2ndedn, Paris: Klincksieck. Thomason, R. H. (1976) 'Some extensions of Montague grammar'. In B. H. Partee (ed.), Montague Grammar, Academic Press, 77-117. Trudgill,P. (1990) The Dialects of England, Oxford: Blackwell. Wald, B. (1998) Review of Schutze 1996, Language, 74: 266-70. Weinberg, G. (1971) The Psychology of Computer Programming, New York: Van Nostrand Reinhold. Weisberg, S. (1985) Applied Linear Regression, 2nd edn, Wiley. Weiskrantz, L. (1997) 'Thought without language: thought without awareness?' In J. Preston (ed.) Thought and Language (Royal Institute of Philosophy Supplement, 42), Cambridge University Press, 127-50. Wells, R.S. (1947) 'Immediate constituents', Language, 23: 81-117. White, M. G. (1950) 'The analytic and the synthetic: an untenable dualism'. In S. Hook (ed.) John Dewey: Philosopher of Science and Freedom, New York: Dial Press, 316-30. Reprinted in L. Linsky (ed.) Semantics and the Philosophy of Language, University of Illinois Press, 1952,ch. 14. Whitney, W. D. (1885) The Life and Growth of Language, 5th edn, Kegan Paul, Trench. Wilson, A. (1952) Hemlock and After, Seeker and Warburg. Wittgenstein, L. (1953) Philosophical Investigations, Oxford: Blackwell. Yngve, V. H. (1960) 'A model and an hypothesis for language structure', Proceedings of the American Philosophical Society, 104: 444—66. Yngve, V. H. (1961) 'The depth hypothesis'. Injakobson 1961: 130-8. Reprinted in Householder 1972: ch. 8. Zaenen, A. (2000) Review of Johnson and Lappin, Local Constraints vs. Economy. Computational Linguistics, 26: 265-6. Zipf, G. K. (1935) The Psycho-Biology of Language: An Introduction to Dynamic Philology, Houghton Mifflin. Reprinted by MIT Press (Cambridge, Mass.), 1965. Zipf, G. K. (1949) Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology, Addison-Wesley. Reprinted by Hafner, 1965.
This page intentionally left blank
URL list
(1) http: //www. hit .uib.no/icame.html/ (2) http://www.systransoft.com
(3) http://www.grsampson.net/Resources.html-» SUSANNE Corpus (4) http: //www.grsampson.net/Resources.html-* download SUSANNE (5) http://info.ox.ac.uk/bnc/
(6) http: //www.grsampson.net/Resources.html-»CHRISTINE Corpus (7) http: / /www. gr sampson. net/Resources. html-*CHRISTINE documentation (8) http: / /www-tei. uic. edu/orgs /tei/ (9) http: //www.ilc.pi.cnr.it/EAGLES96/intro.html (10) http://coral.lili.uni-bielefeld.de/EAGLES/eagbook (11) http://www.Idc.upenn.edu/readme_files/timit.readme.html (12) http://www.itri.brighten.ac.uk/events/senseval (13) http://www.cis.upenn.edu/-treebank/home.html (14) http://cslu.cse.ogi.edu/HLTsurvey/ (15) http://www.grsampson.net/Resources.html—»download Simple GoodTuring
This page intentionally left blank
Index
Frequently-used technical terms are indexed only for passages which contribute to defining them. Aarts, J. 80 Add-Half 109,111,120 n. 16 Add-One 96, 109, 111, 120n. 16 Add-Tiny 109,111 additive method 96,109-12,118, 120 n. 17 Adler, A. 5 AffixHopping 147-9, 152-3, 164 n. 1 Aho, A. V. 55 n. 1 Aitchison,J. 201 Akmajian, A. 141 Allen, D. E. 84 Ammon, U. 72 n. 1 analytic v. synthetic/analyticity 182, 199,201-5,207n. 10 Anglin, J. M. 188-90 ANLT (Alvey Natural Language Tools) 173,175 Apostel, L. 119n. 6 applied linguistics 93 n. 1 Aristotle 205-6 Atkinson, M. 123-4 Austin, J.L. 207 n. 11 automata theory 162-3 Auxiliary Transformation, see Affix Hopping Baayen,R.H. 120n. 15 Bachenko, J. 97 Baker, C. L. 128 Baker, P. 73 n. 9 Bar-Hillel,Y. 206 n. 2 Barrett, K. 16
Benn, A. N. W. 88 Berber Sardinha, A. 12 n. 2 Bernstein, B. 8, 57-9, 64-5, 67, 72 n. 1 Bever,T.G. 39, 207 n. 12 Biber, D. 35 n. 1 Bierwisch, M. 184 binary branching 45 binomial distribution 98-9, 110, 113, 118, 118n.4, 119n.9, 120 n. 16 Bletchley Park 9, 94-5 Bloomfield,L. 68,145-6 Bod, R. 173 Bolinger,D. 199 Bowerman, M. 130,140 n. 2 Box,G. E. P. 118n. 2, 120 n. 16 Boyle's Law 4-5 Braine, M. D. S. 128 Bricmont,J. 155 Briscoe,E.J. 173-5, 177, 179 n. 4 British National Corpus 57-72, 78 Brooks, P. J. 140n. 2 Brown, R. 128 Brown Corpus 34, 41, 43, 55 n. 3, 56 n. 5, 79 Burt,M.K. 149-50 central embedding 13-23, 39 Chao,Y.R. 105 Chaucer, Geoffrey 191 Chitashvili,R.J. 120n. 15 Chomsky, A. N. 1-2,4,6,14,68,74-5, 80, 122-4, 131,137-8, 140 n. 3, 141-64,166,173,183,204-5
222
INDEX
CHRISTINE Corpus and analytic scheme 42, 57-72, 86-8 Church, K.W. 23 n. 1, 96, 98,100-1, 113,118n.3,119n.lO Cohen, L.J. 207 n. 8 Cole,R. 81, 93 n. 2 competence 80, 84, 166, 169, 174, 176-8,202 Complementizer Placement 150 componential analysis 181-4,188, 199,204 computational linguistics 6 conservatism in languageacquisition 140 n. 2 context-free grammar 54, 145 contradictory sentences 137-9 Corbet, A. S. 96 corpus linguistics 6 critical period 67-71, 73 n. 11 cross-validation 109-12,118 Culicover, P. W. 12n.4 Culy,C. 173-7,179 n. 4 Cunningham, H. 93 n. 1 Curtiss, S. 68 Darwin, Charles 141 data-oriented parsing 173 database 178 n. 2 Date,C.J. 178n. 2 DeRoeck,A. 13-15,19-20 degenerating research programme 5, 71 Deleted Estimate 109-10 Dell,G. S. 14,19 dependency structure 56 n. 9 depth 37-56 depth-based measure 51-5, 56 n. 7 Dillon, G. L. 205 direct v. indirect speech 87-8 discontinuous constituent 145 discovery procedures 160-3 distinguisher 181 Do-Support 150 Dobrovolsky, M. 183 dysfluency, see speech repair EAGLES (Expert Advisory Group on Language Engineering Standards) 90-1 Edwards, J. 85
Efron, B. 96 egoless programming 77 elaborated code 8, 57, 65 elementary transformation 152, 154-5, 164 n. 2 elicitation 122-3 Ellegard, A. 34, 35 n. 2, 55 n. 2 embedding index 59-70, 72 n. 4 Empson,W. 194 empty node 42-3 see also ghost entailment 182-4, 207 n. 10 Equi NP Deletion 150 Erdmann,K.O. 199 Eriksson, G. 93 n. 3 essence/essentialism 189-92,199-201 Estoup,J.B. 119n. 6 evaluation measure 140 n. 3 Eversley, J. 73 n. 9 expected frequency 98-9,101 expected likelihood estimator, see AddHalf explanatory adequacy 161 falsifiability, see refutability family resemblances 199,201 fashion 186-9, 206 n. 1 Fellbaum,C. 182 Feynman, R. P. 19 Fienberg, S. E. 109,120 n. 17 Fillmore,C.J. 136 finite-state model 14,162-3 first-language acquisition, see languageacquisition Fisher, R. A. 96,118 n. 2, 120 n. 16 Fodor,J.A. 39,181-3,188,190, 199-202,204,206 formal v. material mode 138-9 fractals 10,12 n. 3,93, 172 Francis, W. N. m145-6 Frege,G. 155 Fries, C. C. 136
Fromkin,V. 92,139,182,195, 207 n. 5 Gale, W. A. vi, 95-120 Galileo 1 Garrett, M. F. 39 Garside, R. G. 22, 26,85 Garton Ash, Timothy 21 Gazdar, G. J. M. viii, 149-50, 155-6
INDEX generative linguistics 11, 25, 143-5, 165-6 generative v. interpretative semantics 143 Genie 68 genre 7, 24~36, 55 n. 3, 178 n. 1 Ghezzi,C. 76,82 ghost 42-3,59,61 Gibbon, D. 91 Gleick,J. 19 Good, I. J. viii,9,94-120, 179 n. 3 Good-Turing frequency estimation 94-120, 179 n. 3 Goodenough, W. H. 204 Goodman, L. A. 96 Goodman, N. 11,131,205 Graham, K. 207 n. 11 grammar 1,122,143 grammar discovery procedures, see discovery procedures grammatical idiom, see idiom Grant, C. 18 Gries, S. viii Gropen,J. 128, 140 n. 2 Grover,C. 173 Hagege,C. 150 Haigh, R. vii, 24-36 Hanlon, C. 128 Harris, Randy A. 143 Harris, Robert 95 Harris, Z. S. 143-5,151,161 Hastings, Battle of 126 head 56 n. 9 Heasley,B. 205 held-out estimator 110,115 Hempel,C.G. 131 Highway Code 138 Hilbert,D. 155 HillJ.H. 206 n. 4 Hockett,C.F. 144-5 Hodges, A. 77 Hoey, M. 6 Hofland, K. 29, 55 n. 3 Holland, P. W. 109, 120n. 17 Hopcroft, J. E. 55 n. 1 Householder, F. W. 136,166 Hudson, R. A. 32 Hurford, J. R. viii, 205 Hutchins,W.J. 77
223
idiolect 137 idiom 36 n. 3,43,59 immediate constituent/immediate constituent analysis 30,145 immediate-dominance constraints 56 n.4 indirect speech 87-8 inference/inference rule 190, 202, 206, 207 n. 13 see also entailment; meaning postulate inferiority complex 5 innate ideas/innate knowledge 11, 13, 80,84,128,130,136,143-4,181, 191, 195, 207 n.6 interpretative semantics 143 introspection 2-5, 8, 10, 22, 79-80, 122-40,157-60,202 intuition, see introspection Jacobs, R. A. 149 Jazayeri, M. 76 Jeffreys, H. 96 Jelinek,F. 109-10 Jerome, Jerome K. 20 Johansson, S. 29, 55 n. 3 Johnson, W. E. 96 Katamba,F. 183 KatzJ.J. 74-5,125,181-3,188, 190,199-204,206,207 n. 12 Katz,S. M. 120n. 15 Khrushchev, Nikita S. 156 Kilby,D. 123-4 Kilgarriff, A. 200 Kimball,J.P. 164n.2 kinship terms 195-7, 204-5 Knuth, D. E. 55 n. 1 Kornai,A. 104 Kristeva, Julia 155 Kruskal, J. B. 98 Labov,W. M. 3-5,14-15,125,136, 139,158, 189-91 Lacan, Jacques 155 Lakatos, I. 5, 71, 147, 199, 207 n. 11 Lakoff,G. 201 Lancaster-Leeds Treebank 26-36, 41,83-5,89,92,166-75 Lancaster-Oslo/Bergen Corpus, see LOB Corpus
224
INDEX
Langendoen, D. T. 41, 122-6, 159 language-acquisition 67-71, 128-30, 186 language-acquisition device 67,70 language engineering 75, 86, 93 n. 1 LeBailly,L. 18 Leech, G. N. 22,26,83,85,90 Lees, R. B. 39 left-branching 22, 37-56 Lenneberg, E. H. 68, 70, 72-3 n. 8 Lesk, M. 22-3, 78 Lidstone,G.J. 96 likelihood 118n. 2 lineage 44 linear-precedence constraints 56 n. 4 linguistic semantics 205 Linnaeus, C. 84, 93 n. 3 LOB Corpus 16, 22, 25-35, 79, 90, 167, 178 n. 1 Locke, J. 191 London-Lund Corpus 86 Lounsbury, F. G. 204 Lyons, J. 22,39,182,184,205 McCawley, J. D. 156 McEnery, A. M. 85 McGilvray, J. 205 McNeil, D. 96 Malinowski, B. 196 Mandelbrot, B. 10, 119 n. 6 Mandrioli, D. 76 Manning, C. D. 9 many-place predicate, see multipleplace predicate Marcus, G. F. 128 marker 181-9, 193, 195-6, 200, 202, 204, 207 n. 6 Markovian utterance 88 Marks, L. E. 18 Marshall, I. 96 material mode m138-9 maximum likelihood estimator l95-6, 118n.2,120n. 16 meaning-component, see marker meaning postulate 184, 196 Meijer,G. 17,129 Meiklejohn, J. M. D. 165 memory limits 7,37,39,55,157 Mendenhall, W. 30,63,70 Mentalese 181
Mercer, R. 109-10 Miller, G. A. 14, 119 n. 6 Miller, J. 161 Montague, R. 139 Montemagni, S. 178 n. 2 Moore, R. K. 91 Morf,A. 119n. 6 Morley-Bunker, N. 203-4 Morris, J. 196-7 Mosteller,F. 118n.4 multiple central embedding, see central embedding multiple-place predicate m183-4 Nadas,A. 109 nativism, see innate ideas negative binomial distribution 118-19 n.4 negative evidence 126-9,139,141 nonsensical sentences 13 7-9 O'Grady,W. 141,183 Obermeier, K. K. 81 one-place predicate m183-4 one-sided linear grammar 162 Ouhalla,J. 139,147 Partee,B.H. 149,195 performance 80, 166, 169, 174, 176-7, 202 Perks, W. 109 phrase-structure grammar/rules 56 nn. 4 & 9, 143-5, 147-50, 153, 156,162-3,175 Pike,K.L. 145 Pinker, S. 11, 68, 130, 140 n. 2, 181 Popper, K. R. 4-5, 129-30, 137, 140 n.2,142, 161,180,186,199,201 population frequency 95, 118 nn. 1 & 2 population probability 118 n. 1 positive any more 3-4 Postal, P.M. 126 potential falsifier 5 Powell, J. Enoch 20 Press, W.H. 107 production 24-5 production-based measure m51-5, 56 n. 7 progressive research programme 5 proper analysis 153, 155
INDEX prototype theory 201 puberty 68,70-1 Pulleyblank, E. G. 17 Pythagoras's theorem 125 Quennell, Sir Peter 21 Quine, W. van O. 140n. 3,199, 201-2,204 Radford,A. 139 Rahman, A. M. 86, 88 Ransome, A. M. 16 rawproxy 102,106,112,115,118, 119n. 8 Ray, J. 35,40, 84 realization-based measure 51-3, 56 n. 7 recursion 58, 133 refutability 4-5, 129,140n. 3,180-1 Registrar-General's social classification 61-3, 72 n. 5 regression analysis 107, 120 n. 14 Reich, P. A. 14,18-19 relational database management systems 178 n. 2 restricted codes 8, 57, 65 right-branching 22,37-9 Roca, I. m123-4 Rodman, R. 92,139,182, 195,207 n. 5 rootrank node 43 Rosch,E. 201 Rosenbaum, P. S. 149 Ross, D. 205 Rudner,R. S. 131 Saeed,J. I. 182,184,201 sample frequency 95 Saussure, M.-F. de 144 Schachter, P. 149 Schleicher, A. 144 Schutze,C.T. 128 Schiitze,H. 9 Schiitzenberger, M. P. 153 semantic anomaly 199 see also contradictory sentences semantic atom/feature/marker/ primitive, see marker Senseval 200 sex-change 196-8,205 Shakespeare, William 191
225
Sharman,R. 12 n. 3,93, 172 Simple Good-Turing frequency estimator 95-120 simple idea 191 singulary branching 43-4, 54 smoothed proxy 102,106,112,115, 118,119n.8 Snow,C. 17,129 sociolinguistic code 8, 57-8, 65, 72 n. 1 software crisis 75-7 software engineering 75-7, 81, 90 Sokal,A. 155 Somers, H.L. 77 Sommerville, I. 76 speech code, see sociolinguistic code speechrepair 58-9,61,86-7,157 speech-management phenomena 42, 86 Sproat,R. 104 Stabler, E. P. 23 n. 1 Stafford, G. H. 73 n. 10 Stafleu, F. A. 93 n. 3 steady state 68-9 Stemmer, N. 202 Stockwell, R. P. 149 strong generative capacity 157 Stuttaford, T. 21 subordinate conjunct 61, 72 n. 4 Subset Principle 130 SUSANNE Corpus and analytic scheme 36 n. 2,41-56, 72 n. 4, 89,116,167-8 synthetic, see analytic v. synthetic systematics 92 Systran 77 tacit knowledge 158-9 see also innate knowledge Tag Formation/tag question 126,150 tagma 13 taxonomic model 74,162 Taylor, L. 173-4 Tesniere, L. 56 n. 9 Text Encoding Initiative 90 Thisted, R. 96 Thomason, R. 139, 140 n. 4 Tiao,G.C. 118n.2, 120 n. 16 TIMIT database 97,99 Toma, P. 77 Tomasello, M. 140n. 2
INDEX
226
Toulmin, G. H. 96 trace, see ghost Transformational Grammar 141 -64 treebank 26, 35-6 n. 2,89, 178 n. 2 Trudgill, P. 62 truth of logic 207 n. 13 Turing, A.M. 9, 77, 94-120, 179 n. 3 two-way cross-validation, see crossvalidation type collection 89 Ullman, J. D. 55 n. 1 Universal Grammar 128 unseen species 95-6, 107, 110, 112, 115-17, 178 n.2 van den Heuvel, T. 80 von Humboldt, W. 198,207 n. 7 Waegner, N. 174 Wald,B. 128 Wallace, D. L. 118n.4 weak generative capacity Webb, I. R. 20-1
15 7-8, 163
Wegener, A. L. 141 Weinberg,G. 77 Weisberg,S. 120n. 14 Weiskrantz, L. 206 n. 4 Wells, R.S. 145 Wheeler, Max viii Whitby, Synod of 126 White, M.G. 199,201 Whitney, W. D. 68 wild children 67-8 Williams, C. B. 96 Wilson, A. F.J. 192-4 Winski,R. 91 Wittgenstein, L. 199, 201, 205 word-sense disambiguation 200 WordNet 182 Yngve, V. H. 7, 22, 35-56 younger sister 44 Zaenen, A. 150 Zipf,G.K. 101 Zipf'sLaw 108,119 n. 6