Corpus Linguistics: Refinements and Reassessments
LANGUAGE AND COMPUTERS: STUDIES IN PRACTICAL LINGUISTICS No 69
edited by Christian Mair Charles F. Meyer Nelleke Oostdijk
Corpus Linguistics: Refinements and Reassessments
Edited by
Antoinette Renouf and Andrew Kehoe
Amsterdam - New York, NY 2009
Cover image: Collocational “heat map” for the word credit (detail); from the paper “Weaving web data into a diachronic corpus patchwork”, by Andrew Kehoe & Matt Gee. Cover design: Pier Post The paper on which this book is printed meets the requirements of "ISO 9706:1994, Information and documentation - Paper for documents Requirements for permanence". ISBN: 978-90-420-2597-4 E-Book ISBN: 978-90-420-2598-1 ©Editions Rodopi B.V., Amsterdam - New York, NY 2009 Printed in The Netherlands
Contents Introduction Antoinette Renouf and Andrew Kehoe
1
1. Looking more closely at existing boundaries of the discipline Corpus linguistics meets sociolinguistics: the role of corpus evidence in the study of sociolinguistic variation and change Christian Mair
7
Creating corpora from spoken legacy materials: variation and change meet corpus linguistics Joan C. Beal
33
Discourse linguistics meets corpus linguistics: theoretical and methodological issues in the troubled relationship Tuija Virtanen
49
'Tis well known to barbers and laundresses: Overt references to knowledge in English medical writing from the Middle Ages to the Present Day Turo Hiltunen and Jukka Tyrkkö
67
Comparing type counts: The case of women, men and -ity in early English letters Tanja Säily and Jukka Suomela
87
2. Examination of a known language feature from a new point of view Does English have modal particles Karin Aijmer
111
A reassessment of the syntactic classification of pragmatic expressions: the positions of you know and I think with special attention to you know as a marker of metalinguistic awareness Julie Van Bogaert
131
The functions of expletive interjections in spoken English Magnus Ljung
155
3. Examination of the potential of a new corpus, tool, model or technique to extend linguistic knowledge Change and constancy in linguistic change: How grammatical usage in written English evolved in the period 1931-1991 Geoffrey Leech and Nicholas Smith
173
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form: A critical discussion of selected lexicographic parameters and query options Alexander Onysko, Manfred Markus and Reinhard Heuberger
201
How representative are the ‘Philosophical Transactions of the Royal Society’ of 17th-century scientific writing? Lilo Moessner
221
A multi-dimensional analysis of a learner corpus Bertus van Rooy and Lize Terblanche
239
Weaving web data into a diachronic corpus patchwork Andrew Kehoe and Matt Gee
255
4. Re-examination of known linguistic phenomenon in light of further/new data “To each reader his, their or her pronoun”. Prescribed, proscribed and disregarded uses of generic pronouns in English Elisabetta Adami
281
The interpersonal function of going to in written American English Anna Belladelli
309
Re-analysing the semi-modal ought to: an investigation of its use in the LOB, FLOB, Brown and Frown corpora Marta Degani
327
On the use of split infinitives in English Javier Calle-Martín and Antonio Miranda-García
347
Exploring change in the system of English predicate complementation, with evidence from corpora of recent English Juhani Rudanko
365
Encoding of goal-directed motion vs resultative aspect in the COME + infinitive construction Sara Gesuato
381
A corpus-based analysis of invariant tags in five varieties of English Georgie Columbus
401
Discourse presentation in EFL textbooks: a BNC-based study Christoph Rühlemann
415
Awful adjectives: a type of semantic change in present-day corpora Göran Kjellmer
437
5. Discussion Panel Global English – Global Corpora: Report on a panel discussion at the 28th ICAME conference Marianne Hundt
451
Introduction Corpus Linguistics: Refinements and Reassessments Antoinette Renouf and Andrew Kehoe Research & Development Unit for English Studies, Birmingham City University Stratford-upon-Avon as conference venue for the 28th International ICAME Conference provided an ideal setting for a field which sits on a methodological continuum of word-based English textual enquiry stretching from the index verborum, primarily biblical, of the years of early printing1, to today’s technologically full-blown corpus-based studies, by way of the miscellany of ‘partial’ and ‘complete’ concordances of Shakespeare2, produced with lesser or greater degrees of computational assistance, over the past 200 years. That continuum inevitably encompasses an evolution in the definitions and assumptions underlying notions such as ‘index’ and ‘concordance’ which are central to the study of English corpora. Throughout history, linguists and literary scholars have been impelled by their curiosity about a particular linguistic or literary phenomenon to seek to observe it in source texts by means of the prevailing technological tools. The fruits of each earlier enquiry in turn nourish the desire to acquire further knowledge, through more detailed or extensive observation, of other or newer linguistic facts becoming available at the frontiers of newer technology. As time goes by, the corpus linguist operates increasingly from a position of awareness of the known linguistic facts, the standard methodologies, the existing corpora and the available tools and text-processing technology. Corpus Linguistics, thirty years on, is less characterisable as an innocent sortie into corpus territory on the basis of a hunch, and increasingly as an informed, critical reassessment and/or extension of existing analytical orthodoxy and descriptions, in the light of the potential offered by new data and tools coming on stream. The role of ICAME conference host afforded us the opportunity to foreground this aspect of corpus linguistics, and accordingly, the theme of our conference was ‘Corpus Linguistics Reassessed’. The response to this invitation was rich and, though diverse, showed that critical and informed reappraisal of the available facts, data, methods and tools was indeed a central preoccupation of the corpus linguistic research community. The title of this volume is thus ‘Corpus Linguistics: Refinements and Reassessments’. The selected papers, whilst categorisable across all these aspects, are grouped under the following headings: 1. 2. 3. 4. 5.
Looking more closely at existing boundaries of the discipline Examining a known language feature from a new point of view Examining the potential of a new corpus, tool, model or technique to extend linguistic knowledge Examining a known linguistic phenomenon in the light of further/new data Discussion Panel
2
Antoinette Renouf & Andrew Kehoe
1.
Looking more closely at existing boundaries of the discipline
Christian Mair opens section one on cross-boundary studies with a paper which looks beyond corpus linguistics, to the issues arising at its intersection with sociolinguistics. He shows how corpus data can provide new insights into sociolinguistic variation and change, specifically into patterns of variation not noticed or accurately described in previous sociolinguistic research, with reference to new data: the Jamaican component of the International Corpus of English. Joan Beal examines the intersection between traditional corpus linguistics and variationist studies, the latter traditionally focussing on spoken language and collecting private data sets. Professor Beal discusses 1960s Tyneside speech, the challenges and solutions involved in converting data on audio tape into a conventional corpus (NECTE), and plans for developing further corpora and common standards. Tuija Virtanen explores the ‘troubled relationship’ between corpus linguistics and discourse linguistics. She considers the theoretical and methodological issues involved in the application of corpus linguistic techniques to discourse analysis. She acknowledges that the two fields are difficult to interweave, but sets out the primary areas of commonality, discussing the potential benefits to practitioners in both fields of combining forces. Turo Hiltunen and Jukka Tyrkkö explore the intersection between traditional corpus linguistics and one aspect of discourse linguistics, namely discourse analysis. They examine the benefits of using corpus-linguistic techniques and tools to search for key lexis in the diachronic study of certain discourse features from Late Middle English onwards. This sheds light on unexplored discourse features and suggests interesting new hypotheses. Tanja Säily and Jukka Suomela venture beyond the standard repertory of corpus linguistic methods of quantification, and draw on the field of lexical statistics for more advanced measures, namely non-parametric statistics, in order to study morphological productivity and gender issues in a corpus of early English letters 2.
Examining a known language feature from a new point of view
Karin Aijmer opens this section with a re-analysis of English modality in the light of translation correspondences across parallel corpora. Professor Aijmer builds on her argument for the existence of a ‘modal particle’ in English, this time with reference to the discourse marker of course, which example she uses to demonstrate that ‘discourse marker’ and ‘modal particle’ are not just alternative labels for the same concept, but denote a functional split. Aijmer is one of the inspirations for Julie Van Bogaert’s study of the pragmatic expressions you know and I think, which she points out have been referred to as both ‘modal particle’ and ‘discourse particle’ by Aijmer (1997,
Corpus Linguistics: Refinements and Reassessments
3
2002). Bogaert reassesses the syntactic classification of these pragmatic expressions in the literature, and overcomes limitations found there with a new classificatory system based on ‘scope’. Magnus Ljung makes novel use of an existing linguistic model of spoken interaction (Stenstrom, 1994) to conduct a pragmatic reassessment of expletive interjections. He acknowledges that the notion of interjections being pragmatic markers is controversial, but references Aijmer (2002) as supporting his position. 3.
Examining the potential of a new corpus, tool, model or technique to extend linguistic knowledge
Geoffrey Leech and Nicholas Smith contribute the first paper to this section, reporting on their exploitation of the important new Lancs-31 corpus, the final part of the trio of corpora of text covering the period 1931 to 1991 to reassess how far trends of change already observed in the comparison of LOB (1961) and FLOB (1991) have themselves been undergoing change over the period in question, and to suggest motivations for aspects of ‘grammaticalization, colloquialization, Americanization and densification’. The next two papers each discuss the benefits and challenges of transforming an existing electronic textual data resource into a corpus. Alexander Onysko, Manfred Markus and Reinhard Heuberger discuss critically issues of digitisation, dialectology, lexicography and computational linguistics in the processing of Joseph Wright’s English Dialect Dictionary. Lilo Moessner examines critically the Philosophical Transactions of the Royal Society of 17th-century scientific writing and the degree to which they can be deemed to be representative. She achieves this by submitting them to Biber’s multidimensional analysis. Bertus van Rooy & Lize Terblanche also take Biber’s model but they adapt it to separate ‘style dimensions from grammar and information presentation dimensions in a way that the original model did not allow’. Their new multidimensional model allows the authors to move on from isolated linguistic features and examine their combined functional effects. The Tswana Learner English corpus is compared to the Louvain LOCNESS corpus. Andrew Kehoe and Matt Gee round off section one with an account of the special role of the WebCorp Linguist’s Search Engine in supplementing the picture of language provided by existing corpora, not simply by supplying the latest coinages from the web but by filling in vital information gaps about lexical change across time in British and American English from a ‘patchwork’ of corpora. They distinguish this approach from that of Leech and Smith, who take the ‘thirty-year interval’ approach to the study of grammatical change.
4
Antoinette Renouf & Andrew Kehoe
4.
Examining a known linguistic phenomenon in the light of further or newer data
Section four gathers together a set of nine papers in which authors assess the literature on a known feature of language, and then seek to extend the established description in one direction or another with reference to further, often newer, data. Most take as their object of study an aspect of grammar, though where they steer their fresh investigation varies, facilitated by the nature of the new data which is consulted. Elisabetta Adami reassesses the uses of generic pronouns, contrasting established descriptions with her new findings in recent British and American corpus data, namely the academic written sections of BNC, ANC, the Brown family, and several ICE components. Several writers bring a newly diachronic perspective to existing studies. Anna Belladelli takes a diachronic look at the causes of a spread in the use of going to, going beyond the ‘colloquialisation’ explanation offered by others, including Leech and Smith (this volume), with reference to the Brown and Frown corpora of American English. Marta Degani fills a gap in the description of modals by analysing the use of the hitherto poorly investigated semi-modal ought to more fully, from a similarly short-term diachronic perspective, in the Brown corpus family of British and American English. She finds that her data confirm the general pattern of decrease in the frequency of modal verbs from the period found by Leech (2003, 2004, 2006 and this volume), and ‘sustain Leech’s observation that this decline has been more drastic in the case of infrequent modals such as shall, ought to and need (Leech 2003: 228-9)’. Javier Calle-Martín and Antonio Miranda-García seek to account for a longer-term diachronic change, reporting on their survey into existing work on the use and acceptability of split infinitives from the 17th Century to the present day. They are able to improve on this through the evidence provided by the Lampeter Corpus of Early Modern English Tracts, CLMET, CEN and the BNC. Two writers extend existing descriptions by taking a semantic perspective: Juhani Rudanko reassesses English predicate complementation in this light, using CLMET (3rd part) and the ‘UK Books’ subcorpus of the Collins-Cobuild Demonstration Corpus; while Sara Gesuato supplements existing descriptions of complex predicates with new findings in the Collins-Cobuild Bank of English online about the semantic preferences, as well as the frequency and syntactic environments, of resultative come constructions. Adding a variationist component to existing descriptions of the sociolinguistic features and functions of single invariant tags, Georgie Columbus moves beyond individual language varieties to devise a full corpus linguistic description of the class conducted across five ICE corpus varieties of English (British, Indian, New Zealand, Singapore and Hong Kong). Meanwhile, Chris Rühlemann takes a new, discourse-oriented perspective on the class of reporting verb BE + like. Building on previous studies, Rühlemann
Corpus Linguistics: Refinements and Reassessments
5
examines this structure in relation to its presentation in the BNC and in EFL textbooks. Göran Kjellmer rounds off section four by shifting the focus from grammar to lexis, and in particular to lexical semantics, and studies the change undergone in the CobuildDirect corpus by adjectives conventionally expressing the sense of ‘awfulness’. 5.
Discussion Panel
Section five reports on an ICAME panel discussion, entitled ‘Global English – Global Corpora’. A panel, consisting of Anna Mauranen, Joybrato Mukherjee, Pam Peters and with Marianne Hundt as Chair, take the timely opportunity to air their views on what are widely-used varieties of ‘International English’, touching on a number of issues ranging from ‘ownership’ to whether adequate descriptions are available or even possible from the language learning point of view. Peters assesses the adequacy of language corpora to support such ambitions, deciding that there is a need for improvement, not just in corpus content but in range; a concluding note to the panel report also criticises a current lack of corpus compilation documentation which could ensure caution in interpretation. Mauranen and Mukherjee usefully set up an opposition on the status of these language variants. Mukherjee sees ELF (English as a lingua franca) not as ‘a well-defined variety of English’ but as ‘an umbrella term for a multitude of variants’, a ‘makeshift code’ without a locality; while Mauranen asserts that ‘many communities of practice have adopted ELF and their de facto language, and… the ensuing norms of use are regulated by the participants…ELF is also the language of wide and diffuse networks of uses and users’. Questions from the floor are summarised, together with discussion on such issues as accommodation, nativeness, norms and ‘common core’ English. The assembled gathering concludes that the international core of English cannot yet be described; that ‘ownership’ is still a controversial question; and that what Mair & Mollin (2007) call ‘standard ideology’ is an issue affecting the status of ELF and norms for teaching. Notes 1
The first concordance to the New Testament in English was published in London ca.1535 by Thomas Gybson; the first English concordance to the whole Bible was that of John Marbeck (London, 1550); Alexander Cruden's concordance to the whole English Bible, completed 1737 (London, 1738).
6
Antoinette Renouf & Andrew Kehoe
2
Shakespeare concordances were first created manually, as in Bartlett (1889) or Steveson (1953); and later on electronically derived, as in Spevack (1968-80).
References Bartlett, J. (1960 [1889]) A Complete Concordance or Verbal Index to Words, Phrases and Passages in the Dramatic Works of Shakespeare with a Supplementary Concordance to the Poems. London: Macmillan. Mair, C. & S. Mollin (2007), “Getting at the standards behind the standard ideology: what corpora can tell us about linguistic norms”, in: S. VolkBirke and J. Lippert (eds.) Anglistentag 2006 Halle: Proceedings, Trier: WVT, 341-353. Spevack, M. (1968-1980) A Complete and Systematic Concordance to the Works of Shakespeare. 9 vols. Hildesheim: Georg Olms. Steveson, B. (1953), The Folger Book of Shakespeare Quotations, New Jersey: Folger.
Corpus linguistics meets sociolinguistics: the role of corpus evidence in the study of sociolinguistic variation and change Christian Mair University of Freiburg Abstract The contribution opens with a general discussion of the relationship between sociolinguistics and corpus-linguistics. The point is made that while the concerns of these two traditions in the study of linguistic variability and variation were rather different at the outset they have meanwhile developed in such a way as to make co-operation fruitful and, indeed, necessary. This point is illustrated from the author’s own work on the recently completed Jamaican component of the International Corpus of English. The variables analysed are the use of person(s) as a synonym for people, the presence or absence of subject-verb inversion in questions, the modals of obligation and necessity, negative and auxiliary contraction and, finally, the use of the “new” quotative be like.1
1.
Introduction
By a chronological accident both computer-aided corpus linguistics and variationist sociolinguistics emerged as new subfields of linguistic research at about the same time – in the early 1960s. Both, as we know, have gone on to expand and prosper. However, in the early days there was little to suggest that important contact zones might develop in which the two fields would crossfertilise in unforeseen ways. In early corpus linguistics an understandable bias developed towards the study of the written standard (that is precisely the variety which remained outside the scope of classical sociolinguistics) and towards the study of lexico-grammar (whereas the investigation of phonetic variation dominated in early sociolinguistics). With few commendable exceptions, such as, for example, the London-Lund Corpus, which contained extensive prosodic mark-up, corpora of spoken English reduced the complexity of live speech to orthographic transcription, thus rendering the material unsuitable for the study of pronunciation. This bias towards written and standard English in corpus linguistics is now gradually being redressed. Owing to the immense amount of work necessary in the compilation, there is still a dearth of spoken-language corpora which allow access to pronunciation and prosody, but unlike the earliest corpora of spoken English more recently compiled resources such as the British National Corpus (spoken-demographic component) or the Longman Corpus of Spoken American English (http:/www.pearsonlongman.com/dictionaries/pdfs/Spoken-American.pdf) make available the speech of a broad social range of informants. Corpora devoted
8
Christian Mair
to the New Englishes and emerging standards inevitably contain instances of nonstandard usage, and a small number of corpus projects – such as the Freiburg Corpus of English Dialects (FRED) or the Lancaster Corpus of Written British Creole – are explicitly devoted to the documentation of non-standard varieties. In sociolinguistics there has been a similar broadening of the database. Whereas in the early days the focus was almost exclusively on the spontaneous language use of precisely defined “local” speech communities, recent work has placed emphasis also on communities of practice, larger, more unstable and more difficult-to-define networks of communication, frequently characterised by elements of stylized and conscious language use.2 One result of this trend has been that public speech, language use in the media and even written language are no longer beyond the pale in sociolinguistics. Consider, for example, an important recent (2003) special issue of the Journal of Sociolingustics on “Sociolinguistics and globalisation,” which will be referred to again in Section 7 below and which, alongside more mainstream sociolinguistic fare, devotes three articles to subjects such as “Global schemas and local discourses in Cosmopolitan” (Machin & Leeuwen 2003), language use in Japanese rap music (Pennycook 2003) or inflight magazines (Thurlow & Jaworski 2003). The technicalities of corpus compilation and use of corpora came to the fore as one of the central concerns at a recent major sociolinguistics conference (cf. Beal et al., eds. 2007). The successive widening of the database both in corpuslinguistics and in sociolinguistics has led to a blurring of formerly fixed boundaries and the emergence of a contact zone between the two subfields. A corpus linguist working on the spoken-demographic portions of the BNC requires profound knowledge of the urban dialectology of contemporary Britain; conversely, the rapidly growing number of publicly available corpora of English contains an increasing amount of material which sociolinguists would disregard at their peril. We have thus arrived at a situation in which the question providing the title for Meyer (2004) – “Can you really study language variation in linguistic corpora?” – tends to convey not so much genuine scepticism as a note of irony and mock-disbelief. One controversial point between sociolinguists and corpuslinguists will probably remain the definition of what constitutes proper fieldwork. To the purist, true fieldwork requires that the researcher has full control over every aspect of data collection, annotation and processing. However, less risky and less laborious strategies – such as researchers inviting international student informants into their offices to elicit data on non-standard usage – have been known to be honoured by the encomium “field work”. On such a more generous definition, a corpus linguist looking for instances of the affirmative aye in the speech of middle-class and working-class males in the spoken-demographic portions of the BNC could well claim that he or she was engaged in sociolinguistic fieldwork of sorts. To sound the programmatic claim that corpuslinguistics and sociolinguistics have now developed to a stage where they simply must pool their resources for mutual benefit is one thing. To look for existing successful
Corpus linguistics meets sociolinguistics
9
corpuslinguistic contributions to variation studies which might impress sociolinguists sufficiently to consider closer cooperation is, of course, another. Corpus studies can boast a proud record in one area of variation which is somewhat marginal to sociolinguistics, i.e. the study of variability within the standard conditioned by style, register, medium (speech/writing) or text type (see Johansson (forthcoming) for a convenient summary). The work of Douglas Biber and his associates may be singled out here, both for its quality and originality and for its comprehensiveness, in that it places equal emphasis on synchronic regional and stylistic variability (Biber 1988, Biber, ed. 1994, Biber et al. 1999) and diachronic change (Biber & Finegan 1989, Biber 2003). Another area of success comes a little closer to the core concerns of sociolinguistics: empirical documentation of regional variation in standard Englishes around the world. The study of contrasts between British and American English, first on the basis of the Brown and LOB corpora and subsequently including many further resources, has been one mainstay of corpuslinguistic research since its inception. Prominent among current projects devoted to this problem is, of course, the International Corpus of English (ICE – see Greenbaum, ed. 1996). Interestingly enough, however, the most substantial dialogue between corpuslinguistics and sociolinguistics so far has developed not around the study of present-day English but of variability and change in older stages of the language. Corpus-based historical sociolinguistics has already come to the fore as a mature area of research (Nevalainen & Raumolin-Brunberg 2003, Nevalainen, ed. 2006) – probably because in this area the data is scant and no battles of faith can arise about the proper methods of fieldwork. After this general introductory survey, I will focus on the discussion of specific empirical and theoretical issues which are bound to arise when corpuslinguistics meets sociolinguistics. I will do so mainly on the basis of my own experience working on the recently completed Jamaican component of the International Corpus of English (ICE). 2.
ICE Jamaica: potential and limitations of corpus-based sociolinguistics
Linguistic research on the language situation in the Anglophone Caribbean has traditionally focused on the English-lexifier creole languages of the region (or the basi- and mesolectal parts of the creole-English continuum), neglecting the emerging local variety of standard English. To redress this imbalance, the English Department of the University of Freiburg and the Department of Language, Linguistics and Philosophy at the University of the West Indies, Mona, Jamaica, have cooperated to produce the Jamaican component of the International Corpus of English (ICE). In line with ICE guidelines,3 the corpus comprises about one million words, sampled over a broad range of written and spoken textual genres but generally produced by educated speakers (and not a demographically representative cross-section of the population as a whole).
10
Christian Mair
With text-collection, transcription and mark-up approaching completion, project-related research is currently moving from the pilot stage into the main phase. The project aims at contributing toward a linguistic geography of English in the Caribbean by providing detailed phonetic and lexico-grammatical descriptions of Jamaican English, as well as by examining important pragmatic and sociolinguistic aspects of the use of this variety by educated Jamaicans, including its use in code-switching with Creole/ Creolised English. Furthermore, it is hoped that our results will help to shed further light on questions of standardisation in the context of English as a world language, by comparing the language situation of former colonies with English as a native language (e.g. New Zealand) or a second or official language (e.g. India) to that of the Caribbean, which is of particular interest in this respect due to the existence of its creole substrate. Such “cross-variety,” comparative research is much needed in studies on World Englishes and was one of the foremost research goals envisaged by the founders of the ICE project. Important among the “beyond the corpus” questions are attitudes towards this emerging standard held by speakers and writers and its position with regard to Jamaican Creole, the local mass vernacular. The emerging Jamaican standard is being shaped by three major forces: (i) (ii) (iii)
the persistent but probably declining influence of a traditional colonial British norm; growing influence from the US; growing direct and indirect influence of the Jamaican Creole substrate.
In addition to these, some independent innovation of the type to be expected in any living language is likely to be encountered, as well. Clearly, none of the available ICE corpora was originally designed for sociolinguistic research. The focus was on regional variability in standard English, on the documentation of the New Englishes, including the secondlanguage varieties that have arisen in the wake of decolonisation in the second half of the 20th century, and on stylistic variation within any one of these standards. High hopes were pinned on the opportunity to compare features across varieties in currently ten, and ultimately sixteen, parallel corpora.4 Indeed, this comparative perspective figures prominently in current research undertaken on the basis of ICE Jamaica. Thus, Andrea Sand (2004, and forthcoming) has used ICE Jamaica in conjunction with several other ICE corpora in order to identify the pre-determined breaking points in English grammar or, in other words, those intransparent or otherwise fragile areas of the linguistic system which will give rise to variability whenever the language is transported into new regions, adopted by new groups of speakers as a second or first language or even learned by foreigners. The focus in this type of corpus-based variation studies is on grammatical theory and typology as much as on the narrowly sociolinguistic issues of community-internal social variation and the assignation of prestige and stigma to variant forms of a given variable.
Corpus linguistics meets sociolinguistics
11
Being a sample of the local acrolect or emerging standard, ICE Jamaica is obviously unsuitable as a stand-alone resource for a sociolinguistic investigation of the use of English in Jamaica. Any analysis based on it would have to be complemented by studies of language use in the mesolectal range (such as were carried out – using a Labovian approach – by Patrick (1999)). As I intend to demonstrate in the following five case studies, though, ICE Jamaica does have considerable sociolinguistic potential once ways are found to identify that portion of corpus-internal variability which is sociolinguistically relevant. In other words, the question is how to use the corpus in order to access and reconstruct a sociolinguistic space beyond the corpus. The first of the variables to be investigated is a lexical one – choice between neutral people and formal persons to refer to a plurality of human beings. The second and third – subject-operator inversion in main-clause whquestions and modal expressions of obligation and necessity – are grammatical. The fourth is morphological in terms of form, but pragmatic-stylistic in terms of textual function: choice between full and cliticised or contracted forms of certain auxiliaries and the negator not. The fifth and final phenomenon to be looked at will be instances of the “new” quotative be like in Jamaican English. At first sight, this seems to be a straightforward case of lexical innovation under American influence, but on closer inspection it turns out to involve complicated discursive processes of the “globalisation of vernacular features.”5 3.
Too much person? “Person/people” as a sociolinguistic marker in Jamaican English
Before becoming tangled in the complexities of the Creole-English continuum which informs the actual use of English in Jamaica, it is useful to establish its two extreme ends with regard to the variable studied here. In traditional Creole the noun/pronoun smadi (from English somebody) is the most general reference to an individual human being. It functions as an indefinite pronoun but, depending on the context, could also be considered one translational equivalent of English person.6 The plural of smadi is piipl (obviously derived from the English people). In all varieties of English the noun person can, of course, be pluralized but persons is rarely used outside formal or technical contexts; the usual way of referring to a plurality of human beings is people. In Jamaican English, however, the word person (in singular and plural) is firmly established in mesolectal and acrolectal usage and even displays a number of interesting grammatico-semantic properties which have no immediate equivalent in other varieties of English (as we shall see below). On the basis of the then available written data from ICE-Jamaica, Mair (2002: 48) noted that the plural form persons was far more frequent in Jamaican texts than in texts from corresponding ICE material from Britain, New Zealand and East Africa. With ICE-Jamaica now completed, it is of course tempting to
12
Christian Mair
investigate whether this peculiarity is confined to written usage only or also evident in the spoken domain. As the following examples show, the first lesson to be learnt was that it was not feasible to restrict attention to the plural persons in the spoken data from ICE Jamaica: (1)
(2)
(3)
No no but they’re not around but what you find is that the persons who are teaching JAMALs [Jamaican Movement for the Advancement of Literacy teaching modules] are person like me who no know nutten but are scared of word … Worst if you value the person friendship and you think the person is somebody you want to keep in touch with there’s no way you’re going to I mean let that candle [go] out – you’re going to always try to keep the candle burning … And who was you, uhm, who were the person you [word] Who was the person that you went with?
Example (1) exhibits a clear code-switch into (fairly basilectal) Jamaican Creole, and the second mention of person is, hence, not marked for plural. In (2), the genitive is not marked, which, like the absence of inflectional plural marking, is an occasional option in (upper mesolectal) informal Jamaican English. Example (3) similarly shows the two conflicting or complementary linguistic systems interacting in the online production of speech, this time involving subject-verb agreement and inflectional plural marking. Table 1 below summarises the findings from the now available face-toface conversations in ICE-Jamaica (=texts S1A-1 through 90, c. 180,000 words), in comparison to the corresponding British, New Zealand and Irish material from ICE: Table 1: Frequency of people vs. person(s) in the direct conversations of ICEGB, ICE-NZ and ICE-JA ICE-GB
ICE-NZ
ICE-IE
ICE-JA
people 411 449 275 663 person 76 66 48 157 persons 2* 113 [*of which one read aloud] significances (Ȥ2): people:person – p < 0.01; people:person+persons – p = 0 Note the virtual absence of the plural persons from contemporary spoken British, Irish and New Zealand English, whereas it remains a viable synonym for people in spoken Jamaican English. A first explanation for this state-of-affairs might be that we are dealing with archaic usage. Some support for this view is provided by data from the OED quotation base which are summarised in Figure 1 below:
Corpus linguistics meets sociolinguistics
13
Proportion of people:persons in the OED quotation base
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
persons people
1351- 1451- 1551- 1651- 1751- 1851- 19511400 1500 1600 1700 1800 1900 2000
Figure 1: People vs. persons in the OED quotation base The relative frequencies of people vs. persons were calculated for the second half of every century since the 14th and, as can be seen, the frequency of persons has diminished from a high of c. 40 per cent in the latter half of the 17th century to below 10 per cent. What we find in written (Mair 2002) and spoken Jamaican usage today is roughly comparable to the British English of the 18th and 19th centuries (as it is documented in the very heterogeneous written quotations from the OED). As Jamaican English is certainly not the only ex-colonial variety which has on occasion been considered to tend towards archaic or old-fashioned usage, it is instructive to compare the findings from ICE-Jamaica to secondlanguage varieties from India, Singapore, Hong Kong and the Philippines:7 Table 2: Frequency of people vs. person(s) in the direct conversations of ICE-JA, ICE-India and ICE-Singapore, ICE Hong Kong, and ICE Philippines ICE-JA
ICE-India
ICE-Sin
ICE-HK
ICE-PH
people 663 556 345 1302 330 person 157 103 109 155 143 persons 113 35 3 6 4 significances (Ȥ2): people:person – p = 0; people:person+persons – p = 0 As Table 2 shows, parallels are restricted to the singular. As for the plural persons, Indian English displays some weak similarity with Jamaican English, whereas Singaporean, Hong Kong and Philippines English pattern like the two natively spoken varieties (GB and NZ).8 Once again, the “colonial lag” has not provided an over-arching explanatory framework for developments in World
14
Christian Mair
Englishes but has been exposed as the myth it probably is (cf. Görlach 1987, Hundt forthcoming). The appropriate strategy of investigation thus is to treat each variety in its own right and draw up synchronic formality profiles which ideally would be based on a large number of lexical and morphosyntactic formality markers – for example pairs of near synonyms of etymologically Germanic and Romance origin such as fight-combat, help-assist(ance), spending-expenditure or surviving archaisms such as upon for on. Unfortunately, though, given the size of the ICE corpora, search results for most purely lexical variables are bound to remain tentative. For example, it is interesting to note that the direct conversations from ICE-GB contain not a single relevant instance of either the verb assist or the noun assistance, two formal synonyms of help. (The one instance of assistant found occurred in the collocation assistant manager, in which it is not interchangeable with helper). By contrast, fifteen instances were found in the corresponding portions of ICE Jamaica.9 The results for the on-upon variable are inconclusive in the specific instance of Jamaican English because in this variety upon is not necessarily archaic but could be motivated by Jamaican Creole pan “on”. One relevant morphosyntactic formality indicator, namely auxiliary and negative contractions, will, of course, be treated in depth in Section 6 below. Seen in conjunction with evidence from other formality markers, it is plausible to assume that the noticeable frequency of the word person(s) is at least partly due to the fact that in the Jamaican sociolinguistic situation English is per se a formal choice, particularly in the spoken domain. Additionally, there may well be a tendency towards hyper-correction, i.e. to avoid lexical material such as piipl which is also present in Jamaican Creole. Note, however, that the corpus contains many examples, including those listed in (1) to (3) above, which are far from formal, as is shown, for example by the fact that the noun person occurs in passages otherwise displaying Jamaican Creole features and itself occasionally lacks standard English inflectional endings. Therefore, we should consider a third factor: incipient grammaticalisation, with person developing features of an indefinite pronoun. A “general process whereby generic nouns give rise to pronominal categories” is richly attested in the languages of the world, and person is indeed one of several starting points for this pathway of grammaticalisation (Heine/ Kuteva 2002: 232-233). It is familiar from English-based pidgins and creoles, particularly in West Africa.10 In Caribbean English-lexifier creoles, person is not the typical exponent of the category “indefinite pronoun”, but cases of incipient grammaticalisation are documented. Thus, Allsopp (1996: 437, s.v. person) draws attention to a number of common uses in which person is a translational equivalent of various English indefinite and interrogative pronouns, giving the following citations from Barbadian usage: No person is there, at the door, Which person goin[g] pay all dat money?, and Who the person is? Of these constructions at least the last two seem to be of wider currency in Caribbean Englishes. Thus, which person gwine pay all dat money? and who de person? are acceptable in informal Jamaican English.11
Corpus linguistics meets sociolinguistics
15
Allsopp notes a similar tendency for the word people to be used “as a casual indef[inite] pron[oun], in contexts signalling contempt” (1996: 436, s.v. people). One of his illustrative examples is Those are the underlying evils of Trinidad society. Each man thinks he is people. Is time to stop all that, which shows people being used as equivalent of somebody [important]. It is tempting to assume that the occasional vernacular use of person in pronominal function is a direct boost on the frequency of the word in acrolectal regional Englishes, and that a similar usage involving plural people is an added indirect motivation to use persons – on the assumption that a spontaneous impulse to use people is checked in formal English through the tendency towards hypercorrection noted above, which is expected to encourage a realisation as persons instead. Regardless of how we account for the diachronic origin of the phenomenon, however, one thing is clear. Synchronically, the use of person(s) for people is attested in Jamaican English to an extent which goes beyond any other available ICE corpus, be it native- or second-language, and thus presents a clear case of a statistical regionalism.
4.
Main-clause order in wh-questions
Along with the use of me instead of I in co-ordinate noun phrase subjects (me and my Dad went fishing), the use of never as an invariable past-tense negator (I never met him last night) and the use of the base form of adjectives in adverbial function (some people work good under pressure), the lack of subject-operator inversion (or do-support) in questions is one of the four non-standard morphosyntactic features which Kortmann and Szmrecsanyi (2004: 1193) have shown to have the widest distribution in non-standard varieties of English around the world in their discussion of “vernacular universals” or “Angloversals.” The direct conversations of ICE-Jamaica contain more than enough material to investigate the spread of this phenomenon in the emerging local standard. A search for all “wh”-interrogative pronouns (including, of course, how) was undertaken which showed that while “correct” question grammar of course remains the statistical norm in the data, questions without inversion are common and thus belong among the non-standard syntactic variants which apparently have very little stigma attached to them, comparable, for example, to the stopping of the voiced dental fricative ([ð] [d]) on the phonetic plane (on which see Irvine 2004). Note that the absence of inversion in main-clause questions seems to be exceedingly rare in ICE-GB. A spot-check of the 77 relevant questions in texts S1A 1 to S1A 10 did not yield a single clear example.12
16
Christian Mair
Table 3: Subject-verb inversion in main-clause wh-questions in ICE Jamaica (direct conversations)13
inversion no inversion total
extrapolated frequency wh*
extrapolated frequency how
1259 378 1637
261 60 321
extrapolated frequency/ all 1520 438 1958
per cent 77.6 22.4 100.0
Apart from the syntactically motivated absence of inversion, there is phonetically driven ellipsis of do/did or the auxiliary are through assimilation in rapid speech which is found in many kinds of informal English (what did she say what she [] say; what are you doing what you [] doing). A possible instance from ICE-GB could be the following, for which we could assume the pronunciation [], without an overtly realised operator do: (4)
Oh what d’you mean by programming in Pascal (S1A 8)
However, the original sound recording made available with the second release of ICE-GB has [ ] in this case and thus supports the transcription. By contrast, ICE Jamaica contains several examples which could be regarded as phonetically conditioned deletion of do or are. (5) (6)
What you think about that? How we going to do it?
However, in view of the far greater number of instances which are unambiguously syntactic in nature it is questionable whether there is even a need to invoke such phonetic factors. Consider the following typical instances: (7) (8) (9) (10) (11)
And where you went to high school? Why you choose to do psychology? What exactly they do up here honestly? So why it not happening at that school? What that has to do with it though?
In none of these examples could phonetic assimilation lead to the deletion of the operator (where did you go …, why did you choose …, what exactly did they do …, so why’s it not, what does it have14). In many others, the operator is retained, but stays in place after the subject: (12)
Why you don’t like to stay home with your mother?
Corpus linguistics meets sociolinguistics (13) (14)
17
So how long you’ve been working here? When you’re going?
Note that all examples so far have been taken from passages of text which are located very much at the (standard) English end of the Creole-English continuum, as with the exception of the lacking subject-operator inversion they display no direct influences of the Creole substrate. This means that this construction does not have much stigma associated with it, and that we should not assume codeswitching into Creole when we find it occurring on its own. Such code-switches, however, do occur when absence of inversion combines with clearer (and more stigmatised) Creole features such as lack of inflection for the 3rd person singular or absence of the copula be, as it does in a small number of cases: (15) (16)
How much it cost? So I went to him afterwards and I said uhm what wrong?
The material additionally contains a number of self-corrections by speakers which open up interesting discourse-analytical and processing perspectives. There are cases in which speakers move from an inverted question to an uninverted one, presumably in an attempt to create a more relaxed conversational atmosphere (17 and 18 below), and there are instances of the reverse, speakers correcting a spontaneously produced non-standard form to a standard one (19): (17)
(18) (19)
A: So how do you think that impact because they see it as a drug B: Impact on what? A: On the children and on society on y you know because they associate Rasta with uhm weed and you do smoke so how you think that impact on on on your relationship So what do you suggest What do you suggest What you suggest that we do to to uhm to rectify that situation What was primary school what uhm primary school prep school primary school you go to did you go to
Given that the absence of subject-operator inversion (or do-support) in questions has been identified as one the most widespread grammatical features of the New Englishes and non-standard varieties in general, its presence in Jamaican English is not a surprise in itself. A comparative look the spread of the phenomenon across several ICE corpora (which because of the extremely high frequency of questions remains beyond the scope of the present paper) would be very useful, however, in order to find out whether we are dealing with an “Angloversal,” an unmarked choice in the New Englishes which tends to arise irrespective of the particular local linguistic ecology, or with a contact phenomenon, because – after all – uninverted questions are normal in Jamaican Creole. Assessing the relative impact of universal and language-specific factors in variety formation is an important task in contact linguistics. With regard to a more specific socio-
18
Christian Mair
linguistic research agenda, the role of the variable in managing conversational atmosphere and accommodation among participants, which has become obvious from the illustrations in examples (17) to (19), is of great interest in a qualitative interaction-based sociolinguistic approach. 5.
The modals of obligation and necessity
The modals of obligation and necessity represent one of the well-documented areas of grammatical contrast between British and American English, the globally dominant reference standards. Moreover, this fragment of the grammar has been subject to fairly rapid diachronic change over the past three centuries, with relevant phenomena including the spread of have got to (on the back of earlier have to – see Krug 2000), the decreasing frequency of must and the rapid spread of need to (Mair 2006: 103-108, Mair/ Leech 2006: 326-329). These modals are thus an almost perfect diagnostic to assess the synchronic regional orientation of a New English with regard to British or American norms and also its degree of linguistic conservatism. Table 4 below presents the findings from the Santa Barbara Corpus of Spoken American English (in the absence of an ICE-USA) and five ICE corpora. Table 4: Obligation and necessity in the Santa Barbara Corpus of Spoken American English and the conversation components of four ICEcorpora (S1a 1 – 100) Form:
Santa Barbara 59 0
ICEGB 97 6
ICENZ 136 3
ICEIE 118 3
ICEIndia 206 1
ICEJA 124 3
must must not/ mustn’t need not/ needn’t 0 1 0 3 11 0 NEED* to 111 51 57 50 18 156 NOT* need to 7 8 15 6 1 4 HAVE* to 448 269 364 430 585 627 NOT* have to 51 27 29 22 16 14 HAVE* got to 12 118 114 11 4 2 HAVE* gotta 18 0 0 0 0 1 got to 4 9 42 0 4 6 gotta 96 0 0 1 0 6 *CAPITALISED forms stand for all morphological variants, in this case need, needs, needed, needing; NOT stands for do not, does not, did not, don’t, doesn’t, didn’t, shouldn’t, etc. Owing to the different sizes of the corpora, the findings from the Santa Barbara corpus and the ICE-GB conversations are not straightforward to compare, but one thing which they do show is the expected contrast in the frequency of have got to
Corpus linguistics meets sociolinguistics
19
– high in British English and very low in American English. Note also, on an issue which is not directly related to the concerns of the present paper, that while HAVE got to is attested at a rate comparable to British English in New Zealand, it is rare Irish English. As regards the findings from the five ICE corpora themselves there is no easy explanation for the fact that have to, the most common form in all corpora, should be so much more frequent outside Britain.15 Other than that, we note an almost uncanny similarity of preferences between British English and New Zealand English. Indian English stands out through its markedly conservative profile, reflected in high frequencies for must and low frequencies for the innovative forms need to and have got to. Jamaican speakers do not align with British norms in the same way that New Zealanders seem to be doing. Note that while they even lead in the use of the innovative need to, on the whole they avoid the British have got to. The resulting profile thus resembles an American English one. For the time being, we must leave open the question of whether this similarity has come about gradually and independently or whether it reflects recent exposure to and re-orientation towards a US English norm on the part of a growing number of Jamaicans. The most intriguing explanandum in Table 4 is the frequency of need to in Jamaican English. As this form is spreading rapidly in British and American English at the moment (Mair/ Leech 2006: 326-329), the conservative explanation would be to point out that the spoken texts of ICE Jamaica were recorded in the early 2000s, that is at least ten years later than those of most other ICE corpora (except ICE Ireland). However, whether this factor is enough to account for the entire disparity must remain open. The most robust result of Table 4, on the other hand, is the solidly nonBritish or even North American profile of variation in the use of modals which emerges from the ICE-JA data. This profile is only partly corroborated by searches for several other demarcators of British and American usage. British and British-influenced Englishes, for example, are known to be characterised by a preference for towards over toward, whereas the reverse is true for American English and varieties related to it. Table 5 lists some pertinent figures from a number of ICE corpora and an American reference database, namely the Corpus of Spoken Professional American English (CSPAE): Table 5: Towards vs. toward in selected corpora ICE- ICEGB NZ towards 311 342 toward 9 25 * Figures are based on the writing. significances (Ȥ2): p = 0
ICEIE 253 5 470 out
ICEJA* 204 50 of 500
ICEICEIndia Philippines 273 126 7 61 texts available at the
CSPAE 124 264 time of this
20
Christian Mair
All historically British-influenced varieties, and even Philippine English, share the British preference for towards over toward, though the “American” variant has slightly higher frequencies in the Jamaican and Philippine corpora than in the others. Similar observations can be made for the use of gotten as a variant of the past participle of the verb get. While at frequencies of 2, 2, 6 and 8 in ICE-GB, ICE-India, ICE-NZ and ICE-Ireland respectively, the form is marginal in these varieties, ICE-JA has 34 instances. 6.
Contractions
The contraction of certain auxiliary verbs (e.g. he’s for he is) and of the negation particle not (e.g. isn’t for is not) are variables which are extraordinarily well suited to an approach combining corpuslinguistics and sociolinguistics. As precisely definable search strings, such forms are easily retrievable from digitised text, and at the same time contractions of this type are one of the most reliable indicators of stylistic (in)formality (cf., e.g., Diller 1999, Peters 2001, YaegerDor, Hall-Lew & Deckert 2002). Formality levels in the conversational texts of ICE Jamaica provide crucial evidence when it comes to determining the status of standard English in Jamaica. If the level of formality were high and if the range of observed stylistic variability were narrow,16 this would mean that the role of acrolectal English is marginal in spoken usage and that, unlike writing, where it clearly dominates, it is an extraneous or “adoptive” (Shields-Brodber 1989, 1997) standard in oral communication. The great advantage of the corpuslinguistic working environment provided by ICE is that the frequency of contractions in spontaneous speech can be compared across varieties. Thereby, contraction frequencies in ICE Great Britain, ICE Ireland and ICE New Zealand can be taken to represent the norm for uncontroversial instances of contemporary native-speaker usage in largely monolingual contexts. By contrast, ICE India illustrates the situation in a typical multilingual environment in which English serves as a prestigious and formal second language. The working hypothesis is that contraction rates will be uniformly high in native-speaker usage, because here it is English which is the default choice for the informal baseline style of face-to-face talk. Whether it is possible to have a conversation in English and remain informal is an open question in the Indian sociolinguistic context, and – probably to a lesser extent – also in the Jamaican one. For the following experiment, all combinations of a pronominal subject and a form of the verb be in the present tense were investigated in the spontaneous-dialogue sections of ICE Great Britain, ICE Zealand, ICE India and ICE Jamaica. The findings are thus based on text samples S1A-1 to S1A-100, that is a total of c. 200,000 words of transcribed dialogue per corpus.17 Table 6 lists the search strings in question:
Corpus linguistics meets sociolinguistics
21
Table 6: Be-contractions searched in five ICE corpora not contracted/ not negated I am you are he/ she/ it is we are they are
not contracted/ negated I am not you are not he/ she/ it is not we are not they are not
subject-verb contraction I’m (not) you’re (not) he/ she/ it’s (not) we’re (not) they’re (not)
negative contraction I amn’t you aren’t he/ she/ it isn’t we aren’t they aren’t
Recall that our working assumption was that contraction frequencies would be uniformly high in spoken British, Irish and New Zealand English. Figures for Indian English were expected to be low. As is shown in Table 718, this expectation is substantially borne out. Interestingly enough, Jamaican English does not reach the very high contraction rates of the uncontroversially nativespeaker corpora, but remains nevertheless much closer to them than to a clear second-language variety such as Indian English. Table 7: Be-contractions in five ICE corpora – global frequencies19 uncontracted ICE-GB 232 ICE-NZ 90 ICE-IE 336 ICE-JA 582 ICE India 2297 significances (Ȥ2): p = 0
contracted
total
4036 3809 4092 3214 1588
4258 3899 4428 3796 3885
contraction rate in per cent 94.8 97.7 92.4 84.7 40.9
It is, of course, possible to refine the analysis also by looking at the returns for individual pronouns and the corresponding forms of the verb be (see Appendix for figures). This more delicate analysis shows, for example, that contractions of is are significantly more common than contractions of are in Indian English, or that the form amn’t, a marginal presence in British English, is practically absent from all other varieties. In addition, the relatively low values for negator contractions (n’t) are, of course, due to the fact that the search was restricted to pronominal subjects. Such considerations notwithstanding, the general trend documented in Table 7 remains robust. In sum, the analysis shows that with regard to the variable at issue Jamaican English does not exhibit the formality-profile of a typical secondlanguage variety (Indian English), but tends towards the native ones without fully reaching their high contraction rates. Seen as a corpus, ICE-JA thus appears to present material which is very much like natively spoken English. However, this does not mean that English should be considered the native variety of each and every speaker recorded in the corpus. A promising direction for further
22
Christian Mair
sociolinguistic analysis would thus be to determine the extent of inter-speaker and intra-speaker variability in the corpus material, as the somewhat “mixed” character of Jamaican English might result from the fact that the sample contains a number of speakers who have contraction rates comparable to those found in British or New Zealand English (i.e. native speakers of English who use the language across the entire formality range) and others whose profile matches that of second-language speakers (i.e. the speakers of “adoptive” English in the sense of Shields-Brodber 1989 whose natural mode of informal expression is a mesolectal variety of Jamaican Creole). 7.
“New quotatives” in Jamaican English and the globalisation of vernacular features
The new quotatives go and be like – first identified as innovations in American English by Butters (1980, 1982)20 – are among the fastest-spreading grammatical constructions in English today. In particular, be like is not only spreading in the variety in which it originated, American English, but has been reported as an innovation in Australian English, Canadian English and Newfoundland English, British (=English) and Scottish English (see Barbieri 2005: 223 and Buchstaller 2006b: 363 for a review of pertinent research). Thus, its presence in ICE Jamaica does not come as a surprise. (20)
I don’t know what they were thinking some chicken stuff and fish and whatever it is with uhm what’s that dressing vegetable dressing on the chicken and Okay well who eat that I’m like hello we are black people from the Caribbean please no white people here You know No maybe white people would eat stuff like that
(21)
You know she knows nothing about these people. Me fraid you know the man a call her she run gon go go take picture So I’m like where’s the picture we thought it was a instant thing. She’s like no him have it.
Note that while the direct conversations of ICE Jamaica contain c. 50 clear instances of quotative be like, quotative go seems to be absent from the data. There is, however, one informal quotation-introducing device which is in competition with be like, namely Jamaican Creole mi say, him say etc. As is not surprising in such a case of rapid change in progress, the use of be like is influenced by diverse extralinguistic and structural factors “such as age and sex of the speaker […] grammatical person of the subject, discourse function of the quotation and tense” (Barbieri 2005: 223) and – the point of Barbieri’s (2005) own paper – register. Summarising the results of previous research on the new quotatives, Buchstaller reports that “a number of studies have suggested that be like might eventually push out go” and that “U.S. respondents associate quotative be like […] with younger speakers and women. It also triggers a range
Corpus linguistics meets sociolinguistics
23
of associations with personality traits, many of which can be subsumed in the category ‘social attractiveness’, or solidarity traits” (2006b: 363). Buchstaller subsequently investigates the use of and attitudes towards the new quotatives in British English, focussing specifically on the question of whether the adoption of a new form also implies the adoption of the functional and attitudinal indexicality associated with it in the variety in which it originated. She concludes: […] that if be like has been imported from the U.S., speakers in the British Isles have not simply passively adopted the social attitudes attached to it. Rather, the adoption of global resources is a much more agentive process, whereby travelling features are actively re-evaluated and manipulated on the perceptual level. As linguistic resources are borrowed across the Atlantic, they may lose or gain associations during the process or, alternatively, already existing percepts may be re-analyzed and re-evaluated. Consequently, for speakers of the borrowing variety, new associations interact with possibly secondhand ones and aspects of existing meaning can become more or less salient during the process. (2006b: 375) There is reason to assume that similar processes of dissociation and “re-allocation of attitudes” (Meyerhoff & Niedzielski 2003) are at work in the spread of be like in Jamaican English. What is in line with many observations made on varieties of English spoken outside Jamaica is the concentration of be like among younger female speakers: of the ca. 50 instances collected, all but 5 examples are from speakers younger than 25 years, and only three are produced by males (by two different speakers, both in the 26-45 age bracket). However, what is sociolinguistically unique about the linguistic situation in Jamaica is that the strongest non-standard and informal competitor of quotative like is not go, but Jamaican Creole quotation-introducers such as mi say/ dem say/ him say. This means that there are two different ways of being informal, an international one imported recently from informal American English and a local one, from Jamaican Creole, with a long historical standing. Note finally, that the sheer frequency with which be like is attested in the Jamaican data is striking. Although normalised frequencies per million words are difficult to reconstruct from Buchstaller’s analysis,21 it is safe to say that quotative be like seems, somewhat surprisingly, to be as common in Jamaican English as in American English, the variety it originated in a mere four decades ago. Given its rapid recent spread in so many varieties of English, we would, of course, have to ask whether be like would not even be more frequent in more recent British and American material. As for the ICE working environment in general, the lesson taught by this exploratory look at quotatives in Jamaican English is that the various corpora are clearly not comparable to a sufficient degree in this case of rapid change in progress. When ICE-GB was sampled, quotative like had barely reached Britain and is therefore not attested. Quotative like is amply attested in ICE Ireland,
24
Christian Mair
whose spoken texts were recorded 10 to 15 years later, at roughly the same time as those of ICE Jamaica. Quotative like is by and large unattested in all those second-language ICE corpora (East Africa, India, Singapore, Philippines) which were collected in the time in between. But whether this is a sign that secondlanguage varieties resist this particular innovation more than natively spoken ones is uncertain; it may well be that the spoken texts for these corpora were sampled too early. 8.
Conclusion
The study of selected types of local or non-standard usage in ICE Jamaica shows very clearly what corpuslinguistics and sociolinguistics have in common, namely an interest in linguistic variation. However, it also shows very clearly what still sets them apart, namely their different analytical perspectives. In Barbieri’s terms, corpus linguists start out from charting the “frequency patterns of use” observed in their data, whereas sociolinguists working in the variationist paradigm first define the variable and then aim to identify “the contribution of particular factors to the probability of the choice” between particular variants (2005: 224). The two perspectives are by no means incompatible, but the different emphases they engender for research practice need to be spelled out. First, while both corpuslinguistics and sociolinguistics generally use quantification and statistics, their approaches differ. A typical corpus-linguistic frequency measure, for example, is absolute or normalised frequency (say, per million words). Sociolinguists, on the other hand, give (and tend to think in) group-specific realisation rates (e.g. per cent of realisation of a variable as variant X). In many sociolinguistic studies (including Buchstaller’s study of quotatives reported above), absolute corpus size is thus difficult to infer, which may make comparison to corpus-based studies rather difficult. Secondly, corpus data is usually in the public domain, which allows easy replication of studies and, ideally, cumulative progress as a research community builds up around a corpus and profits from and builds on one another’s work. The raw data of sociolinguistic studies, by contrast, is rarely made available to the general academic public. The starting point for most corpus-analysis is concordancing. Quantification chiefly focuses on establishing collocational patterns, the influence of structural context on the choice of variants, and on corpus-internal variability by register or genre. The chief aim of variationist sociolinguistics, on the other hand, remains finding out about “the correlation of dependent linguistic variables with independent social variables [which] has been at the heart of sociolinguistics since its inception more than three decades ago” (Chambers 1995: Preface). Of course, this does not mean that the linguistic context in which a variable occurs is irrelevant for sociolinguists. Any decent variationist study of word-final consonant-cluster deletion or some such classic variable would distinguish between utterance-final, pre-consonantal or pre-vocalic environments
Corpus linguistics meets sociolinguistics
25
at least. It merely means that such aspects will usually not remain the major preoccupation of a study. Similarly, a corpus linguist is free in principle to access any ICE corpus as a stratified sample of speech produced by older and younger speakers, male and female speakers, and so on. In practice, though, this approach is not supported by standard corpus-analytical software tools and may therefore tend to be avoided. And if one is willing to shoulder the necessary work, one may still be disappointed, as the sociolinguistic information in many a file-header may be very generic (“male, English”) or even missing for many a participant in a conversation. Among many hundreds of individual informants contributing to the spoken-demographic portions of the BNC there is “‘Rudy,’ 61, West Indian, warehouse manager, social class C1 (junior management, supervisory or clerical”, who has contributed his c. 10,000 words to text KCP, but it is a long way to get to him. To turn from these general considerations to the specific sociolinguistic constellation investigated in the present paper: what does ICE-Jamaica tell us about the current state of development of the emerging standard of English usage in Jamaica? As was pointed out above, this emerging standard is developing in a pull among three competing orientations: British, American, and local (that is, in contact with Jamaican Creole). In addition, it is a legitimate question to ask whether Jamaican English shares features with other New Englishes with which it has not been in direct contact (in the spirit of the “Angloversals” debate reported on in Section 4). The following tabular survey shows how this pull plays itself out with regard to the five variables investigated here. They are displayed along the vertical axis of the Table. The horizontal axis lists historical and current contact influences and orientations and, in the rightmost column, possible similarities to other New Englishes which are not motivated by direct contact. A “+” sign indicates similarity between Jamaican usage and the norm in question; a “-” stands for distance to it: Table 8: Competing orientations in the Jamaican standard Variable Ļ
Orientation ĺ
people/ persons +/- inversion in mainclause questions modals of obligation and necessity contractions quotative be like
GB
US
-
local/ Jam. Creole + +
-
+
-
+
-
-
-
+
n.a. -
-
“Angloversals”
On the evidence of this partial survey (restricted as it is to five variables), there is little reason to continue including Jamaican English among British-influenced post-colonial standards such as Australian English or New Zealand English.
26
Christian Mair
Jamaican Creole, mesolectal informal English and even American English seem to have become more important contact varieties today than the now remote former colonial British standard. In addition, there limited parallels between Jamaican English and second-language standards such as Indian English, which show that English tends to be restricted to formal domains of use in spoken communication. While many speakers of educated Jamaican English continue to believe in the essentially “British” nature of their standard, hard evidence for such a view seems to be disappearing outside the relatively firmly regulated area of spelling. Such is the state of linguistic development 47 years after Jamaican independence in 1962. Notes 1 This research is supported by external funding from the Deutsche Forschungsgemeinschaft (DFG MA 1652/4 “Educated Spoken English in Jamaica: Phonetische/ lexikogrammatische Normierung und soziolinguistischer Status”), which is gratefully acknowledged. In addition I would like to thank Dr. Dagmar Deuber, Freiburg, for her insightful comments on a previous version of this paper. Dr. Birgit Waibel and LuminiĠa-Irinel Traúcă have helped with the corpus counts. 2
To describe the successive extensions of scope in sociolinguistics over the past four decades, Penelope Eckert has recently used the metaphor of three “waves.” The first wave is classic variationism as exemplified in Labov’s 1966 Social Stratification of English in New York City, exploring the “big picture” by establishing quantitative correlations between independent social variables and dependent linguistic variables. Like the first wave, the second wave of sociolinguistic studies is focussed on the use of a given variety by its community of speakers, but uses ethnographic methods to gain a deeper understanding of how variation operates in and for a community. The third and most recent wave goes beyond the study of variables in localised speech communities and studies variation “not as a reflection of social place, but as a resource for the construction of social meaning” (Eckert 2005: 1). This means that the focus of interest shifts from the linguistic variable, chosen frequently because of its intrinsic linguistic interest – for example as a presumed instance of change in progress –, to the study of communicative styles which are not necessarily localisable any longer.
3
For further details see, e.g., Greenbaum, ed. 1996 or the project’s homepage at http://www.ucl.ac.uk/english-usage/ice/.
4
The following ICE corpora are publicly available: Great Britain, New Zealand, East Africa, India, Hong Kong, Ireland, Singapore, Philippines. ICE Australia is completed and can be consulted on request through a server at Macquarie University. Work on ICE Jamaica is substantially
Corpus linguistics meets sociolinguistics
27
complete, and publication is imminent. Data collection is still in progress for ICE Canada, Fiji, Malaysia, South Africa, Sri Lanka, USA. Cf. http://www.ucl.ac.uk/english-usage/ice/index.htm. Further projects, such as, for example, a corpus documenting Maltese English, are in the planning stage. 5
Cf. Buchstaller 2006b: 362, who writes that in such “cases of borrowing, the stereotypes attached to linguistic items are not simply taken over along with the surface item. Rather, the adoption of global resources is a more agentive process, whereby attitudes are re-evaluated and re-created by speakers of the borrowing variety.”
6
See the entries for smadi, s’madi and somebody in Cassidy/ LePage 1980.
7
ICE-East Africa was not included in this comparison, because it contains an insufficient amount of spontaneous speech.
8
The exceptionally high figure for people in ICE Hong Kong is a matter which cannot be pursued here. It is partly due to an apparent preference in this variety for analytical expressions such as Hong Kong people (rather than Hong Kongers) or Chinese people (rather than the Chinese).
9
The total returns were 19, from which four irrelevant hits were discarded. For comparison, ICE India yielded 15 returns, from which 3 turned out to be genuine. The figures obtained for help* were 49, 107 and 89 in ICEGB, ICE-JA and ICE-India respectively.
10
The process of grammaticalisation has been completed in Nigerian Pidgin, for example (Dagmar Deuber, personal communication).
11
Joseph Farquharson (personal communication) points out that for him as a native speaker there is an assumption that which person, unlike who, implies that there is a known group from which an individual is selected, while who makes no such assumption.
12
In fact, there was one clear instance of the opposite of what was looked for: inversion in an apparently dependent clause: “Well we’re heading to how d’you get into working with disabled people.” (S1A 4)
13
The following procedures were adopted. To identify the relevant questions from the corpus a search was undertaken for all instances of wh* and how in S1A 1 to S1A 90, which yielded 5246 returns for manual post-editing. The extrapolated frequencies and percentages in Table 3 are based on an inspection of 400 instances of wh* and 100 of how (i.e. a total of 500 cases). 191 (= 143 + 48) of the 500 concordance hits were identified as syntactically independent questions. Of these 42 (= 33 + 9) did not display inversion. From among the borderline cases, I excluded why (not) + inf.
28
Christian Mair
questions, what if questions, echo-questions (e.g. you do what?), and verbless or incomplete wh-/how questions (i.e. Why?, How?, What else?, What about + NP?, etc.). Questions in passages of direct speech, on the other hand, were treated as syntactically independent and therefore included (e.g. Sometimes even if you just ring the phone one time and say hi how are you doing). 14
This analysis presupposes that the use of the operator do is normal with have to in questions and negations in Jamaican English, and that an older British variant – what has that to do with it, though? – is no longer relevant. If it was, the example would have to be re-classified with (12) to (14) below.
15
Standard significance tests are not available for this table, as there are too many cells with less than five members. Although hafi, “have to,” is common in Jamaican Creole, it is difficult to gauge the extent of “substrate” influence in Jamaican English here, as similarly high values can be observed in Indian English. The presence of hafi in Jamaican Creole, on the other hand, may work as an impediment to the spread of have got to/ gotta.
16
Shields-Brodber 1989 observes a tendency towards “monostylistic” usage among contemporary habitual users of English in Jamaica.
17
In addition to the 90 samples of direct conversations analysed in sections 3 and 4, the investigation thus includes the 10 samples of telephone conversations.
18
Table 7 gives global frequencies; for a detailed break-down of individual results from five corpora see Appendix.
19
Note that only those uncontracted forms were counted which could in theory have been contracted. Thus, I am would have been counted in I am here, but not in the short affirmative answer Yes, I am. For the sake of completeness it should be added that in addition to the forms listed in Table 2 these figures contain two instances of ain’t from ICE-NZ and one from ICE Jamaica.
20
For important follow-up studies on the phenomenon in American English see Blyth, Recktenwald & Yang 1990, Romaine & Lange 1991, or Barbieri 2005.
21
Buchstaller (2006a: 8-9) reports finding 93 instances in a corpus of British English spontaneous speech comprising roughly a million words and 121 in the portion of the American Switchboard Corpus which she used (which apparently is about a quarter of the total 3 million words). As the conversations from ICE Jamaica make up only c. 200,000 words, the
Corpus linguistics meets sociolinguistics
29
normalised frequency (per million words) for this variety would have to be estimated at about 250. References Allsopp, R. (1996), Dictionary of Caribbean English usage. Oxford: OUP. Barbieri, F. (2005), ‘Quotative use in American English: a corpus-based crossregister comparison’, Journal of English Linguistics, 33: 222-256. Beal, J., K.P. Corrigan and H. Moisl (eds.) (2007), Creating and digitizing language corpora. Vol 1: Synchronic databases. Basingstoke: Palgrave Macmillan. Biber, D. (1988), Variation across speech and writing. Cambridge: Cambridge University Press. Biber, D. (ed.) (1994), Sociolinguistic perspectives on register. New York: OUP. Biber, D. (2003), ‘Compressed noun-phrase structures in newspaper discourse: the competing demands of popularization vs. economy’, in: J. Aitchison and D.M. Lewis (eds.) New media language. London: Routledge. 169181. Biber, D. and E. Finegan (1989), ‘Drift and evolution of English style: a history of three genres’, Language, 65: 487-517. Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), The Longman grammar of spoken and written English. London: Longman. Blyth, C., S. Recktenwald and J. Wang (1990), ‘I’m like, ‘Say what?!’: a new quotative in American oral narrative’, American Speech, 65: 215-227. Buchstaller, I. (2006a), ‘Diagnostics of age-graded linguistic behaviour: the case of the quotative system’, Journal of Sociolinguistics, 10: 3-30. Buchstaller, I. (2006b), ‘Social stereotypes, personality traits and regional perception displaced: attitudes towards the ‘new’ quotatives’, Journal of Sociolinguistics, 10: 362-381. Butters, R. (1980), ‘Narrative Go ‘Say’’, American Speech, 55: 304-07. Butters, R. (1982), ‘Editor’s note [on be like ‘think’]’, American Speech, 57: 149. Cassidy, F.G. and R.B. LePage (1980), Dictionary of Jamaican English. Cambridge: CUP. Diller, H.-J. (1999), ‘Some thoughts on the stylistic function of contractions in written texts’, in: U. Carls and P. Lucko (eds.) Form, function and variation in English. Frankfurt: Lang. 235-245. Eckert, P. (2005), ‘Variation, convention, and social meaning’ [Presidential Address, 2005 LSA Meeting]. http://www.stanford.edu/~eckert/EckertLSA2005.pdf Görlach, M. (1987), ‘Colonial lag? The alleged conservative character of American English and other ‘colonial’ varieties’, English World-Wide, 8: 41-60. Greenbaum, S. (ed.) (1996), Comparing English worldwide: the International Corpus of English. Oxford: Clarendon Press.
30
Christian Mair
Irvine, A. (2004), ‘A good command of the English language: phonological variation in the Jamaican acrolect’, Journal of Pidgin and Creole Studies, 19: 41-76. Johansson, S. (forthcoming), ‘Interpreting textual distribution: social and situational factors’. Arbeiten aus Anglistik und Amerikanistik 34. Heine, B. and T. Kuteva (2002), World lexicon of grammaticalization. Cambridge: CUP. Hundt, M. (2009), ‘Colonial lag, colonial innovation, or simply language change?’ in: G. Rohdenburg and J. Schlüter (eds.) One language, two grammars: morphosyntactic differences between British and American English. Cambridge: CUP. 13-37. Kortmann, B. and B. Szmrecsanyi (2004), ‘Global synopsis: morphological and syntactic variation in English’, in: B. Kortmann et al. (eds.) A handbook of varieties of English. Vol II: Morphology and syntax. Berlin: Mouton de Gruyter. 1142-1202. Machin, D. and T. Leeuwen (2003), ‘Global schemas and local discourses in Cosmopolitan’, Journal of Sociolinguistics, 7: 493-512. Mair, C. (2002), ‘Creolisms in an emerging standard: written English in Jamaica’, English World-Wide, 23: 31-58. Mair, C. (2006), Twentieth-century English: history, variation, standardization. Cambridge: CUP. Mair, C. and G. Leech (2006), ‘Current changes’, in: B. Aarts and A. McMahon (eds.) The handbook of English linguistics. Oxford: Blackwell. 318-342. Meyer, C. (2004), ‘Can you really study language variation in linguistic corpora?’ American Speech, 79: 339-355. Meyerhoff, M. and N. Niedzielski (2003), ‘The globalization of vernacular variation’, Journal of Sociolinguistics, 7: 534-555. Nevalainen, T. and H. Raumolin-Brunberg (2003), Historical sociolinguistics: language change in Tudor and Stuart England. London: Longman. Nevalainen, T. (ed.) (2006), Types of variation: diachronic, dialectal and typological interfaces. Amsterdam: Benjamins. Patrick, P.L. (1999), Urban Jamaican Creole: variation in the mesolect. Amsterdam: Benjamins. Pennycook, A. (2003), ‘Global Englishes, Rip Slyme, and performativity’, Journal of Sociolinguistics, 7: 513-533. Peters, P. (2001), ‘Corpus evidence on Australian style and usage’, in: D. Blair and P. Collins (eds.) English in Australia. Amsterdam: Benjamins. 163178. Romaine, S. and D. Lange (1991), ‘The use of like as a marker of reported speech and thought: a case of grammaticalization in progress’, American Speech, 66: 227-279. Sand, A. (2004), ‘Shared morpho-syntactic features of contact varieties: article use’, World Englishes, 23: 281-298.
Corpus linguistics meets sociolinguistics
31
Sand, A. (forthcoming), ‘Angloversals? Shared morpho-syntactic features in contact varieties of English’, unpublished “habilitation” thesis, University of Freiburg. Shields, K. (1989), ‘Standard English in Jamaica: A case of competing models’, English World-Wide, 10: 41-53. Shields-Brodber, K. (1996), ‘‘Old skeleton, new skin’: the relationship between open syllable structure and consonant clusters in Jamaican English’, in: P. Christie (ed.) Caribbean Language Issues: Old and New. Kingston: UWI Press. 4-11. Shields-Brodber, K. (1997), ‘Requiem for English in an ‘English-Speaking’ Community’, in: E. Schneider (ed.) Englishes around the World II: Caribbean, Africa, Asia, Australasia – Studies in Honour of Manfred Görlach. Amsterdam: Benjamins. 57-67. Thurlow, C. and A. Jaworski (2003), ‘Communicating a global reach: inflight magazines as a globalizing genre in tourism’, Journal of Sociolinguistics, 7: 579-606. Yaeger-Dor, M., L. Hall-Lew and S. Deckert (2002), ‘It’s not or isn’t it? Using large corpora to determine the influences on contraction strategies’, Language Variation and Change, 14: 79-118. Appendix A: Be-contractions in five ICE corpora (conversations only) – raw data ICE-GB I am I am not I’m I’m not I amn’t
25 2 678 135 0
ICENZ 4 1 505 88 0
ICE-IE
ICE-JA
39 2 620 95 0
73 12 732 150 0
ICEIndia 275 41 397 66 1
you are you are not you’re you’re not you aren’t
30 1 388 63 2
6 2 272 33 1
29 1 346 56 0
86 9 462 66 2
321 27 91 2 0
he/she/it is he/she/it is not he’s/she’s/it’s
114 7 2087
51 3 2099
244 5 2234
229 19 1201
905 81 881
32
Christian Mair
he’s/she’s/it’s not he/she/it isn’t
208
216
231
169
87
26
9
13
9
2
we are we are not we’re we’re not we aren’t
16 1 147 11 1
9 0 152 11 0
12 0 113 14 0
60 2 158 12 0
185 28 22 2 0
they are they are not they’re they’re not they aren’t
32 4 258 32 0
14 0 387 34 0
28 0 340 27 3
77 15 227 25 0
390 44 32 5 0
B: Be-contractions in five ICE corpora (conversations only) – summary not contracted/ not negated
not contracted/ negated
subject-verb contraction
ICE-GB 217 15 4007 ICE-NZ 84 6 3797 ICE-IE 352 8 4076 ICE-JA 525 57 3202 ICE-India 2076 221 1585 * Figure contains two and one instance of ain’t respectively.
negative contraction
29 12* 16 12* 3
Creating corpora from spoken legacy materials: variation and change meet corpus linguistics Joan C. Beal University of Sheffield Abstract Contrasting the aims and methodologies of corpus linguists and variationists, Charles Meyer writes that the latter ‘have been more interested in spoken language’ and ‘have tended to collect data for private use and have not generally made public their data sets’ (2006: 169). Since the advent of sociolinguistics in the 1960’s, individual scholars and research teams have been amassing recordings of spoken data, often for the purpose of investigating variation across a limited number of linguistic features. Surprisingly little of this material has, however, been made accessible to the wider community of scholars. As John Widdowson points out, ‘much of this data remains hidden and inaccessible, scattered in numerous, often obscure, repositories’ (2003: 81). What is more, these valuable legacy materials are often kept in inadequate storage facilities, and in obsolescent media, leading to the danger of them being lost forever. The Newcastle Electronic Corpus of Tyneside English (NECTE) was created with the aid of a Resource Enhancement Grant from the then AHRB with the primary objective of ‘rescuing’ legacy materials from the Tyneside Linguistic Survey collected c.1969 and creating an accessible corpus by combining these with more recently-collected data from the Phonological Variation and Change project, collected c.1994. More specifically, the resultant corpus was designed to be of use to as wide a range of end-users as possible and therefore available in a number of formats: sound, phonetic transcription, orthographic transcription and grammatical mark-up. The challenges posed by this project, and the ways in which the project team overcame them, will be the main focus of this paper, and should provide useful pointers to anybody intending to embark on creating a corpus of spoken language, whether from legacy materials or from newly-collected data. The topics to be covered are: (i) ethical and legal issues surrounding the making accessible of data collected in an era before ethics review or the UK’s 1998 Data Protection Act; (ii) the challenges involved in gathering metadata and digitising ‘old’ audio material; (iii) standards of transcription and mark-up. Finally, there will be some discussion of plans to process other ‘legacy’ materials, and progress made towards developing common standards, as set out in Kretzschmar et.al. (2006).
1.
Introduction: Corpus Linguists and Sociolinguists
In his introduction to the special volume of Journal of English Linguistics devoted to papers from ICAME 2005, Charles Meyer notes that ‘although corpus linguists and variationists…have always had a shared interest in the analysis of empirical data, they have approached the analysis of variation in different ways’ (2006: 169). He goes on to contrast the approaches of corpus linguists and variationists in the following ways:
34
Joan C. Beal 1.
Whilst corpus linguists have tended to study both spoken and written language, variationists have concentrated on spoken data;
2.
Corpus linguists create public corpora, whilst sociolinguists mainly collect data for private use;
3.
Corpus linguists have concentrated on standard varieties, whereas sociolinguists have paid more attention to non-standard accents and dialects.
The title of the conference session whose papers appear in this issue of JEL ‘Corpora and the Study of Regional and Social Variation’, itself indicates that there is increasing convergence between Variationists and Corpus Linguists on point 3. The availability of corpora of different national varieties of English, most notably the ICE corpora, and of regional varieties, such as the Freiburg corpus of Region English Dialects (FRED), has allowed corpus linguists to turn their attention to variation and variationists to have access to large amounts of comparable data. At Sociolinguistics Symposium 15 in 2004, a workshop entitled ‘Models and Methods in the Handling of Unconventional Digital Corpora’ included fourteen contributions from a diverse group of scholars, some of whom would consider themselves corpus linguists, some socio-or historical linguists but all of whom had developed or were developing corpora which incorporated historical, regional or social variation. The very fact that such a wide range of scholars participated in this workshop bears witness to this convergence between the disciplines.1 Point 1 is true to some extent, though some variationists have looked at corpora of written language: for instance, Sali Tagliamonte (2007) has compiled a corpus of data from instant messaging in order to analyse adolescents’ use of language online. What I would like to concentrate on in this paper is point 2: is it still true that variationists collect data for private use, and, if so, what are the obstacles to making this public? 2.
‘Hidden and Inaccessible’: the legacy of sociolinguistics
In a paper first delivered at the first UK Language Variation and Change conference in Reading (1997), but published in 2003, John Widdowson called for a corpus to be created from all the material collected by variationists during the 20th century, or at least as much as survives. He laments the fact that: much remains in often widely dispersed and inaccessible locations in departmental collections, or, we must admit to our shame, kept in inadequate storage conditions in our own offices, or even at home, gathering dust, wow and flutter, print-through and meltdown, silently shedding the hard-won sounds of twentieth-century speech in the
Creating corpora from spoken legacy materials
35
constantly dispersing particles of ferric oxide of an obsolescent recording system. (Widdowson 2003: 84) Widdowson’s description of the vast amount of linguistic data languishing unloved and undiscovered would melt the hardest heart, but the idea of gathering all these into a national repository is impractical, to say the least. Issues of copyright, ownership and data-protection alone would strangle such a project at birth. Any scholar with a box of audio-tapes in the attic, perhaps recorded for a student project, needs to ask questions such as: who owns the intellectual property in them, the researcher or the university at which he or she was studying at the time? Was informed consent obtained from the speakers recorded, and is there a record of this? Did this consent include the recording being made available to other researchers? Did the World Wide Web even exist when the recordings were made? Is there a record of the speakers’ names and addresses from which they could be contacted in order to obtain consent retrospectively? As I hope to demonstrate, these problems are not insuperable, and for recentlycollected data the requirement of the major research councils that data from funded projects be deposited with Qualidata or AHDS will protect the legacy for future researchers2, but when dealing with legacy materials I would argue that it is better to start from the bottom up, dealing with individual collections whose provenance is known, rather than attempting the mass rescue advocated by Widdowson. 3.
A Case Study: The Newcastle Electronic Corpus of Tyneside English
3.1
Overview
The Newcastle Electronic Corpus of Tyneside English (NECTE)3 can be described as a legacy corpus in that it brings together materials that had been collected for two sociolinguistic projects collected in Tyneside, North-east England, at the beginning and the end of the second half of the 20th century. These were (i) the Tyneside Linguistic Survey (TLS), collected in 1969 in Gateshead on the South bank of the River Tyne and Newcastle on the North bank, and (ii) the Phonological Variation and Change (PVC) project, collected in 1994 in Newcastle. The aim of the NECTE project team was to create an accessible database which would make the materials available to as wide a range of users as possible, and which would be, as far as this is possible, ‘future proof’. NECTE is in no sense a ‘balanced’ corpus like the BNC: it simply preserves and makes available the data that we inherited. In the case of the more recent of the two sub-corpora, this is less problematic, in that the research design of the ESRC-funded PVC project required a balanced sample, and the data, already digitised and properly stored, did not need to be rescued. The TLS materials are another story, exemplifying Widdowson’s notion of ‘hidden and inaccessible’ data.
36
Joan C. Beal
The aims and methodology of the TLS project are outlined in Strang (1968) and Pellowe, Nixon, Strang & McNeany (1972). The plan was to conduct loosely-structured interviews with 150 informants drawn from a stratified random sample of Gateshead. A grid was drawn over a map of Gateshead and equal numbers of informants were contacted from within each square on the grid. We are lucky in that a single individual conducted all the interviews - Vincent McNeany - as we have learned within sociolinguistics that the kind of data produced often depends very much on who is collecting the data. Different interviewers can potentially produce different kinds of data and this is not what you would want if you are trying to compare speakers with one another. McNeany was a postgraduate student at the time, but had been born and raised in the community from which he was collecting the data, and still lived there at the time of the project. He had a local accent and was able to put participants at their ease, often referring to shared experiences. The interviews were recorded onto reel-to-reel tapes, 103 of which remain, of which 3 are badly damaged. The whereabouts of the remaining tapes are, at the time of writing, unknown. The TLS team also set out to interview a matching number of informants from Newcastle, but, sadly, none of these recordings have ever been found. John Pellowe, the Principal Investigator on the TLS project, left Newcastle in 1980. Thereafter, the only published work based on the TLS material was Jones-Sargent (1983), though the data was occasionally used by individual researchers. I remained aware of its existence and whereabouts because, during the period between 1977 and 2001 when I was employed by the University of Newcastle, I was frequently asked for samples of ‘traditional’ Tyneside speech, and, with a small legacy from an alumnus, had one recording transcribed and transferred to audiocassette for this purpose. The majority of the recordings and other materials remained in storage in what is now the School of English Literature, Language and Linguistics at Newcastle University. By ‘storage’, I am not referring to controlled archival conditions, but to boxes in cupboards, not exactly ‘hidden and inaccessible’ but at the very least in danger of deterioration. Some came to light only after our project began: John Local, who had worked on the TLS project as a graduate student, but subsequently took up a post at the University of York, brought in a number of recordings which he had taken with him, and alerted us to the fact that others had been deposited with the British Library. There may, for all we know, be others ‘out there’. In 1994, I began the resurrection of the project with a small grant from the Catherine Cookson Foundation, a charitable trust financed by the eponymous author of historical romances. This involved transferring the original reel-to-reel materials onto audio-cassettes: without this intervention, much of the corpus would today be unusable. As it happens, this transfer to what has now become an obsolescent medium, happened not a moment too soon. By the time we were able to digitise the recordings, some of the reel-to-reel tapes had deteriorated so much that we had to digitise from the audio-cassette copies. We subsequently learned that the shelf-life of reel-to-reel analogue tapes is estimated at about 25 years.
Creating corpora from spoken legacy materials
37
Having thus ‘rescued’ the data, the NECTE project team faced a number of challenges. The following sections will outline the nature of these challenges and provide an account of the NECTE team’s response to each of them in turn. 3.2
Challenge 1: Ethics and ‘informed consent’
To comply with the ethical review procedures of the AHRC, and of our own universities, the NECTE project team had to be able to demonstrate that the subjects of both the TLS and PVC had given their informed consent to be recorded, and, more importantly given that the whole purpose of the NECTE project was to make this data more widely available, that they agreed to the recordings being accessed by other researchers. In the case of the PVC project, this was unproblematic, since it was conducted under the auspices of the ESRC, and in compliance with the 1984 Data Protection Act. However, the TLS researchers in 1969 had no Data Protection Act to comply with, and there were, of course, no university ethics committees. However, the SSRC, precursor to the ESRC, even at this early stage, had an ethics policy in place, and we were fortunate enough to be able to recover evidence that the subjects had indeed given informed consent to being recorded, and to the recordings being made available to future researchers. A letter to subjects was found which stated ‘The results of the survey will in due course be published, but no resident who has helped by talking in this way will be referred to in such a way that they could be identified’ and which was signed by Barbara Strang, Professor of English Language and General Linguistics, University of Newcastle upon Tyne. Of course, these subjects could have had no idea that there would one day be such a thing as the World Wide Web, and that the recordings might be available to anybody in the world at the click of a mouse. This creates something of a grey area: the 1969 agreement guarantees anonymity, but is a voice ever truly anonymous? From the outset of the project, we were aware of the importance of taking advice from the Arts and Humanities Data Service, but it also became apparent that we were breaking new ground, and were subsequently invited to give a paper on the legal and ethical issues involved at an AHDS one-day course on copyright and data-protection issues in 2003. We also took advice from Newcastle University’s Data Protection Officer. Although compliance with the DPA is essential, where the material is older and the subjects no longer alive, it may be necessary to take a more pragmatic view. The ‘Sounds Familiar’ website at the British Library which, in connection with the BBC Voices project, has made some of the recordings from the Survey of English Dialects (SED) available, could not have got off the ground had such a strict view of data protection been taken. There is no official record of consent for publishing the recordings from the SED informants themselves and any attempt at securing these retrospectively would have been impossible given that none of the speakers is still alive. It was felt that sufficient time had elapsed to consider making the recordings more widely available. Much consideration was given to the close relationships that the fieldworkers developed with the informants -and there is a great deal of reference in the SED peripheral literature
38
Joan C. Beal
to the pride the informants felt in being asked to take part in the survey. It was felt that using extracts (sympathetically selected so that no individual would be compromised in any way) would be appropriate. In any case, the informants at the time were all aware and comfortable with the idea that their responses would be published (e.g. in Orton & Halliday (1962)), used in lectures, talks etc. and even occasionally broadcast on the BBC. The only condition was that the recordings should be streamed and therefore not downloadable. In the light of the numerous responses that the BL have had from descendants of the original informants they feel it was indeed the right decision – they have had contact with a number of people and been able to supply copies of the recordings for their family archives, for instance.4 The TLS subjects had been promised anonymity. To achieve this, the NECTE project removed all names from recordings and transcripts. A table with names and ID codes was created which could only be accessed by the project team, and this was securely stored. The original audio data are now stored in a safe, on two password-restricted computers, and on a computer in a locked archive with access restricted to the NECTE research team and legitimate associated scholars. Because the free-wheeling nature of the TLS interviews meant that subjects spoke about matters considered ‘sensitive’ under the 1998 Data Protection Act: health, religion, politics, trade union membership, and because some were minors at the time of recording, it would not be acceptable to make the recordings freely available on the web. For this reason, researchers wishing to access the NECTE corpus must complete and sign a form, stating their credentials and reasons for wanting to use the material and agreeing to comply with the DPA. Projects such as the SCOTS corpus (www.scottishcorpus.ac.uk), for which spoken data is deliberately collected rather than ‘rescued’, can build informed consent into the design from the beginning, and thus make their material much more widely available, but with legacy materials this is not possible, unless a difficult and lengthy process of contacting subjects is undertaken. Where subjects have died, we were advised that we would be in a ‘Catch 22’ situation: to gain the informed consent of their family would mean breaching the confidentiality of the subject. Compliance with the Data Protection Act (1998) has thus involved putting in place a number of safeguards which restrict immediate access to the NECTE materials. However, these safeguards have not made NECTE inaccessible. To access the corpus, one has to be serious and put in some effort, so it is not likely to be accessed by the casual ‘surfer’. Nevertheless it has proved useful for research and pedagogy at various levels: it has been used for research on phonology, discourse, morphology and syntax; for teaching at high school (GCE AS and A2), undergraduate and Masters levels; and by scholars in the UK, Europe, North America and China.
Creating corpora from spoken legacy materials 3.3
39
Challenge 2: gathering the materials
In a recent account of the NECTE corpus, Allen et. al. admit that ‘as restoration and digitization efforts progressed, it became evident that only a fragment of the projected TLS corpus had survived’ (2007: 20).The information in unpublished TLS project documentation (as well as that in the public domain) did not allow the NECTE team to decide with any certainty how large the corpus originally was. We are not sure, for example, how many interviews were conducted, and the literature gives conflicting reports of 150 and 200. It is also unknown how many of the original interviews were orthographically and phonetically transcribed. Jones-Sargent (1983) used 52 (digitally-encoded) phonetic transcriptions in her computational analysis, but the TLS material includes seven electronic files that we recovered from the Oxford Text Archive, but that she did not use. As such, there were clearly more than 52 phonetic transcriptions, but was the ultimate figure 59, or were further files digitized but never passed to the OTA? The ‘legacy’ of the TLS project currently held by the NECTE project is as follows: • 103 audio recordings, of which 3 are badly damaged. For the remaining interviews, the corresponding analogue tape is either blank or simply missing. • 57 index card sets, all of which are complete. • 61 digital phonetic transcription files. • 64 digital social data files. This is still a lot of data, but mystery surrounds the missing materials: were there ever 200 or even 150 recordings, and if so, where are the others? The TLS was innovative and ground-breaking, in many ways ahead of its time. It is difficult to get anyone under the age of 30 to understand the concept of a reel-to-reel analogue tape, but when I start talking about the fact that data for the TLS had to be input to a vast computer and in the form of cards punched by a team of data processors, people of this age are astonished. The TLS team pioneered multivariate analysis, using an early version of the cluster analysis programme, CLUSTAN5. Rather than transcribe the data into IPA, they developed a hierarchical coding system, and the research associate Vince McNeaney became so familiar with this that he transcribed straight into the code. Figure 1 shows an extract from the TLS coding system, which was preserved both in a manual, and on a chart made out of old wallpaper. We were able to digitize this historical artefact for posterity. It shows the meticulous phonetic detail of the TLS transcriptions and coding.
40
Joan C. Beal
Figure 1: The TLS coding system The coding system involves three levels: the symbols in the boxes at the top of Figure 2 represent Overall Units (OUs), equivalent to the lexical sets used by Wells (1982) to enable comparison of different accents. The next level is that of the Putative Diasystemic Variant (PDV): these are represented by the IPA symbols in the left-hand column under each OU, and are roughly equivalent to the phonemic level of transcription. The symbols which appear to the right of each PDV are ‘states’, each representing a different phonetic variant. Each of these has a number, such that the code for any output indicates not only its precise phonetic nature but the phoneme of which it is an allophone and the lexical set in which it was used. The TLS transcriptions were hand-written on index cards like the one that appears in Figure2.
Figure 2: TLS transcription card Initially, from NECTE’s perspective, these electronic files appeared to be a labour- and time-saving alternative to keying in the numerical codes from the index cards. However, a peculiarity that stems from the original electronic data
Creating corpora from spoken legacy materials
41
entry system used by the computing staff who had input the data from the TLS team’s original index cards meant that the resulting files had to be extensively edited by members of the NECTE team when they were returned to us from the OTA. The problem arose from the way in which the five-digit codes were laid out by the TLS researchers on the index cards as you can see in figure 2. For reasons that are no longer clear, all the consonant codes (beginning (0294(1)) in line 4) were written on one line, and all of the vowel codes appear on the line below ((0134(1)) on line 5). When the TLS gave these index cards to the University of Newcastle data entry service, the typists entered the codes line by line, with the result that, in any given electronic line, all the consonant codes come first, followed by the vowel codes. This difficulty pervades the TLS electronic phonetic transcription files. While it had no impact on the output of the TLS team (given that they were examining codes in isolation and that phonetic environment had already been captured by their hierarchical scheme), it was highly problematic for the NECTE enhancement of the original materials. Simply to have kept this ordering would have made the phonetic representation difficult to relate to the other types of representation planned for the NECTE enhancement scheme. The TLS files were therefore edited with reference to the index cards so as to restore the correct code sequencing, and the result was proof-read for accuracy. The example in figure 3 shows the intermediate (PDV) TLS phonetic representation – equivalent to a broad segmental phonetic IPA representation. In the corpus, each PDV segment is, however, indexed into up to 10 state variants – equivalent to a (very) detailed phonetic IPA representation. Orthographic
Down by Clark Chapman’s
Segmental Phonetic (PDV)
dũƘn baŸ klşk Ƶæpmԥnz
Figure 3: Example of NECTE transcriptions As already indicated in 3.1, the TLS recordings were, in the event, digitised just in time. Some of them had deteriorated considerably, and even where the sound quality was still acceptable, there were problems. The interviewer had carried an UHER portable recorder to subjects’ houses. These machines allow recording and playing at different speeds. If he thought the tape was going to run out before the end of the interview, he would simply increase the speed. This meant that the digitised recordings would change speed at random points, and the speakers would sound like the cartoon characters The Chipmunks. This had to be put right at a later stage. The original analogue recordings, both reel-to-reel and cassette versions, were first digitized at a high sampling rate, a graphic equalisation process was then applied to clarify the sound, a hiss reduction filter and a click eliminator were applied and variations in tape recording speed were eliminated. 6 Other consequences of recording in subjects’ houses include traffic noise, interruptions, and in one case a rather loud budgie in the background. Nevertheless, the recordings available on the NECTE website, whilst perhaps not
42
Joan C. Beal
suitable for acoustic analysis, are clear enough to be comprehensible, and to bring the voices of late 1960’s Gateshead to life. 3.4
Challenge 3: transcription
A more detailed account of the principles and methods we used for transcription of the NECTE corpus can be found in Beal, Corrigan, Smith and Rayson (2007). The audio content of the TLS and PVC corpora has been transcribed into British English orthographic representation, and this, too, is included in its entirety in the NECTE corpus. Two problems were encountered and, we hope, resolved in creating this representation: (i) application of English orthography to nonstandard spoken English and (ii) transcription accuracy. Since NECTE makes sound files and some phonetic transcriptions available, and since the practice of representing non-standard phonology semi-phonetic spelling has been discredited by e.g. Preston (1985, 2000), we took the principled decision to use Standard British English spelling in our orthographic transcriptions, except where the item was lexically or morphologically distinct . Thus, for example, the characteristic Tyneside pronunciation of /na:/ for SE know would be spelt
in popular representations of the dialect7, but it is transcribed in NECTE. Transcribers adhered to a strict protocol, which can be found on the NECTE website. Any large-scale textual transcription project will be subject to human error so, to maximize accuracy, we conducted two correction passes on our primary transcription. These were carried out by two different members of the NECTE team who were themselves not involved in the primary transcription; the decision criterion was majority agreement. [TLS/G052] [TLS/01] [TLS/G052] [TLS/01] [TLS/G052] [TLS/01] [TLS/G052] [TLS/01] [TLS/G052]
and eh I I lived in with my mother for not quite two year but varnigh aye and I went to lobley-hill that was my first house ah yes yes and I shifted I got an exchange to be near my mother you-know yeah {xx} in the flat oh aye well I lived in there for about oh .. eighteen or nineteen year, maybe a little bit longer I divven’t know but eh then I come over here because they were modernising the flats you see
Figure 4: Extract from a TLS transcription Figure 4 is an example of the kind of transcription file that was produced by the NECTE transcribers. Notice that /na:/ is spelt ‘know’, but ‘divven’t’ is not
Creating corpora from spoken legacy materials
43
represented as Standard English ‘don’t’. This is because, in this case, the difference is morphological rather than just phonetic. In fact Heike Pichler of the University of Aberdeen has accessed the corpus to provide comparable material for her (2008) study of ‘divven’t’ in Berwick upon Tweed. Had we not decided to represent morphological alternations like this in the transcription, her task would have been much more difficult. Varnigh is a rather archaic word meaning ‘very nearly’ and, as such, is transcribed according to an agreed protocol recorded in the NECTE appendix 2, which is a lexicon of dialect terms used in the corpus and can be found at http://www.ncl.ac.uk/necte/appendix2.htm. 3.5
Challenge 4: Tagging
With regard to tagging, the challenge presented to the NECTE team was that existing tagging software had to be used and the tools in question had to encode non-standard English reliably, that is, without the need for considerable human intervention in the tagging process and / or for extensive subsequent proofreading. As was the case with transcription, I do not intend to go into too much detail here concerning the tagging of NECTE, because Nick Smith and Paul Rayner have covered this in the paper from the 2006 ICAME conference which is published as Beal, Corrigan, Smith and Rayner (2007). What I can say is that both the NECTE team, and our colleagues at UCREL learned a great deal from our successful attempt to modify the CLAWS tagger for use with non-standard English. The additions to CLAWS include the following: • pronouns: wor = ‘our’ (= possessive form of personal pronoun); tagged APPGE; • mesel, hisself, theirself, theirselves, etc. (=reflexive personal pronoun); tagged PPX1 or PPX2; • auxiliaries: div = a regional variant of the auxiliary do, non-3rd singular present tense; tagged VAD0. Some of the more idiosyncratic usages in Tyneside English could simply be added to the lexicon, even though I might prefer to classify them as morphological variants. Tyneside English is distinguished from Standard English, or at least the kind of English found in standard corpora like BNC, by its diversity of discourse markers. In Tyneside English you can get strings of discourse markers like ‘way ye bugger man’ which together simply express surprise. Examples of discourse markers found in the NECTE corpus are wey, like, aye, well, uhhuh, huh, ah, you know, and I mean. CLAWS did not have a specific tag for these, but it proved a satisfactory solution to use the existing CLAWS tag for an interjection, UH, for these. Certain forms still proved difficult to tag automatically, especially where forms in Tyneside English have different functions to the same surface form in Standard English, Examples of this are: went as past participle, as in ‘If I’d went’; give, come; seen and done as preterits; and we as first person plural object pronoun, as in ‘She sent we’. The use of forms identical to the Standard English
44
Joan C. Beal
preterite as past participle, such as ‘If I’d went’, could be caught if an auxiliary is detected before it, and preterite ‘give’ could be identified as such if a 3rd person singular pronoun preceded it, but these forms proved impossible to tag. However, we were pleasantly surprised by the extent to which the CLAWS tagger could be adapted to deal with this non-standard variety, and, in practice, any researcher investigating morphological or morpho-syntactic variation in Tyneside would be aware of these forms and search for them in context. 3.6
Challenge 5: ‘Future-proofing’
One of the principal aims of NECTE was to ‘future proof’ this important resource. Since I became involved in the world of archives, I have encountered a great deal of scepticism about the longevity of digital materials. When a similar collection of recordings made in Sheffield in the early 1980’s, the Survey of Sheffield Usage, was digitized and made available on CD, questions were asked about the relative shelf-lives of CD versus archival audiotape. The truth is, we do not know, but digitising these audio collections gives them their best chance of survival. By depositing the digitised materials with the AHDS as well as on a secure server at the University of Newcastle, we have done the best we can to future proof them. Of course, we do not envisage these materials being left in a cupboard, virtual or real, again, and the many users to whom DVD copies of the corpus have been distributed provide further safeguards against loss. We keep a record of all these requests and so, in the event of catastrophe, could ask them to return the favour. In order to ensure that the corpus would work on all platforms and with all software applications, we encoded NECTE using Text Encoding Initiative (TEI)-conformant Extended Markup Language (XML) syntax. XML (http://www.w3.org/XML/) aims to encourage the creation of information resources that are independent both of the specific characteristics of the computer platforms on which they reside (Macintosh versus Windows, for example), and of the software applications used to interpret them. To this end, XML provides a standard for structuring documents and document collections. TEI defines an extensive range of XML constructs as a standard for the creation of textual corpora in particular. Together, these are emerging as world standards for the encoding of digital information, and it is for this reason that NECTE adopted them. The AHRC in fact strongly recommends that XML is used, but we were surprised to find that NECTE was the first AHRC-funded linguistic corpus to use XML. The reason for this is probably the perceived lack of ‘user-friendliness’ of XML: as we state elsewhere ‘users not familiar with these standards may find the pervasive markup tags in the NECTE files a distracting encumbrance and yearn for the good old days of plain text files’ (Allen et. al 2007: 36). Complaints about the lack of user-friendliness are perhaps not entirely unjustified. XML is a markup language that provides a standard for the structuring of documents and document collections, and, although XML-encoded documents are plain text files that can be read by humans, in general they should not be. For an XML document to be readily legible, software that can represent the structural markup in a visually-accessible way is required. XML-aware
Creating corpora from spoken legacy materials
45
software visualization and analysis tools are gradually becoming available. The Oxford University Computing Service’s Xaira system, for instance, is ‘a general purpose XML search engine, which will operate on any corpus of well-formed XML documents (http://www.oucs.ox.ac.uk/rts/xaira/). It is, however, best used with TEI-conformant documents’. Nicolas Ballier of the University of Paris 13, has successfully used Xaira with NECTE. Mike Scott has reported to us that, with minimal adaptation, he has been able to use NECTE with Wordsmith, and Anita Auer has been able to remove the mark-up to present the files to MA students as more user-friendly files for small-scale analysis projects. NECTE is thus fulfilling our aim of making available a corpus which can be used on a variety of platforms and with a variety of analysis tools. 4.
Next Steps
I hope that this paper has demonstrated that, whilst the mass rescue envisaged by John Widdowson (2003) may not be feasible, we should not give up hope of creating useful corpora from legacy materials. The NECTE team learned a great deal from colleagues in both sociolinguistics and corpus linguistics in the course of the project, and we hope that our corpus will provide a model for future ‘rescue’ operations which would likewise be informed by corpus linguistics. The Survey of Sheffield Usage, held in the Archives of Cultural Tradition at the University of Sheffield, has been partially digitised and transcribed according to the principles outlined in 3.4.8, and I hope to produce an accessible Corpus of Sheffield Usage in due course. The networking opportunities offered by events such as the ICAME conferences have led to a group of researchers working towards agreement on common methods for producing corpora for regional and social analysis of languages and varieties (Kretzschmar et. al. 2006). Perhaps the bleak future predicted by Widdowson can be avoided, after all. Notes 1
The papers from this workshop, along with invited contributions from scholars who were not able to attend but had developed similar corpora, have been published in a two volume collection: Creating and Digitizing Language Corpora (eds. Beal, Corrigan and Moisl 2007). For details of the workshop see http://www.ncl.ac.uk/ss15/panels/
2
This information was correct at the time of the ICAME conference, but, shortly afterwards, the AHRC released the news that they were no longer able to finance AHDS.
3
This project was financed by Resource Enhancement Grant AHRB RE11776 from what was then the Arts and Humanities Research Board
46
Joan C. Beal
(now AHRC), Principal Investigator K.P. Corrigan. The project website is at www.ncl.ac.uk/necte 4
Thanks to Jonnie Robinson, Lead Content Specialist: Sociolinguistics and Education, Social Sciences Collections and Research at the British Library, for this information. The websites can be viewed at http://www.bl.uk/learning/langlit/sounds and http://www.bbc.co.uk/voices.
5
Updated versions of CLUSTAN have since been successfully applied in a wide range of disciplines: see http://www.clustan.com/
6
Thanks to Jonathan Marshall, now at the University of Gloucester, for carrying out this essential restoration work.
7
See Beal (2000) for further discussion of orthographic representation of Tyneside speech in popular literature.
6
I acknowledge the assistance of the British Academy in providing a Small Grant to finance transcription.
References Allen, W., J.C. Beal, K.P. Corrigan, W. Maguire and H.L. Moisl (2007), ‘A linguistic time capsule: the Newcastle Electronic Corpus of Tyneside English’, in: Beal, J.C., K.P. Corrigan and H.L Moisl (eds.), Creating and Digitizing Language Corpora, volume 2: Diachronic Databases, Basingstoke: Palgrave Macmillan. 16-48. Beal, J.C. (2000), ‘From Geordie Ridley to Viz: Popular Literature in Tyneside English’, Language and Literature, 9. 4: 343-359. Beal, J.C., K.P. Corrigan and H.L. Moisl (eds.) (2007), Creating and Digitizing Language Corpora, volume 1: Synchronic Databases, volume 2: Diachronic Databases, Basingstoke: Palgrave Macmillan. Beal, J.C., K.P.Corrigan, N. Smith and P. Rayner (2007), ‘Writing the vernacular: Transcribing and tagging the Newcastle Electronic Corpus of Tyneside English, Studies in Variation, Contact and Change, 1 http://www.helsinki.fi/varieng/journal/volumes/01/beal_et_al Jones-Sargent, V. (1983), Tyne Bytes. A computerised sociolinguistic study of Tyneside, Frankfurt am Main: Peter Lang Kretzschmar, W.A., J.C. Beal, J. Anderson, K.P. Corrigan, L. Opas-Hänninen and B. Plichta (2006), ‘Collaboration on Corpora for Regional and Social Analysis’, Journal of English Linguistics, 34, 3: 172-205. Meyer, C.F. (2006), ‘Editor’s Note’, Journal of English Linguistics, 34, 3: 169171. Orton, H. and W. J.Halliday (eds.) (1962), Survey of English Dialects by Harold Orton and Eugen Dieth. B, The Basic Material, Vol. 1, The Six Northern
Creating corpora from spoken legacy materials
47
Counties and the Isle of Man, Leeds: E.J.Arnold for the University of Leeds. Pichler, H. (2008), A qualitative-quantitative analysis of negative auxiliaries in a northern English dialect: I DON'T KNOW and I DON'T THINK, _innit_?, University of Aberdeen PhD Thesis. Preston, D.R. (1985), ‘The Li’l Abner syndrome: Written representations of speech’. American Speech 60(4): 328-336. Preston, D.R. (2000), ‘Mowr and mowr bayud spellin: Confessions of a sociolinguist’. Journal of Sociolinguistics 4(4): 614-621. Tagliamonte, S. (2007), ‘Corpora from the virtual world: teenagers, instant messaging and language change’, paper presented at ICAME 28, Stratford upon Avon. Widdowson, J.D.A. (2003), ‘Hidden depths: Exploiting archival resources of spoken English’, Lore and Language, 17(1&2):81-92.
Discourse linguistics meets corpus linguistics: theoretical and methodological issues in the troubled relationship Tuija Virtanen Åbo Akademi University, Finland
Abstract Discourse linguistics and corpus linguistics have an uneasy relationship because of their inherent ontological and epistemological differences. Yet it is a steady relationship going back well into corpus-linguistic history, and one that both fields are highly motivated to maintain despite its many hazards and challenges. Singling out five complementary dimensions of discourse, understood here in a broad sense, this paper shows that not all of them will be equally accessible to users of corpus methods. Two fundamental aspects of discourse are identified as particularly challenging to corpus-linguistic enquiry, i.e. the distinction between product- and process-oriented approaches; and the status of the primary notion of context. The latter raises the issue of authenticity, suggesting a need to rethink what we mean by the notion. The important methodological distinction between a corpus-based and a corpus-driven approach to discourse serves to highlight key issues in the joint history of discourse linguistics and corpus linguistics. The paper is rounded off with a discussion of the benefits to be gained by a combination of discourse linguistic and corpus linguistic approaches and methods: each party can complement the other in constructive ways; to uncover new aspects of discourse that may suggest a reconsideration of our present understanding, and disclose our tacit assumptions about it.
1.
Introduction
Discourse linguistics and corpus linguistics have an uneasy relationship because of their inherent ontological and epistemological differences. Yet, it is an established relationship, going back well into corpus-linguistic history, and one that both parties are highly motivated to keep up and develop, despite its many hazards and challenges. The aim of this paper is to contemplate some of the major stumbling blocks in this relationship. I set out to identify similarities and differences between the two approaches to the study of text and discourse, with reference to concrete research projects, in order to consider the ‘added value’ to be gained from combining methods. Keeping the theoretical and methodological discussion as general as possible, the label ‘discourse linguistics’ is here used as an umbrella term for discourse analysis, discourse studies, text linguistics, pragmatics, conversation analysis and other related approaches to the study of text and discourse. ‘Corpus linguistics’ here broadly refers to any linguistic framework which uses computer corpora as data and associated method of enquiry, irrespective of whether we are dealing with ‘linguistics’ of a particular kind (i.e. corpus ‘linguistics’, rather than
50
Tuija Virtanen
corpus ‘studies’). The focus is on the area of overlap between discourse linguistics and corpus linguistics.1 2.
Major stumbling blocks to the relationship
The use of corpus data in analyses of text and discourse raises two issues: (i) the difference between a product and a process view of discourse; and (ii) the status of the textual, situational and socio-cultural context in the particular study. In discourse linguistics, the object of study is the process, rather than its outcome, the product. But it is this product that is stored in the form of a corpus. Furthermore, context is as important as the pieces of speech or writing under analysis, in investigations of discourse as process and as social action. But it is far from straightforward to figure out how linguists can best integrate this inherent aspect of discourse into studies of corpus data. While easy to identify, these two fundamental aspects of discourse, i.e. its process orientation and the interdependency within a particular context, still constitute the major stumbling blocks on the road towards ‘discourse and corpus linguistics’. Corpora are essentially static, consisting of records of spoken or written text that discourse linguists explore in the hope of being able to reconstruct the processes through which these products were shaped to serve particular communicative goals and to function as situated social action for interlocutors, readers and writers. Even though corpora increasingly code contextual information, the inherently dynamic character of context as instigating and affecting discourse, and being in turn created through discourse as social action, remains beyond the reach of corpus linguistics. An analysis of five complementary dimensions of discourse singled out in Section 5 reveals that not all of them will be equally accessible to users of corpus methods. And corpora can be of many different kinds, some more suited to investigations of discourse phenomena than others. 3.
Rethinking authenticity
Discourse linguists and corpus linguists both rely on discourse data and each values authenticity, often understood in the sense of ‘real-life’ data, i.e. discourse that has been produced, used or co-constructed by people in a given communicative situation for particular purposes. Although widely used to justify the chosen method, the term ‘authenticity’ is far from straightforward, as testified by recent discussions across disciplines (see e.g. Gill 2008). Questions raised by Gill (2008), which are worth considering in any kind of study of discourse, include whether the data we are investigating are regarded as authentic because they seem, in one way or another, ‘original’, i.e. directly related to some kind of origin. But the dialogism of discourse makes such origins very difficult to define. Another question is whether we talk about ‘authenticity’ because we are, consciously or not, concerned with an object of discovery (in a corpus). What
Discourse linguistics meets corpus linguistics
51
about the values that we are, perhaps implicitly, attaching to the data at hand; are we, for instance, exploring something as ‘authentic’ in the sense of ‘desirable’ or ‘normative’, including or excluding what we will then interpret as less so? This question is all too familiar to students of EFL data, the status of the ideal native speaker being of central concern. Authenticity in linguistic enquiry may also refer to unedited, non-manipulated data, to discourse that is viewed as relatively spontaneous. It is indeed worthwhile to give these and other questions concerning the notion of authenticity due attention in studies of discourse, irrespective of whether we are using corpus data. One of the main problems is, however, that what is authentic in corpus studies need not be so in discourse studies, because of the status of context in the investigation. Linguists are repeatedly confronted with ethical issues connected to the procedures of collecting data and the extent to which they are at liberty to use such materials. This is especially acute in studies of impromptu speech. Choices have to be made between optimally ‘natural’ data and materials which bear a trace of metalinguistic awareness on the part of the interlocutors who are engaged in the particular discourse practices. Such decisions are bound to affect the degree of authenticity of our data. There is also the classic issue of ‘transcription as theory’ (Ochs 1979) in recontextualizing data for research purposes, highly relevant in both corpus linguistics and discourse linguistics. But students of writing are also confronted with problems of authenticity: corpora are the outcome of the processes of decontextualization and recontextualization of discourse. Our data are not the ‘original’ or ‘authentic’ pieces of writing that they represent, nor are we studying them in a communicative situation matching those of their writers or the expected readership. Even linguists vouching for unedited, non-manipulated discourse are still aware of the recontextualization processes that have taken place for the data to end up on their desks and screens. The dynamism of discourse is irretrievably lost in concordances, lists and samples of various kinds. Authenticity is also called into question when we make use of publicly available Internet data, unless we happen to occupy the dual role of discourse participant (‘user’ rather than ‘lurker’) and (external) observer of the discourse under construction. But the user role inevitably influences the discourse that we as linguists are hoping to investigate, which is a problem familiar to anthropological linguists and sociolinguists engaging in participant observation of discourse in particular situational and socio-cultural contexts. The status of collections of Internet data as corpora has recently been debated by corpus linguists wishing to benefit from the easy access to huge quantities of publicly available materials (for discussions of Internet data as corpora, see e.g. Baker 2006; Hoffmann 2007; contributions to Hundt et al. (eds.) 2006; Kehoe & Gee this volume; Yates 2001). The main problems include attempts to analyse computer-mediated conversation in lieu of offline discourse, rather than in its own right, and of course, the central issue of the lack of representativeness of the sample, which corpus linguists have to weigh up when considering any quantification of their data (see the discussion in Section 4). Discourse linguists
52
Tuija Virtanen
investigating Internet data will appreciate programs that register (i) the (lack of) simultaneity of interaction, and (ii) what appears on the screens of each discourse participant at any given stage of the interaction. It is also essential to have access to relevant information concerning other discourse activities, online and offline, in which users are engaged in parallel or between their individual attention spans (for discussions, see e.g. contributions to Herring et al. (eds.) forthcoming). Questions of authenticity come to the fore in historical linguistics, where studies of language change frequently suffer from a lack of (appropriate) data. Historical linguists, irrespective of whether they work with corpora or individual texts, are used to assessing the relative authenticity, in one or several of the senses referred to at the outset of this section, of the body of data that has survived through time, its internal and external comparability, and hence, their premises for conclusions. Judgements of the relative authenticity of historical data are based on what is known of their origins, relevant situational and socio-cultural contexts, and the extent to which such written records are deemed appropriate for analysis of reflections of spoken discourse (see the discussions in Kytö 2000; Wårvik 1990, 2003). In the following section, the concern is with the methodological differences between discourse linguistics and corpus linguistics, which again raise the issue of the uneasy balance between representativeness and availability of data. 4.
Methodological differences: two kinds of discourse
A good place to start exploring the similarities and differences between discourse linguistics and corpus linguistics is with the section on ‘methods and materials’ typically found in studies of concrete linguistic phenomena. The conspicuous differences between discourse linguistics and corpus linguistics concerning the ways in which the methods and materials of the particular study are presented remind the reader of the two main scholarly paradigms prototypically associated with the natural sciences and anthropology. Linguistics, the study of language, is a very broad field indeed, encompassing both ‘hard’ and ‘soft’ scientific approaches. In corpus linguistics, the key notion is ‘frequency’. Even though linguists of other orientations also set out to quantify their data, there will often be decisive differences between their goals and methods and those of corpus linguists which will have a bearing on the results (see, for instance, Mair’s discussion in this volume of corpus linguistics and sociolinguistics). In contrast, discourse linguistics has not traditionally had quantification as its primary method. As the terminology needed to refer to non-quantitative research methods which try to account for text in context and the reflexivity of the contextualization processes is, however, largely missing, such studies are often misleadingly called ‘qualitative’. Both discourse linguists and corpus linguists do, of course, strive for qualitative analyses of their data; the difference lies in the fact that discourse linguists tend to prefer situated analyses of the particular, while corpus linguists
Discourse linguistics meets corpus linguistics
53
do so through quantification. What are therefore of interest to corpus linguists are the most frequent items in the data – and occasionally also the least frequent ones, in studies of absence, rather than presence, of linguistic elements – while discourse linguists may be able to learn from any instances that are relevant to their study. Hence, the size and the kinds of data necessary for the two different methodologies can be expected to vary considerably. (For a book-length discussion of the use of corpora in discourse analysis, see Baker 2006.) Discussions of data sampling and search procedures help readers of corpus studies to interpret the particular findings accordingly. Discussions of methods and materials in discourse-linguistic studies may similarly form sections in their own right in published work. Not infrequently, however, this information is integrated in the scholarly discussion of the phenomena at hand, as it is usually far less clear-cut and straightforward than the ‘methods and materials’ of corpus linguistic studies. While human language cannot, of course, constitute an object of study on a par with those typically found in the ‘hard’ sciences, where the analyst can be clearly separated from the data, it is still this paradigm that is reflected in the discourse of corpus studies. The discourse of discourse studies is different, as might be expected in light of the focus on the dynamic nature of the data and the theoretical and methodological choices made in delimiting and approaching the object of study. Discourse linguists have to come to terms with a high degree of causal indeterminacy in their studies. As a result of their expertise in analysing text and discourse in depth and their continued attempts to get to grips with the dynamism of text-context reflexivity, discourse linguists are highly aware of a fact which is relevant to studies of all orientations: that linguists are indeed constructing discourse through discourse, even when they are writing up the study itself. They therefore attempt to make this aspect of study explicit. Discourse linguists also tend to be very much aware of the status of introspection in their work, present in some form and at some stage in all linguistic enquiry, and they therefore make every effort to signal clearly a separation of speculative elements from findings, in the construction of the argument. The two discourses, those of corpus linguistics and discourse linguistics, constitute a source of possible misunderstandings between the practitioners of the two strands of language study. One of the decisions to make, in view of the purpose of a particular study, concerns the relative balance between representativeness and availability of data, already touched upon in Section 3. Because of the choices concerning quantification, corpus linguists and discourse linguists are likely to provide very different answers if confronted with the question ‘representative of what?’ Both know that their data can never be representative of ‘language as a whole’ but, in view of the need for quantification, corpus linguists rightly put a great deal of effort into ensuring that their materials are representative of some aspect of a particular construct. The representativeness of even very large corpora will, however, always be more problematic in view of the goals of discourse linguistics. In the rare instances where discourse linguists are able to conclude that their data are representative of what they want to study, they may not need
54
Tuija Virtanen
the data at all; usually, however, they cannot be sure that their materials are representative enough to warrant a great deal of generalization (see e.g. the discussion in Mair 1990: 14). And they know that one single text is likely to provide them with more insight into the use and structuring of language than they can ever hope to expose through their analyses of the particular. Problems of availability for them tend to be related to restrictions based on ethical issues, specially prevalent in studies of spontaneous spoken interaction, computermediated conversation, and (chains of spoken and written) institutional discourse in many societies. The availability of data may also be reduced on other grounds, such as copyright restrictions and legal constraints of various kinds, the (semi)private nature of much business communication and organizational discourse, or simply because the necessary materials have not survived through time. These problems will, however, affect corpus linguists and discourse linguists alike. 5.
Possible points of convergence
This section explores possible points of convergence between corpus linguistics and discourse linguistics in terms of: (i) five different dimensions of discourse (see Enkvist 1984; Virtanen 1997), and (ii) two methodologically different approaches to corpora (see Sinclair 2004). Discourse linguistics has, over the years, undergone a remarkable expansion of focus. With the discursive turn in social sciences, the relative weight of the reflexive text-context pair of notions has shifted towards its second member. The context to be taken into account in studies of text and discourse has expanded enormously, from co-text (linguistic context) and a particular situational context, to society and culture at large, to the extent that the latter are now judged to be relevant to the study. Yet all dimensions of discourse are still with us and equally relevant, irrespective of their chronological order of appearance on the discourse-linguistic scene - simply because they serve to accomplish different analytical tasks. Situated analyses of discourse practices in text and talk rely on contextualization cues exhibited in the linguistic signals that are present or absent in a piece of discourse. Starting from (i) a ‘structural’ dimension, present in much work on textuality, we can proceed to (ii) a ‘contentbased’ dimension, typically opted for in rhetorically-oriented studies. The ‘cognitive’ dimension (iii) is omnipresent in studies of text and discourse, and it can thus be specifically foregrounded where expedient. The ‘interactional’ dimension (iv), originating in studies of spontaneous speech, cuts across much of the current discussion of discourse phenomena, highlighting the dynamism of discourse practices in both speech and writing. And the ‘socio-cultural’ dimension (v), too, demands consideration of the reflexivity of text and discourse. In (v), the focus is on the situational and socio-cultural contexts in which people jointly engage and re-engage in social action through discourse, and in performances through which discourse takes shape; the concern is with ways of
Discourse linguistics meets corpus linguistics
55
(co-)constructing such contexts and adapting to them, and of maintaining or altering them through discourse. It is obvious that these five dimensions of discourse are not all equally accessible to users of corpus-linguistic methods. In view of the discussion of the status of context in such investigations, corpus-linguistic approaches can be expected to focus predominantly on the structural aspects of discourse and the various content-based phenomena apparent in text and talk. In contrast, the interactional and socio-cultural dimensions of discourse lend themselves less well to corpus studies because what is examined here is the dynamism of discourse as social action. The study of discourse processes and other cognitive issues increasingly have recourse to corpus data but often to ends that are not of primary concern to the corpus linguist. Sinclair’s (2004) distinction between ‘corpus-based’ and ‘corpus-driven’ approaches constitutes another relevant starting point for the discussion of corpus and discourse linguistics. The ‘corpus-driven’ approach is reminiscent of that of conversation analysts, while the ‘corpus-based’ approach is more in line with much work in text and discourse linguistics and pragmatics. 5.1
Fields of mutual interest
Corpus linguistic and discourse linguistic studies have benefited from one another in a number of fields of mutual interest. These include (i) variation across texts and discourses, (ii) textual and pragmatic collocation, and (iii) the intricacies of spoken interaction. The first of these, the discovery of distributional patterns, is the domain of corpus linguistics par excellence. Investigations of linguistic variation place high demands on corpus design. But variation is also of central importance in the study of text and discourse. Discourse linguists have benefited from corpus-linguistic methods to study variation across texts and discourses, including variation across time in historical linguistics. The usual text classifications include text/discourse types, genres, registers, styles and modes, while fictionality can also constitute a dividing line between text categories (for corpus studies of various kinds of variation across texts and discourses, see e.g. Biber 1988; Dorgeloh 2004; Granger (ed.) 1998; Semino and Short 2004; Stubbs 1996; Taavitsainen 1997). The notions employed in text and discourse categorization are not straightforward, however, and linguists of both orientations should continue to give full attention to decisions in this regard. Some divisions have long been standard in corpus design. Thus, it is only recently that speech and writing have started to appear in the same corpus, and multimodal corpora are likely to grow in importance, along with the current interest in Internet data. Both corpus-based and corpus-driven methods are used in discourseoriented studies of linguistic variation. In historical corpus linguistics, the models tend to come from our understanding of present-day discourse phenomena, the combination of which has renewed the field of historical linguistics over the past thirty years. Corpus-driven approaches, again, invite linguists to explore historical data in their own right, which may facilitate the interpretation of the
56
Tuija Virtanen
findings. Variation is also an important issue in studies of ongoing language change, as can be witnessed, for instance, in data from online contexts (but for corpus-methodological concerns, see the discussion in Section 3). The pros and cons of the two approaches, corpus-driven and corpus-based, to variation across texts and discourses are crystallized in the following two quotations from the relevant literature. The first one serves as an argument for the adoption of corpus-driven methods; the second emphasizes the risk of misinterpretation in approaches that do not take into account fundamental distinctions between categories of text and discourse based on text-internal criteria. “…despite theoretical frameworks that are general enough, descriptions are too dependent on the text and discourse type.” (Sinclair 2004: 67) “So determinative of detail is the general design of a discourse type that the linguist who ignores discourse typology can only come to grief.” (Longacre 1996: 7) If we are interested in the inherent hybridity of discourse and the processes of hybridization (Fairclough 1992), the point of departure must be some kind of categorization of discourse. If, in contrast, we start from large amounts of “uncontaminated text” (Sinclair 2004: 191), we cannot study hybridization per se, at least not until we have identified categories that emerge from the data. Longacre’s point about linguists running the risk of comparing apples and oranges if discourse typology is not taken into account has proven to be crucial in studies of text and discourse, irrespective of the kind of text or discourse categorization we are working with (for a discussion of variation across texts and discourses in the light of text type and genre, see Virtanen, forthcoming). Corpusdriven studies promise to uncover categorizations of text and discourse which differ from those in focus in corpus-based studies; though both methods are likely to point to some of the most basic distinctions such as the difference between narrative and non-narrative text. Other distinctions likely to emerge even when using corpus-driven methods include that between ‘evocative’ and ‘operational’ discourse (cf. Enkvist 1985), and between common, and at times adjacent, genres of everyday life (such as news and reviews, or gossip and jokes). The second area in which corpus linguists and discourse linguists happily meet is in collocational patterns. Access to very large corpora and the Internet has resulted in something of a renaissance in the study of collocation. Texts and discourses exhibit collocation in the very concrete sense of words that like each other’s company. The default definition of collocation as the “co-occurrence of words with no more than four intervening words” (Sinclair 2004: 141; 1991) allows us to contemplate them in novel ways, starting from what is present in texts and discourses of various kinds and ignoring for a moment the constraints of grammar. Contextual issues come to the fore when we note that collocational
Discourse linguistics meets corpus linguistics
57
patterns vary according to discourse type, genre, register and style. But new categorizations are also likely to emerge through the study of collocation in large bodies of data. Firth’s early interest in matters of context invites us to study collocation in relation to the context-of-situation and the cultural context. Extending the scope of ‘collocation’ and ‘colligation’ (Firth 1968) from a sentence-grammatical study of word and tag sequences in a given corpus to entire texts allows us to study ‘textual’ and ‘pragmatic’ collocation. While many linguists select relatively narrow search spans to avoid overwhelming problems of insufficient precision in the procedure, the possibility of varying the search span is of great interest in the study of text and discourse as it helps us to explore collocational phenomena which operate over sentence boundaries. In addition to the study of relatively overt textual collocation, we may be alerted to implicit relations that are not readily noticed using traditional methods. Such pragmatic collocation is of major relevance to the study of text and discourse. This is a field of study where corpus-driven approaches promise new insights into an aspect of text and discourse that is “not subject to any conventions of linguistic realizations, and so is subject to enormous variation, making it difficult for a human or a computer to find it reliably” (Sinclair 2004: 144-145, on ‘semantic prosody’; cf. also the discussion of ‘semantic preference’ in Sinclair 2004: 142; for discussions of textual and pragmatic collocation, see Virtanen 2005; Östman 2005). For the analyses to be meaningful, however, linguists need access to very large bodies of data (cf. the discussion in Sinclair 1991). In this light, the opportunities are now very different from the times of early monitor corpora: huge quantities of text on the Internet can be subjected to investigations of regular co-occurrences of words, also in terms of the two extended senses of collocation, textual and pragmatic. This endeavour is facilitated by tools such as WebCorp (see Renouf et al. 2007). It is, however, crucial to verify the nature of the reliance of such interfaces on existing search engines, so that the results can be interpreted accordingly. Search engines may, for instance, retrieve particular kinds of web data while excluding others, such as discussion boards, blogs or chat rooms, which has important implications for the results of the study. Corpus-driven analyses of collocation have been suggested as a point of departure for cognitive text linguistics (de Beaugrande 2004: 24-26). The hypothesis is that a meaning which is conspicuous in a particular co-text reflects processes of multiple activations in networks with other meanings. Collocation is thus assumed to constitute the ‘missing link’ between language and discourse, explaining why people know what a word of a given language potentially means out of context, while still using and interpreting it in a specific sense in a particular discourse context. Equally interesting for discourse-linguistic purposes would be the prospect of extending the recent corpus-linguistic notion of lexical ‘repulsion’ between word pairs (Renouf & Banerjee 2007) to cover potential ‘textual and pragmatic repulsion’, while still trying to eliminate, in appropriate ways, the all too numerous search results that such an expansion would inevitably involve. In the identification of potential repulsion manifest in texts, added precision might come
58
Tuija Virtanen
through the consideration of the contextual notions of genre and register. Findings about linguistic repulsion are also likely to disclose important aspects of textual silence, not least if related to discourse-linguistic insights into discourse types and styles. Hence, pairs of connectors occurring across units of text of various sizes might be hypothesized to show ‘textual repulsion’ in relation to discourse type or genre. Applications would thus seem to include new ways of narrowing the scope of lexical searches on the web. Investigations of ‘pragmatic repulsion’, again, might take into account sets of lexical items that manifest highly implicit patterns of repulsion vis-à-vis particular function words (such as signals of negation or wh-items). Studies of pragmatic repulsion would necessitate very large bodies of data, and as with explorations of pragmatic collocation, they only seem possible using corpus linguistic methods. A third field of mutual interest, impromptu speech (as well as less unplanned face-to-face interaction), is an area where corpus-based studies have been successful. It is a paradox that this is also the area where corpus compilation is especially cumbersome, and problems of authenticity are foregrounded in the transcription process; not to mention ethical issues that accompany the process of recording spontaneous speech. However small-scale, such corpora still offer linguists a rich source of insight into the workings of planned and unplanned speech. Linguistic elements that have been identified as serving discursive or pragmatic functions of various kinds have been explored in corpus data. This strand of research has given particles and routine expressions a central status in linguistic enquiry, thus extending their study beyond the ground-breaking work by the early enthusiasts of discourse markers and pragmatic particles. The starting point has often been a set of predetermined lexical items, selected on the basis of earlier work in discourse linguistics. Important corpus-based studies in this area include those originating in the Lund circle directed by Jan Svartvik, who computerized and analysed the LLC (see e.g. Svartvik 1979; Aijmer 1996; Stenström 1994). Several of its members have subsequently extended this strand of corpus analysis to other corpora and compiled corpora of their own. Brinton (2008), Culpeper and Kytö (1999), and Wårvik (1990) investigate discourse markers and pragmatic particles in historical data, in written records of various kinds that are assumed to reflect some degree of spokenness or orality. Instead of starting from predetermined lexical items, which may have the disadvantage of severely delimiting potential findings, corpus-based studies of spoken interaction have at times chosen as a point of departure a particular discourse-organizing function, such as topic management or conversational openings and closings, or a communicative function, such as disagreeing or making requests (cf. Holmes and Stubbe 2003 on power and politeness manifest in a corpus of workplace discourse). Studies of interaction focusing on politeness and (inter)subjectivity are, however, predominantly grounded in situated discourse analysis because, as Hunston (2004:186) points out concerning evaluation, “reliable automatic identification and quantification can be carried out on only a limited set of realizations”. Situated socio-cultural performance of
Discourse linguistics meets corpus linguistics
59
politeness and affect through discourse seems beyond the reach of corpus linguistics. Corpora of spoken discourse have offered new insight into the study of overlapping speech, prosody and intonation. But searches over relatively large quantities of data, where possible and expedient, still involve a high risk of misinterpretation, while close-up, context-related analyses of individual occurrences are of less interest to the corpus linguist preferring to rely on large bodies of data. For instance, it is important to keep in mind that all overlaps are not necessarily recognized as interruptions by interlocutors in a given speech situation. The hazards of interrupting and being interrupted constitute a fundamental aspect of face-to-face interaction but their investigation necessitates situated in-depth analyses. Wichmann’s work on discourse intonation (e.g. 2004) shows how demanding a corpus-based study of spoken discourse is and how important it is to connect the findings in close, context-related observations of particular occurrences in the data. It can be expected that corpus studies of spoken interaction continue to be conducted along with manual analyses of the particular. Despite fundamental methodological differences, corpus linguistics and discourse linguistics manifest a good number of shared interests and concerns, thus potentially contributing to one another in important ways. Let us therefore turn to some of the most problematic areas in attempts to combine the two approaches. 5.2
Areas of unease
It is in the core areas of the study of text and discourse that corpus-based and corpus-driven analyses have little to offer, simply because it may not be possible to find what the discourse linguist wishes to explore or because the findings point to what we already know. Such areas have to do with (i) text structure or discourse organization, (ii) text-context reflexivity and (iii) situated analyses of ‘doing genre’. The most or least frequent instances are not the primary concern of the discourse linguist trying to determine how coherence works for interlocutors, as individuals and members of groups of various kinds; how words link to worlds and worlds to words simultaneously through discourse; or what kinds of action, or discourse practices people in various interlocutor roles set out to perform and adapt to through discourse in particular situational and socio-cultural contexts. And the processes of co-constructing discourse communities and various communities of practice, or those of (re-)engaging, face-to-face or online, in the ‘discursive struggle’ that is formative of our identities – all of these phenomena are of less value to the corpus linguist trying to get to grips with linguistic variation across established or emergent genres, or with distributional patterns of other aspects of the use of language in as large a sample as possible of representative computerized data. Text structure and discourse organization constitute a shared interest between corpus linguists and discourse linguists. But it is difficult to come up with quantitative findings which respect the inherent dynamism of discourse unless methods are combined so that an in-depth analysis of discourse is also
60
Tuija Virtanen
conducted, and typically a large part of the counting will have to be manual. Small but specialized corpora are easier to handle here but generalizability, essential in corpus linguistics, is then not possible and corpus-driven analyses are not applicable. Studies of corpus data have a lexical focus, which highlights explicitness in the signalling of discourse organization. Yet there are many other cues to discourse phenomena that need to be accounted for if we are to model ways in which people construct coherence, context and culture, through discourse that is at the same time affected by context and culture. The obstacle in the relationship between corpus linguistics and discourse linguistics is the issue of text-context reflexivity, which does not readily lend itself to static analyses of decontextualized data in the form of the linguistic output of situated discourse events which have been recontextualized as a corpus. This fundamental aspect of discourse as process and as social action is a familiar issue to the contextsensitive discourse linguist planning how best to approach the object of study. Central to the study of discourse are people’s intertextual and interdiscursive repertoires, which are constructed, recycled and altered through discourse, in always new and unique communicative situations. The communicative and social contributions of discourse type and genre construction can be accounted for in terms of such repertoires as well as intertextual and interdiscursive chains appearing across texts and discourses. It is through discourse that genres emerge and evolve, as interlocutors keep mediating them in particular communication situations in which they co-construct and make use of them, for and through social action. And it is in discourse that a small number of types or modes are exhibited which facilitate discourse processing and serve the communicative goals of its interlocutors. Corpus linguists investigate explicit signals, or the lack thereof, of established or evolving conventions; but the issue of what people set out to do, with and through genres and discourse types in particular situational and socio-cultural contexts, perforce lies beyond the reach of corpus-linguistic analyses. Hence, the development and change of genre conventions is a popular corpus-linguistic topic, while the social action of ‘doing genre’ is more likely to be adopted for study in situated analyses of discourse data. 5.3
A Happy Ending
Corpus-based studies of discourse phenomena may help us to get to grips with cohesion, rather than coherence. Also, aspects of positionally-defined thematic structure will be easier to examine than the intricate interplay of given and new information. Vocabulary-based analyses can help single out rhetorical units pertaining to structure and content. Interactional signals, and to some extent relevant socio-cultural cues, are typically approached through predetermined sets of lexical items. What all of this suggests is a focus on textuality, rather than the dynamic, situated nature of discourse. Corpus-driven studies of collocation and other semantic relations in text, too, disclose co-textual, rather than contextual, information. Even though discourse linguists will be able to make informed guesses on the basis of the outcome of corpus-driven studies, this process is not,
Discourse linguistics meets corpus linguistics
61
strictly speaking, concomitant with the idea of ‘uncontaminated’ text, guiding such approaches. Practitioners of corpus-based and corpus-driven methods differ in their views of the status, scope and nature of context in the investigation. Similar differences also exist in discourse linguistics. A good deal of context is inferable from the text; yet, corpus-based and corpus-driven analyses might not give access to such information in the way a situated analysis of ongoing discursive struggle in a particular instance of interaction does. But interaction is not only a characteristic of spoken language; writing, too, can be overtly interactive. Corpus linguists can gain insight into interaction, for instance, by analysing corpora consisting of text-based computer-mediated discussions. Yet here too, approaches to interaction that are informed by dialogism are likely to benefit less from corpus study than the monologistic frameworks traditionally adopted in corpus linguistics. Discourse studies tend to require compilation of specialized corpora, which run the risk of being too small to be of interest to corpus linguists. But small-scale corpora may also occasionally provide discourse linguists with findings that are all too familiar to them for the corpus-linguistic methods to be of relevance in the enquiry. Further, small corpora are of no use in corpus-driven studies, which instead demand very large bodies of data to be able to show the existence of systematic lexical and grammatical patterns, which, it is hoped, might serve to ground analyses of (inter)textual relations and contextualization cues. Ultimately, the size and kinds of corpus data will have to be thoroughly (re-)assessed according to the discourse-linguistic goals. The relationship between corpus linguistics and discourse linguistics is thus destined to continue to be a troubled one. While not yet necessarily pointing towards a ‘happy ending’ of any kind, there has, however, recently been an increase in the number of corpus-linguistic investigations of discourse structure. Textbooks in corpus linguistics have hitherto included an odd page on discourse or pragmatics, introducing a few studies of explicit, not infrequently predetermined, lexical signals that have been shown to serve pragmatic functions. Similarly, edited volumes of corpus-linguistic enquiry have at times included a chapter or two on discourse organization, usually oriented towards lexical relations identified in or across texts. Recent volumes clearly attempt to remedy the scarcity of corpus-linguistic studies of discourse phenomena. In addition to a larger number of investigations based on lexical elements, we now also find more focus on prosody and discourse intonation (cf. the contributions to Ädel and Reppen (eds.) 2008, Flowerdew and Mahlberg (eds.) 2009, and Partington et al. (eds.) 2004; Baker 2006). There is often a decisive element of manual, in-depth analysis of text and talk in corpus-based studies of discourse phenomena, while appropriate parts of the study are carried out by computer (see Biber et al. 2004; Biber et al. 2007; Du Bois 2007; Reppen et al. (eds.) 2002; Thomas and Wilson 1996; Wichmann 2004). This avenue remains an option in terms of added value of results or in the potential for developing and testing software for the purposes of discourse-linguistic enquiry.
62
Tuija Virtanen
6.
Concluding remarks
The main differences between corpus linguistics and discourse linguistics are ontological and epistemological. Corpus linguists and discourse linguists set out to describe and explain very different realities, sustain very different views of what constitutes evidence, and have different views of the kinds of claims that can be made. There is not much to be done about these differences; they are intrinsic. But linguists working within one or the other framework would do well to give thought to these basic differences given the goals of their studies and the concrete decisions that they are making during the research process. With reference to the five dimensions of text and discourse singled out in this paper, it is obvious that not all are equally accessible to practitioners of corpus linguistics. And what can be operationalized in view of a meaningful corpus study is not necessarily news to discourse linguists. Despite attractive solutions ranging from discourse-sensitive tagging to the compilation of focussed corpora, consisting of entire texts where possible, the main problem on the road from discourse to corpora and back again remains the lack of contextual dynamism. It is only through due attention to discourse as process and social action that investigations succeed in truly taking into account the bidirectional relation between actual texts and pieces of discourse, and their situational and socio-cultural contexts. Yet there is a benefit in attempting to combine the two approaches, and developments in software motivate linguists of various orientations increasingly to opt for new avenues in their chosen field of study. In principle, combining methods from corpus linguistics and discourse linguistics allows us to explore the workings of discourse in novel ways. In practice, this would seem to involve inclusion in one and the same study of two kinds of analyses: an in-depth context-sensitive analysis of text and discourse, and a corpus-based and/or corpus-driven investigation of some identifiable linguistic elements (or the lack thereof), suggested by the preceding discourse analysis as worthwhile candidates for quantification in a given body of data. Alternatively, a corpus-driven study can greatly benefit from subsequent enrichment by a close analysis of some of its results in a particular discourse context. Complementary or conflicting findings are both welcome: they offer new insights, disclose tacit assumptions and suggest reconsideration of our present knowledge of discourse. An understanding of the premises and goals of both fields will, however, be crucial for a harmonious and happy relationship between corpus linguistics and discourse linguistics. Note 1 This paper is based on an extensive discussion in my chapter entitled ‘Corpora and discourse analysis’ in Corpus Linguistics: An International Handbook, edited by Anke Lüdeling and Merja Kytö, to be published by Mouton de Gruyter.
Discourse linguistics meets corpus linguistics
63
References Ädel, A. and R. Reppen (eds.) (2008), Corpora and discourse: the challenges of different settings. Amsterdam: Benjamins. Aijmer, K. (1996), Conversational routines in English: convention and creativity. London: Longman. Baker, P. (2006), Using corpora in discourse analysis. London: Continuum. De Beaugrande, R. (2004), ‘Language, discourse, and cognition: retrospects and prospects’, in: T. Virtanen (ed.), Approaches to cognition through text and discourse. Berlin: Mouton de Gruyter. 17–31. Biber, D. (1988), Variation across speech and writing. Cambridge: Cambridge University Press. Biber, D., E. Csomay, J.K. Jones and C. Keck (2004), ‘Vocabulary-based discourse units in university registers’, in: Partington et al. 23-40. Biber, D., U. Connor and T.A. Upton (2007), Discourse on the move: using corpus analysis to describe discourse structure. Amsterdam: Benjamins. Brinton, L.J. (2008), The comment clause in English: syntactic origins and pragmatic development. Cambridge: Cambridge University Press. Culpeper, J. and M. Kytö (1999), ‘Modifying pragmatic force: hedges in Early Modern English dialogues’, in: A.H. Jucker, G. Fritz and F. Lebsanft (eds.), Historical dialogue analysis. Amsterdam: Benjamins. 293-312. Dorgeloh, H. (2004), ‘Conjunction in sentence and discourse: sentence-initial And and discourse structure’, Journal of Pragmatics 36: 1761-1779. Du Bois, J.W. (2007), ‘The stance triangle’, in: R. Englebretson (ed.), Stancetaking in discourse: subjectivity, evaluation, interaction. Amsterdam: Benjamins. 139-182. Enkvist, N.E. (1984), ‘Contrastive linguistics and text linguistics’, in: J. Fisiak (ed.), Contrastive linguistics, prospects and problems. Berlin: Mouton de Gruyter, 45-67. Enkvist, N.E. (1985), ‘A parametric view of word order’, in: E. Sözer (ed.) Text connexity, text coherence: aspects, methods, results. Hamburg: Helmut Buske. 320-336. Fairclough, N. (1992), Discourse and social change. Cambridge: Polity Press. Firth, J.R. (1968), Selected papers 1952-1959. Ed. by F.R. Palmer. London: Longman. Flowerdew, J. and M. Mahlberg (eds.) (2009), Lexical cohesion and corpus linguistics. Amsterdam: Benjamins. Gill, M. (2008). ‘Authenticity’, in: J-O. Östman and J. Verschueren (eds.), Handbook of Pragmatics. Amsterdam: Benjamins. Available also in J-O. Östman and J. Verschueren (eds.) (2005-), Handbook of pragmatics online. Amsterdam: Benjamins, at http://www.benjamins.com/online/hop Granger, S. (ed.) (1998), Learner English on computer. London: Longman. Halmari, H. and T. Virtanen (eds.) (2005), Persuasion across genres: a linguistic approach. Amsterdam: Benjamins.
64
Tuija Virtanen
Herring, S.C., D. Stein and T. Virtanen (eds.) (forthcoming), Handbook of the pragmatics of computer-mediated communication. Berlin: Mouton de Gruyter. Hoffmann, S. (2007), ‘Processing Internet-derived text: creating a corpus of Usenet messages’, Literary and Linguistic Computing, 22 (2): 151-165. Holmes, J. and M. Stubbe (2003), Power and politeness in the workplace. London: Longman. Hundt, M., N. Nesselhauf and C. Biewer (eds.) (2007), Corpus linguistics and the web. Amsterdam: Rodopi. Hunston, S. (2004), ‘Counting the uncountable: problems of identifying evaluation in a text and in a corpus’, in: Partington et al. 157-188. Kytö, M. (2000), ‘Robert Keayne’s Notebooks: a verbatim record of spoken English in early Boston?’ in: S.C. Herring, P. Van Reenen and L. Schøsler (eds.), Textual parameters in older languages. Amsterdam: Benjamins, 273-308. Longacre, R.E. (1996), The grammar of discourse. 2nd ed. New York: Plenum Press. Mair, C. (1990), Infinitival complement clauses in English: a study of syntax in discourse. Cambridge: Cambridge University Press. Ochs, E. (1979), ‘Transcription as theory’, in: E. Ochs and B.B. Schieffelin (eds.), Developmental pragmatics. New York: Academic Press. 43-72. Östman, J-O. (2005), ‘Persuasion as implicit anchoring: the case of collocations’, in: H. Halmari and T. Virtanen (eds.), 183-212. Partington, A., J. Morley and L. Haarman (eds.) (2004), Corpora and discourse. Bern: Peter Lang. Renouf, A. and J. Banerjee (2007), ‘The search for repulsion: a new corpus analytical approach’, in: P. Pahta, I. Taavitsainen, T. Nevalainen and J. Tyrkkö (eds.), Studies in variation, contacts and change in English. VARIENG, University of Helsinki. Accessed 22 September 2008 at http://www.helsinki.fi/varieng/journal/volumes/02/renouf_banerjee/ Renouf, A., A. Kehoe and J. Banerjee (2007), ‘WebCorp: an integrated system for web text search’, in Hundt et al. (eds.), 47-68. Reppen, R., S.M. Fitzmaurice and D. Biber (eds.) (2002), Using corpora to explore linguistic variation. Amsterdam: Benjamins. Semino, E. and M. Short (2004), Corpus stylistics: speech, writing and thought presentation in a corpus of English writing. London: Routledge. Sinclair, J. (1991), Corpus, concordance, collocation. Oxford: Oxford University Press. Sinclair, J. (2004), Trust the text: language, corpus and discourse. London: Routledge. Stenström, A-B. (1994), An introduction to spoken interaction. London: Longman. Stubbs, M. (1996), Text and corpus analysis. Oxford: Blackwell.
Discourse linguistics meets corpus linguistics
65
Svartvik, J. (1979), ‘Well in conversation’, in: S. Greenbaum, G. Leech and J. Svartvik (eds.), Studies in English Linguistics for Randolph Quirk. London: Longman, 167-177. Taavitsainen, I. (1997), ‘Genre conventions: personal affect in fiction and nonfiction in Early Modern English’, in: M. Rissanen, M. Kytö and K. Heikkonen (eds.), English in transition: corpus-based studies in linguistic variation and genre styles. Berlin: Walter de Gruyter. 185-266. Thomas, J. and A. Wilson (1996), ‘Methodologies for studying a corpus of doctor-patient interaction’, in: J. Thomas and M. Short (eds.), Using corpora for language research: studies in the honour of Geoffrey Leech. London: Longman. 92-109. Virtanen, T. (1997), ‘Text structure’, in: J. Verschueren, J-O. Östman, J. Blommaert and C. Bulcaen (eds.), Handbook of pragmatics. Amsterdam: Benjamins. Available also in J-O. Östman and J. Verschueren (eds.) (2005-), Handbook of pragmatics online. Amsterdam: Benjamins, at http://www.benjamins.com/online/hop Virtanen, T. (2005), ‘Polls and surveys show: public opinion as a persuasive device in editorial discourse’, in: Halmari and Virtanen (eds.), 153-180. Virtanen, T. (in press), ‘Corpora and discourse analysis’, in: A. Lüdeling and M. Kytö (eds.), Corpus linguistics: an international handbook. Berlin: Mouton de Gruyter. Virtanen, T. (forthcoming), ‘Variation across texts and discourses: theoretical and methodological perspectives on text type and genre’, in: H. Dorgeloh and A. Wanner (eds.), Approaches to syntactic variation and genre. Berlin: Mouton de Gruyter. Wårvik, B. (1990), ‘On the history of grounding markers in English narrative: style or typology?’ in: H. Andersen and K. Koerner (eds.), Historical linguistics 1987: papers from the 8th international conference on historical linguistics. Amsterdam: Benjamins. 531-542. Wårvik, B. (2003), ‘When you read or hear this story read: issues of orality and literacy in Old English texts’, in: R. Hiltunen and J. Skaffari (eds.), Discourse perspectives on English: medieval to modern. Amsterdam: Benjamins. 13-55. Wichmann, A. (2004), ‘The intonation of please-requests: a corpus-based study’, Journal of Pragmatics 36: 1521-1549. Yates, S.J. (2001), ‘Researching Internet interaction: sociolinguistics and corpus analysis’, in: M. Wetherell, S. Taylor and S.J. Yates, Simeon J. (eds.), Discourse as data: a guide for analysis. Milton Keynes: The Open University. 93-146.
'Tis well known to barbers and laundresses: Overt references to knowledge in English medical writing from the Middle Ages to the Present Day Turo Hiltunen and Jukka Tyrkkö Research Unit for Variation, Contacts, and Change in English (VARIENG) University of Helsinki Abstract The discursive representation of knowledge, the fundamental objective of scientific inquiry, reflects underlying epistemic conditions of scientific thought (Bates 1995). Knowledge is communicated in scientific writing by means of lexical choice, discourse conventions and the organization of information. Over the long history of vernacular medicine, the writers of each era – from scholasticism and empiricism to evidence based medicine – have had their own perspectives on knowledge, revealed by the discursive practices they employed. Lexical items referring to the concept of knowledge (e.g. knowledge, information, doctrine) are investigated from the late Middle English period to Present-day English. We analyze variation and change in the lexicon of knowledge and analyze the discursive contexts in which the terms appear, showing how these have changed over time in different subgenres within learned medicine. The study makes use of several medical corpora with a total word count of roughly one million words: the MEMT is used for the Middle English period, and a selection of texts from the EMEMT corpus (articles from the Philosophical Transactions and other contemporary medical texts) represent the Early Modern English period. For the PDE period, we use a selection of research articles from academic journals and texts from the Medicor.1
1.
Introduction and background
From the very beginning of organized scholarship, knowledge has been the primary objective of learned activity. While the understanding of what constitutes knowledge and how one should go about gaining it have changed over the centuries, knowledge has remained the yardstick by which the learned judge one another. Medicine, the oldest field of learning with a continuous written history in the vernacular (Taavitsainen and Pahta 2004), has always had a characteristically dichotomous relationship to knowledge. On the one hand, medicine has always been studied theoretically, on the other, medical knowledge has always had a practical application in the healing of the sick. According to the Canon of Avicenna,2 the most important collection of medical texts in the Middle Ages, “Medicine is the science by which the dispositions of the human body are known so that whatever is necessary is removed or healed by it, in order that health should be preserved or, if absent, recovered.”
68
Turo Hiltunen and Jukka Tyrkkö
This study examines how overt references to knowledge have changed in medical writing from the beginning of vernacular medicine in the late fourteenth century to the present day. Underlying the research question is the claim by French (2003) that presenting oneself to the public and professional colleagues as a “rational and learned physician” was often the main enterprise of Late Medieval and Renaissance physicians – sometimes even at the expense of actually acquiring knowledge. On this basis, it is reasonable to presume that medical writers,3 as a discourse community (see, e.g. Swales 1990: 24-27) with a vested interest in regulating references to knowledge, would always make assertions about the act of knowing or the possession of knowledge deliberately and precisely. Using a series of diachronic medical corpora to examine proportional changes in different classes of nouns and verbs in the field of knowledge, we demonstrate that knowledge references are employed differently at different historical periods. The changing styles of scientific thought, which correspond more or less with these periods, have been identified and used in scholarship under a variety of names. This study follows a popular model which distinguishes four main periods: scholasticism, identified with the axiomatic and authoritybased knowledge; empiricism, characterized by observation-focused knowledge; rationalism, during which reasoning and ideational constructs came to the forefront; and finally constructivism, typified by the analytical testing of hypotheses (cf. Taavitsainen and Pahta 1998). Given the scope of the paper, this scheme is naturally a generalization, and individual fields of science, let alone fields of learning, may have descriptive models specific to their particular histories. Methodologically this study combines historical discourse analysis with corpus linguistics. Our approach starts by defining a lexical field, follows with an investigation of its attestations in a series of historical corpora, and finally interprets these as evidence of changes in the discourse of science in different periods in history. 2.
Method
The main research question of this study is to examine whether, over a long time line, the occurrences in a corpus of lexical items representing a given conceptual field can be understood to reflect underlying paradigm changes in scientific thinking. The starting point to this hypothesis is that the conceptual field in question has to be lexically attested at a reasonably high frequency and further that the field can be considered central to fundamental ways of thinking. In our estimation, the conceptual field of knowledge serves such a purpose in scientific writing. Because discursive features are not annotated in the corpora we use, reaching this goal requires that the phenomena under investigation need to be described in a way that facilitates meaningful corpus searches. Our solution is to
References to knowledge in English medical writing
69
focus on overt references to knowledge, that is, passages explicitly evoking the concept by using a particular kind of lexical item. These passages can be retrieved from the corpora, once a list of all relevant lexical items has been established. This operationalisation comes with the caveat that the investigation is restricted to passages featuring knowledge words. Those that do not contain any of the search words are not considered, even if they point to “knowledge” by some discursive means. This in turn means that our analysis provides information about overt references to knowledge, that is, about the way in which medical writers evoke the concept of knowledge by using certain lexical items. In our view, such references are not necessarily directly linked to the amount of knowledge that the texts contain, but are rather matters of writing style and as such particularly revealing about the underlying thought style. To study references to a given conceptual field in a corpus is essentially to examine all the lexical items which can be taken to semantically belong to that field. Although this premise is in itself straightforward, it presents three challenges to be addressed before the examination of corpus evidence can begin. First, the conceptual field has to be defined clearly. This task is not easy, particularly in the case of abstract concepts which are especially prone to being approached from a variety of different theoretical perspectives, resulting in overlapping and, at times, contrasting interpretations of conceptual constructs. Once the field has been defined, its lexical composition needs to be determined. In a diachronic study, this involves paying attention to both lexical and semantic changes that occur over time. Finally, the instantiations of those lexical items in the corpora have to be retrieved, a process which involves careful examination of spelling variants, particularly for ME material, and the ruling out of homonyms. Although the objective of the study is to examine knowledge references in medical discourse, the lexical field cannot reasonably be limited only to items denoting the core sense of episteme or objective, stable knowledge (realized through lexical items such as know, understand, etc.). While we chose to discard references to knowledge claimed through pure belief, it was apparent that references to knowledge systems (doctrine, science, etc.) and practically oriented knowledge (cunning, craft, etc.) were not to be left out. Lexical items referring to units of itemized knowledge (data, information), a feature closely associated with modern scientific writing, were also included. On the other hand, lexical items which refer exclusively to the adjacent semantic fields of teaching and learning were excluded as we judged them to stray too far from the central issue of how medical authors have positioned themselves in relation to knowledge. Any of the lexical items included can of course be used instructively. Items belonging to the field of doxa (i.e. subjective knowledge through faith or belief) were left out altogether. The sense of each occurrence of pertinent lexical items was evaluated individually in context. Items were included in the analysis if the sense was judged knowledge-related. Finer-grained semantic differences, such as those given in the OED, were not identified for individual lexical items.
70
Turo Hiltunen and Jukka Tyrkkö
In several cases, the issue of polysemy became central. Because of the way the conceptual field was delimited, senses primarily related to cognition or simple practical ability were ruled out. With some lexical items the majority of occurrences had to be discarded as belonging to a different semantic field. A good example of this phenomenon is wit, which as a verb can in most cases be classified as a lexical item of knowledge. The corresponding noun, however, predominantly falls under the semantic field of cognition, as in example 1: (1)
SLuggy & slowe, in spetynge muiche, Cold & moyst, my natur ys suche; Dull of wit, & fatt, of contnaunc strange, fflewmatyke, þis complecion may not change. LME: Practical Verse4
To further clarify the semantic categorization, we consulted the respective sections of the Historical Thesaurus of English (hereafter HTE).5 The HTE categorization for knowledge appeared to largely coincide with ours, with the exception that some lexical items of practical knowledge were not to be found under relevant section headings. However, our reading of the primary material clearly confirmed that lexical items such as cunning were frequently used to mean practical skill arising from learned knowledge (see example 2). We, therefore, included such items in the study: (2)
But in specyall ther ar v þat ys to say connynge to wyrke in postumes and konnynge to teche to wyrke in woundys and konnynge to wyrke in vlceres and festurys and old sorys and cankyrs and connynge to restore flesch agayne and awoyd place with medycyns. LME: Book of Surgery
At the same time, many of the lexical items listed under the relevant sections in the HTE either were not words of knowledge in the way we use the term, or were not attested in our data, and were therefore excluded from further analysis. Following these criteria, references to the semantic field of knowledge, as defined above, are realized in the corpora using 17 nouns and 3 verbs. Spelling variants of each were discovered through consulting the Oxford English Dictionary and cross-referencing with the full word lists of all pre-PDE corpora, and all occurrences were retrieved (see section 4). 3.
The Data
Our approach treats lexical items denoting knowledge and knowing as correlates of the scientific thought style, and we expect to find variation in their frequency
References to knowledge in English medical writing
71
and distribution in medical texts on a par with changes in the thought style. The investigation of this hypothesis is based on a series of corpora that represent different periods in the history of medical writing in English. A major factor in choosing a suitable corpus for the analysis was availability: we wanted to make use of existing corpora to the extent it was possible. Some of the available corpora met these requirements: the MEMT corpus for the late Middle English period, the ARCHER corpus for the 19th century, and the Medicor for Present-day English. No finalized corpus of medical writing is presently available for the Early Modern period, but to examine the full time line of vernacular medical writing we filled in the gap between MEMT and ARCHER with a selection of 17th century texts from the forthcoming Early Modern English Medical Texts (EMEMT) corpus. Our study focuses on the learned end of medical writing. In the LME and EModE corpora this includes both texts written by university educated physicians and practitioners without institutional credentials (see e.g. Wear 1998 and Siraisi 1997), while 19C, PDE1 and PDE2 corpora represent university-based medicine exclusively. Within this category, journal articles and other scholarly writing were considered separately, as the available corpora enabled such a distinction for two periods. The corpora used in this study consist of learned medical texts from the Late Middle English period to the present day. The material comes from six different samples, which together cover four periods, as shown in Table 1. The aggregate size of the corpora is ca. 1.1 million words. Table 1: Corpora used in this study Corpus
LME
EModE1
EModE2
19C
PDE1
PDE2
Timeline Texts
1375- 1500 39
1650-1700 36
1665-1713 153
1820-1905 40
1983-1997 63
2001-2005 64
Words
221,646
245,839
195,226
83,970
197,010
252,685
The Late Middle English subcorpus (LME) is a sample from the Middle English Medical Texts corpus (Taavitsainen et al. 2005), containing all the texts in the categories Surgical texts and Specialized treatises. The Early Modern English subcorpus consists of two parts. The first part (EModE1) contains texts from two categories in the forthcoming Early Modern English Medical Texts corpus, General treatises and Surgical treatises. The second part (EModE2) contains articles on medical topics from the Philosophical Transactions of the Royal Society, also to be included in the EMEMT corpus. The nineteenth century subcorpus (19C) consists of all texts included in the category Medicine in the ARCHER corpus. All the texts in this sample come from the Edinburgh Medical Journal (see Biber et al. 1994).
72
Turo Hiltunen and Jukka Tyrkkö
The Present-day English data is again divided into two subcorpora. The first sample (PDE1) contains all texts in three categories of the Medicor corpus: Handbooks, Textbooks, and Editorial articles (Vihla 1998). The second sample (PDE2) contains 64 medical research articles from eight different medical journals representing the specialisms of surgery and orthopaedics. The subcorpora are not of equal size, and the 19th century in particular is represented by a smaller dataset than the other periods under investigation. This is because we did not have access to corpora representing medical writing of the period other than the ARCHER, and time constraints did not permit us to collect supplemental material. We take this into account in our analysis, by using normalized frequency counts per 1,000 words. All searches were carried out using the Wordsmith Tools 4. 4.
Results
The uses of nouns and verbs of knowledge were analyzed separately.6 To provide a more accurate description of the use of relevant lexical items, one further level of categorization was introduced in each group. Data on nouns with different semantic characteristics were considered separately, and verbs are discussed in relation to their actors. Results of corpus searches in each category are provided, and the most interesting developments are discussed and illustrated with examples. 4.1
Nouns
Nouns in the lexical field of knowledge can be divided into several distinct groupings on the basis their semantic properties. While some, like knowledge and understanding, refer to the underlying concept on a general level, others have more specific ranges of reference. For the purposes of detailed analysis, we distinguish four groups of nouns:7 General knowledge nouns: knowledge, understanding, wit, wisdom, reason Nouns denoting knowledge as a learned ability: cunning, craft, skill, mastery Nouns denoting knowledge as a system: art, mastery, science, practice, doctrine, model, theory Nouns of itemized knowledge: data, information Tables 1-4 show the frequencies of groups of nouns in different corpora. The first line shows the raw frequency, and the second the frequency normalized to 1,000 words of running text. Considering each of the noun groups separately, we can observe important changes in their frequencies over time. Taking general nouns under investigation first, we can see in Table 2 that, from the Late Middle English period onwards, there is a gradual decrease in the frequency of these words continuing all the way to the Present-day English corpora.
References to knowledge in English medical writing
73
Table 2: Frequency of general nouns LME 308 1.39
EModE1 165 0.67
EModE2 80 0.41
19C 18 0.21
PDE1 51 0.26
PDE2 52 0.20
In the late medieval period, general nouns denoting knowledge typically occur in passages where some piece of knowledge is explicitly indicated to be useful or necessary to the reader, as in example (3). (3)
Thow schalt also haue knowlech þat he þat is wunt to ete twyis on þe day, and aftyr chongyth þat dyete and takyth hym to o mele, it is very certeyn þat it schal turne hym to noyauns. LME: Þe Priuyte Of Priuyteis
In the Early Modern English data, passages of this kind are no longer common. Instead, we find general nouns in first-person narrative accounts, where the writer of the text speaks of his own knowledge (example 4). (4)
There, Sir, are all the Observations I have been able to collect yet: if any thing else material shall hereafter come to my knowledg about these matters, I shall not fail to impart them, God permitting. EModE2: Glanvill (1669) ‘Observations concerning the Bath-Springs’ The Philosophical Transactions, 4, 49, p. 982
In our PDE data, general nouns occur predominantly in passages indicating a gap in the present state of knowledge, which the research article intends to fill (5): (5)
To our knowledge, no studies of PMF effects on in vivo contusive spinal cord injury (SCI) models have been reported. PDE2: Crowe et al. (2003) ‘Exposure to Pulsed Magnetic Fields Enhances Motor Recovery in Cats After Spinal Cord Injury”. Spine 28, 24, p. 2660-6.
Nouns in the second group, which denote knowledge as a learned ability, are few in the Late Middle English corpora, and in later periods they are all but absent, except for a few sporadic occurrences (Table 3). Table 3: Frequency of skill nouns LME 83 0.37
EModE1 10 0.04
EModE2 11 0.06
19C 2 0.02
PDE1 4 0.02
PDE2 0 0.00
74
Turo Hiltunen and Jukka Tyrkkö
This suggests that while practical knowledge was a relevant part of the lexicon of knowledge in the late medieval period (as in example 6), it no longer appears as such in our data from later periods. (6)
Þerfor þe significaciouns ar to be taken of þe beyng or essencion of þe sekenes which þof all þai be þe bigynnyng and grounde of al þe arte and crafte of medycyne and a parte þer of. LME: De Ingenio Sanitatis
The picture is more varied for nouns in the third group, nouns denoting knowledge systems (Table 4). It seems that there is a small decrease starting in the Early Modern English period and continuing to the 19th century, but the frequency of these nouns in the Present-day English data is again almost the same as in the LME period. Table 4: Frequency of system nouns LME 146 0.66
EModE1 141 0.57
EModE2 69 0.35
19C 37 0.44
PDE1 90 0.46
PDE2 162 0.64
But even while the overall frequency of the noun group remains stable, there are changes in the relative importance of individual nouns within the group. This is particularly obvious when we compare the differences in the distributions of two individual nouns, doctrine and model. In the LME data, the noun doctrine is the most frequently attested noun in this group (55 instances, 0.25 words per 1,000 words) (example 7). In later periods the frequency decreases steadily and there are no occurrences of the noun in PDE research articles. (7)
But neuerþelattere in þe þridde doctrine of þis same chapitre schal be told in partie of þe pannycles þat beþ vndir þe scolle, closinge þe brayn. LME: Chirurgie De 1392
The noun model shows almost entirely the reverse pattern of development. Apart from two occurrences in EModE1, the noun is attested only in the PDE data, and it is by far the most common noun denoting knowledge systems in both corpora (59 instances, 0.30 per 1,000 words in PDE1; 141 instances, 0.56 per 1,000 in PDE2 (example 8)).
References to knowledge in English medical writing (8)
75
The model of demineralized bone matrix (DBM)-induced bone formation recapitulates the cell biology of endochondral ossification seen during embryogenesis and fracture healing. PDE2: Ciombor et al. (2002) ‘Low frequency EMF regulates chondrocyte differentiation and expression of matrix proteins’. Journal of Orthopaedic Research 20,1, p. 40-50.
Finally, the first instances of nouns of itemized knowledge are found in the Early Modern English corpora, after which there is a dramatic increase in their frequency in the later periods (Table 5). Table 5: Frequency of nouns of itemized knowledge LME 0 0.00
EModE1 1 0.00
EModE2 16 0.08
19C 12 0.14
PDE1 183 0.93
PDE2 497 1.96
This increase coincides with important changes in the dominant research paradigm of medical science, and probably reflects the development towards modern clinical medicine, where the focus is increasingly on the results and measurements (example 9), as well as on the implications that they may have on clinical practice and further research (example 10). As the table shows, these nouns are particularly common in Present-day research articles. (9)
The data from our series demonstrate the paramount importance of the extent of the neurological injury for the prediction of the functional outcome. PDE2: Zelle et al. (2004) ‘Functional Outcome Following Scapulothoracic Dissociation’ Journal of Bone & Joint Surgery, 86, 1, p. 9-16.
(10)
Identifying the immediate operative-related risks of instrumented interbody fusion can provide useful information for approach selection. PDE2: Scaduto et al. (2003) ‘Perioperative Complications of Threaded Cylindrical Lumbar Interbody Fusion Devices: Anterior Versus Posterior Approach’ Journal of Spinal Disorders & Techniques, 16,6, p. 502-507.
The results show that an overall change takes place in the discourse of knowledge over the centuries. Significantly, this phenomenon is not only a matter of overall frequency change, but can be attributed more specifically to developments within medical discourse, as shown by the comparison of data from the four groupings of nouns.
76
Turo Hiltunen and Jukka Tyrkkö
4.2
Verbs
Next, we move to the significantly more limited lexical field of knowledge verbs. From the ME period onward, the corpora attest only three verbs in this lexical field: know, understand, and wit. The last of these, wit, is only found in the ME and EModE periods, predominantly in formulaic constructions (“it is to wit”, etc.).8 The three verbs are treated as a single lexical field and not subdivided. To study the use of overt verbal references as a reflection of changes in the underlying thought style, we focused on two indicators: overall usage of knowledge verbs and semantic changes in the actor or agent of knowledge verbs. Over the timeline, the usage of knowledge verbs shows a steadily declining trend until the mid 18th century, after which the frequency appears to level off (see figure 1). The overall decline in the use of knowledge verbs roughly coincides with the timeline associated with the changing of scientific paradigms. Although the observation is partly explained by the overall increase of nominalization particularly in scientific writing from the late seventeenth century onward (see Halliday 2004; Banks 2003),9 the specific nature of the lexical field of knowledge may have contributed to the steep decline. 2,5
2
1,5
1
0,5
0 LME
EModE1
EModE2
19C
PDE1
PDE2
Figure 1: Frequency of knowledge verbs across the corpora (1/1000 words) The scholastic tradition, which persisted in medical writing until the middle of the Early Modern period, is noted for the high level of didactic and author centred discourse (see Wallis 1995, Taavitsainen and Pahta 1998). Our findings support this view, showing frequent use of deontic modal constructions involving know or understand, as well as the formulaic constructions “it is to wit” or “it is to know”.
References to knowledge in English medical writing
77
(11)
It is to wete þat in flebotomie 4 þyngis are principalli attendid: sc., custome, tyme, age, & vertue. LME: Phlebotomy.
(12)
When þu hast ete þi mete, be ware þu ete not eftsonis, vn-til þi mete bifore receiuid be perfitely digestid. And when þat is, þu shalt knowe by .ij. tokenis. One is when þine appetite cummith to þe ayene after þi mete which þu hast receyuid. Anoþir tokin: if þi spettel be sotel, and li3tly will destende in to þi mouth. LME: Regimen sanitatis.
The declining use of knowledge verbs in the Early Modern period can be interpreted as a reflection of the gradual replacement of the gnostic tradition of knowledge with the epistemic (see Bates 1995), the first major shift of scientific paradigm. As the primary discursive purpose of references to knowledge changes from reinforcing established authorities to evaluating knowledge in light of observations and methodology, the need for verbs explicitly denoting the act of knowing can be expected to decline – a view our corpus evidence appears to support. The second major shift in scientific thought styles, from Empiricism to Rationalism, comes through in the data. The discovery of new clinical methods, coinciding with the 19C part of the corpus, appears to have changed the way knowledge was discussed in medical writing. As the focus of medical writers shifted from natural philosophy to knowledge derived from increasingly accurate clinical data, the occasions for using knowledge verbs decreased notably. It will do well to keep in mind the development of the academic register of writing as a somewhat separate issue from the changing underlying scientific paradigms. While modern scientific practice owes mainly to Empiricism and subsequent styles of thought, at least some of the stylistic features associated with modern science writing appear to have been established at a slightly earlier date. The gradual stabilization of academic writing came about not only as a result of ideational developments, but also of social and technological developments. From the 17th century onward, the ever strengthening role of learned societies and universities, the establishment of academic printing in the vernacular and the wider circulation of learned titles all resulted in the development of relatively uniform, genre specific stylistic features that we today associate with academic writing. Our findings, showing a clear decline in the use of knowledge verbs until the EMoDE2 period followed by a relatively steady level thereafter, support the view that at least some stylistic discourse feature may have began to stabilize by the seventeenth century (see Halliday 2004). 4.3
Knowers
In order to take a closer look at the overall patterns of knowing, we were interested in examining whom medical writers of different periods have seen fit to
78
Turo Hiltunen and Jukka Tyrkkö
associate with the act of knowing; in other words, whose knowledge has been considered worth mentioning, whether in the positive or negative. To facilitate a systematic analysis, we categorized the semantic role of actors of knowledge verbs – i.e. knowers – into six groups according to the approximate level of knowledge they appeared to represent (see table 6). At the top of the system we placed references to God as the infallible knower, at the bottom references to the layman. In between, we ranked ancient authorities, the author himself, the community of professional medicos, and the reader, in descending order. No distinction was made according to the specific training or background of the medical practitioner; accordingly, class four includes university trained doctors, surgeons, barber-surgeons, and apothecaries. Table 6: Classification of types of knower Class
Label
Lexical attestations
1
Divine
Direct reference to God or Christ.
2
Authority
In general (e.g. “auctores”, “the ancients”) or by name, such as Galen, Hippocrates, Avicenna, etc.
3
Author
First person singular
4
Medical community
Direct reference to medical or scientific community or to a specific subsection, such as physicians, surgeons, etc. Can be indicated through the use of first person plural, passive voice, etc.
5
Reader
Second person singular or direct reference to reader, or more specifically, as in 'young physicians'
6
Laymen
By direct reference to a non-medical profession such as “laundresse” or “fishmonger”, or to a generic actor (“boy”, “any man”, etc.)
Under this model, knower classes do not imply a qualitative assessment about the factual correctness of the actor’s knowledge. For example, if the actor of a knowledge verb is an ancient authority it does not necessarily follow that the sentence presents that authority figure as someone who knows (see example 16). Using this system of classification, we examined all knowledge verbs in the corpora (figure 2 and table 7).
References to knowledge in English medical writing
79
100% 90% 80%
All/lay Addressee Prof. Comm. Author(s) Authority Divinity
70% 60% 50% 40% 30% 20% 10%
PD E2
PD E1
19 C
2 EM
od E
1 od E EM
LM
E
0%
Figure 2: Knowledge verbs classified by type of actor Table 7: Knowledge verbs classified by type of actor Subjects
Divinity
Authority
Author(s)
LME EModE1 EModE2 19C PDE1 PDE2
3 3 0 0 0 0
7 6 12 2 0 0
13 77 47 10 1 12
Prof. Comm. 52 53 74 19 95 75
Addressee
All/lay
351 111 13 0 1 0
19 71 17 5 4 2
The vast majority of knowledge verbs in the Late Medieval subcorpus is found to occur with deontic modals, indicating a didactic preoccupation. In such instances the subject of the verb is usually the intended reader, whom the author, positioning himself as a teacher, instructs. Another common strategy is to list the things a member of a particular professional community (physicians, surgeons, apothecaries) are expected to know or be able to do. In these instances, we class the subject under ‘addressee’ if the context makes it clear the nominal reference is used didactically (as in example 13) and not as an assertion of shared understanding about a medical issue.
80
Turo Hiltunen and Jukka Tyrkkö
(13)
A surgian muste knowe þat alle bodies þat ben medlid vndir þe sercle of þe moone, ben engendrid of foure symple bodies, her lijknes ech in oþere medlyng. LME: Lanfranc, Chirurgia Magna 1
The second most common actor of knowledge verbs is the professional community, usually manifest syntactically through passive constructions. Here the tone of the discourse is less imperative, and the function of the reference is usually to indicate that a given piece of knowledge is held by all members of the community as a fact. A typical attestation of the type is found in descriptions of illnesses and their signs: (14)
If þe discrasie be hote, which is knowen bi redne3 & vesicacioun; make colde þe place no3t bi iusquiamy ne bi mandrake, as seiþ G, for þai colde tomych. bot with rosis, plantage & vnguento albo, which infrigideþ moderately driand. LME: Chauliac, Wounds
One of the more interesting findings concerns the discursive strategy employed in ME references to ancient authorities. Against expectations, corpus evidence shows that the collocative relationship between the names of authority figures and knowledge verbs is relatively weak, and that instead the knowing of such authorities is expressed much more frequently through speech act verbs, particularly say – a practice Taavitsainen (2001: 45-46) ascribes to the virtually infallible status of such authorities’ knowledge, which needs no reinforcement with a knowledge verb (example 15). (15)
Avicenna seiþ þat membres beþ bodyes imaade of þe firste mellinge of humours; oþir, as it is iseide super Iohannitium, a membre is a stedfast and a sad partye of a beest icompouned of þinges þat ben liche oþir vnliche, and is i-ordeynede to somme special office. LME: Trevisa, On the properties of things
An analogy can be drawn to biblical language, where the word of God is generally expressed through speech act verbs. The actual use of the divine subject (e.g. “God only knows” etc.) is extremely rare in medical writing, showing only three attestations during both the ME and EModE periods and none thereafter. From the beginning of the Early Modern period, scholasticism began to steadily lose ground to the new and frequently iconoclastic paradigm of empiricism. Somewhat surprisingly, changes in the style of scientific thought appear to be reflected in medical writing by an increase, rather than a decrease, of references to ancient authorities as knowers. Significantly, however, the increase comes with a change in polarity, whereby passages referring to an ancient authority as a knower are increasingly used to point out their mistakes and lack of knowledge (cf. McMullin 1985: 17):
References to knowledge in English medical writing
(16)
81
And as for Campher, Galen knew it not. Avicen saith expressely of Campher, that although it bee odorata, yet it is frigida. EModE1: Jorden, A Discovrse of Natvrall Bathes and Minerall Waters (1631), p. 27.
First person singular subjects appear significantly more frequently than in the ME period. Often the discursive function is to assert the personality of the author, and to use his personal authority to make a point. (17)
I know, and am well assured, that Physicians would frequently advise their Patients to stoving and bathing, had they them in their own houses. EModE1: Cock, Miscaelanea Medica (1675), p. 37.
Another explanation for the increasing use of the first person singular subject is the empirical paradigm of the personal observation, which often took the form of narrative. In the Philosophical Transactions, for example, many accounts of firsthand medical observations are presented as first-person narratives. (18)
Antimony will recover a Pig of the Measles; by which it appears to be a great purifyer of the Blood. I knew a Horse, that was very lean and scabbid, and could not be fatted by any keeping, to whom Antimony was given for two Moneths together every morning, and that upon the same keeping he became exceeding fat. EModE2: ‘A Letter lately written by an observing person to a Friend of the Publisher, concerning the vertue of Antimony’ (1668) The Philosophical Transactions 3, 39, p. 774
In the light of our data, Early Modern medical writing (EModE1 and EModE2) also reflected the empirical mindset by representing laymen as people who could be seen as possibly possessing knowledge valuable to the medical community. This practice continues in the 19C subcorpus, but appears only infrequently in PDE1 and PDE2. (19)
‘Tis commonly known to Barbers and Laundresses, that the same PumpWater will not so well and uniformly or without little Curdlings, dissolve Wash-balls and Soap, as Rain-Water, and some running Waters usually will. EModE2: ‘An Account of the Honourable Robert Boyle’s way of examining Waters as to Freshness and Saltness’ (1693) The Philosophical Transactions 17, 196, p. 631
Another major shift can be seen in the 19C subcorpus. Overt references to knowing declined considerably and were increasingly expressed through passive constructions. The frequency of references to the medical community as knower
82
Turo Hiltunen and Jukka Tyrkkö
increased, reflecting the increasingly organized and institutional nature of the medical profession. (20)
Strange as it may read, cases are known where the illness merely leads to indisposition, with headache, giddiness, and a bubo in the neck, groin, or armpit. 19C: Robertson, ‘Notes on an outbreak of plague’(1905)
When it comes to verbal knowledge references, modern medical writing largely follows the trend set in the 19th century. In some respects, modern articles also appear similar in style to the early research articles of the late 17th century. When knowing is mentioned, it is often presented in terms of explaining things which are not yet known and realized through negative polarity (example 21). By doing so, modern medical authors contextualize their findings in terms of the broader field of learning, thereby adding credibility to their own findings by showing areas which are yet to be examined. (21)
Neuroglial cells seem also to be an important mediator for the normal metabolism of neurons, although little is known in this respect. PDE1: Angevine, The nervous tissue (1986)
As with the analysis of nouns, the closer examination uncovered domain-specific discursive practises which help explain the more general frequency changes over the timeline. The decreasing use of knowledge verbs hides a significant transformation in discursive strategy, from the reader-oriented style of the Late Middle Ages to the community-oriented discourse of the Present Day. 5.
Conclusions
This study provides compelling evidence for the changing patterns of overt references to knowledge over a long period of time. The overall trends are clear: the frequencies of both nouns and verbs of knowledge are the highest in the late Middle English data, and considerably lower in later periods, with the exception of nouns denoting itemized knowledge. In part, the decrease can be explained by the waning of the influence of the scholastic thought style on medical writing. In the late Middle English data, overt references to knowledge are mostly encountered in didactic passages which are aimed at the reader of the text. Such passages are characteristic of late Middle English medical writing, but they are no longer common in the Early Modern period and virtually disappear thereafter. The drop in the frequency of knowledge words may therefore be partly attributed to the fact that from Empiricism onward there are fewer contexts in which these words may be used. At the same time, new openness to novel ideas opened the door to seeing even the layman as someone with valuable knowledge.
References to knowledge in English medical writing
83
However, there are other issues that come into play apart from the decline of scholasticism in explaining trends that are observed in frequency data. Partly as a result of the general cultural outlook of the Renaissance and partly in consequence of the growth of printing which identified individual authors more closely than before with their works, the position of the contemporary author as an original and authoritative knower strengthened markedly. Here, in particular, the commercial aspects of medicine highlighted by French (2003) tie in with the business of publishing (cf. Furdell 2002), for the sharp increase in references to the author’s personal knowledge can be seen not only in light of a change in scientific paradigm, but also as a deliberate attempt to assert personal authority for financial reasons. Over the following two centuries, the authority of the individual gradually shifted over to the professional community. The developing register of modern scientific writing began to favour an increasingly nominal style largely devoid of expressions of personal opinion (see Banks 2003). In Present Day medical writing, knowledge is overwhelmingly discussed from this perspective. Additionally, the use of nouns of itemized knowledge increases sharply, a development that can be attributed to the nature of modern clinical medicine and particularly the associated advances in measuring technology. The number of overt references to knowledge is not directly related to how much information a text contains. Rather, we see them primarily as an aspect of writing style, which is contingent on the context in which texts are produced. Therefore the fact that we have observed a decline in the use of verbs and most noun categories does not directly tell us anything about the information content, but it gives us some insight into how that information is expressed. In fact, these results could be interpreted as evidence for the increasing certainty about the propositions that are made in the texts. As pointed out already by Lyons (1977: 809), categorical assertions are epistemologically the strongest kind of statements, and Biber’s (2004: 126) study suggests that reliance on such statements has indeed increased in medical prose in the last two centuries. Therefore, it makes sense that modern research articles (whose information content is unquestionably high) only refer to knowing when something is common knowledge in the field, or when something is not known, but not in making a claim for new knowledge. The results of this exploratory study suggest that the approach to discourse analysis we have adopted, based on the analysis of a clearly delineated conceptual field and the investigation of the associated lexical items in corpora, is a viable model with potential future applications in the diachronic study of the expression of ideas. The findings are made particularly interesting by the fact that while they agree with the major results of earlier research, the corpus-driven nature of the method sheds light on unexplored discourse features. Moreover, our study is able to suggest new hypotheses which could account for changes taking place between individual periods, as well as in the relative importance of individual words.
84
Turo Hiltunen and Jukka Tyrkkö
Notes 1
This study was conducted with funding by the Research Unit for Variation, Contacts, and Change in English at the University of Helsinki, funded by the Academy of Finland
2
Avicenna, Liber Canonis, Book 1, chap 1, F. 1r (Venice, 1507; facsimile, Hildesheim, 1964).
3
As reflected in the composition of the corpus (section 3), we focus on the learned end of medical writing. Although the spectrum of the medical profession was wide and varied until the Enlightenment, the more learned writers can be reasonably approximated as a discourse community.
4
Text labels of the LME corpus refer to the short titles used in the MEMT corpus (Taavitsainen et al. 2005)
5
The Historical Thesaurus of English is available online at http://libra.englang.arts.gla.ac.uk/historicalthesaurus/. We are grateful to Prof. Christian Kay and Dr. Irene Wotherspoon for giving us advance access to the section ‘Knowledge’ of the Historical Thesaurus of English.
6
Adjectives and adverbs denoting knowledge were not included in this study.
7
Mastery is to be found classified as both nouns of learned ability and system nouns. The individual occurrences were evaluated on a case-bycase basis. Lexical items denoting medical signs (e.g. sign, symptom, accident) were not considered units of itemized knowledge in this study. On the use of sign terminology in ME and EModE medical writing, see Tyrkkö (2006).
8
See Taavitsainen and Pahta (1997). Notably, by the ME period English no longer lexically marked the semantic difference between “knowing of” and “knowing about”, attested in Germanic languages (e.g. German kennen and wissen) and in Romance languages (e.g. French connaître and savoir).
9
For a study of nominalization specifically in Early Modern medical writing, see also Tyrkkö and Hiltunen (forthcoming).
References Banks, D. (2003), ‘The evolution of grammatical metaphor in scientific writing’, in: L. Ravelli, A-M. Simon-Vandenbergen and M. Taverniers (eds.) Grammatical metaphor: views from systemic functional linguistics. Amsterdam: Benjamins. 127-148.
References to knowledge in English medical writing
85
Bates, D. (1995), ‘Scholarly ways of knowing: An introduction’, in: D. Bates (ed.) Knowledge and the Scholarly Medical Traditions. Cambridge: Cambridge University Press. 1–22. Biber, D., E. Finegan and D. Atkinson (1994), ‘ARCHER and its challenges: Compiling and exploring a representative corpus of historical English registers’, in: U. Fries, G. Tottie and P. Schneider (eds.) Creating and using English language corpora. Amsterdam: Rodopi. 1–14. Biber, D. (2004), ‘Historical patterns for the grammatical marking of stance, Journal of Historical Pragmatics, 5, 1: 107–136. EMEMT= Early Modern English Medical Texts. In preparation. French, R. (2003), Medicine before Science. The Business of Medicine from the Middle Ages to the Enlightenment. Cambridge: Cambridge University Press. Furdell, E.L. (2002). Publishing and Medicine in Early Modern England. Rochester: University of Rochester Press. Halliday, M.A.K. (2004) [1988], ‘The Language of Physical Science’, in: J.J. Webster (ed.) The Language of Science. London: Continuum. 140–158. HTE=Historical Thesaurus of English (forthcoming). Available online at http://libra.englang.arts.gla.ac.uk/historicalthesaurus. Lyons, J. (1977), Semantics. Volume 2. Cambridge: Cambridge University Press. McMullin, E. (1985), ‘Openness and Secrecy in Science: Some Notes on Early History’, Science, Technology, & Human Values, 10, 2: 14–22. MEMT= Middle English Medical Texts. 2005. Compiled by I. Taavitsainen, P. Pahta and M. Mäkinen. CD-ROM. Amsterdam: Benjamins. Oxford English Dictionary. 2004-. Online. J. Simpson (ed.). Available at http://www.oed.com/ Siraisi, N. (1997), Medieval & Early Renaissance Medicine. An Introduction to Knowledge and Practice. Chicago and London: University of Chicago Press. Swales, J. (1990), Genre Analysis. English in academic and research settings. Cambridge: Cambridge University Press. Taavitsainen, I. (2001), ‘Language History and the Scientific Register’, in: H-J. Diller and M. Görlach (eds.) Towards a History of English as a History of Genres. Heidelberg: C. Winter. 185–202. Taavitsainen, I. and P. Pahta (1997), ‘The Corpus of Early English Medical Writing: Linguistic Variation and Prescriptive Collocations in Scholastic Style’, in: T. Nevalainen and L. Kahlas-Tarkka (eds.) To Explain the Present: Studies in the Changing English Language in Honour of Matti Rissanen. Helsinki: Société Néophilologique. 209–225. Taavitsainen, I. and P. Pahta. (1998), ‘Vernacularization of Medical Writing in English: A Corpus-Based Study of Scholasticism’, Early Science and Medicine 3. 157–185.
86
Turo Hiltunen and Jukka Tyrkkö
Taavitsainen, I. and P. Pahta. (2004), ‘Vernacularization in Scientific and Medical Writing’, in: I. Taavitsainen and P. Pahta (eds.) Medical and Scientific Writing in Late Medieval English. Cambridge: Cambridge University Press. 1–22. Tyrkkö, J. (2006), ‘From tokens to symptoms: 300 years of developing discourse on medical diagnosis in English medical writing’, in: M. Dossena and I. Taavitsainen (eds.) Diachronic Perspectives on Domain-Specific English. Bern: Peter Lang. 229–255. Tyrkkö, J. and T. Hiltunen (forthcoming), ‘Frequency of nominalization in Early Modern English medical writing’, in: A. Jucker, M. Hundt and D. Schreier (eds.) Corpora: Pragmatics and Discourse. Papers from the 29th International Conference on English Language Research on Computerized Corpora. Amsterdam: Rodopi. 293-316. Vihla, M (1998), ‘Medicor: A corpus of contemporary American medical texts’, ICAME Journal, 22: 73–80. Wallis, F. (1995), ‘The experience of the book: manuscripts, texts, and the role of epistemology in early medieval medicine’, in: D. Bates (ed.) Knowledge and the Scholarly Medical Traditions. Cambridge: Cambridge University Press. 101-126. Wear, A. (1998), Health and Healing in Early Modern England. Aldershot: Ashgate.
Comparing type counts: The case of women, men and -ity in early English letters Tanja Säily a and Jukka Suomela b a
b
Research Unit for Variation, Contacts and Change in English (VARIENG), Department of English, University of Helsinki Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki
Abstract This work is a case study of applying nonparametric statistical methods to corpus data. We show how to use ideas from permutation testing to answer linguistic questions related to morphological productivity and type richness. In particular, we study the use of the suffixes -ity and -ness in the 17th-century part of the Corpus of Early English Correspondence within the framework of historical sociolinguistics. Our hypothesis is that the productivity of -ity, as measured by type counts, is significantly low in letters written by women. To test such hypotheses, and to facilitate exploratory data analysis, we take the approach of computing accumulation curves for types and hapax legomena. We have developed an open source computer program which uses Monte Carlo sampling to compute the upper and lower bounds of these curves for one or more levels of statistical significance. By comparing the type accumulation from women’s letters with the bounds, we are able to confirm our hypothesis.
1.
Introduction
The linguistic case we study is as follows. We have two roughly synonymous suffixes, -ness and -ity, which are typically used for forming abstract nouns from adjectives, as in example (1). (1)
a.
generous [ ] + -ness generousness [ ]
b.
generous [ ] + -ity generosity [
]
The first suffix, -ness, is etymologically native, while -ity entered the language as a result of contact with French during the Middle English period, and was later reinforced by loans from Latin (Marchand 1969: 312–313). The foreignness of -ity can be readily discerned from the above example: it changes the form of its base from [ ] to [ ], whereas with -ness there is no change (but see Section 2.1). In addition, the meaning of words in -ity is often not entirely compositional, i.e., not deductible from the meanings of the base and the suffix. Thus, it is both (morpho)phonologically and semantically more opaque than -ness (cf. Riddle 1985: 443–444; Aronoff and Anshen 1998: 246).
88
Tanja Säily and Jukka Suomela
What we are interested in doing with the suffixes is to compare their morphological productivity, a concept famously defined by Bolinger (1948: 18) as “the statistically determinable readiness with which an element enters into new combinations”. More specifically, we wish to examine whether the productivity of each suffix varies between different sociolinguistic groups, as defined by Labovian sociolinguistic categories such as age, gender and social status. Many linguistic features show sociolinguistic variation, but to date this has been studied little in the case of morphological productivity, and not at all with the otherwise closely scrutinised pair of -ness and -ity. Our data come from the 17th-century part of the Corpus of Early English Correspondence (1998; henceforth known as the CEEC). We have chosen personal letters as our material because they are one of the closest genres to speech, which is the primary medium of language and the most fertile ground for linguistic change (Nevalainen and Raumolin-Brunberg 2003: 28). This time period is interesting because it is to be expected that -ity would by this time have spread to wider use from the more literary genres in which it entered the language. Furthermore, a pilot study by Säily (2005) using the smaller Corpus of Early English Correspondence Sampler (1998) showed a gender difference in the use of -ity in letters of the 17th century. We believe that -ity, as a learned and etymologically foreign suffix, is less productive with poorly educated social groups, such as women and the lower ranks, than with well-educated groups, such as men and the higher ranks. As to the productivity of -ness, we do not expect to find significant differences between social groups. 1.1
Objectives
The main measure of morphological productivity used in this study is that of type counts, i.e., how many different words in -ity and -ness are used by the different social groups. We seek to study the productivity of the suffixes -ity and -ness in our material by two complementary means: 1.
Statistical hypothesis testing. We aim to formulate and test a hypothesis which captures our belief that gender is significant in the case of -ity.
2.
Exploratory data analysis. Regardless of whether gender proves to be significant or not, we are interested in studying the correlation between productivity and a number of other variables, such as the age, domicile or social rank of the writers.
We present a unified approach which enables us to tackle both of these tasks. 1.2
Contributions
This work is a case study of applying nonparametric statistical methods to corpus data. We show how to use ideas from permutation testing to answer linguistic
Comparing type counts: women, men and -ity in early English letters
89
questions related to productivity and type richness. The basic techniques are standard but not widely used in the study of these questions – our hands-on report aims at promoting the use of these powerful tools. With this goal in mind, we have chosen to describe in detail one particular application of these techniques. The emphasis is on depth, not breadth: instead of side-tracking and discussing a number of alternative techniques at each point, we make particular choices and go through all the subtleties that need to be taken into account. We assume a basic knowledge of statistical hypothesis testing, but we have included an informal introduction to permutation tests. We take the approach of computing accumulation curves for types and hapax legomena (i.e., types that occur only once). In particular, we use Monte Carlo sampling to compute the upper and lower bounds of these curves for some predetermined levels of statistical significance. Once we have computed an accumulation curve, we can test a hypothesis by simply plotting a data point on the curve. Exploratory data analysis is equally straightforward, and we can also qualitatively study the shape of the accumulation curves. One of the main technical contributions is described in Section 5: we have developed a computer program which can be used to compute the curves. This is the only part of the method described here which is computationally intensive. In the implementation, the emphasis is on computational efficiency. The program is freely available under an open source licence. The results achieved by using these methods on our data are reported in Section 6. As we shall see, we can conclude that our hypothesis is true: the type richness of -ity is indeed significantly low in the subcorpus which consists of women’s texts. Exploratory data analysis reveals an unanticipated feature of the data: the type richness of -ity is also significantly low in the subcorpus which consists of the letters written in 1600–1639. 2.
Background and related work
In this section, we justify the use of type counts for measuring morphological productivity, place the study in the framework of historical sociolinguistics, and review related work on using similar methods. 2.1
Type counts as a measure of morphological productivity
According to Dalton-Puffer (1996: 217), there is an obvious correlation between productivity and type counts: “a productive morphological rule produces many different words (types), and it is therefore likely that in a given corpus a productive suffix will occur more often than an unproductive one”. Type counts are by no means a perfect measure of productivity, however. As Cowie and DaltonPuffer (2002: 416) point out, the existence of a large number of types may be due to aggregation through productivity in the past rather than current productivity. Furthermore, in the case of -ity, some words have been borrowed from French or Latin as a package including the suffix, with no productivity involved at all in
90
Tanja Säily and Jukka Suomela
English. This applies to the word generosity in our example (1): according to the Oxford English Dictionary (henceforth the OED), generosity has been in the language since about 1432 and is an adaptation of the Latin word generǀsitƗt-em. Nevertheless, type counts are frequently used as a measure of productivity, for example by Baayen and Lieber, who call it the extent of use (1991: 818). This measure may not give us a full picture of the productivity of a suffix, but it can certainly be useful despite the above caveats about past productivity and borrowing. In addition, the impact of these caveats could be reduced by restricting the kinds of words that are counted. One possible restriction would be that the suffixed word must have had an extant base at the time when the material was written; another could be that the word must not have been in the language for, say, more than a century, as evidenced by its first attestation date in a major dictionary such as the OED (Cowie and Dalton-Puffer 2002: 419). These restrictions would increase the probability that the word in question was formed productively from suffix and base rather than retrieved as a whole from the mental lexicon of the writer. For this study, however, we have elected to omit the above restrictions and count all words that etymologically contain the suffix in question – as noted by Plag (1999: 29), dropping out “non-productive formations” could mean prejudging the issue of whether the suffix is productive. The latter of the above restrictions at least would certainly be too limiting: To an individual user of the language, a word can be new even if it has been around in the language community for hundreds of years (cf. Baayen and Renouf 1996: 77), and thus even established words can be formed productively by users from the base and the affix. In fact, even if an affixed word exists in the mental lexicon of the user, he or she may still end up forming it from its constituents, depending on how frequent the affixed word is compared with its base – Hay (2001) claims this is true for processing (e.g., when reading), but we think it holds for producing words as well. As for words with no extant base, they too may contribute to keeping the suffix productive, as they contain its form and meaning, and there is often an adjective related to the missing base that could be seen as the base by the user; see (2).1 Various restrictions on type counts are explored in Säily (2008: 87–95). (2)
ambiguity ~ ambiguous + -ity
2.2
Historical sociolinguistics and morphology
The application of sociolinguistics to historical material is a fairly new approach: according to Nevalainen and Raumolin-Brunberg (2003: 2), the first systematic attempt at this was made by Suzanne Romaine in 1982. Nevalainen and Raumolin-Brunberg themselves are pioneers in this field, which is now called historical sociolinguistics. While morphology has been studied within this framework, research has so far concentrated on inflectional morphology such as the use of third-person -s vs. -th (Nevalainen and Raumolin-Brunberg 2003). BĜezina
Comparing type counts: women, men and -ity in early English letters
91
(2005) is a rare example of a study on the productivity of derivational prefixation from the perspective of historical sociolinguistics. To our knowledge, there have been no studies on suffixation from a similar perspective. 2.3
Methodology
Previous work on comparing the productivity of an affix between subcorpora often relies on the subcorpora being approximately the same size, so that for instance type counts obtained from each subcorpus can be compared directly. Then, if the type counts differ by an order of magnitude, it may be possible to draw conclusions without paying attention to statistical significance (e.g., DaltonPuffer 1996: 106). Empirically validated assumptions on modelling productivity have been made by, e.g., Baayen (1992, 1993). For example, the growth rate of the type accumulation curve has been approximated as the ratio between the number of hapax legomena and the total number of tokens with the affix (Baayen 1992: 115). Baayen (2001) studies both parametric and nonparametric models for the class of LNRE (large number of rare events) distributions, such as lexical frequency distributions. These models are based on the assumption that individual words appear randomly in texts; such modelling assumptions make it possible to extrapolate beyond observed sample size. For a recent study on the statistical models for the accumulation of types and hapax legomena, see Evert and Baroni (2005), and for related statistical software, see Evert and Baroni (2007). Nonparametric methods similar to ours – in particular, Monte Carlo sampling of permutations – have been used in corpus linguistics to some extent. For example, Baayen (2001: 6–7, 24–32) computes Monte Carlo confidence intervals for the accumulation curves of some lexical characteristics. Permutations are generated at the level of individual words, which is consistent with the assumption that individual words appear randomly in texts. However, in many cases the observed values lie outside the confidence intervals (Baayen 2001: 6–7, 24–32; Tweedie and Baayen 1998: 335), indicating that the assumption of randomness causes bias in the results. Tweedie and Baayen (1998) address the bias by permuting words within a randomisation window. Our approach is to leave the original discourse structure intact and permute only large parts of the corpus. Analogous research questions arise and similar methods can be used in studies of biodiversity in the field of ecology, to enable comparisons of species richness in different areas (see, e.g., Gotelli and Colwell 2001). Our text length corresponds to their number of individual animals; our number of types to their number of observed animal species; our two subcorpora of men and women to their different areas; and our type accumulation curves to their species accumulation curves.
92
Tanja Säily and Jukka Suomela
3.
Material
Our material in this study comes from the 17th-century part of the 2.7-millionword Corpus of Early English Correspondence (1998 version). The CEEC is an electronic collection of 6,039 letters composed by 778 writers between the years 1410?–1681. It was compiled by Terttu Nevalainen (team leader), Jukka Keränen, Minna Nevala (née Aunio), Arja Nurmi, Minna Palander-Collin and Helena Raumolin-Brunberg. Due to a lack of resources for transcribing and editing, the corpus is based on published editions of letters; however, some of the material has been checked against the originals by members of the CEEC team. The CEEC is designed for studying the English language – more specifically, English English – in its socio-historical context. To this end, the writers have been carefully selected to give as balanced a representation of different social categories as possible. Nevertheless, the dominance of men from the upper ranks has been unavoidable: they were the most literate group, they were considered important enough that their letters were preserved, and their letters were later considered important enough to be published.
Running words 600,000 men women
500,000 400,000 300,000 200,000 100,000 0 1600-1639 1640-1681
Figure 1: Running words written by men vs. women in the CEEC, 1600–1681 The 17th-century part of the CEEC consists of 1.4 million words covering the years 1600–1681. Unfortunately, only about a quarter of this material was written by women, as can be seen from Figure 1. The situation between different ranks, regions, etc. is similarly imbalanced. Example (3), from a letter written in 1654 by Dorothy Osborne, illustrates the raw material in the corpus (emphases added).
Comparing type counts: women, men and -ity in early English letters (3)
93
… to Visett a place you are soe much concern’d in, and to bee a wittnesse your selfe of the probabillity of your hopes though I will beleive you need noe other inducement to this Voyage then … (A 1654 FN DOSBORNE 130:Heading)
For the purposes of this work, we have divided the corpus into samples, each consisting of one person’s letters from a 20-year period in the corpus: 1600–1619, 1620–1639, 1640–1659, and 1660–1681. As an example, all letters in the corpus that were written by Dorothy Osborne in 1640–1659 form a sample called DOSBORNE-1640. 3.1
Input data
The instances of -ity and -ness were extracted from the corpus using the WordCruncher program. Since the corpus was unlemmatised, and the grammatically tagged version was not yet available, this had to be done by searching for all word-forms which had a suitable ending. Different spelling variants of the suffixes were collected from the OED, the Middle English Dictionary (MED) and by browsing the corpus itself, after which they were used one by one in WordCruncher searches. Some of these variants, such as -nes, yielded a vast number of erroneous results, because many other words besides those having the suffix ended in that way, such as plurals of words ending in -n. These had to be weeded out by hand. A combination of manual work and Perl scripts was used to produce a computer-readable list enumerating all instances of the suffixed words in a normalised form for each sample. The word probabillity in example (3) counts as one instance of the normalised form probability in the sample DOSBORNE1640. There was a total of 94 occurrences of -ity in this sample, and they were instances of 31 different normalised forms, shown in example (4). Thus, we say that the number of -ity tokens is 94 and the number of -ity types is 31 for the sample DOSBORNE-1640. (4)
antiquity authority calamity charity civility commodity conformity contrariety curiosity equality extremity formality gravity importunity impossibility infirmity insensibility necessity nobility opportunity piety possibility probability quality quantity reality severity society university vanity variety
The information extracted from the corpus can be summarised as two incidence matrices, one for -ity and another for -ness. Each row of a matrix corresponds to one sample and each column corresponds to one type. The element at row i and column j indicates the number of occurrences of type j in sample i. The sum of the elements on row i equals the number of tokens in sample i, and the number of nonzero elements on row i equals the number of types in sample i. This is exemplified for -ity in Table 1.
94
Tanja Säily and Jukka Suomela
Table 1: Part of the matrix representation of -ity … contrariety credulity curiosity … probability … … ASTUART-1600 DOSBORNE-1640 SPEPYS-1660 …
0 1 0
0 0 1
1 4 0
0 1 1
The number of running words was counted for each sample; for DOSBORNE1640, the number of running words is 71,299 – the number of distinct words in the sample is not needed in our study. Sociolinguistic information on each person was retrieved from an auxiliary database; this included gender, domicile and social rank. For DOSBORNE-1640, the gender is ‘female’, the domicile is ‘other’ and the social rank is ‘gentry upper’. Our incidence matrices for -ness and -ity are freely available for download (Säily and Suomela 2007). 3.2
Characteristics of the input data
The total number of samples in the corpus is 412, of which 112 consist of letters written by women. The total number of different types of -ity in the corpus is 192 and the total number of different types of -ness is 312. The relative sizes of the samples are illustrated in Figures 2 and 3. In the figures, samples from men are represented by white boxes, while samples from women are grey diamonds. The size of the symbol is in proportion to the number of running words in the sample. The largest samples are labelled, including DOSBORNE-1640 with 71,299 running words, and ASTUART-1600, Arabella Stuart’s letters written in 1600–1619, with 30,472 running words. Figure 2 presents the samples ordered by the number of -ity types they contain per -ity tokens. As noted above, there are 31 -ity types and 94 -ity tokens in the sample DOSBORNE-1640. Figure 3 presents the same information for -ness types. For example, there are 46 -ness types and 188 -ness tokens in DOSBORNE-1640. As can be seen from the figures, the size of the samples varies widely; there are many samples with very few tokens and types, and a few samples with very many tokens and types. From these figures we may observe, e.g., that while DOSBORNE-1640 includes more -ness types and tokens than any other sample, there are many samples from men which have a larger number of -ity types than this sample.
Comparing type counts: women, men and -ity in early English letters
-ity
Types
JCHAMBERLAIN-1600 HMORE-1660
40
ASTUART-1600 JJONES-1640
JHOLLES-1600 WPETTY-1660
HOXINDEN-1640
30
DOSBORNE-1640
JHOLLES-1620
SPEPYS-1660
20 TWENTWORTH-1620 TKNYVETT-1640 CLOWTHER-1620 BELIZABETH-1640
10
AANTONIE-1600
0 0
20
40
60 80 Suffix tokens
100
120
Figure 2: Samples ordered by the number of -ity types per -ity tokens
-ness
Types
DOSBORNE-1640 JJONES-1640
40
ASTUART-1600
HMORE-1660 SPEPYS-1660
JHOLLES-1600
30
TWENTWORTH-1620 HOXINDEN-1640 JCHAMBERLAIN-1600
JHOLLES-1620
WPETTY-1660
20
CLOWTHER-1620 AANTONIE-1600 BELIZABETH-1640
10
TKNYVETT-1640
0 0
50
100 150 Suffix tokens
200
250
Figure 3: Samples ordered by the number of -ness types per -ness tokens
95
96
Tanja Säily and Jukka Suomela
4.
Methods
We are interested in comparing the productivity of a suffix between different subcorpora which consist of several samples, for example, all letters written by women. Our primary measure of productivity is the number of types. In the previous section we defined type counts for samples; this extends naturally to a whole subcorpus. As an alternative measure of productivity, we consider the number of hapax legomena. In precise terms, the measures are as follows; here we use the case of -ity as an example. (a)
Number of types. This is the number of different types of -ity which occur in the subcorpus at least once. For example, if the subcorpus contains occurrences of the word generosity (no matter how many times, regardless of the spelling) and no other -ity words, the number of types is 1.
(b)
Number of hapax legomena or hapaxes. This is the number of different types of -ity which occur in the subcorpus exactly once. For example, if the subcorpus contains only one occurrence of the word instability, one occurrence of the word capability, four occurrences of the word generosity (in various spellings) and no other -ity words, the number of hapaxes is 2.
If we view the subcorpus as a matrix where the element at row i and column j indicates the number of occurrences of type j in sample i (recall Table 1), we can give the following equivalent definitions. Form a vector v by adding up all rows of the matrix. Then the number of types is the number of nonzero elements in v, and the number of hapaxes is the number of elements equal to 1 in v. 4.1
Comparing productivity between subcorpora
The measures we defined above have an obvious drawback: they are sensitive to the size of the subcorpus. In our material we have 80 types of -ity in the texts written by women and 183 types of -ity in the texts written by men; however, we cannot immediately say that the type richness of women’s texts is lower, as we have much more material from men (see Figure 1). Furthermore, the relation between the size of the subcorpus and the number of types occurring in it is not necessarily linear. Put simply, at the very beginning of the type accumulation curve, each -ity word is likely to be new, but later we are more likely to meet -ity words which have already occurred in the corpus. With hapaxes, the measure might even decrease as the size of the subcorpus increases. We shall see practical examples of the nonlinear behaviour throughout this work (e.g., Figures 4, 6 and 8). Therefore, attempts to normalise the number of types by, say, dividing by the number of running words are not justifiable (cf. Gotelli and Colwell 2001). Indeed, such attempts give completely misleading results with our data. For example, the number of -ity types per 100,000 running words is approximately 23.5
Comparing type counts: women, men and -ity in early English letters
97
for women and 17.6 for men in our material. It would appear that the type richness is higher for women, even though the opposite is the case, as we shall see. We might be able to tackle the problem by making further modelling assumptions on the process which generates the text; we might, for example, assume that the occurrences of the words are independent, and we could then use the input data to estimate the probabilities of each person producing a particular word; this way we could compare the productivity of different persons. However, we are reluctant to make such simplifying assumptions, as the choice of words may have subtle dependencies on the textual context (see, e.g., Baayen 2001: 163). We take a somewhat extreme approach in assuming nothing. Instead of trying to compare subcorpora of different sizes, we only assume that we can compare subcorpora of equal sizes. We use the following alternative definitions for equal size: (i)
The same number of running words.
(ii)
The same number of -ity tokens.
For most of this work we focus on definition (i) in conjunction with measure (a), i.e., the number of types. Other combinations may also be of interest, and we can experiment with them by using the same general approach and the same tools. For example, if we use measure (b) and definition (ii), we compare the number of -ity hapaxes in subcorpora with the same number of -ity tokens. Equally well, we could compare the ratios between -ity hapaxes and -ity tokens, arriving at Baayen’s (1992: 115) definition. 4.2
Statistical significance
We are not interested in merely noticing that a particular subcorpus has a lower number of types in comparison with another subcorpus. We are interested in differences which are statistically significant; informally, not likely to be mere random artefacts of the data. We now review some basics of statistical hypothesis testing and apply the ideas to our problem. Let us choose the measure of productivity (a), the number of types, and say that we are willing to compare only subcorpora which are equal by definition (i), the number of running words. The idea that women are significantly less productive than men in this material is captured as follows. Let n be the number of running words in the subcorpus which consists of the texts written by women and let t be the number of types in this subcorpus. Hypothesis. Gender is significant. For a subcorpus with n running words, t is a particularly low number of types. The null hypothesis is that there is no connection between the number of types and gender; the effect is caused by chance. More formally, the null hypothesis is
98
Tanja Säily and Jukka Suomela
that the numbers of running words and the rows of the incidence matrices for men and women are samples from the same population. Intuitively, the null hypothesis suggests that the subcorpus of texts written by women could be constructed through the following process. We randomly pick samples from the corpus, labelling them as having been written by women, until the subcorpus we have accumulated is of size n; the rest of the corpus is then labelled as having been written by men. We emphasise that our samples consist of complete letters. We need not assume that the words within each letter are independent of the context; we only assume that samples as a whole are interchangeable under the null hypothesis. We can test the hypothesis by estimating how likely it is that a subcorpus constructed in this way has as few as t types (we apply one-sided testing here). If this turns out to be very unlikely, say, happening on average only once in 100 trials, we reject the null hypothesis and accept the original hypothesis, with p = 0.01. There is a subtlety: as we work at the granularity of samples, and the sizes of the samples vary, it may be that very few labellings – maybe just the original labelling – produce a subcorpus with exactly n running words. In practice, we make a minor adjustment. Informally, we consider subcorpora with at least n running words and not many more than that; making the subcorpus longer certainly cannot have a negative bias on the number of types. The case of hapaxes is more complicated; we come back to this issue in Section 5.3. 4.3
Permutation testing
Now, we have formalised our hypothesis and we are ready to do standard hypothesis testing – all we need to do is estimate the probability p of obtaining such an extreme case as at most t types in a subcorpus with n running words. As we are dealing with type counts, we do not have a simple mathematical formula for calculating p: the probability depends not only on summary information such as the values t and n but on the full incidence matrix. Therefore, we use techniques from permutation testing (see, e.g., Good 2005). Applied to our problem in a straightforward manner, the basic idea would be as follows. We take the intuitive idea of picking samples in a random order quite literally. The order in which we pick the samples forms a permutation (reordering) of the samples. To calculate the probability p, we need to calculate the percentage of permutations which have at most t types in the first n running words. We generate all permutations of the samples, check which of them satisfy this condition, and compute the percentage p. The next section adapts this basic idea to our needs. 5.
Implementation
Standard permutation testing would indeed suffice if all we were interested in was testing one hypothesis. However, we are also interested in exploratory data analysis. We want to consider several variables besides gender and see if they
Comparing type counts: women, men and -ity in early English letters
99
correlate with the number of types. Ideally, we would prefer to avoid repeating extensive computations between each experiment. We also wish to gain more understanding on the accumulation of types as a function of corpus size. We can address all of these requirements by calculating type accumulation curves similar to that shown in Figure 4. This is the output generated by the computer program that we present in this section. First we describe how to interpret and use these curves; then we discuss the implementation which is used to compute the curves.
-ity
Types
150
100 p p p p
50
0 0.0
0.2
0.4 0.6 0.8 1.0 Running words (millions)
< 0.1 < 0.01 < 0.001 < 0.0001 1.2
1.4
Figure 4: Bounds for -ity types as a function of the number of running words Figure 4 shows upper and lower bounds for the number of -ity types. On the x axis, we have the number of running words in the subcorpus. The bounds are plotted for various levels of statistical significance. For example, the solid black curve corresponds to the level p = 0.01; the lower bound for, say, 600,000 running words at this level is 123, and the upper bound at this level is 163. This can be interpreted as follows: in all permutations of the samples that we can construct from the whole corpus, less than 1% have fewer than 123 -ity types within the first 600,000 running words, and less than 1% have more than 163 -ity types within the first 600,000 running words. The p values here refer to a one-sided test; for a two-sided test, the p values need to be doubled. Once we have computed the curves, we can immediately use them for hypothesis testing, in a very straightforward manner: we simply plot the data point that corresponds to the subcorpus of interest on these curves and see whether the point lies, for example, below the lower bound. If so, we conclude that the number of types is significantly low for a subcorpus of this size. This is merely an (indirect) application of a permutation test.
100
Tanja Säily and Jukka Suomela
An example of this is shown in Figure 5. In the subcorpus which consists of the letters written by women, we have 340,116 running words and only 80 -ity types. In the subcorpus which consists of the letters written by men, we have 1,038,951 running words and 183 -ity types. We have plotted both data points on top of the curves already shown in Figure 4. We note that the data point which corresponds to women’s texts lies below the lower bound with p = 0.001. We conclude that it is highly unlikely to come up with such a collection of samples by chance; our main hypothesis is true. We come back to the analysis of the results in Section 6.
-ity
Types
men 150
100 women p p p p
50
0 0.0
0.2
0.4 0.6 0.8 1.0 Running words (millions)
< 0.1 < 0.01 < 0.001 < 0.0001 1.2
1.4
Figure 5: Hypothesis testing. Women have significantly few -ity types As we shall see, calculating the curves requires some amount of computation. However, once we have done the computation, we can use the same curves repeatedly to answer various questions. We can test other similar hypotheses easily by plotting more data points on top of the curves. Indeed, we can do exploratory data analysis by plotting data points corresponding to each possible value of each sociolinguistic category, such as gender, domicile, social rank, and time period. We shall see examples of this in Section 6. We can also analyse the curves qualitatively: the shape of Figure 4 illustrates the nonlinear relation between the size of the subcorpus and the number of types occurring in it. Finally, we can calculate similar curves for measure (b), hapaxes, and we can also consider definition (ii), which means that the x axis shows the number of -ity tokens instead of the number of running words in the subcorpus. See Figure 6 for an example.
Comparing type counts: women, men and -ity in early English letters
101
-ity
Hapaxes 70 60 50 40 30
p p p p
20 10 0 0
500
1000 1500 Suffix tokens
2000
< 0.1 < 0.01 < 0.001 < 0.0001 2500
Figure 6: Bounds for -ity hapaxes as a function of the number of -ity tokens 5.1
Basic algorithm
We proceed to present the operation of the computer program. The program performs the computations in two steps. The first step essentially tabulates for each pair (t, n) an approximation of the number of permutations such that there are exactly t types within the first n running words. The second step uses the table to find for each value of n those values of t at which we cross the significance levels of interest, such as p = 0.01 and p = 0.001. The first step is computationally more intensive. It consists of generating a large number of random permutations of the samples – typically, the number of permutations is in the range of tens of thousands to millions. For each permutation, we process the samples one by one, in the order indicated by the permutation. For each new sample, we compute the total number of types observed so far. Each permutation can be interpreted as a type accumulation curve, similar to the two examples illustrated in Figure 7; in the figure, each tick mark corresponds to one sample. Once we have a complete accumulation curve, we increment the counters in the table for each pair (t, n) through which it passes. This is repeated for each permutation, after which we can perform the second step.
102
Tanja Säily and Jukka Suomela
Types
-ity
150
100
50
0 0.0
0.2
0.4 0.6 0.8 1.0 Running words (millions)
1.2
1.4
Figure 7: Two type accumulation curves. Each tick mark represents the addition of one sample 5.2
Computational complexity
In the first step, we employ a randomised algorithm to approximate the number of permutations for each (t, n). This is an application of the Monte Carlo method (Mitzenmacher and Upfal 2005: 252), in which one picks a number of objects at random from a suitable probability distribution, checks which percentage of them satisfies the desired properties, and derives an estimate of the total number of such objects. By increasing the number of objects that we choose, we can improve the accuracy of the estimate. As is usual in an implementation of permutation testing (Good 2005: 233), we choose a particularly simple probability distribution, the uniform distribution over all permutations; therefore, we can pick a random permutation by using a simple algorithm for randomly shuffling a list. By resorting to a randomised approximation algorithm, we have sacrificed some accuracy. This is acceptable, as we only need the first few decimals of the probability p. Approximation is in any case unavoidable, because it is not likely that there exists any efficient algorithm for, say, computing the exact number of permutations which traverse through a given point (t, n). Even determining whether the number is more than zero is hard: this is a generalisation of the SET COVER problem, which belongs to the class of NP-complete problems, and it is generally believed that no efficient algorithm exists for any problem that is NPcomplete (see, e.g., Garey and Johnson 2003 [1979]).
Comparing type counts: women, men and -ity in early English letters 5.3
103
Implementation details
Next we address the fact that we only have data at the granularity of entire samples. Put simply, based on our input data, we do not know whether the occurrences of the types are at the beginning or the end of the sample; if we are interested in knowing the exact value of t for some n which happens to be in the middle of a sample, we do not know whether we would have already met the new types of this sample by n running words or not. Therefore, our program adopts a safe approach: it always considers the worst case for us and the most favourable case to the null hypothesis, i.e., the case which produces the widest confidence intervals. Finding the worst cases for the number of types is straightforward. For lower bounds, we can proceed as if all types were clustered at the very end of the sample, and for upper bounds we can assume the opposite. The case of hapaxes is more involved, as we need to distinguish between several cases: (a) newly created hapaxes, i.e., types which have not occurred before this sample and which occur only once in this sample; (b) temporary hapaxes, i.e., types which have not occurred before this sample and which occur more than once in this sample; and (c) removed hapaxes, i.e., types which have occurred exactly once before this sample and which occur at least once in this sample. For lower bounds, the worst case is that the types of class (c) occur at the very beginning of the sample, cancelling previously known hapaxes. For upper bounds, the worst case is that the types of class (a) and one instance of each type of class (b) occur at the very beginning of the sample, increasing the number of hapaxes at least temporarily. To develop a program which is computationally efficient in terms of time and memory requirements, we need to address some further issues. First, while the range of possible values of t is typically moderate, the range of possible values of n can be large; in our data, we have more than one million running words. The size of the table where the number of permutations for each (t, n) are stored would be impractical. We can significantly improve performance by dividing the n dimension into a smaller number of slots; for example, we can interpret the range from n = 0 to n = 4,999 as one slot, the following 5,000 running words as another slot, and so on. The approach of using slots is combined with the approach of finding worst-case bounds. Therefore, the slots can be used safely: they do not introduce any artefacts in the curves which would make some finding seem statistically significant if this is not the case. Naturally, using very large slots may prevent one from finding even statistically significant results. To further improve performance, the computations in the first phase use a data layout in which each element requires only 1 or 2 bits of storage: for types, the single bit stands for “at least 1”; for hapaxes, one bit stands for “at least 1” and the other for “at least 2”. The input is pre-processed into an incidence matrix which is stored in this compact format, and the table containing the counts for each slot is also stored in this manner. The compact memory layout is cachefriendly and allows us to exploit bit-parallelism in the calculations.
104
Tanja Säily and Jukka Suomela
The program is written in standard C (ISO/IEC 9899:1999); it should compile and run on any standard-compliant platform. The only essential limitation on the size of the input data is the amount of available memory. Parameters such as the number of iterations and the slot size can be set by using command line switches. 5.4
Performance
The following example illustrates the typical performance of the program. In our input data for the suffix -ity, we had 412 samples and 192 different types of -ity. We used slots of 5,000 running words each; this resulted in 277 slots. We ran the experiments on a desktop PC with a 2.4-GHz Pentium 4 processor, under the Linux operating system; the application was compiled using the C compiler from the GNU Compiler Collection (GCC). We experimented with two different numbers of permutations: 20,000, which is suitable for getting a quick idea of whether there are any statistically significant results in view, and 1,000,000, which is more than enough to produce publication-quality illustrations such as those presented in this work. The running time for computing the type accumulation curves was 1.7 seconds for 20,000 permutations and 82 seconds for 1,000,000 permutations. The running time for computing the hapax accumulation curves was 2.3 seconds for 20,000 permutations and 113 seconds for 1,000,000 permutations. 5.5
Using the implementation
The computer program described in this section is freely available under an open source license (GNU General Public License, version 2.0 or later). For details on obtaining and using the program, see Suomela (2007). Both the input and the output of the program are plain text files. The program accepts as input data matrices similar to those illustrated in Table 1. The input files can be prepared manually or, as we have done, by using corpusspecific tools. The output consists of the numerical data for curves similar to those in Figure 4. Tools such as statistical software packages or spreadsheets can be used to visualise the results. With our program, we provide a script which illustrates how to draw graphs similar to Figure 4 by using R, the free software environment for statistical computing (R Development Core Team 2007). As stated above, the program is only needed for computing the upper and lower bounds for type accumulation, and such computation needs to be performed only once for a given data set. In the following section, we use the bounds for both hypothesis testing and exploratory data analysis. 6.
Results and conclusions
Our hypothesis was that gender is significant in the case of -ity; as seen from Figure 5, this turned out to be the case. The richness of -ity types is significantly
Comparing type counts: women, men and -ity in early English letters
105
low (p < 0.001) in women’s letters in the 17th-century part of the CEEC. Naturally, the 17th-century part of the CEEC is not a perfect representation of 17thcentury English; neither are type counts a perfect measure of morphological productivity. Nevertheless, a result which is statistically this significant demands an explanation, and we argue that an attractive candidate can be found through examining the socio-historical situation in 17th-century England (see, e.g., Wrightson 1993). As women’s access to education was severely restricted, they would not have had the competence to use the learned and etymologically foreign suffix -ity to the same extent as men. The situation for -ness is shown in Figure 8. Here the data points for both men and women fall between the upper and lower bounds, and we cannot draw a similar conclusion on the significance of gender.
-ness
Types 300
men 250 200 150 women 100 50 0 0.0
0.2
0.4 0.6 0.8 1.0 Running words (millions)
p p p p
< 0.1 < 0.01 < 0.001 < 0.0001 1.2
1.4
Figure 8: Bounds for -ness types as a function of the number of running words Finally, we explore some other sociolinguistic categories. Subcorpora based on the domiciles of the informants show no significant results. As for social rank, we might have expected to find a significantly low level of productivity for -ity in the lowest ranks, but there is simply too little data from them in the corpus. A more interesting case comes up when we divide the corpus into time periods: letters written in 1600–1639, and those written in 1640–1681. Figure 9, based on the same set of curves as Figure 5, shows that the type richness of -ity is significantly low in the earlier period. One interpretation for this could be that there is a linguistic change in progress: in the course of the 17th century, the use of -ity becomes more common in personal letters. This makes sense – not only was the use of Latinate features socially stratified (they were mostly used by learned men), but it was also register-specific, and began to spread from more
106
Tanja Säily and Jukka Suomela
formal contexts to less formal ones during the 16th and 17th centuries (cf. Nevalainen and Tieken-Boon van Ostade 2006: 281–282; Riddle 1985: 455–456). The above examples illustrate the ease with which we can do exploratory data analysis once we have computed the bounds of the type accumulation curves. Even with a relatively small corpus, we were able to not only confirm our hypothesis but also discover unanticipated linguistically interesting results. The bounds for hapax counts turned out to be too wide for significant differences to emerge (see Figure 6). It may be that this measure requires more data to become usable. However, if the problem of wide bounds for hapax accumulation curves persists in larger corpora, this could call into question the use of hapax-based productivity measures in general.
-ity
Types
1640-1681 150 1600-1639 100 p p p p
50
0 0.0
0.2
0.4 0.6 0.8 1.0 Running words (millions)
< 0.1 < 0.01 < 0.001 < 0.0001 1.2
1.4
Figure 9: Subcorpora based on time periods In addition to testing hapax accumulation in larger corpora, future work could include a comparison between our type accumulation curves and those derived from more widely used parametric models. Another opportunity for future research would be a more fine-grained investigation of the differences between men and women in the use of the suffix -ity: as pointed out by an anonymous reviewer, part of the differences observed in this study could be due to women writing about a more restricted set of topics, which may lead to a large vocabulary overlap between women. As noted in Section 4.1, our work focuses on definition (i) of corpus size – in our type accumulation curves, the x axis is the number of running words in the corpus. Another possibility would have been to compute type accumulation as a function of suffix tokens. Further work is needed in order to better understand the
Comparing type counts: women, men and -ity in early English letters
107
interplay between the number of running words, the number of affix tokens, and the number of affix types in the context of productivity. Acknowledgements We thank Harald Baayen, Terttu Nevalainen, the audience at ICAME 28 and the members of VARIENG for discussions and comments, and anonymous reviewers for their helpful feedback. The database of sociolinguistic information used in the study was compiled by Arja Nurmi. This research was supported in part by the Academy of Finland Centre of Excellence funding for the Research Unit for Variation, Contacts and Change in English (VARIENG) at the Department of English, University of Helsinki, and the Helsinki Graduate School in Computer Science and Engineering (Hecse). Notes 1 As noted by an anonymous reviewer, this particular example could also be regarded as an instance of affix substitution. This provides an even stronger motivation for not leaving out these kinds of words. References Aronoff, M. and F. Anshen (1998), ‘Morphology and the lexicon: Lexicalization and productivity’, in: A. Spencer and A.M. Zwicky (eds.) The Handbook of Morphology. Cambridge, MA: Blackwell Publishers. 237–247. Baayen, R.H. (1992), ‘Quantitative aspects of morphological productivity’, in: G. Booij and J. van Marle (eds.) Yearbook of Morphology 1991. Dordrecht: Kluwer Academic Publishers. 109–149. Baayen, R.H. (1993), ‘On frequency, transparency and productivity’, in: G. Booij and J. van Marle (eds.) Yearbook of Morphology 1992. Dordrecht: Kluwer Academic Publishers. 181–208. Baayen, R.H. (2001), Word Frequency Distributions. Dordrecht: Kluwer Academic Publishers. Baayen, R.H. and R. Lieber (1991), ‘Productivity and English derivation: A corpus-based study’, Linguistics, 29: 801–843. Baayen, R.H. and A. Renouf (1996), ‘Chronicling the Times: Productive lexical innovations in an English newspaper’, Language, 72 (1): 69–96. Bolinger, D.L. (1948), ‘On defining the morpheme’, Word, 4: 18–23. BĜezina, V. (2005), The Development of the Prefixes un- and in- in Early Modern English with Special Regard to the Sociolinguistic Background, unpublished MA thesis, Faculty of Arts, Charles University in Prague. CEEC = Corpus of Early English Correspondence (1998), compiled by the Sociolinguistics and Language History project team (T. Nevalainen, J. Keränen, M. Nevala, A. Nurmi, M. Palander-Collin, H. Raumolin-Brunberg) at the Department of English, University of Helsinki. http://www.helsinki.fi/varieng/domains/CEEC.html. Corpus of Early English Correspondence Sampler (1998), see above.
108
Tanja Säily and Jukka Suomela
Cowie, C. and C. Dalton-Puffer (2002), ‘Diachronic word-formation and studying changes in productivity over time: Theoretical and methodological considerations’, in: J.E. Díaz Vera (ed.) A Changing World of Words: Studies in English Historical Lexicography, Lexicology and Semantics. Amsterdam: Rodopi. 410–437. Dalton-Puffer, C. (1996), The French Influence on Middle English Morphology: A Corpus-Based Study of Derivation. Berlin: Mouton de Gruyter. Evert, S. and M. Baroni (2005), ‘Testing the extrapolation quality of word frequency models’, in: P. Danielsson and M. Wagenmakers (eds.), Proceedings of Corpus Linguistics 2005. The Corpus Linguistics Conference Series 1. Available at http://www.corpus.bham.ac.uk/PCLC/. Evert, S. and M. Baroni (2007), ‘zipfR: Word frequency distributions in R’, in: Proceedings of the ACL 2007 Demo and Poster Sessions. Stroudsburg, PA: Association for Computational Linguistics. 29–32. Garey, M.R. and D.S. Johnson (2003) [1979], Computers and Intractability: A Guide to the Theory of NP-Completeness. New York: W.H. Freeman and Company. Good, P. (2005), Permutation, Parametric, and Bootstrap Tests of Hypotheses. 3rd edition. Springer Series in Statistics. Berlin: Springer-Verlag. Gotelli, J. and R. Colwell (2001), ‘Quantifying biodiversity: Procedures and pitfalls in the measurement and comparison of species richness’, Ecology Letters, 4: 379–391. Hay, J. (2001), ‘Lexical frequency in morphology: Is everything relative?’, Linguistics, 39 (6): 1041–1070. Marchand, H. (1969), The Categories and Types of Present-Day English WordFormation: A Synchronic-Diachronic Approach. 2nd edition. Munich: C.H. Beck’sche Verlagsbuchhandlung. MED = Middle English Dictionary, 2001 edition. Electronic version. Available at http://ets.umdl.umich.edu/m/med/. Mitzenmacher, M. and E. Upfal (2005), Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge: Cambridge University Press. Nevalainen, T. and H. Raumolin-Brunberg (2003), Historical Sociolinguistics: Language Change in Tudor and Stuart England. London: Pearson Education. Nevalainen, T. and I. Tieken-Boon van Ostade (2006), ‘Standardisation’, in: R.M. Hogg and D. Denison (eds.) A History of the English Language. Cambridge: Cambridge University Press. 271–311. OED = Oxford English Dictionary, 2nd edition, 1989. OED Online. Available at http://dictionary.oed.com. Plag, I. (1999), Morphological Productivity: Structural Constraints in English Derivation. Berlin: Mouton de Gruyter. R Development Core Team (2007), R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Available at http://www.R-project.org.
Comparing type counts: women, men and -ity in early English letters
109
Riddle, E.M. (1985), ‘A historical perspective on the productivity of the suffixes -ness and -ity’, in: J. Fisiak (ed.) Historical Semantics; Historical WordFormation. Berlin: Mouton de Gruyter. 435–461. Säily, T. (2005), ‘Use of the suffixes -ity and -ness in early English letters: Was gender a factor?’, unpublished seminar paper, Department of English, University of Helsinki. Säily, T. (2008), Productivity of the Suffixes -ness and -ity in 17th-century English Letters: A Sociolinguistic Approach, unpublished MA thesis, Department of English, University of Helsinki. Available at http://urn.fi/URN:NBN:fife200810081995. Säily, T. and J. Suomela (2007), ‘Incidence matrices for -ness and -ity’. Available at http://www.cs.helsinki.fi/jukka.suomela/ity-ness-data/. Suomela, J. (2007), ‘Type and hapax accumulation curves’, computer program. Available at http://www.cs.helsinki.fi/jukka.suomela/types/. Tweedie, F.J. and R.H. Baayen (1998), ‘How variable may a constant be? Measures of lexical richness in perspective’, Computers and the Humanities, 32: 323–352. Wrightson, K. (1993), English Society, 1580–1680. London: Routledge.
Does English have modal particles? Karin Aijmer University of Gothenburg Abstract Modal particles are functionally closely related to discourse markers. This raises the issue of whether modal particles have a common ‘class-identifying’ function which distinguishes them from discourse markers (and adverbs) as well as questions about what we mean by modality. Of course has been treated as a discourse marker as well as a modal adverb. However it does not seem to have been discussed as a modal particle. It is argued that we should distinguish between its uses as a discourse marker and modal particle on the basis of its formal properties and its functions.
1.
Defining the problem
The interest in modal and evidential particles in different languages of the world in the last decades is evidenced in works such as Chafe and Nichols (1986), Aikhenvald (2004), Palmer (1986) and we can also, as a result, expect more interest in studying particles in the European languages. Modal particles are also said to be a frequent feature of some, mainly Germanic, languages (e.g. German, Swedish, Dutch, Danish, Norwegian). In Swedish, we find ju (‘as you know’), nog (‘probably’), väl (‘surely’), visst (‘evidently’) and descriptions of German regularly identify over twenty modal particles (including schon, wohl, denn, ja) (Hoye 1997: 209). Modal particles are a subclass of pragmatic markers and they share a number of properties with other pragmatic markers. They are not part of the truth-conditional content; they are optional in the sentence and they have textual and interpersonal functions. The definition and classification of (modal) particles rely on a number of formal criteria such as position in the clause, syntactic integration and the lack of stress (Waltereit 2001: 1392; Hansen 1998). Modal particles are for example usually unlike adverbs with regard to stress and position. They do not occupy initial or final position in the clause but ‘particle’ in the relevant languages has a fixed position in the verbal complex (the middle field), a topological notion referring to the position after the initial element of a complex verbal element. The formal criteria are fairly rigid and are influenced by the German tradition of ‘Partikelforschung’ (see e.g. Weydt 1969). Formal factors may not be equally important in all languages although they are part of the definition in German and in Swedish. However, modal particles ‘do not appear to belong to a very clearly defined modal system’ such as the modal auxiliaries (Palmer 1986: 45). Modal particles are generally felt to be semantically and pragmatically elusive (Waltereit
112
Karin Aijmer
2001: 1392) and ‘the modal functions identified are considerably different in the different languages, or at least are conceptualized in different ways’ (Traugott 2007: 142). They can have meanings which are marginally modal or not obviously modal at all. As Palmer (1986) points out (quoting Curme 1905 (1960)), modal particles in German are paraphrased by ‘modal adverbs which denote in what manner a thought is conceived by the speaker’ and they seem to be ‘essentially comments on the proposition rather than opinions about it, and so not very obviously modal’ (Palmer 1986: 46). This raises the issue of whether modal particles have a common ‘class-identifying’ function which distinguishes them from discourse markers (and adverbs), as well as questions about what we mean by modality. Modal particles are functionally closely related to discourse markers. However the relationship between modality and discourse has not been much discussed (cf. Traugott 2007). For example, in the early literature on discourse markers such as Schiffrin (1987), modality is not discussed as a source of discourse markers. In this paper, my aim is to discuss the relationship between modality and different discourse and pragmatic functions. I will discuss the modal adverb of course which has been studied earlier but not from this perspective (cf. Simon-Vandenbergen and Aijmer 2002/2003; Wichmann et al. forthcoming). The adverb is multifunctional and has a number of pragmatic and discourse functions which are obviously modal but removed from the literal meaning of the adverb. Of course has been treated as a discourse marker as well as a modal adverb. However it does not seem to have been discussed as a modal particle. I will argue that we need to distinguish between its uses as a discourse marker and modal particle on the basis of its formal properties and its functions. In addition of course can be an answer particle. This function is easy to describe in both structural and discourse terms. Functionally of course is for example used if the speaker’s and hearer’s assumptions converge. The use as an answer particle will not be further discussed but is of interest if we want to describe the different functions of of course in terms of polysemy and grammaticalization. 2.
Modal particles and modal adverbs
Modal adverbs provide the closest approximation to modal particles and it has recently been suggested that modal adverbs in English should be regarded as modal particles in some of their senses, ‘primarily those adverbs with only faint shades of meaning’ (Hoye 1997: 209). This idea fits in well with the hypothesis proposed by several linguists (Diewald 2006, Waltereit and Detges 2007) that modal particles are derived by grammaticalization from adverbs. In traditional descriptions of English grammar there is no place for modal particles. However, according to Hoye, the distinction between adverb and particle can be said to match the classification into different types of adverbs familiar from Quirk et al’s description (1985):
Does English have modal particles?
113
the concept of ‘modal particle’ is relevant to the classification of modal adverbs in English because, … according to the degree of their integration in clause structure and the nature of their association with the modal verb head, they display various degrees of lexical redundancy and grammaticalization (Hoye 1997: 209). Quirk et al distinguish between adjuncts, disjuncts, subjuncts and conjuncts in terms of their centrality or peripherality in the clause. For example, adjuncts are typically integrated in the sentence and contribute to the propositional content just like other sentence elements. Of course is not an adjunct (a VP adverbial) in present-day English but was used in older English to indicate ‘that something occurred as a natural process’ (Lewis 2003). Of course in present day English has as its core meaning ‘taking for granted’, ‘definitely’, ‘obviously’. Of course as a subjunct is illustrated in (1) where it is subordinate to the subject in the clause. (1)
Many young people of course prefer hip hop to rock music.
It can also be subordinate to the whole clause: (2)
Many young people may of course prefer hip hop to rock music.
Subjuncts ‘have to a greater or lesser extent, a subordinate role in relation to one of the other clause elements or to the clause as a whole. They exhibit considerably less semantic and grammatical independence than disjuncts and are more closely integrated in clause structure and especially the verb phrase.’ (Hoye 1997: 155). As a subjunct of course is ‘concerned with expressing the semantic role of modality in particular emphasis’ (Quirk et al 1985: 587). When of course is a disjunct it is more salient in the clause. Disjuncts ‘have a superior role as compared with the sentence elements; they are syntactically more detached and in some respects ‘superordinate’, in that they seem to have a scope that extends over the sentence as a whole’ (Quirk et al 1985: 613). The semantic role of disjuncts is to express a comment as to ‘the degree of or condition for truth of content’ (Quirk et al 1985: 615). Of course is for instance a high probability adverb conveying ‘the speaker’s strength of conviction or emphasis in the truth of the adjoining proposition; by topicalizing the firmness of the speaker’s belief the effect is, of course, to emphasize it’ (Hoye 1997: 190). (3)
of course he’ll be working with overseas students
(4)
of course, when the subject matter concerns very recent events it may not be easy to convey new techniques (Hoye 1997: 190 abbreviated example).
114
Karin Aijmer
In addition of course can be a conjunct encouraging ‘a particular attitude in the addressee as well as expressing the nature of the connection between the units they conjoin’ (Hoye 1997: 154). In (5), of course is used to express a contrast to the content in the preceding utterance. (5)
A: She could be waiting at the hairdresser’s, I suppose … B: Of course she could but all the same I don’t think it likely.
Of course signals concession (I grant that, certainly) followed by an argument in the but-clause. According to Hoye (1997: 212) “it would not be implausible to redefine subjuncts expressing modality as ‘modal particles’, subdivided into the following categories: evidential particles (clearly, obviously); hearsay particles (apparently); reinforcement or emphasising particles (certainly, surely, well); and focus particles (only, simply)”. Of course (not mentioned by Hoye) could presumably be regarded as a modal particle similar in meaning to certainly or to obviously. Hoye’s suggestion sparks off interest in the question whether English can be said to have modal particles. However it is not easy to say what of course means and how many meanings it has. In this article I will discuss of course as a polysemic marker which has developed functions which are characteristic both of discourse markers and modal particles. It will be shown that the functions of of course can be traced to the (presuppositional) properties of of course and the larger sequences of ‘rhetorical relations’ in which of course plays an important role (Lewis 2003). Hopefully the analysis of of course can also sharpen the analysis of what we mean by discourse markers and by modal particles. Another aim is to show how translations provide a method to circumscribe the meanings and functions of multifunctional and polysemous items by looking at the translations of of course into Swedish. 3.
Translations as a model to study multifunctionality
Of course has several meanings which are not always easy to distinguish from each other. Paraphrasing goes some way towards describing what of course means in different contexts. Translations are a more indirect method to arrive at the meanings of a lexical item. The method is particularly interesting when lexical elements are multifunctional since the translator has to interpret the meaning of the lexical item in its context. The translator’s analysis can thus be a complement to the linguist’s analysis based on features such as position, collocation and above all the linguistic and non-linguistic context. The translations of of course range from meanings such as certainty (Swe. naturligtvis, givetvis, förstås) to translations such as ju ‘as you know’ (see Table 1). However the translations only provide ‘raw semantic data’ which have to be evaluated and further analyzed. We need for instance to explain why of course has a certain discourse function. Moreover the translations do not tell us if a new
Does English have modal particles?
115
meaning has been conventionalised or is only implicated. The frequency of a particular translation (or meaning) may however be a sign that conventionalisation has taken place. Low-frequent meanings on the other hand are more likely to be implicatures or side-effects of more salient meanings. The examples of of course discussed in this study come with a translation taken from the English-Swedish Parallel Corpus (Altenberg & Aijmer 2000), a corpus of almost three million words of fiction and non-fiction. Table 1 shows the translations of of course from English original texts (English originals -> Swedish translations) and the Swedish sources of of course (Swedish translations <English originals).1 The zero-expressions are also important. Omission of of course can be expected if it has a weakened literal meaning and mainly pragmatic function as a modal particle or a discourse marker. Table 1: Swedish translations and sources of of course Translations from English 71
Swedish sources 70
Total
naturligtvis
91
74
165
givetvis (‘of course’)
41
12
53
det är klart (att) (‘it is clear that’)
18
5
23
visst (‘certainly’, ‘by all means’)
6
19
25
fast det är klart (‘but it is clear’)
4
-
4
ju (‘as you know’)
4
66
70
ja (jo)det är klart
2
1
3
visserligen (‘admittedly’)
2
8
10
javisst (ja), jovisst (‘certainly’)
2
8
10
men …ju (‘but… of course’)
1
-
1
naturellement
1
-
1
självfallet (‘of course’)
2
6
8
självklart (‘of course’)
1
1
2
genast (‘at once’)
1
-
1
naturligt (nog)
1
1
2
för den skull (‘because of that’)
1
-
1
troligtvis (‘probably’)
1
-
1
förstås(s) (‘of course’)
141
116
Karin Aijmer
nog (‘probably’)
-
7
7
väl (‘surely’)
-
4
4
så klart (‘clearly’)
-
2
2
då (‘then’)
-
1
1
alltså (‘consequently’)
-
1
1
för all del (‘by all means’)
-
1
1
förvisso (‘certainly’)
-
1
1
of course
-
1
1
minsann (‘indeed’)
-
1
1
det förstår sig (‘that can of course be understood’) nämligen (causal ‘for’)
-
1
1
-
1
1
som bekant (‘as is well-known’)
-
1
1
Other
-
4
4
Zero
9
11
20
Total
259
308
567
Of course is translated in different ways reflecting its meanings as a discourse marker or a modal particle. 4.
Of course as a discourse marker
When of course is a conjunct with concessive meaning I have analysed it as a discourse marker. In the well-known definition of discourse markers by Schiffrin (1987: 31) they are ‘sequentially dependent elements which bracket units of talk’. ‘That is, they do not add so much to the propositional content of utterances as flag the sequential structure of discourse by indicating how discourse relates to other discourse’ (Rühlemann 2007: 116). In (6) of course could be analysed as a discourse marker since it functions ‘conjunctively’: (6)
Breakfast: Breakfast was your most important meal. He hooked up the percolator and the electric skillet to the clock radio on his bedroom windowsill.
Does English have modal particles?
117
Of course he was asking for food poisoning, letting two raw eggs wait all night at room temperature, but once he’d changed menus there was no problem. (AT1) Frukost — frukost var dagens viktigaste måltid. Han kopplade ihop kaffebryggaren och den elektriska kastrullen med klockradion på fönsterbrädet i sovrummet. Att låta ett par okokta ägg ligga och vänta hela natten i rumstemperatur var naturligtvis att medvetet utsätta sig för matförgiftning, men när han ändrade sin matordning blev det inga problem. It can be safely inferred (‘taken for granted’) that leaving raw eggs at room temperature will lead to food poisoning, but in this case the person changed menus and therefore avoided being poisoned. However it is not self-evident how of course should be analysed in similar examples. It functions ‘conjunctively’ but it also expresses the speaker’s attitude. In (7) of course introduces an argument as given information which is later dismissed in the but-clause: (7)
Of course you haven’t been here long, but you’ll have heard of Davina Flory?” (RR1) Visserligen har ni inte varit här så länge, men nog måste ni ha hört talas om Davina Flory?”
Disjuncts express the speaker’s comments on the message and therefore seem to be less clearly discourse markers. However Thompson and Zhou (2000) have shown that disjuncts can be weakly connective although it is more difficult to label or explicate the relation to the preceding utterance in this case (Thompson and Zhou 2000: 137). For example, of course can typically combine with but or be replaced by but. In (8) but of course has the function to close off a topic (Topic A ‘there might be an interaction’) and presents or shifts to a new one (Topic B ‘but he would of course dominate it’) (Lewis 2003). (8)
Then there might possibly be an interaction, but all the time, of course, he’d dominate it with his grasp of the thing and if we were able to come up with anything, if he took hold of it, then he’d elaborate it in his own particular way. (CE1T) Då kunde det möjligtvis ske en växelverkan, men det är ju klart att det var hela tiden han som dominerade med sitt grepp på det och om vi då kunde komma med bidrag, om han högg tag i dem, så vidareutvecklade han ju dem på sitt speciella sätt.
118
Karin Aijmer
Of course often collocates with but as in the example above where the speaker introduces an argument in order to dismiss it as irrelevant. In the following example the translator has added men ‘but’ in the Swedish translation thus signalling that the adversative meaning is implicit in the use of of course. Of course is a discourse marker since the speaker not only implies that something is absolutely certain but uses of course to achieve better coherence or to repair a potential coherence gap in the discourse (Waltereit and Detges 2007:65). Of course marks the transition to a new topic which is dismissed or ‘removed from the centre stage’ (Lewis 2003). (9)
That left him and Blake, the old man thought. In a way he envied Blake, completely assimilated, utterly content, who had invited him and Erita round for New Year’s Eve. Of course, Blake had had a cosmopolitan background, Dutch father, Jewish mother. (FF1) Nu var det han och Blake kvar, tänkte den gamle mannen. På ett sätt avundades han Blake, som var helt assimilerad och fullständigt belåten och hade bjudit hem honom och Erita på nyårsafton. Men Blake hade förstås också en kosmopolitisk bakgrund — holländsk mor och judisk far.
In (10) of course signals a change of direction in the speaker’s thought (‘on the other hand’). Of course introduces a counterargument which is dismissed before the speaker continues: (10)
…and he thought perhaps she wasn’t joking. A little later she said: “Of course, you can’t blame Harry Harris too much, considering what his wife’s like.” (FW1) Kanske ligger det i alla fall något bakom det hon säger, tänkte han. Lite senare sa hon: “Fast det är klart, man kan inte bara skylla på Harry Harris, eftersom man vet hurdan hans hustru är.”
Does English have modal particles?
119
The reason for using of course is to suggest that the following proposition is an established truth and therefore ‘unimportant’ in the larger argumentative context. (11) is another example of the argumentative function of of course: (11)
“You are a fool!” Hilary had banged on a kitchen cupboard as she spoke and the cups and plates inside trembled. “Of course he’s not coming back. The petty cash is empty. (FW1) “Ni är en idiot!” Hilary hade slagit näven i köksskåpen när hon talade, så att kopparna och tallrikarna därinne skakade. “Det är klart att han inte kommer tillbaka. Handkassan är borta, och jag ringde banken.
Of course as a discourse marker can also be used to add a point in an argument as in (12): (12)
He was impressed; it was from the General Secretary of the CPSU personally, handwritten in the Soviet leader’s neat, clerkish script and, of course, in Russian. (FF1) Han blev imponerad; det var från kommunistpartiets generalsekreterare personligen, handskrivet med den sovjetiske ledarens prydliga, bokhållaraktiga stil, och givetvis på ryska.
The fact that the letter was in Russian is not simply inferred from what has been said earlier (the letter was from the secretary of the Communist party personally). It provides ‘a new final idea on a particular topic’ and is used in persuasive discourse to clinch a point in an argument (cf. Lewis 2003). The document received was written by the Soviet leader personally in his own handwriting and most importantly it was in Russian. As shown by the translations, of course does not express certainty or emphasis only but it has developed discourse-marking functions e.g. to shift the topic or to add a point to the argument. Because of its close relationship with adversative markers like but and with additive markers (and) I have regarded of course as an emergent discourse
120
Karin Aijmer
marker with the function of achieving interpersonal and textual coherence while concealing any disagreement between the participants. 5.
Of course — a modal particle?
Modal particles fulfil basic communicative functions in language which differ from those of discourse markers. According to Waltereit (2001) modal particles have the common function to modify the preparatory conditions of the speech act ‘at minimal linguistic expense’. For example the speaker can say both ‘the great tradition in Cadíz is both dance and song’ and ‘ the great tradition in Cadíz is of course both dance and song’. The unmarked form without of course is the preferred one since it respects conversational maxims or heuristics associated with the assertion. The insertion of of course forces the hearer to find a motivation for the speaker’s flouting of the non-obviousness maxim (don’t say something which is obvious to the hearer). Waltereit comments on the German example Die Malerei war ja schon immer sein Hobby (painting has always been his hobby), ‘the effect of ja [a particle with the literal meaning of affirmation] seems to be that the preparatory condition on assertions concerning non-obviousness of p is cancelled, i.e., that the assertion counts as a relevant contribution to conversation even if the propositional content is obvious to the addressee’ (Waltereit 2001: 1398). The reference to the justification for the speech act introduced by of course (or by German ja) can also be achieved by explicit means. The German sentence discussed by Waltereit can be paraphrased: You certainly know that painting has always been his hobby. I’m only saying this because I need this fact for my argumentation (Waltereit 2001: 1399). As suggested by the paraphrase, the motive for using ja can be rhetorical or argumentative. What makes of course special in the example given above for illustration is that it does not only focus on what the speaker knows but signals that the hearer knows (or should know) as well. The meaning of the modal particle can be understood from the presuppositional properties of the adverb. Because of its evidential or modal meaning (something is self-evident or certain) of course presupposes that something is given or known information. By means of pragmatic accommodation (Lambrecht 1994) new presuppositions can come into existence and ultimately become conventionalized. Pragmatic accommodation is described as follows, ‘if the presupposition evoked by some expression does not correspond to the presuppositional situation in the discourse it is normally automatically supplied by the speech participants’ (Lambrecht 1994: 67). For example the speaker can exploit the presuppositional structure of of course in order to mildly draw the hearer into sharing an opinion or to signal a step in the argumentation (‘let us assume that this is shared knowledge- it follows that….’). In previous work (Wichmann et al, forthcoming) we described the function of of course as heteroglossic since it opens up the dialogic space for alternative voices (White 2003). Speakers engage in interaction and use of course and other modal adverbs rhetorically or strategically to take up a position of
Does English have modal particles?
121
alignment or disalignment to assumptions or beliefs generated by the preceding discourse (White 2003; Wichmann et al forthcoming). The heteroglossic function explains why we can use of course as a modal particle with the interpersonal function to take up a stance challenging what is said or the expectations arising from the preceding text: (13)
But the great tradition in Cádiz is, of course, as mentioned earlier, dance and song. (BTC1T) Men den stora traditionen i Cádiz är ju, som tidigare nämnts, dansen och sången.
The speaker provides justification for the statement by referring explicitly to the fact that it has been mentioned earlier (‘as mentioned earlier’). Ju in the translation can be paraphrased by ‘as you know’. The example illustrates what I mean by the use of of course as a modal particle. The modal particle has the function to comment on or make adjustments to the interactants’ common ground in order to avoid misunderstandings. Vaskó and Fretheim refer to the context-adjusting function of modal particles added at strategic points in the discourse, for example ‘to check whether the speaker’s and hearer’s contextual assumptions converge or diverge’ (Vaskó and Fretheim 1996: 245). The examples I will discuss as modal particles are those where of course has been translated as ju, i.e. the translator has interpreted of course as having the meaning ‘as you know’, ‘as everyone knows’. We also need to look at the factors which explain the translator’s choice. Example (13) where the meaning of of course is also signalled by ‘as mentioned earlier’ should be compared with (14) where the clause introduced by of course represents a bridging context in the terminology of Evans and Wilkins (2000: 550): ‘In these contexts… speech participants do not detect any problem of different assignments of meaning to the form because both speaker and addressee interpretations of the utterance in context are functionally equivalent, even if the relative contributions of lexical content and pragmatic enrichment differ.’ When of course co-exists with must the meanings certainty (emphasis) and ‘evidence’ or justification are present simultaneously and cannot be teased apart. (14)
If global decisions are to have legitimacy, then of course they must be representative. (EISC1T) För att globala beslut skall ha någon legitimitet, måste de ju vara representativa.
In (15) of course must be interpreted as the modal particle although the knowledge status of the participants is not referred to explicitly.
122
Karin Aijmer
(15)
Pasqual Pinon’s two heads are shown on a series of photographs from the 1920s and 1930s; the last was taken only a few days before his death. He had of course acquired a certain international fame by then, and had been the subject of a biography, which was published after his death: this was written by the impresario John Shideler, and called A Monster’s Life. There are pictures enough. They all express sadness and dignity; as if the two heads always looked into the camera conscious that they would never be understood, that those seeing the pictures would never understand. (PE1T) Pasqual Pinons två huvuden finns återgivna på en rad fotografier från 20och 30-talen; det sista är taget bara några dagar före hans död.Han hade ju då uppnått en viss internationell ryktbarhet, och blev föremål för en biografi publicerad efter hans död: det var impressarion John Shideler som skrivit denna biografi, “A Monster’s Life”. Bilder finns det gott om. De ger alla uttryck för sorg och värdighet; som om de två huvudena alltid såg in i kameran medvetna om att de som såg bilderna aldrig skulle förstå.
On the hierarchical discourse level the clause containing of course is backgrounded. The topic (Pinon’s two heads are shown in a series of photographs) is resumed after the addition of backgrounded information (Pinon had achieved international fame by then) needed to guarantee that misunderstanding will not occur. Of course is inserted for the prophylactic purpose to show the relationship between the proposition introduced by of course and the preceding sentence but also to show that the information is backgrounded in relation to the main topic. (16) is another example of the necessity to analyse the function of of course in relation to the organization of the text. The speaker claims that the three Spanish cities Sevilla, Cádiz and Jerez are in fact the cradle of flamenco. This is qualified by the statement that Jerez is best known as the city of wine. Of course introduces the information which is needed in order to justify this claim (the sweet grape is grown, several famous wine cellars are located there). Lewis refers to this as a metatextual backgrounding function: ‘the thread of the narrative is broken… to inform the hearer of a circumstance that will make the narrative more coherent’ (Lewis 2003: 87). (16)
Sevilla, Cádiz and Jerez all claim to be the cradle of flamenco, and all three cities are, in fact, important names in the history of flamenco. Jerez, of course, is best known as the city of wine. Between the mouths of the Guadalquivir and Guadalete rivers the sweet grape is grown from which the sherry wine with all its variants comes. The great Bodegas (winecellars) with famous names like Domecq, González Byass, Sandeman and several others are located there. In the world of flamenco two types of
Does English have modal particles?
123
flamenco song can be identified: cante flamenco andaluz and cante flamenco gitano, Andalusian flamenco song which is sung by a payo (nongypsy) and gypsy-flamenco song which is sung by a calé (gypsy). (BTC1T) Sevilla såväl som Cádiz och Jerez gör anspråk på att vara den ort där flamencons vagga stod.Säkert är dock att alla de tre städerna är viktiga namn i flamencons historia. Jerez är ju framförallt känd som vinets stad. Mellan Guadalquivirs och Guadaletes mynningar odlas den ljuva druvan som ger sherryvinet i alla dess varianter. Där finns de stora bodegorna med sina kända namn som Domecq, González Byass, Sandeman och flera till.I flamencovärlden skiljer man på två typer av flamencosång: cante flamenco andaluz och cante flamenco gitano. Andalusisk flamencosång som sjungs av en payo — icke-zigenare — och zigenarflamencosång som sjungs av en calé — zigenare. The importance of taking into account the larger rhetorical context for the interpretation of of course becomes especially clear when we look at examples where there are no linguistic cues to the interpretation. In the following example of course does not have a discourse marking or topic-shifting function but introduces additional information supporting the main topic as is shown by looking at the text : (17)
She was so intense it seemed my quiet mother, her hair groomed and elegant legs neatly crossed as if her husband were there to approve of the standard — the self-respect — she kept up, was the one to supply support and encouragement. Of course I know her. That broad pink expanse of face they have, where the features don’t appear surely drawn as ours are, our dark lips, our abundant, glossy dark lashes and eyebrows, the shadows that give depth to the contours of our nostrils. … And even if I hadn’t known her, I could have put her together like those composite drawings of wanted criminals you see in the papers, an identikit. The schoolboy’s wet dream. My father’s woman. But I had no voluptuous fantasy that night. I woke up in the dark. (NG1) Hon var så intensiv att det kom att verka som om min stillsamma mor — med sitt uppsatta hår och de eleganta benen prydligt korsade som om
124
Karin Aijmer hennes make fanns där och kunde uppskatta att hon fortfarande höll på stilen, självaktningen — var den som gav stöd och uppmuntran. Visst känner jag henne.Ett sådant där skärt ansikte som liksom breder ut sig och där dragen inte verkar klart avgränsade som våra är, våra mörka läppar, våra täta blanka mörka ögonfransar och ögonbryn, skuggorna som ger djup åt våra näsvingars konturer. … Även om jag inte hade känt henne kunde jag ha plockat ihop henne som de där “spökbilderna” av efterlysta brottslingar som man ser i tidningarna, en nyckel till en identifiering. Skolpojkens sexdröm. Min fars kvinna. Men jag hade inte några vällustiga drömmar den natten. Jag vaknade i mörkret.
Of course signals a break in the topic (The woman was unlike the speaker’s mother), also marked by a change of tense from the past to present tense before the topic is resumed (I woke up in the dark). In the clause introduced by of course the speaker stops to think about a certain type of woman: ‘the broad pink expanse of face they have’ unlike other women the speaker knows. She is the picture of a woman the speaker could have imagined - a schoolboy’s dream. The example is unusual because of course has initial position, a position which is more typical of the discourse marker function.2 Notice that of course is not usually found in initial position when it is backgrounded as in the example above. The Longman Dictionary (LDOCE) observes: ‘Instead of saying: We play a lot of tennis and polo. Of course we have our own swimming pool, you would say: We also have our own swimming pool, of course or …and of course we have our own swimming pool.’ As appears from the dictionary example, of course can also have end position as in the following corpus example. The function of the modal particle in this example is to refer to something both the speaker and hearer know, in order to establish rapport (Holmes 1988): (18)
But sometimes I can’t get my breath, I have difficulty in breathing. I’m not as young as I was, of course, and you’ve got to have some ailment. (SC1T) Men ibland har jag svårt för att få luft, svårt att andas.Jag är ju inte så ung längre och nån krämpa ska man ju ha.
As shown by the following example, of course as a modal particle expresses weak connection with both the preceding context and what follows. However unlike the discourse marker, of course as a modal particle does not change the topic but refers to some circumstance (fact, information) which is needed to establish interpersonal coherence (speaker and hearer share assumptions, beliefs
Does English have modal particles?
125
and knowledge). A closer analysis of the contexts where of course is used shows that it is typically used to introduce a topic which is subordinate to another topic (e.g. the explanation for a claim, background information needed to facilitate the progression of a narrative). (19)
“Yes,” said Asplund, “that’s the whole idea.” I had a few meetings with Lewerentz when we were drawing the Bredenberg department store. Lewerentz of course took over his father’s factory in Eskilstuna and used a metal window that wasn’t so common in those days, with interlinked arches and double glazing. It was absolutely new, because we had invited tenders from German companies for that sort of design. Then Lewerentz came along and said that he could do it cheaper, but he couldn’t meet the delivery deadlines. (CE1T) “Jo,” sa Asplund, “det är ju det som är meningen.” Jag hade en del sammanträffanden med Lewerentz när vi ritade Bredenbergs varuhus. Lewerentz övertog ju sin fars fabrik i Eskilstuna och körde med ett metallfönster som inte var så vanligt på den tiden med kopplade bågar och dubbla glas. Det var alldeles nytt, för vi hade tagit in anbud från tyska firmor på en sån konstruktion. Då kom Lewerentz och menade att han kunde göra det där billigare, men han kunde inte klara leveranstiderna.
Of course relates a proposition to the preceding utterance which contains the new information (I had a few meetings with Lewerenz). By introducing a reference to shared evidence for the information in the first utterance the speaker makes sure that misunderstandings are avoided. In (20) the new information is that the shipping company was obliged to lay the vessels up. Of course signals that the sentence to which it is attached fits into the context as backgrounded information. The backgrounded utterance marked by of course is followed by a resumption of the topic or narrative. (20)
Export volumes to Belgium and France were small and the Gällivare company was periodically compelled to lay the vessels up. Narvik did not come into use before the beginning of 1903, of course. (TR1T)
126
Karin Aijmer Exportkvantiteterna till Belgien och Frankrike var små och Gällivarebolaget fick periodvis lägga upp fartygen. Narvik kom ju ej i bruk förrän 1903.
The precise interpretation of of course depends on the context. Example (21) is different from other examples discussed because of course modifies a question. The hearer’s wife is from America and the speaker refers to this circumstance as the justification for asking the question. (21)
And now the Boss stands there, several years closer to Modern Times, and wants to placate, shouts down the stairs. “There was one thing, Aron. I’ve purchased a gramophone and wonder if you know... your wife was from America, of course. (GT1T) Och nu står Patron där, några år närmare det Moderna och vill blidka; ropar neråt trappan. — Det var en sak till, Aron. Jag har inköpt en grammofon och undrar, känner du till... din hustru hon var ju från Amerika.
In a question, of course comes to mean ‘request for confirmation’ rather than reference to evidence or justification. There are some syntactic contexts where of course cannot be interpreted as a discourse marker and which are therefore indications of the conventionalisation of of course as a modal particle. For example, when of course is found in a nonrestrictive relative clause the information is already backgrounded or ‘parenthetical’. By sneaking in of course the speaker makes it even more difficult for the hearer to avoid the implication of shared knowledge associated with of course. The assumed knowledge (‘as you know’) may be specific to a social network of which both the speaker and the hearer are part as may be the case in the following example (cf. Holmes 1988 ‘the confidential of course’). (22)
“But one must not forget the long winters which, of course, for seventy to eighty percent consist of complete darkness”. (GT1T) Men man får då inte glömma de långa vintrarna som ju till sjutti, åttio procent består av rent mörker.
Similarly in (23) the information in the because-clause is presupposed and therefore backgrounded. Of course denies that the information is new but implies that it is uncontroversial because it is shared by the community: (23)
‘How do you do, Franklin,’ said Auntie, shaking the boy’s hand (she found herself wondering just whom it had originally belonged to, because of course it was, as you might say, second-hand). (ARP1T)
Does English have modal particles?
127
— Goddag Franklin, sa fastern och tog gossens hand (och hon kom på sig med att undra vem den hade tillhört i original, den var ju numera så att säga second hand). In the following example of course introduces shared information as an afterthought after a break (and therefore backgrounded): (24)
“We became friends because we shared some artistic enthusiasms — music, and manuscripts, and calligraphy, and that sort of thing — and of course he made me one of his executors. (RDA1) “Vi blev vänner för att vi hade en del konstnärliga intressen gemensamt — musik och manuskript och kalligrafi och sådana saker, och han gjorde ju mig till en av sina testamentsexekutorer.
In this article I have only discussed the meaning of of course. Other modal particles have a more transparent meaning for example I think which can be explained as cancelling or flouting the preparatory condition that the speaker is sincere (Bill is fat I think) (Aijmer 1997). Like other modal particles it is used to check that the speaker and hearer are on the same wavelength by referring to the background context for the assertion. 6.
Conclusion
English has modal particles ‘which look like adverbs’ but can be distinguished from those on the basis of function as well as on the patterns where they occur. Hoye (1997) referred to the adverbs as modal particles when they were subjuncts, i.e. subordinate in the clause when compared with disjuncts. However, the meaning of the modal particle is not simply a weakening of the modal meaning of the adverb as suggested by Hoye but reflects the fact that the adverb has been ‘pragmaticalized’ and has a number of new functions. Moreover, Hoye did not discuss the difference between modal particles and discourse markers which is important when we analyse the functions of of course. In this light, the translation data from Swedish is particularly interesting because it gives evidence for a functional split between different uses of of course. I have suggested that these differences could be characterised in terms of the difference between discourse markers and modal particles. Modal particles in English are above all a functional category, although they can have certain formal characteristics. Since they are backgrounding, they are for instance normally placed in medial or final position. It has been argued that we can understand their functions by referring to the conditions for the speech act. The speaker can, for instance, say either John of course took over his father’s factory or John took over his father’s factory. With the first alternative, of course has the procedural or signalling meaning to comment on how the
128
Karin Aijmer
information fits with the background context (the preparatory conditions of the speech act). Thus, for example, of course is incompatible with ‘the nonobviousness’ condition of the assertion and is motivated by the need to avoid misunderstandings caused by divergent opinions. Of course can have a number of different functions which are explained by its presuppositional properties (something is given or true). It can be used dialogically to take up a stance to what the hearer knows (as you know, as you should know) or what is common knowledge in order to agree or disagree. Because it comments on common ground, it can have functions such as solidarity or rapport if used by members in a social group. Other functions are argumentative or manipulative. Of course as a modal particle can also appear in contexts where it has a backgrounding function. By this, I mean that is used for ‘subordinate’ functions such as elaborating or explaining what is said or to remedy a break of the narrative thread. As a discourse marker on the other hand, of course guides the hearer through the discourse signalling a topic shift, new turns, or the introduction of new points in the argumentation. Of course was frequently found after but which marks a deviation from the main topic to a new thought or argument. Moreover, it is foregrounding, i.e. it introduces new information into the discourse. Table 2 summarizes the meanings of the modal particle and the discourse marker: Table 2: The discourse marking and modal particle functions of of course Discourse marker Foregrounding (new information) Dismissive Concessive Topic-shifting Marking steps in the discourse or points in the argumentation Modal particle Backgrounding (old information) Context-adjusting Argumentative/ manipulative Solidarity (positive politeness) The polysemy of of course in present-day English and the relationship between the different meanings is motivated by the diachronic changes. The co-existence of variants representing different stages of the language is known as layering (Hopper 1991). Layering and the dynamic view of language it presupposes is also evidenced by bridging contexts where the functional distinctions between different uses of of course seem to be neutralized. In a bridging context of course can for instance both mean certainty and ‘as you know’. Modality is a broad notion as illustrated by this study of of course and should be redefined to take into account its interactional uses. Of course does not only refer to certainty but can be realized in many different ways. For example, of course has interpersonal modal meanings (to refer to what is clearly familiar or true) associated with pragmatic accommodation as well as with meanings
Does English have modal particles?
129
oriented towards ‘interpersonal or dialogical coherence’ (the speaker imposes him- or herself in the discourse in order to shift the topic or to reject it before continuing). Notes 1
There may be an imbalance between the same item as translation and as the source of a translation due to the translation process itself. I have therefore referred to the frequencies of translations and sources together.
2
Visst (certainly) in the translation emphasises the dialogical aspect of of course although the speaker in this case responds to his own thoughts rather than to the hearer.
References Aijmer, K. (1997), ‘I think – an English modal particle’, in: T. Swan & O.J. Westvik (eds.) Modality in Germanic languages. Historical and comparative perspectives. Berlin: Mouton de Gruyter. 1-47. Aijmer, K. and A.-M. Simon-Vandenbergen (2007), The semantic field of modal certainty. A corpus-based study of English adverbs. Berlin and New York: Mouton de Gruyter. Aikhenvald, A.Y. (2004), Evidentiality. Oxford: OUP Altenberg, B. and K. Aijmer (2000), ‘The English-Swedish Parallel Corpus: A resource for contrastive research and translation studies’, in: C. Mair and M. Hundt (eds.) Corpus linguistics and linguistic theory. Papers from the 20th International Conference on English Language Research on Computerized Corpora (ICAME 20) Freiburg im Breisgau 1999. Amsterdam & Philadelphia: Rodopi. 15-33. Chafe, W. and J. Nichols (eds). (1986). Evidentiality: The linguistic coding of epistemology. Norwood, N.J.: Ablex. Curme, G.O. 1905 (1960), A grammar of the German Language. London: Macmillan: (1960 rev. edn) New York: Frederick Unger. Diewald, G. (2006), ‘Discourse particles and modal particles as grammatical elements’, in: K. Fischer (ed.) Approaches to discourse particles. Amsterdam: Elsevier. 403-425. Evans, N. and D. Wilkins (2000), ‘In the mind’s ear: The semantic extensions of perception verbs in Australian languages’. Language 76 (3): 546-592. Fraser, B. (1996), ‘Pragmatic markers’. Pragmatics (6)2: 167-190. Holmes, J. (1988), ‘Of course, A pragmatic particle in New Zealand women’s and men’s speech’. Australian Journal of Linguistics 2: 49-74. Hoye, L. (1997), Adverbs and modality in English. London and New York: Longman. Lambrecht, K. (1994), Information structure and sentence form. Topic, focus, and the mental representations of discourse referents. Cambridge: CUP. Lewis, D.M. (2003), ‘Rhetorical motivations for the emergence of discourse particles, with special reference to English of course’, in: T. van der
130
Karin Aijmer
Wouden, A. Foolen and P. Van de Craen (eds.) Particles, Special issue of Belgian Journal of Linguistics 16: 79-91. The Longman dictionary of contemporary English (1995) [1978] [LDOCE] Palmer. F.R. (1986), Mood and modality. Cambridge: CUP. Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive grammar of the English language. London: Longman. Rühlemann, C. (2007), Conversation in context. A corpus-driven approach London and New York: Continuum. Schiffrin, D. (1987), Discourse markers. Cambridge: CUP. Searle, J.R. (1969), Speech acts. An essay in the philosophy of language. Cambridge: CUP. Simon-Vandenbergen, A.-M. and K. Aijmer (2002/2003), ‘The expectation marker of course’. Languages in Contrast 4 (1): 13-43. Thompson, G. and J. Zhou (2000), ‘Evaluation and organization in text: The structuring role of evaluative disjuncts’, in: S. Hunston and G. Thompson (eds.) Evaluation in text. Authorial stance and the construction of discourse. Oxford: OUP. 121-141. Traugott, E. Closs (2007), ‘Discussion article: Discourse markers, modal particles, and contrastive analysis, synchronic and diachronic’, in: M. Josep Cuenca (ed.) Catalan Journal of Linguistics (6). 139-157. Special issue: Contrastive perspectives on Discourse Markers. Vaskó, I. and T. Fretheim (1997), ‘Some central pragmatic functions of the Norwegian particles altså and nemlig’, in: T. Swan & O.J. Westvik (eds.) Modality in Germanic languages. Historical and comparative perspectives. Berlin: Mouton de Gruyter. 233-292. Waltereit, R. (2001), ‘Modal particles and their functional equivalents: A speechact theoretic approach’. Journal of Pragmatics 33: 1391-1417. Waltereit, R. (2006), Abtönung. Zur Pragmatik und historischen Semantik von Modalpartikeln und ihren funktionalen Äquivalenten in romanischen Sprachen. Tübingen: Max Niemeyer Verlag. Waltereit, R. and U. Detges (2007), ‘Different functions, different histories. Modal particles and discourse markers from a diachronic point of view’, in: M. Josep Cuenca (ed.) Catalan Journal of Linguistics (6). 61-80. Special issue: Contrastive perspectives on Discourse Markers. Weydt, H. (1969), Abtönungspartikel: die deutschen Modalwörter und ihre französischen Entsprechungen. Bad Homburg: Gehlen. White, P. (2003), ‘Beyond modality and hedging: a dialogic view of the language of intersubjective stance’. Text 23(2): 259-84. Wichmann, A., A.-M. Simon-Vandenbergen and K. Aijmer (forthcoming), ‘How prosody reflects semantic change: a synchronic case study of of course’, in: K. Davidse and H. Guyckens (eds.) Subjectification, intersubjectification and grammaticalization. Berlin and New York: Mouton de Gruyter.
A reassessment of the syntactic classification of pragmatic expressions: the positions of you know and I think with special attention to you know as a marker of metalinguistic awareness Julie Van Bogaert Ghent University, Belgium; Research Foundation – Flanders (FWO-Vlaanderen) Abstract This paper wishes to point out some limitations to the way in which the syntactic classification of pragmatic expressions has traditionally been handled. As an alternative, it proposes a syntactic classificatory system that pivots on the notion of scope. This alternative approach is applied to corpus data of you know and I think. An attempt is made to establish connections between the pragmatic expressions’ syntactic behaviour on the one hand and their functional properties on the other hand and to compare the findings for both expressions. In so doing, special attention is devoted to local you know, a specific syntactic use of this pragmatic expression, which is found to correlate with a particular function, viz. that of marking metalinguistic awareness. The findings of this study may have implications for the way you know is commonly viewed, especially by laypeople, but also in scholarly settings.1
1.
Introduction
You know and I think are two of the most frequently used pragmatic expressions that take the form of a verb of cognition with a first or second person singular subject. They are at the top of the list of most frequently used expressions of this type in the London Lund Corpus (Stenström 1995: 293) and this is no different in the data that I have used for this article (cf. section 2). In Scheibman’s study of stance in American English conversation, I think was preceded, in a frequency count of verbs of cognition in the first person singular simple present, only by I (don’t) know, and you know towered over all other cognitive verbs in the corresponding second person singular category (Scheibman 2002: 64-67, 74-76). You know and I think have been studied from various perspectives and have been referred to by a plethora of terms such as discourse marker (Schiffrin 1987), modal particle (Aijmer 1997), discourse particle (Aijmer 2002) or comment clause (Jespersen 1937; Peltola 1983; Stenström 1995; Quirk et al. 1985), to name but a few. Aijmer (this volume) suggests that ‘discourse marker’ and ‘modal particle’ are not just two alternative labels for one and the same concept by pointing out a functional split between of course as a discourse marker and of course as a modal particle. Adopting a sociolinguistic point of view, Bernstein (1971: 98, 109-14) found that “egocentric” I think sequences are typical of middle class speakers while “sociocentric” sequences like you know are more frequently used among working class speakers. Huspek (1989) further developed
132
Julie Van Bogaert
this insight and took the factor of group identity into consideration when looking at the use of you know and I think in the language of his working class American informants. The first significant monograph dealing specifically with you know is Östman’s (1981). Subsequent studies that try to come to grips with the pragmatics of this expression, often looking for its core meaning, include Schourup (1985), Erman (1987), Stenström (1995), He and Lindsey (1998), Jucker and Smith (1998) and Fox Tree and Schrock (2002). Some authors approached you know from a variationist-sociolinguistic point of view (Holmes 1986; 1990; Stubbe and Holmes 1995; Erman 2001) and others still investigated speakers’ perceptions of it (Watts 1989; Fox Tree 2007). As good as all of the aforementioned studies, by their very interest in the ‘meaning’ and functions of you know, gainsay laypeople’s opinions on this “exasperating expression” (Stubbe and Holmes 1995) or “verbal garbage” (Schourup 1985: 94) that has been a popular target for prescriptivists (Schourup 1985; Fox Tree 2007). As regards I think, Urmson, as early as 1952, tackled the issue of first person singular verbs in the present tense combining with a that-clause that have the capacity to occur in positions other than the beginning of a sentence and dubbed them “parenthetical verbs”. This syntactic mobility aroused the interest of transformational grammarians, who explained the phenomenon as a case of S(entence)-lifting (e.g. Ross 1973). Hooper (1975) investigated different types of predicates combining with that-clauses and called I think a weak assertive predicate. In more recent years, a number of studies appeared that apply grammaticalization theory to I think and pragmatic expressions of the same type (e.g. Thompson and Mulac 1991a; 1991b; Palander-Colin 1999; Tagliamonte and Smith 2005; Van Bogaert 2006). The sociolinguistic and stylistic conditions for the use of I think have also been explored. Simon-Vandenbergen looked into the expression’s relation with social class (2002) and its use in political discourse (1998; 2000) while Andersen concentrated on its use in teenager talk (2001). The equivalents of I think and other expressions with cognitive verbs have been studied in languages other than English. With reference to the Romance language family, Blanche-Benveniste and Willems (2007), Schneider (2007) and Dendale and Van Bogaert (2007) should be mentioned and Nuyts (1994) compared Dutch mental state predicates to other grammatical means of realizing epistemic modality. Some studies devote attention to the question of the syntactic positions that you know, I think and related expressions can occupy in a sentence (Erman 1987; Thompson and Mulac 1991a; 1991b; Stenström 1995; Aijmer 1997; Simon-Vandenbergen 1998; 2000; 2002; Van Bogaert 2006). This issue will be of central interest in the present article. The aim of this paper is to raise a few points of criticism of the approach that is traditionally adopted to the syntactic classification of you know and I think and to propose an alternative way of looking at the syntactic behaviour of these expressions that is based on the notion of scope. It will be demonstrated that this alternative system can be helpful in explaining certain correlations between syntactic position and function.
Pragmatic expressions: the positions of you know and I think
133
After a brief presentation of the data, the canonical approach to the syntactic classification of pragmatic expressions is critically evaluated and an alternative classificatory system is put forward. This alternative is then applied to the data of you know and I think and functional explanations are offered for the syntactic findings. In this discussion, special attention is devoted to a particularly interesting type of you know that will be referred to as local you know. 2.
Data
This study makes use of corpus data from the ICE-GB (International Corpus of English – the British Component), which comprises 1,061,264 words. However, only the spoken part of the corpus was taken into consideration as you know and I think are typically used in spoken language (Schourup 1985; Biber et al. 1999: 668-69; Aijmer 2002). The spoken section of the ICE-GB contains 637,562 words and yields 1,081 occurrences of you know and 1,734 of I think. In the I think data, allowance has been made for negative transportation, which means that examples with I don’t think have also been incorporated. Unclear and unfinished utterances have been left out of consideration. 3.
The canonical tripartite system for the syntactic classification of pragmatic expressions
Traditionally, pragmatic expressions like you know and I think have been syntactically classified following the well-known tripartite system distinguishing between initial, medial and final position, similar to the syntactic description of adverbs. The application of this by now canonical system to syntactically mobile first person singular cognitive verbs dates back to Urmson’s work on what he called “parenthetical verbs” (Urmson 1952): I suppose that your house is very old. Your house is, I suppose, very old. Your house is very old I suppose. (Urmson 1952: 221) Quirk et al. proposed a more elaborate system for the syntactic classification of adverbials. They refined the tripartition by splitting up the categories ‘medial’ and ‘final’ into ‘initial medial’, ‘medial medial’, ‘end medial’, ‘end’ and ‘initial end’ (Quirk et al. 1985: 491-500). Alternatively, some scholars have described pragmatic expressions in terms of their position in the turn (Erman 1987) or in the intonation unit (He and Lindsey 1998; Kärkkäinen 2003). However valuable these approaches may be, they do not take syntax as a starting point. In this paper a conscious decision is made to regard the position of you know and I think in relation to syntactic rather than discursive or intonational units.
134
Julie Van Bogaert
The main fault that can be found with the canonical tripartite system is that it does not do justice to the pragmatic expressions’ functional specificities. Labelling a token as initial, medial or final reveals rather little about the pragmatic or interpersonal functions that it may have. In other words, the canonical tripartition does not constitute a sufficiently refined tool for explaining why a particular pragmatic expression occurs where it does. In this respect, I feel that this classificatory system misses two important points. Firstly, it does not take account of the syntactic level at which a pragmatic expression occurs. As such, it would group together the following two occurrences of I think in the category ‘medial’. (1)
Cameing Coming out I think was definite
(2)
Father McDade d’you you remember in I think lecture three uh Rabbi Sacks said at one point faith is not measured by acts of worship alone
The canonical approach pays no heed to the fact that in (1) I think is placed in between the constituents of a clause whereas in (2) it is used at a lower syntactic level; here I think has been inserted within a phrase functioning as a clause constituent. This syntactic difference is functionally relevant as will become clear in this article. The second drawback to the canonical tripartite system is its failure to take into account the scope2 of the pragmatic expression. It wrongly assumes that a pragmatic expression invariably has scope over the entire clause (complex) in which it occurs, failing to notice that its scope is sometimes limited to one particular phrase or word functioning as a clause constituent or as a phrase constituent. (3)
You know you have a beech wood that might be all beeches and it might be on limestone and it might be on chalk or it could be on flint or gravel soil soil
(4)
It’s very uhm <,,> you know solid
As regards the syntactic characterization of corpus example number (3), I agree with the canonical tripartition by classifying it as a case of initial you know with scope over the whole clause complex that it introduces. Number (4), on the other hand, is of a different nature in that in this case, you know is used with very narrow scope; it is restricted to the word solid, the head of the adjective phrase very solid. Therefore, rather than assigning this example to the category of medial position, I consider it as a case of initial you know used with local scope over an adjective functioning as the head of an adjective phrase. The pragmatic expression assumes a position that is initial relative to the element over which its scope applies. Thus, it will be observed that the term ‘initial’, in my framework,
Pragmatic expressions: the positions of you know and I think
135
is not restricted to clause-initial position, as is common in the literature, but is to be understood as ‘in front of the scoped element’. Bearing the above criticism in mind, I have attempted to develop a refined syntactic classificatory approach to pragmatic expressions in which syntax is not considered in isolation but rather in relation to interpersonal, discursive and pragmatic properties. Assigning corpus data to syntactic categories must not be an end in and of itself; rather it must be a means to gain more insight into the pragmatic expressions’ functional properties. The syntax of you know and I think should be described in such a way that also discloses relevant information about what these expressions do in discourse. The syntactic classification of a pragmatic expression should not only tell us where the expression is used but also why it is used. In the next section, I will propose an approach to the syntactic description of pragmatic expressions like you know and I think that allows one to account for the interaction between formal, syntactic characteristics and functional ones. Its relevance to functional interpretations of pragmatic expressions will become apparent in section 5. 4.
An alternative approach
The essence of the alternative approach to the syntactic description of pragmatic expressions is to specify the form and the function of the scoped element. The basic distinction to start from is that between clausal scope and local scope. Context plays a crucial role in identifying the extent of a pragmatic expressions’ scope. 4.1
Clausal scope (A)
4.1.1 Clause functioning as speech act (A1) Clausal scope implies that a pragmatic expression has scope over a clause or over a proform (so or not, or zero proform as with you know) substituting a clause. The function performed by this clause or proform is mostly that of a speech act. A pragmatic expression with clausal scope may be placed in initial, medial or final position in the clause (complex). Medial position means that the expression is placed in between clause constituents. (5) is an example of I think in medial position having scope over a clause functioning as a speech act. Figure 1 is a schematic representation of this use of I think. (5)
And he also I think wants time and space to himself to sort himself out
136
Julie Van Bogaert
A1
Figure 1: Clause functioning as speech act It should be noted that it is possible for a pragmatic expression to be inserted within a phrase functioning as a clause constituent whilst holding the whole clause in its scope rather than a constituent of this phrase, which is, as we will see in 4.2.1, most commonly the case when a pragmatic expression is used at this syntactic level. This syntactic type will be referred to as a special kind of medial position, viz. intrusive medial use. It is exemplified in (6) and visually illustrated in figure 2. In this example, I think is used within a verb phrase but it would be untenable to claim that this is done because the pragmatic expression specifically qualifies have or been. The intrusive medial position is not restricted to clauses functioning as speech acts (A1’), but it can also be found in the other clausal scope categories, which will be presented below. (6)
Well the Arabs have I think been a little bit slow with the sole exception of Syria of President Assad of Syria
A1’
Figure 2: Clause functioning as speech act: intrusive medial position 4.1.2 Clause functioning as clause constituent (A2) It may be that a clause that is scoped by a pragmatic expression does not perform a speech act but rather that it functions as a constituent, either nominal (A2a) or adverbial (A2b), of another clause. The nominal clauses may be that-clauses, whclauses or nominal relative clauses and in theory also exclamative clauses, but none of those were found in the corpus. The adverbial category includes adverbial clauses and sentential relative clauses.3 (7) and (8) exemplify the use of a pragmatic expression in a nominal clause and in an adverbial clause respectively. (7)
It’s much more to do with sociocultural factors <,> and next time I will explain why I think that these sociocultural factors are important and how they’re actually operating
Pragmatic expressions: the positions of you know and I think (8)
137
And so Roger we’re doing a we’re doing a we’re putting a a D on the front of each of these notes because I think it needs it really cos it’s a sort of
A2a/b
Figure 3: Clause functioning as clause constituent 4.1.3 Clause functioning as phrase constituent (A3) A third and final category within the clausal scope class is that in which the clause in question shifts to an even lower rank, viz. that of a phrase constituent. Most of the time the clause in question is a relative clause, as in (9): (9)
That is the fundamental premise of a new police force which I think we need in this country and should move towards <,>
A3
Figure 4: Clause functioning as phrase constituent The use of I think in relative clauses, as in (9), receives attention in SimonVandenbergen (2000: 49), who regards occurrences of I think immediately following the relative pronoun as being used in medial position: ...the fact that in this position [immediately after a relative pronoun] I think cannot be followed by that (in contrast with initial I think) provides an argument for classifying such instances as medial rather than initial... (Simon-Vandenbergen 2000: 49) Nevertheless, I would like to argue that this particular use of I think needs to be viewed as initial because firstly, the pragmatic expression cannot occur any earlier in the relative clause and secondly, the subordinator that can, though very marginally, sometimes be expressed. No such examples were found in the ICEGB, but the BNC yielded a few as did the World Wide Web.4 A selection of them is listed as examples (10) to (14). The most exceptional examples are (13) and (14), in that most grammars disallow the realization of that when the function of the relative pronoun is that of subject of the relative clause. (Quirk et al. 1985: 1050; Huddleston and Pullum 2002: 953).
138
Julie Van Bogaert
(10)
This is a point that I have made often in the House and on which I think that I have the support of the Adam Smith Institute which I hope will also be supported by many Conservative Members. (BNC HHW)
(11)
On this matter, there are two central issues on which I think that those responsible must be held to account (www.publications.parliament.uk/pa/cm200304/cmhansrd/vo040720/debte xt/40720-37.htm)
(12)
So I wanted to draw on that kind of thing which I think that’s a very important part of Scottish culture which still exists. (www.nationaltheatrescotland.com/content/default.asp?page=s3_1_1&id= 1801)
(13)
Another track which I think that might be the next Wu-Banger is “R.E.C. Room”, True Master produced this track and I think that this track has a whole lot of potential. (mysite.wanadoo-members.co.uk/rnnr/uncontrolled.html)
(14)
Yeah, I think there isn’t a lot to do, but I’m not stopped, I’ve e-mailed someone who I think that can help us getting the SV16 or a way to back up the firmware from the fone. (www.3g.co.uk/3GForum/archive/index.php/t-18493.html)
These marginal constructions in which the relative pronoun is a push-down element (Quirk et al. 1985: 1298) raise a number of questions. As (10) and (11) are not the only attestations in political discourse one may wonder whether the realization of that is a case of hypercorrection or hyperformality. A related question would be whether this atypical use of the subordinator is facilitated by the deliberative meaning of I think. Aijmer (1997: 21ff) distinguishes between on the one hand the tentative use of I think, which softens illocutionary force and expresses uncertainty, and on the other hand its deliberative use, heightening illocutionary force and expressing certainty and commitment. Aijmer’s criteria for differentiating between the two uses are prosodic and syntactic. As regards syntax, deliberative I think is used in initial position and followed by the subordinator that. I think used with zero that is considered tentative as are occurrences in medial or final position. The deliberative function of I think is frequently attested in political language (Simon-Vandenbergen 1998; 2000).5 Examples (10) and (11) are both deliberative and used in political settings. It may be that speakers using this uncommon construction feel that they require the subordinator in order to sound more authoritative and deliberate. Another way of looking at the constructions is to treat them as syntactic amalgams or blends (Bolinger 1961; Lakoff 1974). This seems especially plausible for examples (12) to (14), as these attestations have one foot in the spoken domain, the medium in which syntactic amalgams are most likely to occur. (12) is a transcription of an
Pragmatic expressions: the positions of you know and I think
139
interview and (13) and (14), coming from a personal website and an internet forum respectively, can also be thought of as standing rather close to spontaneous spoken language. The BNC data contained a few clearer cases of syntactic amalgams, one of which is given as example (15): (15)
There was an assumption that inflation would be higher than it was and that was cut back to one point five percent, which I think that I would actually support was a sensible way forward.
4.2
Local scope (B)
When a pragmatic expression is used with local scope, it mostly applies, from a formal point of view, over a phrase or a word but it may also have scope over a subclause. The function of the scoped element is either that of a clause constituent (B1) or of a phrase constituent (B2). 4.2.1 Phrase, word or subclause functioning as clause constituent (B1) Figure 5 clarifies the notion of a pragmatic expression with scope over a phrase functioning as a constituent of the clause. We can see that, similarly to figure 1, representing a pragmatic expression in medial position with scope over the whole clause, the expression is positioned in between clause constituents. This time, however, the arrow points to one of the clause constituents rather than the entire clause. The pragmatic expression may, as in example (16), scope the clause constituent following it, which corresponds to initial position, or it may assume final position relative to the clause constituent preceding it. In the corpus, two cases of I think were found that would seem to occupy medial position in a phrase, i.e. their scope is not restricted to a phrase constituent, which would make them category (B2), but rather, it extends over the phrase that the pragmatic expression breaks up. Under (17), an illustration of this marginal phenomenon is given and it is visualized in figure 6. (16)
I mean all that and he’s he’s popped his clogs at fifty-three and he’s you know not not a particularly nice man
B1
Figure 5: Phrase/word/subclause functioning as clause constituent (17)
Uh he later told me in the course of his cross-examination that he didn’t think there had been any change i in the ground <,> uhm uh between the <,> date of the accident and the photograph uh with the exception I think of the <,> uhm <,> the slope <,,>
140
Julie Van Bogaert
B1’
Figure 6: Phrase functioning as clause constituent: medial position As was mentioned, the clause constituent over which a pragmatic expression with local scope applies may take the form of a clause. The importance to distinguish between this type and categories (A2) and (A3), which also involve a pragmatic expression with scope over a subclause, will be illustrated by means of a comparison between examples (18) and (19). (18)
There was One other name I wanted to to throw in was Gerald Finzi <,> because I think the Clarinet Concerto is the most amazing piece
(19)
And you’re made uhm Archbishop of Canterbury I think because you’re thought to have done a tolerably good job as a diocesan bishop
While (18) falls under category (A2) and (19) under (B1), in both cases I think has scope over a clause which in turn functions as a constituent within a clause. Nevertheless, there is a difference in that in (18), I think functions at the level of the subclause; it is an interpersonal modification of the propositional content of the because-clause. I think here indicates that the clarinet concerto being the most amazing piece is the speaker’s personal evaluation. In (19), on the other hand, I think has an interpersonal function in the main clause; it realizes an epistemic modification that singles out the validity status of one particular constituent: an adverbial adjunct that happens to be clausal. This scopal difference also comes to expression at the formal level: in (18) I think is part of the subordinate clause and hence it is placed ‘within’ this clause, after the subordinating conjunction. I think in (19), by contrast, does not have a function inside the subclause as it is placed ‘outside’ of it, preceding the adverbial clause. The difference in scopal relationship can be illustrated by means of a reactance6 involving a cleft construction: (20)
It is because I think the Clarinet Concerto is the most amazing piece that there was another name I wanted to throw in.
(21)
It is because you’re thought to have done a tolerably good job as a diocesan bishop that I think you’re made Archbishop of Canterbury.
Pragmatic expressions: the positions of you know and I think
141
To put I think in (19) after the conjunction would lead to an entirely different interpretation. The result would be (22) and the difference in meaning is made explicit by means of a cleft reactance in (23). (22)
And you’re made uhm Archbishop of Canterbury because I think you’re thought to have done a tolerably good job as a diocesan bishop
(23)
It is because I think you’re thought to have done a tolerably good job as a diocesan bishop that you’re made Archbishop of Canterbury
The difference between the two sentences can be visualized by comparing figure 3, corresponding to (18), to figure 7, a schematic representation of a sentence like (19). B1
Figure 7: Local scope: clause functioning as a clause constituent 4.2.2 Phrase, word or subclause functioning as phrase constituent (B2) In this syntactic category, the scope of the pragmatic expression is as narrow as possible; it is limited to the constituent of a phrase. The visual representation of this type bears some resemblance to both the intrusive medial position (A1’) and to the exceptional use of a pragmatic expression in medial position relative to the phrase that it scopes (B1’). In all three cases, the pragmatic expression is inserted within a phrase but the difference resides in its scope. In the case of intrusive medial position, the scope is the widest as it encompasses the whole clause. (B1) involves a somewhat narrower scope, viz. phrasal scope and in (B2) the scope is at its narrowest. Here only one particular word or phrase that is part of the disrupted phrase falls within the scope of the pragmatic expression. This is the most common scopal behaviour when a pragmatic expression is placed inside a phrase, both for you know and for I think.
B2
Figure 8: Phrase/word/subclause functioning as phrase constituent (24)
Uh in the uhm <,> I think October issue of Computational uh Linguistics there’s an attempt to do something of this type
142
Julie Van Bogaert
Similarly to (B1’), also in this category it may happen that a pragmatic expression is realized within a phrase without scoping any single constituent of this phrase but rather the phrase as a whole, only in this case the phrase in turn functions as the constituent of another phrase, as exemplified in (25). This means that within (B2), the category of local scope over a phrase constituent, we need to allow for an admittedly very marginal category of medial position. (25)
A day a meal <,,> for as long as you like <,> in the imagination of a generation like ours obsessed I think <,> with the attempt to put themselves back in past time as well as to live intensely as they do <,> in the present <,>
It can be seen that the phrase within which I think in (25) occupies medial position is a non-finite subordinate clause functioning as a postmodifier. So in this subcategory of local scope too the scoped element can be a clause. 5.
Data discussion and findings
I will now discuss the most important findings that came out of the classification of the ICE-GB data of you know and I think following the alternative approach presented in section 4. 5.1
More local you know than local I think
Upon comparing the scores for clausal as opposed to local scope of you know and I think, one immediately notices that local you know is used with much higher frequency than local I think. As table 1 shows, local you know accounts for nearly 37% of all uses as opposed to a mere 5.42% of local I think Table 1: Clausal and local uses of you know and I think clausal local
you know 63.09% (682) 36.91% (399)
I think 94.58% (1640) 5.42% (94)
This observation extends the following claim about you know made by Erman (1987: 98): [you know] has a narrower scope than the other two PEs [pragmatic expressions] [I mean and you see] . … you know tends to be used more locally (e.g. between and within constituents) than the other two PEs. On the basis of the ICE-GB data, we can add to Erman’s claim that you know tends to be used with narrower scope not only than I mean and I see, but also than
Pragmatic expressions: the positions of you know and I think
143
I think. In the following two sections, explanations for this divergent behaviour of you know and I think will be provided by relating the pragmatic expressions’ syntactic properties to their interpersonal and discursive functions. 5.1.1 I think as an epistemic expression The functional explanation for the tendency of I think to be used with clausal scope resides in its function as an expression of epistemic modality. It was characterized as such by Thompson and Mulac (1991a; 1991b), Aijmer (1997) and Thompson (2002), to name but a few. In Systemic Functional Linguistics I think is known under the name of ‘interpersonal grammatical metaphor’, i.e. a metaphorical realization of epistemic modality, which is congruently expressed by modal auxiliaries and adverbs (Halliday 1985, Halliday and Matthiessen 2004). Being one of the TAM properties, modality is inherent in the finite clause and consequently it is not surprising that I think tends to have clausal scope. On closer examination of the 94 I think tokens that were analysed as having local scope, it turns out that 33 of these, or 35.11%, are ellipses. This means that the scoped element is used on its own, but it stands for a finite clause that can be recovered from the context. The number of ellipses in the you know data is much lower; it amounts to 16.54% of local you know. It would not be entirely indefensible to treat these elliptic uses as clausal ones as one is in fact expected to infer a clause from them. With reference to ik denk / denk ik, the Dutch equivalent of I think, Nuyts (1994: 81) observes that some cases that he was inclined to classify as parenthetical in fact can also be analysed as elliptical complementtaking uses. An example of elliptic I think is given under (26). On the basis of the utterance preceding the ellipsis, a clause like My mum is coming back on Friday I think can be inferred. (26)
B: When’s your Mum coming back A: Uh <,> Friday I think
In spite of its tendency to have clausal scope, it is nevertheless possible for I think to problematise the validity status of one particular piece of information rather than the clause as a whole. In example (27), the speaker expresses uncertainty specifically about the name of the town that was being besieged by the Scots: (27)
The the s the Scots were besieging <,> I think uh uh Berwick and Edward whoever it was at the time came out to relieve it
5.1.2 You know as a local marker of metalinguistic awareness The proclivity of you know for local scope can also be explained by having a closer look at its functional properties. In the literature, one of the recurrent core meanings that are attributed to you know is that of negotiating common ground
144
Julie Van Bogaert
(Östman 1981; Schiffrin 1987). The data suggest that you know is usually not used to create a general sense of common ground, or what Östman called a “youknow mood” (1981). Rather, you know has the potential to locally create common ground. Local you know is used when the common ground status of particular, rather small items in the information structure needs to be established by the speaker and acknowledged by the hearer. It is used to introduce specific pieces of discourse into the realm of common ground. To be even more precise about the nature of this common ground, most local uses of you know negotiate common ground at the metalinguistic level. A lot of the common ground that speaker and hearer make use of in conversation is of a metalinguistic nature; both parties draw on the knowledge that they share about the language that they are using, especially, in the case of you know, about its lexical possibilities and constraints. The metalinguistic component of common ground will be referred to as metalinguistic awareness and local you know will be subsumed under the umbrella term of ‘marker of metalinguistic awareness’7, which is similar to what Verschueren (2000: 445) called “metapragmatic markers […], which draw attention to the lexical choice-making itself, as a kind of warning against unreflective interpretation”. Local you know works at the level of metalinguistic awareness in that it draws attention to the speaker’s process of lexical selection and the hearer’s acceptance of this choice. The pragmatic expression indicates that the speaker is drawing on their metalinguistic awareness to produce a particular phrase or word and at the same time you know constitutes a request to the hearer to also make use of their metalinguistic awareness in order to understand the speaker’s communicative intentions. It is important to note that local you know is not just about the speaker’s selection of the right words. In fact, you know is highly hearer-oriented. Selecting the right expression and arriving at the right meaning is a joint enterprise. By using you know the speaker makes a request for cooperation and benevolence on the part of the hearer. The speaker wants the hearer to accept their choice of wording and appeals to the hearer and to the hearer’s metalinguistic awareness to accept their lexical choice. Hence, the hearer is actively involved in the process of creating meaning. Such a cooperative relationship between speaker and hearer presupposes solidarity. No instances were found in the corpus that could be interpreted as authoritarian. The example provided below illustrates the active involvement of the hearer in the process of lexical selection and creating meaning. It can be seen how the hearer supplies the words that the speaker is having difficulty producing. (28)
A: I’m look I’m quite looking forward to seeing them again They ‘re quite <,> you know Quite n... B: Very nice guys A: Yeah they are definitely
Pragmatic expressions: the positions of you know and I think
145
According to some scholars, what you know essentially does is to invite addressee inferences (Jucker and Smith 1998; Fox Tree and Schrock 2002). With respect to local you know, we can say that this pragmatic expression suggests that the hearer, on the basis of the common ground s/he shares with the speaker, could have inferred the word, phrase, or clause marked by you know by themselves. In order to infer something, people need to use the resources of their background knowledge, i.e. the common ground they share with their interlocutors, which, in this case is background metalinguistic knowledge. The overall category of marker of metalinguistic awareness can be differentiated into four functional subcategories, which will be dealt with one by one below. i) Online planning activities You know as a marker of metalinguistic awareness can be used as a device that helps speakers plan their utterance as they go along. Its use as a repair marker was described by Schourup (1985), Holmes (1986) and Erman (1987). Schourup (1985) pointed out that repairs performed by means of you know suggest that the hearer could have inferred the repair himself/herself. That is why a repair like the one in (29) does not seem to work: (29)
? I got a dog you know cat for my birthday. (Schourup 1985: 122)
According to Schourup, the change from dog to cat is too radical and would have been more felicitously realized by I mean. It will be seen that performing a repair involves both interlocutors’ metalinguistic awareness in the sense that the speaker, who decides that they did not use the right word, asks the hearer to accept the dismissal of it and its substitution with another. The examples listed under (30) and (31) illustrate the use of local you know in repair sequences. (30)
Not not as bad you know not as stiff as the other ones
(31)
I’m just wearing leggings and my big baggy you know my big green Vneck jumper...
Local you know can also be found in repetition sequences, as illustrated in (32) and (33). Not infrequently is a function word, such as a determiner (e.g. (32)), or a preposition or a combination of the two (e.g. (33)), realized before you know and repeated after it, postponing the realization of a content word, most typically a noun or NP. Local you know mostly has scope over a noun or a NP, viz. in 42.11% of all local uses.
146
Julie Van Bogaert
(32)
reading that uh you know that book you gave me on Stephen that Stephen King book as well
(33)
And I said well maybe you know I’ll look in my you know in my diary
Local you know is often surrounded by hesitation markers such as pauses and ‘fumbles’ like uhm and uh. This also points towards online planning activities. (34)
They look as if they’ve all had a quick turn under the steam roller <,> uh and yours have that same quality in that you’ve made up your decisions which <,> are <,> you know stylized<,>
(35)
He used he used to be quite <,,> portly you know
The above types of online planning phenomena all involve lexical searching. In prescriptivist and lay discourse they are commonly referred to, rather irreverently, as ‘fillers’ (cf. Fox Tree 2007). ii) Creative language Speakers sometimes use local you know to draw their interlocutor’s attention to the fact that they are using expressive or “creative language” (Aijmer 2002). They signal that they are using figurative language or an unconventional turn of phrase and they request the hearer to accept their metaphor, comparison or imaginative use of language that may require an extra processing effort. The speaker indicates that the hearer will also need to be creative with their background metalinguistic knowledge in order to appreciate the meaning of what the speaker is saying. Examples (36) and (37) illustrate this usage of you know. In (36), language is used metaphorically seeing that the interlocutors are talking about painting. (36)
He’s developed a sort of a you know a language
(37)
It’s like uhm I mean it’s a sort of minor version of uhm you know Paul Vining’s moustache cum beard
We can see in examples (36) and (37) that local you know as a marker of creative language tends to collocate with sort of. According to Willemse et al. (2007), sort of or kind of may also mark creative language. This does not mean, however, that you know and sort of are mutually interchangeable or that the realization of one rather than the other is entirely random. Evidence for motivated use of pragmatic expressions can be found in Fox Tree (2007). In this study involving informant testing, Fox Tree demonstrates that speakers have distinct notions about the meanings of the pragmatic expressions um, uh, you know and like.
Pragmatic expressions: the positions of you know and I think
147
iii) Metalinguistic distancing When a speaker uses you know as a metalinguistic distancing device, they apologize for not having chosen the most appropriate term. They want to make it clear to the hearer that they are not altogether happy with the way they have put things. Quite commonly, speakers distance themselves metalinguistically from a certain word or phrase because they have selected an expression from a different register from the one they consider appropriate. The metalinguistic distancing function of you know shows similarities to that of like as described by Andersen (2001: 243): (…) like can be construed as a signal that the expression the speaker chooses may not be the most appropriate one, and that an alternative expression might communicate her ideas more efficiently (…) Analogously, like can be construed as a signal that the chosen expression does not fit readily into the linguistic repertoire of the speaker, i.e. that the speaker feels a minor discomfort with its use. (…) The potential alternative might be for instance a stylistically different expression (…). In the examples provided below, the reasons for metalinguistic distancing are stylistically motivated. The respective speakers of (38) and (39) do not feel entirely comfortable using the rather slangy or crude expressions pigged off and chucked in. (38)
If I’m sort of you know pigged off with things at school I will pick up Pride and Prejudice (…)
(39)
So I thought God damn it if I ever get close to walking up the aisle and then I get you know chucked in <,> <,> I’ll be I’ll have a nervous breakdown
At this point in the discussion of the functionality of local you know, I consider it appropriate to go into a short digression to point out some commonalities between the functions of this pragmatic expression. The metalinguistic distancing function of you know borders onto what I would like to call quotative you know. This usage of you know has not been discussed yet in this article as it tends to be used with scope over a whole speech act and an in-depth discussion of you know with clausal scope falls outside the scope of this study. Nevertheless, I consider it legitimate to attribute a quotative function to you know. It can aid in demarcating the speaker’s own words from quoted discourse, as in (40), which can be interpreted as an act of distancing oneself at a metalinguistic level from somebody else’s words or thoughts or from his/her own words or thoughts at some previous time.8
148
Julie Van Bogaert
(40)
And I felt like turning around and sort of saying you know well whose fault is that
Interestingly, the marker sort of / kind of has also been found to have a quotative function (Willemse et al. 2007). This would be the second characteristic that you know and sort of / kind of have in common, besides marking creative language. In fact, the quotative function of you know and sort of / kind of bleeds into that of marking expressive or creative language. The quoted sequences are often cases of creative language. In this respect, the expressions resemble the well-known quotative like, which does not so much render somebody’s words verbatim as the overall ‘feel’, i.e. the emotions, attitudes and dramatic effect of what was said. The like quotative may, for example, also frame non-speech sounds and facial expressions (Fairon and Singler 2006: 326). In example (41), quotative you know is used expressively in that the speaker more than likely wishes to convey his feeling of desperation rather than the words that he actually uttered. (41)
and I used to kind of say you know please t please God get me out of this
iv) Expansion The fourth and final usage of local you know as a marker of metalinguistic awareness is the expansive function, to be understood in the Hallidayan sense of the term (Halliday 1985, Halliday and Matthiessen 2004). It entails that you know is used to add extra information that elaborates a preceding concept, mostly as an apposition (42) or a clarification, as in (43). (42)
Uh but I thought that one you know the brie de Meaux ‘s quite good isn’t it
(43)
He was doing it with Golden Grahams You know the <,> breakfast cereal <,,>
Admittedly, the expansive category is the least metalinguistic one; it does not so much appeal to the interactants’ knowledge of the language as to their knowledge of the world. Nevertheless, the door to the metalinguistic realm remains open seeing that appositional you know shades off into reformulation and repair, activities that involve metalinguistic awareness. The thin line between apposition and reformulation or repair is illustrated in examples (44) and (45). (44)
He said he wanted to serve the Government uh you know support the Government
Pragmatic expressions: the positions of you know and I think
149
(45)
But she means involved with other people you know prepared to take an interest in <,> <,>
5.2
Local you know: mostly in initial position
We have just seen that you know is used with local scope more often than I think. In this section, a second striking quantitative observation will be related to the functional properties of the two pragmatic expressions. Both you know and I think are mostly used in initial position. The percentages for initial position, regardless of scope, are 71.97% and 85.70% respectively. However, with clausal scope, I think is used in initial position more often than you know but when used with local scope, you know assumes initial position most often, as becomes clear in table 2. Table 2: Initial uses of you know and I think clausal local
you know 65.69 % (448) 82.71% (330)
I think 87.38% (1433) 56.38% (53)
Since local you know as a marker of metalinguistic awareness is used “as a kind of warning against unreflective interpretation” (Verschueren 2000: 445, my italics), it should not come as a surprise that it has a proclivity for preceding the element for which it provides a warning. It gives the speaker more time to think and it signals to the hearer that his/her active involvement in the decoding process is called upon. A speaker’s uncertainty as to the epistemic status of a piece of information, on the other hand, may be added after the informational chunk about which one is not sure, as a kind of ‘afterthought’. I think, when used locally, can be tagged on to a word or phrase to whose truth value the speaker decides that, on second thoughts, s/he does not want to commit. (46) is an example of I think occupying final position in relation to a noun. (46)
The house knows that this matter may be debated on the Queen’s speech specifically tomorrow and again on uh Monday I think
6.
Conclusion
Since the canonical tripartite system for syntactically classifying pragmatic expressions like you know and I think was found inadequate in its neglect of the notion of scope and syntactic levels, an alternative approach to the issue of the syntactic positions of this type of expression was attempted. The results of a classification of spoken ICE-GB data following this system were found to throw
150
Julie Van Bogaert
more light on the functional properties of you know and I think and, conversely, insights into the pragmatic expressions’ functions helped explain their syntactic patterns. The discussion of local you know, in particular, suggests that some views commonly held about this pragmatic expression, in the first place among laypeople but also, to some extent, among linguists, need to be revised. The insights provided in this article constitute an argument against the notion of ‘random sprinkling’, according to which, as explained in Fox Tree and Schrock (2002), you know can be scattered through the discourse at random. It would mean that is does not matter exactly where the expression occurs, as long as it creates a casual atmosphere (cf. Östman’s notion of a “you-know mood” (1981)). It would seem, instead, that you know is used at strategic, critical points in the discourse which require heightened metalinguistic awareness and where the common ground requires local remedying. To this should be added that no cases of you know in intrusive medial position were found in the data and that even regular medial you know was quite rare (4 occurrences). These observations further strengthen the claim that when you know is inserted somewhere, it is there for a reason. You know is sometimes referred to as an imprecision marker or what James calls a “compromiser” (1983). This view requires some subtle qualification. On the basis of the above discussion, we can say that you know is not used by a speaker who is deliberately being imprecise. Rather, it is used by a speaker who has great concern with being understood by the hearer; the speaker tries to facilitate efficient communication by substituting one word with another word that they consider more effective, by warning the hearer that they are not using the most conventional of expressions or by adding extra information to be absolutely sure that the hearer grasps his/her communicative intentions. You know can be considered an imprecision marker only to the extent that it is used by a speaker who aims to communicate adequately and clearly but who apologizes for perhaps not always reaching their goal. Given the interesting relationship between syntax and pragmatic, interpersonal or discursive functions that came to expression in this study, the possibilities of the alternative syntactic classificatory system will be further explored in future work on related pragmatic expressions composed of cognitive verbs, e.g. I suppose, I guess and I believe. Notes 1 I would like to thank my supervisor, Anne-Marie Simon-Vandenbergen, and co-supervisor, Miriam Taverniers, for their constructive comments on this study. Needless to say, the responsibility for any errors is entirely mine. 2
‘Scope’ needs to be understood as a largely semantic notion rather than a strictly syntactic one. It is to be understood as “the stretch of language affected by the meaning of a particular form” (Crystal 1985: 308) or the way McGregor defined it (1997: 209ff). He distinguished between three types of syntagmatic relationship: constituency, dependency and
Pragmatic expressions: the positions of you know and I think
151
conjugation. The third category characterizes, defines and is defined by the interpersonal semiotic. In a conjugational relationship, one unit ‘shapes’ the other, indicating how it is to be taken by the addressee. Within each of the three syntagmatic relationships, two dimensions are possible: scoping and framing. Scoping means that a unit applies over a certain domain, leaving its mark on the entirety of this domain. 3
Admittedly, the status of sentential relative clauses as adverbial clauses is debatable. It shares characteristics with both content disjuncts and with non-restrictive relative clauses (Quirk et al. 1985: 1120).
4
A search restricted to pages from the UK was conducted on www.google.co.uk.
5
In Van Bogaert (2006) it is demonstrated that I believe can also be used with either a tentative or a deliberative meaning and like I think, deliberative I believe is typical of political language.
6
The term ‘reactance’ is used as in Whorf’s (1945) terminology to mean that certain syntactic differences are not noticeable in the surface forms of utterances but only come to expression in different ‘reactions’ to particular syntactic operations.
7
Both ‘common ground’ and ‘metalinguistic awareness’ need to be understood not so much as pre-existing constructs but as dynamic concepts that come into being as the discourse unfolds. That is why the term ‘marker’ to refer to you know in these contexts is somewhat infelicitous as marking something presupposes that what is being marked is already there.
8
When used as a quotative, you know is usually accompanied by additional expressions with a quotative function, such as she said in the example given.
References Aijmer, K. (this volume), ‘Does English have modal particles?’. Aijmer, K. (2002), English discourse particles. Amsterdam: Benjamins. Aijmer, K. (1997), ‘I Think - an English modal particle’, in: T. Swan and O.J. Westwik (eds.) Modality in germanic languages: Historical and comparative perspectives. Berlin & New York: Mouton de Gruyter. 1-47. Andersen, G. (2001), Pragmatic markers and sociolinguistic variation. Amsterdam: Benjamins. Bernstein, B. (1971), Class, codes and control 1: Theoretical studies towards a sociology of language. London: Routledge and Kegan Paul. Biber, D. et al. (1999), Longman grammar of spoken and written English. London: Longman.
152
Julie Van Bogaert
Blanche-Benveniste, C. and D. Willems. (2007), ‘Un nouveau regard sur les verbes à rection faible’, Bulletin de la Société de Linguistique de Paris, 102.1: 217-254. Bolinger, D. (1961), ‘Syntactic blends and other matters’, Language, 37: 366381. Crystal, D. (1985), A dictionary of linguistics and phonetics. 2nd ed. Oxford: Blackwell. Dendale, P. and J. Van Bogaert (2007), ‘A semantic description of French lexical evidential markers and the classification of evidentials’, in: M. Squartini (ed.) Evidentiality between Lexicon and Grammar. Thematic issue of Italian Journal of Linguistics / Rivista Di Linguistica, 19.1. Erman, B. (1987), Pragmatic expressions in English: A study of you know, you see and I mean in face-to-face conversation. Acta Universitatis Stockholmiensis/Stockholm Studies in English. Stockholm: Almqvist & Wiksell International. Erman, B. (2001), ‘Pragmatic markers revisited with a focus on you know in adult and adolescent talk.’ Journal of Pragmatics, 33: 1337-59. Fairon, C. and J.V. Singler. (2006), ‘I’m like “Hey, it works!”: Using Glossanet to find attestations of the quotative (be) like in English-language newspapers’, in A. Renouf and A. Kehoe (eds.) The Changing Face of Corpus Linguistics. Amsterdam: Rodopi. Fox Tree, J.E. (2007), ‘Folk notions of um and uh, you know and like’, Text and Talk, 27.3: 297-314. Fox Tree, J.E. and J.C. Schrock (2002), ‘Basic meanings of you know and I mean’, Journal of Pragmatics, 34: 727-47. Halliday, M.A.K. (1985), An introduction to functional grammar. London: Arnold. Halliday, M.A.K., and C.M.I.M. Matthiessen (2004), An Introduction to Functional Grammar. London: Arnold. He, A.W. and B. Lindsey (1998), ‘“You know” as an information status enhancing device: Arguments from grammar and interaction’, Functions of Language, 5.2: 133-55. Holmes, J. (1986), ‘Functions of you know in women’s and men’s speech’, Language in Society, 15: 1-22. Holmes, J. (1990), ‘Hedges and boosters in women’s and men’s speech’, Language and Communication, 10.3: 185-205. Hooper, J.B. (1975), ‘On Assertive Predicates’, in: J.P. Kimball (ed.) Syntax and semantics. Vol. 4. New York: Academic Press. 91-124. Huddleston, R., and G.K. Pullum. (2002), The Cambridge Grammar of the English language. Cambridge: Cambridge University Press. Huspek, M. (1989), ‘Linguistic variability and power: An analysis of you know/I Think variation in working-class speech’, Journal of Pragmatics, 13: 66183. James, A.R. (1983), ‘Compromisers in English: A cross-disciplinary approach to their interpersonal significance’, Journal of Pragmatics, 7: 191-206.
Pragmatic expressions: the positions of you know and I think
153
Jespersen, O. (1937), Analytic syntax. Copenhagen: Levin & Munksgaard. Jucker, A.H. and S.W. Smith (1998), ‘And people just you know like ‘wow’: Discourse markers as negotiating strategies’, in: A.H. Jucker and Y. Ziv (eds.) Discourse Markers: Description and Theory. Pragmatics and Beyond New Series. Amsterdam: Benjamins. 171-201. Kärkkäinen, E. (2003), Epistemic stance in English conversation: A description of its interactional functions, with a focus on I think. Amsterdam: Benjamins. Lakoff, G. (1974), ‘Syntactic amalgams’, in: M. La Galy, R.A. Fox and A. Bruck (eds.) Papers from the tenth regional meeting of the Chicago linguistic society. Chicago IL: CLS. 321-44. McGregor, W. (1997), Semiotic Grammar. Oxford: Clarendon Press. Nuyts, J. (1994), Epistemic Modal Qualifications: On Their Linguistic and Conceptual Structure. Antwerp Papers in Linguistics 81. Antwerp: UIA. Östman, J.-O. (1981), You know: A discourse functional approach. Pragmatics and Beyond. Amsterdam: Benjamins. Palander-Collin, M. (1999), Grammaticalization and social embedding: I think and methinks in Middle and Early Modern English. Helsinki: Tome LV. Peltola, N. (1983), ‘Comment clauses in present-day English’, in: I. Kajanto (ed.) Studies in classical and modern philology. Helsinki: Suomalainen Tiedeakatemia. 101-13. Quirk, R., S. Greembaum, G. Leech, J. Svartvik (1985), A comprehensive grammar of the English language. London: Longman. Ross, J.R. (1973), ‘Slifting’, in: M. Gross, M. Halle and M. Schützenberger (eds.) The formal analysis of natural language. The Hague: Mouton. Scheibman, J. (2002), Point of view and grammar: Structural patterns of subjectivity in American English conversation. Amsterdam: Benjamins. Schiffrin, D. (1987), Discourse markers. Cambridge: Cambridge University Press. Schneider, S. (2007), Reduced parenthetical clauses as mitigators: A corpus study of spoken French, Italian and Spanish. Amsterdam: Benjamins. Schourup, L.C. (1985), Common discourse particles in English conversation. New York: Garland. Simon-Vandenbergen, A.-M. (1998), ‘I think and its Dutch equivalents in parliamentary debates’, in: S. Johansson and S. Oksefjell (eds.) Corpora and crosslinguistic research: Theory, method and case studies. Amsterdam: Rodopi. 297-317. Simon-Vandenbergen, A.-M. (2000), ‘The functions of I think in political discourse’, Journal of Applied Linguistics, 10.1: 41-63. Simon-Vandenbergen, A.-M. (2002), ‘I think - a Marker of Middle Class Discourse?’ in E. Kärkkäinen and T. Lauttamus (eds.) Studia Linguistica et Litteraria Septentrionalia. Studies presented to Heikki Nyyssönen. Oulu: Oulu University Press. 93-106.
154
Julie Van Bogaert
Stenström, A.-B. (1995), ‘Some remarks on comment clauses’, in: B. Aarts and C.F. Meyer (eds.) The verb in contemporary English. Cambridge: Cambridge University Press. 290-301. Stubbe, M. and J. Holmes (1995), ‘You know, eh and other ‘exasperating expressions’: An analysis of social and stylistic variation in the use of pragmatic devices in a sample of New Zealand English’, Language and Communication, 15.1: 63-88. Tagliamonte, S. and J. Smith (2005), ‘No Momentary Fancy! The zero ‘complementizer’ in English dialects’, English Language and Linguistics, 9.2: 289-309. Thompson, S.A. (2002), ‘“Object complements” and conversation: Towards a realistic account’, Studies in Language, 26.1: 125-64. Thompson, S.A. and A. Mulac (1991a), ‘A quantitative perspective on the grammaticalization of epistemic parentheticals in English’, in E.C. Traugott and B. Heine (eds.) Approaches to Grammaticalization. Vol. 2. Amsterdam: Benjamins. 313-39. Thompson, S.A. and A. Mulac (1991b), ‘The discourse conditions for the use of the complementizer that in conversational English’, Journal of Pragmatics, 15: 237-51. Urmson, J.O. (1952), ‘Parenthetical verbs’, Mind, 61: 480-96. Van Bogaert, J. (2006), ‘I guess, I suppose and I believe as pragmatic markers: Grammaticalization and functions’, BELL New Series, 4: 129-49. Verschueren, J. (2000), ‘Notes on the role of metapragmatic awareness in language’, Pragmatics, 10.4: 439-56. Watts, R.J. (1989), ‘Taking the pitcher to the well: Native speakers’ perceptions of their use of discourse markers in conversation’, Journal of Pragmatics, 13: 203-37. Whorf, B. (1945), ‘Grammatical Categories’, Language, 21: 1-11. Willemse, P., K. Davidse, and L. Brems (2007) ‘Synchronic layering of type nouns in English and French’, Paper presented at the 28th ICAME conference. Stratford-upon-Avon, 23rd-27th May 2007.
The functions of expletive interjections in spoken English Magnus Ljung University of Stockholm Abstract This paper is a study of the functions of ten common expletive interjections in a 1 millionword sub-corpus from the spoken component of the BNC. The findings indicate that about a hundred interjections function as release mechanisms for mostly negative feelings triggered by real-world experiences. The rest are shown to be pragmatic markers as these have been defined in the recent literature and are analysed mainly in terms of the discourse-based analytic model used in Stenström (1994).
1.
Introduction
The aim of this study is to explore the use of expletive interjections in modern spoken British English as represented in the spoken component of the BNC. The study focuses on ten common expletive interjections in a sub-corpus made up of 26 conversation texts from the spoken component of the BNC, viz. KB0 – KB9, KBA-KBN, KBP, KBR, KNT. The sub-corpus contains 1,000,015 words and has for obvious reasons been named Conv1M. The ten expletive interjections that are the focus of my study are bugger, Christ, cor, damn, fuck, god, gosh, hell, Jesus, and shit. The study considers only interjectional uses of these words and consequently ignores their use as nouns, adjectives, adverbs and as “filler material” in expressions like What the fuck, Who the hell, etc. On the other hand I have included in my study both single interjections like Bugger!, Cor!, Hell! etc. and - with one exception - collocations containing these words which are functionally interchangeable with the single interjections. The exception is God, which occurs in such a plethora of collocations as to make their inclusion impractical. Here I include only the most common collocation with God, viz. Oh God! Table 1 provides a full account of my data. Table 1: Expletive interjections included in the study BUGGER
14
CHRIST
35
COR
82
Bugger! 1, Oh bugger! 1, Bugger it! 6, Bugger me! 2, Bugger + NP/pron 4. Christ! 11, Ah Christ! 1, By Christ, 2, By bloody Christ! 1, Cor Jesus Christ! 1, For Christ’s sake(s)! 7, Jesus Christ! 3, Jesus bloody Christ! Oh Christ! 8, Oh Jesus Christ! 1 Cor! 74, Cor blimey! 2 , Cor bloody hell! 1, Cor Jesus Christ 1, Cor strewth! 4
156
Magnus Ljung
DAMN FUCK OH GOD GOSH HELL
JESUS
SHIT Total
4 14 179 30 121
16
18 513
Damn! 2, Damn it! 2 Fuck! 2, Fuck 4, Fuck it! 3, Fuck me! 3, Oh fuck! 1, Oh fuck me! 1 Oh God! 179 Gosh! 19, Ah gosh! 1, Oh gosh! 9, Oh my gosh! 1 Hell! 1, Bleeding hell! 1, Bloody hell! 70, By hell! 2, Fucking hell! 12, Oh hell! 3, Sodding hell! 1, God flipping hell! 1, Flipping hell! 1, Oh fucking hell! 2, Oh bloody hell! 26, Oh, oh bloody hell! 1 Jesus! 7, Ah Jesus! 1, Ah Jesus Christ! 1, Jesus wept! 1, Jesus bloody Christ! 1, Jesus Christ! 1, Oh Jesus 2, Oh Jesus Christ! 1, Cor Jesus Christ! 1 Shit! 11, Oh shit! 7
As Table 1 shows, the total number of expletive interjections in my study is 513, which means that the speakers in the study produce one of the selected expletive interjections per 2000 words. However, the total number of expletive interjections in the Conv1M corpus is much higher than that. Merely including all existing interjectional combinations with the word God would have added another 322 instances. If we also add the motley crew of euphemistic interjections alluding to God, the sum total would rise to about 900 and the production rate for these particular interjections would be very close to one per 1000 words. 2.
Subjectivity, interactivity, textuality
The fact that the present paper is about expletive interjections in a way makes it a study of swearing, a large and somewhat ill-defined area of language that has lately attracted the attention of a number of linguists. The last few years, for instance, have seen a number of studies on swearing in British English, for example McEnery (2005), McEnery and Xiao (2003), (2004). These studies offer valuable information about the typology and sociolinguistics of English swearing and provide fascinating historical accounts of British attitudes towards swearing over the years. As I have already mentioned, the aim of the present study is different. The question I want to address here is why people swear, more specifically what functions expletive interjections serve in spoken English. It may seem that there is an obvious answer to my question: a generally held view of interjections and in particular of expletive interjections is that they are used in outbursts of mostly negative speaker feelings like anger and irritation. In his influential 1997 Cambridge Encyclopedia of Language David Crystal sums up this view in the following manner: “The functions of swearing are complex. Most obviously, it is an outlet for frustration and pent-up emotion and a means of releasing nervous energy after a sudden shock” (Crystal 1997: 61).
The functions of expletive interjections in spoken English
157
When used in this way the expletive interjections reflect or are usually thought to reflect the speaker’s inner states and feelings. For this reason they have been referred to as pure interjections, a category which may be thought of as closely corresponding to that of response cries first suggested by Goffman (1978). The pure interjections - PI’s for short - express the speaker’s reaction to a range of stimuli that is in principle impossible to delimit. The stimuli are often thought of as being of a physical and easily observable nature, thus making it possible for those in the speaker’s presence to make deductions about his/her reasons for uttering them. What I wish to argue in the present paper is that while certain of the expletive interjections in my corpus may be interpreted as clear instances of PI’s, the majority of the expletive interjections are used to express speaker attitudes, to signal the orientation of a text, and to deliver different interactional signals. In short, my claim is that in many of their uses the expletive interjections should be regarded as belonging to a linguistic category that has variously been called pragmatic markers, pragmatic particles and discourse markers. The notion that expletive interjections – and, indeed, interjections in general - may be used for pragmatic purposes and should be included among the pragmatic markers is not uncontroversial. Many of the scholars involved in the study of pragmatic markers do not mention interjections at all. Others expressly deny that interjections should be admitted to that category, for instance Andersen (2001: 42) Yet a third group take a more kindly view of interjections, for example Aijmer who claims that discourse particles include elements as varied as conjunctions (however), main clauses (I think), sentence adverbials (frankly), imperatives (look) and interjections (oh) (Aijmer 2002: 18). What then are the criteria for membership in the pragmatic marker category? According to Brinton (1996: 33), pragmatic markers have at least the following characteristics: Pragmatic particles (1) constitute a heterogeneous set of forms which are difficult to place within a traditional word class (including items like ah, actually … I mean, I think, you know), (2) are predominantly a feature of spoken rather than written language (3) are high-frequency items, (4) are stylistically stigmatized and negatively evaluated, (5) have little or no propositional meaning or are at least difficult to specify lexically, (6) occur either outside the syntactic structure or are attached to it and have no clear grammatical function, (7) are optional rather than obligatory features, (8) may be multifunctional operating on different levels (including textual and interpersonal levels). In my opinion, the expletive interjections satisfy all of these principles. They are definitely a heterogeneous group whose word class membership is often impossible to establish. It is true, however, that they have one factor in common, viz. the fact that, by a process of grammaticalization, they have developed from words denoting matters that are, or once were, taboo.
158
Magnus Ljung
As for the other criteria the expletive interjections are certainly a feature of the spoken language and have high frequencies of occurrence; they are definitely stigmatized and negatively evaluated, they have little or no propositional meaning, they are optional rather than obligatory, and they tend to be multifunctional on different levels. Later scholars - like Erman (1998) and (2001), Andersen (2001) and Aijmer (2002) - who prefers the term “discourse particle” to “pragmatic particle” - have elaborated on Brinton’s principles for pragmatic particles, and distinguish three broad types of pragmatic function, viz. subjectivity, interactivity and textuality. Individual pragmatic particles are typically associated with one of these functions but are usually also connected with the others: pragmatic particles are polyfunctional. Like Brinton’s earlier criteria, those of subjectivity, interactivity and textuality cause no problems for the expletive interjections. Let us take a look at the first one, subjectivity. What this term usually refers to in pragmatic texts is a number of speaker-related functions, in particular those conveying the speaker’s attitude to (the proposition underlying) the following utterance and those expressing the speaker’s epistemic stance towards that proposition. Example (1) is a straightforward example of a speaker using an expletive interjection to express his attitude to what he is saying: (1)
bloody hell look at that old codger behind the wheel (KB7 11226)
Here it would seem that although bloody hell in (1) is probably polyfunctional like most other pragmatic markers, its main function is to express the speaker’s surprise at the age of the driver. The wider context also confirms that this is the intended effect. But it is not always as easy as this to determine just what it is the expletive interjection is meant to express. In (2) for example a case can be made both for an attitudinal and an epistemic stance interpretation, something that reminds us of what we just said about the polyfunctionality of the pragmatic markers: (2)
Cor that was a proper macho man! (KBL 2438)
The mild interjection cor is often used to convey an attitude of surprise, both on its own and with regard to a following proposition. That may well be what it is doing in (2). However – like other clause initial interjections - cor also places a certain amount of emphasis on the following utterance. Emphasis may be interpreted in many different ways, but a likely interpretation in (2) is that the added emphasis is a way to insist on the veracity of the utterance: what the speaker is saying is that a proper macho man is a true description of the man in question. This leads to the conclusion that (2) therefore expresses both attitude and epistemic stance. Emphasis may also be used to strengthen the speech act force of certain utterances, in particular promises and predictions as in (3) and (4):
The functions of expletive interjections in spoken English (3)
Bugger it I’m gonna pay this off! (KB2 1980)
(4)
I’m not playing this, bugger it! (KB7 6582)
159
Epistemic stance expressed by means of expletive interjections may also be negative and be expressed by means of a post-posed expletive interjection as in (5). (5)
A: Have you done it? B: Well, I’ve done some of it C: Have you fuck! (KBM 995)
The clearest examples of the second main pragmatic marker function – interactivity – are the feedback signals known as backchannels given by listeners to speakers to show that they are listening. At their most colourless, backchannels are mere acknowledgements like Mm, Mhm. However, as e.g. Stenström (1995: 82) demonstrates, backchannels do not have to be colourless but vary along a “feedback gradient” reflecting the listener’s degree of involvement in what the speaker is saying. As an example of such a gradient Stenström offers the series Mm – I see – Oh – Gosh – Really – My goodness – Hell, a series ranging from listener indifference to strong listener involvement. Examples (6) and (7) are examples from my corpus of backchannels expressing, respectively, mild and strong degrees of involvement on the listener’s part: (6)
A: They’ve got 14 lawns here.. B: Gosh! (KBK 6153)
(7)
A: She must be 37. B: Bloody hell! (KB1 3983)
The third main function usually attributed to the pragmatic markers is textuality. According to Andersen, textuality or the textual function “describes what the speaker perceives as the relation between sequentially arranged units of discourse” (Andersen 2001: 66), for example the use of Now as an indicator of the transition from one topic to another (cf. also Aijmer 2002: 6). A not unusual type of textual pragmatic meaning expressed by means of expletive interjections in my data is the use of preposed expletive interjections to indicate that what follows somehow exemplifies a previous claim (cf. Aijmer’s point that certain “pragmatic markers are used to mark an elaboration or clarification of the topic” (Aijmer 2002: 86)). This seems to be what is going on in example (8). (8)
A: Ange was saying she’s .. .she gets a bit funny, don’t she? B: Cor bloody hell she give I [sic] <pause> three questions the other day (KB6 2186)
160
Magnus Ljung
Apparently B regards the asking of the three questions as evidence that A is right in what she is saying about someone being “funny” and uses the expletive interjection Cor bloody hell to point this out. 3.
The pure interjections
In the preceding section I have tried to show that in many of their uses, expletive interjections meet the same functional demands as the bona fide pragmatic markers and should therefore be admitted to the same category. I have also argued that the popular view of expletive interjection usage – that they are mostly psychological outlets for pent-up feelings of irritation and the like – accounts only for a certain type of interjections that I have called pure interjections. When pure interjections are used in real life, they may be more or less difficult to interpret. If they are triggered by some observable mishap like the accidental cutting of a finger or the breaking of a window, bystanders usually find it easy to construct an explanation for the uttering of the pure interjection by linking it to the mishap. But when the use of a pure interjection is caused by nonobservable factors like the speaker’s own thoughts or feelings, or are triggered by physical mishaps not observable to others, they are much more difficult to interpret. The same kind of difficulties often arise when we try to interpret pure interjections in a corpus of spoken English, be it on tape or in transcription; since we cannot observe the factors that trigger them, we are reduced to more or less ingenious guesses about what is going on. However, it is only fair to note that we sometimes do get information about speech situations in the transcripts of the spoken component of the BNC. On such occasions, the text actually identifies the event that triggered the expletive interjection. Examples (9) and (10) are cases in point. (9)
<“crash as kid falls over”> Oh my god (KB6 347)
(10)
A: Again, this is <“knocking on the wall”> it’s the same as this. Shit! (KBD 7410)
It is obvious that Oh my god! in (9) is a reaction to the child falling over. In (10) we can, I think, make a reasonably good case for interpreting the situation as one in which A is attempting to find out what the wall is made of, perhaps hoping for something solid. On finding that s/he is mistaken, s/he reacts by using the pure interjection Shit!. (Obviously, the very opposite may be the case – it may be the sameness that causes the speaker’s irritation). However, in most cases such direct explanations are missing, and we have to form an idea of the situation in which the utterance is made by studying its immediate context. (11) is a fairly typical example of such a guessing-game:
The functions of expletive interjections in spoken English (11)
161
A: I’ll have the yellow ones. B: The yellow ones? A: Just <pause> oh bloody hell B: The yellow ones were thrown A: What do I do? (KB7 4705-4709)
Here clearly (A) had planned to use something s/he had counted on finding in a cupboard and when s/he discovers that what s/he is looking for is no longer there, s/he gives vent to her/his disappointment by exclaiming Oh bloody hell! We never do find out what s/he was looking from either from the preceding or the following text, but we can, I think, confidently characterize oh bloody hell in this context as a pure interjection expressing A’s disappointment on not finding whatever it is s/he is looking for.. By engaging in detective work of this nature I eventually managed to identify a number of what I regard as convincing instances of pure interjections: of the 513 expletive interjections in my data, I reckoned that about one fifth belong to the pure interjections category and eventually put the total number of PI’s at 92. That leaves us with 421 expletive interjections which are not PI’s and should accordingly be amenable to pragmatic analysis. 4.
A discourse-based analysis of the expletive interjections
In section 2, I discussed the well-established pragmatic notions of subjectivity, interactivity and textuality and gave a few examples of how these three notions might be used to provide pragmatic analyses of those expletive interjections that are not pure interjections. My approach was to study individual instances of interjection usage and try to assign plausible meanings to them. Necessary as it is, this approach needs to be combined with one that attempts to provide an account of the interplay between the meanings of the pragmatic particles and the different surroundings - syntactic and discourserelated - in which they occur. An obvious candidate for the job would be an analytic model operating in terms of turn-taking like the discourse-related approach developed in the mid-1990’s by Anna-Brita Stenström (cf. Stenström 1991 and 1994) and which goes back to earlier ground-breaking work on discourse by John Sinclair and Malcolm Coulthard as presented in SinclairCoulthard (1975). In the remainder of this paper I will show what an analysis in terms of Stenström’s model – in broad outline – would look like and how it can be used to account for the functions of at least certain of the expletive interjections in the corpus. In Stenström’s model, communication operates in terms of turns, moves and acts. A turn is everything a speaker says before the next speaker takes over. Turns are realised by moves. A simple turn contains a single move, while a complex turn contains several moves. Moves are realized by acts, of which there is a bewildering array.
162
Magnus Ljung
One of the key features in Stenström’s model is the distinction between gap fillers and slot fillers (Stenström 1994: 61-62). Gap fillers are turns of their own. Slot fillers, on the other hand, are merely part of a turn. Stenström (1994: 61) illustrates this distinction by contrasting examples (12) and (13). In (12), the exclamation Right! is a gap filler, making up the entire second turn in the exchange, while in (13) it is a slot filler placed before another slot filler, viz. the clause let’s look at the applications. (12)
A: It’s under H for Harry B: Right.
(13)
A: Well I went about a quarter to B: Right, let’s look at the applications
Gap fillers typically function as responses to a previous utterance and characteristically serve as second turns in two-turn exchanges as in example (12) or as third turns in interrogative exchanges like (14) (14)
A: Whose father died then? B: Celestian’s. A: Oh Christ! (KBH 1166-68)
Note how the responses in (12) and (14) differ in their degree of speaker involvement: Right in (12) is merely a backchannel informing A that B is listening, while the function of Oh Christ! in (14) is to express B’s reaction to the information s/he has received and possibly also to offer B’s sympathy to those affected by the father’s death. My examples of gap fillers so far have all been responses, and it is true that in the case of expletive interjections, there is a strong link between the two. However, “gap filler” is the overall term for any utterance making up a simple turn on its own. Thus the first turns in (12), (13) and (14) are all gap fillers serving as conversation initiators. Slot fillers display a variety of functions. A very common one among the expletive interjections is to express subjectivity - attitude, epistemic stance - with regard to a following (less often a preceding) utterance in the same turn: in fact my early examples (1) – (5) were all demonstrations of such slot filler functions Like Stenström, I distinguish between several types of slot-fillers depending on where in the turn they occur, but unlike hers, my classification operates in syntactic rather than turn-based terms. I make a distinction between five types of slot filler positions: (1) immediately before a clause, (2) in the middle of a clause, (3) immediately after a clause, (4) immediately before a word or phrase and (5) immediately after a word or phrase. The above description of the five different types of slot fillers concludes my account of the different uses of expletive interjections that I have encountered in my corpus. Together with the gap fillers and the pure interjections that I have
The functions of expletive interjections in spoken English
163
already discussed, they make up a total of seven different functions for expletive interjections. Table 2 shows how the 513 expletive interjections in my corpus are distributed across these seven functions. Table 2: Distribution of expletive interjections in Conv1M Slot filler before a clause Gap fillers Pure interjections Slot filler after a clause Slot fillers before a word/phrase Slot fillers after a word/phrase Slot filler inside a clause Total
226 116 92 40 30 7 2 513
The statistics in Table 2 show that there are great differences among the different uses of the expletive interjections in spoken English. There are three major uses: as slot fillers immediately before a clause, as gap fillers and as pure interjections. The first of these is the by far most important type, with almost twice as many members as its closest competitor the gap fillers. With its 92 members, the pure interjections are obviously also a major category. Much further down the list, with 40 and 30 members respectively, we find another two expletive slot-filler positions: immediately after a clause and immediately before a word or a phrase. At the bottom there are the two very small categories: slot fillers following a word or phrase with only seven members and finally the use of expletives as slot fillers in the middle of a clause of which there are only two instances. Below I will comment on all of them in turn, beginning with the slot fillers. We have already seen examples of slot fillers immediately before a clause, viz. (1) and (2), repeated below for convenience: (1)
bloody hell look at that old codger behind the wheel (KB7 11226)
(2)
Cor that was a proper macho man! (KBL 2438)
Other examples of the same type are (15) – (17): (15)
Christ that’s gonna be a thousand pounds. ( KB1 1181)
(16)
Oh hell well he won’t have to bother, bother about a suit will he? (KB2 1870)
(17)
Just had a shower, cor feel a bit cold now (KB7 2870)
In my previous interpretation of (1) and (2) I claimed that it seemed reasonable to regard both as expressions of speaker subjectivity with regard to the content of the following clause or rather with regard to the content of the proposition underlying that clause. These are indeed plausible interpretations which can also
164
Magnus Ljung
be given to (15), (16) and (17). On such a reading we would then claim that in (15) Christ expresses the speaker’s surprise and perhaps even irritation over the cost of something, that in (16) Oh hell adds extra emphasis to the claim he won’t have to bother, and that in (17) cor is used to express the speaker’s mild concern at feeling cold. What makes such interpretations somewhat problematic is the fact that we know nothing about the phonology of (1) and (2) and (15) – (17) for the simple reason that – unlike the London-Lund corpus - the spoken component of the BNC is not phonologically annotated. As a result we don’t know where the tone unit boundaries go in these examples and nor do we know anything about the intonation. (Apparently the punctuation and spelling in the transcript cannot be relied upon to reflect phonological detail). What are the consequences of the lack of phonological annotation? Well, if a phonological analysis were to reveal that there is no tone unit boundary separating bloody hell, cor, Christ and oh hell from the following clause in (1), (2) and (15) – (17), these interjections are not independent units representing moves of their own, but are part of the same move as the following clause or NP. If on the other hand they were followed by a tone unit boundary, they would constitute independent moves of their own expressing the speaker’s strong surprise and/or irritation. What difference does it make? If a collocation like bloody hell in (1) is not a tone unit of its own, does that mean that it no longer qualifies as a pragmatic marker? At least according to one pragmatics scholar, the answer seems to be that it does not. In her discussion of Oh! in Aijmer (2002: 108), the author argues that when followed by a tone unit marker, oh is a pragmatic marker carrying a strong meaning of surprise. When oh is not followed immediately by a tone unit boundary it loses much of its “surprise” meaning and is reduced to the role of intensifier of the following item(s) but is still regarded as a pragmatic marker. Let us turn now to the gap fillers. With its 116 instances this is the second largest discourse function in my data. Let us consider three new examples of this function, viz. (18), (19) and (20). (18)
A: I’ve got thirty in tens. B: Ah Jesus!
(19)
A: I’m driving, there’s this big bang, and the whole bonnet lit up. B: Oh God!
(20)
A: Double tennis court? B: Mhm. A: Gosh!
In all three examples above the expletive interjections have clearly interactive functions. As could be expected, they serve as acknowledgements of the information given in a previous turn, but at the same time they also express a reaction to that information. Take for instance the exchange between A and B in (18). A study of the conversation leading up to the exchange in (18) reveals that B
The functions of expletive interjections in spoken English
165
has asked A to give (lend?) her/him forty pounds and when it turns out that A has only thirty pounds, B utters Ah Jesus! as an expression of disappointment. (19) and (20), on the other hand, do not express much interest from the speakers. It is interesting to compare my findings concerning gap fillers with Stenström’s results in her 1991 study of the expletives in the LLC, viz. the London-Lund Corpus of Spoken English. Investigating what she called at the time “expletives as separate turns” i.e. gap fillers, she found that 58% were used as responses in second or third turns, and that the remaining 42% were used as “go/on signals”, viz. as feedback signals interrupting the speech of another speaker. While there are instances of such go/on signals in my corpus like for example (21), (21)
A: If you take B: Cor! A: the top off (KBP 1660)
such interruptive constructions are rare in my data. There is no obvious reason for this difference, but it may have to do both with the time at which the two corpora were created and with the kind of speakers involved. The LLC contains speech dating back to the 1960s and the 1970s, while the 26 texts in my data were all recorded in 1991 and 1992. It is possible, though perhaps not very likely, that the use of go-on signals has diminished in the time interval separating the two corpora. A more plausible explanation may be the difference between the speakers in the LLC and the BNC. The aim of the former was to represent educated adult British English and in fact most of the speakers are academics. The aim of the BNC was not to record only educated adult British English but to represent the entire gamut from “educated English” to uneducated and from teenagers to 70year-olds. As a result, the speakers recorded in the BNC are not at all as homogeneous as those in the LLC but differ from them both with regard to age and to social class. It seems to me that both the age difference and the social difference between the speakers enrolled in the two corpora may have had an effect on the use of go-on signals. The third largest group of expletives in Table 2 is the pure interjections. I have already pointed to the difficulties involved in finding plausible triggering factors for interjections in corpus data. However, occasionally another, theoretically more interesting interjections-linked difficulty turns up. What I am referring to are cases in which what seems to have started out as a genuine pure interjection is overheard by others and intrigues them to such an extent that they ask the speaker what is the matter. By doing that they change the nature of the original pure interjection, which has now become the first turn in an exchange. Example (22) shows how this may happen: (22)
A: Oh damn it. [Turn 1] (KBA 46) B: What? [Turn 2] A: This one doesn’t seem to want to come out [Turn 3]
166
Magnus Ljung
In an analytic model based on turn taking, this is a non-problem: an initial utterance that is linked to another utterance in the way Oh damn it! and What? are linked in (22), is by definition the first turn in an exchange involving at least two - in the case of (22) three - turns. But if we forget for a moment the exigencies of a strict turn taking system, we realize that the key question here is what A’s intentions were. Did s/he intend her/his utterance Oh Damn it! to be taken as the first move in an exchange, or did s/he just let it slip out without any communicative plans? Cases like these raise other - larger - questions, like the nature of self-talk and whether it makes sense to talk about pure interjections as communicative, both of them issues raised in an interesting paper by Erving Goffman from 1978. Next in the list in Table 2 we find two smaller categories. The first is made up of slot fillers appearing immediately after a clause as in our old example (4), repeated below and in the new (23). (4)
I’m not playing this, bugger it! (KB7 6582)
(23)
Stop dribbling, for Christ’s sake! (KBL 33702)
Like many other examples in BNC, (23) was uttered while the speaker was watching football on TV. The phrase for Christ’s sake alternates in this position with for God’s sake, for Pete’s sake and occasionally for fuck’s sake. All the pragmatic sake constructions seem to have developed a highly specialised function: they are used by speakers to emphasize the situational relevance of her/his own utterances (or of elements of these utterances). Stenström (1994) uses the term booster for this function, defining it as “the speaker’s assessment of what s/he says”. The slot filler position in the middle of a clause is used extremely seldom. One of the few examples I have found is (24), which admittedly could also be interpreted as a pure interjection: (24)
Right, Ann <pause> what wine <end of voice quality> <pause> oh God! <pause> is made in <pause> oh, Department of the Marne <end of voice quality> (KBD 7826)
Slot fillers before a word or phrase, on the other hand, are fairly common. There appear to be at least two ways of using the expletive interjection in such cases. In (25) and - in particular - in (26) the interjections are in all probability tone units of their own expressing the speaker’s feelings concerning the following NP. Thus in (25), Oh God is used to convey the speaker’s irritation with the neighbour’s cats. In (26) the wider context of the quote makes it clear that the speaker expresses surprise at the proposal to locate a night club in a certain street. Example (27), on the other hand, strikes me as another example of the merely intensifying use of an interjection that we noted in the discussion of the expletive interjections used as slot fillers before a following clause in (1), (2),
The functions of expletive interjections in spoken English
167
(15) – (17). (It is hard not to feel that the difference between (25) and (26) on the one hand, and (27) on the other is in fact reflected in the punctuation here as in many other places and that the role of punctuation in BNC might be worth looking into). (25)
Oh God, next door’s cats (KB8.8468)
(26)
Shit! Down Quinnan Street! (KBD 967)
(27)
Oh god yes. (KBP 4686)
There is one interesting case of a slot filler occurring just before an NP in what must be an example of an interjection used with a textual function, more precisely as an act of repair when the speaker realises that s/he has made a mistake and wants to put it right: it was not curtains that the speaker should have ordered, but curtain rails. (28)
No I haven’t ordered any curtains, cor … curtain rails (KBH 3898)
The position immediately after a word or phrase is not as common as that before a word or a phrase, but we do find examples like (29): (29)
Damn paint and stuff, cor strewth. (KBR 531)
The situation here is that the speaker and her/his interlocutor are visiting a building that is being redecorated. The speaker coughs and then exclaims Damn paint and stuff ! adding cor strewth as a booster emphasizing the relevance of his exclamation. 5.
The distribution of the individual interjections
In the preceding section I explored the different mostly discourse-based functions with which the expletive interjections in my data have been used. I will bring this paper to its conclusion with a brief presentation of the distribution of the individual interjections across these functional categories with a view to establishing whether the individual expletive interjections show any marked tendencies to differ in their choice of function. I present my findings in Table 3. However, before we discuss the results in the table, let me remind the reader that the labels represent all uses of the words involved, whether as single words or as part of a collocation.
168
Magnus Ljung
Table 3: Distribution of the expletive interjections. PI: pure interjection, GF: gap filler, BC: before clause, MC: mid-clause, AC: after clause, BWP: before word/phrase, AWP: after word/phrase; % in brackets LABEL
Bugger Christ Cor Damn
PI 1 (7.1) 11 (31.4) 8 (9.75) -
Fuck
-
Oh God Gosh
34 (19) 1 (3.3) 25 (20.7) 5 (31.2) 7 (38.9) 92 (17.9)
Hell Jesus Shit Total
GF 2 (14.2) 2 (5.7) 11 (13.4) 1 (25) 2 (14.2) 49 (27.3) 8 (26.7) 31 (25.6) 7 (43.7) 3 (16.7) 116 (22.6)
BC 3 (21.4 ) 15 (42.9) 58 (70.7) 2 (50) 6 (42.9) 67 (37.4) 19 (63.3) 48 (39.7) 2 (12.5) 6 (33.4) 226 (43.9)
MC 2(1.7) 2(0.4)
AC 3 (21.4) 6 (17.1) 3 (3.7) 1 (7.1 ) 13 (7.3) 1 (3.3) 10 (8.3) 2 (12.5) 1 (5.6) 40 (7.8)
BWP 5 (36) 1 (25) 4 (28.6) 15 (8.4) 1 (3.3) 3 (2.5) 1 (5.6) 30 (5.7)
AWP -
TOT 14
1 (2.8) 2 (2.4) -
35
1 (7.1) 1 (0.55) -
82 4 14 179 30
2 (1.7) -
121
-
18
7 (1.4)
513
16
The statistics in Table 3 have been organized from left to right in order to make it possible to observe what percentage of the total number of occurrences each interjection devotes to the different functions. When the total number of occurrences is very low, this becomes a rather uninteresting exercise. However, with interjections with high total frequencies of occurrence, this method sometimes yields interesting information about the functional preferences of individual interjections. Given the information in the Table 2, we should not be surprised to find that almost 44% of the totals fall in the slot-filling “before clause” category. By the same logic the fact that the gap fillers and the pure interjections end up in second and third position will hardly cause any raised eyebrows. What is more interesting is the way the “before clause” percentages for cor and gosh surpass the 43.9% in the totals row by a thumping 26.8 and 19.4 percentage points respectively. Cor has 70.7% of its 82 occurrences in that position while gosh has 63.3% of its 30 occurrences in the same slot, a distribution strongly suggesting that these two items have specialized as expletive clause-initial pragmatic markers. Another surprise may be found among the gap fillers, where Jesus has 43.7% in comparison with the 22.6% value in the totals row. But Jesus is a low-
The functions of expletive interjections in spoken English
169
frequency item with a mere 16 occurrences and we may find it more rewarding to study the gap filler figures for real high-frequency interjections like Oh God! and hell. Both of these have gap filler percentages clearly above the totals values for the category. A third set of items with deviant percentages are to be found in the column for pure interjections, but as membership in this category is more difficult to determine than for the other functions these findings should be taken with a certain amount of scepticism. For what it is worth, however, a study of the percentages in that column reveals that Shit, Christ and Jesus all have substantially higher percentage values than the expected 17.9% found in the totals row. In the case of shit almost 39% of its occurrences are pure interjections; the corresponding percentages for Christ and Jesus are 31.4% and 31,2 % respectively. 6.
Conclusion
The aim of the present paper has been to explore the functions of expletive interjections in spoken British English as they are used in a 1 M-word sub-corpus from the spoken component of the BNC. The study focuses on ten common expletive interjections representing the semantic areas particularly associated with English expletives, viz. bodily waste, religion and sex. As Table 1 shows, the majority of the expletives are religious both in terms of types and tokens. The data was examined with a view to establishing in what ways the expletive interjections were actually used in conversation. It was found that they may be used in two distinct ways. Thus in about 20% of the 513 utterances making up my data the interjections are used merely to signal often involuntary speaker reactions to stimuli of various kinds as for example in exclamations of pain, irritation, surprise etc. I refer to the interjections used in this manner as pure interjections. In the utterances making up the remaining 80% of the data, the expletive interjections were used to carry out the communicative functions of subjectivity, interactivity and textuality (see the discussion of examples (1) – (8)), functions strongly associated with the category of pragmatic markers. In addition it turned out that all the expletive interjections in this category also satisfied the criteria for membership in the pragmatic marker category listed in Brinton 1996. These findings indicate that unless they are used as pure interjections, there is every reason to regard expletive interjections as pragmatic markers. The expletive interjections were also exposed to a discourse-based analysis in terms of the distinction between gap fillers and slot fillers found in Stenström (1991), (1994). The analysis revealed that the majority of the interjections were used as slot fillers, in particular before clauses as in (1), where bloody hell expresses the speaker’s attitude to the (proposition underlying) the following clause:
170
Magnus Ljung
(1)
… bloody hell look at that old codger behind the wheel. (KB7 11226)
The second largest category was the interactive gap fillers used as responses to the immediately preceding utterance as for instance in example (7), where B uses the same expletive interjection as that found in (1) in response to A’s claim that somebody is 37 years old: (7)
A: She must be 37. B: Bloody hell!! (KB1 3983)
The final part of the study explored the distribution of the individual expletive interjections. It was found that certain of them have become highly specialized, for instance cor and gosh, both of which favour the slot filler position “before clause” (cf. Table 3). References Aijmer, K. (2002), English discourse particles, evidence from a corpus. Amsterdam / Philadelphia: Benjamins. Aijmer, K. (2004), ‘Interjections in a Contrastive Perspective’, in: E. Weigand (2004) Emotion in Dialogic Interaction: Advances in the Complex. Current Issues in Linguistic Theory Vol. 240, pp. 99-120. Amsterdam & Philadelphia :John Benjamins. Andersen, G. (2001), Pragmatic Markers and Sociolinguistic Variation. A Relevance Theoretic Approach to the Language of Adolescents. Amsterdam & Philadelphia : John Benjamins. Brinton L.J. (1996), Pragmatic Markers in English: grammaticalization and discourse Functions. Berlin: Mouton de Gruyter. Crystal, D. (1997), The Cambridge Encyclopedia of Language. Second edition Cambridge: Cambridge University Press. Erman, Britt (1987), Pragmatic expressions in English: a study of you know, you see, and I mean in face-to-face conversation. Sweden: Almqvist & Wiksell International. Erman, B. (2001), ‘Pragmatic markers revisited with a focus on you know in adult and adolescent talk’. Journal of pragmatics 33: 1337–1359. Goffman, E. (1978), ‘Response Cries’. Language 54: 787-815. Hughes, G. (1998), Swearing: a Social History of Foul Language, Oaths and Profanity in English. Oxford: Blackwell. Hughes, G. (2006), An Encyclopedia of Swearing. Armonk N.Y. ; Sharpe. McEnery, T. and R.Z. Xiao (2003), ‘Fuck Revisited’. Corpus Linguistics 2003 28-31 McEnery, T. and R.Z. Xiao (2004), ‘Swearing in Modern English: the case of Fuck in the BNC’. Language and Literature 13: 235-268. McEnery, T. (2005), Swearing in English. Bad language, purity and power from 1586 to the present. Abingdon: Routledge.
The functions of expletive interjections in spoken English
171
Sinclair, J. and M. Coulthard (1975), Towards an Analysis of Discourse. The English used by teachers and pupils. Oxford: Oxford University Press. Stenström, A. (1990), ‘Lexical Items Peculiar to Spoken Discourse’, in: J. Svartvik (ed.) The London-Lund Corpus of Spoken English. Lund: Lund University Press. Stenström, A. (1991), ‘Expletives in the London-Lund Corpus’, in: K. Aijmer and B. Altenberg (eds.) English Corpus Linguistics. Studies in honour of Jan Svartvik. London and New York: Longman. Stenström, A. (1994), An Introduction to Spoken Interaction. London & New York: Longman. Svartvik J. and R. Quirk (eds.) (1980), A corpus of English conversation. Lund Studies in English 56. Lund: Gleerup. Svartvik, J. (1991), The London-Lund Corpus of Spoken English. Lund: Lund University Press. Van Lancker, D. and J.L. Cummings (1999), ‘Expletives: Neurolinguistic and Neurobehavioural Perspectives on Swearing’. Brain Research Reviews 31:83-104.
Change and constancy in linguistic change: How grammatical usage in written English evolved in the period 1931-1991 Geoffrey Leech and Nicholas Smith Lancaster University and University of Salford, UK Abstract The creation of the Lanc-31 corpus (familiarly known as B-LOB - ‘Before LOB’) completes a trio of matching corpora of standard written British English 19311- 1961 1991 on the model of the Brown corpus. The short-term history of English in the twentieth century can therefore now be examined using three equidistant broadly-sampled and comparable corpora of the written language, and it is possible to trace how far trends of change already observed in the comparison of LOB (1961) and F-LOB (1991) have themselves been undergoing change over the period in question. We will present in outline the recent history of a considerable range of grammatical features insofar as it can be learned from frequency counts from these three equivalently-sampled corpora. In many cases examined, the trend of increasing or decreasing frequency observed in the later period (1961-91) is found to be a continuation of a similar trend in the earlier period (1931-61).2 In other cases there is change in the rate or direction of change. In other words, there is both constancy and change in the rate of change. We provide tentative explanations of these changes, where appropriate, in terms of grammaticalization, colloquialization, Americanization and densification. Comparable developments in American English, based on analysis of the equivalent Brown and Frown corpora, are traced for the 1961-92 period, and provide insight into the relation between the two regional varieties, mostly showing AmE trends to be in advance of those for BrE.
1.
Introduction
The first part of our title may seem tautological, and needs explanation. What is meant, in a linguistic context, by ‘change in linguistic change’? To answer this, we first refer to a methodology that has, over the past decade or so, become quite an established way of studying short-term diachronic change. This is the use of comparable corpora – corpora equivalently sampled from the language, though different in temporal as well as geographical provenance – as a means of identifying rather precisely how the use of the language developed over a period. The Brown quartet of matching corpora (the four corpora of written standard English known as Brown, LOB, Frown and F-LOB) have been used in this way (see a range of publications, including Hundt 1997, Hundt and Mair 1999, Leech 2003, Leech and Smith 2006, Smith 2003, Leech et al. in press) as a means of tracking changes in frequency of use during the period between 1961 and 1991/2 in AmE and BrE. Such studies have focused largely on grammar, as the size of the corpora (c. one million words each), while too limited for lexical studies, is
174
Geoffrey Leech and Nicholas Smith
particularly suitable for grammatical studies. For a given grammatical phenomenon, this methodology can establish not only significant changes in frequency (increase or decrease), but rates of change.3 We have now been able to use, in a provisional form, a further recently completed corpus (see Leech and Smith 2005), from 1931 (± 3 years), the Lanc-31 corpus familiarly known as B-LOB (‘before LOB’). This, matching in every achievable respect the LOB and F-LOB corpora of British English, extends the comparable corpus methodology thirty years further into the past. ‘The Brown quartet’ of corpora now becomes ‘the Brown family’, encompassing three generations of language use. With this additional third sampling period, we have a trio of temporally equidistant corpora for BrE, doubling the length of the period for studying change in this way. But, more than this, the new corpus enables us to identify alterations in the direction and rate of change across time. We may observe, for example, an increase (acceleration) or a decrease (deceleration) in the rate of change of frequency in the post-1961 period. Indeed we might find a change in the direction of change (from increase to decrease of frequency or vice versa), using three-point line charts of a kind that will be abundantly illustrated in this chapter. Some of the kinds of pattern we can now observe are shown in Figure 1.4
(a) Steady increase
(b) Steady decrease
(c) Decrease: deceleration
(d) Increase: acceleration
Figure 1: Some patterns of linguistic frequency change: B-LOB – LOB – F-LOB 2.
An example of 1930s English
To begin with, we present a sample from the 1931 corpus – as a reminder of how the written standard language has changed since then:
Change and constancy in linguistic change: 1931-1991 (1)
175
Poached eggs require skill in the handling, and must be carefully denuded of water when removed from the pan, for nothing is more distasteful than water-sodden toast. (B-LOB F35, category F: Popular lore)
We can scarcely imagine a cookery book using such a seemingly ‘stilted’ style like this at the present day. In addition to its general formality of lexical choice (e.g. require, denuded, distasteful), the passage contains three grammatical choices, highlighted with italics, which would be significantly less common today: (a)
Like nearly all modal auxiliaries, the modal auxiliary must illustrated here has declined; indeed, it has declined much more than most other modals (see Section 6).
(b)
The passive voice, as in must be… denuded, has also declined markedly (see Section 4).
(c)
For, as a conjunction of reason, has declined catastrophically since the 1930s (see Section 4).
3.
Possible determinants of grammatical change
Before moving on to a more detailed account of changes of grammatical frequency, it is as well to consider the question: how is it that such changes have been taking place? Are there any general trends that can be observed? Although it may seem like putting the cart before the horse, we believe it will be helpful if we list a number of possible determinants of change right away, elaborating on them later. (i) Colloquialization has been proposed as an explanation of many changes in our corpora (Mair 1998, Leech et al. in press, especially Chapter 11). New developments in a language, it seems, tend to arise in colloquial speech, and to make their way gradually into the written medium. The trend of written language acquiring habits of spoken language, although by no means general, is one that has been observed through corpus studies going back to the seventeenth century (Biber and Finegan 1989, 1997). Colloquialization can also have a negative side: decline in frequency of a structure strongly associated with formal or literary writing may be attributed to an avoidance of such structures due to colloquial influence. This can be part of the explanation of (b) above, the decline of the passive. Turning to (c), the conjunction for, these days, is rarely found in speech, and so part of its decline may again be a negative effect of colloquialization: forms strongly associated with the literary language may become ‘upstaged’ by the increasing prevalence of more colloquial forms. (ii) Grammaticalization has been another such explanation, widely accepted in accounting for diachronic change in English (see, for example, Traugott and
176
Geoffrey Leech and Nicholas Smith
Hopper 2003). If must has declined so drastically, part of the reason could be that it is competing with verbal idioms such as have to, (have) got to – so-called ‘semi-modals’ whose emergence is a textbook example of grammaticalization (Krug 2000; Tagliamonte 2004). (iii) Americanization? The evidence provided by the Brown family of corpora – especially the comparison between the British corpora (1961, 1991) and the American corpora (1961, 1992) – often shows AmE to be in the lead or to show a more extreme tendency, and BrE to be following in its wake. Thus must, in our data, has declined more in AmE than in BrE, and has become much rarer than have to and (have) got to in AmE conversational speech.5 Users of British English are familiar with lexical changes due to American influence, such as increasing use of movie(s) and guy(s), but grammatical changes from the same source are less noticeable. The question mark after ‘Americanization’ above, however, is a warning that a finding that AmE is ahead of BrE in a given frequency change does not necessarily imply direct transatlantic influence – it could simply be an ongoing change in both varieties where AmE is more advanced. If the term ‘Americanization’ is taken to imply direct influence of AmE on BrE, it should be treated with caution. (iv) Densification is the tendency for the semantic content of written language to become more compactly expressed (see Biber 2003) – as shown, for example, in the frequency increase of noun + noun sequences and of s-genitives (see Section 7) that has been in progress for at least sixty years, and probably for much longer (see Leonard 1968, Rosenbach 2002, 2006). Strangely, this tendency seems to run counter to colloquialization, since a high frequency of nouns and low frequency of pronouns are strongly characteristic of written, not spoken language (Biber 1989). Ultimately we can only speculate about the causes of such changes: but these ‘-izations’ have a preliminary explanatory value in showing how individual changes belong to a more general class of changes with apparently similar characteristics and motivations. We now consider these trends in more detail, and later briefly discuss the interconnections between these trends. 4.
Colloquialization
The easiest and most canonical illustration of colloquialization is the increasing use of contractions (both verb and negative contractions) in written language over the sixty-year period. Restricting attention to negative contractions in -n’t, we begin by showing this trend in two separate line charts illustrating two different ways of measuring change of frequency. The first, or proportionate, method expresses the frequency of a feature’s occurrence as a percentage of its grammatically and semantically allowable occurrences. This is rather easy to do
Change and constancy in linguistic change: 1931-1991
177
in the case of negative contractions: we merely have to count the instances of -n’t and divide this figure by the instances of -n’t and of uncontracted not following a finite auxiliary or form of be; for example: f (don’t) f (don’t + do not)
70% 60% 50% Press Gen Prose Learned Fiction Overall
40% 30% 20% 10% 0% 1931
1961
1991
Figure 2: Contracted form n’t in BrE as a proportion of all not-contractions We see from Figure 2 that not-contractions have increased markedly since 1931, and that this increase has been steady and consistent in the two periods prior to and after 1961. To compare broad groupings of text genres, we have divided each corpus into four subcorpora we will refer to as ‘Press’ ‘General Prose’ ‘Learned’ ‘Fiction’
(text categories A-C) (text categories D-H) (text category J) (text categories K-R)
In comparing the subcorpora, it is no surprise to find that Fiction (the most speech-related subcorpus, because of its extensive inclusion of dialogue) shows by far the highest frequency of contractions, and that Learned (which is remote from speech, consisting chiefly of academic writing) shows the lowest frequency, close to 0%. The intermediate categories General Prose and Press come between Learned and the overall percentage for the whole corpus (shown by a thick black line), but of these two subcorpora, Press shows a sharper increase than General
178
Geoffrey Leech and Nicholas Smith
Prose. The whole picture, however, is remarkably consistent: each subcorpus, even to a small degree the Learned category, shows a constantly increasing use of contractions. The second way of measuring increase of frequency is the normalization method, which for us is simply to count occurrences per million words (pmw). This is close to the raw frequency count in each corpus, and might be considered a ‘rough and ready’ measure. However, our next chart (Figure 3) indicates a result closely similar to the more sophisticated proportionate frequency measure. In both Figure 2 and Figure 3, all four subcorpora show (a)
a steady and consistent increase,
(b)
the same initial order of frequency (Fiction, Press, Gen. Prose, Learned)
(c)
a divergence between Press and Gen. Prose, the former climbing more steeply than the latter, and moving from a position where contractions in Press are slightly less frequent than in Gen. Prose, to one where they are substantially more frequent.
8,000 7,000 6,000 Press Gen Prose Learned Fiction Overall
5,000 4,000 3,000 2,000 1,000 0
B-LOB
LOB
FLOB
Figure 3: Contracted form n’t in BrE expressed as frequency per million words In the rest of the chapter, therefore, we will rely on the normalized (pmw) method as the most convenient and often the only viable one, mainly because the proportionate method requires a clearly definable set of alternatives, the occurrences of which have to be counted to determine the non-occurrences of the feature being examined, and such a set of alternatives cannot be easily defined, let alone easily retrieved from the corpora. Consider, for example, the modal must: alternatives to must should include not only semi-modals expressing
Change and constancy in linguistic change: 1931-1991
179
obligation/necessity, such as have to and (have) got to, but other means of expressing a similar meaning, such as main verbs (need, require), adverbs (necessarily, inevitably), adjectival constructions (it is necessary to; we are obliged to), and possibly the uses of need and necessity as nouns. And these are only a sample of the alternatives to must. There is no easy way of drawing a boundary line to identify ‘non-occurrences of must’. From a negative viewpoint, colloquialization also seems to be the main explanation for a dramatic change in the frequency of relativization devices, especially between 1961 and 1991. The following charts show (a) a dramatic increase in the frequency of relative clauses introduced by that, and (b) a corresponding (though less dramatic) decrease in the frequency of relative clauses introduced by wh- pronouns (which and who/whom/whose). 3,000
2,500
2,000
AmE (est.) BrE
1,500
1,000
500
0 1931
1961
1991
Figure 4: Increasing use of that-relative clauses 1961-1991/2 in AmE (Brown ĺ Frown) and BrE (LOB ĺ F-LOB): frequencies pmw Figure 4 compares the increases in that-relatives in BrE and AmE since 1961, showing that the American increase is even steeper than the British (the amount of hand-editing required to count that-relatives, as compared with wh-relatives, dissuaded us from looking at that-relatives in the B-LOB corpus). Figure 5, on the other hand, shows a steady decline in wh-relatives in BrE since 1931, while the figures for AmE (available only for 1961-91) again show a somewhat more extreme trend.
180
Geoffrey Leech and Nicholas Smith
9000 8000 7000 6000 5000
AmE BrE
4000 3000 2000 1000 0 1931
1961
1991
Figure 5: Wh-relatives in AmE and BrE (frequencies pmw) Analysis of the subcorpora indicates that, while relative clauses as a whole are strongly biased towards expository writing, wh-relatives are particularly associated with Learned texts, whereas that-relatives are more evenly spread in the corpora. In the Brown family of corpora, the wh-relatives (which of course have a virtual monopoly of non-restrictive clauses) are predominant throughout. Overall, in Brown the frequencies of wh-, that and zero relatives are in the proportion 68% : 21% : 11% (changing to 54% : 35% : 12% in Frown). In LOB, the proportions are 74% : 14% : 12% (changing to 70% : 17% : 13% in F-LOB). The predominance is huge in the LOB Learned texts, where the ratios are 84% : 11% : 5%. In LOB Fiction, however, these types are more evenly spread, with the proportions 53% : 22% : 25%, while the wh- relatives still remain in the majority. Comparing Learned, as the most formal-informative variety, with Fiction, the variety closest to speech, we note in the above comparisons that wh- relatives have a distribution contrasting with that relatives, which have their strongest representation (in percentage terms) in Fiction writing.
Change and constancy in linguistic change: 1931-1991
181
10,000 9,000 8,000 7,000 Press Gen Prose Learned Fiction Overall
6,000 5,000 4,000 3,000 2,000 1,000 0 1931
1961
1991
Figure 6: Wh-relatives in BrE (frequencies pmw) Figure 6 breaks down the decline of wh-relatives in Figure 5 into subcorpora, showing that wh-relatives have by far the lowest frequency in the Fiction subcorpus, which is in general closest to the spoken language, and hence likely to reflect colloquial influences.6 This lends plausibility to the view that the increasing trend to favour that-relatives and disfavour wh-relatives is an aspect of colloquialization. On the other hand, in AmE, it is a massive decline (-34.9%) of which alone in relative clauses that accounts for the overall decline of whpronouns, and this has much to do with prescriptive influences.7 Our remaining examples of the influence of colloquialization are again negative ones, showing the decline of formal features, some already noted in section 1. Figures 7 and 8 display the already-noted declining frequency of the be-passive. Figure 7 shows AmE considerably ahead of BrE in the overall passive decline. Figure 8 breaks down the passive decline into subcorpora. Here the four subcorpora are in the opposite order to the order observed with contractions: Learned shows by far the highest frequency of passives, and Fiction shows the lowest. This is consistent with the view that the passive may have been declining because of disassociation with colloquial usage.8
182
Geoffrey Leech and Nicholas Smith
16,000 14,000 12,000 10,000 BrE AmE
8,000 6,000 4,000 2,000 0 1931
1961
1991
Figure 7: The be-passive in AmE and BrE: (frequencies pmw) 25,000
20,000
Press Gen Prose Learned Fiction Overall
15,000
10,000
5,000
0
1931
1961
1991
Figure 8: The be-passive in BrE (frequencies pmw) However, the picture is not quite so straightforward as it was with contractions. It seems that passives show no decline in General Prose, and indeed show an increasing trend in Learned, before the 1960s. This needs further investigation, but conceivably the ethos of dispassionate impersonality, encouraged in serious
Change and constancy in linguistic change: 1931-1991
183
writing and particularly in science, was still in the ascendant up to the middle decades of the century, but has since lost influence. An additional reason for a passive decline, probably increasing through the century, has been the hostility (especially in the US) of prescriptive forces – including usage gurus, house style manuals, crusaders in favour of ‘plain English’, and latterly, grammar checking software – all either overtly or covertly disparaging the use of the passive.9 700
600
500 Press Gen Prose Learned Fiction Overall
400
300
200
100
0 1931
1961
1991
Figure 9: Conjunction for in BrE (frequencies pmw) Another previously-mentioned case for negative colloquialization is the conjunction for, illustrated by the following: (2)
A proprietary remedy should be used, for this is better than any homemade one. (B-LOB, E36)
(3)
In the first place, the statement that a real crime is one about which the good citizen would feel guilty is surely circular. For how is the good citizen to be defined in this context unless as one who feels guilty about committing the crimes that Lord Devlin would classify as ‘real’. (F-LOB, G52)
The corpora attest some ambivalence about the status of for. It is not a fullyfledged subordinator, as unlike its competitors because, as and since, it cannot precede the matrix clause. Example (3) illustrates its use as a sentence-initial form, more akin to a sentence adverb: a proportionately increasing tendency. The remarkable decrease of this conjunction from nearly 600 per million words (pmw) to little more than 200 shows an accelerating decline over the sixty years.
184
Geoffrey Leech and Nicholas Smith
However, here the pattern of subcorpora suggests a somewhat different explanation from that of the passive. Fiction and General Prose are somewhat more retentive of for than the other subcorpora: a sign, perhaps, that for has become increasingly restricted to use in ‘literary’ contexts. It is noticeable that Press, usually in the vanguard of change (see Hundt and Mair 1999), shows the least use of for overall. Yet a further plausible example of negative colloquialization is the similar fate of the preposition upon, which in many contexts can be used as a more formal or literary variant of on: (4)
But my ideal society would be based upon a certain fundamental personal and social equality. (B-LOB, G65)
(5)
The reverse design is based upon the Combined Operations badge of WWII, but with modifications. (F-LOB, E25)
1,200
1,000
800 Press Gen Prose Learned Fiction Overall
600
400
200
0 1931
1961
1991
Figure 10: Preposition upon in BrE (frequencies pmw) Like for (as a conjunction), upon undergoes a precipitous decline between the 1930s and the 1990s, but in this case most of the loss (in raw frequency terms) takes place in the first thirty years. The pattern of subcorpora is also different, indicating that, at least in the second thirty years, it is the Learned subcorpus that is more retentive of upon. But, as in the cases of the passive and for, Press shows itself the most ‘advanced’ subcorpus in its growing avoidance of this preposition.
Change and constancy in linguistic change: 1931-1991 5.
185
Americanization
We have already observed more than one trend in which AmE leads while BrE follows some way behind. This has been seen in Figure 4 (the growing use of that-relatives), Figure 5 (the declining use of wh-relatives) and Figure 7 (the declining use of the passive). Other changing frequency trends dealt with in this chapter could also be cited: the increasing use of contractions, the declining use of modal auxiliaries (Figure 13), particularly of must (Figure 14), the increasing use of noun + noun sequences (Figure 22) and of genitives (Figure 20). All these show the characteristic pattern of AmE leading BrE in frequency change and (often) in the steepness of the frequency change. Only one example in this chapter illustrates the opposite phenomenon, whereby BrE leads AmE, and this is the case of semi-modals, on which we will comment in the next section. To give a particularly compelling illustration of AmE ‘leadership’ in change, we turn to the case of infinitive complementation of the verb help. The contrast we are considering is that between help + to-infinitive and help + bare infinitive, as exemplified by (6) and (7): (6)
…he acts as a detective, helping to unravel the mystery. (F-LOB G09)
(7)
Such fame helped unlock the coffers of the Treasury…. (F-LOB G38)
Figure 11 shows from 1931 an increasing use in BrE of the bare infinitive construction, which accelerates after 1961 following a trail already blazed by AmE. In 1931 this construction is apparently rare in BrE, the to-infinitive being much more common. After 1961, the to-infinitive in BrE declines, again following the AmE trend, so that by 1991 the bare infinitive has overtaken the toinfinitive as the more common construction. From 1961, BrE seems to follow almost slavishly the respective increase and decline of the two constructions in AmE. It is notable that Figure 11 illustrates a rather rare pattern – where an increase in 1931-61 is followed by a decrease. This applies to the use of the help to construction in BrE. Admittedly there is no statistical significance with these relatively small numbers, but a change in the direction of frequency change may well be a signal of changing circumstances – perhaps in this case a sign of increasing American influence towards the end of the century. The credibility of this explanation is backed up by another example of a change of direction, that of the mandative subjunctive illustrated by (8): (8)
‘Conditions have dictated that operations be scaled down…’ (F-LOB A38)
186
Geoffrey Leech and Nicholas Smith
250
200
help + to-inf. (AmE) help + to-inf. (BrE) help + bare inf. (AmE) help + bare inf. (BrE)
150
100
50
0 1931
1961
1991
Figure 11: To- vs bare infinitives with help (all construction types) in AmE and BrE (frequencies pmw) 120
100
80
subjunctive should
60
40
20
0 1931
1961
1991
Figure 12: Mandative use of subjunctive and modal should in BrE (freqs. pmw)
Change and constancy in linguistic change: 1931-1991
187
Again, the numbers are small and non-significant, but the change of direction in Figure 12 (this time from decrease to increase) is indicative of a revival of the mandative subjunctive in BrE which has been confirmed in other studies (notably that of Övergaard 1995). The upper line in Figure 12 represents the alternative construction of should-periphrasis (as in operations should be scaled down…), which has been declining in BrE whereas the subjunctive, which is the dominant construction in AmE, has changed from a virtually terminal decline prior to 1961 to a small but appreciable increase. As the subjunctive in BrE is associated with a more formal register range than should-periphrasis, this change of direction goes against colloquialization, and the only credible explanation seems to be American influence. 6.
Grammaticalization, modals and semi-modals
With modals and semi-modals, the three factors of grammaticalization, colloquialization, and Americanization appear to come into play together. Commonly attributed to grammaticalization, as already noted, is the progressively increasing frequency of semi-modals such as have to, be going to and want to (see Krug 2000). Overall, the core modals remain more or less level (although slightly decreasing) between B-LOB and LOB (1931-1961), but then a decline of around 10% sets in – see Figure 13.
16,000 14,000 12,000 10,000 BrE AmE
8,000 6,000 4,000 2,000 0 1931
1961
1991
Figure 13: Core modals in AmE and BrE (frequencies pmw)
188
Geoffrey Leech and Nicholas Smith
(The modals, for this purpose, comprise will, would, can, could, may, might, shall, should, ought to, need (+ bare infinitive), and the contracted forms ’ll, ’d, won’t, shouldn’t, etc.) As illustrated in Figure 13, AmE is further ahead than BrE in this trend, something that is more dramatic and noteworthy if we look at must alone: 1400
1200
1000
800 BrE AmE
600
400
200
0 1931
1961
1991
Figure 14: Must in AmE and BrE (frequencies pmw) In contrast, the semi-modals increase by about the same amount in 1961-1991. However, in the Brown family of corpora, core modals as a class are about 6.5 times more frequent than semi-modals in BrE in 1961, moderating to about 5.5 times as frequent in 1991 (see Leech et al. in press, Chapters 4 and 5). (The semi-modals included here are be able to, be going to, (be) supposed to, be to, (had) better, (have) got to, have to, need to, want to, and their contracted or reduced forms gonna, ’s to, (’d) better, ’ve got to, wanna, etc. The reasons for including be able to and need to as marginal members of this list are discussed in Leech et al. in press, Chapter 5.) Here we have an apparent exception to the ‘rule’ that AmE leads and BrE follows. In Figure 15, AmE starts from a lower starting point in 1961 and ends at a lower endpoint in 1991. But too much store should not be laid on this finding. The semi-modals are a diffuse group of verbs, which remains relatively rare (apart from have to) in written English even up to the 1990s. On the other hand, other corpora10 provide evidence of far greater frequency of semi-modals in spoken English, and especially in spoken American English. A study of equivalent spoken subcorpora extracted from the Diachronic Corpus of Present-
Change and constancy in linguistic change: 1931-1991
189
day Spoken English (DCPSE)11 suggests a strong growth (a percentage growth of nearly 37%) of semi-modal usage in British speech from the 1960s to the 1990s. 2,500
2,000
1,500 BrE AmE
1,000
500
0 1931
1961
1991
Figure 15: Selected semi-modals in AmE and BrE (frequencies pmw) On pure frequency grounds, the Brown family provides little evidence that semimodals are encroaching on the territory of the core modals. On the other hand, in spoken AmE there is much more potential for rivalry between modals and semimodals (see fn. 5), and it is conceivable that the decline of core modals in written English since 1961 is an indirect reflection of this rivalry – an argument strengthened by the greater decrease in frequency of core modals in spoken English (according to our study of DCPSE) than in written English. It is worth observing a possible synergy here between the trends of grammaticalization and colloquialization. As far as the semi-modals are concerned, grammaticalization is a phenomenon of the spoken language.12 Semimodals have yet to make huge inroads into the written language, but the increase of semi-modals (9.0% noted between LOB and F-LOB) can reasonably be attributed to colloquialization. Some semi-modals, including (have) got to and be going to, do not increase at all between LOB and F-LOB, and one explanation that suggests itself for this is that some kind of ‘prestige barrier’ discourages the use of these forms in published writing. (Particularly the avoidance of forms of get in the written language, a well-known taboo, might account for the low and even declining usage of (have) got to.) The lower frequency of semi-modals in the American written corpora might again be due to such a ‘prestige barrier’, which could well be more powerful on the west side of the Atlantic.
190
Geoffrey Leech and Nicholas Smith
In the case of must and its semi-modal rivals, a more specific explanation suggests itself. We can contrast the decreasing use of must in Figures 14 and 16 with a big increase in the use of have to and need to (Figures 17 and 18). 1,800 1,600 1,400 1,200 Press Gen Prose Learned Fiction Overall
1,000 800 600 400 200 0 1931
1961
1991
Figure 16: Must in BrE (frequencies pmw) 900 800 700 600 500
BrE AmE
400 300 200 100 0 1931
1961
Figure 17: Have to in AmE and BrE (frequencies pmw)
1991
Change and constancy in linguistic change: 1931-1991
191
250
200
150 BrE AmE
100
50
0 1931
1961
1991
Figure 18: Need to in AmE and BrE (frequencies pmw) In the case of must, Learned is the most retentive subcorpus and Press the least retentive – not a surprising result in view of the reputations of these varieties as respectively conservative and innovative (see Mair and Hundt 1999). The subcorpora here do not follow a consistent pattern, but overall the decline of must shows an accelerating trend in the more recent period. Have to and especially need to, in contrast, show sharply increasing frequency profiles. A late and largely unacknowledged newcomer to the class of semi-modals, need to has only recently started to ‘take off’ (see Smith 2003, Taeymans 2004), and is much less frequent than have to and must. It is the steepness of its increase, particularly in 1961-1991, that commands attention. The more frequent semi-modal have to, on the other hand, shows a greater increase in 1931-1961. The evidence provided by the three-point line charts suggests that have to began to approach the peak of its frequency climb earlier, and its increase was decelerating at the time when that of need to was accelerating. We have argued elsewhere (Smith 2003, Leech 2003 – cf. also Myhill 1995) that the exceptionally steep decline in frequency of must is likely to have been influenced by attitudinal factors. The root use of must is associated with the speaker as the deontic centre, and hence with an authoritarian tone, particularly when combined with you and we as subjects: (9)
‘Well, you can’t [go home]. You’re ill. The doctor says you must stay….’ (F-LOB K27)
192
Geoffrey Leech and Nicholas Smith
(10)
But to compete with the world we must adapt to the 21st century. Not the 19th. (F-LOB B14)
In contrast, have to and need to typically avoid the face-threatening tone of must. This is especially true of need to, which typically attributes the necessity or obligation to factors internal to the human protagonist, and hence typically stresses beneficial aspects of the constraint: (11)
Nevertheless the picture in the mind of western man seriously needs to be corrected. The Hungarian people are no longer poor or oppressed according to their standards. (LOB E22)
(12)
We may all need to become more aware of how we use water, to learn the ways of managing and conserving supplies. (F-LOB F09)
Notice that if must were used in examples like (11) and (12), the implied obligation on the addressee would seem far more face-threatening. 7.
Densification
To illustrate densification, we move from verbal to nominal constructions. Two of the most salient trends observed over the sixty-year period are the increase of noun+noun13 sequences and the increase of s-genitives. These can both be considered ways of achieving greater compactness of meaning in the noun phrase, a trend which has been particularly associated with Press writing since the early decades of the twentieth century, and which is found in its quintessential form in newspaper passages such as: (13)
the aviation and casino kingpin Kirk Kerkorian finally sold MGM’s film entertainment division to Pathe boss Giancarlo Parretti in November. (Frown A43)
(All underlined words in (13) are nouns.) Measuring compactness in terms of the number of words expressing a given concept, we notice that in the following, compared with a prepositional construction, the use of a noun + noun sequence or s-genitive + noun reduces the number of words by a third: (14)
(a) N1 of N2: (b) N2’s N1: (c) N2 N1:
the cigarette lighter of a car a car’s cigarette lighter (Brown, E04) a car cigarette lighter
Densification manifests itself in: (a) a decrease of the use of of (Figure 19), (b) a steep rise in s-genitives (Figure 20) and (c) a steep rise also in noun + noun sequences (Figure 21).
Change and constancy in linguistic change: 1931-1991
193
60,000
50,000
40,000 Press Gen Prose Learned Fiction Overall
30,000
20,000
10,000
0 1931
1961
1991
Figure 19: Preposition of in BrE (frequencies pmw)
10,000 9,000 8,000 7,000 6,000
Press Gen Prose Learned Fiction Overall
5,000 4,000 3,000 2,000 1,000 0 1931
1961
Figure 20: S-genitive in BrE (frequencies pmw)
1991
194
Geoffrey Leech and Nicholas Smith
40,000 35,000 30,000 25,000
Press Gen Prose Learned Fiction Overall
20,000 15,000 10,000 5,000 0 1931
1961
1991
Figure 21: Noun+noun sequences: BrE (frequencies pmw) The decreasing use of of as shown in Figure 19 is, of course, only a rough and ready indication of what is happening in the noun phrase. Most ofs occur in the noun phrase, but only a minority of them fall into the category of of-genitives, where the s-genitive construction could be substituted. We examined a randomized 2% sample of the ofs in LOB and F-LOB, and discovered that the frequency loss of the of-genitive, based on this limited sample, was 24%, and therefore commensurate with the increase of the s-genitive over the same 1961-91 period. That period also showed a loss of 5% of prepositions as a whole, but the loss of of represented in Figure 20 was greater than this overall prepositional loss. It is reasonable to speculate that the decreasing use of of shown in Figure 19 is in part due to a tendency to switch from of-genitives to s-genitives over the period in question. Although increase in noun + noun sequences over the period has been highly significant in all subcorpora, it is the increase in s-genitives that invites most scrutiny. The remarkable rise in frequency over sixty years in Press emphasises that it is journalistic writing that above all spearheads this change. The subcorpus that might be thought to be the most natural home for the sgenitive is Fiction, because of the supposed restriction of this construction to human nouns. But Fiction conspicuously lacks the sharp increase of s-genitive frequency in Figure 19, and so the chart could lend itself to another suggestion that the increase is not due to increase of human nouns (which would predominate in Fiction texts), but to the extension of the genitive to other ‘quasihuman’ or inanimate categories of noun. However, studies focusing on the
Change and constancy in linguistic change: 1931-1991
195
genitive (Rosenbach 2002, Hinrichs and Szmrecsanyi 2007) have failed to detect any categorical association between increasing use of the s-genitive and extension of its use to a broader range of nominal categories. There is no doubt that increasing use of the genitive with inanimate possessors is one of the major factors of change, particularly in AmE, but as Rosenbach puts it (2002: 271) ‘the choice between the two genitive variants … is not a matter of categoriality but of preference’. Part of the explanation for the meteoric rise in the use of s-genitive, as well as for the similar growth in the use of noun + noun sequences, appears to be that genres of expository writing, in particular, above all Press, are moving towards a more densely information-packed style in the use of noun phrases (cf. Biber and Clark 2002). Bearing in mind that English-speaking society has been increasingly dominated by mass media and the information ‘overload’ of recent decades, this is not an implausible explanation, and would explain why the increasing densification trend is strongest in the Press (see Figures 20 and 21). From Figure 22 it is apparent that AmE has led BrE in this increasing use of noun + noun sequences. 35,000
30,000
25,000
20,000 BrE AmE
15,000
10,000
5,000
0 1931
1961
1991
Figure 22: Noun+noun sequences in AmE and BrE (frequencies pmw) 8.
Conclusion: the Constancy or Inconstancy of Change
Of the four processes put forward as explanatory hypotheses, three (grammaticalization, Americanization and colloquialization) can often be seen as cooperative trends. For example, the increasing use of semi-modals suggests a
196
Geoffrey Leech and Nicholas Smith
synergy of grammaticalization – accounting for increasing use of semi-modals in speech – colloquialization – leading to a similar, though less pronounced, development in writing – and Americanization – proposing that the more extreme trend in AmE, especially AmE speech, is being increasingly followed in BrE. The fourth hypothesis, however – densification – refers to a development primarily associated with the written language, and if anything ‘anti-colloquial’ in its effects. Further research will be needed to give more substance to these claims. The three-point line charts have the potential to show how change changes. But if anything, it is constancy of change that is most evident in the 3point line charts we have presented. That is, of the patterns illustrated in Figure 1, it is the patterns (a) and (b), which show little or no alteration in the increase or decrease of frequency, that are most noticeable. This is also the picture often shown by subcorpora – a further confirmation of the impression that in many cases the increase or decrease of frequency between 1961 and 1991 is simply a continuation of a trend already in progress over the previous thirty years. On the other hand, where there has been acceleration or deceleration – patterns (c) and (d) of Figure 1 – it is tempting to speculate on the alteration of rate of change. All four possible patterns are observed, and are exemplified below: (a)
Decelerating decrease: wh-relative clauses (Figure 5)
(b)
Accelerating decrease: the passive (Figure 7)
(c)
Decelerating increase: have to (Figure 17)
(d)
Accelerating increase: need to (Figure 18)
In addition, we see both kinds of change of direction: (e)
Increase followed by decrease: help to + infinitive (Figure 11)
(f)
Decrease followed by increase: mandative subjunctive (Figure 12)
Of these (a) lacks any obvious explanation, (c) and (d) might possibly be explained as different phases in the development of semi-modals as a consequence of grammaticalization, and the remaining cases seem to fit a pattern of increasing American influence in the 1961-91 period. Thus (b) reflects a steeper decline of the passive in AmE possibly due to prescriptivism; and in (e) and (f) the change of direction shows a convergence towards American preferences. Although the provisional nature of these findings counsels caution, the addition of the 1931 corpus to the ‘Brown family’ brings new precision, depth and insight to our understanding of the recent grammatical history of English. Notes 1
For practical reasons, the 1931 corpus is actually sampled from the period 1928-34, centring on the year 1931. The corpus is in a provisional state,
Change and constancy in linguistic change: 1931-1991
197
but is likely to change little. It is to be released within the next two years. A fourth matching corpus (centring on the year 1901) is now nearing completion at Lancaster. For convenience, we refer to the periods covered by the corpora as 1931-1961-1991, although 1931 (as just explained) is something of an idealization, and 1991 (the date of the F-LOB corpus) is slightly inaccurate for the corresponding AmE corpus, Frown, whose texts date from 1992. 2
We are indebted to Marianne Hundt and Christian Mair for their collaboration in research on this trio of corpora; also to Paul Rayson for collaboration and support in the development of the B-LOB corpus. Financial support for the compilation of this corpus was generously provided by the Leverhulme Trust, in the form of an Emeritus Fellowship.
3
For example, according to the F-LOB and LOB corpora, the frequency of modal auxiliaries in written British English has declined by 9.6% in thirty years: a finding which is significant as a very high level (p < .001).
4
Statistical significance is at p < .05, p < .01 and p < .001 levels are not cited in this chapter, although most increases and decreases of frequency shown in the charts are significant at least to the p < .05 level. Where they are not significant, this will be stated in the text.
5
In Leech et al (in press), in Chapter 5, Tables 5.2 and 5.3, we report relative frequencies of modals and semi-modals in two large and comparably sampled corpora of conversation from the 1990s: the Longman Corpus of Spoken American English and the demographic subcorpus of the British National Corpus. We found that in AmE conversation, must was less than half as frequent as need to and (have) got to; and less than one-eighth as frequent as have to. In BrE conversation, the differences were less extreme, but still must was less than half as frequent as have to, and also appreciably less frequent than (have) got to. (Instances transcribed gotta were counted with (have) got to.)
6
This is confirmed, in proportionate terms, by a frequency comparison of relative clause types across genres, including fiction, speech, and academic writing, in Biber et al. (1999: 610-611).
7
In the US it has been a widely-held prescriptive rule that that should be used instead of which in introducing restrictive relative clauses. See Arnold Zwicky’s ‘Language Log,’ posting, July 4, 2005 on ‘the sacred That rule’: http://itre.cis.upenn.edu/~myl/languagelog/archives/002291.html#more
8
See Seoane and Loureiro-Porto (2005) and Seoane and Williams (2006) on colloquialization and the declining use of passives in scientific English.
198
Geoffrey Leech and Nicholas Smith
9
The disparagement of the passive in favour of the active can be found in many influential style guides and the like – see, for example, Strunk and White (2000: 18).
10
Approximate figures from the AmE and BrE conversation corpora discussed in note 5 yield the following ratios: in AmE, 1.79 modals per semi-modal; in BrE, 2.79 modals per semi-modal. These contrast with the ratios for written language in the 1990s corpora Frown and F-LOB: 5.3 and 5.5 modals per semi-modal respectively. The frequency gap between modals and semi-modals in spoken language is obviously far narrower in speech (especially American speech) than in writing.
11
We are grateful to Bas Aarts for making a copy of this corpus available to us before its official release, to allow us to make comparisons of the kind discussed here. As the earlier part and the later part of the DCPSE were collected under different conditions, and samples from the 1960s and 1990s were not strictly comparable, we extracted ‘mini-subcorpora’ of approximately 140,000 words from the periods 1958-69 and 1990-1992 with the aim of achieving as nearly as possible the comparability of corpora from the 1960s and 1990s, as found in the Brown family. The mini-corpora were clearly too small to provide more than tentative results, although the differences in frequency between the earlier and later corpora, with respect to modals and semi-modals, were highly significant: a loss of 11.8% in the case of modals, and a gain of 36.8% in the case of semi-modals (see Leech et al., in press. Chapters 4 and 5).
12
This is not only because the semi-modals as a class are much more frequent in the spoken language, but also because the phonetic reduction of the semi-modals, which is a salient indicator of the ongoing grammaticalization process, appears only in the spoken language. The reduced written forms gonna, wanna, gotta, better, supposed to (the last two lacking the finite auxiliary) are frequent in transcriptions of speech – for example in the BNC and the LCSAE – but are rare in the Brown family, even in representations of spoken dialogue. The occurrence of these transcribed reduced forms can be taken in general as an indicator of phonetic reduction, although in any given case the transcriber’s subjective impression is involved in the choice of a standard or non-standard spelling.
13
Strictly, ‘noun + noun’ means ‘noun + common noun’. We excluded strings where the second noun was a proper noun, as this would have included personal names such as Hillary Clinton and Gordon Brown, as well as place names spelt as two words (such as Los Angeles and New York), which, according to the tagging conventions of the C8 tagset, are tagged as a sequence of two proper nouns. The frequency count for ‘noun
Change and constancy in linguistic change: 1931-1991
199
+ common noun’ sequences included multiple counts where a sequence of three or more nouns occurred. For example, the sequence film entertainment division in example (13) would count as two noun + noun sequences: film entertainment and entertainment division. References Biber, D. (1989), Variation across Speech and Writing. Cambridge: Cambridge University Press. Biber, D. (2003), ‘Compressed noun-phrase structures in newspaper discourse: the competing demands of popularization vs. economy.’, in: J. Aitchison and D.M. Lewis (eds.), New media language. London: Routledge, pp. 169181. Biber, D. and E. Finegan (1989), ‘Drift and evolution of English style: A history of three genres’. Language 65(3), 487-517. Biber, D. and E. Finegan (1997), ‘Diachronic relations among speech-based and written registers in English’, in: T. Nevalainen and L. Kahlas-Tarkka (eds.), To Explain the Present: Studies in the Changing English Language in Honour of Matti Rissanen. Helsinki: Mémoires de la Société Néophilologique de Helsinki, pp. 253-75. Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), Longman Grammar of Spoken and Written English. London: Longman. Biber, D. and V. Clark. (2002), ‘Historical shifts in modification patterns with complex noun phrase structures.’, in: T. Fanego, M. López-Couso and J. Pérez-Guerra (eds.). English Historical Morphology: Selected Papers from 11 ICEHL, Santiago de Compostela, 7-11 September, 2000. Amsterdam: Benjamins, pp.43-66. Facchinetti, R., M. Krug and F. Palmer (eds.) (2003), Modality in Contemporary English. Berlin: Mouton de Gruyter. Hinrichs, L. and B. Szmrecsanyi (2007), ‘Recent changes in the function and frequency of Standard English genitive constructions: A multivariate analysis of tagged corpora.’ English Language and Linguistics 11(3), 335378. Hopper, P.J. and E. Closs Traugott (2003), Grammaticalization. Cambridge: Cambridge University Press. Hundt, M. (1997), ‘Has BrE been catching up with AmE over the past thirty years?’, in: M. Ljung (ed.), Corpus-based Studies in English: Papers from the 17th International Conference on English Language Research on Computerised Corpora (ICAME 17), Stockholm, May 15-19, 1996. Amsterdam: Rodopi, pp. 135-151. Hundt, M. and C. Mair (1999), ‘“Agile” and “Uptight” Genres: The CorpusBased Approach to Language Change in Progress.’ International Journal of Corpus Linguistics 4, 221-242. Krug, M. (2000), Emerging English modals: a corpus-based study of grammaticalization. Berlin and New York: Mouton de Gruyter.
200
Geoffrey Leech and Nicholas Smith
Leech, G. (2003), ‘Modality on the move: The English modal auxiliaries 19611992’, in: Facchinetti et al., pp. 223-40. Leech, G. and N. Smith (2006), ‘Recent grammatical change in written English 1961-1992’, in: A. Renouf and A. Kehoe (eds.). The Changing Face of Corpus Linguistics. Amsterdam: Rodopi, pp. 185-204. Leech, G., M. Hundt, C. Mair and N. Smith (in press), Contemporary Change in English: A Grammatical Study. Cambridge: Cambridge University Press. Leonard, R. (1968), The Types and Currency of Noun + Noun Sequences in Prose Usage 1750-1950. Unpublished MPhil thesis, University of London. Mair, C. (1998), ‘Corpora and the study of major varieties of English: Issues and results’, in: H. Lindquist, S. Klintborg, M. Levin and M. Estling (eds.), The major varieties of English: Papers from MAVEN 97. Växjö: Acta Wexionensia, pp. 139-157. Mair, C. (2006), Twentieth Century English: History, Variation and Standardization. Cambridge: Cambridge University Press. Myhill, J. (1995), ‘Change and continuity in the functions of the American English modals.’ Linguistics: An Interdisciplinary Journal of the Language Sciences 33, 157-211. Övergaard, G. (1995), The Mandative Subjunctive in American and British English in the 20th Century (Studia Anglistica Upsaliensia 94). Uppsala: Acta Universitatis Upsaliensis. Rosenbach, A. (2002), Genitive variation in English: conceptual factors in synchronic and diachronic studies. Berlin: Mouton de Gruyter. Rosenbach, A. (2006), ‘On the track of noun+noun constructions in Modern English’, in: C. Houswitschka, G. Knappe and A. Müller (eds), Anglistentag 2005 Bamberg. Proceedings. Trier: Wissenschaftlicher Verlag, pp. 543-557. Seoane, E. and C. Williams. (2006), ‘Changing the rules: A comparison of recent trends in English in academic scientific discourse and prescriptive legal discourse’, in: M. Dossena and I. Taavitsainen (eds.) Diachronic Perspectives on Domain-Specific English. Bern: Peter Lang, pp. 255-276. Seoane, E. and L. Loureiro-Porto (2005), ‘On the colloquialization of scientific British and American English.’ ESP Across Cultures 2, 106-118. Smith, N. (2003), ‘Changes in the modals and semi-modals of strong obligation and epistemic necessity in recent British English’, in: Facchinetti et al., pp. 241-266. Strunk, W., Jr and E.B. White (2000), The Elements of Style. London and New York: Longman. Tagliamonte, S.A. (2004), ‘Have to, gotta, must: Grammaticalization, variation and specialization in English deontic modality’, in: H. Lindqvist and C. Mair (eds), Corpus Research on Grammaticalization in English. Amsterdam: John Benjamins, pp. 33-55.
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form: A critical discussion of selected lexicographic parameters and query options Alexander Onysko, Manfred Markus and Reinhard Heuberger University of Innsbruck1 Abstract The digitised version of Joseph Wright’s English Dialect Dictionary (EDD, 1896–1905) promises to be a lexicographic milestone for English dialect terms and phrases of the 18th and 19th centuries. In a research project in the English Department at the University of Innsbruck, the c.5000 pages of the dictionary have been transferred into machine-readable text and parsed. Our aim is to produce an online version of the dictionary for research on the history of spoken and dialectal Late Modern English. The paper demonstrates the complexity of the entries in the EDD and focuses on the questions of dialect attribution and of the definition of words and phrases as two cases in point. Beyond that, we will provide a survey of the search interface and specifically discuss the implementation of the two issues of dialect area and definition.
1.
Introduction
More than 100 years after its first publication, Joseph Wright’s English Dialect Dictionary (EDD) remains the most comprehensive and reliable lexicographic work on English dialects, also going beyond the Oxford English Dictionary (OED) in its treatment of dialectal forms.2 The dictionary was published in 6 volumes between 1898 and 1905 and, according to the editor, it “includes, so far as possible the complete vocabulary of all dialect words which are still in use or known to have been in use at any time during the last two hundred years in England, Scotland and Wales” (Wright Vol.1 1898: v.). In order to draw a comprehensive picture on dialect terms in the period from about 1700 to the time of publication, Wright used a plethora of sources. He gathered information from 274 correspondents, 90 printed dialect glossaries (many compiled by individual members of the English Dialect Society) and 342 unprinted collections of dialect words. Furthermore, Wright incorporated dialect terms from various reference works, literary sources, magazines and press publications, as well as folkloristic sources such as songs and games, which, altogether, amount to more than 2000 bibliographic entries. Overall, the six volumes of the EDD feature close to 65.000 entries of dialect words including further thousands of variant forms, derivations, compounds and phrases, the majority of which are meticulously labelled according to their use in various counties, regions and nations. In spite of Wright’s detail-oriented manner of compilation, the analysis of the dictionary entries has shown that he employed multiple, sometimes hardly
202
Alexander Onysko, Manfred Markus and Reinhard Heuberger
accessible ways of indicating the areal distribution of dialect terms. The semantic information in the EDD is similarly complex, with features of meaning being included outside the definition proper. Apart from discussing these selected problems, the paper will also provide an overview of the complexity of the planned online search interface, specifically focussing on the retrieval of dialect areas and semantic information. 2.
The parameters of the entries
First of all, a few introductory remarks have to be made about the lexicographic structure of the EDD. There are eight basic recurrent units in the dictionary entries: (1) headwords, (2) parts of speech, (3) labels, (4) counties, regions and nations (dialect areas), (5) phonetic transcription and (6) definitions or meaning(s), which are often exemplified by a large number of (7) citations; optionally, Wright provides (8) comments, particularly on etymology, at the end of the entries. Figure 1 - for the sake of brevity only a simple example - illustrates the role of the eight fields (cf. Markus and Heuberger 2007). (1) AFTER-DAMP, (2) sb. (3) Tech. (4) Nhb. Dur, w.Yks. (5) [aftԥ-damp.] (6) The noxious gas resulting from a colliery explosion (Wedgwood). (7) Nhb. & Dur. After-damp, carbonic acid, stythe. The products of the combustion of firedamp, NICHOLSON Coal Tr. Gl. (1888). Nhb.1 After-damp, the noxious gas resulting from a colliery explosion. This after-damp is called choak-damp and surfeit by the colliers, and is the carbonic acid gas of chymists, HODGSON A Description of Felling Colliery. w.Yks. The after-damp completed their death, N. & Q. (1876) 5th S. v. 325. Miners’ tech. Carbonic acid gas, or choke damp, which the miners call after-damp, CORE (1886) 228.
(8) [After + damp, q.v.; cp. choak-damp.] Figure 1: Standard entry structure in Wright’s EDD While the first five parameters are relatively homogeneous in layout, definition has turned out to be rather variable: sometimes it is appended to the paragraph of the lemma, as in Figure 1, but in polysemous entries it is split into separate paragraphs. This quality of the definition as a “trouble maker” has prompted us to insert a caesura after the phonetic transcription and to subdivide the structure of the entries into entry heads and entry bodies. The head comprises the first five
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form
203
units, i.e. from headwords to phonetic transcription, while the body consists of the other three units: definitions, citations and comments. In our work so far, we have focussed on correct parsing. It has soon become obvious that the main entry parts contain various sub-parameters which may be of interest to the user of the electronic version of the EDD, such as spelling variants, compounds, derivations, phrases and sources. However, the two basic pieces of information that almost every entry offers are dialect areas and definition. This paper will, therefore, use these two fields to demonstrate the complexity and wealth of information provided by Wright’s dialectal treasure house. 3.
Dialect attribution3
One has to distinguish between general dialect attribution, which is given in the form of dialect labels, and specific information which can be inferred from the so-called source codes or correspondent codes. Both types occur very frequently in the EDD and need to be considered equally for a thorough philological investigation of the dictionary. 3.1
General dialect attribution: dialect labels
The general dialect labels usually appear as abbreviations in the entry head (cf. Figure 1). They signify the area where a headword is used. The prominent location of the labels in the entry head marks this information as relevant for the lemma as a whole. Dialect labels occur on three different levels, i.e. counties, regions and nations. Whenever Wright had precise information about the spatial distribution of a certain dialect term, he listed all counties in which that term was used. In cases where his sources were less precise or when there were too many counties to mention, he preferred to indicate the regions or the nations in which the usage occurred. The three levels of dialect labels are briefly explained in the following. 3.1.1 County labels Wright used 126 county labels in the EDD, 42 of which pertain to England. Scotland (39), Ireland4 (32) and Wales (13) are classified accordingly. Counties are always indicated by means of three- or (occasionally) four-letter codes, e.g. Bnff. (Banffshire / Scotland). As mentioned, they usually occur in the entry head between the usage labels and the phonetic transcriptions, but one of the peculiarities of the dictionary is that they can also be found in the entry body, typically in front of the citations (cf. chapter 3.3.). The various counties in Great Britain and Ireland have not been dealt with in equal measure, presumably as a result of Wright’s sources and informants being unevenly spread. An electronic analysis of the dictionary shows, for example, that Yorkshire has been mentioned most often by far (76.661), whereas
204
Alexander Onysko, Manfred Markus and Reinhard Heuberger
the counties Denbighshire (England / 21), Sutherland (Scotland / 54) and Limerick (Ireland / 22) have received comparatively little attention. Some Irish counties, e.g. Kilkenny and Monaghan, merely have a single reference in the entire dictionary. In addition, Joseph Wright also uses cardinal directions in combination with counties to further specify dialect areas (e.g. n.Cmb. for ‘north Cambridgeshire’, sw.Cor. for ‘south-west Cornwall’). This issue will be discussed in more detail in chapter 5.2.1., as an example of how dialect reference is incorporated in the search interface. 3.1.2 Regional labels For England, Scotland, Wales and Ireland, Wright has also made regional distinctions. The names of these regions appear in abbreviated form, e.g. e.Cy., (East Country – England), and they generally comprise several counties. The areal definition of the regions mainly follows Wright’s considerations in his English Dialect Grammar (1905: 1-3) which is, in turn, partly based on Ellis’ classification of the dialects of England (cf. 1968 [1889]). Defining n.Cy., for example, Wright states the following: When I use the expression n.Cy., the northern counties or the northern dialects, I mean thereby Nhb. [Northumberland], Dur. [Durham], Cum. [Cumberland], Wm. [Westmoreland], Yks. [Yorkshire] (except sw. & s.Yks.) and the northern portion of Lan. [Lancashire]. (1905: 2) Since it is impossible to determine the exact signification of expressions such as “the northern portion of Lan.” and, in the EDD, Wright most consistently refers to counties as a whole, we have decided to interpret such definitions inclusively, i.e. a search for n.Cy. will include the lemmas of the entire counties concerned. On a general level of regional specification, Scotland and Ireland show a similarly detailed classification while Wales is only regionally subdivided into north and south. The regions that figure most prominently are the English n.Cy. (14.966) and e.An. (8.514). The Scottish regions have a few thousand, the Irish a few hundred references on average. 3.1.3 National labels The list of nations in the EDD is restricted to countries and continents where English was spoken natively in the 19th century: that is the USA, Canada, Australia, New Zealand, England, Scotland, Ireland and Wales. The USA is only mentioned 34 times, whereas the hypernym America occurs 2553 times5. Interestingly enough, Scotland is cited significantly more often than England (11.359 vs 4.440 occurrences). This striking difference is presumably due to the fact that Wright had less precise information about dialectal usage in the Scottish counties, thus preferring the more general nation tag.
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form 3.2
205
Specific dialect attribution: source codes
The citations of the EDD are drawn from a great number of printed glossaries, which are referred to in the entries by means of source codes. These codes usually consist of an abbreviation of a certain dialect area in combination with a superscript figure. The majority of the source codes is mnemonic since the numeral represents the only difference between source codes and corresponding county abbreviations. Thus, reference to Cheshire in citations from the printed glossaries is immediately recognizable as Chs.1, Chs.2 and Chs.3. Only the abbreviation N.I.1, which stands for the counties Antrim & Down in Northern Ireland, lacks immediate transparency. The following entry shows how source codes are integrated in the dictionary. BACK-ORDER, sb. Chs. Der. [ba.k.oda(r).] A countermand, a reversal of a previous command. s.Chs.1 Ahy woz tu utoo’kn dhem bee uss tu)th faer, bu mestur sent mi baak au rdurz [I was to ha’ tooken them beas-s to th’ fair, bu’ mester sent me back-orders]. Der. (H.R.)
Figure 2: Source codes in Wright’s EDD (example 1) The source code s.Chs.1 indicates that the subsequent citation was taken from a book identified in the reference list as The Folk-Speech of South Cheshire by Th. Darlington. The source code is even more precise than the county label Chs. in the head and emphasizes that there can be variance of areal dialect information between different parts of the same entry. This kind of graded precision of dialect areas relates to the fact that Wright uses the head to give the general picture and to summarize information mentioned in other parts of the entries. Figure 3 provides another example of how information in the head can diverge from dialectal source codes since the latter prove the existence of a specific sense of the entry at large. AGAIN, prep. Var. dial. uses in Sc. Irel. and Eng. Also written agaan, agean, agen, agin, agyen. See below. [agien, age’n, egin.] Used for against, in most of its mod. meanings. I. Of position. 1. Near, beside. n.Yks. Just ageean t’pleeace where Ah wur bred, Broad Yks. (1885) 27 ; n.Yks.2 i.e.Yks.1 Oor spot ligs agaan Helmsla. e.Yks.1 w.Yks. Nelly always sits again John (F.P.T.) ; Poor Bill, he wur leynd ageean t’wall, PRESTON
206
Alexander Onysko, Manfred Markus and Reinhard Heuberger
Poems, &c. (1864) 24. Lan.1 Agenvth’ heawseeend wur a little cloof o’ full o brids and fleawrs. Chs.1 He lives agen th’ chapel; [...]
Figure 3: Source codes in Wright’s EDD (example 2) From a user perspective this shows that dialect information in an entry can be both general and specific. While information in the head commonly bears scope over the whole entry, source codes in citations are more closely bound to their individual usage examples and can characterise only specific meanings. This variance in the identification of dialect areas has led us to link areal abbreviations and source codes in the search application so that a user will get the full picture and be able to retrieve entries for a particular county even if there is no explicit county label in the entry head. 3.3
Specific dialect attribution: correspondent codes
Another type of specific dialect information that can also be inferred from the citations are oral sources, usually given by the initials of the correspondents in combination with a dialectal abbreviation. In Figure 4, E.S.F. stands for Rev. E. S. Fox from West Yorkshire (w.Yks.), and J.W.P. for J. W. Partridge from Worcestershire (Wor.). BACK-LANE, 5*. Yks. Lin. Rut. Lei. War. Wor. [bak-len, Yks. ba’k-loin.] A narrow, unfrequented street, gen. a by-way leading from the main thoroughfare. w.Yks. The side street in Snaith running parallel to the High Street is usually called Back Lane (E.S.F.). Lin. I tooke to my heels as hard as I could runne and got my selfe into a back-lane, BERNARD Terence (1629) 156. n.Lin.1 Thaay’re buildin’ a sight o’new hooses agean As’by back-laane fer th’ iron-stoan men to live in. Rut.1, Lei.1 War.3 When there is more than one road through a village, the least important is generally known as the back lane. Wor. (J.W.P.)
Figure 4: Correspondent codes in Wright’s EDD Similar to source codes, the combination of correspondent codes and dialect labels indicates where a word or phrase was used. Since this piece of information can be very helpful in cases where there is no precise dialect label in the entry head, we intend to link dialect area and oral sources. This also implies that the informational value of source codes and correspondent codes is twofold. On the one hand, they explicitly mark specific sources and will be retrievable as such on the search interface. On the other hand, they implicitly mark geographical
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form
207
distribution and will thus be automatically included in queries for specific dialect areas. 3.4
Some selected features and problems of dialect attribution in the EDD
One of the greatest merits of the electronic EDD (eEDD) will be the automated cross-linking between all counties, regions and nations, allowing the user to retrieve a significantly greater number of relevant results. More precisely, we have tried to allocate all counties to the regions to which they belong geographically, and the same has been done with regard to nations. For example, a search for the abbreviation of Wales (Wal.) merely yields 134 lemmas. These results will be multiplied with an inclusive query that uncovers not only the explicitly marked entries but also those referring to the regions of North Wales and South Wales and to all the Welsh counties like Anglesea, Merionethshire and Flintshire. The result list for this inclusive query will be arranged hierarchically so that the most relevant results will be displayed on top. Thus, the hits for Wales-nation will be listed first, followed by Walesregions and, finally, by the lemmas of individual Welsh counties. This feature will prove helpful for dialectologists who require a vast amount of data and do not mind the graded precision of the results6. Despite this feature, the most apparent difficulty for the dictionary user is the heterogeneous positioning of dialect areas in the EDD and the varying modes of referring to dialect information. As mentioned, general dialect labels occur in the entry head right after the word class labels. But they may also refer to variants or phonetic transcription. Very often, dialect information is included in the citations, specifically in connection with Wright’s written and oral sources. Dialect information may further be provided in the comments section or, more exceptionally, in the definitions. This scattered mode of marking dialect areas needs to be taken into account when devising queries on the search interface (cf. chapter 5.2.1.). To add to the complexity, the EDD also contains fuzzy dialect indications, e.g. in some parts of England, in other parts of Scotland. The problem of separating such fuzzy dialect information from more specific reference demands a layered structure of dialect areas on the search interface (cf. 5.2.1.). Furthermore, counties, regions and nations are sometimes also referred to negatively, i.e. they are actually excluded as dialect areas (e.g. not in Sc., or not in gloss. of s.Chs. and Shr.). Such negated codes were kept apart as a separate subset of dialect information and excluded from further consideration during the parsing process. 4.
The parameter of definition
As far as the meaning of the lemmas is concerned, Wright gives their definitions in various layouts, conditioned by the single case of an entry. Moreover, definitions can be steered by aspects of word formation, syntax and semantics. On
208
Alexander Onysko, Manfred Markus and Reinhard Heuberger
the other hand, elements of meaning can also be gleaned from other parts of the entries than the definitions proper. 4.1
Layout
The simplest case of layout is demonstrated in Figure 1: the definition is given in the paragraph of the head. However, with polysemous headwords or in the case of a phrase formed with the entry word, meanings are listed in separate paragraphs. Figure 5 shows an excerpt of the complex definition of the verb ACT (definition 6):
Figure 5: Listed meanings in definition 6 of the entry ACT It is clear from this example that meanings are not listed by Wright as pearls on a string, but mixed with phrases and, in this case, with derivations (Acting/Action). They are also interrupted by citations. This type of layout has caused special problems for our parser, which we have partly solved by classifying definitions manually. In other cases the slot of definition is not irritatingly complex, but empty (Figure 6):
Figure 6: Empty definition slot Here Wright has not given the definition of acant in his own words, but silently refers the readers to his citations. The parser, of course, misses such taciturn references, unless we mark them manually.
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form
209
In yet other cases the definition is indicated by a cross-reference, as in Figure 7, definition 3:
Figure 7: Entry of DUMMOCK with cross-reference (see below) in definition 3 The entry in Figure 7 also demonstrates that the layout of meaning can take different shapes in one entry: the first meaning, of the noun, is given as an appendix in the first paragraph; the second meaning is that of the verb, and the third meaning, of the phrase, is not explicitly given but made clear by crossreference. In sum, it is only up to a point that layout can be helpful for the computational identification of definitions. 4.2
Patterns of word formation and syntax
The diverse layout correlates with patterns of word formation and syntax. Specifically Hence (see Figure 5) appears as a marker of introducing derivations, with the definitions given after the part of speech labels. Figure 8 provides another example of this function of Hence:
Figure 8: Hence, marking derivations, at the beginning of a paragraph Compounds, by contrast, can be traced by the introductory marker comp. (see Figure 9) and phrases (see Figure 7) by the initial string phr.:
210
Alexander Onysko, Manfred Markus and Reinhard Heuberger
Figure 9: Use of comp. in entry ADDLE (lit. ‘stagnant water’) In addition to phr. and comp., Wright has sometimes used the tag comb. (‘combination’), obviously in cases where he found it hard to decide whether a word cluster should be classified as a compound or as a syntactic group (phr.). We have decided to abide by this intermediary category so as to respect Wright’s expertise and the historical importance of the text. In other cases, however, Wright has summarised a paragraph of word clusters by the word pair comp. and phr., as in Figure 10. While the expressions concerned can be manually disentangled and attributed to either of the two types, we will, for the time being, keep the mixed bag comp. and phr. untouched as an autonomous category. 4.3
Semantic patterns
Given that meanings are parsed appropriately as definitions, the question arises in what way this information can be retrieved. In the EDD meaning stands for many different things, depending on whether we are talking about Wright’s explanation of the letter A, the definition of a flower term by its Latin name, or the grammatical explanation of a function word. Moreover, dialect terms can be tied to specific phrasal expressions. In this case, the definition of the lemma consists of the actual phrase, i.e. the definition is given by its idiomatic use. This leads to a structural overlap between the search options in phrases and definitions (see Figure 11) and, as such, will have to be corrected manually. Figure 10 (lemma ALL) demonstrates a further problem involved in classifying definitions caused by the interspersed reference of meaning, compounds and phrasal elements. While parsing definitions fragmentised in single words disallows the immediate retrieval of the whole definition and its appropriate designation (e.g. All-a-bits ‘in pieces or rags’; Figure 10), a keyword search in definitions will guide the user close to the defined phrase or compound. In answer to these restrictions, we have planned to make the phrases explicitly available for analysing dialectal variation. In view of the fact that the EDD contains the
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form
211
abbreviation phr. 9818 times and the related label comb., as a marker of syntactic groups, 3756 times, the definitions of these lexical units are a most promising field of dialectal research; the more so since every single marker may introduce a dozen or more examples, as in the case of ALL in Figure 10.
Figure 10: Variable form of definition 4.4
Semantic elements outside definitions
Elements of meaning can be found in Wright’s EDD also outside the definitions proper, e.g. among usage labels, citations and comments. The latter are given in square brackets at the end of entries7. Suffice it to give one example of such implicit meaning: the marker diminutive, which appears 200 times in total, mostly occurs in comments but sometimes also in other entry parts (e.g. head and definition). The use of diminutive suffixes with an emotional connotation is a striking feature of English dialects. This example is meant to remind us that dialectal features may also be couched in morphological productivity and are, thus, more than just the nomenclature affiliated with country life, from hay to horses.8 Wherever features of meaning are hidden in an entry, we try to trace them most comprehensively and to make them retrievable via the search interface. 5.
Outlining the search interface
In general, the EDD appears as very consistent in its lexicographic macrostructure and type-setting (cf. Markus 2007, Thompson 2006: 12-14), which allows postulating formal rules to automatically parse the various structural units (head, definition, citation and comment) to a sufficient extent9. This parsing is taken as the basis for retrieving information from the eEDD. For the conception of the search interface our aims are twofold: on the one hand, we follow the premise to map the eEDD as closely to the original as possible. This involves retaining its entry structure, highlighting the various
212
Alexander Onysko, Manfred Markus and Reinhard Heuberger
subsections in the entry, and providing the possibility of comparing the digital output with an image of the original dictionary entry. On the other hand, the search interface should allow sophisticated access to the information contained in the dictionary. This means that users should be able to tailor their queries for certain entry parts and that they should be able to search for specific dialectal information (e.g. for finding all verbs that occur in Yorkshire, Northumberland and Cumberland, or all nouns in Scotland that are etymologically marked as deriving from Old Norse). 5.1
Functional outline
In line with these aims, we have devised a search interface that strives to provide flexibility in accessing the dictionary and to offer high granularity of information retrieval. Figure 11 provides a schematic overview of the search interface.
Figure 11: Schematic outline of the search interface Basically, the user will have two main options of accessing the dictionary. First of all, a headword search will be available as the default mode: any string (including truncation symbols) can be entered in the field Search for and the respective hits will be retrieved from the database. This word (or string) search can be limited in scope to the various structural units in the entries (full text, heads, definitions, citations, comments, variants, compounds, derivations and phrases).
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form
213
The second basic option is to tap the information in the dictionary by selecting Search filters (dialect area, usage label, part of speech, source, phonetic, morphemic, etymology and time span). This will allow a user, for example, to choose Scotland from the list of dialect area (nation) and find all the entries that bear reference to Scotland (for further details on dialect areal queries see 5.2.1.). Similarly, a user can search for entries containing specific usage labels (e.g. frequently, obsolete, slang, etc), etymological information (e.g. Middle High German, Old Norse, Middle Dutch, etc), part of speech categories (e.g. noun, adjective, etc; this also includes basic morphological and syntactic markers, e.g. diminutive, intransitive, etc) and source information (e.g. printed sources, correspondent codes and works from the general bibliography). These filter options are for the most part based on information explicitly provided in the dictionary. Thus, the standard abbreviations for counties, regions and nations, the part of speech labels, etymological references and the abbreviations for correspondents, dialect glossaries and general reference works are extracted from Wright’s lists of abbreviations, correspondents and unpublished works, as well as from the general bibliography at the end of volume 6. Usage label appears as a hybrid category in the sense that abbreviations employed by Joseph Wright are mixed with various keywords (e.g. inanimate, synonym, cant, derogatory, etc). We have defined the latter as separate labels from the dictionary according to their regular occurrence in the entries and for their relevance from a current lexicographic perspective (cf. Béjoint 2000). The remaining options among Search filters incorporate functions going beyond Wright’s basic lexicographic markers as given in lists of abbreviations and sources. Thus, the phonetic search (envisaged for a later stage of the project) will feature a separate search window, allowing queries for phonetic symbols. The category morphemic will offer a list of derivational morphemes, and time span will create the possibility of searching dictionary entries per year and period. This means that a user can search for dialect words for any given year and by periods of ten years between 1700 and 1900. In addition, we also intend to implement a search for the earliest date of mention of entries in the EDD. Since dates are based on bibliographic information, they merely provide indirect evidence. Nevertheless, they can facilitate investigating the use of dialect words in literary works cited in the entries. In addition to these features, we are planning to implement parts of the database, in particular the citations, as self-contained corpora, with our software allowing work indexes and concordances. Apart from offering searches in specific entry sections or providing a selection of parameters in scroll-down menus among search filters, the interface will turn into a fully fledged query engine when the user combines search filters, specific string searches and queries in selected entry parts. If users, for example, are interested in terms relating to cow in England, they can type in cow*, search in definitions, and select the nation label England from the search filters. In addition, basic Boolean operators provide the option of combining the categories of search filters and also of selecting several parameters among the different search filters. While cross-categorial combinations between dialect area, usage
214
Alexander Onysko, Manfred Markus and Reinhard Heuberger
label, part of speech, etymology and source are restricted to AND queries, selections within categories allow for AND and OR logic. Consequently, highly complex queries become possible, e.g. (Scotland OR England OR Wales) AND (Substantive AND Plural) AND (Slang) AND (Old Norse). These can basically be run in the various entry parts (cf. Figure 11)10. In order to maintain an overview of one’s filter selections, a search protocol will automatically be generated, and the last 10 search commands will be stored for immediate retrievability. As the last example indicates, the search engine will have the capacity to minutely tap the deepest wells of the EDD and thus clearly bring to light both its potential and its limitations. In its options of combining query parameters, the search interface might at times appear too powerful and even dig beyond the substance of dialectal information in the EDD. 5.2
Retrieving dialect areas and searching in definitions
After the presentation of the basic structure of the search interface, this section will touch upon issues raised in the preceding chapters and show their implementation and their repercussions on query design. The multi-layered reference to dialect areas (cf. chapter 3) calls for a nested, albeit more complicated, interface structure. The vicissitudes of meaning appear at first hand as limiting factors as far as retrieving information from the entry area definitions is concerned. With the proper scope of the queries, however, the initial limitations can be compensated. 5.2.1 Structure and representation of dialect areas In line with the tripartite classification of dialect areas into nations, regions and counties and further sub-classifications into cardinal directions and fuzzy phrasal expressions (cf. chapter 3.1.), the search interface offers the possibility of selecting areal abbreviations among these categories. In addition, we have decided to combine the levels of counties and regions with nations. Accordingly, selecting England will not only yield the entries that contain the explicit abbreviation Eng. but also all the lemmas that are attributed to any of the English counties or regions. On the other hand, if users are interested in investigating general dialectal terms of England, they will have to be able to search for the explicit label Eng. only. These considerations lead to a layering of filter options on the search interface. Figure 12 provides an overview of the types of dialect information. While directional and fuzzy information occurs among nations, regions and counties, Wright did not comprehensively apply these modes of reference to all basic abbreviations. In the case of Yorkshire (Yks.), where he drew on a large pool of informants and other sources, he made more subtle distinctions between n.Yks., s.Yks., e.Yks., w.Yks., m.Yks. (mid Yorkshire), ne.Yks., nw.Yks., se.Yks. and sw.Yks. Counties for which he lacked substantial information are devoid of further subdivisions. This is particularly evident in Ireland, for which only 5 out
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form
215
of 32 counties include directional specification. Furthermore, in the majority of Scottish (25 of 44) and Welsh counties (7 of 11), Wright abstained from providing directional specification. Dialect Area
Nation
Region
explicit
directional
County
fuzzy
Figure 12: Overview of areal dialect information in the EDD As regards fuzzy areal information, merely 35 out of 126 counties are referred to by means of vague phrasal expressions, such as in some parts of Lancashire, in many parts of Cheshire. Nevertheless, this information of restricted value is incorporated in the search interface. Figure 13 shows the varying types of dialect specification for a few English counties.
Figure 13: Excerpt of search interface: dialect area – county
216
Alexander Onysko, Manfred Markus and Reinhard Heuberger
5.2.2 The elusiveness of meaning The discussion in chapter 4 has demonstrated that the explication of meaning is far from homogeneous in the dictionary. From the perspective of the eEDD, the different facets of meaning representation evoke questions as to how far definitions can serve as elements of investigation, and which search strategies can be employed in order to adequately retrieve meaning. Preliminary investigations indicate that searching the lemma definitions of the eEDD will prove interesting in terms of lexical fields. Dialects are generally rich in lexical variation for concepts pertaining to speakers’ everyday concerns (cf. Chambers and Trudgill 1998). As the EDD covers the period of Late Modern English11, speakers’ language use is prone to be shaped by an interesting clash of the traditional and the modern, i.e. the unification of rural values of agricultural life on the one hand, with concepts arising from the heyday of the industrial revolution12. Probing the traditional side, terms of the lexical field of domesticated animals were presumably popular among rural dialect speakers in the 18th and 19th centuries. Indeed, the number of concordances found in the dictionary for terms like cattle (TF13 1564), cow (TF 1777), sheep (TF 2674), pig (TF 1062), dog (TF 1496) and horse (TF 3259) hint at the prolific documentation of dialectal terms in this lexical field. How can a user be guided in finding dialect words by such semantic keyword searches? The primary home of meaning in the eEDD is the entry part classified as definitions (cf. Figure 11). Restricting the search to this part of the dictionary entries will cover the standard case of meaning explication for monosemous and polysemous lemmas. The way this works may be seen from the definition in the entry BEEF. The first four senses are the following: 1.An ox or cow intended for slaughter. 2.A fibrous carbonate of lime, with a texture resembling fossil wood. 3.Riming slang for ‘stop thief!’ 4.Comp. (1) Beef-balks, a shelf or beam for storing beef; (2) -ball, a beef-dumpling; (3) -brewis, beef-broth (4) -case, a laddershaped frame, hung horizontally under the ceiling near the fire, on which beef was placed to dry (5) -eater, see below; (6) -head, a blockhead, fool; (7) -heart, a cow’s heart ready for cooking; (8) -steak rock, (9) -tree, see below. Applying keyword queries, a search for the term cow* will uncover the metonymical extension of the anthropocentrically driven functionalisation of the animal for its meat (definition 1). A user interested in finding dialectal terms for thief will be able to retrieve the meaning of beef as “Riming slang for ‘stop thief!’” from definition 3. The fourth sense of BEEF is an example of the mixture of compounds and their respective meanings under the larger structural umbrella of the definition of a lemma. At the stage of automated parsing such mixed bags
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form
217
of compounds and meanings were parsed as definitions. In order to reduce the lexical paraphernalia in this entry part, manual classification of the actual meanings would be necessary. The completion of this task is envisaged for the second phase of our research project SPEED in 2009 and after. At that stage, we will also try to minimise another shortcoming of searching in definitions. This is that a keyword search in definitions alone currently neglects instances of meaning present in other entry parts (e.g. citations). To avoid losing this information, it is advisable to extend the semantic keyword search to citations and comments. At present, this is a slight limitation in terms of user friendliness and accessibility of the data in the dictionary. 6.
Conclusion
As the remodelling of the EDD into an electronic version is still under way, this article provides an overview of some of its most important lexicographic parameters and discusses their implementation on the search interface. The attribution of dialectal terms to specific areas is one of the characteristic features of the EDD as Wright painstakingly tried to provide comprehensive information on where a dialect expression was used. Accordingly, reference to dialect area is encoded in the dictionary in multiple ways: in abbreviations pertaining to national, regional and county nomenclatures as well as in reference to dialect glossaries and correspondent codes. Furthermore, Wright employed cardinal directions to fine-line areal differences, and, in a few instances, he provided areal information in the form of fuzzy phrasal expressions. To adequately reflect these layers of reference to dialect areas, the electronic version will allow inclusive and combined searches of these parameters while still retaining the flexibility for searching explicit and individual dialect areas. The issue of lexical meaning looms as a further crucial and complex matter since meaning can be incorporated at various locations in an entry (e.g. in citations, cross-referenced, and immediately following compounds, derivations and phrasal expressions). This structural heterogeneity renders the automated recognition of meaning elements a highly difficult undertaking and has forced us to focus our present attention on the standard case of meaning in monosemous and polysemous entries. For the time being, searching among definitions in the electronic version will bear the marks of these structural restrictions. They can, however, be compensated with more integrative Search in selections. Overall, the possible combinations of filter parameters and entry parts will allow a comprehensive grasp of the information contained in the EDD and, thus, inspire new insights into the dialectal landscape of Late Modern English. Notes 1
The research team of the government-funded project SPEED (Spoken English in Early Dialects) includes Manfred Markus (director), Reinhard
218
Alexander Onysko, Manfred Markus and Reinhard Heuberger
Heuberger (co-director), Alexander Onysko (project manager), Raphael Unterweger (software developer), Christian Peer and Christoph Praxmarer (junior researchers). 2
A tentative query in the OED 2 on CD-Rom for dial* provides merely 7,942 results, and specific dialect areas such as counties are rarely mentioned. Searching for Worcestershire, for example, yields eleven entries, four of which refer to the word “Worcester(shire)” itself. The current online-version of the OED, however, is in the process of integrating entries of the English Dialect Dictionary (Philip Durkin, personal communication).
3
The term dialect is used here in Wright’s non-modern sense of dialect as regional variation (cf. Quirk et. al. 1985: 16).
4
No distinction is made between Northern Ireland and the Republic of Ireland in line with the political reality towards the end of the 19th century.
5
Wright refers to America as a cover term of the U.S.A. and Canada. In some entries, e.g. POINTER, the abbreviations U.S.A. and Can. are listed individually, whereas there seem to be no entries where Amer. is combined with either of the other dialect markers.
6
The geographical precision of the hits is reduced for dialect areas that appear lower in the result list as they extend the literal, i.e. narrow, interpretation of the search item. The benefit of more inclusive searches, however, becomes evident as soon as further search filters (e.g. adj., slang or figurative) are selected since they tend to demand a reasonably sized pool of data to deliver results.
7
Another way of tracing meaningful elements in the various slots is by searching for words such as meaning and sense. These items occur 422 and 754 times in the dictionary.
8
Cf. the keyword lists in various lexical fields as listed by Francis (1983, 54-60) in line with previous research.
9
Manual proofreading shows an error rate of about 15% for automated structural parsing. The head, for example, was specified as the area between the lemma and the phonetic transcription, formally marked by square brackets. For the fairly numerous entries that lack phonetic transcription, a set of function words and auxiliary verbs (e.g. a, the, of, having) were taken as boundary markers since they typically initiate definitions. Apart from the main structural units, however, the proper recognition of variants, phrases, compounds and derivations demands
Joseph Wright’s ‘English Dialect Dictionary’ in electronic form
219
plenty of manual post-editing as these entry segments fail to provide clear boundary markers. 10
For complex queries, however, it is advisable to keep the Search in options fairly broad to obtain results. Furthermore, a few filter parameters are tied to specific entry parts such as etymological abbreviations, which are exclusively given in the comments section.
11
Cf. Beal (2004) for a comprehensive overview of this period.
12
Joseph Wright himself grew up under harsh circumstances during the late industrial revolution. He had to work in a cotton mill already at age 7 and learned how to read and write only in his teenage years, attending evening classes and Sunday school (cf. Holder 2004: 229-34).
13
Token Frequency was ascertained with Wordsmith Tools for the full text of all dictionary entries.
References Beal, J.C. (2004), English in Modern Times. London: Arnold. Béjoint, H. (2000), Modern Lexicography. Oxford: OUP. Chambers, J.K. & P. Trudgill (1998), Dialectology. 2nd ed. Cambridge: CUP. Darlington, Th. (1887), The Folk-Speech of South Cheshire. English Dialect Society. Ellis, A.J. (1968 [1889]), On Early English Pronunciation. Reprint. New York: Greenwood Press. Francis, W.N. (1983), Dialectology. An Introduction. New York: Longman. Holder, R.W. (2004), The dictionary men: their lives and times. Bath: Bath UP. Markus, M. (2007), ‘Wright’s English Dialect Dictionary Computerised: Towards a New Source of Information’, Online publication VARIENG Helsinki. Markus, M. (2007), ‘Wright’s EDD Computerised: Architecture and Retrieval Routine’, Online publication Conference Dagstuhl Dec. 2006. Markus, M. and R. Heuberger (2007), ‘The Architecture of Joseph Wright’s English Dialect Dictionary: Preparing the Computerised Version’, International Journal of Lexicography 20 (4), 355-68. Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive grammar of the English language. London: Longman. Scott, M. (2004), Wordsmith Tools. 4th ed. Oxford: OUP. Thompson, A. (2006), Joseph Wright’s slips: a linguistic evaluation of the making of the English Dialect Dictionary. Unpublished M.A. Thesis: Leeds. Wright, J. (1898-1905), The English Dialect Dictionary. 6 vols. Oxford: Henry Frowde. Wright, J. (1905), The English Dialect Grammar. Oxford: Henry Frowde, [repr. 1968].
How representative are the ‘Philosophical Transactions of the Royal Society’ of 17th-century scientific writing? Lilo Moessner RWTH Aachen, Germany Abstract The focus of the paper is on the notion of representativeness. It is approached from three different angles. In the first section, representativeness as a (desirable? possible?) property of linguistic corpora is discussed. Then the point of view is narrowed down to the R (for ‘representative’) in ARCHER, and here in particular to the register ‘science’. In the following empirical part, a multidimensional analysis of English science texts of the 17th century is presented. It is based on a corpus which comes in equal parts from ARCHER and from other sources. The comparative analysis reveals major differences between the sub-corpora. They are interpreted in section 4 as different degrees of representativeness. The last section contains a summary and the conclusion that the linguistic structure of English science texts of the 17th century is not fully represented by a random sample of texts from the Philosophical Transactions.
1.
Representativeness as a property of linguistic corpora
When the pioneers of sociolinguistics applied sociological methods in linguistic studies, they were very explicit about the requirement of representativeness and about the sampling methods to be used in order to achieve this goal: “There is only one way to ensure that the results obtained in an incomplete survey of this kind can legitimately be said to apply to the population as a whole: the section of the population which is to be studied must be selected by ‘accepted statistical methods’ (...). The informants, that is, must constitute a genuine representative sample of the city’s population.” (Trudgill 1974: 21) The ‘accepted statistical methods’, which are described in much detail (Trudgill 1974: 21-25), produced what handbooks of empirical social sciences refer to as quasi-random or stratified samples. They were not completely random, because not every member of the population had the same chance of being selected. In Trudgill’s study on the social stratification of Norwich English it was important that members of all social strata were included in the sample, and this would not have been guaranteed by complete random sampling. The compilers of the Brown Corpus probably aimed at representativeness, too, since Francis’s understanding of a corpus was “a collection of texts assumed to be representative of a given language, dialect, or other subset of a language, to
222
Lilo Moessner
be used for linguistic analysis” (Francis 1979: 110). It is, however, doubtful if representativeness was achieved. One problem lies in the determination of the universe of texts from which the samples are to be taken, and another in the decision on the number of text categories as well as the number and size of the texts to be included in each category. For the Brown Corpus the universe of texts was defined as “edited English prose printed in the United States during the calendar year 1961” (Brown Corpus Manual). For practical reasons, most texts were taken from the holdings of Brown University Library, which narrows down the scope of the textual universe. The list of the text categories and the number and size of the texts to be included in each category were set up at a conference at Brown University in 1963. After these initial decisions were taken, the actual sampling process could start, and here the rules of random sampling were followed. These shortcomings concerning representativeness are explicitly mentioned in the manual to the LOB Corpus: “... the present corpus is not representative in a strict statistical sense..... The true ‘representativeness’ of the present corpus arises from the deliberate attempt to include relevant categories and subcategories of texts rather than from blind statistical choice.” (Johansson 1978: 14). Since the FLOB and Frown corpora were planned as exact counterparts to LOB and Brown, the same compilation principles were followed. Necessary modifications in the press section of FLOB are described in Sand/Siemund (1992). It is generally agreed that the compilation of historical corpora is an even more daunting enterprise, especially when early periods are to be included (cf. Biber et al. 1998: 251-53; Meyer 2002: 37f.). I need not point out the problems arising from the limited amount of data and their poor quality. What is more relevant in the present context is the classification of the textual universe, be it ever so limited and so poor. A lot of work lies behind the division of the texts in the Helsinki Corpus into 33 text types and 7 prototypical text categories, respectively (cf. Kytö/Rissanen 1993: 10-14; Rissanen 1994: 76f.; Kytö 1996: 46). But the principles of classification are as subjective as in the corpora mentioned before, and the universe of texts from which the text selection was made is even less precisely specified. I would like to stress that I have no intention of playing down the merits of the Helsinki Corpus; like so many others I am very grateful that we have it. But we should be aware of its limitations, when we want to generalize on the basis of results obtained from its data. ARCHER is the long-term diachronic English corpus which complements the Helsinki Corpus with texts from the middle of the 17th to the end of the 20th century. Its name documents that is was intended as A Representative Corpus of Historical English Registers. This aim is explicitly stated in the description of the corpus: “It was our aim ... to collect a representative corpus of texts in each of the several registers” (Biber/Finegan/Atkinson 1994: 4). The compilers of ARCHER were aware of the problems ensuing from their ambitious aim, and the guidelines which they followed in the compilation process allow us to assess the amount of representativeness they could hope to attain. Ideally the textual universe would
17th-century scientific writing
223
have been derived from an exhaustive list of all texts written between 1650 and 1999. For practical reasons, the starting-point were “the major research libraries of the University of Southern California, the University of California at Los Angeles, and the Huntington Library in San Marino” (ibid.: 5). This is basically the same procedure as that adopted for the Brown Corpus. The selection of the texts and their assignment to 10 categories were probably as much a result of careful deliberation and agreement among the research team as in the compilation of the Brown Corpus. Some categories of the Brown Corpus have counterparts in ARCHER, some Brown categories are missing from ARCHER, and some ARCHER categories are absent from the Brown Corpus.1 Special decisions were taken for the compilation of the register ‘science’, and these will be dealt with separately. After these initial considerations it seems that representativeness is an unachievable goal and therefore not worthwhile aiming at any longer. The first part of this claim was explicitly supported by Mukherjee in his review article of three recent introductions to corpus linguistics,2 and it was implicitly admitted by the expression “the holy grail of representativeness” in Leech (2007: 134).3 This pessimistic attitude is easy to understand if we ask which conditions must be fulfilled for a corpus to be representative. A representative corpus is a random sample of a population, which exactly maps the structure of the population. This, however, can only be guaranteed when we know so much about the structure of the population that sampling becomes superfluous.4 This consequence is not drawn by either of the linguists mentioned before, nor does it correspond to the research reality in corpus linguistics. This more optimistic point of view is based on the conviction that representativeness is not an either/or property, but that there are degrees of representativeness and that there are methods helping to approach the ideal of maximum representativeness. These methods involve the use of the competence of expert corpus linguists in determining the number and relative importance of genres, the appropriate degree of their subclassification, and the size of the text samples to be included. Leech (2007: 140f.) contrasts two proposals for achieving a still higher level of representativeness. His own consists in making ACEs (Atomic Communicative Events) the yardstick for the inclusion of texts into a corpus. This means that the proportion of texts of a special category is not measured by the number of texts produced, but by the number of text recipients. One of the consequences of this approach would be that in a corpus of PDE, tabloid papers and everyday conversations would be represented by more texts than broadsheet papers and academic lectures or sermons. The other method was proposed by Biber (1993) and taken up in Biber et al. (1998: 250). It consists in a cyclic procedure, starting with the compilation of a pilot corpus. The results of a multidimensional analysis of this corpus reveal those registers in which the linguistic variation of the whole corpus is insufficiently evidenced. More material is added to them, and the cycle starts again with the enlarged corpus. It is repeated again and again, until stable variation is achieved. A similar line of research is also envisaged for the compilation of a web-based corpus in Biber et al. (2007: 126).
224
Lilo Moessner
2.
The register ‘science’ in ARCHER
In the context of representativeness, three registers of ARCHER deserve special attention; these are ‘legal opinion’, ‘medicine’, and ‘science’. The universes from which their texts were chosen were narrowed down to “appellate and Supreme Court decisions of the Commonwealth of Pennsylvania” (Biber et al. 1994: 6), to articles published in the Edinburgh Medical Journal, and to those in the Philosophical Transactions of the Royal Society (PTRS).These restrictions may have been motivated by the research aim to investigate the evolution of “representative journals” (ibid: 2),5 but it is equally probable that more practical reasons dictated this strategy. That this was definitely so and that this led to even greater restrictions for the register ‘science’, is openly admitted: “as an expedient in the face of diminishing resources, we targeted volumes representing central years within each period rather than all volumes throughout the period” (ibid.: 6). This means that the first 50-year period of the register ‘science’ is represented by articles of the PTRS of the years 1674 and 1675. They deal with very different topics, ranging from “Brief Directions on How to Tan Leather”, “A Phytological Observation concerning Orenges and Limons”, “Advertisements ... upon Frosts in some parts of Scotland” to “Considerations Touching the Compression of the Air”. They come in 10 files of about 2,000 words each. As a consequence of random sampling and due to the fact that scientific articles of the 17th century were often much shorter than those of modern times, the majority of files contains more than one text. The text passages of the 13 identifiable authors range from a little more than 100 to just over 2,000 words. The size of the texts whose authors are not indicated or are marked ‘anonymous’ amounts to nearly 40%. This structure of the science sub-corpus of ARCHER casts some doubt on its representativeness. A more serious problem lies in the choice of the textual universe from which the samples were taken. Although the leading role of the Royal Society in 17th century science is undisputed and although its language policy had a shaping influence on scientific English,6 the PTRS were not its only, perhaps not even its most important, voice. Papers read during the meetings of the Royal Society could be included in its History, in its Register-book, or be published as monographs. Following Biber (1988: 13), who argues that there is a systematic relation between extralinguistically defined registers, in this case science, and their linguistic structure, it will be assumed that different publishing formats of 17th century science texts resulted in different variation patterns. If this hypothesis can be empirically supported, the representativeness of the sub-corpus ‘science’ of ARCHER is not yet perfect. In the next section I will present results from a comparative analysis of the 17th century sub-corpus ‘science’ of ARCHER and of a matching corpus of science texts written by scholars connected with the Royal Society and published in book form.
17th-century scientific writing 3.
A multidimensional analysis of 17th century science texts
3.1
The data7
225
The texts of the 10 science files from ARCHER were originally sent to Henry Oldenburg, the editor of the PTRS, some of them as responses to queries from the Royal Society. Others reached him through intermediaries who wished to bring them to the attention of a wider scientific public. It depended mainly on Oldenburg if an article was included in the PTRS or not. All ARCHER files are prefixed by an identification label which specifies the date of publication and the author of the first text of the file. In this study, the files will be referred to by abbreviations of the text authors.8 The control corpus also contains 10 files of about 2,000 words each, but each file consists of just one passage from one text. The files were produced as transcripts from microfilm versions of the original texts9 or from facsimile editions accessible through Early English Books Online. Emendations were kept at a minimum, only obvious spelling errors were corrected. Care was taken to include passages from different parts of the texts. The files will also be referred to by abbreviations of the names of their authors. The following texts, which date from between 1661 and 1691, were used: Nehemiah Grew: Experiments in Consort of the Luctation arising from the Affusion of several Menstruums upon all sorts of Bodies (1678) [= Grew] Robert Hooke: An attempt for the explication of phaenomena (1661) [= Hook] Henry Power: Experimental Philosophy (1663) [= Pow] Robert Boyle: A Continuation of New Experiments (1669) [= Boyle] Sir Kenelm Digby: Chymical Secrets (1683) [= Dig]10 George Sinclair: Hydrostatical Experiments (1672) [= Sincl] Hugh Gregg: Curiosities in Chymistry (1691) [= Gregg] Thibaut, P.: The art of chymistry (1675) [= Thib] John Wallis: Discourse of Gravity and Gravitation (1675) [= Wall] John Evelyn: A Philosophical Discourse of Earth (1676) [= Eve] All texts were published as book-length studies, and they deal with topics of the fields of physics and chemistry. Their authors had close connections with the Royal Society (RS); this is why the files will be referred to as RS-files. Most authors were fellows of the RS. Sinclair was an active correspondent, whose observations can also be found in the PTRS. Thibaut’s book was originally written in French, but the English translation was made by a fellow of the RS. The whole corpus contains about 41,000 words, 20,751 words coming from the ARCHER sub-corpus and 21,019 words from the RS-subcorpus. Its structure is shown in Table 1.
226
Lilo Moessner
Table 1: Structure and size of the corpus ARCHER size RS size
3.2
Ano1 2,054 Grew 2,028
Leew 2,127 Hook 2,026
A.I. 2,042 Pow 2,050
Ano2 2,083 Boyle 2,012
Beal 2,056 Dig 2,024
B.R. 2,057 Sincl 2,168
Hook 1,899 Gregg 2,198
Hugy 2,001 Thib 2,281
Leib 2,181 Wall 2,082
Ray 2,251 Eve 2,150
Research method and linguistic features
The corpus was analysed with the method of multidimensional analysis (MD analysis); it assumes that -
texts are characterized not by one, but by a combination of several communicative functions, these functions can be described as dimensions of variation, the dimensions of variation can be derived from the co-occurrence patterns of linguistic features, the co-occurrence patterns can be automatically produced by a statistical program as the output of a factor analysis.
The input for the factor analysis are frequency counts of the features which are assumed to determine the functional profile of the texts under consideration. Strictly speaking, each new MD analysis should start with a factor analysis. In practice, however, apart from a couple of noteworthy exceptions (Taavitsainen 1993, Biber 2001), it has become customary to take over the dimensions of variation established in Biber (1988) for PDE, but the linguistic features which are counted are adapted to the research interests of the respective linguists and to practical necessities (Biber/Finegan 1997, González-Álvarez/Pérez-Guerra 1998, Atkinson 1999). This is the strategy followed here, too. The linguistic features were counted in all 20 files, and the absolute frequencies were normalised on the basis of 1,000 words to avoid skewing of the results by the different sizes of the individual files. The mean frequencies of all features were calculated, and their standard deviations in both sub-corpora, i.e. in the ARCHER-files and in the RS-texts, were established. Then all frequencies were standardized to a mean of 0.0 and a standard deviation of 1.0. Text dimension scores were computed on all dimensions as the sums of the standardized frequencies of the features, and their means yielded the genre dimension scores for the two sub-corpora.11 The inventory of linguistic features which were counted comprises the following 17 elements:
17th-century scientific writing -
227
verb forms: present tense, past tense, finite perfect aspect forms, be as main verb, passive verbal syntagms; modal auxiliaries: possibility, prediction, and necessity modals; pronouns: first, second, and third person forms of personal pronouns, reflexive pronouns and possessive determiners; nominal postmodifiers: relative clauses, past participle clauses ; adverbial subordinate clauses: conditional clauses, all other subordinate clauses.
-
Since the texts of the corpus were written in the 17th century, the realisation of the investigated features was not necessarily identical with that of the corresponding features in PDE (e.g. the feature ‘second person pronoun’ is realised by thou and by you, perfect aspect forms contain the auxiliary have or the auxiliary be). EModE relative constructions required a different treatment from PDE relative constructions. In Biber’s PDE model, relative clauses introduced by that figure on the dimension which marks an overt expression of persuasion, whereas relative clauses introduced by wh-pronouns are treated as markers of elaborate reference. In the 17th century the distribution of these two types of relative markers did not yet follow the PDE rules, and therefore all relative clauses were interpreted as markers of elaborate reference. 3.3
The results
On dimension 1, which measures the degree of involvement and interaction of a text, the following features were counted: present tense verb, second person pronoun, first person pronoun, main verb be, and possibility modal. Table 2 contrasts text and genre dimension scores of the sub-corpora. Table 2: Text and genre dimension scores of the sub-corpora on dimension 1 ARCHER text dim genre dim RS text dim genre dim
Ano1 Leew A.I. -0.51 -2.84 2.20
Ano2 Beal B.R. -3.60 3.20 0.97 -0.32 Grew Hook Pow Boyle Dig Sincl -3.16 0.06 -1.79 -2.41 -2.11 0.48 -1.09
Hook 2.05
Hugy Leib Ray -2.42 -1.50 -0.71
Gregg Thib -1.52 0.15
Wall Eve -0.26 -0.32
Six of the ARCHER-files have negative dimension scores, and they are not counterbalanced by high positive dimension scores of the other files. Consequently, the average dimension score of -0.32, which characterizes these texts as a whole and is usually referred to as ‘genre dimension score’, situates them slightly below the dividing-line between involved and informational. Since even 7 RS-texts have negative dimension scores, it is to be expected that the
228
Lilo Moessner
absolute value of the genre dimension score is higher. The genre dimension score of -1.09 indicates that the RS-texts are more informational than the others. The individual text dimension scores also reveal that the RS-texts are linguistically much more homogeneous than the ARCHER-files. Homogeneity is given when the range of text dimension scores, i.e. the difference between the highest and lowest text dimension score, is low. On this dimension, the range of the ARCHER-files is 6.80, and this contrasts sharply with 3.64 for the RS-texts. Biber (1988: 171) suggests two interpretations of high ranges in a genre; either the genre contains several sub-genres (e.g. the category ‘academic prose’ in his PDE corpus, which contains the sub-genres natural science, medical, mathematics, social science, politics/education, humanities, and technology/engineering), or it is not well-defined (e.g. the category ‘conversation’ in his PDE corpus). A comparison of the extreme values among the text dimension scores yields a further interesting result. The lowest score of an ARCHER-file (Ano2 = -3.60) is lower than the lowest score of an RS-text (Grew = -3.16), and the highest score of an ARCHER-file (Beal = 3.20) is higher than the highest score of an RS-text (Sincl = 0.48). This constellation means that the ARCHER-subcorpus covers more variation patterns than the RS-subcorpus. On dimension 2, which measures the degree of narrativity, the features past tense verb, perfective verb (perfect aspect forms of verbs), and third person pronoun were counted. Text and genre dimension scores are entered in Table 3. Table 3: Text and genre dimension scores of the sub-corpora on dimension 2 ARCHER text dim genre dim RS text dim genre dim
Ano1 Leew A.I. -1.05 -1.27 0.93
Ano2 Beal B.R. 4.41 2.72 3.61 1.43 Grew Hook Pow Boyle Dig Sincl -5.25 -0.94 -3.11 -3.63 -0.95 -6.38 -3.56
Hook -1.09
Hugy Leib Ray 4.91 0.71 0.45
Gregg Thib -2.64 -3.90
Wall Eve -4.65 -4.11
Here the difference between the two genre dimensions scores is even bigger. The value 1.43 for the ARCHER-files attests them a small degree of narrativity, whereas the genre dimension score of -3.56 situates the RS-texts considerably below the dividing-line between narrative and non-narrative texts. All RS-texts have a negative text dimension score. The range of text dimension scores is again bigger for the ARCHER-files (6.18) than for the RS-texts (5.44). As on dimension 1, the ARCHER-files cover more variation patterns than the RS-texts. But it is important to note that on dimension 2 the range of the RS-texts does not lie within that of the ARCHERfiles, but that the ranges overlap. The lowest text dimension score of all files (-6.38) comes from the RS-text Sincl, the highest (4.91) from the ARCHER-file Hugy. This yields an overall range of 11.29. The linguistic features which were counted on dimension 3 as indicators of the degree of explicit/elaborate reference were relative clauses of three types:
17th-century scientific writing
229
relative clauses with the relative marker in subject position, relative clauses with the relative marker in object position, and pied piping constructions. Zerointroduced relative clauses were left out of consideration. Table 4 contains the text and genre dimension scores of the sub-corpora. Table 4: Text and genre dimension scores of the sub-corpora on dimension 3 ARCHER text dim genre dim RS text dim genre dim
Ano1 Leew A.I. -0.41 -2.03 0.84
Ano2 Beal B.R. 3.88 0.75 3.56 0.42 Grew Hook Pow Boyle Dig Sincl -2.88 -0.35 -1.05 -1.12 2.64 0.13 -0.80
Hook -1.73
Hugy Leib Ray -1.37 0.16 0.60
Gregg Thib -0.90 -2.16
Wall Eve -3.94 1.67
The text dimension scores of the individual texts yield a positive genre dimension score for the ARCHER-subcorpus (0.42) and a negative value for the RSsubcorpus (-0.80). Consequently, the ARCHER-texts are marked by a small degree of elaborate/explicit reference, whereas the reference system in the RStexts is more situation-dependent. With respect to their reference system, the degree of heterogeneity does not differ much in the two sub-corpora; the range in the ARCHER-subcorpus is 5.91, that in the RS-subcorpus is 6.58. But for the first time it is the RS-subcorpus which covers more variation patterns than the ARCHER-subcorpus. As on dimension 2, the two ranges overlap. The lowest value (-3.94) comes from the RS-text Wall, the highest (3.88) from the ARCHER-file Ano2. Dimension 4 measures the degree of open persuasion. Here the following features were counted: prediction modal (the modal auxiliaries will, shall, would), conditional subordination (conditional clauses), and necessity modal (the modal auxiliaries must and should). The usual figures are given in Table 5. Table 5: Text and genre dimension scores of the sub-corpora on dimension 4 ARCHER text dim genre dim RS text dim genre dim
Ano1 Leew A.I. Ano2 Beal B.R. -1.50 0.28 0.26 -0.72 -4.34 -1.82 -1.61 Grew Hook Pow Boyle Dig Sincl -2.28 1.16 2.59 -0.38 1.55 1.25 0.81
Hook -2.24
Hugy Leib Ray -3.67 -2.72 0.41
Gregg Thib 0.38 2.29
Wall Eve 3.93 -2.39
The genre dimension score of -1.61 for the ARCHER-files is to be interpreted as a very low degree of open persuasion, whereas their genre dimension score of 0.81 situates the RS-texts slightly above the baseline of the scale of persuasion. As on dimension 3, the range of the text dimension scores is smaller in the ARCHER-subcorpus (4.75) than in the RS-subcorpus (6.32). Consequently, the RS-texts cover more variation patterns. The ranges overlap as on dimensions 2 and 3; the lowest value (-4.34) comes from the ARCHER-file Beal, the highest (3.93) from the RS-text Wall.
230
Lilo Moessner
Dimension 5 measures the degree of abstractness or impersonality. On this dimension the following features were counted: passive verbal syntagms, nominal postmodifiers realized by past participle constructions, and adverbial subordinate clauses except conditional clauses (they are considered a feature of dimension 4). Table 6 contains the relevant figures. Table 6: Text and genre dimension scores of the sub-corpora on dimension 5 ARCHER text dim genre dim RS text dim genre dim
Ano1 Leew A.I. Ano2 Beal B.R. 1.67 -1.20 -1.67 -0.03 -1.30 0.28 -0.40 Grew Hook Pow Boyle Dig Sincl -0.18 -1.14 -1.31 1.50 0.31 0.84 0.32
Hook 0.98
Hugy Leib Ray 1.10 -1.42 -2.42
Gregg Thib 3.85 1.94
Wall Eve 0.88 -3.44
The genre dimension score of the ARCHER-files (-0.40) situates them just below the dividing-line between abstract and non-abstract texts, whereas the value for the RS-texts (0.32) situates them slightly above the dividing-line on the abstract side. As on dimensions 3 and 4, the RS-texts show a greater range of text dimension scores (7.29) than the ARCHER-files (4.09). A comparison of the respective positions of the two ranges yields a mirror image of dimension 1. On dimension 5, both the lowest value (-3.44) and the highest value (3.85) come from RS-texts. 4.
Interpretation of the results of the analysis
The results of the analysis presented in the preceding section show that the two sub-corpora differ in the following four parameters: the extreme values of the text dimension scores, the position of the range of these values on the dimension scales, the genre dimension scores and their position on the dimension scales. These parameters are illustrated in Figure 1. On dimension 1, all text dimension scores of the RS-texts lie within the range of the ARCHER-subcorpus, and both sub-corpora have genre dimensions scores which mark them as more informational than involved. Consequently, the control corpus did not yield additional variation patterns, and the ARCHERsubcorpus with its wider range represents the genre ‘science’ better than the RSsubcorpus.
17th-century scientific writing
231
Figure 1: Ranges of text dimension scores and genre dimension scores It is interesting to note that the files with the lowest and the highest text dimension scores, Ano2 and Beal, represent different sub-genres. Both files contain one text only; Ano2 is an account of discoveries made during expeditions in search of a north passage to China and Japan, whereas Beal is a letter to the editor of the PTRS, in which the author describes his observations about frosts in Scotland. In the last paragraph of this letter, which does not form part of the file, the author describes his own language like this: “But you must expect no other language, or composure, than what comes first to a running pen, and agrees with rusticities; for which I have more affections, than spare minutes to offer to you.” (PTRS vol. 10, p. 367) The text of the letter is in line with this description; it is a first person report with many possibility modals, mostly written in present tense. This is the beginning of the letter: “It may seem, by the curious Remarks sent to you from Scotland that we are yet to seek out the Causes and original Source, as well as the Principles and Nature, of Frosts. I wish, I were able to name all circumstances that may be causative of Frosts, Heats, Winds, and Tempests. I know by experience, that the situation of the place is considerable for some of these; but after much diligence and troublesome researches, I cannot define the proximity or distance, not all the requisites, that ought to be concurrent for all the strange effects I have observ’d in them.” First and second person pronouns as well as possibility modals are absent from the beginning of Ano2. The prevalent tenses are past and present perfect.
232
Lilo Moessner “It is sufficiently known to those who have made any inspection into the Navigation of this and the former Age, how studiously and sollicitously the Lords of the United Netherlands have, for these eighty years and more, laboured to encourage those that should first discover a more compendious and shorter passage by the North to China, Japon, and other Oriental countries. But those who first adventured upon this Enterprize, found by sad experience, that the success answered not their expectation and hopes: whose calamitous encounters I shall not go about to recite, since their own Narratives have run through most hands.”
The situation on dimension 5 is the mirror image of dimension 1. All text dimensions scores of the ARCHER-files lie within the range of the RS-subcorpus. Since additionally the genre dimension scores of the two sub-corpora are situated on different sides of the dividing-line between abstract and non-abstract texts, the control corpus proved particularly helpful in establishing the linguistic structure of science texts. It represents the genre more adequately than the ARCHERsubcorpus. Unlike on dimension 1, the texts with the lowest and the highest text dimension scores cannot be attributed to different sub-genres of scientific writing. Gregg is an account of the chemical properties of natural substances, and Eve describes the consistency of several geological layers. The following extract from Gregg shows that the high score for abstractness is mainly due to the big number of passive constructions: “If you pour Spirit of Salt, by degrees, upon a Lee of Salt of Tartar, (or of any other Alcalisate Salt, ) ‘till it be almost satiated, (which is known by the abating of the Effervescence, ) you shall observe a kind of Earth precipitate out of the fixt Salt, (namely because, upon the mutual conflict, between an Acid and an Alcali, whatsoever heterogeneous substance is contained in either of them uses to precipitate.) The Earthy part of the Salt of Tartar being thus separated, the saline part is thereby render’d Volatile, and would actually fly away, were it not for the Acid that fixes it anew: and if you separate this Acid, by the addition of new Salt of Tartar, it will by this means be set at liberty, and strike your Nostrils with an Urinous odour.” In Eve, by contrast, the properties of the individual substances are fore-grounded, and passive constructions are very rare. This results in a less abstract style, as is illustrated by the following extract: “But all Sand does easily admit of Heat and Moisture, and yet for that not much the better; for either it dismisses and lets them pass too soon, and so contracts no ligature; or retains it too long; especially where the bottom is of Clay, by which it parches, or chills, producing
17th-century scientific writing
233
nothing but Moss, and disposes to Cancerous infirmities: But if, as sometimes it fortunes, that the Sand have a surface of more genial mould, and a fund of Gravel or loose stone; though it do not long maintain the virtue it receives from Heaven; yet it produces as forward springing, and is parent of sweet Grass, which, though soon burnt up in dry weather, is as soon recover’d, with the first rain that falls.” The interpretation of the results on the other three dimensions is less straightforward. The ranges on these dimensions overlap, and where the genre dimension scores are positive for one sub-corpus they are negative for the other and vice versa. The obvious conclusion is that on these dimensions neither subcorpus is sufficiently representative of English scientific writing in the 17th century. The combination of the two sub-corpora would be a promising move in the direction towards a more representative corpus of 17th century scientific writing. Its degree of representativeness would then have to be tested by comparing it with still other texts of the same register, complete representativeness being reached when no new variation patterns are discovered. 5.
Summary and conclusion
In this paper I argued that although the existing multi-purpose corpora cannot be considered representative samples, representativeness should nevertheless be an aim of corpus builders. This claim rests on the assumption that representativeness is not an either/or quality, but that corpora can be placed on a scale of representativeness. Then I tested the hypothesis that the 17th century PTRS texts of the register ‘science’ in ARCHER are representative of scientific writing of that period. I analysed the 10 ARCHER-files and a control corpus of the same size with the method of MD-analysis. The range of text dimension scores and their position on the dimension scales as well as the genre dimension scores and their position on the respective scales were established. These parameters were used to measure the degree of representativeness of the two sub-corpora. On dimension 1, the ARCHER sub-corpus proved more representative of English scientific writing in the 17th century, whereas on dimension 5 the control corpus showed a higher degree of representativeness. Only on dimension 1 did the lowest and highest text dimension scores correlate with different sub-genres of the genre science. On dimensions 2-4 neither subcorpus reached a sufficient degree of representativeness. The results of the present study strongly suggest the compilation of a more representative corpus of 17th century scientific writing, which should contain texts from diverse sources. It seems that the situational parameter ‘format’ with the dichotomy “published vs non-published and various formats within ‘published’” (Biber 1993: 245) is of particular relevance here.
234
Lilo Moessner
Notes 1
The Brown Corpus categories A-C (Press: Reportage, Press: Editorial, Press: Reviews) correspond to the ARCHER category ‘News’, the Brown Corpus category J (Learned) is split up in ARCHER into the categories ‘Legal opinion’, ‘Medicine’, and ‘Science’. The Brown Corpus categories E-G (Skills and Hobbies; Popular Lore; Belles Lettres, Biography, Memoirs, etc.) have no clearly recognizable counterparts in ARCHER, and the registers ‘Journals/Diaries’ and ‘Letters’ are absent from the Brown Corpus.
2
“Absolute representativeness is an unattainable ideal” (2004: 214).
3
Leech is more explicit towards the end of the same article when he writes: “... the absolute goal of representativeness is not attainable in practical circumstances” (2007: 140). It is interesting though that he argues in terms of practical circumstances, not on the basis of theoretical considerations.
4
This position was taken in Váradi (2001) and reported by Leech (2007: 136); cf. also Rieger (1979).
5
This is a problematic goal in itself, because the PTRS were the first and for about 200 years the only scientific journal in England (Kronick 1962: 6, Lambert 1985: 9).
6
cf. Moessner, forthcoming.
7
I wish to thank Christian Mair (Freiburg) for letting me have access to the ARCHER-files.
8
The following abbreviations are used: Ano1 (anonymous, PTRS 9), Leew (Antony van Leewenhoeck), A.I. (initials of an unidentifiable author), Ano2 (anonymous, PTRS 10), Beal (John Beal), B.R. (initials of an unidentifiable author), Hook (Robert Hooke), Hugy (Christian Huygens), Leib (Gottfried Wilhelm Leibniz), Ray (John Ray).
9
When the Boyle file was produced, the Hunter/Davis edition of Boyle’s works was not accessible. In the meantime the file has been collated with the edited text.
10
Dibgy’s book was published posthumously by his assistant George Hartman.
11
For a detailed description of these procedural steps cf. Biber 1988: 93-97.
Primary sources ARCHER. A Representative Corpus of Historical English Registers.
17th-century scientific writing
235
Boyle, Robert (1669), A Continuation of New Experiments Physico-Mechanical, Touching the Spring and Weight of the Air, and their Effects. Oxford: Henry Hall. Digby, Sir Kenelm (1683), Chymical Secrets, and rare Experiments in Physick and Philosophy. London: Will. Cooper. Evelyn, John (1676), A Philosophical Discourse of Earth, Relating to the Culture and Improvement of it for Vegetation and the Propagation of Plants, etc. as it was presented to the Royal Society, April 29, 1675. London: John Martyn. Gregg, Hugh (1691), Curiosities in Chymistry: Being new Experiments and Observations Concerning the Principles of Natural Bodies. London: Stafford Anson. Grew, Jeremiah (1678), Experiments in Consort of the Luctation arising from the Affusion of several Menstruums upon all sorts of Bodies. London: John Martyn. Hall, Marie Boas (ed.) (1966), Experimental Philosophy, in Three Books: Containing New Experiments Microscopical, Mercurial, Magnetical, by Henry Power. New York/London: Johnson Reprint Corporation. Hooke, Robert (1661), An Attempt for the Explication of the Phænomena, Observable in an Experiment Published by the Honourable Robert Boyle. London: Sam. Thomson. Sinclair, George (1672), The hydrostaticks, or, The weight, force, and pressure of fluid bodies, made evident by physical, and sensible experiments by G.S. Edinburgh: George Swintoun, James Glen, and Thomas Brown. Thibaut, P. (1675), The Art of Chymistry: as it is now Practised. Written in French By P. Thibaut, Chymist to the French King. And now Translated into English, By A Fellow of the Royal Society. London: John Starkey. Wallis, John (1675), A Discourse of Gravity and Gravitation, Grounded on Experimental Observations: Presented to the Royal Society, November 12, 1674. London: John Martyn. References Atkinson, D. (1999), Scientific Discourse in Sociohistorical Context. The Philosophical Transactions of the Royal Society of London, 1675-1975. London/Mahwah, NJ: Laurence Erlbaum. Biber, D. (2006), University Language: A Corpus-based study of spoken and written registers. Amsterdam/Philadelphia: John Benjamins. Biber, D. (2001), ‘Dimensions of variation among 18th-century speech-based and written registers’, in: H. Diller and M. Görlach (eds.) Towards a History of English as a History of Genres. Heidelberg: Winter. 89-109. Biber, D. (1993), ‘Representativeness in Corpus Design’. Literary and Linguistic Computing 8: 241-57. Biber, D. (1988), Variation across speech and writing. Cambridge: CUP.
236
Lilo Moessner
Biber, D. and J. Kurjian. (2007), ‘Towards a taxonomy of web registers and text types: a multidimensional analysis’, in: M. Hundt, N. Nesselhauf and C. Biewer (eds.) Corpus Linguistics and the Web. Amsterdam/New York, NY: Rodopi. 109-31. Biber, D., S. Conrad, R. Reppen (1998), Corpus linguistics. Investigating language structure and use. Cambridge: CUP. Biber, D. and E. Finegan. (1997), ‘Diachronic Relations among Speech-Based and Written Registers in English’, in: T. Nevalainen and L. Kahlas-Tarkka (eds.), To Explain the Present: Studies in the Changing English Language in Honour of Matti Rissanen. Helsinki: Société Néophilologique. 253-76. Biber, D., E. Finegan and D. Atkinson (1994), ‘ARCHER and its challenges: Compiling and exploring a representative corpus of historical English registers’, in: U. Fries, G. Tottie and P. Schneider (eds.), Creating and Using English Language Corpora. Papers from the Fourteenth International Conference on English Language Research on Computerized Corpora, Zürich 1993. Amsterdam/Atlanta, GA: Rodopi. 1-13. Birch, T. (1576 [1968]), The History of the Royal Society of London for Improving of Natural Knowledge, 4 vols. London: Millar [facsimile reprint Hildesheim: Olms]. Brown Corpus Manual http://khnt.hit.uib.no/icame/manuals/brown Francis, W.N. (1979), ‘Problems of assembling and computerizing large corpora’, in: H. Bergenholtz and B. Schaeder (eds.), Empirische Textwissenschaft: Ausbau und Auswertung von Text-Corpora. Königstein: Scriptor. 110-23. González-Álvarez, D. and J. Pérez-Guerra (1998), ‘Texting the written evidence: On register analysis in late Middle English and early Modern English’. Text 18 (3): 321-48. Gotti, M. (2006), ‘Disseminating Early Modern Science: Specialized News Discourse in the Philosophical Transactions’, in: N. Brownlees (ed.), News Discourse in Early Modern Britain. Selected Papers from CHINED 2004. Bern: Peter Lang. 41-70. Johansson, S. (in collaboration with G. Leech and H. Goodluck) (1978), Manual of information to accompany the Lancaster-Oslo/Bergen corpus of British English for use with digital computers. Oslo: University of Oslo, Department of English. Kronick, D.A. (1962) A History of Scientific and Technical Periodicals. New York: The Scarecrow Press. Kytö, M. (comp.) (1996), Manual to the diachronic part of the Helsinki Corpus of English Texts. 3rd ed. Helsinki: University of Helsinki, Department of English. Kytö, M. and M. Rissanen (1993), ‘General Introduction’, in: M. Rissanen, M. Kytö and M. Palander-Collin (eds.), Early English in the Computer Age. Explorations through the Helsinki Corpus. Berlin/New York: Mouton de Gruyter. 1-17.
17th-century scientific writing
237
Labov, W. (1966), The Social Stratification of English in New York City. Washington, D.C.: Center for Applied Linguistics. Lambert, J. (1985), Scientific and Technical Journals. London: Clive Bingley. Leech, G. (2007), ‘New resources, or just better old ones? The Holy Grail of representativeness’, in: M. Hundt, N. Nesselhauf and C. Biewer (eds.), Corpus Linguistics and the Web. Amsterdam/New York, NY: Rodopi. 133-49. Meyer, C. (2002), English Corpus Linguistics. An introduction. Cambridge: CUP. Moessner, L. (2009), ‘The Influence of the Royal Society on 17th-Century Scientific Writing’. ICAME Journal 33. Mukherjee, J. (2004), ‘The state of the art in corpus linguistics: three book-length perspectives’, English Language and Linguistics 8 (1): 103-19. Rieger, B. (1979), ‘Repräsentativität: von der Unangemessentheit eines Begriffs zur Kennzeichnung eines Problems linguistischer Korpusbildung’, in: H. Bergenholtz and B. Schaeder (eds.), Empirische Textwissenschaft: Ausbau und Auswertung von Text-Corpora. Königstein: Scriptor. 52-70. Rissanen, M. (1994), ‘The Helsinki Corpus of English Texts’, in: M. Kytö, M. Rissanen and S. Wright (eds.), Corpora Across the Centuries. Proceedings of the First International Colloquium on English Diachronic Corpora. St. Catharine’s College Cambridge, 25-27 March 1993. Amsterdam/Atlanta, GA: Rodopi. 73-79. Sand, A. and R. Siemund (1992), ‘LOB - 30 years on ...’. ICAME Journal 16: 119-22. Taavitsainen, I. (1993), ‘Genre/subgenre styles in Late Middle English?’, in: M. Rissanen, M. Kytö and M. Palander-Collin (eds), Early English in the Computer Age: Explorations through the Helsinki Corpus. Berlin/New York: Mouton de Gruyter. 171-99. The Royal Society of London. Philosophical Transactions, Volume 10 (1675) [Facsimile reprint 1963]. New York: Johnson Reprint Corporation and Kraus Reprint Corporation. Trudgill, P. (1974), The Social Differentiation of English in Norwich. Cambridge: University Press. Váradi, T. (2001), ‘The linguistic relevance of corpus linguistics’, in: P. Rayson, A. Wilson, T. McEnery, A Hardie and S. Khoja (eds.), Proceedings of the Corpus Linguistics 2001 Conference. Lancaster University: UCREL Technical Papers 13. 587-93.
A multi-dimensional analysis of a learner corpus Bertus van Rooy and Lize Terblanche North-West University, Vaal Triangle Campus, South Africa Abstract The present study reports on a multi-dimensional analysis (Biber, 1988) of the Tswana Learner English (TLE) corpus, together with the Louvain Corpus of Native English Essays (LOCNESS). A new multidimensional model is extracted, since the similarities between nativeness and non-nativeness mask differences between linguistic features to such an extent that it is not possible to come to a complete understanding of such differences using the standard 1988 model. A basic five factor model was extracted. Dimension 1 can be taken to capture advanced literacy, specifically as far as complex noun phrase structure is concerned, with the function of expressing information densely. Dimension 2 can be regarded as an indication of transparency and Dimension 3 captures a range of informal style features. The features that group together as Dimension 4 represent a style of writing that is more nuanced and precise and as a provisional label, we propose contextualisation of information. Dimension 5 can be regarded as the persuasive dimension in student writing, a feature that has been identified as a very important characteristic by Biber and Grabe (1987), and also in our own study of student writing. The most striking differences between the two corpora are on Dimensions 1 and 4. LOCNESS shows more advanced literacy than the TLE, and also contextualises information more extensively than the TLE. On the other dimensions, both corpora contain essays that display the various different styles available, showing that as a register, student writing allows for some internal stylistic variation independent of whether the writers are native or non-native speakers of English. The results confirm the usefulness of the multidimensional model, particularly to the extent that a new model is extracted. Substantial overlap between some of the dimensions in this study and dimensions in other models indicate that multidimensional modals are sensitive to particular kinds of feature groupings, which should be taken as evidence in favour of the general validity of this kind of approach.
1.
Introduction
Previous research on English second language writing usually focuses either on ‘errors’ and non-standard features that regard New Varieties of English as imperfect versions of standard metropolitan varieties, or examines specific individual linguistic features. Relatively few studies examine broader patterns of co-occurrence of features, with notable exceptions being Nkemleke’s (2006) study of expository writing in Cameroon English and Mesthrie’s (2006) work on non-deletions in Black South African English.
240
Bertus van Rooy and Lize Terblanche
To remedy this, we propose to adopt a multidimensional approach (Biber 1988) and apply it to corpora of student writing. Biber (1988) originally developed the model in an attempt to characterise differences between various spoken and written registers of English on the basis of the distribution of 67 different linguistic features. He extracted the frequencies of all these features from each of the texts that made up his corpus, before submitting the values to a factor analysis. The purpose of such an analysis is to find patterns of co-occurrence of features and to group them in factors that are assumed to reflect meaningful underlying dimensions. Biber (2006: 181) argues that the 1988 model can be applied successfully to new discourse domains. However, he acknowledges that using the 1988 model may make it impossible to describe the dimensions that are most important in a particular domain of use. This means that linguistic features can occur in particular ways in different discourse domains and that these features reflect the specialized properties of the domains (Biber, 2006: 181). Biber (2006: 181) proposes a completely new MD analysis to identify the co-occurrence patterns in a corpus when analysing a new discourse domain with many different text categories. This approach is proposed when comparing various registers. However, Biber’s (2006) argument also suggests that for any new register, it is possible that the application of the 1988 model may overlook certain unique functions of particular linguistic features. Van Rooy & Terblanche (2006) compared Dimension 1 on Biber’s 1998 model with native and non-native student writing. The differences between the two student corpora were so slight, they all but disappeared once they were compared to other registers (Van Rooy & Terblanche 2006: 178). In an application of the entire original MD model, Van Rooy (2008) finds once again that the construct student writing is a much stronger determinant of the emerging patterns in the data than any differences between native and non-native writing. Furthermore, his analysis shows that certain dimension scores present a misleading characterisation of the non-native data. Therefore, the motivation for a new MD model is that the similarities between L1 and L2 due to both corpora representing the register student writing mask the differences to such an extent that it is not possible to come to a complete understanding of the linguistic differences between native and non-native student writing. As opposed to comparing various registers, the present study looks at the differences between native and non-native writing, while keeping the variable register constant. In this study, a new factor analysis of the data is conducted to uncover dimensions that are able to distinguish more clearly between the two corpora. We attempt to present a comprehensive characterisation of differences and similarities between second language writing and native speaker writing, serving as benchmark for future microscopic investigations into more specific aspects of such differences and similarities. The methodology is presented in the next section, followed by the results and discussion, before conclusions are offered in the final section of the article.
A multi-dimensional analysis of a learner corpus 2.
Methodology
2.1
Corpora
241
In order to enable comparisons between the analyses performed using Biber’s original factor analyses (reported by Van Rooy 2008) and the new MD/MF model, the same corpora are used in both studies. A corpus of student writing produced by native speakers of Setswana from South Africa and Botswana, the Tswana Learner English Corpus (TLE), is analysed and compared to a corpus of native speaker writing, the Louvain Corpus of Native English Speaking Students (LOCNESS). Both these corpora are from the International Corpus of Learner English project. In the case of the TLE, the present study analysed 383 essays with an average length of 392 words (S.D. 135). A total of 188 essays from LOCNESS were analysed, with an average length of 1075 words (S.D. 620). The essays were written on a wide range of topics. A list of suggested topics is provided by the ICLE guidelines1, which was adjusted for the South African context in the case of the TLE. The topics for LOCNESS were more diverse, with literary and philosophical topics included alongside those more similar to the TLE2. The TLE data contain essays that were written in class, but under relatively unconstrained contexts. Students had between one and two hours to plan and write. None complained of not having enough time. No reference tools were used, though. The LOCNESS essays include a proportion of exam essays that were written under strict time controls, as well as some untimed essays. While this results in a less homogenous sample, it is the closest available as basis for comparison with the TLE. No similar corpus of student essays from native South Africans is available, and there is relatively limited contact between the students who wrote for the TLE and native speakers, thus there is very little chance of influence between the two segments of the population. 2.2
Feature extraction
All 67 features originally analysed by Biber (1988) were included in the present study. All spelling errors in the learner data were corrected before further analyses were undertaken. Subsequently, both corpora were part-of-speech tagged with the Support Vector Machine-tagger developed by Giménez and Màrquez (2004) using the Penn Treebank tagset. Due to the non-standard features frequently encountered in learner data, the majority of features were extracted using a combination of manual and automatic procedures. Features that could be extracted using lexical items only, such as the various types of adverbial subordinators, were extracted fully automatically using the TextMiner function in Statistica. In the case of all features that required partof-speech tags, the data were inspected manually in Wordsmith Tools, and only valid cases were accepted. Wherever possibilities for part-of-speech tag confusion existed, related part-of-speech tags were also extracted and inspected, in an
242
Bertus van Rooy and Lize Terblanche
attempt to ensure not only high precision in the classification, but also to maximise recall (in the standard information retrieval senses of the terms, see Van Rijsbergen, 1979: 144-150). Frequencies were normalised to a relative frequency per 1000 words, with the exception of the type-token ratio and average word length. In the case of the type-token ratio, the 1000-word normalisation was not feasible, because most essays contained fewer than 1000 words. Consequently, the type-token ratio was normalised to 200 words for both corpora. 2.3
Statistical analysis
Biber (1988) used factor analysis to cluster the linguistic features together in factors, which were interpreted as functional dimensions underlying the statistical factors. He extended the analysis by computing factor scores on the basis of the dimensions. The linguistic features with absolute factor loadings higher than Ň0.35Ň were included in this calculation, with the additional proviso that each feature was included only once, in the factor where it had the highest absolute loading. When new factor analyses have been undertaken since Biber’s 1988 study, e.g. by Reynolds (2005) or Biber (2006), the data is standardised in its own terms, to a mean of 0 and standard deviation of 1 for each variable. Feature 62, split infinitives, was excluded from the present study on account of the almost complete absence of data in both corpora. On the basis of the standardised data, we extracted a new factor model, using Promax rotation as is customary in the MD approach. The solution with five factors was chosen as optimal; beyond the fifth factor, it became increasingly difficult to assign a meaningful functional interpretation to the data, and the number of variables that loaded onto a factor became very small. After the factors were identified, dimension scores on the new factors were calculated for both corpora. Possible differences in means between the two corpora were assessed using Cohen’s (1969) d-statistic, which is simply calculated by dividing the absolute difference in means between the bigger of the two standard deviations of the two corpora. Cohen (1969:18-25) provides guidelines for interpreting the effect size obtained in this manner. For the purposes of this paper, where the focus is on the most salient differences between the corpora, the focus in the interpretation of the data will be on large effect sizes (d>0.8). Cohen (1969: 25) indicates that these differences correspond to ‘grossly perceptible’ differences, such as the difference in length between 13- and 18-year old girls, or IQ differences between PhD graduates and typical first year university students.
A multi-dimensional analysis of a learner corpus 3.
Results
3.1
The factor model
243
The basic five factor model, with the linguistic features loading onto them, is presented in Table 1. Only variables with absolute values of 0.3 and higher are included, and variables are only included in the factor where they have the highest absolute value. These dimensions can be interpreted as follows. Dimension 1 can be taken to capture advanced literacy. It includes two of the three typical feature clusters Biber (2006: 186) associates with literacy in contrast with orality: complex structures in noun phrases and information density. Differences in grammatical complexity is a well-known finding from research on second language acquisition (Grant & Ginther 2000; Hinkel, 2002, 2005; Reynolds, 2005). However, most previous studies focus on individual features and cannot give an overview of second language writing as a whole. As second language speakers develop as writers, they increase their use of more complex grammatical structures, such as nominalizations, subordination and passives (Grant & Ginther, 2000: 140). Apart from the two very general features of informational density – type/token-ratio and word length, several noun phrase specific structures can be identified: nominalisations (V14), gerunds (V15), total other nouns (V16), attributive adjectives (V40), and predicative adjectives (V41). However, unlike Biber’s (2006) finding that literacy contrasts in the first dimension of MD models with orality, we have no corresponding set of features encoding the orality dimension in our dimension 1. This is not surprising, since we compare two written corpora. In previous research, we showed that while there are some minor correspondences between the TLE and certain spoken registers, these are not substantial enough to show up in multidimensional models (Van Rooy & Terblanche 2006, Van Rooy 2008). Thus, we propose to regard the data only in terms of the literacy dimension, but then postulate that high positive dimension scores will be indicative of advanced literacy, in contrast to lower literacy levels. This dimension overlaps substantially with the negative features of the first dimension of Biber’s (2006) model of university language, where he terms this collection of features literate discourse. It also includes all negative features of the 1988 model, where they are labelled informational production. Likewise, in Reppen’s (2001) study on language for and by children, a first dimension with positive features overlapping with our features was identified and labelled as edited information discourse, and found in the school textbooks written for but not by children.
244
Bertus van Rooy and Lize Terblanche
Table 1: New Factorial pattern Dimension 1 V40_attr Adjectives V16_noun all V64_phras coord V39_prep. Phrase V44_word length V14_nominalisation V27_past part WHdel V43_TTR V15_gerund V28_pres part relatives V41_pred Adjectives
0.73 0.67 0.64 0.62 0.55 0.51 0.45 0.45 0.42 0.40 0.40
Dimension 2 V3_present tense V8_3p pronoun V31_WH rel subj V35_causal subord V19_be main verb V18_passive by V38_Adv subord V52_mod possibility V24_infinitive V11_indef pronoun V66_synt negation V10_dem pronoun
0.61 0.59 0.59 0.58 0.44 0.40 0.39 0.39 0.38 0.37 0.34 0.32
Dimension 3 V59_contractions V7_2p pronoun V6_1p pronoun V49_emphatic V12_do pro-verb V13_dir WH-question V50_discourse part
0.56 0.54 0.53 0.45 0.40 0.37 0.30
Dimension 4 V1_past tense V55_public verbs V2_perfect V5_time adverbials V23_wh clause V46_down toner V21_that verb-comp V36_concessive V33_pied piping V20_exthere V4_place adverbials
0.61 0.52 0.44 0.43 0.42 0.35 0.32 0.31 0.31 -0.38 -0.53
Dimension 5 V63_split auxiliaries V53_mod necessity V17_pass agentless V54_mod prediction V37_conditional V67_anal negation V57_suasive verbs V42_adverbs V61_stranded prep
0.75 0.70 0.64 0.61 0.55 0.42 0.39 0.38 0.30
Passage 1 illustrates the very high frequency of nouns, nominalisations and attributive adjectives that occur in the LOCNESS corpus. The frequency of nominalisations is much higher in the native speaker corpus than the TLE.
A multi-dimensional analysis of a learner corpus
245
Overall, passage 1 is a text that uses grammatically complex linguistic features visibly more than passage 2 from the TLE corpus, and as a consequence, information is presented much more densely in passage 1 than in passage 2. While spelling mistakes were corrected before analysing the data, the sample passages below are all from the raw, unedited corpora. (1)
Word count: 90 nouns (33/100 words) nominalisations (4/100 words) attributive adjectives (9/100 words)
Alcoholism is a growing problem in the United States today that affects all ages. Too many students fight alcoholism in high school and college environment. This problem could easily be curtailed by lowering the drinking age from twenty-one to eighteen. Changing the drinking age from twenty-one to eighteen would lower the amount of crimes among young adults, encourage a more responsible approach to alcohol in the United States and improve the health of the nation. Allowing alcohol consumption at age would change the way America viewed alcohol-use as a society. (2)
Word count: 196 words nouns (15/100 words) nominalisations (1/100 words) attributive adjectives (2/100 words)
Poverty is the cause caeus people in Africa are very poor to can surpport themselves and their families so some of those people in order for them to survave they go to the street and just sell their body so that they can get the money and buy food for their famillies. Because of poverty some of us can not go to school and study so that we can get a better jobs and make an hounest living, that is why some of us go out there and sell our selfs. And at the end one endup getting HIV/AIDS because of it - not only can we get HIV/AIDS by selling our bodies. Some of the people do not have places to stay and because it is cold outside and they dont have food they just go to some strangers and ask for some help, so a stranger will take an advantage of that poor person. On the other hand our government is giving out free condoms that are not even 100% safe so people just go for those condoms because they can not afford to by that ones that are bein sold at the camisty.
246
Bertus van Rooy and Lize Terblanche
Dimension 2 can be regarded as an indication of a transparency. It overlaps with six of the features on the positive side of Dimension 1 in the Biber (1988) model, labelled involvement. The features that occur on both models are present tense verbs, causative subordination, BE as main verb, adverbial subordinators (which have a higher loading on factor 5), possibility modals, indefinite pronouns and demonstrative pronouns. It likewise shows a degree of overlap with the positive features of the first dimension, oral discourse, in Biber (2006). This can be seen through the use of present tense verbs which describe actions in the immediate context of interaction (Biber 1988: 105). Overt cohesive devices such as causal and other subordinators are used. Various pronominal forms such as third person, indefinite and demonstrative pronouns occur as grammatical means to achieve reference cohesion. Biber’s (1988) model contains a large number of features loading on Dimension 1. The features with a negative loading are associated with high informational density in a text. However, the interpretation of the positive features is more complex. Biber (1988: 105-107) describes these features as representing an interactive focus on the one hand and the effects of real-time planning constraints on the other hand. Our model helps to shed some light on the complex set of positive Dimension 1 features in the Biber (1988) model. Dimension 2 in the present study overlaps with the part of the original dimension that selects fairly plain wording and grammatical structures, and is much more verbal than nominal in focus. Another subset of features from Biber’s Dimension 1 overlaps with our Dimension 3, which can be interpreted in terms of a different style choice. Our Dimension 2 features tend to overlap more with features that show evidence of real-time constraints, resulting in generalised lexical choice and sequentially structured, non-integrated information, combined with very explicit marking of particular cohesive relations. Timed student essays may well have this effect on occasion, where students start writing before planning adequately, and therefore present fragmented (rather than integrated and dense) information. Passage 3 is an excerpt from the TLE corpus that contains a high frequency of third person pronouns, as well as causal and adverbial subordinators. These features are all typical examples of a plain and direct style. The fourth passage shows that Dimension 2 reflects a choice of style rather than a limited access to grammatical features, since it shows an example where another TLE student avoids using these features: (3)
Word count: 138 adverbial subordinators (1.4/100 words) causal subordinators (1.4/100 words) third person pronouns (4.3/100 words)
One can describe being poor as having no many. Most of people in Africa have no money to survive so they find different ways of find money example of prostitution. There are cases whereby ladies trade
A multi-dimensional analysis of a learner corpus
247
sex for money in order to have money with different number of people. They usually practice unsafe sex because their costumer cannot pay for a protected sex so there is high risk of getting HIV/Aids through this practice though they get money. Since Africa is not devoloped there is poor health. Most of poor people end up eating unheath food because they do not have money to boy heath food. Africa does not have good and many heath faciliticies which its people can get medicines cheap to cures sexual transmitted diseases, tuberculosis and other HIV/aids related diseases before they develop to Aids. (4)
Word count: 150 words adverbial subordinators 0 causal subordinators 0 third person pronouns (2.7/100 words)
Many countries of Africa are poor and this means that the population also is poor. Most of these countries are over populated, this means that not everyone in the country will be able to get a job even if they are educated and these people who are not working are the ones who are involved in some activities like being prostitudes in order to make a living. Unprotected sex can be dangerous as it spreads an uncurable disease called HIV/AIDS. When more people get infected, the country have to buy or import expensive medicine from other continents in order to cure people. People should be given condoms to reduce the spread of AIDS. People should also be advised by the social workers and nurses on dangers of engaging themselves in unprotected sex. Students should also be taught about the dangers of involving themselves on sex whilst they are still young. Dimension 3 seems relatively straight-forward to interpret; it captures a range of very typical informal style features and overlaps fully with a subset of the positive features in Biber’s Dimension 1 in the 1988 model. The overlapping features are: contractions, second person pronouns, first person pronouns, emphatics, do as pro-verb, direct WH-questions and discourse particles. This means that all of the features that load on our Dimension 3 were originally on Biber’s Dimension 1. High loadings on these factors reflect an informal writing style, typical of texts with a high degree of involvement. Likewise, in Reppen’s (2001) study of children’s language, a third dimension, labelled involved personal discourse was identified, which overlaps to an extent with our Dimension 3. As noted earlier, the split between Dimensions 2 and 3 draws apart two different aspects of the positive features on the first dimension identified by Biber
248
Bertus van Rooy and Lize Terblanche
(1988). Our Dimension 2 reflects a style choice of presenting information in a planned manner or more fragmented, under real-time planning constraints. On the other hand, Dimension 3 is a purer type of style dimension, where greater involvement of the writer and more informal style choices correspond with a high dimension score. Passage 5 is an example from LOCNESS where the frequent use of first person pronouns, contractions and emphatics reflect an informal writing style. This proves that dimensions 2 and 3 reflect a choice of style, since some native speakers evidently choose to write in a more informal manner. Passage 6 is from the TLE and contains only one first person pronoun and none of the other features: (5)
Word count: 157 first person pronouns (8/100 words) contractions (4/100 words) emphatics (2/100 words)
Upon entering college I didn’t know I would still have a curfew. Nor did I know I would be treated as if I were age thirteen. I thought if I had a male guest, friend, brother, or cousin, they could spend the night. I guess if I were a resident of one of the “special” dorms A could, co-ed. if some dorms can have overnight visitation all of them should. Just because a dorm is co-ed doesn’t mean overnight visitation is allowed. They still have a 2 a.m. curfew. A friend of mine that’s a Bates House resident just has another resident of the opposite sex to sign her mate guest in and he spends the night with her. She’s not the only resident doing it. Students in universities and colleges should not have to sneak around just to spend quality time with someone. We’re not at home we don’t have certain luxuries anymore like a car. (6)
Word count: 236 Bold=first person pronouns (0.4/100 words) Italics=contractions (0) Underlined=emphatics (0)
In South-Africa, North-west is one of the best tourists attraction. The proble is that the industry is still growing. It is not like in other country like United State of America were the tourism industry there is very big. They can even see the cannon that was used by the Barolong to defeat the British. The Taung skull heretage sites is also very attractive to the tourist because it is known all over the world. That place is very known because of the skull that was found in a cave at Taung. That
A multi-dimensional analysis of a learner corpus
249
skull scientics they were disagree that it was not a human skull but they end up agree that maybe it was an ape skull. Those animal they are more or less the same as human being. The were working straight like human but their back was to a beat carve. The tourists can go to that place see and the community can benefit. By selling food and some of African pottery and dressing. The built environment can also attract tourist. Places like museum. let us take Mafikang Meseum as example. The Museum is a place were thing that have been used in the past were stored. The tourists can found information of the history of the place and anything that was from the past. The history of the war of Boroling and the British, the warren fought, war between the british and Barolong. The features that group together as Dimension 4 constitute a slightly less coherent set. A number of them deal with marked forms within the verb phrase, but at least downtoners, concessive adverbials and pied piping constructions do not fall in this category. It is also the only dimension with negative features, the existential there and place adverbials. The positive features overlap in part with the positive features of the third dimension in Biber’s 2006 model, where he regards the features as indicative of a reconstructed account of events. These features are that verb complements and past tense verbs. There is also some overlap with the positive features of Dimension 2 in Reppen (2001), which she terms lexically elaborate narrative. One way of analysing these features, is that they represent a style of writing that is more nuanced and precise. Events are properly situated in time through the use of the past tense, perfect aspect and/or time adverbials, suitably hedged by means of downtoners and concessive adverbials, and attributed to appropriate sources of origin through public verbs with that-clause complements or WH-clause complements. For example, the concessive adverbial subordinators are used to introduce background information or for discourse framing (Biber 1988: 236). On the negative side, the use of the existential there and place adverbials serve to highlight and particularise information, without necessarily presenting it in a more subtle manner. As a provisional label, we propose contextualisation of information for Dimension 4. The use of past tense verbs and the perfect aspect give a reconstructed account of events in passage 7. The use of concessive adverbial subordinators and public verbs reflect a precise and nuanced text which contains subtle contextualisation, for example through the use of public verbs to specify the acknowledged sources in the text. This contextualisation of information is absent in passage 8 from the TLE, which is emphatic and forceful: (7)
Word count: 182 Concessive adverbial subordinators (1.6/100 words) Public verbs (1.6/100 words) Past tense and perfect aspect (4.9/100 words)
250
Bertus van Rooy and Lize Terblanche His optimism is however renewed on his arrival in South America. The naïve Candide remarks on how the sea and climate are much better here than in Europe and so decides that this must certainly be ‘le meilleur des mondes possibles’. When Candide remarks that although Pangloss said everything was for the best, he noticed that things always went badly in Westphalia. But this is not a complete rejection of the philosophy of optimism. It is not until his meeting with Cacambo that Candide realises how naïve Pangloss’s views were, and also how restricted they were. He decides that the views of a person can be changed by travel such as has happened to him. At the end of the ‘comte’ we see Candide and Pangloss much more resigned to their fate. Although the thing which Candide has been pursuing all through the novel, that is Cunégonde does not quite turn out as he expected. Although this work would appear to be light hearted, it does contain a very real condemnation of the attitudes of society and the naïve philosophy of optimism. (8)
Word count: 182 Concessive adverbial subordinators 0 Public verbs 0 Past tense and perfect aspect 0
My friend I would so much wish to advise you to open a saving account at ABSA bank, because at ABSA banker are provided with first preverance. The staff of ABSA a well training in serving bankers. They are aware that bankers are the people who brings in money in their bank otherwise the bank would be closed. The warmth, love and the way they welcome you is realy impressive. You will wish to have all your banking with them. Their service is really excellent. You feel so welcome to ask as much questions as you wish. Even if you want to see the manager you are allowed to. The ABSA bank is truelly secured. At the door there is a securityguard always. There are too many people coming to bank and some enquiring about savings, fixed deposit, loans and withdrawings. The que is running so fast that you don’t spend too much time in the bank, and the ABSA bank has many branches. Even in one two town you get too many banks. Their interest are higher than any other banks. The fifth dimension in our model overlaps largely with the fourth dimension, overt expression of persuasion, in the Biber (1988) model. The five linguistic features that occur in both models are prediction modals, suasive verbs,
A multi-dimensional analysis of a learner corpus
251
conditional adverbial subordinators, necessity modals and split auxiliaries. The only feature that occurs in Biber’s (1988) model, but not in this study, is infinitives which load on our Dimension 2. Dimension 5 in the present model goes even further by incorporating other features that can simply be regarded as the persuasive dimension in student writing, a feature that has been identified as a very important characteristic by Biber and Grabe (1987). Of course, the topics given to the students invited argumentative writing, so the persuasiveness is not unexpected. What is unexpected in student writing, however, is the extent to which such features are used, outscoring political speeches and newspaper editorials in terms of persuasiveness. The final passage is an example of persuasive writing from the TLE corpus, a style which is typical of student writing in general. The most obvious marker for this dimension is suasive verbs, but necessity and predictive modals, various adverbs, as well as the conditional adverbial subordinator if all reflect a persuasive text: (9)
Word count: 353 words suasive verbs (1.1/100 words) necessity and predictive modals (3.7/100 words) conditional adverbial subordinators (1.4/100 words) adverbs (3.4/100 words)
I fully agree with the topic that poverty is the cause of the HIV/AIDS epidemic in Africa. I think that if it was not for poverty or if everybody was rich in Africa then this HIV/AIDS epidemic would not be spreading so rapidly in our beloved country. Today you will find young people leaving their homes saying that they are going to look for jobs only to find out that there aren’t jobs out there. They end up on the streets and the only way to survive will be to get boyfriends so that you may get some sort of income. You are definitely going to go for the cash thinking that it is only for that time it will pass and at least you’ve got money to buy food and clothes to get going. Everywhere you will hear people say nasty things about prostitutes. The honest fact is those people did not ask to be what they are now. If everyone was rich, the world would be a better place to live on. Everyone will be concentrating on his or her belongings. No one will be short of anything that will make her or him to end up in the street. Now rich people know that they can go out there hunting for those in need and asking for the impossible from them. Now because the HIV/AIDS epidemic does not have its own people or only specific type of people you would not tell if one has it or not. What I think can be done is our government can give us free education and not ask for the so called experience so that we can all get jobs and be able to maintain our families and that way will be
252
Bertus van Rooy and Lize Terblanche fighting a lot of things. Shooting two birds with one stone is a great thing to do. If our adults can afford then this prostitution and being charmed by people who can afford will come to an end. If we started that way then the HIV/AIDS epidemic would also be stopped. People will now not have a reason for being prostitutes.
3.2
Dimension scores
Table 2: Mean dimension scores for the two corpora, together with standard deviations and Cohen’s d-value. Large effect sizes are indicated by an asterisk Dim 1 2 3 4 5
Advanced literacy Transparency Informal style Contextualisation Persuasion
Mean LOC 7.44 1.48 -0.04 5.00 2.24
Std.Dev. LOC 4.42 3.17 3.62 3.98 3.93
Mean TLE -3.65 -0.73 0.02 -2.45 -1.10
Std.Dev. TLE 5.82 6.80 3.53 3.40 5.21
d-value 1.91* 0.33 0.02 1.87* -.64
The dimension scores for the five new dimensions are reported in Table 2, alongside their standard deviations and a d-value, which evaluates the difference in means between the two corpora. The comparison makes it clear that there are major differences in the dimensions that incorporate grammatical resources that play a role in information transfer (Dimensions 1 and 4). By contrast, for the purer style dimensions (2, 3 and 5), the results are much closer together. It seems as if the style dimensions, particularly 2 and 3, can be interpreted in terms of choices between more transparent, and informal, or more longwinded, and formal. Both corpora contain essays that have higher positive and higher negative scores on these dimensions, as is clear from the relatively high values for the standard deviations, given the mean values. As far as persuasiveness is concerned, LOCNESS makes more frequent use of the relevant linguistic resources than the TLE. However, compared to other registers analysed by Biber (1988), even the TLE makes substantial use of the resources of persuasion. If dimension scores were calculated in terms of the original Biber dimension, using Biber’s standardisation algorithm, a positive score of 1.4 would be obtained for the TLE. While lower than the 4.5 of LOCNESS, this is higher than the vast majority of registers examined. The situation is very different for Dimensions 1 and 4 in our model. On these two dimensions, the TLE has strong negative scores, and LOCNESS has strong positive scores. It should be clear that the grammatical resources for information packaging and for conveying subtle senses about the information are not as readily available to the TLE writers. The overall effect is perhaps best illustrated by a comparison of extracts 7 and 8. In extract 7, more subtle
A multi-dimensional analysis of a learner corpus
253
argumentation is presented about the information, contextualised in its historical context, with concession to other views. In extract 8, a very forceful argument is presented in the present tense, drawing on the general truth sense of the tense, with little concession to other views. While both passages have many nouns, the density is higher in passage 7, as is the density of adjectives, particularly attributive ones. This means that the expressive flexibility of the TLE writers is constrained by the availability of the relevant grammatical features. 4.
Conclusions
Firstly, by extracting a new multidimensional model, it has been possible to detect that there are more grammatical differences than differences relating to a particular writing style. The dimensions that highlight grammatical complexity are Dimensions 1 and 4 on our model. These dimensions illustrate that the TLE writers do not enjoy the same access to linguistic features associated with the kind of grammatical complexity that allows for integrated, yet subtle presentation of information, as opposed to the native speaker writers who regularly incorporate these features into their writing. The scores on Dimension 2, 3 and 5 are much closer together and do not distinguish between the TLE and LOCNESS to the same extent. This signifies that these dimensions reflect a certain style of writing rather than grammatical complexity. However, LOCNESS uses more of the features that are associated with persuasiveness. Both native and non-native speakers decide to use the linguistic features associated with style to a greater or a lesser degree. Thus, some TLE students write in a direct and plain style, while others write in more elaborate or ornamental ways. Likewise, some LOCNESS students make use of the direct and plain style, but others do not. Secondly, the results validate the decision to extract a new multidimensional model, since it was possible to gain deeper insights into student writing. These insights would have been impossible if the study had focused on isolated linguistic features, because the conspiracy between different features to achieve functional effects would have been lost. Likewise, extracting a new feature model rather than using the dimensions of the original Biber (1988) model enabled us to separate style dimensions from grammar and information presentation dimensions in a way that the original model did not allow. A final conclusion is that general dimension patterns exist, which emerge across different multidimensional models. There are three basic patterns that can be isolated: firstly, a dimension with a dense informational/nominal structure, secondly a dimension with a strong oral and informal style and lastly a dimension that reflects the intensely persuasive nature of student writing. The L2 data in the present study differ much more from Standard English than any data used in previous projects. Therefore, the finding of similarities across very different multidimensional studies are strong support for a claim that certain dimensions are invariantly present across registers and varieties of English.
254
Bertus van Rooy and Lize Terblanche
Notes 1. http://cecl.fltr.ucl.ac.be/Cecl-Projects/Icle/icle.htm#heading5 2. http://www.fltr.ucl.ac.be/fltr/germ/etan/cecl/Cecl-Projects/Icle/locness1.htm References Biber. D. (1988), Variation across speech and writing. Cambridge: Cambridge University Press. Biber, D. (2006), University Language: A corpus-based study of spoken and written registers. Amsterdam/Philadelphia: Benjamins. Cohen, J. (1969), Statistical Power Analysis for the Behavioral Sciences. New York/London: Academic Press. Conrad, S., & Biber, D. (eds.) (2001), Variation in English: Multi-Dimensional Studies. Harlow: Longman. Giménez, J., & Márquez, L. (2004), ‘SVMTool: A general POS tagger generator based on Support Vector Machines’, Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC ‘04). Lisbon, Portugal. Grabe, W., & Biber, D. (1987), ‘Freshman student writing and the contrastive rhetoric hypothesis’, Paper presented at SLRF7, University of Southern California. Grabe. W. & Kaplan, R.B. (1996), Theory and practice of writing: an applied linguistic perspective. London/New York: Longman. Grant, L. & Ginther, A. (2002), ‘Using computer-tagged linguistic features to describe L2 writing differences’, Journal of Second Language Writing, 2: 123-145. Hinkel, E. (2002), Second language writers’ text: Linguistic and rhetorical features. Mahwah: Lawrence Erlbaum Associates. Hinkel, E. (2005), ‘Analyses of second language text and what can be learnt from them’, in: E. Hinkel (ed.) Handbook of Research in Second Language Teaching and Learning. Mahwah, N.J.: Lawrence Erlbaum. 615-628. Mesthrie, R. (2006), ‘Anti-deletions in an L2 grammar: A study of Black South African English mesolect’, English World-Wide, 27: 111-145. Nkemleke, D.A. (2006), ‘Some characteristics of expository writing in Cameroon English’, English World-Wide, 27: 25-44. Reppen, R. (2001), ‘Register variation in student and adult speech and writing’, in S. Conrad & D. Biber (eds.) Variation in English: Multi-Dimensional Studies. London: Longman. 187-199. Reynolds. D.W. (2005), ‘Linguistic correlates of second language literacy development: evidence from middle-grade learner essays’, Journal of Second Language Writing 14: 19-45. Van Rijsbergen, C.J. (1979), Information retrieval. London: Butterworths. Van Rooy, B. (2008), ‘A multidimensional analysis of student writing in Black South African English’, English World-Wide, 29 (3): 268-305. Van Rooy, B. & Terblanche, L. (2006), ‘A corpus-based analysis of involved aspects of student writing’, Language Matters, 37 (2): 160-182.
Weaving web data into a diachronic corpus patchwork Andrew Kehoe and Matt Gee Research & Development Unit for English Studies, Birmingham City University Abstract This paper offers a reassessment of the role of web data in diachronic linguistic analysis. We introduce the diachronic search facilities provided by the WebCorp Linguist’s Search Engine, including the use of a new ‘heat map’ graph for the analysis of changes in collocational patterns over time. We illustrate how web data can be used to supplement data from standard corpora in lexicological studies. Our focus is on the vogue phrase credit crunch and the paper compares examples from standard corpora (BNC, Brown, LOB, Frown, LOB) with those found in web-accessible newspaper texts. Contrary to previous studies, we do not rely on the web solely for the most up-to-date usage examples. Instead, we show how web-accessible texts dating back to the beginning of the 20th Century can be used to fill gaps in and sharpen the picture provided by standard corpora.
1.
Introduction
The original WebCorp project (Kehoe & Renouf 2002, Renouf 2003) was an experiment to see whether we could develop a system to extract linguistic data from web text efficiently and present this to the linguist in as usable as fashion as it is presented in traditional corpora. The system (http://www.webcorp.org.uk) receives a word or phrase and other requirements from the user, passes these to a commercial search engine (Google, AltaVista, etc), and extracts the ‘hit’ pages from the search engine results. Each page is accessed and processed and the extracted concordances are presented to the user in a choice of formats. The WebCorp tool established that web text, though problematic, is nevertheless a resource that can complement corpus evidence with examples of usage that is rare, re-emergent, new or productive. The WebCorp Linguist’s Search Engine (WebCorpLSE) is designed to bypass the commercial search engines upon which WebCorp relied as gatekeepers to the web.1 WebCorpLSE is crawling and processing the web to build a 10 billion word (or 7 terabyte) text corpus, including a multi-terabyte ‘mini-web’, designed to act as a microcosm of the web itself (Kehoe & Gee 2007). In addition to the mini-web, WebCorpLSE has built a newspaper subcorpus, containing daily issues of UK broadsheets from 1984-present and recent issues of other UK and international newspapers. We have also worked with our university colleagues to build collections to assist in their research and teaching, including sub-corpora of blogs, science fiction and major English literary works. All collections are searchable via linguistically-tailored front-ends.
256
Andrew Kehoe & Matt Gee
It is now generally accepted that web data are of value in supplementing evidence from traditional, or ‘standard’, corpora when examining linguistic change over time. Previous work has tended to turn to the web as a source of evidence of the very latest trends in language use and of new coinages not found in standard corpora. Mair, for example, in a study of change and variation in present day English, states that the best way to ‘minimise the risk’ of relying on the web as a corpus is to use it not as a stand-alone source of data, but in conjunction with tried and tested closed corpora. In diachronic work, such corpora are positively indispensable because they add the necessary element of time depth to the web. (Mair 2007: 236) The approach described by Mair is, in part, necessitated by the bias in commercial search engines, like Google, toward the most recently updated pages and the difficulty in extracting older data from the web through these search engines (cf. Kehoe 2006). In this paper, we shall describe the corpus search tools available in WebCorpLSE and the new possibilities which these open up for diachronic linguistic study. We shall illustrate that carefully selected web data can, in fact, provide the necessary ‘time depth’ by overlapping with and filling gaps in the data provided by standard corpora. The web data can, thus, sharpen the diachronic picture presented by standard corpora rather than simply widening it at the most recent end of the timeline. Our analysis will focus on a phrase which, given media preoccupations at the time of writing, may initially seem to be a perfect example of the kind of vogue construction for which linguists have, thus far, turned to the web for evidence: credit crunch. The phrase does not appear in the Oxford English Dictionary (OED) and was named as the Oxford University Press ‘Word’ of the Year for 20082, an honour frequently bestowed on new coinages. One may therefore assume from this that credit crunch will not be found in standard corpora but, as we will show in the next section, there are examples of the phrase in corpora. In fact, the corpus-based Cambridge Advanced Learner’s Dictionary (CALD) includes credit crunch in its entry for another phrase, credit squeeze: credit squeeze UK noun [C] (US credit crunch) INFORMAL a period of economic difficulty when it is difficult to borrow money from banks (http://dictionary.cambridge.org/define.asp?key=18170&dict=CALD) This dictionary (latest printed version 2008) is based on the Cambridge International Corpus: just over 240 million words of UK and US writing and speech, with an emphasis on business, legal, academic and financial English. It is likely that the supposed distinction between US credit crunch and UK credit squeeze was drawn from the last of these, a ‘collection of books, journals, newspaper articles relating to economics and finance’3. This distinction between UK and US usage is one which will be investigated in this paper. We shall use
Weaving web data into a diachronic corpus patchwork
257
standard corpora of British and American English and data from the web to examine usage patterns over time and determine to what extent the US/UK distinction in the CALD entry holds true. Initially, our analysis will focus on credit crunch and we will return to credit squeeze in section 4. 2.
Evidence from standard corpora
The phrase credit crunch appears seventeen times in the British National Corpus (BNC) in texts from 1991-3, shown in Figure 1. The Economist, 1991 The Federal Reserve is struggling to allay fears of a “credit crunch” – when banks are reluctant to lend except to the most creditworthy borrowers. (ABD 81) 2 Fears of a “credit crunch” have prompted policy changes at the Federal Reserve in recent months. (ABD 2335) 3 A credit crunch is the name economists give to a sudden reluctance among banks to lend money. (ABD 2339) 4 Typically, a credit crunch happens when banks start to worry about the creditworthiness of their borrowers. (ABD 2341) 5 A credit crunch – mild, as yet – is undoubtedly under way in America. (ABD 2347) 6 There is a risk, though, that the supply of credit will start to fall faster than the demand; in other words, that a credit crunch will start to drive the process of credit contraction. (ABD 2361) 7 The Bank of England, responding to fears of a credit crunch, has asked banks to think twice before turning away would-be corporate borrowers. (ABD 2367) 8 Such frightening costs undermine the credibility of the FDIC, because, if a banking crisis were to start, the government might find itself facing a credit crunch of its own. (ABD 2381) 9 This demand on the international capital markets raises interest rates, aggravating the problems of debt and credit crunch. (ABD 2386) 10 The Federal Reserve’s Alan Greenspan said the Fed would do what it could to ease America’s credit crunch. (ABG 3211) 11 Yet they, too, complain of aches and pains, of being squeezed by a “credit crunch” under which borrowing has become harder even while interest rates have been falling. (ABJ 3178) 12 There is no generalised credit crunch in Japan, but particular firms are being hurt. (ABJ 3982) 13 That suggests that a credit crunch is taking place, especially since banks are still under orders from the central bank not to increase lending to property companies beyond the overall rate of loan growth. (ABK 2395) Daily Telegraph, electronic edition of 15/04/1992 (AKJ 453) 14 That would cause a severe credit crunch. Unigram X, 1993 (CTG 399) 15 Debt-laden Tustin, California-based business systems supplier MAI Systems Corp appears to have hit a credit crunch according to the German weekly Computerwoche.
1
258
Andrew Kehoe & Matt Gee
Keesings Contemporary Archives, Longman, 1991 (HLC 632) 16 It also took measures to ease the so-called “credit crunch “, mainly by relaxing regulatory pressures in order to encourage bank lending. The Scotsman: Business section, unknown date (K59 3187) 17 Writing in the February issue of the Lloyds Bank Economic Bulletin, he says: “The restoration of financial balance will mean that, far from there being a credit crunch, banks are likely to continue to find very little net demand for loans from companies.”
Figure 1: All examples of credit crunch from the BNC4 It is clear from the BNC concordances that the phrase was current in British English in 1991, and also that a credit crunch was underway in the United States at that time and was in danger of occurring in the United Kingdom. However, the fact that the phrase occurs in double quotes, complete with a full gloss, in three articles from The Economist (whose readers may be expected to be more familiar with economic terms than readers of general audience newspapers) indicates that the phrase was still new and unfamiliar to the majority of UK readers. The BNC data seem to indicate that the phrase credit crunch, like the economic situation it describes, first occurred in the US, thus confirming the CALD definition. This opens up the possibility of turning to another set of standard corpora: the Brown family, ‘corpora equivalently sampled from the language, though different in temporal as well as geographical provenance – as a means of identifying rather precisely how the use of the language developed over a period’ (Leech and Smith, this volume). The 1961 Brown and LOB corpora, with 1 million words each of written American English (AmE) and British English (BrE) respectively, contain crunch only in a literal sense. FLOB, the 1 million word BrE corpus from 1991, does not contain any instances of crunch (though it does include literal crunching and crunchiness). However, Frown, the 1 million word AmE corpus from 1992, includes three instances of crunch, all of which are used in a metaphorical sense to refer to financial situations, including one occurrence of credit crunch (Figure 2). 1.
2.
3.
Hallinan introduced the legislation following an Examiner story that revealed that some city bureaucrats were commuting in style at taxpayer expense despite a severe budget crunch that has required reduction of some vital health services. (A25 15-18) For all of Mr. Kornbluth’s cultural observations, the book is not yet written that closely tracks [US financier Michael] Milken’s persecution with the credit crunch and recession. (C12 89-91) They lend legitimacy to the racist and misogynist stereotypes so popular with conservative politicians and disgruntled taxpayers who feel an economic crunch and are looking for someone to blame. (G23 160-163)
Figure 2: All examples of crunch from Frown corpus5 The limitations of the Brown family for lexical rather than grammatical studies, as pointed out by Leech and Smith (this volume), are clear from these results. A
Weaving web data into a diachronic corpus patchwork
259
study of crunch based solely on Frown would have little choice but to conclude that the word is used only to refer to negative financial situations (the semantic prosody of severe, disgruntled and blame is clear). It is worth noting, however, that the authors of these AmE texts from Frown, unlike the authors of BrE texts from the same period in the BNC, do not feel it necessary to provide a gloss for credit crunch and use crunch in a wider sense to refer to a variety of financial situations. Indeed, the example of credit crunch in this corpus is mentioned in passing in an article which focuses on a different topic; it is given rather than new information. 3.
Turning to the web
Nesselhauf (2007) makes the distinction between two types of web-based diachronic linguistic analysis. The first is the approach taken by us with the original WebCorp system: the analysis of short term changes in texts produced specifically for the web (Kehoe 2006). The second is the analysis of changes in ‘larger and/or earlier time-spans based on texts written for other media and later made available on the internet’ (Nesselhauf 2007: 287). WebCorpLSE moves us toward this second approach to web-based diachronic analysis. As outlined in section 1, the system provides access, via the web, to a variety of sub-corpora, many of which were compiled from web-accessible text collections such as Project Gutenberg. In this paper we focus on the WebCorpLSE newspaper sub-corpus. With regard to this text-type, Nesselhauf’s distinction between the two kinds of diachronic analysis becomes somewhat blurred in that modern newspaper articles are not produced ‘specifically for the web’ but nor are they made available on the web only at a later date. For the past decade, printed newspaper texts have been made available simultaneously on the web. We shall return to this point in section 3.2, which provides details of our newspaper sub-corpus and the kinds of diachronic search possible using WebCorpLSE. Before looking at this, we outline in 3.1 the restricted (though useful) provision for diachronic linguistic analysis in the web-based Google newspaper archive. Throughout our analyses in section 3.1 and 3.2, we will attempt to confirm the accuracy of the CALD definition of credit crunch by examining: i) ii)
what web data can tell us about credit crunch in AmE, including first occurrence what web data can tell us about the introduction of credit crunch into BrE
3.1
Google News
Google News (http://news.google.com) is a ‘news aggregator’: a website that collates, from multiple sources, news stories which may be of interest to an individual user and presents these on a single page. In addition, the Google News
260
Andrew Kehoe & Matt Gee
site contains an archive of major international newspapers and magazines dating back over 200 years. More specifically, Google News provides a master index to several existing newspaper archives (New York Times, Washington Post, etc) and has begun to digitise print newspapers which were not previously available in electronic form.6 Google is working with publishers to make ‘millions of pages of news archives’ available, in facsimile and in a form searchable by keyword. The Google News Archive is not a corpus in the sense used by linguists. Accurate word frequency information is not available and only very limited word contexts are provided, as we shall show in the examples below. However, Google News does allow us to pinpoint when a particular word or phrase entered the lexicon of newspapers in the English-speaking world.7 By default, the Google News Archive search interface8 shows results in ‘relevance’ order, in a similar manner to a standard Google search. A secondary ‘timeline’ option allows the results to be viewed in date order, as shown in Figure 3 for the phrase credit crunch.
Figure 3: ‘Timeline’ results from Google News Archive for credit crunch Figure 3 would initially seem to indicate that there are examples of credit crunch dating back to 1906. However, this output highlights a severe limitation of
Weaving web data into a diachronic corpus patchwork
261
Google News for linguistic search. Many of the years associated with articles in the results list are not the year the article was written but the year in which the event being discussed took place. For example, the first result in Figure 3 (listed as 1906) is actually from a book published in 2002, and the second result (1926) is from an article dated May 29th 2008 in the New Zealand newspaper Timaru Herald.9 The fundamental difference between the dates required for informational search and for linguistic search (cf. Kehoe 2006) makes Google News an inadequate search interface for the latter. It is undoubtedly useful to know that there was a credit crunch in 1906 but it is also clear that the term itself was not used at that time. The last example in Figure 3 encapsulates this as it is a genuine example from the Chicago Tribune of November 16th 1967, which states that there was a credit crunch the previous year. The point is that this 1966 credit crunch appears to have been referred to as such only in retrospect. Finding the earliest occurrence of a term with Google News is a rather laborious process. After finding the earliest genuine occurrence on the timeline by experimenting with different date ranges, it is necessary to switch back to the default view to determine if there are any earlier occurrences. As the default view does not show results in date order, all results must be examined. By carrying out this procedure, we found the earliest examples of credit crunch to be not the November example from the Chicago Tribune but the examples from earlier in 1967 shown in Figure 4.10 New York Times, June 4 1967 avoid a repetition of last year’s credit crunch Washington Post, June 26 1967 highest interest rate since the 1920s - even a little higher than the rates late last summer during the credit crunch Washington Post, June 29 1967 Is the Nation heading into another credit “crunch” like last year’s, with soaring interest rates, competition for savers’ funds, and a new slump in the housing industry? New York Times, June 30 1967 danger that we will be moving toward another “credit crunch”. To avoid this, we urgently need greater fiscal restraint by the Federal Government Hartford Courant, July 2 1967 Five change in federal housing laws, designed to prevent a credit crunch of the 1966 type, were proposed last week to a Senate committee by the National Assn. of Real Estate New York Times, July 2 1967 Interest rates, the ‘topic and concern of the and financial these days, have been climbing steadily and fears of a new credit crunch similar to last summer...
Figure 4: Earliest examples of credit crunch in Google News, extracted manually All the examples in Figure 4 refer to the credit crunch as something which happened the previous year. We cannot say so conclusively but, given that the
262
Andrew Kehoe & Matt Gee
Google archive contains editions of these and other newspapers from 1966 yet returns no hits from that year, it seems likely that the term did not appear in the public domain in the United States until 1967. In the next section, we shall outline how WebCorpLSE, running on a combination of offline newspaper archives and newspaper data extracted from the web, can be used to trace the introduction of the term credit crunch in to the UK. 3.2
Newspaper corpora accessible via WebCorpLSE
We know from the BNC that the phrase credit crunch was used in the UK in 1991 but was not widespread and required explanation. Using the diachronic search facility in WebCorpLSE, we are able to trace the use of the phrase across a 25 year continuous span of UK broadsheet newspapers, segmented into months. The corpus contains 950 million tokens and consists of:11 i) ii) iii)
a complete archive of The Guardian (1984-88) a complete archive of The Independent (1989-99) The Guardian, downloaded from the web (2000-08)12
This corpus combines the two kinds of web-based diachronic analysis outlined by Nesselhauf (2007). On the one hand, the Guardian articles from 2000 onwards are pre-existing web texts. On the other, the early Guardian and Independent articles are off-line resources, being made available online in a form suitable for linguistic study by WebCorpLSE.13
Figure 5: Frequency of credit#crunch across time in the WebCorpLSE newspaper archive (per million words) 14
Weaving web data into a diachronic corpus patchwork
263
The graph in Figure 5 shows the frequency of credit crunch across time in the WebCorpLSE UK newspaper corpus. All frequencies are normalised to account for the varying size of the monthly segments across the years. The dotted line is the normalised monthly frequency and the solid line is a 12 month moving average. We have been examining such graphs for several years but have never seen a case as extreme as this, where the frequency increases from fewer than 1 occurrence per million words to almost 120 per million words within a single year. One of the earliest occurrences of credit crunch in the newspaper corpus is in a sentence from an August 1988 Guardian article, which includes a definition of the term and an indication of its origin: Indeed there is a possibility of a US-style credit crunch, where interest rates are pushed up hard for a short period. However, the phrase is used only 22 times in the 7 years before 1991 and the monthly frequency never rises above 1 per million words. The two noticeable ‘blips’ in Figure 5, prior to the massive upward trend in 2007-8, are accounted for by the concordances in Figures 6 and 7. These concordances were produced by WebCorpLSE, with sentence span selected and the results sorted by date, from earliest to most recent.
Figure 6: WebCorpLSE concordances for credit#crunch from The Independent, January-February 1991 (case insensitive, sentence span)
264
Andrew Kehoe & Matt Gee
Figure 6 shows the occurrences of credit crunch in The Independent in early 1991 which were responsible for the increase in frequency to a peak of 5.4 per million words in February of that year. Again we see some occurrences in quotes, complete with glosses (lines 24, 27, 31, 35 and 37) and other lexical signals such as ‘so called’ (cf. Renouf & Bauer 2001 on ‘contextual clues’). These concordances are contemporary with those from the BNC and it is clear from Figure 5 that, by chance, the BNC compilers captured a phrase, associated with a particular news story, which was at a peak of popularity in BrE.15 This again highlights the limitations of short time-span synchronic corpora in lexical studies. A study of credit crunch based on data from the BNC may overestimate the significance of the phrase in late 20th Century BrE. In order to trace the development of a word or phrase fully, it is necessary to use a larger monitor corpus like the newspaper sub-corpus in WebCorpLSE.
Figure 7: WebCorpLSE concordances for credit#crunch from The Independent, September-November 1998 (case insensitive, sentence span) After 1991, credit crunch appeared rarely (fewer than 50 occurrences in 6½ years) until late 1998 when it appeared 67 times in 3 months, including the cases shown in Figure 7. This 1998 peak in the frequency of the phrase appears to have
Weaving web data into a diachronic corpus patchwork
265
been sparked by comments from the chief executive of Barclays Bank (mentioned by name in lines 138-141 and 146). As in 1991, this peak in credit crunch was fleeting and the frequency of occurrence had fallen back below 1 per million words by December 1998. It then remained at that level until July 2007, when the massive increase in frequency began. Turning to concordances from July 2007 (Figure 8), one is struck initially by the lack of quotation marks around credit crunch and lack of any explanation of the term.16 It may appear that, by this point, the phrase has entered the lexicon of the newspaper to the extent that the journalists no longer feel it necessary to provide an explanation when using it. However, if we then look at a selection of concordances from later in 2007 and into 2008 (Figure 9), with the frequency of credit crunch continuing to rise, we find further examples where credit crunch is defined by the writer. We also see early evidence of the increasing trend for metalinguistic discussion of the phrase credit crunch and its meaning.17
Figure 8: WebCorpLSE concordances for credit#crunch from The Guardian, July 2007 (case insensitive, sentence span)
266
Andrew Kehoe & Matt Gee
Figure 9: Filtered WebCorpLSE concordances for credit#crunch from The Guardian, 2007-8 (case insensitive, sentence span) A possible explanation for this lies in Figure 10, which shows the proportion of occurrences of credit crunch which appeared in each sub-section of The Guardian.
Figure 10: Proportion of occurrences of credit#crunch across sections of The Guardian, 2007-818
Weaving web data into a diachronic corpus patchwork
267
In the early months of 2007, the phrase appeared only in the ‘Business’ section. By July 2007 it was also appearing in the ‘Money’ and ‘Comment’ sections, and by August it had spread to ‘Media’ and ‘Life’. Eventually, in December 2008, credit crunch was appearing in all sections of the newspaper, including ‘Sport’, ‘Education’ and ‘Culture’, thus confirming the notion in Figure 9, concordance 11 that ‘the esoteric “credit crunch” has moved out of the so-called “interbank money markets” and into the consciousness and pockets of the British people’. 3.3
Collocational analyses
Although the filtering options in WebCorpLSE can be used to make manual data analysis a more manageable task, the number of results can be prohibitively large when dealing with frequent lexis. In her 1987 study of ‘lexical resolution’, using a corpus of 13 million words, Renouf concluded that ‘eventually a point may be reached in corpus development where all word forms in which there is a lexicological interest are sufficiently exemplified’ (Renouf 1987: 130). It could be argued that we have now gone beyond this point, to a situation where corpora are so large that, for all but the rarest word forms, we are presented with more concordance data than can be analysed manually. As a result, statistical analyses have become increasingly important. One way to examine the growth of credit crunch over time is to produce span 1 collocational statistics for one or both of the words which constitute the phrase. We have chosen to take credit as this is the more frequent of the two words in our corpus and we felt that an analysis of its collocates may provide more information about squeeze and other related words. Figure 11 shows the span 1 collocates of credit for all months up to and including December 1988 (with a stopword filter switched on), whilst Figure 12 shows the same information but with the time period extended 20 years to the end of the corpus (December 2008). A z-score calculation is used to compare the expected frequency of collocation (based on the frequencies of each word) with the actual, observed frequency. Such collocational statistics are now standard in corpus linguistics and they are undoubtedly useful, as in this case where they reveal that crunch, which did not appear as a statistically significant collocate of credit in 1988, had become its most significant collocate by 2008. (In fact, viewed from the opposite perspective, credit accounts for 90% of the significant collocates of crunch in L1 – immediate left – position in the corpus as a whole.) WebCorpLSE provides an enhanced collocation tool which allows the tracking of changes in collocational patterns across time. We refer to this as a collocational ‘heat map’, where heat is used as a metaphor for collocational strength. To generate a heat map, WebCorpLSE ranks all collocates of the target word in the whole corpus by z-score, and selects the top 200 significant collocates for further analysis. These are then broken down into groups by month and year to create a diachronic table of collocation frequency. The monthly z-scores are used to plot the strength of collocation on a graph by translating them into shades of red.
268
Andrew Kehoe & Matt Gee
Collocate card cards consumer Suisse Family family boom Consumer rating export Export Guarantee Act bank Lyonnais reference scoring controls facilities tax lines limit insurance balances unions
L1 TOT 1 807 6 507 225 225 104 102 102 193 194 90 80 80 82 83 83 72 73 64 67 71 74 44 48 1 47 40 41 62 65 42 2 35 1 37 26 31
R1 Z-score 806 619.54 501 419.92 183.95 104 101.68 92.72 1 87.76 90 80.21 76.48 82 75.50 75.42 1 71.41 64 62.78 67 51.52 3 46.36 44 42.67 48 41.36 46 40.94 40 34.97 41 34.52 3 33.65 42 31.65 33 28.89 36 26.85 26 24.45 31 24.30
Figure 11: Significant span 1 colls. of credit, up to end of 1988
Collocate crunch card Suisse cards Lyonnais rating Consumer tax consumer Agricole Tax Counselling reference deserves pension ratings squeeze export balances Card interest-free unions Family facility markets
L1
TOT R1 Z-score 3 7031 7028 4149.97 3217249 17217 4010.65 1 3311 3310 2739.34 54 8825 8771 2478.06 2 1638 1636 1461.61 8 2053 2045 964.25 1183 1186 3 774.05 4329 4333 4 597.99 1430 1435 5 389.45 1 369 368 355.72 486 490 4 334.70 2 349 347 325.30 751 751 260.96 518 518 256.05 1048 1050 2 248.81 2 551 549 239.53 4 424 420 233.01 508 509 1 217.58 2 266 264 196.68 1 236 235 196.14 209 209 185.52 9 684 675 183.78 429 431 2 181.24 2 343 341 178.28 15 856 841 174.02
Figure 12: Significant span 1 colls. of credit, up to end of 2008
Figure 13 is a heat map for the span 1 collocates of credit from 1985 to 2008.19 This output highlights the fine-grained approach to collocation provided by WebCorpLSE heat maps. We see Lyonnais, a strong span 1 collocate of credit for over 10 years, disappear from the map in 2003, at the point when the French bank Credit Lyonnais became known as LCL. We also see Family disappear and Tax appear in 1998-9, when the ‘Working Families' Tax Credit’ replaced ‘Family Credit’ in the UK welfare benefit system. These are not linguistically interesting examples in themselves but they indicate that the methodology is sound and allow us to draw more meaningful conclusions when, for example, reference, ratings, histories and limit become strong collocates of credit (relating to ‘debt worthiness’). Figure 13 also captures the cyclical nature of credit crunches, with crunch appearing as a significant collocate of credit for specific short periods (1991-2, 1998-9) before ‘fading’ out of use again. We also see squeeze appearing as a span 1 collocate of credit at similar, but not identical, points in time (appearing more gradually from 1988-91, and weakly in 1993-4 and 1998-9). We shall examine squeeze in section 4.
Weaving web data into a diachronic corpus patchwork
269
Figure 13: Top of ‘heat map’ for span 1 collocates of credit (case insensitive) in newspaper corpus 1985-2008 (left and right collocates) Both crunch and squeeze re-emerge as strong collocates of credit in 2007-8 and it remains to be seen how long this particular event will last. Given that the phrase credit crunch is being used more frequently than ever before and that collocates indicating severity (global, crisis) also appear as strongly significant in 2007-8, it would seem that it will be much longer before this particular credit crunch fades from the heat map. We should also note in our discussion of collocation that WebCorpLSE allows the generation of collocates for any search term and is not restricted to single words searches. Figure 14 shows the span 4 collocates of the phrase credit crunch over time. Until 2007, the phrase had few statistically significant collocates, though banks first appeared in 1991 and global, fears and markets had appeared by 1999 (the time of the second ‘blip’ in Figure 5). By 2008, there is a long list of words describing the credit crunch, its causes and effects, some of which are classed as significant as a result of their own newness and rarity (subprime, write-downs). It will be interesting to monitor changes in the collocational profile of credit crunch in future years. 4.
A brief discussion of credit squeeze
Space does not permit a full discussion of credit squeeze but we have conducted a diachronic analysis of the phrase and will summarise the main findings here. Unlike credit crunch, credit squeeze does appear in the OED, under the headword credit (Figure 15).
270
Andrew Kehoe & Matt Gee
Figure 14: Top of ‘heat map’ for span 4 collocates of credit#crunch (case insensitive) in newspaper corpus 1991-2008 (left and right collocates) 14. attrib. and Comb.[...] credit squeeze, the restriction of financial credit facilities through banks etc. 1955 Times 18 July 15/1 As early as last February I applied a little of the curb-what is sometimes called the credit squeeze. 1957 Britannica Bk. of Year 511/2 A verb-form to credit-squeeze, to restrict investment or speculation by reducing financial credits.
1962 H. O. BEECHENO Introd. Bus. Stud. xiv. 138 ‘Credit squeezes’-i.e. making it more difficult to obtain loans from banks and, perhaps, restricting hire purchase business... This check can be applied selectively. Figure 15: OED definition of credit squeeze
Weaving web data into a diachronic corpus patchwork
271
The whole phrase does not appear in the Brown family of corpora but there is one occurrence of squeeze in this sense in the BrE LOB corpus: The big “squeeze” means that it is going to be more difficult to arrange a loan or overdraft. (A06 206-207; Daily Sketch, 4 August 1961) The phrase is not quite as frequent as credit crunch in the BNC, appearing 13 times in texts from 1976-93 (Figure 16). 1
2
3
4
5 6
7
8 9 10
11 12
13
In 1974 his property and investment group also faced problems brought on by a credit squeeze and downturn in the building market. (AAS 11: Guardian Business section, 31/12/89) The capital standards, negotiated through the Bank for International Settlements (BIS), are a natural scapegoat for the credit squeeze that is deepening the recessions in Britain and America and may provoke one in Japan. (ABE 159: Economist, 1991) The higher interest rates and credit squeeze control used by the Conservatives did, however, slow down growth in the economy overall. (CRD 480: Engineers, managers and politicians, 1993) The Conservatives had clearly let the economy overheat for electoral advantage in 1955, but as soon as the election was over, clamped down with a credit squeeze. (CRD 559: Engineers, managers and politicians, 1993) Foreign business also has a more practical complaint: because of China’s credit squeeze, bills are no longer paid on time. (EDU 578: Marxism Today) In Britain the apparently smooth growth during the long boom was marked by dramatic events that, at the time, seemed to be crises: for example, the 1957 credit squeeze and record interest rate jump (FA0 588: Restructuring Britain: the economy in question, 1988) In a less obvious but equally influential manner, if a credit squeeze is applied as a macroeconomic policy, the resulting high interest rates will reduce the number of people able to take out mortgages. (FB2 719: Rural Britain: a social geography, 1985) It won’t be affected by the credit squeeze ...? (G0F 1376: Sweet dreams, 1976) This is true in that consumer demand has collapsed as a result of the credit squeeze (G38 485: Marketing Week, 17/01/92) Britain therefore experienced a credit squeeze in the early 1990s during a period of recession in much the same way -- and for much the same reasons -- that she experienced a credit boom during the period of growth and “overheating “ in the mid-1980s. (H91 296: A treaty too far, 1992) The government responded to the payments crisis with a credit squeeze. (K8U 225: Capitalism since 1945, 1991) This situation would occur in circumstances as in the late 1960s, when due to a credit squeeze, interest rates rose. (K8W 1292: UK financial institutions and markets, 1991)
Second and simultaneously, in order not to release a consumer credit squeeze that would second imports, they should introduce controls on the supply of credit (KRT 3495: Fox FM News: radio programme)
Figure 16: All examples of credit squeeze from the BNC
272
Andrew Kehoe & Matt Gee
It is noticeable that credit squeeze appears far more in the BNC in books, discussing past events (sources underlined in Figure 16), than in news stories. These results are significantly different from those for credit crunch and may indicate that crunch was in the process of replacing squeeze in this context in BrE texts discussing current events. We cannot, of course, draw this conclusion purely from an analysis of the BNC or other standard corpora, for reasons outlined above. However, a diachronic analysis of our UK newspaper corpus using WebCorpLSE (Figure 17) does provide further evidence for this.
Figure 17: Frequency of credit#squeeze across time in the WebCorpLSE newspaper archive (per million words) The phrase credit squeeze appears in the newspaper corpus earlier than credit crunch (1984 versus 1987) but there are only 422 occurrences of the former, compared with 7069 of the latter, and squeeze does not reach the same peaks in frequency reached by crunch. We also used Google News to extract the earliest occurrence of credit squeeze in newspapers, in the same way described above for credit crunch. This revealed the earliest occurrences to be in two New York Times articles from 26th March 1929 (complete with Google OCR errors): alt of which have been recently b3 the stock market. threw out the intimation that a credit squeeze of major proportions was inevitable if the use of ... The tightest credit squeeze in almost nine years tools place On the S:Ork Stock Exchan=a yesterday, when the call loan rate advanced to 74 per cent
Weaving web data into a diachronic corpus patchwork
273
These early occurrences of the phrase in AmE are contrary to the claim in the Cambridge Advanced Learner’s Dictionary (quoted in Section 1) that credit squeeze is a UK term, equivalent to the US credit crunch. It is, of course, conceivable that credit squeeze was once the preferred term in AmE and that, at some point after the coining of the phrase credit crunch in the US in 1967 and before the earliest articles in our newspaper corpus (1984), credit squeeze was still used more widely than credit crunch in BrE. What is certain from our analysis is that, given the recent global credit crunch and massive increase in usage of the phrase in UK newspapers, this distinction between UK and US usage no longer holds true. It is beyond the scope of the current paper to examine the semantics of the two phrases in depth, a task which would require economic as well as linguistic insight, but the phrases credit squeeze and credit crunch do not appear to be as synonymous as the CALD definition implies. It is clear from the OED citations and LOB and BNC concordances that, from the 1950s-1990s, a credit squeeze was a measure applied by a government as a deliberate economic policy. A credit crunch, in its most recent incarnation at least, is something over which governments seemingly have little control.
5.
From the credit crunch to the crunch
As we have seen, the vast increase in use of the phrase credit crunch in mid-2007 was mirrored by an increase in the less used credit squeeze, with both phrases being used to describe the same event. During the same period, we have also noted an increase in the elliptical form the crunch and have examined this by using the date filter option in WebCorpLSE to view all occurrences of the phrase in The Guardian from 20078. These were then analysed manually and divided into five categories: i) ii) iii) iv) v)
crunch as a premodifier (e.g. the crunch vote, the crunch game) the crunch referring to the credit crunch COME+to the crunch (including the crunch came, etc) literal crunch (the crunch of gravel) other
A graph of the results (Figure 18) reveals that, whilst the other meanings have remained constant, the crunch as an abbreviated form of the credit crunch has increased in frequency following first occurrence in July 2007.
274
Andrew Kehoe & Matt Gee
Figure 18: Frequency of the crunch, 2007-8 (per million words), differentiated by sense: 0=other, 1=premodifier, 2=credit crunch, 3=COME to crunch, 4=literal The manual analysis also revealed some creative uses of the crunch, where two meanings have been conflated by journalists for effect, including: 1.
2. 3.
Analysts at Evolution Securities said the worst was still to come, with the “crunch” for Greene King and other licensed retailers arriving this winter and next spring. (04/07/08) When it comes to the crunch, price matters. (07/12/08) The worry is that when it comes to the crunch multinationals will close overseas plants rather than domestic ones and overseas utilities will not pass on cost decreases arising from oil. (08/12/08)
Examples 2 and 3 here are from articles about the credit crunch and the use of the idiom ‘when it comes to the crunch’ appears to be a conscious decision by the writer, certainly so in 2, a sub-headline. The writer of example 1 uses the COME+to the crunch construction (and signals the play on words with the double quotes around crunch) but then selects arriving instead of coming. We would suggest that this was a deliberate choice by the journalist (or possibly a subeditor) to ensure that the ‘credit crunch’ meaning was not ‘lost’ in the idiom. There appear to be two factors driving the growth in the ‘shorthand’ form the crunch. Firstly, journalists tend to tire of ‘buzz’ phrases quickly and begin to look for ‘snappier’ alternatives. Secondly, the vast increase in usage of the phrase (the) credit crunch over a relatively short period of time has left it (and the associated concept) in the public consciousness to such an extent that the shorthand form the crunch is interpretable instantly, without a gloss.
Weaving web data into a diachronic corpus patchwork 6.
275
Conclusion
In this paper, we have illustrated how the web can be used to supplement usage examples from standard corpora in diachronic linguistic analysis. When considering a recent linguistic phenomenon such as the rise of credit crunch, the web offers a solution to the restrictions posed by the ‘dearth of corpora of English spanning the whole of the twentieth century, or more particularly spanning the early part of it’ (Leech 2005: 85). We have shown that, through careful data selection and the use of advanced diachronic analysis tools in WebCorpLSE, it is possible to widen the focus and trace the development of a word or phrase across the twentieth century, in British and American English. Our analysis of credit crunch and associated phrases has highlighted the value of Google News as a repository of twentieth century texts, but has also revealed the limitations, for linguistic search, of the search software provided by Google. The ideal solution would be to access the Google News archive via WebCorpLSE or other similar interface, thus allowing full-scale diachronic linguistic search of twentieth century newspaper text. Of course, newspaper corpora are not an ideal data source for the analysis of all kinds of linguistic phenomena but, as Hundt and Mair (1999) point out, newspapers are usually at the forefront of linguistic change and are, thus, a valuable resource in the kind of linguistic analysis carried out in this paper. Our analysis has focussed on usage patterns rather than semantics but the work has allowed us to make some observations about the meaning and status of the phrase credit crunch and of crunch individually, as relates to squeeze. In fact, our analysis of the ‘shorthand’ form the crunch in The Guardian uncovered a meta-linguistic discussion of crunch to which we now refer in conclusion: What exactly is a crunch? Crunch in this context has two meanings, the first being “critical moment”, as in “coming to the crunch”. This is the older meaning of the two, almost certainly dating to Winston Churchill’s use of it in a 1939 Daily Telegraph interview. […] The second, more modern meaning is the sense of “squeeze”, arising from paucity – this is how we get “energy crunch”. […] Generally, the two meanings bisect, so the word conveys an urgent scarcity. […] But the two meanings have not yet coalesced entirely. (Zoe Williams, The Guardian, 7 January 2008)20 What this journalist refers to as the ‘more modern meaning’ is the wider AmE use of crunch which was already apparent in the 1992 Frown examples discussed in Section 2. This meaning has apparently made its way into BrE as a result of the massive surge in frequency of credit crunch. Prior to 2008, the ‘paucity’ example, energy crunch, had only appeared in our newspaper corpus 7 times, but there were then 13 occurrences in 2008 alone (3 of which appeared days before Zoe Williams’ comments and are apparently what sparked them). As we noted in
276
Andrew Kehoe & Matt Gee
section 3.3, 90% of the immediate left collocates of crunch in our newspaper corpus are accounted for by case variants of credit, so there is little evidence for the wider use of crunch to refer to other kinds of ‘squeeze’ at present. Apart from energy crunch, we do note a handful of occurrences of other crunches in 2007-8 data (supply, pensions, housing, oil). In our analysis of the crunch, we also note two examples where crunch appears to fill a slot more commonly filled by the semantically related pinch: 1. 2.
Harriet Harman, has repeatedly and patronisingly said that “ordinary” families are feeling the crunch from rising fuel and food prices (06/05/08) Budget hotels are raking it in as business people feel the crunch (05/10/08)
The second example here could be interpreted as ‘feel the effects of the credit crunch’, but the first is seemingly equivalent to ‘feeling the pinch’ (the use of from rather than through precluding the interpretation ‘feeling the credit crunch’). This use of crunch is reminiscent of its use in a wider financial sense in the AmE concordances from Frown (Figure 2), where it does indeed convey both scarcity and urgency. This paper has traced the assimilation of the phrase credit crunch in to BrE. During the 1990s, the phrase was used periodically but infrequently in UK newspaper texts, reflecting the cyclical nature of the economic phenomenon it describes. As a result, each time the phrase re-emerged, journalists found it necessary to provide a full gloss. Since mid-2007, however, credit crunch has increased in usage to such an extent that the elliptical form the crunch is now interpretable immediately by the UK public. In fact, as a result of the spread of credit crunch, the word crunch is itself beginning to take on new meanings, including some not linked directly to the financial domain. It, thus, seems unlikely that the phrase credit crunch will require a gloss if it is to re-emerge once again in future years. Acknowledgement Development of WebCorpLSE was in part funded by the UK Engineering and Physical Sciences Research Council (EPSRC), grant reference EP/E001300/1. Notes 1 A recent upgrade to the original WebCorp system (and renaming to ‘WebCorp Live’) has increased processing speed, but the reliance on commercial search engines remains and the range of searches possible is thus still limited. We are maintaining the original WebCorp system for the benefit of those users who wish to conduct ‘live’ searches of the ‘whole’ web, as accessible through commercial search engines. 2
http://www.askoxford.com/worldofwords/wordfrom/wordsoftheyear2008
3
http://www.cambridge.org/elt/corpus/international_corpus.htm
Weaving web data into a diachronic corpus patchwork
277
4
For each occurrence, the BNC file and line number are given in parentheses. Concordance lines are grouped according to the publication and article from which they are taken (the latter extracted manually from the source files). The BNC was designed as a synchronic corpus and is not ideally structured for diachronic study. For example, the file ABD contains 9 occurrences of credit crunch but it is not immediately clear that the last 8 of these all occur in the same article. Nor is it clear on exactly which day each newspaper article was published and, in some cases there is no date information at all other than a wide range (e.g. the article from The Scotsman in figure 1: 1985-1994). Results are presented in figure 1 in BNC file order, which is not necessarily date order.
5
The Frown manual (http://khnt.hit.uib.no/icame/manuals/frown) reveals the sources of these examples to be: A25: Press: Reportage: San Francisco Examiner: ‘S.F. Supervisors Crack Down on Use of City Cars’ (06/10/92). C12: Press: Review: Wall Street Journal: ‘The Persecution of Milken’ (25/08/92). G23: Belles Lettres, Biographies, Essays: Ruth Conniff ‘The Culture of Cruelty’, The Progressive (09/92).
6
See http://googleblog.blogspot.com/2008/09/bringing-history-online-onenewspaper.html.
7
There are several limitations, some of which we go on to outline below. The main limitation is that, at present, the compilers of the Google News Archive are focussing their attention, for the earlier periods of history, on US newspapers. This is not so much of a problem in our case, since we are searching for a term which we believe to have originated in the US.
8
http://news.google.com/archivesearch, The searches discussed in this paper were carried out in January 2009.
9
The Google News results pages carry the disclaimer “Dates associated with search results are estimated and are determined automatically by a computer program”. Kehoe (2006) detailed the ways in which a computer program could estimate the authorship date of web texts for use in linguistic analysis, with a high accuracy rate. Newspaper articles contain far more reliable dating information than web pages, so it is unlikely that Google’s program is wildly inaccurate when estimating these dates. It is simply estimating dates for a different purpose.
10
Note that, in most cases, the full text of matching articles is not available. In some cases, a sentence context is available by following the link to the corresponding newspaper archive. In other cases the limited context on the Google News results page is all that is available. Figure 5 shows the
278
Andrew Kehoe & Matt Gee
widest context available. There is an apparent OCR error in the last context shown. 11
Though the corpus comes from two different broadsheet newspapers, these are broadly comparable in terms of content, focus and style.
12
Including its Sunday sister newspaper The Observer.
13
Though The Guardian has an archive on its website, this is complete only from 1999 onwards. Only a selection of the 1984-88 articles in our corpus is available on the Guardian site and The Independent does not have a freely accessible archive at all. WebCorpLSE makes limited contexts available from these sources, to registered users only.
14
The # operator in WebCorpLSE matches the three variants ‘credit crunch’, credit-crunch’ and ‘creditcrunch’, a useful option when searching for compounds. As it transpires, the last of this does not occur in our corpus. We use the credit#crunch query syntax throughout this paper. This particular search is also case insensitive.
15
The same is perhaps also true, to a lesser extent, for the Frown corpus, its 1992 AmE texts capturing a credit crunch in the US economy at that time.
16
Line 280 (‘By most definitions, that’s a credit crunch’) is a possible exception. However, we would not class the sentence immediately before this (‘Right now, big buy-outs are impossible: the debt markets are closed until the jam clears’) as a clear definition of the term. The concept of ‘credit crunch’ is not presented in this article as something which may be unfamiliar to the reader.
17
This concordance selection was made possible by the ‘filter’ option in WebCorpLSE, which allows manual removal of individual concordance lines, filtering by date, etc.
18
Some of the categories in this chart are composed of several sub-sections on The Guardian website: COMMENT: Comment, Letters; CULTURE: Artanddesign, Arts, Books, Culture, Film, Music, Stage; LIFE: Lifeandhealth, Lifeandstyle, Cars, Society, Travel, Weekend; NEWS: News, UK News; TECHNOLOGY: Science, Technology; WORLD: EU, Global, International, USA, World.
19
We have included both left and right span 1 collocates for illustrative purposes. WebCorpLSE allows the analysis of right and/or left collocates at spans 1-9 and sentence span. It is possible to conflate the frequencies of case variants, separate part-of-speech variants (e.g. separate entries for crunch_NN and crunch_VV) or view POS collocates only.
20
http://www.guardian.co.uk/business/2008/jan/07/creditcrunch.zoewilliams
Weaving web data into a diachronic corpus patchwork
279
References Hundt, M. and C. Mair (1999), ‘Agile and uptight genres: The corpus-based approach to language change in progress’ International Journal of Corpus Linguistics 4, 221-242. Kehoe, A. & M. Gee (2007), ‘New corpora from the web: making web text more “text-like”’ in: P. Pahta, I. Taavitsainen, T. Nevalainen & J. Tyrkkö (eds.) Towards Multimedia in Corpus Studies, University of Helsinki: http://www.helsinki.fi/varieng/journal/volumes/02/kehoe_gee Kehoe, A. (2006), ‘Diachronic Linguistic Analysis on the Web using WebCorp’ in: A. Renouf & A. Kehoe (eds.) The Changing Face of Corpus Linguistics, Amsterdam: Rodopi, 297-307. Kehoe, A. & A. Renouf (2002), ‘WebCorp: Applying the Web to Linguistics and Linguistics to the Web’, in: Proceedings of WWW 2002, Honolulu, Hawaii. Electronic publication: http://www2002.org/CDROM/poster/67 Leech, G. and N. Smith (this volume), ‘Change and constancy in linguistic change: How grammatical usage in written English evolved in the period 1931-1991’. Leech (2005), ‘Extending the possibilities of corpus-based research on English in the twentieth century: A prequel to LOB and FLOB’, in: ICAME Journal No. 29. Mair, C. (2007), ‘Change and variation in present-day English: integrating the analysis of closed corpora and web-based monitoring’, in: M. Hundt, N. Nesselhauf & C. Biewer (eds.) Corpus Linguistics and the Web. Amsterdam/New York: Rodopi, 233-247. Nesselhauf, N. (2007), ‘Diachronic analysis with the internet? Will and shall in ARCHER and in a corpus of e-texts from the web’, in: M. Hundt, N. Nesselhauf & C. Biewer (eds.) Corpus Linguistics and the Web. Amsterdam/New York: Rodopi, 287-305. Renouf, A. (2003), ‘WebCorp: providing a renewable data source for corpus linguists’, in: S. Granger & S. Petch-Tyson (eds.) Extending the scope of corpus-based research: new applications, new challenges. Amsterdam: Rodopi, 38-53. Renouf, A. & L. Bauer (2001), ‘Contextual Clues to Word-Meaning’, International Journal of Corpus Linguistics, Vol. 5 (2), Amsterdam/ Philadelphia: John Benjamins, 231-258. Renouf, A. (1987), ‘Lexical Resolution’, in: W. Meijs (ed.) Corpus Linguistics and Beyond: Proceedings of the Seventh International Conference on English Language Research on Computerized Corpora. Amsterdam: Rodopi, 121-131.
“To each reader his, their or her pronoun”. Prescribed, proscribed and disregarded uses of generic pronouns in English Elisabetta Adami University of Verona, Italy Abstract After a brief review of the existing literature, this paper investigates the use of generic pronouns in the academic written sections of several corpora of English, namely, (a) the socalled ‘Brown Family’ of the ICAME collection, (b) six components of the International Corpus of English, (c) the British National Corpus and (d) the current extent of the American National Corpus. The analysis shows that the 1970s and 80s debate about sexism in language has apparently influenced academic writing, to the extent that the frequency of generic he is lower in the post-debate texts, while other alternatives have been introduced, some of which, such as ‘he or she’ are now widely used in academic writing. Furthermore, in a genre which is most concerned with ‘correctness’, some so far proscribed pronouns, like singular they, show a slight increase, while the usually disregarded generic she attests a quite significant use. The data testify to variations in use between BrE and AmE and, less conclusively, between other geographical varieties of English. In addition, the analysis makes some observations on the contexts of use, both in terms of domains and of type of antecedents, of s/he, singular they and of the rare, yet attested, generic she, generally disregarded by the literature on the subject.
1.
Introduction
For a number of years grammarians, linguists and teachers have debated which English pronoun should be used to refer individually to gender-indefinite or sexmixed human categories and roles, in cases like ‘anyone can put aside his, their or her own interests to review a situation dispassionately’.1 When sexism in language became a major topic of debate, both the long-lasting prescription of ‘generic he’ (e.g. ‘anyone can put aside his own interests’) and the proscription of the so-called ‘singular they’ (‘anyone can put aside their own interests’) were questioned and various gender-fair alternatives, such as ‘he or she’ (‘anyone can put aside his or her own interests’), were suggested. Nowadays, the ‘Great He/She Battle’ seems to have exhausted its ink-munitions and, in the absence of an agreed solution, ‘recast the sentence into the plural’ (~ ‘(all) people can put aside their own interests’) and ‘avoid pronouns whenever possible’ (~ ‘(personal) interests can be put aside to review a situation dispassionately’) remain the most frequently suggested strategies. So far, few studies have been carried out to ascertain the current use of generic pronouns, none of which has examined academic writing extensively. In order to fill this gap, after a review of the long-standing debate (section 2), this paper investigates the use of generic pronouns in the academic written sections of
282
Elisabetta Adami
several corpora of English (section 3), namely, (a) the so-called ‘Brown Family’ of the ICAME collection, (b) six components of the International Corpus of English, (c) the British National Corpus, and (d) the current extent of the American National Corpus. The analysis aims to (a) verify the extent of influence on academic writing of what has been termed the ‘Great He/She Battle’2, (b) uncover differences in the use of generic pronouns in different regional varieties of English, and (c) investigate some contexts of use of the newly introduced gender-fair alternatives and, in particular, of the attested, but so far disregarded, generic she. 2.
The background
Unlike other Indo-European languages, English has no inflectional category for gender and no gender agreement is needed within and above the noun phrase. In English, ‘[g]ender classes can be differentiated only on the basis of relations with pronouns’ (Huddleston and Pullum 2002: 485) and ‘the choice of pronoun is determined by denotation or reference, not by purely syntactic properties of the antecedent’. Indeed, as is well known, the English pronoun system signals, for the third person singular only, the natural gender of the referent, so that he, his, him, himself stand for antecedents denoting males, she, hers, her, herself stand for antecedents denoting females, and it, its, itself refer to non-human entities.3 Given that the choice of the pronoun follows the sex of the referent, a problem arises when a pronoun is to be used with antecedents referring individually to a mixed-sex human group, role, or category, or to a human entity whose sex is unknown (e.g. the student, the child, someone). Following Latin rules and grammatical tradition, for more than two hundred years, grammarians have retained the masculine as the unmarked case in English (cf. Corbett 1991), hence he has been the prescribed choice to be used in cases like every passenger must show his ID. According to this prescription, he can be both gender-specific (to refer to a male) and gender-inclusive, or generic (to refer to a male + female category). The prescription of generic he has been paired with the proscription of the socalled ‘singular they’, although its use in sentences like everybody raised their hand/s is widespread and well evidenced throughout the history of English in authoritative examples, from William Shakespeare to Jane Austen and George Bernard Shaw; see for example the examples cited in the entry for they in the Merriam-Webster Online Dictionary (2005) (cf. Bodine 1975 for a detailed history of both generic he prescription and singular they proscription since the 18th Century; Stanley 1978; Sklar 1983; Baron 1986): […] <'tis meet that some more audience than a mother, since nature makes them partial, should o'erhear the speech — Shakespeare>
“To each reader his, their, or her pronoun”
283
their birth — W. M. Thackeray> <no man goes to battle to be killed. — But they do get killed — G. B. Shaw>. 2.1
The Great He/She Battle and modern descriptions (and prescriptions)
During the final decades of the last century, with the rise of the feminist movement, a wide debate on the subject of sexism in language began to question the prescription of generic he. Many voices arose in favour of a rehabilitation of singular they (Palmer 1951, Bodine 1975, Green 1977, MacKay 1980, Jochnowitz 1982, MacKay 1983, Abbott 1984, Henley 1984, Baron 1986, Kolln 1986, Sklar 1988, Meyers 1993, Zuber and Reed 1993, Bennet-Kastor 1996). This was strongly opposed by grammarians and other scholars (see, for example, Frank and Treicher 1989, Lloyd 1993, Kearns et al. 1994), who vehemently maintained the ungrammaticality of singular they, because of its non-agreement in number with its singular antecedent. In turn, the supporters of singular they countered this claim by pointing out that generic masculine is also ungrammatical where it does not agree in gender with its antecedent, and supported their claims with the history of the second person pronoun, which has gradually lost the singular/plural opposition, i.e. thou/you (for a detailed account cf. Blaubergs 1980; Hellinger 1990; Newman 1992, 1996; Pauwels 1998). Along with singular they, several gender-fair alternatives to generic masculine were suggested, such as the phrase ‘he or she’ and a wide range of neologisms (for a detailed list, see Baron 1981). Yet the use of the former is hindered by its supposed ‘clumsiness’, as argued by many grammar textbooks (see Baron 1981 for a full account), and none of the latter seem to have entered into common usage.4 Furthermore, within the ‘Great He/She Battle’, the use of generic she with antecedents referring to stereotyped female roles (e.g. the nurse, the teacher) has either been stigmatised as sexist or raised as evidence against the supposed unmarkedness of he, i.e. if he were the authentic English generic pronoun it should occur in all cases of generic reference (Martyna 1983, Baron 1986: 175-177). Conversely, the few attempts at using an ‘authentically’ generic she (i.e. not confined to stereotyped female roles) are cited in the literature only by way of paradoxical examples, or considered as mere provocations even by the most vigorous opponents to generic masculine (Schultz 1975, Gershuny 1977, Martyna 1978, DeShazer 1981, Jochnowitz 1982, Nilsen 1984, Meyers 1990, Pauwels 2003). Although, for the sake of brevity, this account may present it as an ‘ordinary’ academic disputation, the debate has, in fact, assumed the sharp tones of a real battle, as only deeply felt issues provoke (and as can be grasped by the titles of the articles of the time; see the reference section). Famous defections (or painful friendly fire incidents) also occurred, as when Robin Lakoff, in her Language and Woman’s Place, declared that ‘my feeling is that this area of pronominal neutralization is both less in need of changing and less open to change than many of the other disparities […], and we should perhaps concentrate our efforts where they will be most fruitful’ (1975: 45).
284
Elisabetta Adami
More recently, the debate has calmed and the issue is apparently no longer in the foreground. However, if the inadequacy of generic he is generally acknowledged (and this is undoubtedly a result of the Great He/She Battle), no agreement has been reached over a satisfying gender-inclusive pronoun, so that the only viable solutions (or, at least, the most often suggested ones) are either (a) to recast the sentence into plural or (b) to avoid pronouns whenever possible.5 So, for example, the sentence ‘every student must write his essay’ can be more conveniently reformulated as either (a) ‘all students must write their essay’ or (b) ‘every student must write an essay’ (or, analogously, ‘an essay must be written by every/all student(s)’). Contemporary descriptive grammars and dictionaries of the English language adopt different approaches to the subject. For example, the Longman Grammar of Spoken and Written English (Biber et al. 1999: 316-317) (henceforth LGSWE) gives some account of the debate and mentions two avoidance strategies, namely the ‘[u]se of plural rather than singular forms’ and ‘the use of coordinated pronoun forms’, i.e. s/he, restricting its use to academic writing. With reference to singular they, and without mentioning any alternative pronoun, the authors claim that, due to its violation of ‘prescriptive rules of grammar […], this solution is least likely to be adopted by academic writing, being a register much concerned with correctness’. The Cambridge Grammar of the English Language (Huddleston and Pullum 2002: 492-494) (henceforth CGEL) dedicates more space to the issue, adopting a perspective which is particularly sensitive to arguments against generic he and more optimistic about its achieved and future effects. Besides mentioning two avoidance strategies (i.e. plural reformulation and the use of I and you to refer to the speaker and the addressee), the authors describe a ‘new and very much minority usage’ of a ‘[p]urportedly sex-neutral she’ and state that ‘some writers alternate he and she’. They also mention s/he, used in both formal and informal registers, ‘although regarded as somewhat “clumsy”’. A whole paragraph is dedicated to singular they, ‘very common in informal style’ since Middle English, which ‘has gained greater acceptance in other styles’, particularly with indefinite pronouns antecedents (i.e. everyone, someone, no one), so that – they argue – it is ‘stylistically neutral’. They also mention a reflexive form themself, attested since the 1970s, whose use is ‘likely to increase with the growing acceptance of they as a singular pronoun’. No instances of this usage have been retrieved from the corpora considered in the present analysis. Dictionaries deal with the topic in different ways too, so that, for example, the Oxford English Dictionary (2nd ed. 1989) (henceforth OED) makes no explicit mention of a generic use, either for he or for she.6 S/he is defined as the written representation of ‘he or she’, used to include both genders, and a second ‘singular’ use of they is mentioned briefly and equated with ‘he or she’. A quite different approach is taken by the Merriam-Webster Online Dictionary (2005) (henceforth M-W); it does mention a generic use both for he and for she (second and third meaning respectively), and also mentions ‘he or she’ and discusses extensively the usage of singular they, which ‘is well established in speech and writing, even in literary and formal contexts’. The special emphasis given to singular they in the most authoritative dictionary of American English is particularly
“To each reader his, their, or her pronoun”
285
remarkable, since it has often been stressed that the overseas variety of English is more conservative than the British one, as regards singular they proscription. In his corpus-based study, for example, Cooper (1984: 18) states that ‘British writers and editors are less influenced by the authority of normative grammarians than are American writers and editors’, thus supporting previous scholars, such as Leonard (1929), and McKnight (1925) (cited in Bodine 1975), Pooley (1946), and Bryant (1950). It is as if the description given by ‘America's most trusted authority on the English language’7 is influenced by a need to counterbalance the long-lasting singular they proscription in AmE, given that the Great He/She Battle has seen its major protagonists in the American academic arena. 2.2
Academic prescriptions
As all those in the field know, the academic written genre is particularly subject to editing rules. Academic writers usually conform to the recommendations (i.e. prescriptions) of the most influent style guides in their discipline, and to those adopted by academic journals. Therefore, the use of generic pronouns in academic writing is necessarily influenced by the prescriptions of various style guides on the subject. In this regard, Pauwels (1998) reveals that Australian, British and US guidelines for non-sexist language most frequently cite ‘he or she’, singular they and selective pronoun avoidance as strategies to replace generic masculine. Dates are critical in this kind of study, so that, for example, only six years earlier, Newman (1992: 451) affirms that singular they ‘is still prohibited by all style guides’. When attempting to interpret the results of a corpus-based study on generic pronouns in academic writing, each source text would ideally be compared with the related style guide in force at the time of publication. It is beyond the scope of the present study to conduct a thorough analysis of changes in style guide prescriptions, but further research on the subject could fruitfully use the data here discussed and compare them with the related style guides recommendations, in order to obtain further insights on what lies behind the changes and the variations attested in the use of generic pronouns. 3.
A corpus-based study
3.1
Motivations and aims
Besides arguing against generic masculine and fostering gender-fair alternatives, many of the existing studies support their arguments with empirical evidence (cf. Newman 1992 and 1996 for a detailed review), either drawn from surveys (Bate 1978, MacKay 1980, Bennett-Kastor 1996), or from elicited data (Marcoux 1973; Martyna 1978, 1980; Khosroshahi 1989; Switzer 1990). In addition, some works rely on ad hoc collected corpora focussing on specific text types and regional varieties of English, such as Purnell (1978) on US political speeches and debates,
286
Elisabetta Adami
Cooper (1984) on American newspapers and magazines, Markovitz (1984) and Ehrlich and King (1994) on US university documents, Meyers (1990) on US student writing, Pauwels (1997) on job advertisements in Australian newspapers, Newman (1998) on transcripts of US TV interviews, and Pauwels (2001) on Australian public, non-scripted speech. Holmes (1998) and Pauwels and Winter (2004) rely on data retrieved from existing pre-tagged corpora: the Wellington Corpus of Spoken New Zealand English, and a selection of student and academic texts from the Singapore and Philippine components of the International Corpus of English respectively. In general, corpus-based studies testify to a decreased use of the generic masculine, but there is no univocal trend as far as gender-fair alternatives are concerned. Some works challenge singular they proscription on the basis of empirical evidence of its widespread use: e.g. Abbott (1984), who collects 114 samples of spoken and written different sources; Meyers (1993), who cites over 600 citations of singular they taken from public speeches and published writing; and Newman (1992), who, by analysing data collected from US TV interviews, concludes that ‘singular they is probably the most common epicene pronominal in English’ (p.460), while ‘[a]lternatives to epicene he that have become acceptable to most contemporary prescriptivists, such as ‘he or she’, do not seem to have entered into extemporaneous use’ (p.469). Only Australian English academic spoken discourse seems to favour s/he over singular they, as evidenced by Pauwels (2002). No extensive analysis appears to have been conducted of academic writing. Hence, the present study focuses on academic written texts in order to (a) ascertain the extent of the effects of the Great He/She Battle on the current use of generic pronouns in academic writing, (b) investigate any possible regional variation in their use, and (c) provide evidence of the rare, yet attested, generic she, whose use has been generally disregarded. Furthermore, observations will be made when the data offer room for comparison between language in use and the grammatical descriptions of English generic pronouns in academic or formal writing. 3.2
The corpora selected
The present study focuses on academic writing for four reasons. Firstly, as mentioned, no extensive corpus-based study has so far been conducted on this text type. Secondly, the written medium needs planning, so it is the most apt to host ‘clumsy’ pronoun forms, such as s/he, and unusual ones, such as generic feminine, whose use is necessarily the result of a conscious choice. The third reason somehow counterbalances the second one, i.e. linguistic prescriptions are more likely to persist and proscriptions are observed much more in academic writing, due to its high level of formality, its submission to editing revisions, and the generally strict editing rules of this genre. Hence, any change in pronoun usage detected in this study can be considered a significant one, for it takes place in a genre which is most concerned with ‘correctness’ (Biber et al. 1999: 317). Finally, the selection of the genre is driven also by a purely pragmatic consideration: academic texts are more likely to
“To each reader his, their, or her pronoun”
287
deal with abstract topics, and thus make greater use of generic and hypothesised reference. The analysis has been carried out on several corpora, each selected for a particular purpose. Firstly, in order to verify a possible diachronic change, pre- and post-battle, the so-called ‘Brown Family’ of the ICAME collection are compared (i.e. the British English corpora LOB vs. FLOB and the American English ones Brown vs. Frown). LOB (BrE) and Brown (AmE) contain texts published in 1961, while FLOB (BrE) and Frown (AmE), designed to be entirely comparable to the former, contain texts published in the early 1990s. Therefore, the double comparison can ascertain the extent of the effects of the Great He/She Battle in both British and American English academic writing. (In the absence of a text type labelled ‘academic’ in these corpora, the section labelled ‘learned’ was considered, which is composed of 160,000 words in each sub-corpus.) Secondly, since English use is international, a brief examination of other regional varieties of English is deemed useful. Therefore, six components of the International Corpus of English (ICE) are considered, namely the East-African (ICE-EA), Philippines (ICE-PHI), Singapore (ICE-SIN), Hong Kong (ICE-HK), Indian (ICE-IND) and New Zealand (ICE-NZ) components. These corpora contain texts published in the 1990s, so the data retrieved allow only synchronic investigation. Each academic sub-corpus is composed of 80,000 words, half the size of the learned subsections of FLOB and Frown (though the East African component is doubled, since it is composed of both Kenyan and Tanzanian texts). Finally, much larger corpora have been selected – the American National Corpus (ANC) and British National Corpus (BNC) – in order to make observations in terms of types of antecedents and contexts of use of the most rarely occurring generic pronouns (s/he, singular they, and generic she). While the BNC written academic subcorpus is composed of 15.9 million words of texts published between 1975 and 1994, the ANC is still under construction and the extant second release includes 4.3 million words of academic written texts, taken from three sources, namely the Biomed Central online journal, the Public Library of Science (Plos), and the Verbatim journal, published in the 90s (Verbatim) and in the early years of this century (Biomed and Plos). Since Biomed and Plos deal with medicine and biology, while Verbatim deals with linguistics, observations can be drawn in terms of frequency across different domains. 3.3
Method and limitations of the study
Before discussing the findings, it is deemed useful to detail some methodological issues implied in a corpus-based study of generic pronouns and the rationale behind the methodological choices adopted here. Identifying occurrences of generic pronouns out of all pronouns has to be done manually and, while undertaking this very time-consuming task, one realises that there is no way to get the whole picture of generic reference in use. Indeed, as mentioned, for generic reference the following options are available:
288
Elisabetta Adami 1. 2. 3. 4. 5. 6.
use of generic he; plural reformulation; pronouns avoidance; use of singular they; use of s/he; use of other pronouns for generic reference (generic she, etc.).
In a corpus, the sentences that are recast into plural (2) and those that avoid pronouns (3) are not retrievable, because no reliable criteria can be devised to identify all possible reformulations. This means that – although never explicitly mentioned in the literature – no corpus-based study on generic pronouns can give quantitative data in terms of relative frequency for all strategies. In other words, one cannot ascertain whether generic he is still the most used strategy for generic reference or whether, for example, most writers choose to recast the sentence into plural for gender-fair reasons. Given these considerations, the present study investigates the only aspects that are ascertainable, i.e. the occurrences of generic pronouns with singular noun antecedents (options 1, 4, 5 and 6 above). All inflected cases for each pronoun have been retrieved using a software tool for corpus-analysis. The tens of thousands of occurrences have then been screened manually by singling out those in which pronouns were either anaphoric or cataphoric to a generic antecedent. Occurrences with no antecedent8 were excluded, as were gender-specific antecedents used to express a generic referent (such as man, chairman, housewife), with the assumption that this latter case should be considered when investigating generic nouns (rather than pronouns). For singular they, only antecedents with indefinite adjectives and pronouns – i.e. every*, each, any*, some*, no* – were elicited, this being the case for which the proscription has been mostly questioned by the literature (cf. the CGEL quotation in 2.1). Admittedly, the limitation of the range of antecedents is motivated by the very high frequency of they in all corpora9, which would have made data-screening a much longer task. Nevertheless, in order to mitigate this inconsistency in search criteria, a preliminary test screening has been carried out on a random sample of 500 occurrences for each set of corpora, which gave no evidence of singular they with antecedents other than every*, each, any*, some*, no* (+ noun). One last methodological caveat concerning the counting of the occurrences needs to be mentioned. Texts can vary in pronominal density and it is debatable whether all pronoun instances should be considered as different tokens or whether all pronouns referring to the same antecedent should count for one token only. All previous studies have adopted the first solution (i.e. one pronoun occurrence = one token), without discussing the rationale for this decision. In spite of the obvious advantage of eliminating the variable ‘density’ from the data, the second method (i.e. all co-referent pronouns = one token) is not straightforwardly viable, mainly for three not easily resolvable reasons. Firstly, in most cases one cannot determine whether pronoun occurrences refer to the same
“To each reader his, their, or her pronoun”
289
antecedent or not, the referent being generic and thus, by definition unidentifiable. The following two excerpts can better explain this thorny issue: (1)
Thus, the LAD needs to contribute enough (but no more than enough) innate knowledge for the child to learn the grammar of a language from the utterances which she hears in the first four or five years of life.
(2)
this approach also implies that not all of a child's understanding of a particular experience may be expressed in language, and that a child may intend to express more than she is actually able to encode formally in language structures.
Generic she in (2) occurs in the same text more than 1,000 words after (1), yet both occurrences have the same antecedent ‘a/the child’. Ultimately, the referent being by definition generic, it is not possible to determine whether she in (1) and in (2) refer to the same child or not. Secondly, this counting method would not account for cases of unique reference expressed with different lexical antecedents, as with ‘the baby’ and ‘the child’ in (3): (3)
In a second phase the baby's unified perceptions are shattered and she begins to understand herself as a subject disturbingly distant from the external world. In the third and highest phase of development the child understands the way in which she differs from and is interdependent with the outside world, and once
Finally, this counting method would not be applicable when texts alternate he and she in co-occurrence with the same antecedent. All these issues considered, in the present study it is deemed wiser to follow the literature and consider all pronoun instances as separate tokens, and highlight possible cases of high pronoun density in single texts (see 3.4.3). Ultimately, the ‘density’ variable is valid for all pronouns (he, they, s/he, or she) so that one can predict with reasonable confidence that density variation affects in a relatively even way the data of all pronouns considered. 3.4
Data analysis
3.4.1 Diachronic analysis: the Brown Family The analysis in this section compares the data drawn from the 1961 texts of LOB (BrE) and Brown (AmE) and the early 1990s texts of FLOB (BrE) and Frown (AmE). By analysing the data diachronically, it becomes immediately evident that the use of generic pronouns has changed considerably.
290
Elisabetta Adami
BrE and AmE behave analogously, as shown in Fig. 1 and 2: generic he is considerably more frequent in the pre-battle period and its use is greatly decreased in the post-battle period (particularly in AmE); gender-inclusive s/he (Fig. 2) is virtually absent in the 60s texts, while it does occur in the 90s texts (more frequently in BrE than in AmE).
Fig. 1. Generic he in the Brown Family (occurrences per million words).
Fig. 2. Generic s/he in the Brown Family (occurrences per million words). Also, singular they does not occur in the 60s, while the 90s texts record two and one instances for BrE and AmE respectively. This is, however, not statistically significant (thus it is not represented here in a dedicated graph) and should possibly be verified on a larger dataset.
“To each reader his, their, or her pronoun”
291
Finally, although again not statistically significant in number (in each of the four corpora less than five instances of generic she were retrieved), the occurrences of she with antecedents denoting gender-indefinite entities represent an interesting finding. In the pre-battle texts ‘generic she’ is attested only with antecedents denoting stereotyped female-roles, e.g. ‘the shopper’ (one occurrence in Brown) and ‘the teacher’ (one occ. in Brown and two in LOB). In turn, in the post-battle texts, the range of antecedents widens (e.g. ‘the user’, ‘the child’ in Frown; ‘anybody’, ‘the voter’, ‘the reader’, ‘the narrator’ in FLOB) and stereotyped ones disappear. Therefore, although, as said, the data cannot lead to generalisations, the more recent texts of the Brown Family exhibit a (rare) use of generic feminine only in nonstereotyped contexts, while the 1960s texts record an exclusive (and equally rare) use of generic she with stereotyped female role antecedents.
Fig. 3. Generic pronouns in the Brown Family (occurrences per million words). In sum, judging from these data, generic pronoun use seems to have changed in the text type that adheres most closely to prescriptions, i.e. academic writing (arguably, because style guide prescriptions have changed as well, see 2.2). In spite of the many guidelines for sexist-free language currently in circulation, generic he is still
292
Elisabetta Adami
by far the preferred pronoun for singular generic reference (as fig. 3 clearly highlights) and singular they proscription still seems to affect academic writers and editors (perhaps both CGEL and M-W are too optimistic as regard to its use in formal writing, cf. 2.1). However, particularly remarkable is the substantial decrease in the use of generic he (58% in AmE and 35% in BrE) and the introduction and use of s/he (in BrE twice as much as in AmE). Furthermore, the gap resulting from the considerable decrease in generic he in AmE is not adequately filled by the increased occurrence of other generic pronouns. This may indicate that, although unquantifiable, strategies of singular pronoun avoidance have also increased in the post-battle period. Indeed, in the 90s BrE texts, generic pronoun occurrences have diminished by only 4% (487.5 occ. in FLOB against 506.25 in LOB), while in AmE the decrease reaches 47% (293.7 in Frown against 550 in Brown). This gap can reasonably be attributed to an increase in plural reformulations and pronoun avoidance strategies, which seem to be preferred in AmE, while BrE rather compensates the decreased use of he by employing s/he more extensively. 3.4.2 Regional variation: the ICE components Table 1 details the occurrences retrieved in the six components of the International Corpus of English. The Hong Kong component is the least prone to use generic masculine, while the New Zealand one makes the largest use, immediately followed by Singapore and Philippines (a selection of texts of the two latter components has already been subject to investigation by Pauwels and Winter 2004, see 3.1). On the other hand, East African English makes a considerable use of s/he (268.75 occ. per million words), followed by Philippines (137.5) and Hong Kong English (125), while New Zealand and Singapore English record less than 100 occurrences per million words (87.5 and 50 respectively) and Indian English makes the least use (12.5). Singular they and generic she are almost absent in the ICE components: generic she occurs only once in ICE-EA, with employee as antecedent; singular they occurs twice in ICE-NZ, and once in ICE-EA and ICE-HK. Finally, ICE-EA and ICE-HK are the varieties which record the smallest ratio between he and s/he, which may be indicative of a higher consideration of gender-fair pronoun use in these varieties. Table 1: Generic pronouns in the six components of ICE (occurrences per million words) He s/he singular they She Total
ICE-EA 543.75 268.75 6.25 6.25 825.00
ICE-HK 362.50 125.00 12.50 0 500.00
ICE-IND 737.50 12.50 0 0 750.00
ICE-NZ 1400.00 87.50 25.00 0 1512.50
ICE-PHI 1075.00 137.50 0 0 1212.50
ICE-SIN 1225.00 50.00 0 0 1275.00
Although this diverse distribution of occurrences does not seem to indicate any clear trend, an interesting result can be obtained by comparing the occurrences of generic
“To each reader his, their, or her pronoun”
293
he in these texts with those in their coeval texts in the Brown Family. Indeed, Fig. 4 shows that generic he is generally less frequent in BrE and AmE than in the other varieties, with Hong Kong English positioned somewhere in between this divide in distribution. Without further sociolinguist investigation, no complete explanation can be given for this variation. Perhaps it is due to a varying influence of the Great He/She Battle on these varieties (with consequences in terms of different prescriptive policies in local style guides). It is also possible that the results are skewed by different text composition in the two sets of corpora. Nevertheless, it is deemed useful here to make the data available for future research (also in consideration of the time-consuming task involved in the screening).
Fig. 4. Generic he in Frown, FLOB and ICE (occurrences per million words). 3.4.3 The larger corpora: ANC and BNC Let us now turn to the data retrieved from larger corpora, starting with the American National Corpus, the composition of which offers the chance to specify another dimension of comparison, namely the domain. Indeed, the data retrieved from Biomed and Plos, which deal with ‘technical’ matters (i.e. medicine and biology), show a slightly more frequent use of gender-inclusive pronouns (s/he, 37 occ., and singular they, 10) than generic he (33). In all the corpora considered, this is the only instance where a gender-fair pronoun occurs more frequently than generic he. The data retrieved from Verbatim (linguistics) confirm a clear preference for generic he (425 occ., compared to 10 and 12 of s/he and singular they respectively). Notwith-
294
Elisabetta Adami
standing the ‘under construction’ status of the corpus, judging from these very limited sources, it seems that generic pronoun use varies across different domains of academic writing. Fig. 5 compares ANC and BNC data. Admittedly, in the BNC, the data for generic he have not been fully retrieved since the task would have required several more months of data screening (he/him/his/himself record 82,125 occurrences in the academic written section of the BNC). However, in order to obtain indicative results, the screening was conducted on a sample of 952 occurrences (by retaining the first and last occurrences of each text file)10 and then projected onto the whole corpus. Hence Fig. 5 shows the actual data for all pronouns except generic he, whose count is extrapolated from a tiny sample of all the occurrences. This methodology may explain the apparently anomalous disproportion of he in the BNC, so that no sound observations can be made in detail, apart from the very obvious one that in these two larger corpora, generic he is used more extensively than the other generic pronouns, as was the case in the Brown Family and in the ICE components. Although the sampling of occurrences of he does not allow us to discuss in depth the range of antecedents, when possible, reference will be made when discussing the data retrieved for s/he.
Fig. 5. Generic pronouns in the ANC and BNC (occurrences per million words; data for he extrapolated from a sample of 952 occ.). Setting aside generic he, interesting observations can be made on the use of the other generic pronouns in the two corpora. The comparison confirms the different behaviour of AmE and BrE noticed in the Brown Family, with the former less prone to use s/he (and, therefore, arguably, more prone to use plural reformulations and pronoun avoidance strategies) than the latter. Furthermore, while BrE makes a larger use of s/he and of generic she than AmE, singular they is equally (in)frequent in the two varieties. This, again, confirms on much larger corpora what was observed in FLOB and Frown, but, interestingly enough, does not confirm what is found in the
“To each reader his, their, or her pronoun”
295
literature, i.e. that American (written) English is more conservative as regards singular they proscription (cf. 2.1). As a matter of fact, singular they is the least frequent pronoun in both corpora (even less frequent than generic she). Once again, unlike in spoken English (Newman 1992, see 3.1), the use of singular they seems to be deterred in academic writing, which favours the more ‘clumsy’ s/he, as appropriately described in LGSWE (see section 2.2). The virtual absence of singular they in the other varieties of English (3.4.2) leads to generalisations about the persistence of singular they proscription in the whole of English academic writing. In future research, it would be interesting to extend the investigation to the academic spoken sections of these corpora. This could verify the findings of Pauwels (2002), a study of Australian academic speech that evidences a preference for s/he (against the already mentioned widespread use of singular they in the spoken mode in general). Should this be confirmed, it would indicate that singular they proscription is characteristic of English academic discourse as a whole. The relatively high number of occurrences recorded for the least frequent generic pronouns allows us to make some observations on the contexts of occurrence. S/he: While s/he occurs rarely in the ANC, 70% of the texts in the BNC academic subcorpus record at least one use of s/he, for a total of 1819 occurrences. None of the texts ever records more than 2.4% of the occurrences, hence the data are not skewed by any particular pronoun density in one text or by one author’s preference. Only 10 of these texts were published before 1980, so that – as indicated by the data of the Brown Family – s/he appears in academic writing largely in the post-battle period. S/he is mostly used in politics, law and education, in social sciences and in humanities and arts, while the so-called ‘hard sciences’ (technical engineering and natural sciences, as well as medicine) employ it less frequently. (It must be remembered, though, that pronouns are generally rarer in hard science writings, and hence no definite conclusion can be made in this regard.) As far as inflected cases and variants are concerned, the possessive occurs most frequently (826 occ.), immediately followed by the nominative (766), while the dative and the reflexive cases are much rarer (170 and 57 occ. respectively). In all inflected cases, the most frequent variant is the coordinated MALE OR FEMALE, i.e. ‘his or her’ (665 occ.), ‘he or she’ (554), ‘him or her’ (124), ‘himself or herself’ (32), followed by the slashed MALE/FEMALE ones. Of these, the variants which prioritise the female element are generally much rarer: s/he (82), ‘her or his’ (11), ‘her or him’ (3), her/himself (2), etc. 151 different antecedents are in co-reference with s/he in the BNC (and 30 in the ANC). Person is the most frequent one (10%), followed by child (8%), individual (6%) and student (5%). Incidentally, these are also the four most frequent antecedents of the sample data retrieved for generic he. Although the data are partial because restricted to three sources, person is the most frequent antecedent of both s/he and he in the ANC, followed by patient with s/he, and by someone/body and,
296
Elisabetta Adami
interestingly, by one with he, although the latter is occasionally mentioned in the literature as a possible alternative to gender-biased reference. Immediately after these four ‘neutral’ antecedents, also very frequent are antecedents which refer to stereotyped male roles, such as (chief executive/police) officer (3.16%) and researcher (2.81%). Other (less frequent) stereotyped male instances include attorney, criminal, director, doctor, employer, head, historian, intellectual, judge, lawyer, leader, manager, master, monarch, (prime) minister, paedophiliac, politician, prisoner, sergeant, Secretary of State (which, however, occurs more frequently in the sample data of generic he), sadist, scientist, solicitor, sociolinguist, sociologist, theorist. Stereotyped male role antecedents in co-reference with s/he are also well attested in the ANC (e.g. director, professor and physician), so that it may be confidently affirmed that the battle against sexist pronoun use has definitely influenced academic writing (and its prescriptions). Among the ‘neutral’ generic antecedents, all discourse roles are attested, such as author, hearer, novelist, reader, receiver, sender, speaker, (story-)teller, writer, etc. The traditional stereotyped female role of teacher ranks seventh (2.46%), while it ranks 12th in the sample data of generic he (and cf. its position as second most frequent antecedent of generic she). Strangely enough, the only other stereotyped female role attested is shopper and no occurrence of stereotyped female role antecedents is attested in the ANC. It would be interesting to investigate the possible reasons for the rarity of this type of antecedent (which are not frequent with generic she either) and whether in these cases, for example, other strategies (e.g. pluralisation or pronoun avoidance) are preferred, or whether female-specific words are used (e.g. housemaid instead of houseworker) or, more generally, if stereotyped female roles are less dealt with in academic writing.11 Interestingly, generic s/he is also used once in co-reference with a malespecific antecedent, ombudsman, and, in a few cases, it is used inconsistently, occurring in a co-reference chain, either with generic he (in co-reference with doctor) or with generic she (with historian): (4)
The requirements of the hermeneutic historian are rather different. She does not principally want to know what the data signify in themselves […]. She is not interested in data as entities to be modelled […]. He or she will be interested therefore in the products of information technology as a means to communication rather than as a store of data.
Furthermore, in both corpora, cases have been retrieved in which the antecedent is plural, as in: (5)
the participants’ concentration is not on whether he or she is signalling an emotion but with getting on with solving whatever problem is to hand
“To each reader his, their, or her pronoun”
297
These occurrences are interesting because their number disagreement is analogous (yet opposite in the singular/plural relation) to that of singular they, which is the reason that the latter is proscribed; indeed, with singular they the antecedent is grammatically singular while the pronoun is (primarily) plural. Conversely, here with ‘he or she’, the antecedent is plural while the pronoun is singular. These occurrences testify once more to the preference of s/he over singular they in academic writing. Singular They: As said, confirming the trends evidenced in the smaller corpora, singular they is rarely used in either the ANC or BNC. However, 63 academic texts in the BNC record at least one occurrence of singular they, whose frequency among text types exhibits the same trend of s/he. Among the antecedents in the ANC, every(one/body) (7 occ.) and each (6) are the most frequent ones, followed by some(one/body) (5) and any(one/body) and no(one/body) (2 occ. each). The situation is slightly different in the BNC, where someone/somebody is the most frequent antecedent (37 occ.), followed by every(one/body) (28 occ.), any(one/body) (14), and each (13 occ.), while only five instances are attested in co-reference with no(one/body).12 Given that someone/body is semantically singular, its high frequency in the BNC is interesting, since its occurrence in co-reference with they disregards number agreement in a more explicit way than everyone/body and each (which are semantically plural). Furthermore, the overall low frequency of singular they in co-reference with no(one/body) seems to rule out academic writing from the description in CGEL, according to which, singular they ‘use in examples like “no one felt that they had been misled” is so widespread that it can probably be regarded as stylistically neutral’ (Huddleston & Pullum 2002: 493). Both in the ANC and BNC, the antecedent is most frequently an indefinite pronoun. When an indefinite adjective occurs, the noun phrase is governed by child and person (4 occ. each), student, individual and patient (2 occ.). Other nouns, such as pupil, adult contributor, solicitor and twin record one occurrence each. Finally, singular they occurs mainly in cases other than the possessive; indeed, their never occurs in the BNC, while it records six tokens in the ANC (with each as its most frequent antecedent). This seems to confirm the greater extent of perceived incorrectness of the possessive, being generally at a shorter distance from its antecedent, as highlighted in the literature (cf. McConnel-Ginet 1979, Jochnowitz 1982, Mühlhäusler and Harré 1991, Newman 1998). Generic She: In the BNC, 37 texts (more than 7% of texts in the corpus), all written after 1980, attest at least one instance of she whose antecedent denotes a gender-indefinite entity, for a total of 542 occurrences. It is useful to focus more extensively on the analysis of this pronoun here, since other studies have generally disregarded it. The range of antecedents of generic she retrieved from the BNC can be divided in three groups:
298
Elisabetta Adami
a. b. c.
antecedents denoting stereotyped female roles ‘neutral’ antecedents used in texts whose topic refers to a stereotyped female role neutral antecedents used in neutral contexts
Antecedents denoting stereotyped female roles (a) include teacher (34 occ.), nurse (10) and secretary (2). It is true that some other terms might be identified in the same way (e.g. health visitor, 2; health worker, 1; maybe also lover, 5; and tutor, 1), and teacher may not nowadays be perceived as a female role. However, the identification is highly subjective and further sociolinguistic investigation is required. Therefore, the literature (for example, Hellinger 2001: 108) will be followed for this type of antecedent. 47 ‘stereotyped’ generic occurrences were retrieved in a total of 15 texts (one text published in 1985, one in 1986, two each in 1987, 1988 and 1990, four in 1992 and three in 1993). One must be cautious in drawing conclusions from these numbers and inferring that these texts use sexist (or stereotyped) generic reference. Indeed, it must be noted that only four of these texts record she exclusively with stereotyped antecedents, while the other 11 also record authentic generic antecedents, such as speaker, researcher, adult and child. These are texts which often alternate he and she for generic reference. Furthermore, one text is on the topic of feminism (Feminism and Linguistic Theory, Cameron 1992) and the stereotyped antecedent secretary occurs in a context where the author discusses gender differences in the labour market. As for the antecedents in (b), apparently generic antecedents (such as student, tutor, victim, patient, sufferer) and indefinite pronouns (everyone, someone) occur in co-reference with she in contexts dealing with ‘nursing’, ‘sex crime’ and ‘rape’, and ‘anorexia nervosa’ (in this latter case, the first time the pronoun she is used, the text specifically declares ‘for it is more often a woman’). These texts (six in total) account for 231 occurrences, 203 of which (almost half of the occurrences of generic she) are in text B33, Teaching Clinical Nursing (published in 1986), which employs exclusively generic she (meaning that any man wanting to become a nurse would probably feel excluded at that time). Once the list of antecedents has been screened for these ‘false’ neutral generic cases, the most frequently recurring antecedents, apart from the stereotyped female teacher (which ranks second), are generally the ‘true’ neutral ones. Instances of authentic generic she have been retrieved in 264 occurrences distributed across 49 texts. The semantic field referring to young human beings is the most frequent (child (67), infant (10), baby (10), 10-year-old (1), and pupil (1)), and is often found in texts alternating generic feminine and masculine pronouns for the same generic reference. Other (less) recurrent antecedents are driver (12), patient (12), person (10), individual (7), parent (4), someone (4), adult (3), colleague (3), student (3), and the data attest a quite remarkable presence of antecedents denoting clearly nonstereotyped female roles, if not stereotyped male roles, such as leader (5), researcher (5), criminal (4), historian (2), proprietor (2), sociologist (2), analyst (1), doctor (1), expert (1), and, maybe also thinker (1). Curiously enough, the very
“To each reader his, their, or her pronoun”
299
frequent active role of speaker (25 occ.) is more often in co-reference with generic she than the passive one of listener (5 occ., plus 2, for hearer); the same is true for sender (5) vs. addressee (1). In turn, the ‘passive’ reader (2) and the ‘active’ writer (1 occ., plus 1, as novelist) are equally frequent. One instance of generic she is in a somehow contradictory co-reference with the gender-fair houseworker (instead of housewife or housemaid). Finally, two occurrences of monarch and one of crown, although occurring in abstract contexts, i.e. laws, acts and regulations, and thus not referring to any particular living queen, can be motivated by the female sex of the current British monarch. As anticipated, these authentic generic antecedents of she occur in a total of 49 texts, with a peak of 13 publications in 1992 (fig. 6). Nevertheless, these authentic generic she antecedents occur also in relatively old texts (two in 1981, two in 1983 and one in 1984), published in years in which stereotyped female role antecedents (or topics) are not attested. Further research is required to verify whether, in recent years, there is a move towards increased use of generic she in academic writing. As for the distribution of these authentic generics across texts, only one records a particular high density of occurrences (Early Language Development, which records 60 occ., equal to 22.73% of all occurrences; the topic of the text explains the high frequency of antecedents denoting young human beings). Three texts contain more than 5% of all occurrences (the topic of one of these makes manifest its frequent use of she, i.e. Feminist Perspectives in Philosophy), while none of the other 45 texts ever reaches 5%. Therefore, apart from these few cases, the distribution is not particularly skewed by any given text (or author’s preferences).
Fig. 6. Distribution of texts across year of publication for stereotyped generic she (GS), stereotyped topic (GST) and ‘authentic’ generic she (GI) in the BNC. As far as sub-genre is concerned, texts dealing with medicine and social science make the largest use of generic she, although, when data are screened to exclude the texts dealing with stereotyped female topics, medicine turns out to make a very rare use of generic she in neutral contexts. Finally, technical engineering records only one occurrence, while the only BNC academic genre where generic she is totally absent is natural science. The so-called ‘hard-sciences’ appear once again less prone to generic pronoun use.
300
Elisabetta Adami
Unfortunately, since 42% of BNC academic texts are unspecified as regards the sex of their author, no investigation can be made as to whether there is a significant variation in the choice of pronoun according to the gender of the writer. Given the smaller size of the corpus, the ANC records a limited number of generic she occurrences (52); four of which have a stereotyped female antecedent, i.e. secretary (3 occ.) and cook (1, referring to the role played in family contexts, not the profession). The most frequent generic antecedents are person (3) and someone (3), other nouns include neighbour, speaker, child, baker, and operator (found in six different volumes of Verbatim journal, although some occur in the same article), health advisor, resident, biochemist, and therapist (found in separate texts in Biomed and Plos journals). Interestingly, the ANC attests she once in co-reference with an apparently male antecedent (father) in the context of gender reassignment, and so does the BNC with the antecedent transvestite; the former is not an instance of generic reference (the father who is talked about is a definite one), while the latter, which occurs three times in the same text, is indeed used generically. They are worth mentioning, since the literature (Newman 1998) has evidenced an even-more complicated situation in terms of generic pronouns usage as transgender topics are increasingly dealt with in texts. In sum, considering the relatively rare frequency of stereotyped female role antecedents in favour of a widespread co-occurrence of generic she with authentic generic antecedents, and considering also that both in the BNC and ANC generic she is even more frequent than singular they, it seems that both CGEL and M-W are particularly accurate in mentioning a generic feminine pronoun use. The same cannot be said of LGSWE and OED, which mention singular they but give no account of generic she13, which, nonetheless, seems to be in use, at least in academic writing. 3.4.4
(Tentative) pragmatic observations
In terms of the rhetorical and pragmatic effects of the different generic pronouns, it is possible to predict with confidence that, in encountering a s/he or generic she in a given text, the reader is conditioned to assume conscious gender-fair behaviour on the part of the writer. The same may not be as true of singular they, the supposed incorrectness of which might be perceived differently by different readers and in different contexts (the literature on the subject has discussed extensively a weaker perception of incorrectness when singular they co-occurs at a relatively large distance away from its antecedent, cf. McConnel-Ginet 1979, Jochnowitz 1982, Mühlhäusler and Harré 1991, Newman 1998). Putting aside the writer’s intentions, comment must be made on the image of the referent that the writer’s use of a certain pronoun prompts in the mind of the reader. Extensive surveys and experiments have been conducted on the issue and they are referred to here in a discussion of the different perspectives given to the utterance when different pronouns are used. In excerpt (6) there are two instances of generic he:
“To each reader his, their, or her pronoun” (6)
301
A person's knowledge includes knowledge which he might reasonably have been expected to acquire from facts observable and ascertainable by him
Due to its well established use, generic he does not automatically lead the reader to build a mental image of the referent, and, when it does, the image is culture determined, i.e. a male as the prototype (as evidenced by the results in BennetKastor 1996; Kidd 1973; Martyna 1978, 1980, 1983; Khosroshahi 1989; Gastil 1990). This cultural determination is further supported by linguistic cues, i.e. the primary gender-specific meaning of he. Similarly, by virtue of its gender-irrelevant plural meaning, singular they in (7) does not lead to a concrete mental image of the referent (cf. the results in Bennet-Kastor 1996). When it does, the image is probably culture determined, i.e. males as the prototype, yet it is not further supported by any linguistic cues in the generic pronoun (here the sex of the reader may also be relevant; cf. Switzer 1990, which found more male-oriented imagery generated by young boys and more inclusive imagery by young girls). (7)
This does not necessarily mean letting everyone do exactly what they want to do
(8)
When a speaker uses because in normal conversation, it is likely that her intention is to communicate information about causal direction
Conversely, being less traditionally used as gender-neutral (or better, genderinclusive) pronoun, she, as exemplified in (8), univocally denotes a female human entity; hence its use for generic reference leads to an unusual mental image, i.e. a female human taken as an instance of a generic category (cf. also the results in Adamsky 1976, Bennet-Kastor 1996, Khosroshahi 1989 and Newman 1998). Generic he leads one to infer a male-oriented inclusiveness (male as hyponym of ‘male + female’), and singular they is definitely generic (gender is irrelevant, it is ‘left out of the picture’), both giving a sense of universality (‘valid for all cases’). The use of generic she, on the other hand, gives the sentence the function of an exemplification (‘valid for one and so for each case, valid for any case’), thus stressing the difference and the compositeness of the referent. Each of these generic pronouns gives the utterance a peculiar perspective: gender-neutral singular they de-genderises the human referent and enhances the level of abstraction of the content, while generic she (as well as inclusive s/he) favours a concrete image of the referent, emphasising ‘gender’ over ‘genericity’ and favouring a perspective of exemplification which emphasises the particular over the general. Generic or, better, ‘exemplifying’ she surprises the reader, forcing her to think about (and imagine) what she is reading. In this way, the use of singular they (together with other strategies of generic pronoun avoidance) might be more apt in contexts where gender specification and personification are not needed, in abstract and general rules
302
Elisabetta Adami
for example (cf. the ‘loss of vividness that follows the lowering of individuation’ in Newman 1998: 373). Conversely, alternating generic ‘he and she’ or using s/he might be useful when exemplification is needed. These are, of course, just hypotheses drawn from inferences on the results of previous studies, as well as from the author’s subjective impressions generated by reading and analysing all the occurrences retrieved; they are to be taken as mere suggestions so that further investigation can verify them empirically. 4.
Conclusions
Briefly reviewing the long and troubled history and the long-standing debate over prescribed, proscribed, and disregarded uses of generic pronouns, the present study has investigated several pre-tagged corpora of English academic writing, each selected with a view to examining a different facet of generic pronoun usage. The analysis of the data retrieved from the Brown Family has illustrated that the Great He/She Battle has certainly influenced both British and American English academic writing; indeed, in the post-battle texts, the frequency of generic he is reduced and s/he has been used more widely. Furthermore, an ‘authentic’ generic she has eventually made its appearance, while the very few occurrences attested for singular they suggest that its proscription still has some influence on academic writers and editors (unlike in other genres). Though generic he is still used much more extensively than its gender-fair alternatives, the analysis cannot account for the whole phenomenon of generic reference since plural reformulation and strategies of pronoun avoidance cannot be estimated. In this regard, the diachronic analysis has allowed some inferences to be drawn. Indeed, unlike in BrE, the decrease of generic he in AmE is not counterbalanced by an equivalent increase in other generic pronouns, hence it is arguable that, while generic he reduction in BrE is compensated by s/he, in AmE strategies of singular generic pronoun avoidance may be preferred. Although the data retrieved from the ICE components do not signal any clear trend in terms of geographical variation, in all these corpora generic he is attested more extensively than in their contemporary counterparts of the Brown Family. (One must, however, be cautious in concluding that British and American English employ gender-fair pronouns alternatives more extensively than other varieties of English, since the comparison might be skewed by the different text composition of the two sets of corpora.) Finally, the analysis of the larger corpora has generally confirmed the trends evidenced in the smaller ones, including a preference for s/he in BrE over AmE. In the ANC, texts dealing with medicine and biology make even more frequent use of gender-fair alternatives to generic he, while in texts dealing with linguistics, generic he is by far the preferred choice. Interestingly enough, the ANC and BNC compared, the data for singular they do not signal any different behaviour between British and American academic writing, both being quite compliant with its proscription. In all corpora, the data for both s/he and singular they confirm that the former is much
“To each reader his, their, or her pronoun”
303
more used in academic writing than the latter, while the data retrieved especially in the BNC for generic she clearly show that its use is not to be disregarded and, judging from the type of antecedents attested, it is far from being confined to stereotyped female roles. In sum, the Great He/She Battle has definitely achieved considerable results even in the genre which is most concerned with ‘correctness’. Contrary to Lakoff’s (1975) prediction (2.1), it seems that linguistic ‘battles’ can indeed influence even those aspects of language which seem more resistant to change, like the pronominal system. The results testify to a great variability in the use of generic pronouns even in such a rigid genre as academic writing, so that, in spite of some persistence of singular they proscription in this genre, the concluding remark of the entry for they in the M-W dictionary seems to be particularly useful: ‘This gives you the option of using the plural pronouns where you think they sound best, and of using the singular pronouns (as he, she, ‘he or she’, and their inflected forms) where you think they sound best’. In this regard, although empirical investigation is needed, it is arguable that the different rhetorical and pragmatic effects given by the various generic pronouns make each of them suitable for different contexts, so that singular they is more suitable in contexts where sex, personification and visual imagery are not relevant (i.e. general rules), while the rare generic she, which we would call ‘exemplifying’ she, catches the reader’s attention, forcing her to visualise a concrete and particular reality, hence alternating he and she, (or using s/he) may be particularly apt when visual imagery is needed. Future investigation could focus on the relation between these results and possible changes (as well as variations across varieties and disciplines) in proscriptions of style guides, which, predictably, have had a great influence on the changes occurring in the use of generic pronouns in academic writing. Sociolinguistic research could also highlight the reasons behind the variation evidenced in the varieties of English in the six components of ICE, while an investigation extended to academic speech texts could ascertain whether the compliance with singular they proscription (evidenced for Australian English by Pauwels 2002) possibly affects the whole academic genre. By way of an afterword, the investigation of generic pronouns leads to some concluding remarks on corpus-based studies. Once again, small corpora prove to be useful for signalling possible general trends and comparing one linguistic fact along large dimensions of variation, i.e. among regional varieties of English or within the same variety along time. Larger – yet, not the largest – corpora can deepen the quantitative analysis to sub-categories, i.e. text types and domains within the same variety. Finally, extremely large corpora are indispensable for linguistic investigation at a qualitative level, but their size is difficult to manage when semantic screening of very frequent lexical items (such as pronouns) is required. Undoubtedly, corpus analysis cannot be ignored in any study concerned with language use, and software tools are indispensible in saving the researcher time and energy. It must be admitted, though, that it is often difficult to ascertain the whole picture.
304
Elisabetta Adami
Language makes use of a wide and differentiated range of strategies for the expression of meaning that go well beyond a limited set of lexemes. These strategies cannot be easily monitored through either software tools or ready-made corpora, but necessarily imply a consideration of subjective factors such as the speaker intentions, the addressee’s and the audience’s knowledge, as well as their comprehension of and sensitivity to linguistic issues. Ultimately, all these extra-linguistic factors cannot be captured by an objective scientific investigation but can only be hypothesised about thanks to the researcher’s subjective sensitivity and to her intuition and mastering of the phenomenon at stake. Notes 1
The example has been taken (and modified accordingly) from an occurrence retrieved in ANC: “we no longer choose to believe that anyone can put aside their own interests to review a situation dispassionately” (ANC\verbatim\vol21_4).
2
Nilsen (1984) titles her article “Winning the Great He/She Battle” and attributes the coinage to Betty Lou Dubois and Isabel M. Crouch, who first used it to label the debate around generic pronouns at the WHIM (Western Humor and Irony Membership) 1982 Humor Conference.
3
The feminine pronoun is also conventionally used to refer to ships, countries, and some personified natural entities, e.g. Nature, the Moon. Generic pronoun denoting non-human referents will not be dealt with here.
4
Indeed, a quick search using WordSmith Tools reveals that none of the proposed neologisms listed in Baron (1981) is attested in the BNC, ANC, ICE, or the contemporary components of the ICAME collection.
5
Cf. Nilsen (1984) and the many web pages concerning guidelines on genderfair writing, e.g.: http://home.comcast.net/~infoeng/samples/gender_fair_guide.html http://www.rscc.cc.tn.us/owl&writingcenter/OWL/Sexism.html http://owl.english.purdue.edu/handouts/general/gl_nonsex.html http://leo.stcloudstate.edu/style/genderbias.html http://www.rpi.edu/web/writingcenter/genderfair.html
6
However, as the OED entry for he states: “The or that man, or person of the male sex (that or who...). Hence Indefinitely, Any man, any one, one, a person (that or who)”; in this respect cf. the entry for man: “A human being (irrespective of sex or age)”, with the following note “Man was considered until the 20th cent. to include women by implication, though referring primarily to males. It is now freq. understood to exclude women, and is therefore avoided by many people. In some of the quotations in this section, it is difficult or impossible to tell whether man is intended to mean ‘person’ or ‘male human
“To each reader his, their, or her pronoun”
305
being’”. It thus seems that the generic sense of he is to be implicitly inferred by the one for man. 7
http://www.merriam-webster.com/info/legacy.htm (accessed 16/09/2008).
8
In the selected corpora, pronouns with no antecedents were found in expressions functioning as meta-linguistic statements of the type “when I say ‘he is not aware’, I mean…”. Quotations and meta-statements (i.e. in texts dealing with the topic of generic pronouns) were excluded from the data.
9
3,739 in the Brown Family, 6,490 in the ICE components, 13,681 in the ANC and 119,253 in the BNC.
10
Of the 501 academic writing text files, only 486 files had occurrences of he, 20 of which had only one. The screening of the sample of 952 occurrences resulted in 221 occurrences of generic he (23.2%).
11
This is not unusual considering that, judging from the higher number of occurrences of (both gender-specific and generic) he than she in all corpora, women are generally less written about than men.
12
The rarity of singular they in co-reference with no(one/body) is confirmed when tested against the overall frequency of each indefinite pronoun; indeed in the corpus no(one/nobody) occurs slightly more frequently than every(one/body) (874 vs. 852 occurrences respectively), which is the second most frequent antecedent of singular they.
13
This remark, however, is stated here only relying on the data retrieved, regardless of issues related to the selection of the texts that constitute the academic subcorpus of BNC. Indeed, the fact that 7% of the texts account for at least one occurrence of generic she might be a consequence of the particular texts included in the corpus.
References Abbott, G. (1984), ‘Unisex they’, English Language Teaching Journal, 38/1: 45-48. Adamski, C. (1976), ‘Changes in Pronominal Usage among College Students as a Function of Instructor Use of She as the Generic-Singular Pronoun’, Paper presented at the 1976 meeting, American Psychological Association. Baron, D.E. (1981), ‘The Epicene Pronoun: The Word That Failed’, American Speech, 56/2: 83-97. Baron, D.E. (1986), Grammar and Gender. New Haven: Yale University Press. Bate, B. (1978), ‘Non sexist language use in transition’, Journal of communication, 28: 139-149. Bennett-Kastor, T.L. (1996), ‘Anaphora, Nonanaphora, and the generic Use of Pronouns by Children’, American Speech, 71/3: 285-301.
306
Elisabetta Adami
Biber, D., S. Johansson, G. Leech, S. Conrad, and E. Finegan (1999), Longman Grammar of Spoken and Written English. London: Longman. Blaubergs, M. (1980), ‘An analysis of classic arguments against changing sexist language’, Women’s studies International Quarterly, 2/3: 135-147. Bodine, A. (1975), ‘Androcentrism in Prescriptive Grammar: Singular “they”, Sexindefinite “he”, and “he or she”’, Language in Society, 4: 129-146. Bryant, M.M. (1950), ‘Person… Their’, College English, 11/6: 345-346. Cooper, R.L. (1984), ‘The avoidance of androcentric generics’, International Journal of the Sociology of Language, 50: 5-20. Corbett, G.G. (1991), Gender. Cambridge: Cambridge University Press. DeShazer, M.K., (1981), ‘Sexist Language in Composition Textbooks: Still a Major Issue?’, College Composition and Communication, 32/1: 57-64. Ehrlich, S. and R. King (1994), ‘Feminist meanings and the (de)politicisation of the lexicon’, Language in Society, 23: 59-76. Frank, F.W. and Treicher, P. (1989), Language, gender and professional writing: Theoretical approaches and guidelines for non-sexist usage. New York: Modern Language Association. Gastil, J. (1990), ‘Generic pronouns and sexist language: The oxymoronic character of masculine generics’, Sex Roles, 23: 629-643. Gershuny, H.L. (1977), ‘Sexism in Dictionaries and Texts: Omissions and Commissions’, in: A.P. Nielsen et al. (eds.), Sexism and Language. Illinois: National Council of Teachers of English. 161-179. Green, W.H. (1977), ‘Singular Pronouns and Sexual Politics’, College Composition and Communication, 28/2: 150-153. Hellinger, M. (1990), Kontrastive Feministische Linguistik. Ismaning: Hueber. Hellinger, M. (2001), ‘English – Gender in a global language’, in: M. Hellinger and H. Bußmann (eds.) Gender Across Languages. Amsterdam: Benjamins. 105113. Henley, N.M. (1984), ‘This new species that seeks a new language: on sexism in language and language change’, in: J. Penfield (ed.) Women and Language in Transition. Albany: State University of New York Press. 3-27. Holmes, J. (1998), ‘Generic pronouns in the Wellington Corpus of Spoken New Zealand English’, Kǀtare, Vol. 1, No. 1: 32-40. Huddleston, R. and G. Pullum (2002), The Cambridge Grammar of the English Language. Cambridge: Cambridge University Press. Jochnowitz, G. (1982), ‘Everybody Likes Pizza, Doesn’t He or She?’, American Speech, 57/3: 198-203. Kearns, E.A., M. Walker, K. McCoy and M. Balorn (1994), ‘Four Comments on “The Politics of Grammar Handbooks: Generic He and Singular They”’, College English, Vol. 56, No. 4: 471-475. Khosroshahi, F. (1989), ‘Penguins don’t care, but women do: A social identity analysis of a Whorfian problem’, Language in Society, 18: 505-525.
“To each reader his, their, or her pronoun”
307
Kidd, V. (1973), ‘A study of the images produced through the use of male pronouns as the generic’, Moments in Contemporary Rhetoric and Communications, 1: 25-30. Kolln, M. (1986), ‘Everyone’s Right to Their Own Language’, College Composition and Communication, 37/1: 100-102. Lakoff, R. (1975), Language and Woman’s Place. New York: Harper & Row. Leonard, S.A. (1929), The doctrine of correctness in English usage 1700-1800. Madison: University of Wisconsin Studies in Language and Literature, 25. Lloyd, C.A. (1993), ‘Does “Concord Based on Meaning” Justify “Their” Referring to “Each” or “Everybody”?’, College English, 1/3: 267-269. MacKay, D.G. (1980), ‘On the goals, principles, and procedures for prescriptive grammar: Singular they’, Language in Society, 9: 349-367. MacKay, D.G. (1983), ‘Prescriptive grammar and the pronoun problem’, in: B. Thorne, C. Kramarae, and N. Henley (eds.) Language, gender and society. Rowley, MA: Newbury House. 25-37. Marcoux, D.R. (1973), ‘Deviation in English Gender’, American Speech, 48/1: 9817. Markovitz, J. (1984), ‘The impact of the sexist language controversy and regulation on language in university documents’, Psychology of Women Quarterly, 8/4: 337-347. Martyna, W. (1978), ‘What does “He” Mean?’, Journal of Communication, 28: 131138. Martyna, W. (1980), ‘The psychology of the generic masculine’, in: S. McConnellGinet, R. Borker, and N. Furman (eds.) Women and language in literature and society. New York: Praeger. 69-78. Martyna, W. (1983), ‘Beyond the “he/man” approach: The case of non-sexist language’, in: B. Thorne, C. Kramarae, and N. Henley (eds.) Language, gender and society. Rowley, MA: Newbury House. 25-37. McConnell-Ginet, S. (1979), ‘Prototypes, pronouns and persons’, in: M. Mathiot (ed.) Ethnolinguistics: Boas, Sapir, and Whorf revisited. The Hague: Mouton. 63-83. McKnight, G.H. (1925) ‘Conservatism in American Speech’, American Speech, 1: 1-17. Merriam-Webster Online Dictionary (2007) http://www.merriam-webster.com Meyers, M.W. (1990), ‘Current Generic Pronoun Usage: An Empirical Study’, American Speech, 65/3: 228-237. Meyers, M.W. (1993), ‘Forms of they with singular noun phrase antecedents: Evidence from current educated English usage’, Word, 44/2: 181-192. Mühlhäusler, P. and R. Harré (1991), Pronouns and People: The Linguistic Construction of Social and Personal Identity. Oxford: Basil Blackwell. Newman, M. (1992), ‘Pronominal disagreements: The stubborn problem of singular epicene antecedents’, Language in Society, 21. 447-475.
308
Elisabetta Adami
Newman, M. (1996), Epicene Pronouns: The Linguistics of a Prescriptive Problem. London: Garland Publishing. Newman, M. (1998), ‘What can pronouns tell us? A case study of English epicenes’, Studies in Language, 22/2: 353-389. Nilsen, A.P. (1984), ‘Winning the Great He/She Battle’, College English, 46/2: 151157. Oxford English Dictionary (2nd ed. 1989), OED Online. Oxford: Oxford University Press. 16 May 2007. Palmer, A. (1951), ‘Rules and Concord’, College English, 12/4: 224-225. Pauwels, A. (1997), ‘Of handymen and waitpersons: A linguistic evaluation of job classifieds’, Australian Journal of Communication, 24/1: 58-69. Pauwels, A., (1998), Women Changing Language. London: Longman. Pauwels, A., (2000), Women Changing Language. Feminist Language Change in Progress. Paper presented at the First International Gender and Language Association Conference, Stanford University, May 2000. Pauwels, A. (2001), ‘Non-sexist language reform and generic pronouns in Australian English’, English World-Wide, 22/1: 105-119. Pauwels, A. (2002), ‘The sociolinguistics of generic pronouns: women’s and men’s use of gender inclusive, gender neutral and masculine generic pronouns’. Paper presented at the International Sociological Association Congress, Brisbane, Australia, July 2002. Pauwels, A. (2003), ‘Linguistic Sexism and Feminist Linguistic Activism’, in: J. Holmes, and M. Meyerhoff (eds.) The Handbook of Language and Gender. Malden/Oxford: Blackwell. 551-570. Pauwels, A. and J. Winter (2004), ‘Generic Pronouns and Gender-Inclusive Language Reform in the English of Singapore and the Philippines’, Australian Review of Applied Linguistics, 27/1: 50-62. Pooley, R.C. (1946), Teaching English Usage. New York: D. Appleton-century Co. Purnell, S. (1978), ‘Politically speaking, do women exist’, Journal of Communication, 28/1: 150-156. Schulz, M.R. (1975), ‘How Serious Is Sex Bias in Language?’, College Composition and Communication, 26/2: 163-167. Sklar, E.S. (1983), ‘Sexist Grammar Revisited’, College English, 45/4: 348-358. Sklar, E.S. (1988), ‘The Tribunal of Use: Agreement in Indefinite Constructions’, College Composition and Communication, 39/4: 410-422. Stanley, J.P. (1978), ‘Sexist Grammar’, College English, 39/7: 800-811. Switzer, J.Y. (1990), ‘The Impact of Generic Word Choices: An Empirical Investigation of Age- and Sex-Related Differences’, Sex Roles, 22: 69-82. The Oxford English Dictionary (2nd ed. 1989), OED Online. Oxford: Oxford University Press. 2007 http://dictionary.oed.com Zuber, S. and A.M. Reed (1993), ‘The Politics of Grammar Handbooks: Generic He and Singular They’, College English, 55/5: 515-530.
The interpersonal function of going to in written American English Anna Belladelli University of Verona Abstract The spread of going to as an expression of futurity in recent written American English has been interpreted by previous studies as being part of a large-scale trend towards the ‘colloquialisation’ of written English. Indeed, in the Frown Corpus most occurrences of going to are found within a range of ‘conversational’ and/or interactive contexts: direct speech, reported speech, soliloquy and so forth. The present analysis, focused on genre categories and pragmatic aims, shows that the ‘colloquialising’ force of the semi-modal does not work in isolation: indeed, in running text going to is often found in co-occurrence with other informal and speech-related linguistic features (both lexical and morphosyntactical) that share the same interpersonal function. These elements are concentrated strategically in texts, probably in order to highlight specific claims. Therefore, besides providing further evidence of the general ongoing colloquialisation or informalisation of written English, the use of going to seems to be part of a strategy which involves a voluntary switch in style in order to create an intersubjective connection with the reader.
1.
Aim and scope of the study
The compilation of the Frown Corpus and the FLOB Corpus in the 1990s has given way to a range of studies on short-term diachronic change in 20th-century British (BrE) and American English (AmE), some of which have focused on expressions of futurity. Early studies of Frown, focusing on Reportage, Editorials and Reviews, shed light on the linguistic change occurring in journalistic writing. Hundt (1997) observed that the number of going-to futures had doubled from Brownpress to Frownpress; Mair added that the increase of the semi-modal could be due to its use in quotations from oral utterances, since “a full 51 of the 67 instances in Frown occur in passages of direct speech” (1997b: 205). He interpreted such a change by suggesting that the spread of going to in newspaper language might indicate a new stylistic trend in that particular genre, i.e. the narrowing of the gap between the norms of spoken and written English (1997a: 1541-1542). Leech (2003) investigated Brown, Frown, LOB and FLOB in search of change in the use of modals and semi-modals. He demonstrated that the frequency of going to and gonna had increased by 51.6% in written AmE, whereas not much had changed in written BrE, thus suggesting that the latter variety was not so ‘keen’ on being ‘colloquialised’.
310
Anna Belladelli
Grammars agree on claiming that the semi-modal1 going to belongs to informal and spoken language. Huddleston and Pullum (2002) consider the occurrence of going to within relatively informal style to be a differentiating factor as opposed to the more ‘neutral’ will (and would, as its past form; see Palmer 2001: 31). Celce-Murcia and Larsen-Freeman (1999) also stress the predominant informality of this verbal structure. Reference grammars likewise agree on the fact that going to is much less frequent in written English than it is in spoken English. Biber et al. (1999) offer more quantitative corpus-based data on the occurrence of going to within the range of possible expressions of futurity. They claim that going to “is a common way of marking future time in conversation [...] but is rarely used in written exposition” (1999: 490; my emphasis). Drawing on the notion that going to is mainly perceived as a feature of the spoken medium, the present study will analyze the spread of this semi-modal in written AmE, with a focus on genres and pragmatic aims. The employment of this semi-modal to reproduce oral exchanges or to create fictional dialogues in writing is easily explained, since speech-related structures and forms contribute to make them realistic and plausible. For this reason, I will instead analyze the use of going to in running text, and try to determine under which circumstances and for which pragmatic purposes writers seem to use it in their texts. My hypothesis is that going to is used deliberately to create an intersubjective connection with the reader/audience to encourage understanding and agreement on the part of the addressee(s). In this sense, I argue that going to has an interpersonal function that can be investigated by focusing on its use in running text as opposed to direct speech and reported speech. As pointed out by Hundt and Mair (1999: 225), we are witnessing – in English as well as in other languages – a large-scale phenomenon that characterises a variety of social domains, and that Fairclough defines as the “apparent democratization of discourse” (Fairclough 1995: 79), i.e. the increasing and systematic choice of certain linguistic options (e.g. the use of first names as opposed to honorifics, or the ongoing informalisation of discourses that traditionally require formal styles) which serve to create the illusion that power relations between humans are weaker than they actually are, thus favouring consent and approval on the part of the less ‘powerful’ members of society, in that they no longer feel threatened by cultural or institutional authorities. In most instances of going to, I observed a number of recurring linguistic elements (listed below) that share the same interpersonal function. The result of this combination appears to be a clear, localisable lowering of the tone of the discourse. It seems that writers are making a deliberate slide down the formality/informality cline, so as to reach a more colloquial, peer-to-peer level of communication, probably in order for their ‘voice’ to be more persuasive and more easily shared with the reader/audience. To prove this observation, I have investigated whether the co-text of these occurrences includes the following elements:2
The interpersonal function of going to in written American English
311
questions (also rhetorical); slang vocabulary, idioms, metaphoric expressions; deictic expressions (temporal, spatial, personal); irony or sarcasm.
Although irony and sarcasm are hardly determinable from an objective perspective, their co-occurrence has been considered here as a valid parameter proving the writer’s attempt to use levity to connect with the audience. Humour, indeed, has been often identified by sociologists “as a force in group formation and behaviour” (Eble 1996: 122). Furthermore, I have taken the use of contracted forms into account as they – like going to – usually mark a switch to a more oral style. Besides, the use of contractions has been proved to be closely related to the choice of informal vocabulary (Chafe and Danielewicz 1987: 94, Westin 2002: 33). 2.
Corpus data and method
Occurrences of going to have been retrieved from two matching 1-million-word corpora: the Brown Corpus, first released in 1964 and containing texts published in 1961, and the Frown Corpus, first released in 1999 and containing material published in 1992 (Hundt, Sand and Skandera 1999). Both corpora consist of 500 texts divided into 15 text categories of fictional and non-fictional writing, and are comparable with their British counterparts, i.e. the LOB Corpus and the FLOB Corpus. The text categories have been grouped together according to the four ‘genre’ categories outlined by Leech (2004: 71), in order to create four discrete sub-corpora:
Press (A-C) General Prose (D-H) Learned (J) Fiction (K-R).
Since these macro-categories are not equal in terms of size,3 data have been converted into rates per million words. Wordsmith Tools (version 4.0) was used for the retrieval of instances of going to in the corpora. The analysis of gonna – which has gained the status of a ‘standard’ allomorph of going to in written Present-day English – has been excluded from the present study, due to its low frequency and its being limited to fictional dialogue.4 Both present and past tense forms of the auxiliary verb to be have been included (see Table 1), as well as contracted forms (see Table 3), forms indicating positive and negative polarity, and interrogatives. Instances of prepositional to, as in the structure ‘going to + noun phrase’, have been excluded manually. For different reasons, occurrences such as the following ones have not been included
312
Anna Belladelli
either. Example (1) mentions the title of a song; in (2), on the other hand, going to can be hardly qualified as verbal, since the hyphenated cluster in which it appears functions as an adjective. (1)
“But radio was all we had for entertainment at that time, and those songs of his stayed in the dust of my memories – songs like Ain’t Nobody Here But Us Chickens and Saturday Night Fish Fry and What’s the Use of Getting Sober When You’re Going to Get Drunk Again?. Every time I heard those songs I felt better, and I thought maybe they could have that effect on other people.” (Frown E36 167)
(2)
Consequently, whatever money you had was for spending. And we Baby Boomers – deep into our I’m-gonna-live-forever 20s – spent it. (Frown B11 30)
After ascertaining the overall distribution of going to across genres, an initial question has to be asked: where and when does going to occur? To answer this question, all the occurrences will be classified according to their context of use: Direct speech/Quote This category includes clauses or sentences surrounded by quotation marks, as well as sentences in dialogues and transcriptions of speeches where turn-taking is graphically signalled, as in interviews: (3)
Q. President Vassiliou, are you going to ask the United States to pressure Mr. Denktash to make some progress? (Frown H 18 191).
Unsignalled direct speech/quote This category includes clauses or sentences which are clearly direct speech but do not display any formal graphic signs marking the quote/speech. I have also considered instances where interior monologue is recorded: (4)
Why, he’s going to kill me, he thought wildly (Brown N09 1510 7).
Reported speech This category includes clauses or sentences introduced by a reporting verb, which paraphrase the content of an original source indirectly, according to various conversion patterns (cf. Celce-Murcia and Larsen-Freeman 1999: 687): (5)
Heath Shuler says Tennessee coach Johnny Majors is going to have to make up his mind this week and pick one quarterback (Frown A17 170).
Running text This category includes clauses or sentences which are found in the body of the text – as opposed to headlines – but do not fit in any of the above groups:
The interpersonal function of going to in written American English
(6)
313
As far as I am concerned there is continuous piling up of evidence that the creative fresh ideas which are needed in the world are going to be found by educated women unafraid to break traditions (Brown F47 1500 3).
For each genre, only remarkable short-term diachronic change concerning the distribution of going to across contexts has been considered. Finally, the instances of going to occurring in running text have been analyzed in detail, so as to redefine the phenomenon in terms of usage in writing. As stated in Section 1, my hypothesis is that the ‘colloquialising’ force of this semi-modal needs to be reinforced by other interpersonal elements, in order to establish an effective connection with the reader. The following case exemplifies my qualitative approach to data analysis: (7)
OVER ALL these fairly awkward problems Khrushchev was to skate rather lightly; and, though he repeated, over and over again, the spectacular figures of industrial and agricultural production in 1980, the “ordinary” people in Russia are still a little uncertain as to how “communism” is really going to work in practice, especially in respect of food. Would agriculture progress as rapidly as industry? (Brown B25 0540 4)
In (7), going to co-occurs with the following elements: metaphoric expressions (skate over a problem), hyperboles (over and over again, the spectacular figures), interrogative clause (Would agriculture progress as rapidly as industry?), and light sarcasm (“ordinary”, “communism”). As for the last element, it is worth saying that single and double inverted commas often indicate the writer’s attempt to personalise his or her writing by mimicking the emotional directness of oral speech, and are conventionally used “to indicate the writer’s abdication of responsibility, [...] a sort of ironic lift of the eyebrow in print” (Lakoff 1984: 245). In (7) the above listed elements are meant to engage the imagined audience in the criticism and, simultaneously, reinforce the writer’s point. 3.
Quantitative analysis
The data retrieved reveal an increase – already tested by past studies – in the use of going to, as shown in Figure 1.
314
Anna Belladelli
Figure 1: The spread of going to (data per million words) according to the genre categories defined by Leech (2004) The semi-modal is primarily found in Fiction, although its frequency has not increased much from Brown to Frown. In turn, Press and General Prose are the genres which have undergone the most remarkable change. (Technically, Learned has also increased the number of instances from the 1960s to the 1990s; nevertheless, since the use of going to is still rare in texts belonging to this genre, such low figures make generalisations difficult.)5 Table 1 shows the distribution of the data retrieved according to tense. When the auxiliary does not help determine the tense, e.g. in case of ellipsis, the context and the overall syntactic structure of the sentence has made it possible. An increase of the use of going to in the present tense is noticed from Brown to Frown. The subsequent slight decrease of going to in the past tense suggests that its use in narrative and descriptive contexts may have decreased, and that more attention should be paid to other contexts of use, such as direct speech. Table 1: Distribution of going to according to tense (raw frequencies)
present past total
going to 133 73 206
Brown % 64.6 35.4 100.0
going to 223 83 306
Frown % 72.9 27.1 100.0
The distribution of going to is worth measuring also with regard to the person it involves. Despite the increase in number, the use of the semi-modal with the first and second singular persons has not undergone any changes that are worth investigating, whereas a slight increase of the first plural person emerges. If we
The interpersonal function of going to in written American English
315
consider the third persons (both singular and plural) together, the use of going to has decreased from Brown to Frown of about 8.2 per cent. The semi-modal seems to be used more often in Frown with person deixis: the occurrence of going to with 1st and 2nd persons (both singular and plural) has increased from 39.7 per cent (Brown) to 46 per cent (Frown). Table 2: Distribution of going to in Brown and Frown according to person (raw frequencies on the left; percentages on the right)
1st sing. 2nd sing. 3rd sing. 1st pl. 2nd pl. 3rd pl. total
going to 33 30 113 18 12 206
Brown % 16.0 14.6 54.9 8.7 5.8 100.0
going to 54 47 141 38 2 24 306
Frown % 17.6 15.4 46.2 12.4 0.6 7.8 100.0
As far as auxiliary verbs are concerned, to be is used in the vast majority of cases. Tables 3 and 4 show the distribution of full (e.g. ‘I am going to’), contracted (e.g. ‘I’m going to’), and non-Standard (e.g. ‘I ain’t going to’) forms of the auxiliary verb used in the going-to constructions found in the data. Ellipsis, such as in ‘you not going to’, has been considered as non-Standard due to its stigmatised status. Table 3: Distribution of full, contracted and non-Standard forms of the auxiliary verb according to person (raw frequencies)
1st sing. 2nd sing. 3rd sing. 1st pl. 2nd pl. 3rd pl. total
full 9 11 91 11 9 131 (63.6%)
Brown contr. 23 15 20 7 3 68 (33%)
non-S 1 4 2 7 (3.4%)
full 22 9 91 21 1 23 167 (54.6%)
Frown contr. 31 31 48 16 1 127 (41.5%)
non-S 1 7 2 1 1 12 (3.9%)
As Table 3 demonstrates, the ratio between full and contracted forms has undergone a quantitative change, as the percentage of the latter has increased of 13.5 per cent from Brown to Frown. An increase of contracted forms can be observed also in Table 4; however, more than two thirds of the going-to constructions with contracted auxiliaries in Frown are found in signalled and unsignalled direct speech. This may be explained by the need to confer a more
316
Anna Belladelli
lively and speech-like tone to fictional dialogues and reconstructed spoken exchanges. Table 4: Distribution of full, contracted and non-Standard forms of the auxiliary verb to be according to genre category (raw frequencies on the left; per million words between brackets) full 26 (147.7) Press GenProse 26 (63.1) Learned 1 (6.2) 78 (312) Fiction Total 131 (529) 4.
Brown contr. 5 (28.4) 4 (9.7) 59 (236) 68 (274.1)
non-S 7 (28) 7 (28)
full 35 (198.9) 41 (99.5) 5 (31.2) 86 (344) 167(673.6)
Frown contr. 31 (176.1) 23 (55.8) 73 (292) 127(523.9)
non-S 1 (5.7) 11 (44) 12(49.7)
Qualitative analysis
This section provides a qualitative and partly quantitative analysis of the spread of going to. The phenomenon is described by taking each genre category into account separately, with a focus on the occurrences found in running text. Shortterm diachronic changes are observed (both in terms of quantitative spread and distribution across contexts) and an explanation of the phenomenon is attempted. 4.1
Press
Quantitatively speaking, Press is the genre in which the use of going to has undergone the most remarkable increase, although such a spread is limited to a specific text type, namely Reportage, as highlighted by Mair (1997a). In turn, while going to seems to have decreased in Editorials, it is still rare in Reviews. As for the distribution across contexts of use, the data show that going to, which used to be employed predominantly in running text, as illustrated in Figure 2, occurs now chiefly in direct speech (also unsignalled). This phenomenon, however, does not mean that nowadays writers use going to more often in direct speech: a more plausible explanation is that the use of quotes and direct speech in Press has increased in general. The change may be accounted for by considering that the ‘apparent democratization’ of language primarily affects genres that originate from institutional and cultural authorities; therefore new styles and trends in newspaper writing are expected to emerge.
The interpersonal function of going to in written American English
317
Figure 2: Contexts of use of going to in Brown and Frown (Press) According to the theory of audience design (Bell 1984, 1991), speakers/writers modulate their talk for their audience. The adjustment of the speaker’s style in order to respond to real or imagined addressees is more evident in texts produced by media institutions, which have economic and social interests in shaping and maintaining a relationship with a specific target audience. For this reason, the increase of direct speech is extremely likely: in the search for strategies to “win bigger audiences” (Hundt and Mair 1999: 236) and maintain their trust, the effectiveness of an alleged ‘first-hand’ testimony makes it the preferable reporting technique. The increase of the occurrences of going to in direct speech is most probably due to the spread of direct speech. The co-existence of these two phenomena has already been observed: The going-to-future has become more frequent [in Press categories] because today’s journalists use more direct-speech quotations than previously to create an illusion of vivid orality in written texts. Thus, the spoken contexts in which going to is frequent are simply represented more often in writing. But of course, the form is also used more frequently outside direct-speech passages, which shows that the stylistic norms governing spoken and written usage are indeed converging. (Mair 1997a: 1541-1542) The notion of ‘norms’ concerning speech and writing has been addressed critically by Lakoff (1984) and Biber (1986): the former, in particular, questioned the clear-cut dichotomy between planned, non-spontaneous written discourse and spontaneous, direct oral communication (1984: 241). In the pre-Internet era, Lakoff had already observed that – thanks to the influence of “mixed” media and the access to “new” technology (television and audio/video recording devices) in
318
Anna Belladelli
everyday life – language users were beginning to feel more comfortable with mixing different styles of communication to their fullest advantage, regardless of which ones were traditionally ascribed to spoken or written language. Her view, however, seems to concern only private discourses: conversely, when analyzing texts conceived and produced by mass media, a more convincing explanation of the phenomenon can be found in Hundt and Mair (1999): drawing on Fairclough’s analysis of language as a social practice, they highlight that newspapers are subject to competitive market forces that push them to attract their audiences by means of specific linguistic strategies. The qualitative analysis of the data reveals that most of the occurrences contain at least one of the interpersonal linguistic elements listed in Section 2, and more often a combination of them: 61.9 per cent in Brown and 75 per cent in Frown. In most cases the writer is using irony, either by making a sarcastic comment or by employing hyperboles and word play (although the remaining text is not characterised by irony). As for deixis, in one out of two occurrences it is the case that either the audience is being addressed directly (by means of the personal pronouns you and inclusive we), or the writer is speaking in the first person, also adding time and space references (adverbs, adjectives). The following examples illustrate typical contexts: (8)
Obviously, regardless of what the politicians and the institutional poohbahs say, Americans as a whole are worried or dissatisfied or both. Neither Bill Clinton nor George Bush is going to discuss the really important matters because they are both establishment candidates. But here is the nut of the problem. (Frown B13 158)
(9)
Need for service is here to stay- and the problem is going to be tougher to solve in the sixties. There are two reasons for this. First, most products tend to become more complex. Second, in a competitive market, the customer feels his weight and throws it around. (Brown E28 0360)
In (8), going to co-occurs with a slang term (poohbah), an idiomatic phrase (the nut of the problem), and a spatial deictic element (the adverb here). The tone of this passage is characterised by a note of sarcasm, since the writer is claiming that the two candidates, who should be concerned with ‘important matters’ more than anyone else, always avoid doing so. Example (9) is extracted from a specialised article on marketing strategies, where the editorial staff makes a list of the 14 main concerns entrepreneurs should keep in mind before starting a new business. Drawing on two main discourse types, i.e. professional counselling and how-to guide, the text producer – i.e. an author subject to collective editing (see Fairclough 2001 [1989]) – needs to connect with the readers as clients and learners. To do so, the overall style is kept informal and the representation of the economic situation is oversimplified: besides containing going to, the excerpt (as well as the rest of the sample) includes colloquialisms (throws it around, tougher)
The interpersonal function of going to in written American English
319
and high-frequency lexical collocations (is here to stay, in a competitive market, become more complex). This analysis shows that, despite the increase in frequency, there has not been a substantial functional change in the employment of going to in discourse, both in terms of motivation and of co-occurring interpersonal elements. Examples from Brown and Frown show that no novel uses of this structure have arisen. In other words, going to is still used predominantly for interpersonal purposes and within specific intersubjective strategies; for this reason, it is should not be considered as less informal – i.e. more similar to will/would – on account of its spread. A more correct explanation can be given by saying that, in the era of informalisation and false democratization of discourse, it is much more often the case for a press writer to resort to those intersubjective strategies. 4.2
General Prose
General Prose is an umbrella label devised by Leech (2004), grouping five text types of the Brown-type corpora (namely D, E, F, G, and H) into a single genre category of non-fictional prose. Figure 1 shows that the number of instances of goicng to in General Prose has doubled from Brown to Frown. More specifically, a remarkable increase is observed in Skills and Hobbies (E) and Popular Lore (F). Unlike in Brown, in Frown going to is also attested in the category Miscellaneous (H), a text type which contains mainly government reports; this suggests a change in the norms of these texts. As in the Press category, data from the General Prose part of the corpora reveal that the ratio between direct speech and running text has almost undergone an inversion in the thirty-year time span between Brown and Frown, as illustrated in Figure 3. The observations on direct speech made in Section 4.1 hold here, as well: the raw number of occurrences within direct speech has increased from 6 to 41 tokens from Brown to Frown, whereas figures within the other contexts have remained very similar (from 21 to 18 in running text; from 3 to 5 in reported speech).
Figure 3: Contexts of use of going to in Brown and Frown (General Prose)
320
Anna Belladelli
Generally speaking, a substantial short-term diachronic change is observable in terms of its usage in reporting real or fictional speech. The instances of going to within direct speech and running text have undergone the same change observed in Section 4.1, as Figure 3 shows; conversely, the decrease in the portion of reported speech is clearly too small to be considered as a remarkable change. The same qualitative analysis described in Section 2 has been carried out on the instances occurring in running text. In more than 70 per cent of the cases going to is used in co-occurrence with at least one of the interpersonal elements (questions, slang and idioms, deixis, or irony). The semi-modal is found predominantly in collocation with deictic pronouns, and also with rhetorical questions and colloquial vocabulary; unlike Press, the use of irony is nearly absent in co-occurrence with going to in General Prose. Diachronically speaking, the co-presence of the semi-modal and the above mentioned elements becomes more frequent from Brown to Frown. Nevertheless, the recurring interpersonal linguistic elements and the effects thereof are different from those observed in Press. Indeed, the predominant element which is found together with going to is exophoric deixis, both referring to the writer (I) and the reader (you); in turn, the use of irony and sarcasm is almost absent.6 Contracted forms are used along with going to in co-occurrence with at least one of the above mentioned elements. Examples (10) and (11) are meant to exemplify the features observed so far, and are representative of the majority of the instances retrieved from the corpora. (10)
Cost of power and machinery is often a serious problem to the small-scale farmer. If you are going to farm for extra cash income on a part-time basis you must keep in mind the needed machinery investments when you choose among farm enterprises. (Brown F13 1040 2)
In this extract, taken from the Farmers Bulletin, the writer is giving advice to readers who might be interested in taking on farming as a recreational activity. The overall style of the text is sober and not utterly informal; conversely, where going to is used, the vocabulary employed in the co-text becomes less technical, as demonstrated by the choice of everyday lexical items, such as serious problem (a highly frequent, poorly descriptive phrase) and power (instead of electricity). The text is meant as a set of tips for beginners and as a reminder for the others: on the one hand, the writer is using the pronoun you so as to address the reader directly; on the other, the presence of the central modal verb must enhances the strength of the suggestion. The use of colloquial going to – which is employed here with the deontic meaning of ‘in case you intend to’ – could be motivated by the need to emphasise a trustworthy and virtually ‘mutual’ exchange of opinions on how to spend money consciously between an experienced writer and an inexperienced, however curious, reader. In (11) the writer shows a similar attitude towards the addressee:
The interpersonal function of going to in written American English (11)
321
Suppose your genetic ‘book’ can be read. It tells me, your potential employer, that at such-and-such an age you are liable to develop this-orthat problem, and I am going to have to foot the bill for the problem. So what is my interest in hiring you as compared with somebody whose book is a bit cleaner? (Frown F12 60)
Here, the presence of the author is even more explicit. First, the personal pronouns and the possessive adjectives are used repeatedly and alternately, as if writer and reader were interlocutors engaged in a face-to-face conversation. Secondly, the presence of colloquial vocabulary (such as the verb to foot and the adverbs so and a bit) and hyphenated modifiers contribute to the informality of the text. Moreover, the writer addresses the reader with a question. Summing up, going to is employed in General Prose mainly in intersubjective strategies which aim at enhancing the persuasiveness of the writer’s voice by reducing the distance between him/herself and the reader/audience. In most instances, writers are actually advising readers on a given issue: in order for communication to be smooth and effective, they might find it convenient to level out the role imbalance between themselves (culturally, ‘those who have something to say’) and the readership (‘those who have something to learn’). 4.3
Learned
The use of going to is very rare in texts belonging to this genre. Despite the relatively high increase of instances from the 1960s to the 1990s, going to does not seem to be considered suitable for texts where a high level of formality is needed. All occurrences retrieved are found in running text, as exemplified in (12). However, the lack of use in other contexts may be biased by the low presence of direct and reported speech in academic writing. (12)
The bottom line on studies of written samples is that one is rarely going to find dirty words used in the sample because they are collected from biased material, even though that material may be appropriate to design children’s reading texts, for example. (Frown J32 147)
Data show that going to is used exclusively when the writer is discussing a relatively trivial or ‘light’ issue, thereby allowing for a shift in style. However, the extremely low number of occurrences does not allow for generalisations about this trend. In (13), for instance, the discussion concerns the reliability of written samples for language studies. Despite the overall formality of the discourse, the writer chooses to use going to when mentioning the alleged absence of derogatory expressions from such samples. Interestingly, the semi-modal co-occurs with the direct object dirty words, which may be considered a quite colloquial vocabulary choice, if compared to the specific terminology available to the author. The lowering of formality could be accounted for by the need to show a nonchalant attitude toward a tabooed or unpleasant matter.
322
Anna Belladelli
4.4
Fiction
The use of going to in fictional texts is very frequent both in Brown and in Frown. Although the number of occurrences is higher in the latter corpus (144 in Brown and 170 in Frown),7 no remarkable diachronic change is observed in terms of contexts. In both corpora, more than half of the instances of going to are found in direct speech. It would be pointless to analyze this phenomenon here: it is well known that going to, along with other grammatical or morpho-syntactical structures which characterise the spoken mode, is exploited in fictional dialogues in order to create lively and realistic verbal exchanges. This lack of remarkable change in the ratio between direct speech may provide extra evidence to the ‘democratization’ theory: authors of fictional texts do not receive the same pressures from the institutional and economical authorities, i.e. they do not need to win the audience’s trust by providing more and more first-hand testimonies. Going to is also used in running text, particularly the past tense variant (almost 90 per cent of the cases). The following example illustrates its recurrent use in description and narration: (13)
She was going to tell Bobby Joe about how mistaken she had been, but he brought one of the cousins home for supper, and all they did was talk about antelope. Bobby Joe was trying to get Linda Kay to say she would cook one if he brought it home. (Brown K26 1280 5)
5.
Conclusion
The aim of this paper has been to provide an analysis of the spread of going to in written AmE, with a focus on contexts and qualitative change. Going to was found to occur in four contexts: running text, direct speech, unsignalled direct speech and reported speech. Short-term diachronic change has been highlighted both in terms of frequency of occurrence and in terms of distribution of going to across different contexts. From a quantitative point of view, Press and General Prose have undergone the most remarkable change, as already stated in previous studies. The thirty-year time span between Brown and Frown has seen an increase of going to, particularly in direct speech. Further analysis of the occurrences in running text has revealed that, although going to is considered to be a relatively informal linguistic choice (as opposed to will/would), its ‘informalising’ force is usually not sufficient: as data show, writers often find it more effective to combine more than one linguistic element which has strong connotations of informal styles (including going to) and that has a clear interpersonal function, so as to pretend to be engaged in a relaxed and more peer-to-peer exchange with the reader. This combination of linguistic
The interpersonal function of going to in written American English
323
options can be described as an intersubjective strategy which aims at establishing a relationship between writers and their audience. Although informalisation (the process by which discourse is made less authority-oriented) and colloquialisation (the process by which discourse is made more similar to speech) are separate phenomena, they merge into Fairclough’s idea of (false) democratization of discourse. The use of going to falls into both phenomena, as this semi-modal is considered by writers as a more informal alternative to will/would on the one hand, and as a feature of spoken English on the other. In the logic of the democratization of discourse, intersubjective strategies are bound to increase in the genres that are closer to (or that originate from) institutional, economical, and cultural authorities, as writers are asked to win the trust of their target audience, especially for the benefit of the market. The results of this study can substantiate this prediction: the genres that have undergone a remarkable change in the frequency of use of going to are indeed the ones that mostly depend on such authorities. Acknowledgements I would like to thank Charles Meyer, Antoinette Renouf and two anonymous reviewers for their insightful comments and suggestions on earlier versions of this article. Notes 1
In Biber et al. (1999: 484) the multi-word verb going to is referred to as a ‘semi-modal’, or ‘periphrastic modal’, or again ‘quasi-modal’; CelceMurcia and Larsen-Freeman (1999: 139) define it as a ‘phrasal modal’. Huddleston and Pullum (2002: 210), on the other hand, do not mention the notion of modality: they describe it as “a future-time to-infinitival that merits attention”. Finally, Bybee, Perkins and Pagliuca (1994) name it ‘primary’ future marker. In so doing, these reference grammars emphasise either the temporal or the modal functions of going to.
2
In this study, co-text is understood to be a portion of text including up to three sentences: the whole sentence where going to is found and (where possible) the previous and the following one.
3
Since the sample size is c. 2,000 words, with a 20- or 30-words exceeding margin, an approximation of the number of words for each genre category in Brown and Frown has been made as follows: Press (176,000), General Prose (412,000), Learned (160,000), and Fiction (252,000).
4
It should be noted that two non-standard forms, namely goin’ to and gon, also occur in these corpora. I have decided to count the occurrences of the former among those of going to, and to exclude the latter since it occurs only once.
324
Anna Belladelli
5
The number of instances of going to has increased as follows (raw data): from 31 to 67 in Press; from 30 to 64 in General Prose; from 1 to 5 in Learned; from 144 to 170 in Fiction.
6
For more quantitative data and a discussion on the role of personal pronouns in the creation of a more “involved” style, see Hundt and Mair (1999: 226-227).
7
These are raw frequencies. The rate per million words would be 571.4 and 564.6, respectively (see Figure 1).
References Bell, A. (1984), ‘Language style as audience design’, Language in Society, 13, 2: 145-204. Bell, A. (1991), The language of news media. Cambridge (MA): Blackwell. Biber, D. (1986), ‘Spoken and written textual dimensions in English: Resolving the contradictory findings’, Language, 62: 384-414. Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), Longman grammar of spoken and written English. Harlow (UK): Longman. Bybee, J., R.D. Perkins and W. Pagliuca (1994), The evolution of grammar: Tense, aspect, and modality in the languages of the world. Chicago: University of Chicago Press. Celce-Murcia, M. and D. Larsen-Freeman (eds.) (1999), The grammar book. Boston: Heinle&Heinle. Chafe, W.L. and J. Danielewicz (1987), ‘Properties of written and spoken language’, in: R. Horowitz and S.J. Samuels (eds.) Comprehending oral and written language. San Diego: Academic Press. 83-113. Eble, C. (1996), Slang and sociability. Chapel Hill and London: University of North Carolina Press. Fairclough, N. (1995), Critical discourse analysis. London: Longman. Fairclough, N. (2001 [1989]), Language and power. London: Longman. Huddleston, R.D. and G.K. Pullum (eds.) (2002), Cambridge grammar of the English language. Cambridge: Cambridge University Press. Hundt, M. (1997), ‘Has BrE been catching up with AmE over the past thirty years?’, in: M. Ljung (ed.) Corpus-based studies in English: Papers from the seventeenth international conference on English Language research on computerised corpora (ICAME 17). Amsterdam and Atlanta: Rodopi. 135-151. Hundt, M. and C. Mair (1999), ‘“Agile” and “uptight” genres: the corpus-based approach to language change in progress’, International Journal of Corpus Linguistics, 4 (2): 221-242. Hundt, M., A. Sand and P. Skandera (1999), Manual of information to accompany the Freiburg-Brown Corpus of American English (‘Frown’).
The interpersonal function of going to in written American English
325
Available from: http://khnt.hit.uib.no/icame/manuals/frownINDEX.HTM [Accessed May 2007]. Lakoff, R. (1984), ‘Some of my favorite writers are literate: The mingling of oral and written strategies in written communication’, in: D. Tannen (ed.) Spoken and written language: Exploring orality and literacy. Norwood (NJ): Ablex Publishing Corporation. 239-260. Leech, G. (2003), ‘Modality on the move: the English modal auxiliaries 19611992’, in: R. Facchinetti, M. Krug and F.R. Palmer (eds.) Modality in contemporary English. Berlin and New York: Mouton de Gruyter. 223240. Leech, G. (2004), ‘Recent grammatical change in English: data, description, theory’, in: K. Aijmer and B. Altenberg (eds.) Advances in corpus linguistics. Amsterdam and New York: Rodopi. 61-81. Mair, C. (1997a), ‘The spread of the going-to-future in written English: a corpusbased investigation into language change in progress’, in: R. Hickey and S. Puppel (eds.) Language history and linguistic modelling. A fest-schrift for Jacek Fisiak on his 60th birthday. Berlin and New York: Mouton de Gruyter. 1537-1543. Mair, C. (1997b), ‘Parallel corpora: a real-time approach to the study of language change in progress’, in: M. Ljung (ed.) Corpus-based studies in English: Papers from the seventeenth international conference on English Language research on computerised corpora (ICAME 17). Amsterdam and Atlanta: Rodopi. 195-209. Palmer, F.R. (2001), Mood and modality. Cambridge: Cambridge University Press. Westin, I. (2002), Language change in English newspaper editorials. Amsterdam and New York: Rodopi.
Re-analysing the semi-modal ought to: an investigation of its use in the LOB, FLOB, Brown and Frown corpora Marta Degani University of Verona Abstract Most of the research on modality in English has been devoted to the study of core modals (Bybee et al 1994; Kemenade 1993; Palmer 1979, 1986; Plank 1984; Roberts 1985; Traugott 1989; Warner 1990). As a consequence, semi-modals have been kept in a state of relative marginality. This holds particularly true in the case of ought to, as confirmed by the lack of substantial work concerning this semi-modal. The present paper addresses the need to fill this gap by providing a description of ought to in British and American English from a short-term diachronic perspective. The study takes a top-down approach since it starts from the working hypothesis that ought to, like other modal verbs, has been gradually undergoing a process of “subjectification” (Traugott 1989, 1995). The hypothesis will be tested on the four corpora constituting the so-called ‘Brown family’ (LOB, FLOB, Brown and Frown), so as to identify any possible short-term diachronic changes in the British and American varieties under scrutiny. In order to measure how and to what extent the phenomenon of “subjectification” has affected ought to, the analysis will be carried out along semantic and syntactic lines. 1
1.
Introduction The modal auxiliaries can, could, may, might, must, shall, should, will, would, and auxiliary do have held center stage in recent accounts of the history of English syntax […] and semantics […] However, the so-called quasi-modals (e.g. to be, to be going to, have to, ought to, had better, be about/able/bound to), which are in an intermediate position between raising predicates and modal operators, have largely been relegated to the sideline or been treated as replacements of the true modals after the latter ceased to be inflected for tense […] As a class the quasi-modals have still not received the attention that they deserve (Traugott 1997: 193).
The present paper is an investigation in the area of weak obligation markers in the English modal system; more specifically, it concerns the semi-modal ought to in Present-Day English2. This research is chiefly motivated by the lack of substantial work regarding the quasi-modal in question, as indicated above. In addition, the study is aimed at measuring the extent to which ought to, like other modal expressions, is undergoing a process of semantic change (cf. Traugott and Dasher 2002: 105-147).
328
Marta Degani
As the OED reports, ought to is characterized by a lack of inflection, a lack of morphological tense distinction (in contrast to the modal pairs can/could, will/would, shall/should, may/might3), absence of the do construction in negative and interrogative contexts, retention of the to-infinitive (possibly influenced by the parallel deontic modal expression have to)4, and by the fact that in negative constructions, not negates the action expressed in the infinitive rather than the deontic meaning of the modal (see also must)5. Ought to expresses duty or obligation of any kind (originally a moral obligation) but, in more general senses, it expresses what is proper, correct, advisable, befitting or expected. The subject of ought to clauses is the person or the thing bound by the obligation. The latter is normally expressed by the following to-infinitive, but it may also be contextually implied. In non-personal and/or passive constructions the obligation is on the part of an agent who is not specified in the clause. The set of conceptual properties developed by Heine in his analysis of the German modals can also be used for a fuller depiction of ought to (cf. Heine 1995). In line with Heine’s model, the conceptual properties of ought to can be defined as follows: its force is social and/or moral; the controlling agent (the obliger) is plural, generic or non-specific; the event is dynamic and it typically leads to a change of state; the eventuality time relative to the modal time is generally future-shifted; the event is nonfactual and probable. Like most English modal verbs ought to can be used to express either deontic (1) or epistemic (2) meaning: (1)
Dear Judith, it is sometime since I last wrote to you and though I am awaiting a letter I think that I ought to write to send you the money I owe you (Coates 1983: 71).
(2)
It ought to have been obvious to Tony that nobody in authority there was going to have a person with my sort of reputation writing articles (Coates 1983: 75).
In (1) ought to is employed in its deontic meaning since the speaker expresses a moral obligation. The utterance is an example of self-exhortation as the speaker urges himself to do something. Conversely, in (2) ought to bears epistemic meaning and could be rephrased as ‘one would have expected that’. Here the modal is inferential and indicates a tentative conclusion about the likelihood of an event. The utterance is also counter-factual since it implies that what was expected to be obvious was actually not. The deontic meaning of ought to concerns weak obligation, duty and advisability. It normally has a future time reference and is generally counterfactual (see Palmer 1990: 122-127, 1986: 131-134), sometimes also non-factive (see Coates 1983: 72). On the syntactic level, it occurs chiefly in collocation with animate subjects. Conversely, the epistemic meaning of ought to regards logical and (tentative) inference, assumptions or probability/likelihood (at times also doubt). Deontic and epistemic ought to share some features: their core time
Re-analysing the semi-modal ought to
329
reference is future and they are typically counter-factual. Syntactically, however, epistemic ought to is used with inanimate rather than with animate subjects. Traditionally, the deontic meaning of ought to has been given primacy over the epistemic. For instance, in an 80 million word corpus (otherwise unspecified for its sources), Mindt (1995: 134-141) establishes the following values for the different meanings of ought to: obligation (58%), advisability (15%), regulation/prescription (4%), and necessity (3%), which can be grouped under the general heading of deontic. His remaining smaller categories of: inference/deduction (16%), certainty/prediction (3%), possibility/high probability (1%) can be subsumed as epistemic. ‘Indeterminacy’ is also frequent with ought to (see Coates 1983: 77-81)6. On the one hand, there are cases of ‘ambiguity’ when it is not possible to decide which one of two mutually exclusive interpretations is intended, deontic or epistemic. On the other hand, there are instances of ‘merger’ when the two meanings are present but they are not mutually exclusive, thus being both available and potentially intended. Consider the following example: (3)
- Newcastle Beer is a jolly good beer - Is it? - Well, it ought to be at that price (Coates 1983: 78)
In example (3) it is not clear whether the speaker is making a logical assumption of the kind ‘the beer costs a lot, therefore it is good’ (epistemic interpretation) or whether the cost itself, being so high, implies an obligation for the beer to be good (‘the beer costs a lot, therefore it MUST be good’, deontic interpretation). Furthermore, both deontic and epistemic meanings can be interpreted along a continuum from strong to weak. For deontic ought to the cline goes from the strong meanings ‘I advice you’, ‘I recommend you’ to the weak meanings ‘it would be proper, good, advisable’. For epistemic ought to the cline runs from the strong meaning ‘I assume that’ to the gradually weaker meanings ‘probably’, ‘it is reasonable to assume that’ down to the weakest ‘it is meant to’, ‘it is supposed to’(Coates 1983: 71-75). 2.
Aim and scope of the study
The working hypothesis of the present paper is that ought to, like other modal verbs, is undergoing a process of “subjectification”, a term which is used here with explicit reference to Traugott (1989, 1995). According to her, there is a strong tendency for objective meaning to be replaced by subjective meaning over time. In other words, referential meaning, which is a property of the externally described situation, tends to be replaced by meaning that describes speakerinternal features of the encoded situation. This means that, diachronically, speaker-oriented meaning is expected to prevail over context-driven meaning.
330
Marta Degani
The process of “subjectification” is also associated to a gradual increase of epistemic meanings. In order to measure how and to what extent the phenomenon of “subjectification” has affected the semi-modal ought to, the analysis will then be carried out along semantic and syntactic lines. The following parameters have been set for the investigation: -
semantic value: deontic, epistemic or merger degree of subjectivity: subjective, weakly subjective or objective time reference of the main predication: past, present or future actualization: ought to as non-factual, counter-factual or actualized type of subject: animate, inanimate or dummy voice: passive or active polarity: positive or negative verb: having strong or weak agentivity
Four 1 million word corpora of the ICAME collection have been selected for the study, namely LOB and FLOB for British English, Brown and Frown for American English. Data will be analyzed in terms of both regional variation (British vs. American English) and short-term diachronic change7. 3.
Analysis
The analysis is based on data retrieved from LOB, FLOB, Brown and Frown8. As tables 1 and 2 show, each corpus contains about one million words spread across 15 text categories, nine of non-fiction (from A to J) and six of fiction (from K to R). Categories are designed to match as closely as possible, but in some instances they fail to have exactly the same size.9 Table 1: Frequencies of ought to in LOB and FLOB LOB (1961) frequency Subcorpus Press
General Prose
Text-type
n. words
raw
FLOB (1991) frequency
norm
n. words
raw norm
A. Press: reportage
88.000
3
34
88.595
1
11.3
B. Press: editorial
54.000
7
129.6
54.375
4
73.6
C. Press: reviews
34.000
2
58.8
34.185
2
58.5
D. Religion
34.000
8
235.3
34.265
4
116.7
E. Skills and hobbies
76.000
2
26.3
76.342
6
78.6
F. Popular lore
88.000
3
34
88.574
5
56.4
G. Belles letters, biography etc.
154.000
25
162.3
154.956
6
38.7
Re-analysing the semi-modal ought to
Learned
Fiction
331
H. Miscellaneous
60.000
5
83.3
59.501
1
16.8
J. Learned writing
160.000
14
87.5
160.972
9
55.9
K. General fiction
58.000
4
68.9
58.329
7
120
L. Mystery and detective fiction
48.000
11
229.2
66.441
4
60.2
M. Science fiction
12.000
0
0
10.226
0
0
N. Adventure and western fiction
58.000
11
189.6
58.237
2
34.3
P. Romance and love story
58.000
8
137.9
58.271
6
103
55.5
18.101
R. Humour Total
18.000
1
1.000.000
104
1532.7 1.021.370
1
55.2
58
879.3
Table 2: Frequencies of ought to in Brown and Frown Brown (1961) frequency Subcorpus
Text-type A. Press: reportage
Press
General Prose
Learned
Fiction
n.words 88.000
raw norm 1
11.4
Frown (1992) frequency n.words 88.625
raw norm 1
11.3 110.5
B. Press: editorial
54.000
5
92.6
54.300
6
C. Press: reviews
34.000
3
88.2
34.253
2
58.4
D. Religion
34.000
4
117.6
34.358
4
116.4
E. Skills and hobbies
72.000
0
0
72.395
0
0
F. Popular lore
96.000
8
83.3
96.580
3
31
G. Belles letters, biography etc.
150.000
6
40
150.950
14
92.7
H. Miscellaneous
60.000
1
16.7
60.386
0
0
J. Learned writing
160.000
8
50
160.938
4
24.8
K. General fiction
58.000
4
69
58.377
2
34.3
L. Mystery and detective fiction
48.000
5
104.1
48.184
5
103.8
M. Science fiction
12.000
1
83.3
12.032
0
0
N. Adventure and western fiction
58.000
7
120.7
58.385
5
85.6
P. Romance and love story
58.000
14
241.4
58.318
4
68.6
R. Humour
18.000
3
166.7
18.046
0
0
1.000.000
70
1285
1.006.127
50
737.5
Total
332
Marta Degani
In Tables 1 and 2, the frequency of ought to in the four corpora is presented. To facilitate comparisons of individual findings, text-categories have been grouped in four sub-corpora: Press (from A to C), General Prose (from D to H), Learned writing (J), and Fiction (from K to R). Raw frequencies have been normalized per 1.000.000 words to enable comparisons between the text-types, which are of different sizes10. The counts also include phonologically reduced forms (e.g. oughta). As the tables indicate, the use of ought to has quite significantly reduced over time in both varieties: from 104 occurrences in LOB down to 58 in FLOB and from 70 in Brown to 50 in Frown. In terms of regional variation, the reduction of the semi-modal is even more drastic in British English than in American English. These data confirm the general pattern of decrease in the frequency of modal verbs from the period of 1961 to 1991-2 commented by Leech (2003, 2004, 2006). The findings also sustain Leech’s observation that this decline has been more drastic in the case of infrequent modals such as shall, ought to and need (Leech 2003: 228-9). In this respect, however, claims that American English leads changes in recent British English (see Hundt 1997, Leech 2003) do not seem to hold true for ought to11. Instead, it might be reasonable to justify the decline of ought to in the British corpora in terms of social changes taking place in contemporary British society. As Faiclough (1992) highlights, there is a tendency in Britain for formality to be gradually replaced by informality. In this general process of “democratization of discourse”, overt power markers such as deontic modals tend to be eliminated (Faiclough 1992: 203). As a closer look at individual text-types reveals, the decline of ought to is fairly spread throughout. More specifically, ought to appears to have drastically decreased in the categories G, H, L, and N in FLOB. A dramatic reduction of the semi-modal is also observable in the categories F and P of the American counterpart. A few exceptions are the categories E, F, and K for British English, and B and G for American English. Here, the ascendancy of ought to contradicts the general trend of its decline. The only category which has maintained a fairly high relative frequency in both varieties is D (Religion). This finding comes as no surprise if one considers that D contains formal written texts which typically convey moral values. For the remainder of the paper, the analysis will be conducted along the semantic and syntactic lines presented in section 2. 3.1
The semantic value of ought to
The deontic meaning of ought to has been considered as having primacy over the epistemic. Indeed, the earliest examples of this semi-modal are all deontic, and in late ME very few instances of epistemic ought to can be found. In this sense, ought to shows a development which is in contrast with other modals such as may, might, shall and would, whose epistemic meanings are attested in ME and perhaps also in OE (cf. Warner, 1990). Diachronically, ought to has also shown an increasing preference for the type of constructions that Coates (1983: 77-80) defines as ‘mergers’, i.e. contexts where the reading can be either deontic or epistemic.
Re-analysing the semi-modal ought to
333
Figure 1 shows the distribution of ought to as deontic, epistemic and merger across the two regional varieties of British English and American English.
70% 60% 50% 40%
Deontic Epistemic
30%
Merger
20% 10% 0% LOB
FLOB
Brown
Frown
Figure 1 Distribution of deontic, epistemic and merger ought to across LOB, FLOB, Brown and Frown Data retrieved from the corpora indicate an increase in the frequency of ought to both as epistemic and as merger. More specifically, epistemic meanings amount to 21% of the occurrences of ought to in LOB and 31% in FLOB; whereas the number of epistemic ought to in Brown and Frown constitute 14% and 36% respectively. The number of ought to used as merger in the British English corpora ranges from 10% (LOB) to 17% (FLOB), while the figures are higher in the American English corpora (17% Brown, 24% Frown). In terms of short-term diachronic change, the increase in epistemic and merger ought to seems to challenge the primacy of deonticity. As for regional variation, the use of the semimodal both as merger and as epistemic appears more frequent in the American variety. The increase in the use of ought to as epistemic or merger supports the idea that this semi-modal is undergoing a process of semantic change, namely that of “subjectification”, which is “hypothesized to motivate the shift from deontic to epistemic meanings” in general (Traugott, 1989: 37). 3.2
Ought to on the subjectivity-objectivity cline
All modal verbs can be used more or less subjectively depending on the extent to which they express opinions, attitudes or conclusions of the speaker (see the contributions in Athanasiadou et al 2006 and in Stein and Wright 1995, and the works of Benveniste 1971, Langacker 1990, Lyons 1977). In this sense, modality seems to imply subjectivity, and truly objective modal verbs cannot be considered as pertaining to natural language use. Nonetheless, the degree of subjectivity is generally higher with epistemic modals than with deontic. Coates, for example, regards subjectivity as the “crucial distinction between forms expressing root
334
Marta Degani
possibility in English and forms expressing epistemic possibility in English” (1995: 59). It should be noted, however, that even though epistemic meanings tend to match with subjective uses, there are cases when deontic ought to is employed subjectively rather than objectively. In the present study the term subjective is meant essentially as speaker’s involvement and it is conceived as a gradable notion. Both epistemic and deontic ought to have been considered along a continuum as potentially expressing more or less subjective meanings. In the case of epistemic ought to the cline ranges from the most subjective meaning ‘I assume that’, to the quite neutral ‘probably’, up to the least subjective, almost objective meaning ‘it is reasonable to assume that’. When dealing with deontic ought to, the semi-modal is considered subjective when its meaning is that of ‘I advice you’, ‘I recommend you’, while it is objective if it can be replaced by expressions such as ‘it would be a good idea’, ‘it would be proper’. In this sense, deontic ought to is highly subjective whenever the speaker overtly takes responsibility for the imposing of the obligation, thus showing his/her authority over the addressee. Conversely, deontic ought to is objective when circumstances compel, necessity is external and authority comes from a source external to the speaker. The degree of subjectivity of ought to is therefore context-dependent and it could not be established if the examples were taken in isolation, detached from the specific contexts where they occurred. Apart from clear cases of either objectivity or subjectivity, the data also show instances of ‘weak-subjective’ ought to. The following excerpts exemplify objective, weak-subjective and subjective occurrences of ought to: (4)
The Prayer Book, not the Bible, can justify the selection of just these five activities, and no more, as the Church’s other ministries of grace. The selection is inherently arbitrary and untheological. This idea behind it is presumably that the catechism ought to mention one ministerial action in the Church of England to correspond with each of Rome’s seven sacraments (LOB D10 147).
(5)
Steventon itself was more a line of cottages than a village, the church and manor house standing half a mile from where its centre ought to be (FLOB G 29 24).
(6)
“Same as I‘ve always said, women rule the roost and no man’s safe from ‘em.” Ought to be a better way of doing things. Take trees. He rattled on very happily. Trees have got the right idea (LOB L01 163).
In (4) ought to is deontic and objective: what it expresses is a formal obligation and the authority clearly comes from a source which is external to the speaker. In (5) ought to is used as epistemic since what is implied is a logical assumption and it is weakly subjective because the logical inference could be conveyed by means of expressions such as ‘it is expected, supposed’. In (6) the semi-modal is epistemic and it is used subjectively since the speaker assumes from his own
Re-analysing the semi-modal ought to
335
knowledge and experience in the world that things can be done differently. In this last example ought to be could be rephrased as ‘I assume that there is’12.
80% 70% 60% 50% Objectivity 40%
Weak subj.
30%
Subjectivity
20% 10% 0% LOB
FLOB
Brown
Frown
Figure 2 Distribution of objective, weak-subjective and subjective ought to across LOB, FLOB, Brown and Frown. As Figure 2 shows, there has been an increase in weak-subjectivity and in subjectivity diachronically. The former is particularly significant in the British variety: weak-subjective ought to accounts for 15% of the occurrences in LOB and reaches up to 28% in FLOB. The latter is observable in both varieties: subjective ought to amounts to 13% of the occurrences in LOB and 21% in FLOB, and the figures are even higher in the American English corpora (Brown 37% and Frown 56%). Furthermore, the pattern of subjectivity being matched with epistemic meanings and objectivity with deontic meanings seems to be confirmed by the data retrieved from the four corpora, except for Frown. Here, subjectivity features so prominently due to the fact that it is not only expressed epistemically but also deontically. As the data demonstrate, the path of development from deontic to epistemic meaning referred to in 3.1, which signals a process of “subjectification”, is paralleled by an increase in the subjective usage of ought to: initially objective uses give way to weakly and later more strongly subjective uses. This is the process by which “meanings become increasingly based in the speaker’s subjective belief state/attitude towards the proposition” (Traugott, 1989: 35). 3.3
Ought to: time reference and actualization
Epistemic and deontic construal of modal verbs typically differ from each other in terms of how they interact with tense and aspect (see Murvet 1996). In the case of epistemic modals, when the complement of the modal is stative, the eventuality
336
Marta Degani
time may be understood to coincide with the modal time, yielding a so-called simultaneous reading (read present time reference). This is generally the most natural reading, though in some cases a future-shifted understanding is also possible. When the complement of the modal is eventive, however, it must be future-shifted with respect to the modal evaluation time. Let’s consider the following examples: (7)
A man who gives a good account of himself is probably lying, since any life when viewed from inside is simply a series of defeats. Yet, theoretically, writing autobiography ought not to be so horrendously difficult. The autobiographer, as Leslie Stephen long ago pointed out, has ex officio two qualifications of supreme importance in all literary work (Frown G 71 31).
(8)
For settling the debate, we commend both the Clinton and Bush camps. They did the right thing, and each side compromised to do so. Three presidential debates ought to bring the candidates, and the issues, into focus. All that's required is for voters to tune in. Let the shows begin (Frown B01 144).
In (7), where epistemic ought to is followed by a stative verb (‘be’), the eventuality time can be interpreted as simultaneous. Indeed, what is expressed is a factual statement. In the analysis, such examples have been given a present time reference reading. Conversely, in (8), where epistemic ought to is followed by an eventive verb (‘bring’) the eventuality can only be read as future-shifted. Here, the modalized action is projected from the utterance time, which is present, to the future. Unsurprisingly, if the complement of the epistemic modal contains the perfect form, the complement has a past-shifted interpretation relative to the modal time. So, in a sentence such as “it ought to have been obvious to Tony that nobody in authority there was going to have a person with my sort of reputation writing articles” (LOB G 14 180), the eventuality time is necessarily past-shifted. In contrast, most deontic-modal construals favour a forward-shifted reading of the eventuality time relative to the modal time, regardless of the aspectual class of the complement of the modal. This can be observed in the examples below:
Re-analysing the semi-modal ought to
337
(9)
Naturally, I was very disappointed about not having this holiday with you, Bill. I'd so looked forward to getting away with you alone. As it is, we shall only have a few days to ourselves, for I suppose I ought to get back to mother as soon as I possibly can (LOB P 23 12).
(10)
She convinced him that he ought to be a member of some of the small teadrinking parties she held at her rooms and in the end he complied with her wishes, although it was only rarely that he added anything to the random conversations (Brown P25 07 40).
In (9) ought to is followed by an eventive verb (‘get back’) whereas in (10) it is complemented by a stative verb (‘be’). In spite of this difference in aspectual class, a future-time shiftedness of deontic ought to can be noticed in both examples. Indeed, in both cases one can perceive the projection of the duty/obligation from the present time of the utterance to the future. Bearing this in mind, occurrences of deontic ought to are expected to be characterized by future time reference whereas instances of epistemic ought to are more likely to be variously marked by present, future, and even past time reference. Therefore, future time reference can be expressed by either epistemic or deontic ought to whereas present and past time generally occur with epistemic ought to. Figure 3 shows the distribution of future, present and past time reference in LOB, FLOB, Brown and Frown. 70% 60% 50% 40%
Future Present
30%
Past
20% 10% 0% LOB
FLOB
Brown
Frown
Figure 3 Distribution of future, present and past time reference across LOB, FLOB, Brown and Frown. The findings confirm the expectation of future time orientation to prevail, with the only exception of FLOB where present time reference is the most frequent. Here, the result is certainly affected by the high number of both epistemic and merger ought to (retrieved from the corpus). Even more significantly, though, the
338
Marta Degani
diagram highlights that present time reference, which is associated with epistemic and merger ought to, is overall quite numerous. Furthermore, the figure indicates the extent to which both present and past time reference have progressively increased over time. In the former case, the rise is from 40% of occurrences in LOB to 45% in FLOB and from 34% of occurrences in Brown to 44% in Frown. In the latter case, the increase is from 6% of occurrences in LOB to 17% in FLOB and from 3% in Brown to 8% in Frown. As the data indicate, the four corpora differ in terms of epistemic meanings having either present or past time reference: in the British English corpora, there is a tendency for epistemic (and merger) ought to to have a past tense orientation whereas in the American English corpora epistemic (and merger) ought to generally yield a present tense reference reading. As far as the expression of factuality is concerned, the semi-modal ought to can occur as either non-factual or counter-factual; in a few cases it may also be actualized. In a sentence such as ‘your car is dirty, you ought to wash it’, ought to is non-factual rather than counter-factual in the sense that the outcome of the directive is not known. From this it follows that whenever the eventuality time relative to the modal time is future-shifted, ought to is interpreted as non-factual. In contrast, when the semi-modal is followed by a present perfect construction (‘you ought to have seen him’) or by a verbal construction with progressive aspect (‘I definitely ought to be settling on the subject’), ought to is understood as counter-factual because the implication is that something which was/is required/likely was/is not done, did/does not occur. Ought to is actualized in expressions such as ‘she convinced him that he ought to be a member of the club and in the end he complied with her wishes’, where the request/desire of the speaker is eventually accomplished on the part of the addressee. Figure 4 displays the distribution of ought to as non-factual, counterfactual or actualized in the four analyzed corpora. 80% 70% 60% 50% Non-factual 40%
Counter-factual Actualized
30% 20% 10% 0% LOB
FLOB
Brown
Frown
Figure 4 Distribution of non-factual, counter-factual and actualized ought to across LOB, FLOB, Brown and Frown.
Re-analysing the semi-modal ought to
339
As the column chart shows, ought to occurs chiefly as non-factual, a result which matches with the high number of future tense oriented ought to just observed. Nonetheless, there are cases when non-factual ought to occurs in present tense verbal constructions. More specifically, the presence of non-factual ought to amounts to 62% of occurrences in LOB and 59% in FLOB, and it is even more evident in the two American English corpora with 63% of occurrences in Brown and 80% in Frown. In the four corpora, counter-factual ought to is generally associated with present tense but also, though less frequently, with past time reference. Conversely, the few instances of actualization tend to occur when the semi-modal has a future tense orientation. 3.4
Type of subject collocating with ought to
Deontic ought to favours co-occurrence with animate subjects, whereas epistemic ought to typically collocates with inanimate subjects or with the so-called dummy subject (impersonal it-construction). As Nordlinger and Traugott comment (1997), an aspect featuring the idiosyncratic development of ought to in comparison to other modal verbs is its tendency over time to favour non-specific, inanimate or impersonal subjects. This observation is confirmed by the data retrieved from the analyzed corpora. Figure 5 illustrates the distribution of the different types of subject (animate, inanimate, dummy) in collocation with ought to across the four corpora. 80% 70% 60% 50% Animate 40%
Inanimate Dummy sub.
30% 20% 10% 0% LOB
FLOB
Brown
Frown
Figure 5 Distribution of animate, inanimate and dummy subject in collocation with ought to across LOB, FLOB, Brown and Frown. In American English, the data show that the use of inanimate subjects has increased over time (29% of occurrences in Brown, 56% in Frown). In British English, the table indicates that there has been a rise in the use of dummy subject constructions (6% of occurrences in LOB, 10% in FLOB). Besides signalling regional variation, both these findings interestingly enough confirm the trend previously observed of epistemic ought to gradually replacing deontic ought to.
340
Marta Degani
The number of animate subjects is markedly high only in Brown, where occurrences of deontic ought to are particularly frequent. These results further support the hypothesis that ought to is undergoing a process of “subjectification”. 3.5
Voice and polarity in ought to verbal constructions
In general terms ought to appears to be voice neutral. However, passive voice is a feature normally associated with deonticity whereas active voice is related to epistemic meanings. Figure 6 shows the distribution of active and passive voice constructions across the analyzed corpora. 100% 90% 80% 70% 60% Active
50%
Passive
40% 30% 20% 10% 0% LOB
FLOB
Brow n
Frow n
Figure 6 Distribution of ought to active and passive verbal constructions across LOB, FLOB, Brown and Frown. As the diagram illustrates, active voice is overall more frequent than passive voice in the four corpora. This result can be explained by the fact that active voice can be found ubiquitously in ought to constructions. In other terms, active voice can match, with no clear preference, with ought to semantic values of epistemic, deontic and merger. Diachronically, a slight increase in the use of active voice is also recorded in both varieties (73% of occurrences in LOB and 79% in FLOB, 94% of occurrences in Brown and 96% in Frown). Conversely, the use of passive voice ought to constructions refers to deontic meanings only. This use is generally much less recurrent and it has even decreased over time (27% of occurrences in LOB and 21% in FLOB, 6% of occurrences in Brown and 4% in Frown). In terms of regional variation, the data indicate that passive voice is employed more consistently in British English than in American English (in ought to constructions). With regard to polarity, negative forms are typically associated with deontic meanings, especially when obligation ought to is employed to express criticism towards oneself or others.
Re-analysing the semi-modal ought to
341
Figure 7 shows the distribution of ought to positive and negative verbal constructions in the corpora. 100% 90% 80% 70% 60% Positive
50%
Negative
40% 30% 20% 10% 0% LOB
FLOB
Brow n
Frow n
Figure 7 Distribution of ought to positive and negative verbal constructions across LOB, FLOB, Brown and Frown. Positive polarity is much more frequent than negative in all corpora. Indeed, ought to verbal constructions with positive polarity amount to 81% of occurrences in LOB, 97% in FLOB, 94% in Brown and 92% in Frown. As observed for active voice, also affirmative form is a syntactic feature which can characterize ought to as epistemic, deontic or merger. This fact explains the abundance of ought to in affirmative rather than negative contexts. Quite unsurprisingly, the few instances of ought to having negative polarity are also cases where the semantic value of the semi-modal is deontic. This association of deonticity with negative polarity is partly contradicted in the Frown corpus only. Here, occurrences have been retrieved where epistemic ought to appears in a negative construction. In a diachronic perspective, the use of negative polarity has decreased in the British English corpora (19% of occurrences in LOB and 3% in FLOB) whereas it has slightly increased in the American English ones (6% of occurrences in Brown and 8% in Frown). The latter result is certainly determined by the just observed presence of negative epistemic ought to in Frown. 3.6
Ought to and agentivity
The term agentivity is used in this study with reference to the semantics of the verbs occurring in collocation with ought to. Verbs having weak agentivity are those which express sensing (thinking, feeling, perceiving), verbal activity (saying) or states of being. They are verbs that do not imply any material action. In contrast, verbs with strong agentivity are those expressing the notion of doing, bodily, physically, materially or of behaving, physiologically. These verbs imply the idea of an action being performed by someone and possibly affecting some entities.
342
Marta Degani
Figure 8 shows the distribution of strong and weak agentivity in ought to constructions across the corpora. 70% 60% 50% 40% Strong Weak
30% 20% 10% 0% LOB
FLOB
Brow n
Frow n
Figure 8 Distribution of strong and weak agentivity in ought to constructions across LOB, FLOB, Brown and Frown. As the column chart illustrates, the reduction in the use of strong agentivity over time is paralleled by an increase in its counterpart. More precisely, the (percentage) distribution of weak agentivity in the corpora is as follows: from 38% of occurrences in LOB up to 48% in FLOB, and from 31% in Brown up to 64% in Frown. This rise is particularly significant since the data trace a pattern for verbs with weak agentivity to occur in verbal constructions where ought to is typically epistemic. The data also suggest a connection between strong agentivity and deontic ought to. However, it should be observed that, though rare, some examples have been found of deontic ought to in verbal constructions having weak agentivity. In terms of regional variation, changes overtime are once again more marked in American English. 4.
Conclusion
The analysis has shed some light on the identification of a possible path of semantic change featuring the development of ought to. Furthermore, it has brought to the fore some observations concerning regional variation as well as short-term diachronic change. In Traugott’s terms (1989, 1995), “subjectification” is a diachronic phenomenon characterized by the gradual increase of epistemic meaning. Data retrieved from the four corpora seem to validate the hypothesis that ought to is undergoing a process of “subjectification” in that findings signal a progressive move from deontic ought to to merger and epistemic ought to. The concomitant increase of linguistic features generally associated with epistemic ought to can be taken as a further indication that the semi-modal is undergoing a process of
Re-analysing the semi-modal ought to
343
semantic change. More specifically, the following grammatical changes have been observed: • an increase in the use of subjective ought to, which typically coincides with the use of the semi-modal having the semantic value of epistemic • an increase in present and past time reference, both generally associated with epistemic meanings • an increase in the use of inanimate and dummy subject, both chiefly occurring in collocation with epistemic ought to • a reduction in the use of passive forms, particularly recurrent in deontic ought to verbal constructions • a reduction in the use of ought to verbal constructions having negative polarity, which normally mark deontic meanings • an increase in weak-agentivity ought to constructions, which generally signal the presence of epistemic ought to. In terms of regional variation, this process seems to have developed more rapidly in American English than in British English (see Hundt 1997 for a discussion of the leading role of American English in grammatical change) since the data show that changes are in general more significant when the comparison is between Brown and Frown than between LOB and FLOB. Since one can assume that “subjectification” like other kinds of grammatical change will move from spoken to written language, the observed drift from deontic to epistemic ought to could be more pronounced in spoken discourse. In this sense, looking at spoken data might be an interesting objective of future research. However, one should also bear in mind that ought to, due to its formality, appears to be a modal more typical of the written than the spoken mode. To conclude, it would also be interesting to expand the scope of the present study beyond the period of thirty years covered here, specifically focussing on current language use in different varieties of English and taking into consideration more recently created corpora. Notes 1
I thank two anonymous reviewers for their helpful comments on an earlier draft of this paper.
2
In the literature, ought to has been variously referred to as quasi-modal (cf. Traugott 1997: 193), semi-modal (cf. Biber et al. 1999: Ch. 6.6), and marginal modal (cf. Quirk et al. 1985: 138-140).
3
The morphological tense distinction between the modals may and might can only be found in American conservative dialects. In standard British English and American English, the modal verb might is the more tentative form of may rather than its past form.
344
Marta Degani
4
The construction with bare infinitive arises in ME and survives to the present day, especially in non-assertive contexts, but it has never become standard.
5
In negative constructions, ought not to (like must not) does not represent a lack of obligation but an obligation not to do something.
6
As clearly highlighted by Coates (1983) and more recently by a number of studies on the topic (Westney 1995, Papafragou 2000, Wärnsby 2006), modality and its linguistic realizations are to be viewed more and more as a fuzzy system affected by ‘indeterminacy’ and often resulting in phenomena like ‘gradience’, ‘ambiguity’ and ‘merger’.
7
LOB and Brown comprise texts published in 1961 whereas FLOB and Frown contain texts published in 1991-2. Thus, the data under analysis only allow for a short-term diachronic investigation. As Leech suggests, dramatic kinds of grammatical changes are not likely to happen in one generation, but if we understand these changes to include changes in frequency, then grammatical changes also occur in a time span of thirty years (see Leech, 2003).
8
According to Krug, this type of corpus-based investigation is a borderline case between synchrony and diachrony, for which he coins the term “brachychronic” (study) (Krug 2000: 34).
9
The Brown family is constituted by comparable corpora in the sense that LOB, FLOB, Brown and Frown were built according to the same principles of design and selection. However, because of unavoidable sampling anomalies, the ideal of completely matching corpora was not totally achieved in practice (cf. Leech 2003).
10
Due to the small number of findings, the author decided not to include any significance testing for the purposes of the article.
11
This comment refers only to the decline in the use of ought to. Indeed, for other kinds of grammatical change, American English appears to have a leading role (see the analysis in the following sections of the present paper).
12
As suggested by an anonymous reviewer, a deontic reading for (5) and (6) would also be possible. This is even more likely in (6), where the speaker seems to express a wish for there being a better way of doing things more than an assumption that there is one.
References Athanasiadou, A., C. Canakis and B. Cornillie (eds.) (2006), Subjectification: various paths to subjectivity. Berlin/New York: Mouton de Gruyter.
Re-analysing the semi-modal ought to
345
Benveniste, E. (1971), Problems in general linguistics. Coral Gables: University of Miami Press. Biber, D., S. Johansson, G, Leech, S. Conrad and E. Finegan (eds.) (1999), The Longman grammar of spoken and written English. London: Longman. Bybee, J.R. Perkins and W. Pagliuca (1994), The evolution of grammar: Tense, aspect, and modality in the languages of the world. Chicago: University of Chicago Press. Coates, J. (1983), The Semantics of the modal auxiliaries. London: Croom Helm. Coates, J. (1995), ‘The expression of root and epistemic modality in English’, in: J. Bybee and S. Fleishman (eds.), Modality in grammar an discourse. Amsterdam: Benjamins. 55-66. Fairclough, N. (1992), Discourse and social change. Cambridge: Polity Press. Heine, B. (1995), ‘Agent-Oriented vs. Epistemic Modality: Some Observations on German Modals’, in: J. Bybee and S. Fleishman (eds.), Modality in grammar and discourse. Amsterdam: Benjamins. 17-53. Hundt, M. (1997), ‘Has British English been catching up with AmE over the past thirty years?’, in: M. Ljung (ed.), Corpus-based studies in English: Papers from the seventeenth international conference on English language research on computerized corpora (ICAME 17). Amsterdam: Rodopi. 135-151. Kemenade, A. (1993), ‘The history of the English modals: a reanalysis’, Folia Linguistica Historica, 13:143-166. Krug, M. (2000), Emerging English modals. A corpus-based study of grammaticalization. Berlin/New York: Mouton de Gruyter. Langacker, R.W. (1990), ‘Subjectification’, Cognitive linguistics, 1: 5-38. Leech, G. (2003), ‘Modality on the move: The English modal auxiliaries 19611992’, in: R. Facchinetti, M. Krug and F. Palmer (eds.), Modality in contemporary English. Berlin/New York: Mouton de Gruyter. Leech, G. (2004), ‘Recent grammatical change in English: data, description, theory’, in: K. Aijmer and B. Altenberg (eds.), Advances in Corpus Linguistics. Papers from the 23rd international conference on English language research on computerized corpora (ICAME 23). Amsterdam/New York: Rodopi. 61-81. Leech, G. (2006), ‘Recent grammatical change in written English 1961-1992: some preliminary findings of a comparison of American with British English’, in: A. Renouf and A. Kehoe (eds.), The Changing face of corpus linguistics. Amsterdam/New York: Rodopi. 185-204. Lyons, J. (1977), Semantics, vol.2, Cambridge: Cambridge University Press. Mindt, D. (1995), An empirical grammar of the English verb. Modal verbs. Berlin: Cornelsen. Murvet, E. (1996), ‘Tense and modality’, in: S. Lappin (ed.), The Handbook of contemporary semantic theory. Blackwell: Oxford. 345-358.
346
Marta Degani
Nordlinger, R. and E. Closs Traugott (1997), ‘Scope and the development of epistemic modality: evidence from ought to’. English language and linguistics, 1(2): 295-317. Oxford English Dictionary Online (http://dictionary.oed.com). Palmer, F.R. (1979), Modality and the English verb. London: Longman. Palmer, F.R. (1986), Mood and modality. Cambridge: Cambridge University Press. Palmer, F.R. (1990), Modality and the English modals. London: Longman. Papafragou, A. (2000), Modality: issues in the semantics and pragmatics interface. Amsterdam: Elsevier. Plank, F. (1984), ‘The modals story retold’, Studies in language, 8: 305-364. Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive grammar of the English language. London: Longman. Roberts, I.G. (1985), ‘Agreement parameters and the development of English modal auxiliaries’, Natural language and linguistic theory, 3: 21-58. Stein, D. and S. Wright (eds.) (1995), Subjectivity and subjectivisation: Linguistics perspectives. Cambridge: CUP. Traugott, E.C. (1989), ‘On the rise of epistemic meanings in English: an example of subjectification in semantic change’. Language, 65: 31-55. Traugott, E.C. (1995), ‘Subjectification in grammaticalization’, in: D. Stein and S. Wright (eds.). 31-54. Traugott, E.C. (1997), ‘Subjectification and the development of epistemic meanings: the case of promise and threaten’, in: T. Swan and O. Westvik (eds.), Modality in germanic languages: Historical and comparative perspectives. Berlin: Mouton de Gruyter. 185-210. Traugott, E.C. and R. Dasher (2002), Regularity in semantic change. Cambridge: CUP. Warner, A. (1990), ‘Reworking the history of English auxiliaries’, in: S. Adamson et al (eds.) Papers from the 5th international conference on English historical linguistics. Amsterdam: Benjamins.537-58. Wärnsby, A. (2006), (De)coding modality: the case of must, may, måste, and kan. Lund Studies in English 113. Lund: Lund University. Westney, P. (1995), Modals and periphrastics in English. Tübingen: Max Niemeyer.
On the use of split infinitives in English Javier Calle-Martín and Antonio Miranda-García University of Málaga, Spain Abstract A split infinitive construction denotes a particular type of syntactic tmesis in which a word or phrase, especially an adverb, occurs between the infinitive marker to and the infinitive of the verb. Although rare from a statistical viewpoint, the earliest instances of the split infinitive date back to the 13th century, in which a personal pronoun, an adverb or two or more words could appear in such environments (Visser 1984, II: 1038-1045). Its use drops drastically throughout the 16th century, but it begins to gain ground again in the 19th century, hence resisting the severe criticisms of grammarians. Nowadays, however, a search for these types of constructions in a present-day English corpus reveals that the prejudice against split infinitives is receding. Therefore, this paper investigates the actual use of the construction in different corpora with the following objectives: a) to provide the statistics of the construction from a historical perspective; b) to analyse the type of adverbs occurring in these contexts; c) to offer a taxonomy of the adverb from a functional perspective; d) to investigate the combined effect of stress and rhythm in the development of the construction; and e) to review the actual use of a prototype splitting in present-day English usage. 1
1.
Introduction
A split infinitive is elsewhere defined as a type of syntactic tmesis in which a word, especially an adverb, occurs between to and the infinitival form of a verb. Different labels have been used to name this particular ordering of English, spiked adverb or cleft infinitive among others, but the term split infinitive has eventually superseded all its predecessors (Smith 1959: 270). From a historical perspective, this type of English splitting has gone through several ups and downs, the earliest instances dating back to the 13th century, wherein a pronominal, an adverb or even two or more words could appear in such environments (Visser 1984, II: 1038-1045; see also Fischer 1992: 329-330). Throughout the 14th century, the construction gained more ground, although it was a kind of stylistic hallmark, being non-existent in some authors whilst occasional or frequent in others (Mustanoja 1960: 515; Fischer 1992: 329). After this sudden rise in Middle English, its usage dropped drastically throughout the 16th and the 17th centuries without any apparent justification for its near disappearance. The definite take-off took place in the 18th century, resisting the severe criticisms of grammarians, who disapproved the construction on the grounds of a) the prescriptivist objection to its lack of prestige (Crystal 1984: 2728); and b) the impossibility of such splitting, either in classical or other Germanic languages (Crystal 1985: 16). Nowadays, however, the construction
348
Javier Calle-Martín and Antonio Miranda-García
may be deemed in the prime of its life as it has eventually managed to find its way both in speech and writing, literary and scientific prose included. From a scholarly point of view, the split infinitive has been a recurrent topic of discussion in the relevant literature. However, a close reading of the existing publications reveals a number of moot points as regards the origin, use, and function of the construction. First, there is no consensus as to the origin of split infinitives. On the one hand, there are scholars who consider that it is the result of a French influence arguing that the construction first appeared after the Norman conquest, notwithstanding the syntactic differences between the English and the French models (see http://en.wikipedia.org/wiki/Split_infinitive). There are others, on the other hand, who believe in an intrinsic development of the English language in the sense that infinitival clauses progressively adopted the word order of finite verbs. Actually, the pre- or post-verbal position of the adverb in finite verbs serves to modify the actual meaning of an utterance inasmuch as “when the adverb precedes a verb, the verb seems more important to our feeling than the adverb even though the adverb may also be stressed” (Curme 1927: 341; see also Curme 1914: 41-45) as in he completely failed to understand it vs. he failed completely to understand it. The infinitive, therefore, has progressively acquired the status of any finite verb to the extent that an adverb may occupy the preverbal position exactly as in any finite verb, as in boldly to go, to go boldly, or to boldly go (Crystal 1984: 30; see also Bryant 1946: 40; Close 1987: 220; Nagle 1994: 238). Secondly, the relevant literature is also erratic as to whether the infinitive marker to belongs to the finite verb or to the infinitive. Whilst the prescriptivist attack is mostly based on the assumption that the infinitive marker is regarded as inseparable from its verb (Hall 1882: 18), there are scholars who consider that no splitting takes place at all because to is not a part of the infinitive, it just goes with it (Jespersen 1930: 191; Malone 1941: 52: Huddleston and Pullum 2002: 581).2 Thirdly, more contradictory seem to be the contexts in which the splitting is bound to occur. Quirk et al. affirm that it is common “with subjuncts of narrow orientation and hence perhaps especially where the infinitive is a gradable verb” (1985: 497-498; see also Close 1987: 217). Neely, on the contrary, concludes that “[…] the decision to pre-position or post-position an adverb is made before the form of the verb is determined, irrespective of whether the verb is expanded or not” (Neely 1978: 404). As cited in many present-day English grammars, our initial impulse for this paper is to consider split infinitives as an intrinsic development of the English language that stems from the speaker’s attempt to avoid ambiguity by preventing the false association of the adverb with the preceding verb (Huddleston and Pullum 2002: 581-582). Even though acknowledging this as the motivating force of split infinitives, it is our intention to investigate the actual use of the construction in synchronic and diachronic corpora and to examine the influence of other factors on the eventual spread of a particular construction. Therefore, the
On the use of split infinitives in English
349
present paper pursues the following objectives: a) to count the occurrences and provide the statistics of the construction from a historical perspective; b) to examine the kinds of adverbs and verbs appearing in these contexts; c) to investigate the contribution of stress patterns in the standardization of some constructions; and d) to explore any possible variation in present day English. Accordingly, our study will be divided into four sections: 1) the first lists the sources used for our investigation; 2) the second deals with the history of the split infinitive from a quantitative approach; 3) the third qualitatively analyses the construction by considering whether there is any kind of restriction as to the type and function of the adverb/infinitive, and whether there is a recurrent rhetorical pattern behind the phenomenon; and 4) our conclusions close the paper. 2.
Methodology
Different corpora have been used as source of evidence to account for the historical period 1640-1920. The major drawback faced by the analyst of these constructions is the strong prescriptive bias against split infinitives throughout the Modern English period, a fact which makes them practically disappear from corpus data, particularly in the case of consciously edited texts. In view of this constraint, more than one corpus was required for our investigation, not only for quantitative purposes to provide a sizeable input, but also for chronological reasons to comprehend such a large time-span of English history. Accordingly, four different sub-periods have been distinguished on account of the corpus used: The Lampeter Corpus of Early Modern English Tracts has been used to investigate the period 1640-1710, amounting to 1.1 million words (Siemund and Claridge 1997: 61-70). The Corpus of Late Modern English Texts (extended version, henceforth abbreviated CLMET) has been the source for the analysis of other three subperiods of 70 years each (1710-1780; 1780-1850; and 1850-1920). With 15 million tokens of late Modern British English, the corpus stands out as an appropriate offshoot for the rise and development of the construction inasmuch as it has been designed to account for any possible “variation in terms of text genre and authorial social background” (De Smet 2005: 70-71).3 The Corpus of English Novels (henceforth abbreviated CEN) has been processed for the retrieval of instances in the period 1850-1920. It is a 25-million-word corpus of late 19th and early 20th-century novels by British and North American novelists, chosen on account of the number and variety of split infinitives found. Even though acknowledging that fiction may imply a more artificial language than non-fiction, thereby written to convey a given stylistic effect, CEN is still taken as a reliable input to investigate the phenomenon, particularly from a qualitative standpoint in terms of the type of adverbs which typically co-occur under these circumstances. To avoid any kind of bias, however, CEN has been
350
Javier Calle-Martín and Antonio Miranda-García
exclusively used as supplementary material for the third sub-period covered by the CLMET (1850-1920). The British National Corpus (henceforth BNC) has been used to investigate variation in the second half of the 20th century. Given the corpus dimension and the impossibility to search for to followed by an adverb in Sara, the texts were deprived of all kind of tagging and parsing so that the resulting plain text could be accordingly processed. From a methodological standpoint, WordSmith Tools (Scott 1996) was used for the automatic retrieval of the instances. On the one hand, the occurrences of to + an adverb in –ly were automatically generated whilst split infinitives with other adverbials, on the other, had to be retrieved manually, hence the need for multivariate searches. 3.
Quantitative approach
This section views the development of split infinitives from a statistical perspective from the second half of the 17th century. Figure 1 shows the number of instances found in the different sub-periods. As corpus size differs from one to another period, the figures have been normalized to a text of 104 sentences so as to allow their subsequent comparison (Biber 1988: 13-14).4 Normalization in terms of the number of sentences in a corpus is necessary in the case of split infinitives in view of a) the limited number of instances found, otherwise the resulting figures would be irrelevant if based on the total of running words; and b) the syntactic nature of the construction, which does not affect a single word but the actual verb phrase. 25 22,86 20
20,26
15 10 6,1
6,47
8,07
5 0 1640-1710 1710-1780 1780-1850 1850-1920 1950-2000
Figure 1: Distribution of split infinitives in each sub-period As illustrated, split infinitives are found to be sporadically used between 1640 and 1850, amounting to 6.1, 6.47 and 8.07 instances every 10.000 sentences, respectively. However, the definite rise of the construction takes place from the second half of the 19th century inasmuch as the figures between the years 1850-
On the use of split infinitives in English
351
1920 almost triple the others, a fact plausibly due to the mitigating censure of prescriptivism, which was then becoming less influential (see Huddleston and Pullum 2002: 581). During the second half of the 20th century, the figures practically correlate with those found one century earlier as the phenomenon is observed in 22.86 instances every 10.000 sentences in the BNC. In view of these figures, there is evidence to state that the construction is unlikely to gain more ground in the future as its occurrence has been practically the same for the last 150 years. To a broad extent, this is still the bias of prescriptivism, which has eventually generated some kind of consensus towards the disapproval of the construction. There is yet a long way for split infinitives to equal the actual occurrence of non-splitting positions. A survey at the distribution of infinitival clauses modified by an adverb, either before the infinitive marker to (i.e. sincerely to apologise) or after the infinitive itself (i.e. to apologise sincerely), reveals an overwhelming preference for non-splitting constructions.5 120 100
97 82
80 68 60 40 20 20 6
6
0 1640-1710
1710-1780
Non-splitting constructions
1850-1920 Splitting constructions
Figure 2: Distribution of non-splitting and splitting constructions Figure 2 illustrates the distribution of splitting and non-splitting constructions in the periods 1640-1710, 1710-1780 and 1850-1920. In view of these results, the Lampeter corpus shows that for the period 1640-1710 the rate of splitting to nonsplitting constructions is 97:6 instances every 104 sentences, a significant difference because it comprises a specific time-span wherein the construction was left to its practical obliteration. The period 1710-1780, on the other hand, shows a slight decrease of non-splitting constructions to the extent that the rate becomes 82:6 examples every 104 sentences. By contrast, a significant decrease of nonsplitting constructions is observed thereafter, thus coinciding with the actual spread of the construction, as the rate drops to 68:20 instances per 104 sentences in the third part of the CLMET.
352
Javier Calle-Martín and Antonio Miranda-García
4.
Qualitative approach
This section analyses the phenomenon from several perspectives in order to reevaluate some traditional tenets which have become almost axiomatic in the literature. A corpus-based methodology is accordingly used to provide, to a certain extent, a plausible justification for the conditions which directly or indirectly affect the origin and development of a particular splitting (to the detriment of others). In light of this, we propose the qualitative study of the construction from a fourfold perspective: the first discusses the semantic nature of the splitting adverb; the second approaches the status of the adverb from a functional perspective in terms of adjuncts and subjuncts; the third proposes a rhetorical analysis of the construction to obtain the particular rhythmic patterns behind split infinitives, if any; the fourth reviews the actual use of a prototype splitting in current usage. Methodologically, our study has been restricted to the input of CLMET and CEN, covering the time-span 1710-1920. There are two reasons for this choice. On the one hand, it is the period witnessing the actual spread of the construction, being therefore a suitable input to investigate these early preferences. On the other, a qualitative approach of this kind would be unfeasible by using the BNC on account of the number of instances obtained. 4.1
Semantic approach
The relevant literature is, to a broad extent, often inconclusive as to the adverbs which produce the splitting as it is limited to mentioning that the phenomenon currently occurs with adverbs of degree, of manner, of time or of other categories (Malone 1941: 53). Huddleston and Pullum (2002: 582) go one step beyond other similar present-day English grammars when they affirm that the adverbs that particularly lend themselves to placement in this position are those marking degree as well as actually, even, further, etc. Others just include the list of adverbs appearing in these constructions on a quantitative basis, this being the case of really, completely, entirely, fully and truly (Thomson and Martinet 1960; Alexander 1988: 305). In light of the above statements, the corpus instances were classified on a semantic basis to discern whether there is a historical preference for a particular adverb type. Figures 3 and 4 present in relative figures the results for the periods 1710-1850 and 1850-1920, respectively.6 4% 5% 41.6
1% 4%
12%
41.6
74% 8.3
0DQQHU
1HJDWLRQ
8.3
4XDQWLW\
7LPH
0DQQHU
7LPH
)UHTXHQF\
4XDQWLW\
1HJDWLRQ
'HJUHH
Figures 3 and 4: Distribution of adverbs according to types
On the use of split infinitives in English
353
As regards the period 1710-1850, Figure 3 reveals that adverbs of manner and time are the most frequent, each totalling 41.6%. These are then followed by those of quantity and negation (each amounting to 8.3%). These figures are misleading on account of the scanty number of instances found. By contrast, Figure 4, more reliable on a quantitative basis, shows the distribution of adverbs in the period 1850-1920, wherein we may conclude that the construction becomes prominent with adverbs of manner, amounting to 73.45% of the occurrences. If compared, adverbs of time present a significant decrease inasmuch as they just represent 12.38% of the instances. Next, these are followed by adverbs of frequency (5.3%), quantity (4.42%), intensity (3.57%) and negation (0.88%), respectively. From these data, one may tentatively conclude that adverbs of manner are overwhelmingly preferred in both periods. The drastic drop in the use of adverbs of time, in turn, is perhaps misleading on account of the limited number of instances found in-between 1710-1850, which comes to distort the occurrence of each type. Note, on the other hand, that the early instances of the phenomenon with adverbs of frequency and of negation are dated in the period 1850-1920, plausibly as an imitation of the syntactic position of the other adverbs (i.e. of manner and of time). In light of this, we have to disregard, at least partially, the accounts published in most modern English grammars. Instead of adverbs of intensity (Huddleston and Pullum 2002: 581), the English language shows a historical preference for adverbs of manner and time, of which again, so and immediately predominate in the period 1710-1850. On the contrary, in the span 1850-1920 the most frequent adverbs of manner are, among others, seriously, openly and quietly; suddenly, always and ever are the most frequent adverbs of time; quite is the corresponding adverb of quantity; and entirely, just, thoroughly, almost and utterly in the case of adverbs of degree. Leaving aside the nature of the splitting adverb, there are also scholars who discuss the semantic connection between the adverb and the infinitive. In this vein, Close argues that “the combination of adverb and verb may be a wellestablished one, in which either the two elements may patently share a semantic component […] or may not do so apparently” (Close 1987: 227). For instance, in to cordially invite, the actual meaning of the infinitive tacitly implies the condition expressed by the adverb. In comparison, in to highly recommend the semantic relation does not become so transparent. Of these two variables, Close concludes that the latter predominates, particularly because the infinitive is usually in the need of a particular semantic modification. In view of this, the instances of our corpus were classified in terms of this kind of semantic relationship to discern the rate in which this semantic association is produced. For the purpose, we have just considered –ly adverbs in our taxonomy because subjuncts such as just, almost, ever or quite hardly have any kind of semantic connection with the infinitive and would then lead to distort the figures. As a result, the semantic association of these constituents has been observed in just 11.1% whilst 88.9% show no relationship between both constituents. Some of them are particularly noteworthy since they often add an irrelevant gradation to
354
Javier Calle-Martín and Antonio Miranda-García
the utterance, thus becoming a kind of stylistic idiosyncrasy of an author, as in to patiently await, to wantonly insult, to reverently salute, to visibly shine or to openly spouse, among others. We may therefore corroborate Close’s statement here insofar as there is not an implicit semantic relation between the infinitive and its corresponding modifier. 4.2
Functional approach
Following the line of previous publications (Huntsman 1980: 697), one of the most accurate descriptions of the phenomenon is given by Quirk et al., who consider that “split infinitives are commonest with subjuncts of ‘narrow orientation’ and hence perhaps especially when the infinitive is a ‘gradable’ verb”, particularly with subjuncts which do not occupy the initial-medial or final position, such as well or too (Quirk et al. 1985: 497).7 This section takes this argument as the starting point to analyse the status of the adverb from a functional perspective. According to Quirk et al., adjuncts are exclusively those that form part of the basic structure of the clause or sentence and modify a verb, hence resembling the other sentence constituents. The term subjunct, on the other hand, applies to adverbials semantically subordinated either to a clause or a sentence (i.e. wide orientation) or to a part of a clause (i.e. narrow orientation). See the following instances: (1) (2)
Her presence caused Sam to instinctively straighten up […]. […] he conjectured he should find the fugitives here; and, following them with all speed, he happened to just arrive at this the happiest moment of Leontes’s life.
If compared, the function of the adverb in (1) equals the function of the other clause constituents to the extent that, among others, the adjunct can be the focus of a cleft sentence (It was instinctively that Sam straightened up) and can also be elicited by question forms (How did Sam straighten up for her presence? Instinctively). On the contrary, this is not possible in (2) because the subjunct just cannot occur in such environments (i.e. *It was just that he happened to arrive at this the happiest moment […] or *How did he happen to arrive at this moment? Just). Therefore, the corpus instances have been surveyed in terms of the grammatical function of the adverb.8 Table 1 displays the total of adjuncts and subjuncts found in relative figures. As above, the period 1710-1850 has been considered as a whole due to the limited number of instances found during the first seventy years of the same span. Table 1: Function of the splitting adverb (in percentages) Adjuncts Subjuncts
1710-1850 18.18 81.81
1850-1920 54.24 45.75
On the use of split infinitives in English
355
Table 1 shows that, after its emergence in the 17th century, there is a major preference for subjuncts in these types of constructions, amounting up to 81.81%, if compared with 18.18% of adjuncts. Of the former, quite, so, just, still and about happen to be the most frequent. On the contrary, from the second half of the 19th century we observe that adjuncts outnumber subjuncts, the former amounting to 54.24%, whilst the latter just represent 45.75%. At first sight, these data apparently disprove the traditional tenet as to the outstanding use of subjuncts under these circumstances. However, the previous figures may be misleading for the analyst of these constructions in the sense that the difference would be even greater if the analysis were based on the number of different adverbs, and not on the total of instances, thereby avoiding the counting of many an instance in which a same adverb occurs. According to this, the traditional account of the split infinitive as a common phenomenon with subjuncts of narrow orientation (Quirk et al. 1985: 497-498) is in need of further clarification. Even though the argument may hold when the total of running instances of a corpus is considered (particularly inbetween 1710-1850), it does not apply if the analysis is based on types (not tokens), as the number of adjuncts is found to be notably higher than that of subjuncts, as shown in Table 2, wherein the number of adjuncts triples that of subjuncts in both sub-periods. Table 2: Function of the splitting adverb (according to types in percentages) Adjuncts Subjuncts
1710-1850 75.1 24.9
1850-1920 79.84 20.15
This fact leads us to deal with the second part of the previous statement, which argues that the splitting of infinitives takes place “with subjuncts of narrow orientation and hence perhaps especially where the infinitive is a gradable verb” (Quirk et al. 1985: 497-498). This description is confusing to the reader, inasmuch as one does not get to know whether the category of gradable verbs affects all split infinitives or just those with a subjunct. A verb is therefore gradable when it becomes possible to “imagine the occurrence to which it refers as being at a more, or less, advanced point on a scale than a similar or comparable happening” (Close 1987: 217; 223). In this vein, increase, relieve, fill, smooth or preserve have been considered as examples of gradable verbs whilst get away, touch, realise or address are non-gradable verbs. Table 3 reproduces the number of verbs in terms of their grade in the examples of our corpus according to whether an adjunct or a subjunct is involved. Therefore, Quirk’s assumption about the tendency to use gradable verbs in split infinitives may be questioned on account of these figures. The period 17101850 presents 59.09% and 22.72% of non-gradable verbs with subjuncts and adjuncts, respectively. The same tendency is replicated in the period 1850-1920, insofar as non-gradable verbs outnumber their gradable counterparts, even though
356
Javier Calle-Martín and Antonio Miranda-García
the latter become more frequent than in the preceding period. In light of this, we may conclude that there is not a restriction as to the kind of verb used in these constructions. Table 3: Distribution of gradable and non-gradable verbs (in percentages)
Adjuncts Subjuncts 4.3
1710-1850 Gradable Non-gradable 0 22.72 18.18 59.09
1850-1920 Gradable Non-gradable 7.42 47.26 12.89 32.42
Rhetorical approach
More often than not, split infinitives present the repetition of particular rhythmic patterns which, to a certain extent, contribute to the development and eventual standardization of a given combination in the language. In fact, this issue has been traditionally overlooked or, at least, tip-toed around in the relevant literature. Quirk et al. are the first to mention the combined effect of stress and rhythm in the development of these structures in the following terms: “the split infinitive is frequently associated with a following focus” (Quirk et al. 1985: 227). In this same vein, Crystal argues that split infinitives comply with the natural rhythm of English, which consists of “the te-tum te-tum rhythm favoured by Shakespeare and which is the mainstay of our poetic tradition” (Crystal 1984: 30). For that reason, he compares the structures boldly to go, to go boldly and to boldly go, concluding that the latter is rhythmically very neat if compared with the others, where we find a sequence of two weak or two strong syllables together, respectively. Apart from these brief notes, a detailed study of the underlying rhetorical patterns behind split infinitives in a comprehensive corpus is still awaited. Following the above accounts, we consider that the stress system of a language directly influences the marked and the unmarked word order of that same language, especially in a stress-based model as is English. In the particular case at hand, there are then grounds to imply the existence of a metrical foot as a direct justification for split infinitives. Therefore, our approach is based on the number of syllables of both constituents (i.e. the adverb and the infinitive). Given their variable number of syllables (one, two, three or more) and the different position of the stress in both (initially, medially or finally depending on the item involved), the corpus instances were accordingly classified in terms of the number of syllables of the constituents. In the figures below, the following notation has been used: the symbol (−) represents a stressed syllable, the symbol (×) represents an unstressed syllable and the bar (|) marks off the boundary between the adverb and the infinitive. Note that the infinitive marker has not been taken into consideration in our scansion as it is in all cases unstressed. Figures 5, 6, 7 and 8 reproduce the occurrence of the different scansions in relative figures.
On the use of split infinitives in English
−|××− 10%
−|×−× 2%
357
−|−×× 2%
−|−× 15% −|− 51% −|×− 20%
−|−
−|×−
−|−×
−|××−
−|×−×
−|−××
Figure 5: Split infinitives with monosyllabic adverbs As shown, the most frequent combination is that comprising a monosyllabic infinitive (− | −), amounting to 53% of the constructions under scrutiny, in examples like to just touch, to so clean, to quite do, to now play, etc. Next, these are followed by infinitives having a disyllabic verb, in which case those finallystressed are largely preferred (− | × −), totalling 20% of the instances, if compared with just 15% with the pattern (− | − ×). See, for instance, to quite forget, to so address or to still believe, in contrast with to then challenge, to so order or to quite finish. With a three-syllable verb, however, split infinitives become considerably less frequent. Of these, the pattern (− | × × −) predominates over the other combinations, amounting to 10% of the constructions. See, for instance, the cases of to quite understand, to not interrupt if compared with to just remember (− | × − ×) and to so compromise (− | − × ×). Note the occasional occurrence of the latter in the corpus in order to avoid the likely sequence of two stressed syllables in a series. −×|××− 3%
−×|−×× 4%
× − | − (×) 4%
−×|×−× 6%
−×|− 49%
−×|−× 12% −×|×− 22%
−×|−
−×|×−
−×|−×
−×|××−
−×|−××
× − | − (×)
−×|×−×
Figure 6: Split infinitives with disyllabic adverbs When there is a disyllabic adverb, on the other hand, the most widespread combination is again that with a monosyllabic verb (− × | −), amounting to 49%
358
Javier Calle-Martín and Antonio Miranda-García
of the instances in cases like to humbly beg, to kindly call, to greatly dare, to always start, etc. As illustrated in Figure 6, when there is a disyllabic verb we may then replicate the above results insofar as finally-stressed verbs (22%) outnumber their initially-stressed counterparts (12%). See examples like to always retain or to merely avoid in contrast with to faintly picture. Finally, when there is a three-syllable verb, the pattern − × | × − × exceeds all the other combinations with examples like to kindly encourage or to quickly recover (6%). Accordingly, compare them with to roughly symbolise or to vaguely understand showing − × | − × × and − × | × × −, respectively, notwithstanding the fact that the latter implies the co-occurrence of three unstressed syllables in a series.
−××|−×× 5%
4%
4%
3%
1% 2%
1% −××|− 36%
−××|××− 5% −××|×− 8%
−××|−× 31% −××|−
−××|−×
−××|×−
−××|××−
−××|−××
−××|×−×
×−×|×−×
×−×|××−
×−×|−×
×−×|×−
×−×|−
Figure 7: Split infinitives with three-syllable adverbs When there is a three-syllable adverb, the same tendency is replicated inasmuch as the most recurrent pattern is again with a monosyllabic infinitive, particularly when the adverb is stressed on initial position (− × × | −), in cases like to suddenly cut, to openly take or to wondrously change, among others (totalling 36%). When there is a two-syllable verb, in turn, the splitting mainly occurs with infinitives whose stress is on the first syllable as they amount to 31%. Such a preference for initially-stressed infinitives is undoubtedly justified on account of the inherent stress pattern of the English language, based on the sequence of stressed and unstressed syllables. According to this, the combination − × × | − × is overwhelmingly chosen (instead of − × × | × −) in order to avoid the series of three unstressed syllables in the construction. Even though the latter represents 8% of the instances in the corpus, it is necessary to say that this figure arises from the continuous repetition of the same example (to thoroughly believe) throughout the entire corpus. When there is a three-syllable infinitive, the patterns found are the following: a) − × × | × × − in to thoroughly understand, to heartily recommend, etc.; b) − × × | − × × in to distinctly recognize or to suddenly constitute; c) − × × | × − × in to instantly remember or to thoroughly depreciate and d) × − × | × − × in to entirely discredit or to entirely abandon.
On the use of split infinitives in English
×−××|−× 9%
×−××|××− 4%
359
−×××|− 26%
−×××|×− 9%
×−××|×− 26%
−×××|−× 4%
×−××|− 22% −×××|−
−×××|×−
−×××|−×
×−××|×−
×−××|−×
×−××|××−
×−××|−
Figure 8: Split infinitives with four-syllable adverbs Finally, with four-syllable adverbs, we witness the same tendency wherein one syllable verbs (either − × × × | − or × − × × | −) predominate, irrespective of the relative position of the stress on the adverb. These structures amount to 48% in our data in examples like to reasonably make, to invariably make, etc. If a disyllabic verb, on the other hand, finally-stressed infinitives are largely preferred, totalling 35% and 13% of the examples, respectively. In view of our results, we have grounds to conjecture the actual influence of stress and rhythm on the development and eventual standardization of many split infinitives. To a broad extent, regardless of the adverb involved, the underlying impulse behind the splitting of an infinitive is to avoid the sequence of several stressed or unstressed syllables in a series.9 In this fashion, the phenomenon is then found to occur under the following circumstances: a) with monosyllabic verbs, irrespective of the number of syllables of the adverb; b) with finally-stressed disyllabic verbs, particularly with monosyllabic and disyllabic adverbs; and c) with initially-stressed disyllabic verbs if a three-syllable adverb is present. Apart from this general rule, with which more than 80% of the examples comply, the underlying motivation behind all the other examples is the actual avoidance of several stressed or unstressed syllables in a series within the construction. 4.4
Current usage
Present-day English grammars normally offer a succinct description of the phenomenon, thus concentrating on the prescriptivist side of the construction along with its sociolinguistic implications. In this fashion, Thomson and Martinet recognize that, although split infinitives used to be considered bad style, there is now a more relaxed attitude to splitting (1960). Likewise, Huddleston and Pullum argue that the separation of the infinitive marker and the verb is not uncommon nowadays in speech or writing, the latter including works of prestigious authors (2002: 582). It is significant, however, that the references to the phenomenon are non-existent in other similar publications, corpus-based grammars included
360
Javier Calle-Martín and Antonio Miranda-García
(Biber et al. 1999). Notwithstanding this, as far as we have been able to explore, none of these grammars offers a detailed description of the construction from the point of view of its actual variation, not only across speech and writing, but also across genres or other variables. In light of this, the BNC has been used to investigate these particular aspects of variation. From a methodological standpoint, however, the demographic data from the BNC have been taken as reliable in most of the cases on account of the corpus description as selected “in a demographically-balanced way”.10 Given the impossibility to search for split infinitives as a whole in the BNC, our analysis has been restricted to the most frequent combination, i.e. to actually + verb, which shows a total of 701 occurrences in the corpus. The immediate aim of this section is therefore to offer some notes about the use and distribution of the commonest splitting in the BNC. It goes without saying, however, that a more comprehensive study is awaited in this line to corroborate or refute the data provided by to actually + verb. First of all, the construction is found to have 529 occurrences in the spoken part of the BNC, in contrast with just 172 in the written collection. The age and sex of the informants has also been surveyed to explore any likely deviation in the use of these constructions. Given that the number of informants is unequally represented in the BNC, the instances have been normalised to a text of 104 sentences (Burnard 2007). Figure 9 reproduces the results, wherein we may conclude that the structure to actually + verb becomes frequent in informants between 25-34 years of age, amounting to 37.78 occurrences (every 104 sentences). However, the structure reaches the top with informants between 45-59 years of age, totalling 46.87 instances in the spoken part. It yet decreases drastically among informants over 60, as just 10.39 occurrences were found. 50 40 30 20 10 0 Under 15
15-24
25-34
35-44
45-59
6o and over
Figure 9: Variation across the age of the informants Finally, the use of the BNC also shows a major preference for the construction among male informants since their figures triple those among females. The former amount to 95.47 instances every 104 sentences in the spoken part of the BNC while the latter just 32.23 occurrences.
On the use of split infinitives in English 5.
361
Conclusions
In the present study, we have analysed the use of split infinitives in English from the second half of the 17th century. For the purpose, several corpora were needed as a sizeable input for our investigation in order to counterbalance the high prescriptive bias against the construction throughout the Modern period. In our case, the Lampeter corpus, CLMET and CEN have been the source of evidence, amounting to more than 40 million running words altogether. The BNC, in turn, was just the input to retrieve the frequency and variation of the construction in the latter part of the 20th century. As mentioned elsewhere in the relevant literature, historically speaking, the phenomenon is thought to have stemmed from the fact that infinitives progressively adopted the status of any finite verb, hence allowing the presence of splitting adverbs to precede them. The new word order of infinitive clauses had two immediate effects: on the one hand, it avoided ambiguity to prevent the false association of the adverb with the preceding verb and, on the other hand, it also allowed to stress the infinitive as the focus of the clause (and not the adverb itself). Even though acknowledging this as the original source of splitting, we must also consider the direct contribution of other factors which, in a subsequent stage, are conducive to the development of a particular type of splitting in English. In view of the results above, we are then in a position to report that the splitting of English infinitives, after its emergence at the beginning of the 17th century, shows a historical proneness to adverbs of manner inasmuch as these are the adverbs which, having a higher occurrence in the language, semantically typify the peculiarities of the action expressed by the infinitive, as in to graciously give her hand, to successfully cope with such difficulties, etc. Notably significant becomes, however, the scarce occurrence of adverbs of time in the period 1850-1920 as they amount to just 12% as a result of the recurrent use of three adverbs in particular, suddenly, always and ever. Next, we find the bulk of other adverb types (i.e. of frequency, quantity and negation), which do not exceed 5% of the occurrences, respectively. Functionally speaking, on the other hand, we have evidence to conclude that adjuncts predominate over subjuncts to the extent that the former triple the latter in the periods under scrutiny. In view of this, the traditional tenet about the split infinitive as a common phenomenon with subjuncts of narrow orientation (Quirk et al. 1985: 497-498) is in need of some clarification. Although Quirk et al. presumably base their statement on the total running instances of the construction (tokens), different evidence is observed if the analysis is based on the number of different adverbs (types). In our opinion, a functional approach to the status of the adverb in these constructions should rule out the total of instances in order to avoid the repetition of a recurrent splitting (i.e. to always + v, to thoroughly + v, to seriously + v), a fact leading to the possible misinterpretation of the data.
362
Javier Calle-Martín and Antonio Miranda-García
Next, the present paper also surveys the different rhetorical patterns behind the split infinitives of our corpus, concluding that there is an active compromise of stress and rhythm in the actual spread of many structures. In this line, more than 80% of the instances comply with the following variables: a) a monosyllabic verb, irrespective of the number of syllables of the adverb; b) a finally-stressed disyllabic verb, particularly with monosyllabic and disyllabic adverbs; and c) an initially-stressed disyllabic verb, if a three-syllable adverb. In all the other cases, on the other hand, the underlying principle behind the splitting of an infinitive is to avoid the sequence of several stressed or unstressed syllables in a series. As cited above, split infinitives are more often than not disregarded in present-day English grammars, which are just limited to the inclusion of brief notes about their frequency and their sociolinguistic implications. In our opinion, the traditional tenets published in the literature should be re-examined in view of this quantitative and qualitative evidence, whether semantic, functional or rhetorical. We are still in need of more insight into the subject to gain a wider scope both synchronically, to explore variation from a multifaceted perspective in a present-day English corpus, and diachronically, to look into the different mechanisms of splitting in Middle English from a statistical standpoint and check the likely contribution of French to the origin and development of the phenomenon. Notes 1
The present research has been funded by the Autonomous Government of Andalusia (grant number P07-HUM02609) and by the Caledonian Research Foundation — the Royal Society of Edinburgh (2006 European Visiting Research Fellowship). These grants are hereby gratefully acknowledged.
2
According to Quirk et al., this view may be corroborated with examples like the following: Did you ever visit her after she had retired? I intended to often enough, but seldom managed to (see Quirk et al. 1985: 496-497).
3
The period 1710-1780 amounts to 3,037,607 tokens; the period 1780-1850 5,723,988 running words and the third 6,251,564, totalling 14,970,622 altogether.
4
As regards the total number of sentences in the corpora, the Lampeter corpus (1640-1710) contains 32,489 sentences; the period 1710-1780 77,266 sentences; the period 1780-1850 185,670 sentences; and the period 1850-1920 257,308 sentences.
5
For the sake of accuracy, we have just computed those instances in which the adverb strictly modifies the infinitive of the verb. Besides, passive constructions like nothing else is easily to be found along with perfective infinitives such as he seems greatly to have impressed her have been ruled
On the use of split infinitives in English
363
out from our analysis on account of the impossibility to split under these circumstances. 6
Given the scanty material available for the period 1710-1780, it has been taken as a whole.
7
“Since a number of entities, notable not and certain adverbials, often occur within verb phrases, they quite reasonably could be found within the tomarked constructions which are the reductions of such phrases” (Huntsman 1980: 697).
8
Conjuncts and disjuncts have not been considered in our classification given the impossibility for them to occur in these positions.
9
Notably significant becomes the use of multiple splittings, which perfectly illustrate the distribution of stressed and unstressed syllables within the construction, as in the following cases to still further offend, to again exclusively occupy, to gently but steadily convey, to unreservedly and innocently enjoy, to utterly and completely triumph, etc.
10
Cited from the BNC webpage at http://www.natcorp.ox.ac.uk/corpus.
References Alexander, L.G. (1988), Longman English Grammar. London: Longman. Biber, D. (1988), Variation across Speech and Writing. Cambridge: Cambridge University Press. Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), Longman Grammar of Spoken and Written English. London: Longman. Bryant, M.M. (1946), ‘The Split Infinitive’, College English, 8.1: 39-40. Burnard, L. (2007), Reference Guide for the British National Corpus (XML Edition). Oxford: Research Technologies Service at Oxford University Computing Services. http://www.natcorp.ox.ac.uk/XMLedition/URG. Close, R.A. (1987), ‘Notes on the Split Infinitive’, Journal of English Linguistics, 20.2: 217-29. Crystal, D. (1984), Who Cares about English Usage? London: Longman. Crystal, D. (1985), ‘A Case of the Split Infinitives’, English Today, 3: 16-17. Curme, G.O. (1914), ‘Origin and Force of the Split Infinitive’, Modern Language Notes, 29.2: 41-45. Curme, G.O. (1927), ‘The Split Infinitive’, American Speech, 2.8: 341-42. Fischer, O. (1992), ‘Syntax’, in N. Blake (ed.) The Cambridge History of the English Language. Volume II 1066-1476. Cambridge: Cambridge University Press. 206-408. Foster, D.W. (1978), ‘A Transformational Justification for English Split Infinitives’, in: D. Nehls (ed.) Studies in Descriptive English Grammar. Heidelberg: Groos. 85-93.
364
Javier Calle-Martín and Antonio Miranda-García
Gelderen, E. van. (1989), ‘The Historical Rationale behind Split Infinitives and Kindred Constructions’, Archiv für das Studium der Neueren Sprachen und Literaturen, 226: 1-18. Hall, F. (1882), ‘On the Separation, by a Word or Words, of To and the Infinitive Mood’, American Journal of Philology, 3.9: 17-24. Huddleston, R. and G.K. Pullum (2002), The Cambridge Grammar of the English Language. Cambridge: Cambridge University Press. Huntsman, J.F. (1980), ‘Two Comments on Peter Neely’s To Split or to Not Split’, College English, 41.6: 696-99. Jespersen, O. (1930), Growth and Structure of the English Language. Oxford: Blackwell. Malone, K. (1941), ‘The Split Infinitive and a System of Clauses’, American Speech, 16.1: 52-54. Mustanoja, T.F. (1960), A Middle English Syntax. Helsinki: Societé Néophilologique. Nagle, S. (1994), ‘Infl in Early Modern English and the status of to’, in: D. Kastovsky (ed.) Studies in Early Modern English. Berlin and New York: Mouton de Gruyter. 233–42. Neely, P.M. (1978), ‘To Split or Not to Split’, College English, 40.4: 402-06. Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A Comprehensive Grammar of the English Language. London: Longman. Scott, M. (1996), Wordsmith Tools 3.0. Oxford: Oxford University Press. Siemund, R. and C. Claridge (1997), ‘The Lampeter Corpus of Early Modern English Tracts’, ICAME Journal, 21: 61-70. Smet, H. De (2005), ‘A Corpus of Late Modern English Texts’, ICAME Journal, 29: 69-82. Smith, W.M. (1959), ‘The Split Infinitive’, Anglia. Zeitschrift für Englische Philologie, 77.3: 257-78. Thompson, A.J. and A.V. Martinet (1960), A Practical English Grammar. Oxford: Oxford University Press. Visser, F. Th. (1984), An Historical Syntax of the English Language. 4 vols. Leiden, Brill.
Exploring change in the system of English predicate complementation, with evidence from corpora of recent English Juhani Rudanko University of Tampere Abstract There are robust grammatical differences between to infinitive and to -ing complements in English, but some predicates have exhibited variation between the two patterns. This study examines one such predicate, the adjective accustomed, and the focus is on the period around the end of the nineteenth century, when the to -ing pattern was starting to emerge as a rival to the to infinitive pattern. The period is studied on the basis of the third part of the Corpus of Late Modern English Texts. Attention is paid to extraction as a syntactic factor bearing on complement selection. From a semantic point of view, the notion of a sense of choice on the part of the referent of the subject is then examined as a semantic property, and it is argued that lack of a sense of choice was associated with the emerging pattern. The article also inquires into the complement selection of the adjective in a corresponding corpus of present-day English, showing that the to -ing pattern is now the rule even in contexts linked to a lack of choice. At the same time, the adjective has become much less frequent in the language.
1.
Introduction
Consider sentences (1a-b): (1)
a. John is reluctant to accept advice. b. John is averse to accepting advice.
The sentences of (1a-b) share a number of properties. Each has two predications, and the analysis of each involves a higher S node and a lower, or embedded S node. Further, the matrix predicate in each of (1a-b) is an adjective that selects a complement clause as one of its arguments. It is also assumed here that the lower clause in each of (1a-b) has its own understood subject, which may be represented by the symbol “PRO”. The postulation of such an invisible element is a somewhat controversial move, but it is justified for the simple reason that it makes it possible to represent the argument structures of the lower predicates in (1a-b). It also makes it possible to comment on the control properties of (1a-b) in a maximally concise way. It can be said that both of (1a-b) display subject control: the understood subject of the lower clause is controlled by the higher subject in each case. It is also observed, as a further similarity, that the word to is found in both (1a) and (1b).
366
Juhani Rudanko
However, a closer look at the sentences of (1a-b) and the word to in them reveals a fundamental difference between the sentences. In (1a) the word to is followed by an infinitive and the pattern is of the to infinitive type. In (1b) the word to is followed by an -ing form, which in traditional terminology would be called a gerund, and the pattern may be termed the to -ing pattern. The two types of complements also display different types of syntactic behaviour. For instance, the to of (1b) can be followed by an appropriate pronoun, while that of (1a) cannot: (2)
a. *John is reluctant to accept advice, but I am not reluctant to it. b. John is averse to accepting advice, but I am not averse to it.
Further, part of the lower clause following the word to may be elided in the case of (1a), but such an elision is not permissible in the case of (1b). Thus the unabridged sentence of (3a) can be reduced to (3b), but the unabridged sentence of (4a) cannot be reduced to (4b): (3)
a. John is reluctant to accept advice, and I am reluctant to accept advice as well. b. John is reluctant to accept advice, and I am reluctant to as well.
(4)
a. John is averse to accepting advice, and I am averse to accepting advice as well. b. *John is averse to accepting advice, and I am averse to as well.
To account for such grammatical differences, it is proposed here that in the case of (1a) the word to is an infinitival marker (Quirk et al. 1985: 1178, note a), or an Aux or an Infl. An Aux is naturally followed by a VP. However, in the case of (1b) the word to is a preposition, followed by a nominal clause, to adopt a term from traditional grammar. A nominal clause may be viewed as a sentence dominated by a NP. With these assumptions made, the differences are accounted for in a straightforward fashion. In (2a-b) the pro-form it, which is a pro-form for an NP, may stand in place of a nominal clause, but such a pro-form cannot substitute for a VP, which explains the ill-formedness of (2a). As for (3a-b) and (4a-b), VP Deletion, or an interpretive analogue of the rule, can apply in (3a), because what follows to is a VP, but it cannot apply in (4a) because what follows to is not a VP, but a nominal sentence. The structural representations proposed for the sentences of (1a-b) are given in (1a´) and in (1b´): (1)
a.´ [[John]NP1 is reluctant [[PRO]NP2 [to]Aux [accept advice]VP]S2]S1 b.´ [[John]NP1 is averse [[to]Prep [[[PRO]NP2 accepting advice]S2]NP]PP]S1
Exploring change in the system of English predicate complementation
367
There is clearly a distinction to be made between the sentences of (1a-b) on the basis of their structural representations, based on differences in their grammatical properties. However, in spite of such sharp grammatical differences, there are a number of matrix predicates — verbs, adjectives and nouns — that have shown a degree of variation and change in recent times between the to infinitival and the prepositional patterns. It is the purpose of this study to investigate one such predicate, the adjective accustomed, with the focus on its sentential complements. The adjective accustomed has been investigated in a number of earlier studies, including Kjellmer (1980), Rudanko (2000: 90 f.), Vosberg (2003b: 314 f.), Rohdenburg (2006: 154 f.), Rudanko (2006) and Rudanko (2007). It seems clear on the basis of these studies that in the eighteenth and nineteenth centuries sentential complements of the adjective were overwhelmingly of the to infinitive type and that in present-day English, to -ing complements predominate over to infinitives both in British English and in American English, with the predominance being more pronounced in American English. It also seems clear that the change affecting the adjective accustomed is part of a larger set of changes generally favouring -ing complements and that such changes are a function of a number of factors relating to the nature of to infinitival and -ing complements, including the increasingly more verbal character of the former (cf. Denison 1998: 266; Rudanko 1998: 20-22; 2000: 35 f.). Rudanko (2007), focusing on the adjective accustomed, lays an emphasis on grammatical change in different text types, and adds the point that the predominance of the to -ing pattern is less pronounced in the more conservative text type of books, as opposed to the text types of newspapers and spoken English. It is the purpose of the present study to supplement the earlier studies by examining British English data from the period 1850-1920 and a corresponding database of present-day British English. A number of to -ing complements can be found with the adjective that pre-date 1850 (Rudanko 2006), but the period around the end of the 19th century is a key period in the study of the adjective. It is during this period that to -ing complements start to emerge with the adjective in larger numbers, permitting a systematic study. The third part of the CLMET, the Corpus of Late Modern English Texts, the extended version, developed at the Catholic University of Leuven in Belgium, is used as the source of data in the diachronic part of this study. This is a corpus of written British English that represents the text type of books, including fiction and some non-fiction, and covers the period 1850-1920. (For information on the original version of the CLMET, see de Smet 2005.) As far as the present author is aware, the complementation patterns of the adjective accustomed in this particular diachronic corpus have not been investigated in the literature. One concern of the present study is the question of whether the emerging pattern was characterized by a specific semantic property, using a remark by the great Dutch grammarian H. Poutsma in his Dictionary (Poutsma MS), which still regrettably remains unpublished, as a point of departure. It may be presumed, on the basis of illustrations in Poutsma (MS), that the remark was made in the 1930s,
368
Juhani Rudanko
and it was specifically on the senses of the adjective with the two types of complement. A further concern of the present study is to examine the complementation patterns of the adjective in the present-day British English. This is done on the basis of the UK Books Corpus of the Collins Cobuild Demonstration Corpus. The reason for this choice is that of the corpora making up the Collins Cobuild Demonstration Corpus, the UK Books Corpus also comprises fiction and some non-fiction, and therefore most closely resembles the Corpus of Late Modern English Texts with respect to text type. 2.
Complements of the Adjective Accustomed in the Third Part of the Corpus of Late Modern English Texts
The third part of the extended version of the Corpus of Late Modern English Texts is a corpus of 6.1 million words. A simple search string consisting of the one item accustomed satisfies the requirements of both precision and recall in the present case. The search string produces 291 hits in the corpus. Among the hits there are a small number of tokens of the verb form accustomed, as in (5), which can be set aside easily enough: (5)
And only when they were together in public did he treat her with courtesy or show her such attentions as western civilization has accustomed women to expect from the men with whom they are connected. (1885, Linton, The Autobiography of Christopher Kirkland)
There are also a number of tokens where the adjective is in a prenominal position inside a NP, as in (6), and such cases can also be set aside in the study of complementation. (6)
He lifted his head a little, a very little, for her accustomed kiss. (1850, Craik, Olive)
In the large majority of the tokens accustomed is an adjective selecting a complement. However, the most frequent type of complement is the NP, as in (7): (7)
Thus no person can thoroughly enjoy elk-hunting who is not well accustomed to it, . . . (1854, Baker, The Rifle and the Hound in Ceylon)
The number of to NP complements is 143, representing a frequency of 23.4 per million words. Setting to NP complements to one side in the present context, there are altogether 99 tokens of sentential complements involving subject control in the material. This represents a frequency of 16.2 per million words. To infinitive complements are clearly predominant, with 88 tokens and a frequency of 14.4 per
Exploring change in the system of English predicate complementation
369
million words. The number of to -ing complements is 11, representing a frequency of 1.8 per million words. Here are two initial illustrations of each type: (8)
a. . . . should think a generic resemblance constitutes a portrait, when we see the great public, so accustomed to be delighted with misrepresentations of life and character, which they accept as representations, that they are scandalised when . . . (1883, Blind, George Eliot) b. Accustomed to spend his holidays among the mountains, though (like a true Forsyte) he had never attempted anything too adventurous or foolhardy, . . . (1906, Galsworthy, The Man of Property)
(9)
a. I am getting quite accustomed to being snubbed by Lupin, and I do not mind being sat upon by Carrie, . . . (1894, Grossmith, The Diary of a Nobody) b. And then Aggie put her face close, as women do who are accustomed to talking in the streets, and said . . . (1897, Caine, The Christian)
When the dates of the to -ing complements are examined, it is observed that they strongly cluster in the second half of the corpus. If the corpus is divided in two on the basis of the time period that it represents, with the first half covering the years 1850-1885, and the second covering the years 1886-1920, all but one of the to ing complements are in the second half, and even the one exception is from 1884. The variation in the material is here examined from two points of view, one syntactic, the other semantic. The syntactic point of view concerns the nature and role of extractions as a factor bearing on complement selection. Here is how Uwe Vosberg formulates what he terms the Extraction Principle: The Extraction Principle In the case of infinitival or gerundial complement options, the infinitive will tend to be favoured in environments where a complement of the subordinate clause is extracted (by topicalization, relativization, comparativization, or interrogation etc.) from its original position and crosses clause boundaries. (Vosberg 2003b: 308; see also Vosberg 2003a) There is little doubt that the Extraction Principle should be recognized as a factor protecting to infinitives when these are in competition with to -ing complements. At the same time, the present author has suggested that the principle should be broadened somewhat (Rudanko 2006: 43). As quoted, the principle makes reference to the extraction of a complement out of a subordinate clause.1 However, extractions are not limited to the extraction of complements, and a broader principle is needed. The broader principle consists in also taking the extraction of adjuncts into account. A way to do this is to replace the reference in the principle to the extraction of a “complement of the subordinate clause” with a reference to the extraction of a “constituent of the subordinate clause.”
370
Juhani Rudanko
The present material may be examined from the point of view of the potential broadening of the extraction principle. Among the 99 tokens of sentential complements involving subject control, there are altogether 19 tokens involving the extraction of a constituent of a subordinate clause dependent on the adjective accustomed out of the subordinate clause. In all 19 cases the complement is of the to infinitive type, as predicted by the principle. Of the 19 cases, as many as ten involve the extraction of an adjunct. In view of the prominence of adjunct extractions, it seems appropriate to broaden the principle. Extractions involving complements are illustrated in (10a-b), and extractions involving adjuncts are given in (11a-b): (10)
a. Their beautiful proportions render them the more striking; there are no gnarled and knotty stems, such as we are accustomed to admire in the ancient oaks and beeches of England, but every trunk rises like a mast from the earth, . . . (1855, Baker, Eight Years’ Wandering in Ceylon) b. This crisis he was accustomed to regard as manifesting itself in a sudden and definite upheaval. (1907, Gosse, Father and Son)
(11)
a. . . . the river happened to be very like those in California in which they had been accustomed to find gold. (1855, Baker, Eight Years’ Wandering in Ceylon) b. Thus some horseflies lay their eggs upon the lips of horses or upon parts where they are accustomed to lick themselves. (1880, Butler, Unconscious Memory)
The rule extracting adjuncts out of subordinate clauses tends to be Relativization, as in (11a-b). The same rule often moves complements out of subordinate clauses, as in (10a). The total predominance of to infinitives over to -ing complements in extraction contexts is not statistically significant, but it is still worth noting as a qualitative factor bearing on complement selection. Turning to the semantic consideration to be examined here, it may be observed that in his Dictionary Poutsma (MS) made this comment with respect to the incidence of the two constructions and a possible semantic difference between them: The infinitive construction is, presumably, rather more common than the gerund construction, and appears to be used to the exclusion of the latter when mere recurrency of an action or state, without any notion of a habit or custom, is in question. (Poutsma, MS, s.v. accustomed) In Rudanko (2006: 39 f.), the present author gave prominence to Poutsma’s statement about the possibility of making out a semantic difference between the two types of complement in the case of the matrix predicate accustomed, and proceeded to suggest that in the case of to -ing complements, the adjective
Exploring change in the system of English predicate complementation
371
“conveys the sense of ‘be used to’, with the complement of the adjective expressing a regular situation” (Rudanko 2006: 39). As for to infinitive complements, the sense of the adjective may be close to that of ‘tend’, with the complement of the adjective expressing a regular practice. There may thus be more of a sense of choice on the part of the referent of the matrix subject in the case of the to infinitive complement than in the case of the to -ing complement. (I am grateful to Ian Gurney, personal communication, for commenting on the distinction.) (Rudanko 2006: 39 f.) It is of interest here to examine the present set of data, in order to see whether or not the kind of semantic differentiation outlined in these statements might be supported by this material, which was not considered in Rudanko (2006). Indeed, the extended version of the Corpus of Late Modern English, being a recent corpus, had not even been compiled at the time when Rudanko (2006) was written. The suggestion, then, is that a to infinitive complement with accustomed is associated with a “sense of choice on the part of the referent of the matrix subject”, whereas this is not so or is less so in the case of a to -ing complement. The two patterns are patterns of subject control, which means that the higher subject and the lower subject are coreferential. One avenue to investigate the notion of a sense of choice may then be to examine the notion of a sense of choice from the point of view of the semantic role of the lower subject. In particular, an agentive role may be associated with a high degree of choice on the part of the referent of the lower subject. Choice implies a volitional act, and it may be recalled that David Dowty (1991, 572) has identified the notion of “volitional engagement in the event or state” as a prominent ingredient of what he terms the Agent Proto-Role. By contrast, a low degree of choice and agency is linked to the Patient role, which may be conceptualized as designating a participant in the action or state that undergoes the action initiated or carried out by some other entity. From this point of view, consider the sentences of (12a-b): (12)
a. . . . replied that he was not accustomed to consult with amateur physicians. (1859-1860, Collins, The Woman in White) b. Her voice dropped a little, with a pathetic expostulating intonation in it, as of one accustomed to be rebuked. (1886-1890, Pater, Essays from the Guardian)
In (12a) the predicate of the lower clause consult is such that its understood subject is conceptualized as exhibiting “volitional engagement” in the event in question. The predicate and its subject are here designated as +Choice. On the other hand, in (12b) the understood subject, because of the nature of its predicate
372
Juhani Rudanko
be rebuked, designates a participant that is conceptualized as an undergoer. The predicate and its subject are designated as -Choice. The illustrations in (12a-b) are of to infinitive complements. To -ing complements of the two types are likewise encountered in the material, as in (13a-b): (13)
a. And then Aggie put her face close, as women do who are accustomed to talking in the streets, and said . . . (1897, Caine, The Christian) b. I am getting quite accustomed to being snubbed by Lupin and I do not mind being sat upon by Carrie, . . . (1894, Goldsmith, The Diary of a Nobody)
It may be observed that in an active sentence an NP expressing a low degree of choice often has the syntactic function of a direct object. When such a sentence is in the passive, the derived subject continues to express a low degree of choice. This is the case in (12b) and (13b), where the lower clause is in the passive, so that PRO is in the subject position. However, while direct objects are often prototypically associated with a low degree of choice, it would be reductive only to consider NPs with this syntactic function in the present context. It should be noted that there are also some predicates that may involve a subject of an active sentence that is patientlike in expressing a low degree of agentivity. Consider (14a-b): (14)
a. “Why, man, I’m not accustomed to receive presents, even as a proxy. I haven’t had one since I was a schoolboy.” (1893, Gissing, The Odd Woman) b. Accustomed as Daffodil had become to meeting with deference and submission, she nevertheless was struck by a something especially reverential . . . (1884, Webster, Daffodil and the Croaxaxicans)
In (14a) the subject of receive is conceptualized as undergoing an action initiated by someone else and is marked -Choice. As for (14b), the sense of the lower predicate meet with is ‘to experience, undergo (a particular fortune or treatment)’ (OED, part of sense 15.f of meet). The ‘undergo’ part of the gloss emphasizes the patient-like nature of the subject. When the 99 tokens of sentential complements are considered, it is observed that in most cases it is easy enough to make a determination, along the lines indicated, about whether or not the lower predicate implies a sense of choice. However, there are some cases, especially with some predicates that take subjects with the Experiencer role, where the distinction between a sense of choice as opposed to a lack of choice is hard to carry out. For instance, consider (15a-b):
Exploring change in the system of English predicate complementation (15)
373
a. “Dick, I’ve always been accustomed to believe what I was told.” (1904, Galsworthy, The Island Pharisees) b. She had all her life been accustomed to see enterprises, even minor ones, well pondered and then carefully schemed beforehand. (1908, Bennett, The Old Wives’ Tale)
Predicates of the type of believe, as in (15a), and see, as in (15b), appear to be in a grey area as regards the notion of a sense of choice that may or may not inhere in the conceptualization of the lower subject. Rather than attempting to force a black and white decision in such cases, they are set aside here as being indeterminate. There are eight tokens of this type. By contrast, there are some other predicates, also to be linked to Experiencer roles in their subjects, that can be seen as being amenable to an agentive interpretation. Consider (16a-b): (16)
a. We are so accustomed to consider family resemblance a matter of course, that we are sometimes surprised when . . . (1880, Butler, Unconscious Memory) b. During the years we spent there, I had been accustomed to regard the phenomena of life as differing totally from what obtains throughout all other latitudes, . . . (1894, Huxley, Discourses)
The subject of the predicate consider in (16a) and that of regard in (16b) designates a process in the mind of a person and as regards the semantic roles of the NPs in question, these can be viewed as Experiencers. However, the predicates also entail a sense of choice, for a person can for instance choose to regard the phenomena of life in one light rather than another, and therefore the predicates are included as +Choice. With eight predicates set aside, the figures in Table 1 give information about the remaining 91 predicates from the point of view of the +/-Choice criterion and about the types of complement in each case: Table 1: Lower predicates that are +/-Choice and the form of the complement
+Choice -Choice
to infinitive
to -ing
total
71 10
4 6
75 16
In the light of these figures, it seems clear that the perspective of +/-Choice should be taken into account as a factor bearing on complement selection. Using the Chi square test, with the Yates correction factor, the figures are significant at the p < 0.01 level. It may also be observed that of the six to -ing complements in the -Choice line, only one of the six is from the first half of the period, and that of the ten to infinitive complements in the same line, four are from the first half. In other
374
Juhani Rudanko
words, in the second half, the numbers of tokens in the -Choice column are almost equal. Looking at the +/-Choice factor from a more qualitative point of view, it is possible to say that when the to -ing pattern was emerging as an alternative to the to infinitive pattern, the emerging pattern did have a semantic niche to latch on to, at least in the specific case of accustomed. It is possible to say that the to -ing pattern with accustomed was especially associated with lower predicates expressing lack of choice on the part of the referent of the lower (and higher) subject. The semantic perspective is not the whole story, and it does not call into question earlier work on other factors that have been proposed as impacting complement selection, and change in complement selection, with accustomed. However, the semantic perspective discussed here supplies an additional dimension, and opens up an avenue for investigating the potential impact of the factor with other predicates that have undergone change in favour of the to -ing pattern in recent times. 3.
Complements of Accustomed in the UK Books Corpus
Turning to present-day English and the Collins Cobuild Demonstration Corpus, it seems appropriate to consider the UK Books subcorpus here. This corpus represents British English, and it is also similar to the Corpus of Late Modern English Texts from the point of view of text type. The two corpora are also of comparable sizes, the UK Books Corpus being 5,354,262 words, which is rounded to 5,4 million words. The simple search string accustomed again seems the most suitable, from the point of view of precision and recall. It yields only 75 hits in the UK Books Corpus. Among the hits there are three verbs, as in (17): (17)
The debate was unusual in that the wartime coalition had accustomed the old members to co-operation and the new ones were all making maiden speeches.
In addition, there are seven tokens where the adjective accustomed is in the prenominal position, which can be set aside from the point of view of the study of complementation. In the remaining 65 cases the adjective selects a complement. The number of NP complements is 46, and that of sentential complements is 19. It is clear from these figures that the frequency of the adjective with both NP and with sentential complements has gone down dramatically in the course of the last century. For NP complements the frequency has gone down from 23.4 per million words to 8.5 per million words. For sentential complements, the decrease is equally dramatic, from 14.4 per million to 3.5 per million words.2 Among the sentential complements, the to -ing pattern is now predominant, with 18 tokens, and there is only one token of a to infinitive.
Exploring change in the system of English predicate complementation
375
Here is the one to infinitive complement in the corpus, and an initial illustration of a to -ing complement: (18)
The very plates from which she is accustomed to eat are apparently not hers at all, . . .
(19)
. . . he’d got quite accustomed to moving around with caution, . . .
It is of interest to observe that the one token of a to infinitive involves extraction. Further, it is observed that the type of extraction in question is adjunct extraction. While the significance of a single example should not be exaggerated, the token is worth noting from the point of view of the broadening of the definition of the Extraction Principle proposed above. An obvious question to raise, in view of the Extraction Principle, is whether there are extractions among the to -ing complements. The presence of such extractions would be an indication of the further consolidation of the to -ing pattern with the adjective. Among the 18 to -ing complements one case of extraction is found: (20)
. . . they would not respond to the kind of phrases which he was accustomed to using.
The presence of extraction in sentence (20) is worth noting, but little can be made of a single example. As for the notion of a sense of choice, it is observed that the one to infinitive complement has a +Choice lower predicate, as is to be expected in the light of the diachronic data. As for the 18 to -ing complements, it is clear that the pattern has spread from its original niche in -Choice contexts to be also available in +Choice contexts. There are two tokens in -Choice contexts, as in (21a), and five cases are in a grey area and indeterminate, as in (21b), but the majority of the to -ing tokens, 11 of the 18, involve a +Choice predicate in the lower clause. Such predicates are found in (19) and (20), and two more are provided in (21c-d). (21)
a. Leona and I are accustomed to encountering that sort of thing in the early stages of a marital separation, . . . b. He had grown accustomed to having Abasio about the place, and even though they argued constantly, he did not want his grandson to go. c. They had three servants but she was more accustomed to looking after the entertaining herself. d. Accustomed as he was to exercising power, Lloyd George could not admit the existence of situations in which men were manacled by forces larger than they.
376
Juhani Rudanko
The findings indicate that the to -ing pattern is now the rule even in +Choice contexts. At the same time, the proportion of +Choice contexts has decreased somewhat. The use of the adjective accustomed has undergone an overall decline in recent English, but the frequency of to -ing complements has risen from 1.8 per million to 3.3 per million, and, as far as sentential complements involving subject control are concerned, it is worth emphasizing that the modern English data from the UK Books Corpus attest to the overwhelming predominance of the to -ing pattern over the to infinitive pattern with accustomed in current British English. 4.
Concluding Observations
The overall story of the change in the sentential complementation patterns of the adjective accustomed in recent centuries is by now established in the literature: in the eighteenth and nineteenth centuries to infinitive complements were predominant over to -ing complements, but in present-day English the to -ing pattern is much more frequent than the to infinitive pattern. The present study provides further confirmation of the dominance of the to infinitive pattern in current British English, and it sheds new light on the two patterns during the crucial period of 1850-1920, studied on the basis of the extended version of the Corpus of Late Modern English Texts. It is during this period that the to -ing pattern began to emerge as a rival to the to infinitive pattern. The present study examines two factors that may have played a role in the rivalry between the two patterns during this period. The first factor concerns extraction contexts. To study such contexts, it is proposed that the Extraction Principle, formulated by Vosberg (2003b), should be broadened in such a way that it would encompass the extraction both of complements and of adjuncts out of complement clauses. The broader definition was adopted here, and 19 cases of extraction were observed in the CLMET, and in all of them the complement was of the to infinitive type. This finding, while not statistically significant in view of the overall predominance of to infinitives and the relative rarity of to -ing complements in the corpus, is nevertheless worth noting. The second factor examined was of a semantic nature. The question was raised as to whether a to infinitive complement may be associated with a sense of choice, and a to -ing complement with a lack of a sense of choice, on the part of the lower subject, as determined by the nature of the lower predicate. A division of such predicates was carried out by using the labels +/-Choice. +Choice predicates were linked to predicates with Agent roles, and prototypically -Choice predicates were linked to the Patient role. However, the division was not strictly tied to semantic roles. While some predicates were viewed as being indeterminate from the point of view of the dichotomy, the division was easy to carry out for the great majority of lower predicates.
Exploring change in the system of English predicate complementation
377
It was observed that in the great majority of the 99 tokens of sentential complements in the corpus, the lower predicate was of the +Choice type. Among such predicates there were some tokens of the emerging to -ing pattern, but the huge majority were of the to infinitive type. As for predicates of the -Choice type, the to -ing pattern was almost equally frequent as the to infinitive pattern. This finding suggests that at the crucial transitional stage the to -ing pattern was associated with a particular semantic domain. This association of the pattern with a semantic niche may well have helped the emergence of the new pattern. When suitable large-scale corpora become available, it will naturally be of interest to examine the adjective accustomed in the period immediately after 1920 to see if the association of the emerging to -ing pattern with the particular semantic niche identified here might have continued during later decades, and how the increasing frequency of the new pattern was related to the +/-Choice factor during such later decades. A further task is to apply the notions identified here to the study of the adjective in other regional varieties around the end of the nineteenth century and in the early decades of the twentieth and to compare British English with such other varieties from this point of view. Acknowledgements The author is grateful to the participants of the ICAME meeting for their comments on the conference presentation in Stratford-upon-Avon and to Ian Gurney, of the University of Tampere, for kindly reading, and commenting on, the pre-submission version of this article. All remaining shortcomings are the author’s responsibility. Notes 1
It may be noted that a more narrow formulation of extraction is proposed in Vosberg (2003a: 202). In that study Vosberg, when describing the nature of the effect of extraction on “infinitival and gerundial complement options,” refers to “environments where the object of the dependent verb is extracted.” This is more narrow than the definition in Vosberg (2003b), quoted in the text, since the notion of a complement, as used by Vosberg (2003a, 2003b) and more generally in the current literature, is broader than that of an object.
2
This finding raises the question of whether some other adjective in the semantic field of accustomed might have correspondingly gone up in frequency in recent English. The adjective used, as in The hard-pressed Japanese are used to living with very little personal space in their tiny houses, . . . (UK Books, Collins Cobuild Demonstration Corpus), comes to mind here as a prominent adjective with a meaning similar to that of accustomed. A pilot study was conducted to examine the incidence of this adjective with sentential complements involving subject control in the two corpora that are considered with respect to accustomed in this article. The search strings used in the pilot study were are/is/was/were+used+to, with
378
Juhani Rudanko irrelevant tokens filtered out manually. It was indeed seen that the frequency of the adjective with sentential complements involving subject control rises from only seven in the CLMET, extended version, that is, 1.1 per million, to 22 in the UK Books Corpus, that is, 4.1 per million. However, this was a limited pilot study, and the present investigation, focusing on accustomed, invites a full-scale diachronic study of used and of other adjectives in the semantic field of accustomed in recent English. (For some discussion, of a mainly qualitative nature, of some adjectives in this semantic field, see Rudanko 2000: 90-96.)
References Denison, D. (1998), ‘Syntax’, in S. Romaine, ed., The Cambridge history of the English language, volume 4: 1776-1997. Cambridge: Cambridge University Press, 92-326. De Smet, H. (2005), ‘A corpus of Late Modern English texts’. ICAME Journal 29, 69-82. Dowty, D. (1991), ‘Thematic proto-roles and argument selection’. Language, 67, 547-619. Kjellmer, G. (1980), ‘“Accustomed to swim; accustomed to swimming”: on verbal forms after TO’, in: J. Allwood & M. Ljung (eds.), ALVAR: A linguistically varied assortment of readings. Studies presented to Alvar Ellegård on the occasion of his 60th Birthday. Stockholm Papers in English language and literature 1. Stockholm: Almqvist & Wiksell, 75-90. OED = Oxford English Dictionary (OED Online). http://dictionary.oed.com/. Accessed on November 1, 2007. Poutsma, H. (MS), Dictionary of constructions of verbs, adjectives, and nouns. Unpublished. Copyright: Oxford University Press. Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik. (1985), A comprehensive grammar of the English language. London: Longman. Rohdenburg, G. (2006), ‘The role of functional constraints in the evolution of the English complementation system’, in: C. Dalton-Puffer et al. (eds.), Syntax, style and grammatical norms. Bern: Peter Lang, 143-166. Rudanko, J. (1998), Change and continuity in the English language. Lanham, MD: University Press of America. Rudanko, J. (2000), Corpora and complementation. Lanham, MD: University Press of America. Rudanko, J. (2006), ‘Watching English grammar change: a case study on complement selection in British and American English’. English Language and Linguistics, 10, 31-48. Rudanko, J. (2007), ‘Text type and current grammatical change in British and American English: a case study with evidence from the Bank of English Corpus’. English Studies 88, 465-483. Vosberg, U. (2003a), ‘Cognitive complexity and the establishment of -ing constructions with retrospective verbs in Modern English’, in: M. Dossena
Exploring change in the system of English predicate complementation
379
and C. Jones (eds.), Insights into Late Modern English. Bern: Peter Lang, 197-220. Vosberg, U. (2003b), ‘The role of extractions and horror aequi in the evolution of -ing complements in Modern English’, in: G. Rohdenburg and B. Mondorf (eds.), Determinants of grammatical variation in English. Berlin: Mouton de Gruyter, 305-27.
Encoding of goal-directed motion vs resultative aspect in the COME + infinitive construction Sara Gesuato University of Padua, Italy Abstract The verb COME followed by an infinitive signals goal-directed motion (“He often came to visit me”) or completion of a process (“She came to expect the worst”). In the former case, COME is used literally (‘move close to’), the infinitive encoding a clause of purpose. In the latter, COME is used aspectually (‘end up (V-ing)’), the infinitive encoding the resultative notion of culmination of an event. Concordances from the Bank of English online show that the aspectual sense of COME is more common than the literal one, and that while the former is associated with verbs denoting deliberate actions, the latter is instantiated with verbs denoting involuntary experiences. However, in specific syntactic environments (embedding of COME under “how” or “why”) the construction activates an aspectual interpretation (‘decide, happen, come about’) even when the infinitival complement refers to a voluntary act. Also, while the matrix clause can combine with auxiliaries and morphological markers to encode temporal-aspectual distinctions, it is preferentially encoded in the simple present, the simple past or the present perfect tense. Overall, the data suggests that the aspectual meaning of the COME + infinitive construction is typically associated with the presentation of complete events seen as ‘unintentional consequences’. 1
1.
Introduction
Non-spatial concepts are often conceptualized in spatial terms, with courses of events being linguistically represented as paths (McIntyre 2001: 150). Some common English motion verbs (e.g. get, go, bring, lead) can be used metaphorically to encode the notion of reaching a non-physical destination or achieving a goal (e.g. “This leads me to the last issue”, “Why don’t you get to the point?”, “It went to pieces”, “Her remark brought the conversation to a close”).2 A similar non-literal usage may also be activated when the verbs are followed by infinitival complements. In such cases, they encode aspectual notions such as inception (e.g. “I got to know him quite well”), culmination (e.g. “That goes to show you can never count on your neighbours”) or causation (e.g. “The scandal led them to resign”, “Everyday contact brought them to understand one another”). In addition, GO and GET have fully grammaticalized two specific metaphorical usages: futurity (e.g. “I am going to quit my job”) and causation (e.g. “I got him to show me the letter”), respectively. The verb COME can also be used either literally or metaphorically. In the former case, it encodes the notion of ‘moving closer towards a location near the
382
Sara Gesuato
speaker/hearer’ (e.g. “Come in!”, “She came downstairs”, “We are coming home soon”). In the latter, it encodes the notions of ‘reaching a state’ (e.g. “They never came to a conclusion”, “When did the Conservatives last come to power?”) or ‘beginning to exist or act’ (e.g. “When did the word Renaissance come into use?”, “We just don’t know how genius comes to life”). The verb COME encodes this twofold meaning also when followed by an infinitival complement. This can represent a goal to be achieved (e.g. “I came to see you only once last year”) or a result seen as already achieved, whether in the past, present or future (e.g. “I came to see you as a friend”). In the former case, COME is used literally: the construction encodes two events, i.e. (a) movement and (b) a subsequent goal dependent on that movement, with the to of the infinitive meaning ‘in order to, so as to’. In the latter case, COME is used aspectually: the construction encodes a single event, conveyed through the infinitival complement, which COME qualifies with a resultative nuance meaning ‘end up V-ing’. Research has been carried out on complex predicates in general (e.g. Alsina, Bresnan, Sells 1997; Hinrichs, Kathol, Nakazawa 1998; Müller 2002; Rosen 1990), on infinitival clauses and catenative constructions in English (e.g. Eastlack 1967; Fang 1995; Hoekstra 1988; Mair 1990; Stevens 1972; Whelpton 2001, 2002), including (causative) resultatives (e.g. Baicchi 2007; Carrier, Randall 1992; Claudé 1990; Goldberg, Jackendoff 2004; Ike-Uchi 1994; Ionescu 1994; Kudrnáþová 2005; Müller 2005; Shirai 1998; Tortora 1998; Yamada 1987) and on the spatial meanings of aspectualizers (Brinton 1988).3 However, very little is known about the use of COME with infinitival complements.4 The Cambridge Grammar of the English Language (Huddleston, Pullum, Bauer 2002) only exemplifies resultative COME (p. 1207), while the Longman Grammar of Spoken and Written English (Quirk, Biber 1999) mentions it as an aspectual verb controlling infinitive clauses, which means ‘come about’ or ‘happen’ (p. 708). Also, Brinton (1988: 4) lists COME among ingressive aspectualisers alongside COME ON (p. 61), specifically stating that it may have a motional or an aspectual meaning (p. 82), but does not discuss it. The goal of this paper is to explore whether the literal and aspectual meanings of COME + infinitive correlate with distinct lexico-grammatical patterns. To this end, examples of the construction have been collected from a corpus of Present-day English and examined for specific morphological, semantic and syntactic co-textual features. In the following sections, I first illustrate the aspectual notion of resultativity and show its relevance to the COME + infinitive construction. Then I present the data examined: I illustrate variant realizations of the construction and examine its co-text in a set of concordances collected from the Collins-Cobuild Bank of English on-line (BoE; 57,000,000 words). Finally, I outline the typical syntactic-semantic profile of the construction and propose a definition of its grammatical status.
COME+ infinitive construction 2.
383
Resultatives
The resultative in English is typically interpreted as a causative construction representing a process (encoded in a verb) that affects an entity (encoded in a noun) and that brings about a resultant state (encoded in an adjective or adverbial; e.g. Goldberg, Jackendoff 2004). In the following made-up examples the causal events are italicized, the affected entities bolded and the resultant states underlined: (1) “I kicked the door open” (2) “She broke the vase into a thousand pieces” (3) “We sang ourselves hoarse” (4) “He worked them into a frenzy” (5) “You swam yourself to near collapse”. However, the resultative has also been interpreted as a construction encoding an event that reaches a final state (e.g. to kill, to find, to eat an apple; Bussmann 1996: 406), independently of whether it includes a causative meaning, it overtly signals the resultant state or it encodes its meaning components morphologically, syntactically or lexically. Following the latter interpretation, resultatives include, for instance: a. constructions that mention causal events rendered with intrinsically causative verbs like to drive and that make reference to resultant states; e.g.: (6) “He drove me crazy” b. constructions encoding only the resultant effects of processes, namely changes of state or position (property and path resultatives, respectively; Goldberg, Jackendoff 2004); e.g.: (7) “The door cracked open” (8) “The dragon flew over the hill” c. constructions encoding causal events through prepositional phrases and resultant effects through verb phrases (sound-emission path and disappearance resultatives, respectively; Goldberg, Jackendoff 2004); e.g.: (9) “A fly buzzed past his ear” (10) “The tow truck disappeared into the sinkhole” d. expressions that are morphologically marked for resultativity, i.e. denoting states that have been reached rather than (simply) states that exist; e.g.:
384
Sara Gesuato
(11) “[…] had he felt himself to be totally weakened […]” (Rowling 2007: 286) e. lexical resultatives, both transitive like make, render, turn, and intransitive like become, get, turn; e.g.: (12) “I made it smaller” (13) “It turned red” f. constructions which encode processes coming to an endpoint, but which merely imply resultant states (i.e. accomplishments rather than sequences of two events; cf. Stewart 1998); e.g.: (14) “He’s opened the door” [‘The door is now open’] (15) “We’ve completely repainted the house” [‘The house is now completely repainted’] (16) “She stubbed her cigarette out” [‘Thus the cigarette was out’] (17) “I’ve eaten” [‘I am now sated’].5 All the above examples count as resultatives because they encode the completion of a process and signal the reaching of a state. 2.1
COME + infinitive as a resultative
The activation of the motional or the aspectual meaning of COME when this occurs with infinitives can be detected through syntactic-semantic tests. Literal COME constructions are compatible with temporal expressions denoting the single point in time at which the motion is realized (e.g. “I came at 5 to meet the sales manager”). Also, they can be queried with how-questions which have scope over the matrix clause, and thus address the issue of the physical way in which motion is performed (e.g. “She came (here in order to) to buy a house” – “How (By what means) did she come?”).6 Finally, they allow the insertion of an adverbial encoding a circumstance of place between the matrix clause and the infinitival complement (e.g. “The volunteers often come [here] to help”). On the other hand, resultative COME constructions are compatible with, or can be paraphrased with, time expressions denoting a period or the passing of time (e.g. “[In just over a month,] I came to realize this was a great opportunity for me”, “[In time,] this discovery affected the arts”, “[As his terror subsided,] he came to see we could find a solution to our problem”, “Slowly, I came to understand what all of this meant” – “It took me a long time to understand what all of this meant”). In addition, they can be queried with how-questions whose scope is the construction as a whole (e.g. “How did she come to understand this was the best solution?” – “How did it come about (How did it happen) that she understood that this was the best solution?”), and not only the matrix clause (e.g. *“How did she get here to understand this was the best solution?”). Finally, they do not allow the insertion of an adverbial encoding a circumstance of place between the matrix clause and the infinitival complement, if the resultative
COME+ infinitive construction
385
meaning is to be preserved (e.g. *“I’m glad I’ve come [here] to know you a little better”). When the resultative meaning is activated, the construction represents an event as the culmination, or final result, of the process encoded in the infinitival complement. It signals the progressive transition from a stage in which the event has not started to take place to a stage in which it has reached its final conclusion, as is the case with accomplishments.7 For instance, “I came to understand” encodes the gradual realization and ultimate completion of the experience of understanding. It can be paraphrased as ‘I got to the point when I finally completely understood’, and it hints at the phases leading to that moment: ‘At first I didn’t understand; then I started to understand; as time went by, I understood more and more; finally I completely understood’.8 In particular, the verb COME signals the approach to and reaching of a target temporal point that coincides with the beginning or realization of a process.9 It is this lexico-syntactic reference to the (intermediate) terminal boundary of an event that makes COME + infinitive a resultative construction. Resultative COME + infinitive evokes, or can be associated with, the expression of contextual circumstances typical of accomplishments, such as the amount of time needed for the process to be completed (e.g. “over a two-year period”), the cause determining – or contributing to the gradual progression of – the event (e.g. “after much thinking”) or the resultant effect (e.g. “Now I know”); e.g.: (18) “But as their fame grew, they came to relish being pictured […]” (N5000950319) (19) “[…] but after some thought I came to realize […]” (B0000000336) (20) “[…] as increasing numbers of Americans came to share his concern over environmental deterioration” (B9000001429).10 Also, the construction may combine with expressions denoting the (chrono)logical link between the process and result components of the unitary event being represented. These expressions include linkers focusing on the conclusion of the event (e.g. eventually, finally, in the end) or its unfolding over a preceding time span (e.g. as, gradually, more and more) or its connection to a preceding cause (e.g. therefore, in this way, and so); e.g.: (21) “In this way it comes to police the relationship between love and sex” (B0000001312) (22) “Social integration and solidarity would therefore come to be conceptualized in terms of bonding” (B0000000845) (23) “And so it came to pass that our boys sweated in the 80 degree summer heat” (N9119980608). The specific nuance of meaning that resultative COME activates depends on the internal aspectual nature of the verbs it combines with. With punctual verbs (e.g.
386
Sara Gesuato
realize), the construction hints at an intervening process – either the mere passing of time or an activity – leading up to the realization of an instantaneous event, whose resultant state (e.g. knowledge acquired) is left unmentioned. With stative verbs (e.g. believe), the construction signals the achievement of a state (i.e. ‘to be a believer’, ‘to be in a state of belief’) at the end of a preceding time span (signalled by COME) characterized by a different state (i.e. ‘not to believe’). With dynamic, non-punctual verbs (e.g. develop), the construction hints at the gradual unfolding of a process over a preceding time span, which results in the continuation or completion of that same process (i.e. ‘I started and continued to develop (until I finished developing)’).11 In the three cases, COME signals the culmination of a process (i.e. the punctual point of its completion), but at the same time imposes a duration on the process leading up to the endpoint (i.e. it hints at the event’s incremental steps, or preliminary stages, which precede its completion). 3.
Data description
3.1
Syntactic realizations
Instances of come, comes, came and coming followed by active non-progressive infinitives (henceforth active infinitives), and of COME followed by active progressive and passive non-progressive infinitives (henceforth, progressive and passive infinitives, respectively) were collected from the BoE in April 2007. The total concordances retrieved were 2,913. Most instantiate the construction in the affirmative declarative form, with negative declaratives and interrogatives making up only 2% and 1% of the data, respectively. All the verb forms instantiating the construction show a preference for the written medium, with an average value of 73% (see Table 1). Table 1: Distribution of forms over the oral and written medium COME forms come comes coming came COME + be V-ing COME + be V-ed Total
Oral medium 260 (19%) 35 (25%) 91 (39%) 213 (22%) 5 (36%) 40 (19%) 643 (27%)
Written medium 1,093 (81%) 106 (75%) 143 (61%) 758 (78%) 9 (64%) 168 (81%) 2,270 (73%)
Global 1,353 141 234 971 14 208 2,913
All verb forms of COME and the three types of infinitives are attested in the data, but with different frequency of occurrence (see Table 2). The matrix phrases instantiate mostly the forms come (49%) and came (38%), the forms comes and coming being quite rare. The complements are realized mostly as active infinitives (92%), while the progressive and passive infinitives account for only
COME+ infinitive construction
387
0.5% and 7.5% of the data, respectively. In addition, the variant realizations of the matrix phrases and the infinitival complements have different combinatorial possibilities. On the one hand, the active infinitive is typically associated with the form come in the matrix phrase, while the passive and the progressive ones with the form came. On the other, the form coming cannot co-occur with a progressive infinitive. Table 2: Frequency and distribution of COME forms and infinitival complements COME come comes coming came Total
To V 1,353 141 234 971 2,699
% 50 5 9 36 100
To be V-ed % 75 36 5 2 6 3 122 59 208 100
To be V-ing % 6 43 1 7 0 0 7 50 14 100
Total 1,426 147 240 1,100 2,913
% 49 5 8 38 100
Just a few syntactic variants of the construction account for most of the data. The tenses most frequently encoded are the simple past, the present or past perfect, and the simple present tense, in this order (see Table 3). In particular, gerundive forms, catenative sequences with auxiliaries – whether marking tense or modality (including be about to, be going to, used to) – and non-finite forms account for 10%, 6% and 12% of the data, respectively. Table 3: Frequency of syntactic variants Tense and aspect Present: simple Present: continuous Perfect: present Perfect: past Past: simple Past: continuous Present participle / gerund Past participle Will future Conditional: present Infinitive Infinitive in periphrastic constructions Single auxiliary + bare infinitive Imperative Other Total
Frequency: tokens 556 89 443 258 1,140 48 95 27 75 42 28 15
Frequency: % 19 3 15 9 39 2 3 1 2 1 1 1
86
3
7 12 2,913
0 0 100
388
Sara Gesuato
3.2
Syntactic co-text
Only about 10% of the data (i.e. 300 instances) co-occur with linkers, adverbs or adverbial expressions, whose scope may be the construction or the sentence it is part of. About 53% of these (i.e. about 5% of the data) express temporal or causal relationships (e.g. eventually, thus); e.g.: (24) “But the real costs that it imposes on a society otherwise not fully prepared for so great a shift from traditional structure and customs are only now coming to be understood” (N5000950406) (25) “Business is gradually coming to realize that the economic centre of gravity is moving north between Sydney and Brisbane” (N5000950130) (26) “Modernization therefore comes to mean construction of social systems ‘with a built-in tendency to change in the direction of greater value realization’” (B0000000845) (27) “In the end, it all comes to seem like one rhythm” (N0000000728) (28) “It was Haig’s ironic fate that he, an eminent Edwardian, eventually came to be judged according to the very different standards of another age” (B0000000551) (29) “In time, he was to come to tell the birds apart by their calls” (B9000001423). Reference to time may additionally be realized at the phrasal or clausal level, in the encoding of participants to and circumstances of the events being expressed. This is, however, an infrequent choice (less than 2% of the data); e.g.: (30) “I have followed the Bears all season and, in the process, have come to admire the players, administrators and senior coach” (N5000950904) (31) “More and more organizers have come to recognize that neighborhood security, for example, is no longer a function of the numbers of police present” (B9000001375) (32) “[…] even as European societies grew more intertwined ‘national antipathies flourished alongside the increasing pace with which men came to know one another’” (N20000951216) (33) “Hour after hour, day by day, the crowd came to sit and stare at the Parliament” (B000001170) (34) “[…] it is only in the course of conversation that you come to realise there are areas they refuse to talk about […]” (B0000000649) (35) “[…] the American visitor with three or even two weeks would come to know gunbearer or tent servant” (B0000000774) (36) “As time goes on, I hope we come to know each other better” (B0000000906) (37) “As I’ve gotten older, however, I’ve come to realize the importance of what the patient is saying, even though nothing is visible to the naked eye” (B9000000434)
COME+ infinitive construction
389
(38) “[…] and their design has evolved so that they have come to incorporate standard details” (N80000000084) (39) “It is to our national credit that this implausible fiction has been overturned and to our national shame that it took so long to come to pass” (N5000950606) (40) “By the time I was fifteen, food had come to play a major part in my life” (B0000001213) (41) “[…] during the years I’ve been giving lessons I’ve come to realise that a large number of amateur golfers hate sand shots […]” (N0000000595). There is a specific, although infrequent, co-text that appears to favour a resultative interpretation of COME + infinitive, and that is when the construction is embedded under how or why. In such cases the focus is on the achievement of a goal. The aspectual nuance conveyed can be glossed as ‘decide’, when reference is made to deliberate actions performed by an agent; alternatively, the construction may convey the meaning of ‘end up, happen’, when the event under discussion makes reference to external circumstances affecting participants; e.g.: (42) “Cross wondered how on earth she had come to be married to a coarse, inhuman character like Blount” (B9000000492) (43) “[…] in the true story of how I came to be turned out of my office” (N5000950319) (44) “[…] and that’s why I come to be paying that amount (S9000000973) (45) “And erm can you tell me how you came to move here” (S9000001606) (46) “How and why did Yeats come to paint the way he did?” (N2000960324) (47) “I mean how did you come to be setting up a business like this” (S9000000907). However, the presence of embedding how or why is no guarantee that an aspectual interpretation is in order, as the larger co-text may reveal. Consider: (48) “[…] and that is why they would not come to give evidence at the inquest” (N6000920519) (49) “And I mean that’s why we came to live here” (S9000001469), which express goal-oriented motion. Overall, embedding how and why occur in 3% and 1%, respectively, of the data. Most of the instances of how (11/14, i.e. 79%) occur in the construction variant with progressive infinitives. 3.3
Lexical associations
The infinitival complements exemplify 785 lexemes, that is, on average, one every 3.7 concordances (see Table 4). The most common ones are listed in the following hierarchy, each accompanied by their token frequency:
390
Sara Gesuato
see-440 > know-148 > live-87 > think-81 > visit-74 > realize-71 > believe-60 > pass-48 > expect-47 > stay-46 > understand-45 > call-40 > rest-42 > take-31 > recognise-24 > accept-22 Table 4: Lexeme types and tokens in the infinitival complements Verb forms come + to V comes + to V coming + to V came + to V COME + to be V-ing COME + to be V-ed Total Average
Types 298 62 82 270 12 61 785
Tokens 1,353 141 234 971 14 208 2,913
Type-token ratio 5 2 3 4 1 3 3
The lexemes instantiated denote either deliberate actions (visit, take) or involuntary experiences (accept, believe, expect, know, realize, recognise, rest, understand). But they may also be ambiguous between the two interpretations; this depends, for instance, on whether a given verb occurs in the active or passive voice (e.g. call - be called), on whether it is polysemous, (pass ‘give’ vs ‘spend time’; see ‘perceive visually’ vs ‘pay a visit’; think ‘reflect’ vs ‘have an opinion’) and on whether it is compatible with both an agentive and an experiential interpretation (‘choosing/happening to’ live/stay). In addition, as is the case with how or why embedding, the literal or aspectual interpretation of the construction relies on cues from the larger co-text. For instance, verbs of involuntary experience may be used to encode goals. In such cases, COME is used literally, and the infinitival complement expresses an outcome that the subject hopes or tries to achieve, even if this is not totally under her control; e.g.: (50) “We wanted to win, we came to win” (N9119980615) (51) “[…] when 180,000 fans came to witness the annihilation of the opposition by Nigel Mansell” (N0000000794). Finally, certain cases may remain ambiguous even when the immediate lexicosyntanctic environment is taken into consideration. This applies especially, but not only, to subordinate or embedded clauses; e.g.: (52) “[…] always point it away from you and anybody else when you come to open it” (E0000002013) (53) “Although your main rows will be empty when you come to plant out your winter crops […]” (B0000001178) (54) “‘[…] the nature of the species that he has come to redeem’” (B9000001369) (55) “When I came to write about the city, it was very challenging […]” (N6000920227)
COME+ infinitive construction
391
(56) “Then Saddlers’ Hall joined in the aggravation as he came to challenge the leaders” (N6000920605) (57) “But it was important not to lose sight of it when the Legal Aid Board came to decide whether to cooperate with a scheme […]” (N2000960405) (58) “[…] who came to exert a mutually transforming influence upon Africans of his time […]” (B0000001159) (59) “And you have to put that into the scales when I came to face the British Aerospace decision […]” (N6000940421) (60) “We come to say that the evil and inhumanity represented by Sandakan […]” (N5000950712). 3.4
Distribution of meanings
Manual coding of the data reveals an uneven distribution of the literal and aspectual meanings of the construction across its syntactic variants (see Table 5). On average, the aspectual meaning is favoured over the literal one (59% vs 39%), but a strong preference for the resultative interpretation applies only to the COME + progressive infinitive and COME + passive infinitive constructions (100% and 98%, respectively). A less marked preference occurs with come + active infinitive. The coming + active infinitive variant displays a strong preference for the literal interpretation (82%), followed by came + active infinitive (60) and come + active infinitive (39%). Ambiguous cases account for only 2% of the data. The different frequency values for the literal and aspectual meanings are statistically significant (p-value <0.01) for all forms except comes (p-value > 0.01). Most of the lexemes associated with the encoding of resultative aspect are exclusively reserved for this function; they encode involuntary experiences. A smaller group, however, are also employed in sentences with a literal interpretation (see Table 6). Table 5: Distribution of literal and aspectual meanings across variant forms of the construction, in percentage values COME forms come comes coming came COME + be V-ing COME + be V-ed Average
Literal 39% 51% 82% 60% 0% 2% 39%
Aspectual 58% 45% 18% 37% 100% 98% 59%
Other 4% 4% 0% 3% 0% 0% 2%
392
Sara Gesuato
Table 6: Colligation of variants of the construction with lexemes encoding only aspect vs lexemes encoding both motion and aspect COME forms come comes coming came COME + be V-ing COME + be V-ed Average
Lexemes encoding only aspect 40% 53% 29% 37% 100% 93% 59%
Lexemes encoding both motion and aspect 9% 5% 4% 6% 0% 0% 4%
The infinitival complements may encode durative processes – whether stative (e.g. seem), dynamic (e.g. use) or envisaging a natural endpoint (e.g. build a hut) – single instantaneous events (e.g. leave), and repeated events (e.g. make each piece of work; suggest every now and then). Table 7 shows their frequency and distribution in the data with regard to those occurrences in which COME + infinitive unequivocally encodes a resultative meaning. In the various corpus subsets there is a consistent preference for durative events, which on average account for about 60% of the data. Punctual events are represented, making up 30% of the data. Habitual events, instead, are rarely instantiated, i.e. about 3% of the time. To sum up, the concordances reveal that COME + infinitive is a fairly frequent construction, used mostly in writing, and preferably realized in a few tenses marked for perfective aspect, which may express goal-oriented motion or, more frequently, resultative aspect, especially in combination with the encoding of durative events. Table 7: Temporal characteristics of events in resultative instances of COME + infinitive COME forms
Durative
come comes coming came
498 (63%) 42 (67%) 25 (58%) 213 (59%)
Single instantaneous 248 (32%) 17 (27%) 16 (37%) 109 (30%)
COME + be V-ing COME + be V-ed Average %
8 (58%) 109 (53%) 60%
2 (14%) 85 (42%) 30%
Repeated 6 (1%) 0 (0%) 0 (0%) 0 (0%) 2 (14%) 4 (2%) 3%
Other 33 (4%) 4 (6%) 2 (5%) 39 (11%) 2 (14%) 6 (3%) 7%
COME+ infinitive construction 4.
393
Discussion and conclusion
The COME + infinitive sequence is attested as a frequent syntactic form in a general corpus of English. In its literal usage, it encodes goal-directed motion. In its more frequent aspectual usage, instead, it encodes resultative aspect, that is, the completion of a process or achievement of a goal. In the latter interpretation, it counts as a manifestation of the localist theory of aspect (Brinton 1988: 112114), according to which, there is “conformity between the spatial meanings of aspect categories and the semantics of the verbs involved” (e.g. ingressive aspect is marked by verbs expressing movement into a situation; p. 95). Resultative COME + infinitive exemplifies the metonymic shift in focus of a motion verb from a spatial meaning to an aspectual meaning, which takes place when it collocates with another verb expressing an action or state (Brinton 1988: 112-114). In general, resultative COME + infinitive manifests the incremental transition of an event to a culmination, or the reaching of a target state, which stands for a metaphorical result-location. It therefore expresses two notions: the development of a process (i.e. a change of state) and the reaching of its endpoint (i.e. the realization of an event). As a result, it can be likened to other structures technically expressing motion but actually denoting change of state, such as going to sleep, falling asleep, putting someone to sleep (Talmy 1975: 234). More specifically, it encodes varying aspectual nuances, depending on the types of verbs it combines with: attainment of a result, with stative verbs like know; inception of a process, with dynamic durative verbs like develop; and realization of a process, with dynamic punctual verbs like arrive. The interpretation of the construction is strongly influenced by the type of events encoded in its infinitival complement: if this denotes a deliberate act, a literal interpretation is favoured; if it denotes an involuntary experience, an aspectual interpretation is likely to be activated. However, despite this correlation, occasional exceptions are attested: certain instantiations are interpretable either in the sense of ‘get closer so as to’ or in that of ‘decide/happen to’ independently of the verb used (e.g. come to buy/rest/win), and only the surrounding co-text (e.g. time adverbs, temporal clauses, how- or why-embedded clauses) may help disambiguate them. The construction displays clear semantic preferences. Although it is used with a great variety of lexemes, most of these encode involuntary experiences or events interpretable as being determined or influenced by external circumstances. More specifically, the lexemes include verbs of physical experience (e.g. develop, die, exist, fall, find, form, get, happen, listen, live, look, notice, perceive, receive, pass, rest, see, wear), verbs of emotional experience (e.g. adore, cherish, deserve, despise, dread, face, fear, feel, hate, loathe, love, prefer, regret, relish, resent, worship); of cognitive experience (e.g. believe, consider, decide, doubt, expect, figure, find out, know, learn, realize, reflect, regard, rely, respect, think, trust, understand, view, value); verbs of relation, often with inanimate subjects (e.g. become, challenge, characterize, comprise, define, denote, depend, epitomize, focus, make up, mean, personify, possess, represent, resemble, seem, sound); and
394
Sara Gesuato
verbs denoting the impact caused by the subject, whether animate or inanimate (e.g. challenge, exert, force, outnumber, overshadow, preserve, reign, share). The re-interpretation of an original expression of motion as a lexicosyntactic marker of resultative aspect is fostered by two co-textual features. In its resultative instantiations, the construction tends to encode durative events, as is typical of ingressive aspectualizers, although it is also instantiated with punctual ones. Also, the matrix clause is mostly realized in non-progressive forms, while its complement tends to be rendered as an active infinitive. This is in line with the semantics of the construction: the use of perfective forms is particularly suitable for encoding the completion of a process.12 COME + infinitive can be said to illustrate the partial grammaticalization of a spatial expression into a marker of resultativity. On the one hand, its grammatical re-interpretation is not complete: the construction takes on an aspectual, modal-like meaning in a favourable co-text, although it can still retain the literal meaning of goal-oriented motion, and is at times ambiguous between a literal and an aspectual interpretation. On the other, its specific aspectual meaning is resultative because, through a combination of lexical and syntactic means, the construction encodes the accomplishment of a process, whose resultant state can be inferred, even if it is not overtly expressed.13 The link between the literal and the aspectual meaning of the construction is provided by those examples in which COME is used literally, but is followed by a verb denoting a non-deliberate event; e.g.: (61) “[…] the rain came to bless me with all its clumsy fingers” (S2000910319) (62) “[…] the nose twisted and came to touch the knees” (B9000001254) (63) “Air sacs are where blood vessels come to deposit ‘used’ air (carbon dioxide)” (N0000000740). However, only diachronic data can provide definite insights into the origin of the resultative variant of the construction. The exploration of the diachrony of the phenomenon goes beyond the scope of this study, but it is certainly a worthwhile research goal: by consulting other corpora and/or concordances from texts by 18th and 19th century authors, for instance, it should be possible to understand whether resultative COME + infinitive is a recent innovation or a structure that was available to speakers/writers also in the past, but whose frequency of occurrence may have increased in recent times. Additionally, one could trace and compare developmental trends across registers (spoken and written), geographical varieties (e.g. American and British) over time, which could give insights into the overall grammaticalization process (cf. Mair 2008 on infinitival complements in specificational clefts). More generally, the consultation of additional corpora may shed light on the actual spread and degree of prominence of the construction examined. On the one hand, the higher occurrence of COME + infinitive in written sources (see Table 1) may be due to a bias in the design of the BoE, most of whose components are representative of the written register. On the other hand, the
COME+ infinitive construction
395
relative scarcity of narrative texts – with their focus on the past – in the BoE may have downplayed the magnitude of the resultative structure (see Table 3 about the preference of the construction for perfect and past tenses). Either way, it is only by comparing the findings reported here with more data, from varied sources, that the issue can begin to be settled. A step in this direction has already been taken. I have looked at the occurrence of resultative COME + infinitive in various components of the International Corpus of English (ICE; Gesuato 2008a, 2008b). Although fewer instances of the construction have been retrieved, the same kinds of co-textual preferences and phraseological associations have been identified there as in the BoE, but with one exception. The Great Britain component more frequently instantiates the literal than the aspectual meaning, while the Hong Kong component instantiates both to the same degree. The ICE data, therefore, seems to suggest that the native variety of British English is not at the forefront of the aspectual development of the construction, which runs counter to what one would expect in general and also to the BoE findings, where the aspectual meaning is more firmly established than the literal one. Even this limited comparison, therefore, reveals that, while the use of corpora is extremely useful in finding out what grammatical and textual patterns characterize a given expression, no single corpus will actually reveal the whole picture of a given linguistic phenomenon. Only by comparing findings from different corpora is it possible to explore how the performance of single individuals can modify the competence of groups of individuals over time. In addition, it may be advisable to compare corpus data with elicited data: the most frequent sense of a given form is not necessarily its most prototypical meaning, as tested against native speakers’ judgments (Leech 2008). An interesting finding from the study is that resultative COME is more common than literal COME (see Table 5) and in statistically significant terms. This suggests that the grammaticalization process affecting COME + infinitive is well under way. Indeed, indirect support for this interpretation is provided by the patterns of comparable grammaticalizing constructions based on motion verbs. For instance, non-progressive forms of GO followed by an active infinitive in the BoE have been found to encode the literal meaning of ‘moving away so as to’ 88% of the time, and to instantiate related, metaphorical meanings outside the domain of tense (‘be transferred and used’, ‘contribute to’, ‘succeed in’ and ‘proceed to’) only 12% of the time (Gesuato forthcoming). Similarly, the BoE has been found to instantiate the have/has/had been to V construction meaning ‘being back from V-ing’ only marginally (i.e. with 41 unambiguous examples; Gesuato 2008c). According to Heine and Kuteva (2002: 2), there are four mechanisms involved in grammaticalization: “(a) desemanticisation (or “semantic bleaching”) – loss in meaning content, (b) extension (or context generalization) – use in new contexts, (c) decategorialisation – loss in morphosyntactic properties characteristic of
396
Sara Gesuato
lexical or other less grammaticalised forms, and (d) erosion (or “phonetic reduction”) – loss in phonetic substance.” The BoE data suggests that resultative COME + infinitive has reached the second stage. However, only a comparison of instances of motional and aspectual COME collected from a speech corpus could reveal whether the resultative examples are also characterized by phonetic reduction with respect to the literal ones. The same authors (pp. 318-319) also show how cross-linguistically the verb COME can be grammaticalized into a resultative marker to denote a change of state, like other aspectual markers (e.g. go, go to, finish, leave). Their survey, therefore, lends support to an interpretation of the non-literal COME + infinitive as a marker of resultative aspect. In conclusion, the role of resultative COME + infinitive in the system of the English language is similar to that of other resultative constructions and lexical aspectualizers: it contributes to the encoding of aspect, which is not fully grammaticalized (i.e. not systematically realized through morpho-syntax; cf. Hopper 1979: 239-40; Horrocks, Stavrou 2003: 299). More specifically, COME + infinitive signals the completed development of a process, although this completion is presented not as already achieved, but as an outcome to be achieved, projected into a later stage. Therefore, while resembling ingressive aspectualizers denoting the beginning of durative processes (Brinton 1988), resultative COME actually functions as a forward-oriented or prospective marker of perfective aspect, which expresses the realization of an event as dependent on the conclusion of an introductory phase. Notes 1 Thanks go to Alberto Mioni and an anonymous reviewer for helpful comments and suggestions on an earlier draft of this paper. 2
Here and elsewhere, made-up examples appear only in double quotes, while examples from the corpus consulted are followed by the specific text reference.
3
There are different views on what syntactic forms count as complex predicates. According to Butt’s (1997: 108) and Mohanan’s (1997: 432) definitions, complex predicates constructions combine two or more semantically predicative elements, which contribute arguments into the flat grammatical function of a single, simple predicate.
4
The semantics of COME, however, has been examined (Goddard 1997).
5
For other types of resultatives, see Horrocks, Stavrou (2003) and Nedjalkov (1988).
COME+ infinitive construction
397
6
Otherwise, if the question is made relevant to the larger event encoded in the sentence, the meaning conveyed will actually be resultative (e.g. “How did it happen that she bought a house?”).
7
Cf. Bertinetto and Squartini’s (1995) description of gradual completion verbs.
8
The resultative meaning of COME is not necessarily dependent on the occurrence of an infinitival complement. It may also be instantiated when followed by an indirect object that encodes a state, event or activity, rather than a physical destination; e.g.: COME + to a decision/conclusion/view; + to an end/stop/halt/standstill; + to power/prominence; + into being/existence/operation/effect; + into view/sight. In addition, it can be activated when used with a predicative adjective denoting a resultant state; e.g. COME + apart/unstuck/undone/untied, + true. Finally, it is also encoded in COME-based phrasal verbs, albeit with specific nuances; e.g. COME IN + first/second; + useful/handy; COME OFF + well/badly/worst. It thus parallels other English motion verbs, in that it can be used both literally and non-literally in similar syntactic environments (see section 1).
9
Cf. Klein (1994)’s characterization of aspect in terms of the interaction of source and target states and their relevant pre- and post-time (ch. 6), as well as his description of the meaning of COME along the same lines (note 4 on p. 227).
10
In these and following examples, underlining signals added emphasis.
11
Cf. Brinton’s (1988:43) description of the meaning nuances conveyed by the perfective depending on the verb type it is applied to.
12
Brinton (1998: 16), for instance, explicitly states that the simple present and the simple past are markers of perfective aspect.
13
Cf. Bertinetto’s (1986: 98, 274 et passim) definition of resultative as ‘+durative’ and ‘+telic’).
References Alsina A., Bresnan J., Sells P. (eds.) (1997), Complex Predicates. Stanford: CSLI. Baicchi A. (2007), ‘‘He Smiled me into Love’. The Subsumption Process of the Intransitive-Transitive Migration’, paper presented at the 23rd AIA (Associazione Italiana di Anglistica) Conference ‘Forms of Migration, Migration of Forms’. University of Bari, 20-22 September 2007. Bertinetto P.M. (1986), Tempo, Aspetto e Azione nel Verbo Italiano. Firenze: Accademia della Crusca.
398
Sara Gesuato
Bertinetto P.M., Squartini M. (1995), ‘An Attempt at Defining the Class of ‘Gradual Completion Verbs’, in: P.M. Bertinetto, V. Bianchi, J. Higginbotham and M. Squartini (eds.) Temporal Reference, Aspect and Actionality, 1: Semantic and Syntactic Perspectives, Rosenberg and Sellier, Torino, Italy. 11-27. Brinton L.J. (1988), The Development of English Aspectual Systems. Cambridge: CUP. Bussmann H. (ed.) (1996), Routledge Dictionary of Language and Linguistics, vol. II, London: Routledge. Butt M. (1997), ‘Complex Predicates in Urdu’, in: A. Alsina, J. Bresnan, P. Sells (eds.) Complex Predicates. Stanford: CSLI. 107-150. Carrier J., Randall J.H. (1992), ‘The Argument Structure and Syntactic Structure of Resultatives’, Linguistic Inquiry, 23(2): 173-234. Claudé P. (1990), ‘La Biprédication Résultative en Anglais’, Sigma: Linguistique Anglaise – Linguistique Générale, 14: 143-56. Eastlack C.L. (1967), ‘Catenative Verbs in Portuguese and English: A Contrastive Study’, Estudos Lingüísticos, 2(1-2): 43-56. Fang A.C. (1995), ‘Distribution of Infinitives in Contemporary British English: A Study Based on the British ICE Corpus’, Literary & Linguistic Computing, 10(4): 247-57. Gesuato S. (2008a) ‘The Resultative Aspectualizer COME + to_Infinitive in Five Varieties of English’, paper presented at the 4th IVACS (Inter-Varietal Applied Corpus Studies International) Conference. University of Limerick, Ireland, 13-14 June 2008. Gesuato S. (2008b) ‘Motional and Aspectual Usage of COME + To-infinitive in Native and Non-native English Varieties’, in: Associaçao de Estudos de Investigaçao Científica do ISLA-Lisboa (ed.) TaLC8 Lisbon, Proceedings of the 8th Teaching and Language Corpora Conference, 3-6 July 2008, Offsetmais Artes Gráficas S.A., 379-385. Gesuato S. (2008c) ‘Corpus Data and Elicited Data: The Case of HAVE BEEN + to_infinitive’, paper presented at the 9th ESSE (European Society for the Study of English) Conference. University of Aarhus, Denmark, 22-26 August 2008. Gesuato S. (forthcoming) ‘GO to V: Literal Meaning and Metaphorical Extensions’, in: M. Hundt, D. Schreier, A. Jucker (eds.) Proceedings of the 29th ICAME (International Computer Archive of Medieval and Modern English) Conference ‘Corpora: Pragmatics and Discourse’. University of Zurich, Ascona, Switzerland, 14-18 May 2008. Goddard C. (1997), ‘The Semantics of Coming and Going’, Pragmatics, 7(2): 147-62. Goldberg A.E., Jackendoff R. (2004), ‘The English Resultative as a Family of Constructions’, Language, 80(3): 532-68.
COME+ infinitive construction
399
Heine B., Kuteva T. (2002), World Lexicon of Grammaticalization. Cambridge: CUP. Hinrichs E., Kathol A., Nakazawa T. (eds.) (1998), Complex Predicates in Nonderivational Syntax, vol. 30 of Syntax and Semantics. San Diego: Academic Press. Hoekstra T. (1988), ‘Small Clause Results’, Lingua, 74: 101-39. Hopper P.J. (1979), ‘Aspect and Foregrounding in Discourse’, in: G. Talmy (ed.) Syntax and Semantics, vol. 12 of Discourse and Syntax. New York: Academic Press. 213-41. Horrocks G., Stavrou M. (2003), ‘Actions and their Results in Greek and English: The Complementarity of Morphologically Encoded (Viewpoint) Aspect and Syntactic Resultative Predication’, Journal of Semantics, 20: 297-327. Huddleston R., Pullum G.K., Bauer L. (eds.) (2002), The Cambridge Grammar of the English Language. Cambridge: CUP. Ike-Uchi M. (1994), ‘English Resultative Constructions and Wh-movement’, in: S. Chiba et al. (eds.) Synchronic and Diachronic Approaches to Language. A Festschrift for Toshio Nakao on the Occasion of his Sixtieth Birthday. Tokyo: Lieber Press. 361-78. Ionescu D. (1994), ‘Resultative Small Clauses’, Revue Romaine de Linguistique, 39(3-4): 353-69. Klein W. (1994), Time in Language. London/New York: Routledge. Kudrnáþová, N. (2005), ‘On One Type of Resultative Minimal Pair with Agentive Verbs of Locomotion’, in: J. Cermák, A. Klégr, M. Malá, P. Šaldova (eds.) Patterns: A Festschrift for Libu se Dusková. Prague: Charles University. 107-14. Leech G. (2008), ‘Frequency is Important – and Challenging: A Present-day Corpus Perspective’, paper presented at the 8th TALC (Teaching and Language Corpora) Conference. University of Lisbon, 3-6 July 2008. Mair C. (1990), Infinitival Complement Clauses in English. A Study of Syntax in Discourse. Cambridge: CUP. Mair C. (2008), ‘Right in the Middle of the S-shaped Curve: On the Spread of Specificational Clefts in 20th Century English’, paper presented at the 8th ESSE (European Association for the Study of English) Conference. University of Aarhus, 22-26 August 2008. McIntyre A. (2001), ‘Argument Blockages Induced by Verb Particles in English and German: Event Modification and Secondary Predication’, in: N. Dehé, A. Wannen (eds.) Structural Aspects of Semantically Complex Verbs. Berlin/Frankfurt/New York: Peter Lang. 131-64. Mohanan T. (1997), ‘Multidimensionality of Representation: NV Complex Predicates in Hindi’ in: A. Alsina, J. Bresnan, P. Sells (eds.) Complex Predicates. Stanford: CSLI. 431-72. Müller S. (2002), Complex Predicates, Verbal Complexes, Resultative Constructions, and Particle Verbs in German. Stanford: CSLI.
400
Sara Gesuato
Müller S. (2005), ‘Resultative Constructions – Syntax, World Knowledge, and Collocational Restrictions’, Studies in Language, 29(3): 651-81. Nedjalkov V.P. (ed.) (1988), Typology of Resultative Constructions. Amsterdam/Philadelphia: John Benjamins. Quirk R., Biber D. (eds.) (1999), Longman Grammar of Spoken and Written English. London: Longman. Rosen S.T. (1990), Argument Structure and Complex Predicates. New York/London: Garland Publishing. Rowling J.K. (2007), Harry Potter and the Deathly Hollows. London: Bloomsbury. Shirai Y. (1998), ‘Where the Progressive and the Resultative Meet. Imperfective Aspect in Japanese, Chinese, Korean and English’, Studies in Language, 22(3): 661-92. Stevens W.J. (1972), ‘The Catenative Auxiliaries in English’, Language Sciences, 23: 21-5. Stewart O.T. (1998), ‘Evidence for the Distinction between Resultative and Consequential Serial Verbs’, in: B. Bergen, M. Plauché, A. Bailey (eds.) Proceedings of the Twenty-fourth Annual Meeting of the Berkeley Linguistics Society, February 14-18, 1998, General Session and Parasession on Phonetics and Phonological Universals, Berkeley, CA, Berkeley Linguistics Society. 232-243. Talmy L. (1975), ‘Semantics and Syntax of Motion’, in: J.P. Kimball (ed.) Syntax and Semantics, vol. 4, London: Academic Press. 181-238. Tortora C.M. (1998), ‘Verbs of Inherently Directed Motion are Compatible with Resultative Phrases’, Linguistic Inquiry, 29(2), pp. 338-45. Whelpton M. (2001), ‘Elucidation of a Telic Infinitive’, Journal of Linguistics, 37(2), pp. 313-37. Whelpton M. (2002), ‘Locality and Control with Infinitives of Result’, Natural Language Semantics, 10: 167-210. Yamada Y. (1987), ‘Two Types of Resultative Constructions’, English Linguistics: Journal of the English Society of Japan, 4: 73-90.
A corpus-based analysis of invariant tags in five varieties of English Georgie Columbus Department of Linguistics, University of Alberta Abstract Discourse markers are a feature of everyday conversation – they signal attitudes and beliefs to their interlocutors beyond the base utterance. One particular type of discourse marker is the invariant tag (InT), for example New Zealand and Canadian eh. Previous studies of InTs have clearly described InT uses in individual language varieties. Such studies have focused on sociolinguistic features and on sociolinguistic functions of single markers. However, InTs as a class have not yet been fully described, and the variety of approaches taken (corpus- as well as survey-based) means that cross-varietal or crosslinguistic comparison cannot be conducted with the results thus far. This study investigates InTs in five varieties of English from a corpus-based approach. It lists the utterance-final InTs available in NZ, British, Indian, Singapore and Hong Kong English through their occurrences in their respective International Corpus of English (ICE) corpora, and compares frequency of usage across the varieties. The quantitative analysis offers a clearer overview of the InT class for descriptive grammars, and clarifies some usage aspects for ESL/EFL pedagogy. Finally, the results offer an insight into the global status of InTs in English. 1
1.
Introduction
Question tags have long been the subject of sociolinguistic and variationist studies. Canonical question tags, such as aren’t you? and isn’t it? have received much attention in linguistics, perhaps due to their curious syntactic and semantic properties, including inversion and polarity. In the last few decades especially, invariant tags (InTs) such as huh and innit, have been equally researched and documented. InTs provide similar attitudinal and evidential meanings above the level of the proposition as canonical tags, but do not undergo changes in structure or polarity. Yet while canonical question tags are the focus of much ESL/EFL clarification in syllabi and texts, their invariant counterparts are rarely formally taught. This imbalance, and the prevalence of one particular tag (eh) in both my home and adopted countries, formed the impetus to investigate the meanings and usage patterns of InTs in different English varieties. 1.1
Previous InT studies
Much research has been undertaken on InTs in particular varieties and/or dialects of English. Most of this research has been within the realms of Conversation
402
Georgie Columbus
Analysis, focusing on sociolinguistic patterns of use and/or pragmatic contributions of tags in a speech community. Sociolinguistic factors such as distribution of the markers within a speaker population have been investigated by Stubbe and Holmes (1995), Andersen (1997, 1998), Algeo (1998), Stubbe (1999), and Starks, Thompson and Christie (2008). Other studies have focused on InT meaning and functions, for example Holmes’ (1982) description of both canonical and non-canonical (i.e. invariant) tags in New Zealand English. Holmes divided the items into hearer- and speaker-oriented categories, and offered a list of potential functions. Meanwhile, Norrick’s (1995) study of US English hunh looked more at the pragmatic features, such as use indicating sarcasm or irony, and use in (semi-)fixed expressions. Berland (1997), on the other hand, focused on teenagers’ use of a small set of InTs in the Corpus of London Teenage speech (COLT). Lastly, the semantics and pragmatics of Canadian eh has been characterised by several researchers, notably Avis (1972), Love (1973), Gibson (1977) and Gold (2005). Each of these studies clearly defines tag uses in their respective varieties, but taken together provide heterogeneous classifications of English InTs. Thus despite this depth in tag description, it is not feasible to compare the tags across varieties using these results, as the studies have been carried out in single varieties and with varying methodologies and sociolinguistic/pragmatic aims. This study, then, aims to work toward such a comparison using five varieties of spoken English. It focuses on the frequency of InTs in the varieties to gain some indication of usage and preferences regarding tags, in order to shed light on global usage of InTs. 1.2
Research goals
This study aims to describe the relative frequencies for uses of the utterancefinal tags in BrE, IndE, NZE, Hong Kong English (HKE) and Singapore English (SinE) results. It investigates InT selection and usage compared across and within the five varieties. 1.2.1 Variety selection BrE, NZE, IndE, HKE and SinE were chosen as the varieties for this study due to their diversity in geography, linguistic history and speaker populations. It seemed desirable to limit the comparisons to varieties within the same ‘type’. That is, in dictionaries and particularly in ESL/EFL, North American and British English are commonly the divisions used for items with varietal distinctions, before subdivision into (loosely) national varieties and their dialects (if at all). BrE was chosen as a globally-recognised ‘type’ of English. Also, BrE as a variety is noteworthy in its dialectal diversity, many of which are available in the ICE corpora used for this study. Given the incomplete status of the relevant corpora, no American-type varieties were considered. NZE is considered to be of the British ‘type’, but has a much smaller speaker base and range of dialects. Additionally, where ESL/EFL materials are concerned, NZE is comparatively under-described. This is also true of IndE, and as a variety with diverse contact
A corpus-based analysis of invariant tags in five varieties of English.
403
languages and large migrant communities in English-speaking countries it can provide an insight into English as a lingua franca. Furthermore, the high business profile of India makes IndE a common language in business situations, and therefore worthwhile to define for EAP/Business English purposes as much as for purely linguistic reasons. For similar reasons, two other outer circle varieties were chosen. These were SinE and HKE, which share related native contact languages. HKE is used in a prominent global business centre, while SinE has been the subject of much scholarly research. A comparison of which tags are shared in two Englishes that have close L1 connections allows insight into the variation possible between such varieties. Most importantly for the comparative aspect of the study, each of these varieties was available as an ICE corpus, with similar time periods for collection and near-identical mark-up protocols. 1.3
Invariant tag definition
To determine which items are indeed tags, a variety of definitions from previous canonical and invariant tag studies, such as Holmes (1982), Meyerhoff (1992), Stubbe and Holmes (1995), Berland (1997), and Andersen (1997, 1998, 2001) were considered. The working definition employed for this study was extrapolated from Biber et al.’s (1999) definition of what they term ‘response elicitors’ (RE) in the Longman Grammar of Spoken and Written English (LGSWE). This stated that REs have a “speaker-centred role, seeking a signal that the message has been understood and accepted” (p.1089).2 Yet while this includes gestural responses, only one RE (right) is noted as requiring a verbal response. Indeed, the response-eliciting function of these items is not universally accepted (cf. Holmes 1982, Berland 1997, Andersen 1998). The identification of InTs in this study utilised a slightly broader definition, in that the ‘message’ being signalled was considered to include attitudinal information as well as propositionchecking information. Furthermore, I assumed no response was required, having no visual data to check this. The classification also excluded non-discourse markers and non-InT homonyms such as yeah where it expressed surprise or affirmation, and right where it was confirmation or part of a direction. The key definition was whether the propositional meaning changed when the item was left out (as for all discourse markers, cf. Schiffrin 1987), and whether the item could function with similar (though not identical) uses as a canonical tag. The tags in each variety were analysed individually, and only those items which fulfilled the criteria above were included for this study (viz. the exclusion of isn’t it?/is it? when in canonical use with grammatical agreement, and no in varieties that did not have the InT function). The definitions employed were corroborated by each stage of analysis. Finally, the frequency analysis deals with only utterance-final InTs. Non-clausal, utterance-initial and utterance-medial InTs in BrE, IndE and NZE are described along with the utterance-final tags in terms of frequency and meanings/functions in Columbus (in revision) and with respect to the most common meanings in Columbus (forthcoming).
404
Georgie Columbus
2.
Methods
The study was conducted using the International Corpus of English corpora for British English (ICE-GB, Survey of English Usage, University College London, 1998), Indian English (ICE-IND, Shivaji University, Kolhapur, and the Freie Universität Berlin, 2002), New Zealand English (ICE-NZ, School of Linguistics and Applied Language Studies, Victoria University of Wellington, 1999), Hong Kong English (ICE-HK, Hong Kong Polytechnic University, The University of Hong Kong and The Chinese University of Hong Kong, 2006) and Singapore English (ICE-SIN, The Department of English, The National University of Singapore, 2002). Each corpus was delimited to text files of 200,000 words each, from the Private Conversation texts (S1A-001-100). They were analysed based on the transcriptions only, due to the lack of sound files at the time the study was undertaken. The advantage of using these corpora was that they had the same text categories and almost identical mark-up conventions. They were also collected during the same time-period, making their content highly comparable. However, there were some differences in the level of notation seen as the ICE-GB text was imported into Wordsmith 4 from its custom-made mining program ICE-CUP. This altered the visible mark-up.3 The search itself involved narrowing down a set of discourse markers to a set of potential InTs using the (‘discourse marker’) tag in ICE-GB’s corpus tool, ICE-CUP (Survey of English Usage, University College London, 1998). Additionally, potential discourse markers in ten randomly selected files were analysed manually from ICE-HK, ICE-SIN, ICE-NZ and ICE-IND; these corpora were available in marked-up text but without the discourse marker tagging. From the original search set of approximately fifty potential InTs, seventeen items which all appeared with InT functions in the utterance-final position were selected for this study. These were accha, ah, ahn, eh, is it, isn’t it, lah/la, na, no, OK/okay, right, see, wah, yeah, yes, you know, and you see. 2.1
A note on the inclusion of lah/la as an InT
Before describing the search methodology, a brief comment on tag selection is necessary. It may not be obvious why the particle la/lah has been classified as a noncanonical tag in this study. Certainly, much has been written on la/lah over the past thirty years, beginning with Richards and Tay (1977) and Kwan-Terry (1978). To understand why la/lah can be a tag we need to return to our basic definitions of invariant tags and tag questions. Invariant question tags are considered to be like the fixed forms of canonical question tags, with similar functions. However, research on both of these discourse marker types has not shown the question function to be the primary use and/or meaning of the tag. Uses such as emphasis, softening and irony/sarcasm are prevalent in the literature (cf. Holmes 1982, Berland 1997, Norrick 1995 inter alia). Similarly, a not insignificant number of the items considered to be invariant tags or response elicitors in the varying descriptions are interjection-based discourse markers of
A corpus-based analysis of invariant tags in five varieties of English.
405
some form, such as hey and eh, a descriptor which is also used for la/lah (Lee 2004). Bell and Peng Quee Ser (1983) describe la/lah as a marker of emphasis or contrast, “drawing attention to the literal meaning – the semantic sense, overt and explicit – of an utterance or part of one” (p.13). Likewise, Kwan-Terry (1978) discusses the marker’s use for persuasion, approval, as a softener or for authority, and for positive and negative humour, as well as for uncertainty and suggestions. If we take these definitions as guides, then the classification of la/lah as an InT is not unjustified – it fits with the prior classifications of other InTs and follows from traditional descriptions of the marker(s). It should be noted again, however, that the classifications of an item as an InT in this study took general descriptions and definitions from previous studies into account as background information only. The only criterion used in the analysis and classification process was the definition provided in 1.3. 2.2
Search technique
The initial search was conducted in Wordsmith 4, utilising the tag symbols for a start of utterance in the ICE mark-up. This was the reason for the elimination of non-final InTs in this study. Some of the tagging was not in the imported version of the ICE-GB concordances, which meant that the results for the BrE data may have been under-reported. With the concordances in the Wordsmith tool, the search items were then entered followed by the start of utterance mark-up symbol. This query returned the start of each utterance, allowing easy visual inspection of the utterance-final instances of the seventeen InTs named above. Table 1 gives example of InT concordances in each of the five varieties. To ensure that other factors such as marked up pauses and anthropophonics which share the initial tag symbol were not falsely included as ‘utterance-final’, each concordance line was manually checked to eliminate the non-final occurrences. A tally for the InTs in each variety was performed using simple Excel functions. Table 1: Examples of concordances for BrE, IndE, NZE, HKE and SinE InTs BrE:
C: She looks she looks Puerto Rican or something is it B: There was this bloke in the in the cafe in Cambridge called the Steps really weird OK C: I wrote I turned up the first night right C: But then I ‘ve had this about twenty years with the same thing on see HKE: <S1A-006#268:1> A: Oh it ‘s exhausting you know
406
Georgie Columbus
Z: Who is twenty-five when he got married A: Twenty-one la B: Pronunciation you know not in English I think isn’t it A: It’s tape recording conversations okay A: Yeah yeah uh may be you will say uh you two always have argue why still can last for four years right SinE: <S1A-099#40:1> A: Ya but other than uhm workwise I guess like I manage to buckle down lah B: You read for yourself is it A: So when the second application came out I applied again and then notwithstanding the fact that they told us that those who have been rejected or have been offered a place don’t have to apply again they will not consider us you see IndE: B: ...everytime the team keeps losing I mean something should be done isn’t it C: But again that caste certificate problem has arrived na B: She is very she is very bold no NZE: <S1A-072#400 > P: He’s pretty intelligent eh A: It’s like you see things T: True eh A: Yeah and all from your sitting room window yeah N: They just go though you know N: They’re only going through a process eh
For the classification of discourse markers as InTs, context was relied on for clarification. This was due to the written nature of the transcriptions (searchable sound files being unavailable) and the lack of intonational mark-up. To a limited extent, mark-up of punctuation was used where possible to determine the utterance position and classification of an item. A question mark offered a clear indicator of question intonation in the file, but was not used in the BrE, HKE, SinE and IndE data, and thus may have led to under-reporting of question uses. Again, full context was used to clarify the utterance’s intent. Interruption data, where the marker occurred at the end of an utterance which overlapped with another speaker’s, could not necessarily be counted as being utterance-final in intention, and thus was only included where this intention was clear. For
A corpus-based analysis of invariant tags in five varieties of English.
407
example, items were included when other mark-up indicated an utterance break, such as when pauses were indicated, or when new utterances began which were also included in the overlap. Finally, the rechecking processes involved during the analysis stage of comparison did offer a chance to confirm and/or adjust previous categorisations and identifications. In general, the classifications were highly reliable across the varieties and tags. It should be noted here that, as with many discourse marker studies, a certain amount of subjective analysis is necessary in determining which items to include for analysis as InTs. It is understood that this does not allow for complete confidence in the results given below, but is a fact reluctantly accepted as necessary for this type of study (cf. Berland’s lament for the same, 1997). 3.
Results
The raw occurrences of the seventeen items in NZE, BrE, HKE, SinE and IndE are given in Figure 1 and Table 2. The only items to reach over fifty occurrences in the utterance-final position in any of the 200,000 word corpora of each were eh, yeah, la, right, you see, no, na and you know. Of the seventeen items which occurred utterance-finally, only four occurred in all five varieties: okay/OK, right, you know and you see. Of these, right and, to a lesser extent, you know, have the highest frequencies. Another major point to be obtained from the results in Figure 1 is that the total number of utterance-final InTs in these varieties is not analogous; BrE has 268 and HKE has 288, while NZE has almost fifty percent more at 386. IndE has almost twice as many as the NZE tally at 696, and more than the total number of InTs in NZE and BrE or HKE combined. SinE, however, has the highest usage of InTs, with 776.
Figure 1: Frequencies for InTs in the five varieties with a threshold of 20 raw occurrences
408
Georgie Columbus
Table 2: Raw frequencies of 17 utterance-final InTs. Shaded cells indicate the InTs found in all varieties. Bold numbers indicate the most frequent tag
T ag accha ah ahn eh is it isn't it lah/la na no OK/okay right see wah yeah yes you know you see Total
BrE 6 1 1 0 7 8 2 34 7 171 31 268
IE 2 18 10 0 12 33 109 237 12 12 2 60 4 158 27 696
NZE 292 0 0 1 7 11 2 35 2 18 18 386
SinE 5 47 14 241 1 14 236 0 7 0 0 110 101 776
HKE 1 25 4 14 5 24 110 0 5 24 5 70 6 288
We now turn to more the detailed comparisons given in Figures 2 and 3. Figure 2 shows only the frequencies for the seven InTs which are shared in BrE, NZE and IndE. We see here that IndE has a high raw frequency of the InT no, and also makes high use of you know. BrE also has frequent use of you know, while in NZE none of these shared InTs is preferred. Instead, NZE has a high use of eh, as seen in Figure 1, and comparable use of yeah to BrE. In Figure 3, we see the full range of InTs in HKE and SinE. The results for these varieties’ InT frequencies show more dissimilarities than likenesses. We do not see patterns similar to each other which may be expected given the similar language contact situation. Nor is there a pattern which is similar to the seemingly BrE-influenced IndE, the other outer circle variety. Instead, HKE and SinE have raw frequency patterns which are distinct from each other and from the three other varieties investigated. The results in Figure 3 and Table 2 show that the two English varieties here do not share frequency of usage in the utterance-final position, despite having similar contact languages. Most obvious is the wide gap between HKE and SinE in the total number of utterance-final InTs – with SinE having almost two and a half times as many InTs as HKE. Indeed, the only points of similarity between the two varieties are the relatively comparable numbers for wah (7 for SinE and 5 for HKE) and the lack of use of see. Also, the two varieties both share use of is it, okay/OK, and you know (with approximately 50-60% fewer uses for is
A corpus-based analysis of invariant tags in five varieties of English.
409
it and you know for HKE than SinE, but more HKE occurrences of okay/OK). Also, neither variety has see in the utterance-final position. Most notable, however, are the three clear preferences for InTs in SinE – you see, right, and la4.
Figure 2: Raw frequencies of the seven shared InTs between BrE, NZE and IndE
Figure 3: Raw tag frequencies in SinE and HKE
410
Georgie Columbus
3.1
Discussion
Several points are realised by the results given above. Firstly, the low number of tags in ICE-GB and ICE-HK may suggest that BrE and HKE do not use tags to a high degree. However, if we consider that BrE’s tally for is it?/isn’t it? alone as canonical (that is, variant) tags was 215 and 156 respectively, with only one example each in noncanonical, invariant use, then it suggests that canonical tags are regularly in use in BrE, perhaps more so than the invariant type. Similarly, the fact that it is not possible to search for known InTs such as innit (e.g. Berland 1997) in BrE, as it is normalised to isn’t it in the ICE-CUP (and thus the exported ICE-GB) corpus content, suggests a higher number of InTs ought to exist in BrE, but they have been obscured in the corpus due to the normalisation process. The implications of such normalisations are discussed further in Columbus (in revision). Another complicating factor may have been the difference in visible mark-up in the ICE-GB transcriptions via Wordsmith. The comparative lack of InTs in HKE, however, is less clear. While this may also be due to higher canonical question tag use, there is a relatively high number of invariant uses of is it? and isn’t it? More research into canonical question tag use in HKE may clarify the matter. Secondly, there is a strong resemblance between the raw frequencies for IndE and BrE. The group of seven items shared by BrE, NZE and IndE in Figure 3 (OK/okay, right, see, yeah, yes, you know and you see) have very similar rates of occurrence. In particular, the pattern of frequent usage is the same, but IndE has extended the pattern to include indigenous InTs, such as ah, ahn, and accha. This extension from the (likely) BrE base contrasts with the NZE pattern. NZE appears to have instead taken the set from the base set in BrE and changed both the relative frequencies and the preferred items. Where IndE uses no and shares high use of you know with BrE, NZE has relatively little use of the InTs in the set but for eh. A search for potential indigenous NZE InTs (such as Maori kao and ae) also revealed no non-English-based tags in use. With respect to the two English varieties with related contact languages, SinE and HKE, the results above show that there is no apparent similarity between HKE and SinE in InT usage, with the possible exception of the non-use of see. However, while HKE has right as a weakly preferred marker, SinE has a strong preference for la as a tag. Right and you know are more often used in HKE, but given the low occurrence of utterance-final tags in HKE overall, these form a high proportion of the InTs used. 4.
General discussion and conclusions
As the results in Figure 1 and Table 2 show, the InT patterns for BrE, IndE, NZE, HKE and SinE are unique to each variety, with the exception of IndE’s extension of the BrE pattern. NZE shares little frequency of InT usage with the other varieties, save the use of yeah in BrE, IndE and HKE, and perhaps you see with
A corpus-based analysis of invariant tags in five varieties of English.
411
BrE and IndE. For the most part, NZE speakers prefer eh over other InTs. SinE also has one preferred marker, la. The variety has a second preferred item, right, which is used to a lesser degree in HKE (though as HKE speakers’ preferred marker), but rarely in BrE, IndE and NZE. Perhaps surprisingly, only one English variety of the five investigated here shows a clear relationship to another. IndE appears to have taken the base set of InTs from BrE and built upon it. Even the number of new, indigenous-based items (four) is higher than other varieties in the private spoken ICE corpora (none in NZE or BrE, two in both HKE and SinE). There are otherwise few InTs which are common across the varieties in terms of rates of usage (at least for the corpus time period of circa 1990-1999). Such a distribution pattern is not without implications; it is to these we now turn. 4.1
Further implications
The relative dissimilarity in the selection and use of InTs these varieties in the utterance-final positions implies that the use of InTs is not comparable across British-type Englishes. This is clear in the relative frequencies of the items and in the preferred tags, or lack thereof, in each variety. Such a lack of similarity in attitudinal nuance could be problematic for global English use; varietal differences at the level above propositional understanding could cause problems for intercultural and global communication. This in turn has implications for pedagogy and materials for ESL/EFL and English for Specific/Business Purposes (ESP): Global English as a lingua franca for both interpersonal and international business needs relies on mutual intelligibility. An awareness of these subtle differences in attitudinal and evidential meaning seems necessary at the varietal level. From an ESL/EFL perspective, these differences are at least as unevenly distributed as accent and vocabulary, with differences in meaning across the English-speaking world. ESP syllabi thus need to go beyond the current focus on polarity and general meaning in canonical tags, and consider the role of invariant tags in conversation when designing curricula and materials. Finally, this study set out to compare the use of InTs in five varieties of English. The variance in use and subtle meanings of a single discourse marker group such as InTs may suggest that a global language cannot in fact guarantee global communication. These differences in frequency may prove challenging for speakers unfamiliar with the variety; however, the results also show similarity in the set, as four items are still shared across the five Englishes. This augurs well for other Englishes, and suggests that with a raised level of awareness, the attitudinal level of tag usage will not be lost in international communication. Notes 1
I would like to thank John Newman for his comments and suggestions on previous drafts of this paper, as well as the original study which this paper extends. Additionally, I would like to thank two anonymous reviewers for
412
Georgie Columbus
their helpful comments, as well as participants at ICAME 28, in particular Sebastian Hoffmann and Andrea Sand, for their insight and comments on this presentation. All errors, of course, remain my own. Some of the frequency results in this paper relating to the BrE, NZE and IndE study have been submitted for publication (Columbus, in revision). 2
I assume here that “message” means ‘proposition’.
3
While all ICE corpora have the same mark-up options, it is up to individual project teams to determine the completed format. Thus differences exist in the detail of mark-up tags used by each variety and the layout of the corpus and mark-up in its final form.
4
The spelling of la/lah in ICE-SIN is restricted to la; without the intonation information and pronunciation of the tag it is not possible to determine if this is one marker or a combination of the la and lah variants noted by Kwan-Terry (1978) and Bell and Peng Quee Ser (1983). Hence, they are treated together in this analysis.
References Algeo, J. (1988), The tag question in British English: it’s different, i’n’it? English Worldwide, 9, (2), 171-191. Andersen, G. (1997), “I goes you hang it up in your shower innit? He goes yeah.” The use and development of invariant tags and follow-ups in London teenage speech. Paper presented at the 1st UK Language Variation Workshop, Reading, United Kingdom. Andersen, G. (1998), Are tag questions questions? Evidence from spoken data. Paper presented at the 19th ICAME Conference, Belfast, United Kingdom. Andersen, G. (2001), Pragmatic markers and sociolinguistic variation. Amsterdam/Philadelphia: John Benjamins. Avis, W. (1972), So eh? Is Canadian, eh?. Canadian Journal of Linguistics, 17, 89-105. Bell, R. and L. Peng Quee Ser (1983), “‘Today la?’ ‘Tomorrow lah!’; the LA particle in Singapore English”. RELC Journal,14, (2),1-18. Berland, U. (1997), “Invariant tags: pragmatic functions of innit, okay, right and yeah in London teenage conversations”. Unpublished master’s thesis, University of Bergen, Norway. Biber, D., Stig Johansson, G. Leech, S. Conrad, and E. Finegan (1999), Longman Grammar of Spoken and Written English. Harlow: Longman. Columbus, G. (in revision), A comparative analysis of invariant tags in three varieties of English. English Worldwide. Columbus, G. (forthcoming). “Ah lovely stuff, eh?” On invariant tag meanings and usage across three varieties of English, in: S. Gries, S. Wulff and M. Davies (eds.) Corpus linguistic applications: current studies, new directions. Amsterdam: Rodopi.
A corpus-based analysis of invariant tags in five varieties of English.
413
The Department of English, The National University of Singapore (2002), The ICE-SIN Corpus. Gibson, D. (1977), Eight types of ‘eh’. Sociolinguistics Newsletter 8 (1), 30-31. Gold, E. (2005), Canadian Eh?: A survey of contemporary use, in: M. Junker, M. McGinnis and Y. Roberge (eds.), Proceedings of the 2004 Canadian Linguistics Association Annual Conference. Retrieved November 19, 2006 from: http://http-server.carleton.ca/~mojunker/ACL-CLA. Holmes, J. (1982), The functions of tag questions. English Language Research Journal, 3, 40-65. Hong Kong Polytechnic University, The University of Hong Kong and The Chinese University of Hong Kong (2006), The ICE-HK Corpus. Kwan-Terry, A. (1978), The meaning and source of the “la” and the “what” particles in Singapore English. RELC Journal, 9, (2), 22-36. Lee, J. (2004), A Dictionary of Singlish and Singapore English. Retrieved September 7, 2007 from: http://home.pacific.net.sg/~willows5/singlish_L.htm Love, T. (1973), “An examination of eh as question particle.” Honours thesis, University of Alberta. Meyerhoff, M. (1992), ‘We’ve all got to go one day, eh?’: Powerlessness and solidarity in the functions of a New Zealand tag, in: K. Hall, M. Bucholtz and B. Moonwomon, (eds.) Locating power: Proceedings of the Second Annual Berkeley Women and Language Conference. Berkeley, California: Berkeley Women and Language Group, 409-419. Norrick, N.R. (1995), Hunh-tags and evidentiality in conversation. Journal of Pragmatics, 23, 687-692. Richards, J.C. and M.W.J. Tay (1977), The La particle in Singapore English, in: W. Crewe (ed.) The English language in Singapore, 141-155. Singapore: Eastern Universities Press. Schiffrin, D. (1987), Discourse markers. Cambridge: Cambridge University Press. School of Linguistics and Applied Language Studies, Victoria University of Wellington (1999), The ICE-NZ Corpus. Shivaji University, Kolhapur, and the Freie Universität Berlin (2002), The ICEIND Corpus. Starks, D., L. Thompson and J. Christie (2008), Whose discourse particles? New Zealand eh in the Niuean migrant community. Journal of Pragmatics 40 (7), 1279-1295. Stubbe, M. and J. Holmes. (1995), You know, eh and other exasperating ‘expressions’: an analysis of social and stylistic variation in the use of pragmatic devices in a sample of New Zealand English. Language and Communication, 15, 63-88. Stubbe, M. (1999), Research report: Maori and Pakeha uses of selected devices. Te reo, 42, 39-53.
414
Georgie Columbus
Survey of English Usage, University College London (1998), The ICE Corpus Utility Program (ICECUP 3.1). Survey of English Usage, University College London (1998), The ICE-GB Corpus.
Discourse presentation in EFL textbooks: a BNC-based study Christoph Rühlemann Ludwig-Maximilians-Universität, Munich Abstract Following corpus-linguistic research which has shown the representation of certain lexico-grammatical features in EFL textbooks to be at variance with their use in native English, this paper aims to explore the match or mismatch of discourse presentation (often referred to as ‘speech reporting’) in conversation and its representation in EFL textbooks. The analysis of selected textbooks shows that textbook representation is overwhelmingly concerned with indirect and, to a much lesser extent, narratised mode but not direct mode, the free categories and representation of voice. Further, textbooks promote quotatives typical of written registers but not informal everyday speech. Specifically, I show that discourse presentation in EFL textbooks features essential parallels with a written register, namely journalistic writing. The concluding section considers implications for EFL teaching.
1.
Introduction
In recent years, an impressive body of applied corpus linguistic research has been accumulated, pointing out gaps between school English and native spoken English as recorded in corpora. The comparative analyses so far have focused on features of lexico-grammar. The features whose treatment in textbooks has been found to be at variance with their use in actual discourse include (i) modal verbs such as can, will, must, may, shall, (ii) conditional clauses and (iii) future time orientation through will and going (Mindt 1996); (i) any, (ii) will and would, and (iii) irregular verbs (Mindt 1997); the linking adverbial though (Conrad 2004); and progressives (Römer 2005). This paper attempts to demonstrate that one crucial discourse area in which the gap is particularly wide is discourse presentation, often also referred to as ‘speech reporting’.1 Given that, for reasons of applied linguistic grading and simplification, school English will, to some extent, always be at variance with naturallyoccurring English, a crucial question to be addressed is whether, in dealing with discourse presentation, we are dealing with some remote or otherwise negligible aspect of conversational behaviour that school English need not be modelled on in great detail or whether it constitutes something more important in the conversational arena which school English should take great care to represent to its best of abilities. There is evidence to suggest that discourse presentation is indeed central to conversation. An initial indication is the fact that the verb SAY is among the most frequent words in various spoken corpora. According to Kilgarriff’s (1998)
416
Christoph Rühlemann
frequency list, said – by far the most frequent form of the lemma SAY – is ranked 42nd in the conversational subcorpus of the British National Corpus (BNC), representing the second most frequent content word (only the content word know is more frequent). Said is ranked similarly highly in the Cambridge and Nottingham Corpus of Discourse in English (McCarthy 1998: 122 f.). Considering that the form know is overwhelmingly used as part of discourse markers such as you know and I don’t know, said might well be ranked first in the list of lexical words in the conversational subcorpus of the BNC – indeed, in the Longman Spoken and Written English Corpus, SAY turned out to be “the single most common lexical verb” (Biber et al. 1999: 373). Thus, the prominent frequency of SAY suggests that sharing with others what was said in anterior situations is fundamental to conversation. Why is this so? The answer becomes obvious when we consider what discourse presentation is used for in conversation: it is an essential ingredient of narrative (cf. Schiffrin 1981: 58). Narrative is seen in Tannen (1986; 1988) as ‘drama’, creating interpersonal involvement and rapport. In her view, discourse presentation (her term being ‘constructed dialogue’) “is a means by which experience surpasses story to become drama” (Tannen 1986: 312). Thus, discourse presentation, as a building block of ‘narrative as drama’, is frequent in, and central to, conversation because it makes a decisive contribution to a fundamental function of language use – what Malinowski identified as ‘phatic communion’: discourse presentation is a means “to establish bonds of personal union” (1923: 480). In sum, discourse presentation is an important component of conversation both in terms of frequency and in terms of its interpersonal function. It is therefore consistent to expect that discourse presentation be covered in very good detail in EFL teaching. This paper will demonstrate that, in actual fact, very much ‘good detail’ is still missing from the discourse presentation as represented in most EFL textbooks. The paper is divided into three main parts. The first part summarises research on two major aspects of how native speakers go about presenting discourse, viz. reporting mode(s) and quotative verbs. The second, major, part looks at how discourse presentation is represented in seven internationally marketed EFL textbooks; here, too, the focus will be on reporting modes and quotatives. The analyses of how EFL quotatives distribute across major English registers will be based on the British National Corpus (BNC) (XML Edition). The concluding part briefly juxtaposes the results of the two analyses and outlines what seems to me the main implication of the stark contrast between ‘real’ and ‘school’ discourse presentation: the need to rethink the role of Standard English in the EFL classroom.
Discourse presentation in EFL textbooks 2.
417
Discourse presentation in conversation
In this section I briefly summarise sociolinguistic and corpus-linguistic findings related to two central aspects of conversational discourse presentation: reporting mode and quotative verbs. The section starts with a look into the reporting mode which is typically used in conversation. 2.1
Reporting mode in conversation
Broadly, four types of reporting mode can be distinguished: with reference to the examples listed below, discourse presentation can be direct as in (1), indirect as in (3), narratised, to use McCarthy’s (1998) term – a more convenient label than McIntyre’s (2004) corresponding ‘narrator’s representation of speech act’ (NRSA) category – as in (4), and what McIntyre et al. (2004) refer to as ‘representation of voice’ (RV) as in (5). This latter category “captures minimal references to speech with no indication of the illocutionary force, let alone the propositional content or form of the utterance (part)” (McIntyre et al. 2004: 62). Subtypes of direct and indirect mode are free direct (or ‘zero quotative’) and free indirect mode, that is, presentation without a reporting clause (cf. McIntyre et al. 2004: 64). (2) exemplifies free direct mode: (1)
(2)
(3) (4) (5)
direct: And then he said here’s the hymns, put those hymns up now. (BNC: KBO 3461) free direct: [Speaker is reporting how someone asked him/her for change for a fiver]. I said no! [ ... ] only. So ... well can you lend me a pound? I said no! (BNC: KD5 7945) indirect: Well I phoned Shirley ... and she said she’s fine. (BNC: KB8 3541) narratised: So we asked for twenty thousand pound upfront. (BNC: KB9 3284) voice: I was sitting there talking and they had a drop, drop of wine (BNC: KC2 1222)
Structurally, direct and indirect mode, on the one hand, and narratised mode and representation of voice, on the other, are neatly distinguished by the fact that the former typically have two clauses – a reporting clause containing the quotative verb and a reported clause containing the discourse reported – while the latter have only one clause (Semino and Short 2004: 11). Functionally, a fundamental difference between the direct modes and all other modes lies in the speaker perspective (Coulmas 1986: 2): while direct mode is characterized by the presenting speaker switching, as it were, into the non-present speaker’s deictic
418
Christoph Rühlemann
system whose discourse is being presented, thus adopting his/her deictic perspective, indirect and narratised modes as well as RV mode presentations relate the (usually anterior) speech event from the presenter’s own deictic perspective. To further understand how the presentation modes are functionally distinguished it is helpful to bear in mind that discourse presentation involves an intertwining of two discourse situations – the current situation where the presentation is being made and the anterior situation where the language presented was originally produced (Short et al. 1996: 114). That is, discourse presentation is a type of mediation between a here-and-now speech situation and a there-and-then speech situation. In mediating between the two, speakers can make the anterior speech situation more or less immediate in the present speech situation depending on their choice of presentation mode: the degrees of immediacy continuously decrease from (free) indirect to narratised mode to RV (cf. Leech and Short 1981; Semino and Short 2004), whereas (free) direct mode serves to re-construct the anterior speech situation with the highest degree of immediacy because, due to the presenter’s switch into the presentee’s deictic perspective (and, additionally, due to imitation of voice-related characteristics such as prosody or voice quality), the presented discourse is uttered as if the speaker whose discourse is being presented were present in the current speech situation. Which is the preferred mode in conversation? There is agreement that discourse presentation in conversation is overwhelmingly in direct mode (e.g., McCarthy 1998: 161; McIntyre et al. 2004: 69). In Halliday and Matthiessen (2004: 444), direct mode presentation (their term being ‘paratactic projection’) accounted for roughly 75 per cent (indirect presentation, or ‘hypotactic projection’, accounted for 25 per cent). In a close analysis of a sample of 300 occurrences of said, the most frequent form of the lemma SAY (see section 2.2), which is, in turn, commonly seen as the most frequent quotative verb, said turned out to introduce direct mode presentation in 215 occurrences, representing 72 per cent (Rühlemann 2007: 124). GO and BE like even invariably launch direct mode presentations (e.g., Butters 1980: 305; Schourup 1982: 148), and even THINK, although less clearly, seems to display a preference for direct mode (Rühlemann 2007: Chapter 6; but see McIntyre et al. 2004 who found that THINK introduced mainly indirect mode presentations). The preferred choice of mode in conversation is, thus, the direct mode. In terms of discourse presentation as a building block of ‘narrative as drama’, it will be obvious that direct mode is the most ‘dramatic’. While in indirect and narratised mode “speakers use themselves as the spatiotemporal point of reference” (Romaine and Lange 1991: 229), speakers using direct mode slip out of their deictic system and into that of a displaced speaker’s. In so doing, they effectively lend their voice to somebody not present and, thus, act like an actor on a stage, uttering words which are not their own. Direct mode is also more dramatic than indirect (and narratised mode) because one problem posed by indirect mode is “how to capture the emotive affective aspects of speech. Insofar as these are expressed not in the content, but in the form of the message, they are
Discourse presentation in EFL textbooks
419
not preserved in indirect reporting” (Romaine and Lange 1991: 240). That is, it is only in direct mode presentation that the expressive potential of the human voice can be exploited. Again, it makes sense to interpret this association of conversational discourse presentation with direct mode as a dramatic device which helps the narrator achieve his/her basic aim: to bring the narration as close to the interlocutors as possible and, thus, engage them affectively. 2.2
Quotative verbs in conversation
Which quotatives are most frequent in conversation? It appears that, in conversation, a small set of verbs dominate the quotative system. According to Tagliamonte and Hudson, “[t]he complete inventory of quotatives used to introduce constructed dialogue in British and Canadian English comprise four major verbs, say, go, think, be like and zero” (1999: 155). For an identical set of quotative verbs used in American English see Buchstaller (2002; cf. also Tannen 1986); a similar top five list was observed in Macaulay (2001) for Scottish English. In conversational language use, then, the most frequent quotatives are to a large extent shared across regional varieties of English. The four quotative verbs are briefly characterised in the following. SAY: It is uncontroversial to view SAY as the ‘default verb’ in conversational discourse presentation both in North-American and British English (e.g., Romaine and Lange 1991: 242; Ferrara and Bell 1995; Buchstaller 2002: 14). In Tagliamonte and Hudson’s (1999: 158) corpus of tape-recorded narratives of personal experience, SAY was the most frequent quotative – 31 percent in British English and 36 percent in Canadian English. However, there is evidence that the dominance of SAY is being challenged, particularly because of the influence of the new quotatives BE like and GO. There is good evidence for such waning dominance in the usage of adolescent speakers: SAY was observed to trail far behind BE like in narratives told by Canadian youths (Tagliamonte and D’Arcy 2004), while GO was used more frequently than SAY by adolescents in Glasgow (Macaulay 2001: 10) and London (Stenström et al. 2002). THINK: Another traditional quotative is THINK. While the present tense form think, particularly when associated with the subject I, is mostly used as a discourse marker (cf. Carter et al. 2000), the past tense form thought seems to be used frequently to introduce discourse presentation. In a sample of 300 randomly selected occurrences, thought acted as a reporting verb in more than half of all occurrences (Rühlemann 2007: 138). In the sample, quotative thought mostly introduced direct presentations. Note, however, that use of quotative THINK dramatically decreases in adolescent speech: in Macaulay (2001) and Tagliamonte and D’Arcy (2004), for example, this quotative accounted for a mere two per cent.
420
Christoph Rühlemann
BE like: There is evidence that BE like has gained a notable frequency in U.S. American English – the variety in which it is commonly assumed to have originated (e.g., Fairon and Singler 2006) – and that, as noted above, BE like has made major inroads into Canadian English. Tagliamonte and D’Arcy (2004) note a dramatic increase in the use of BE like compared to an earlier study (Tagliamonte and Hudson 1999): in Tagliamonte and D’Arcy’s corpus, BE like turned out to be by far the most frequent quotative at all (accounting for 58 per cent of all quotatives), while SAY, GO, and THINK were observed to decline in frequency (Tagliamonte and D’Arcy 2004: 501) (for other varieties in which BE like has been attested see Buchstaller 2008 and references therein). The status of BE like in British English, by contrast, is as yet relatively uncertain. In research carried out on British speech data from the early 1990s, BE like’s frequency was low (cf. Miller and Weinert 1998; Andersen 2001; Rühlemann 2007). However, BE like in British English may well be spreading (e.g., Romaine and Lange 1991; Ferrara and Bell 1995; Andersen 2001; Buchstaller 2002). Strong evidence of this comes from Tagliamonte and Hudson (1999): in their corpus of narratives told by university students in England in 1996, quotative BE like, THINK, and quotative GO were equally represented (18 per cent). GO: Unlike BE like, whose frequency in current British English is as yet somewhat unclear, there is evidence that quotative GO is very frequent. Biber et al. (1999: 1119) found that quotative use of the third person singular present tense form goes is particularly frequent (for supportive evidence see Stenström et al. 2002). Observations made on non-computerized collections of personal experience narratives also suggest that quotative GO is recurrent in British English: in Macaulay (2001: 10) and Stenström et al. (2002), GO had a higher frequency than SAY in Scottish and London youth respectively, and in Tagliamonte and Hudson (1999: 158) GO was equally frequent as THINK and BE like in British youth. High frequencies of quotative GO were also reported for Canadian English (Tagliamonte and Hudson 1999) and U.S. American English (Tannen 1986; Blyth et al. 1990; Ferrara and Bell 1995: 274). Finally, it should also briefly be noted that the four quotatives fulfil different functions in discourse. While SAY and THINK are relatively straightforward, introducing mainly speech and, respectively, thought presentations, BE like and GO act as “‘anything-goes’-items” (Buchstaller 2002: 10). That is, they are able to introduce a broad range of different types of content of the quote: both BE like and GO have been observed to introduce not only speech and thought, but also gesture (Butters 1980: 305; Ferrara and Bell 1995: 281), and emotion (Romaine and Lange 1991: 238; Ferrara and Bell 1995: 282 ff.; Adolphs and Carter 2003: 54; Buchstaller 2002: 15; Rühlemann 2007: 149 ff.). Consider (5): aargh vocalizes the pain the speaker felt after a skiing accident: (6)
I was just like aargh. (BNC: KPV 2371)
Discourse presentation in EFL textbooks
421
Additionally, GO has the capacity to introduce presentations of non-human sound (e.g., Butters 1980: 306 f.; Macaulay 2001: 15). In (6), for example, the speaker is presenting sounds made by a cat: (7)
She sits there she goes [sucking then purring noises] and she stops and you’re just about to go to sleep and she goes [purring noises] so loud! (BNC: KPG 3613)
To summarize this section, discourse presentation in conversation is a richly diversified dramatic activity: presenters ‘report’ not only isolated speech but stage and enact whole scenes from the past animating voice qualities, utterances, thoughts, emotions, and the sounds of people, animals, and things in action. How is this everyday drama reflected in textbooks? 3.
Analysis of discourse presentation in selected EFL textbooks
Reporting is generally introduced in EFL teaching at intermediate level. Accordingly, the seven textbooks selected for analysis all cater for that level. They are given in Table 1 in alphabetical order. The textbooks will be referred to in the following sections by their acronyms listed in Table 1: Table 1: EFL textbooks under examination Textbook
Acronym
Cutting Edge (Intermediate) (2005) Innovations (Intermediate) Workbook (2004) Inside Out (Intermediate) (2000) New Headway (Intermediate) (2003) Reward (Intermediate) (1995) Straightforward (Intermediate Student’s Book) (2006) Touchstone 4 (2006)
CUT INN INS NEW REW STR TOU
The series from which TOU is taken stands out from the others because it draws on the Cambridge International Corpus; it is thus one of the very few textbook series for learners of English which consistently draw on corpus data and insights from corpus research; see also the Collins COBUILD English course (e.g., Willis and Willis 1989) which is based on the Birmingham Corpus – now the Bank of English. Additionally, the textbook puts an extra emphasis on highlighting ‘conversational strategies’ and ‘conversational grammar’. Given the corpus-based approach and the focus on conversation, TOU is a milestone in the history of English textbooks. We saw in the analysis of conversational discourse presentation (see section 2.1) that a crucial choice concerns mode. Which modes are promoted in EFL textbooks?
422
Christoph Rühlemann
3.1
Reporting mode in EFL textbooks
To address the above question the relevant units and sections in all seven textbooks were carefully studied. Table 2: Mentions of different types of reporting mode in textbooks (D: direct; FD: free direct; I: indirect; FI: free indirect; N: narratised; V: representation of voice)
CUT INN INS NEW REW STR TOU
D
FD
I
FI
N
V
no no no no no no no
no no no no no no no
yes yes yes yes yes yes yes
no no no no no no no
(yes) (yes) no (yes) no no no
no no no no no no no
Table 2 shows that none of the textbooks, including corpus-based TOU, mention either the free categories FD and FI or ‘representation of voice’ or, most importantly, direct mode as ways of presenting discourse in their own rights. Instead, the focus is exclusively on indirect and, to a much smaller degree, narratised mode. Narratised mode is not taught explicitly in any of the textbooks. In CUT, INN and NEW, narratised mode only occurs implicitly in a few example sentences and fill-in-the-gap exercises, as in this one from CUT: “(…) would you ______ her the truth?” (p. 107) where the learner is expected to fill in tell and where truth gives a mere summary of the utterance(s) presented. The complete absence of explicit mention of direct mode shown in Table 2 is not to say that the notion of ‘direct speech’ did not figure prominently in the textbooks. In fact, both instances of direct mode presentation and the term ‘direct speech’ recur quite frequently across all relevant textbook units and grammar reference sections. However, instances of direct mode presentation only occur in narrative texts (here, interestingly, the preponderance of direct mode typical of non-textbook fiction is often faithfully re-produced but never pointed out to the learners). Further, where explicit attention is drawn to ‘direct speech’ as such, this is invariably done in the context of transformational exercises; that is, in exercises in which direct speech merely serves as raw material for transformations into indirect speech, thereby applying the rules of ‘backshift’ and performing necessary changes in deictic usage. None of the textbooks inform the learner that direct speech presentation is a reporting mode in its own right. Indeed, indirect mode is presented as if it were the norm in any context of use. Consider, for example, this statement introducing the relevant grammar reference section in NEW: “It is usual for the verb in the reported clause to move ‘one tense back’ if
Discourse presentation in EFL textbooks
423
the reporting verb is in the past tense (e.g., said, told)” (p. 150). Learners consulting the language summary section in the back of CUT are informed: “When we report someone’s words afterwards, the verb forms often move into the past” (p. 152). Similar generalized descriptions could be quoted from the other textbooks as well. Mode representation in textbooks, hence, suggests that ‘reported speech’ is synonymous with ‘indirect speech’. Learners are likely to form the impression that the reporting system is a one-way system, admitting only the choice of indirect mode (or, to a far lesser extent, narratised mode); that direct mode is not only an alternative choice but the preferred choice in conversation is not mentioned. Moreover, direct mode is not only the major mode in conversation but also in fictional writing, as research by Leech and Short (1981: 334) and Semino and Short (2004: 89) has shown. Thus, by failing to include treatment of direct mode, textbook representation of discourse presentation fails to represent how discourse is presented not only in conversation but also in fiction. Interestingly, however, indirect mode and narratised mode, while being of secondary importance in conversation and fiction, seem to be primary in journalistic writing. Comparing discourse presentation across three written corpora – fiction, newspaper news reports, and (auto)biographies – Semino and Short (2004: 225) found these two modes to be predominant in their press corpus. The following analysis of quotatives in EFL textbooks suggests that more parallels can be found between discourse presentation in textbooks and discourse presentation in newspaper reportage. 3.2
Quotatives in EFL textbooks
Table 3 lists all reporting verbs mentioned in the seven textbooks: Table 3: Quotatives in EFL textbooks 7x
2x +
1x
ASK SAY TELL
ADVISE COMPLAIN EXPLAIN INVITE PERSUADE REMIND REFUSE SUGGEST THINK WARN
ACCEPT ADD AGREE APOLOGISE BEG CLAIM CONCLUDE DECIDE DENY ENCOURAGE
ENQUIRE HEAR HOPE INFORM INTRODUCE INSIST ORDER PROMISE RECALL WANT to know
424
Christoph Rühlemann
As shown in Table 3, ASK, SAY, and TELL are included in all seven textbooks; ten verbs are found in two or more of the seven textbooks while 20 verbs are mentioned in one textbook only. Table 3 allows for three initial observations. First, it will not be surprising that SAY is included in all textbooks; as mentioned above, there is broad agreement that SAY is the ‘default reporting verb’. By contrast, as far as ASK and TELL are concerned, which are also included in all seven textbooks, there is some evidence that these two verbs may be far from frequent, at least in speech. For instance, in Tagliamonte and Hudson’s (1999) corpus of British and Canadian quotatives, ASK and TELL were found to be very infrequent, accounting for very small percentage values. I suspect that the three verbs SAY, TELL, and ASK enjoy such popularity with textbook writers because they are generally considered to be associated with a particular type of mood: SAY is seen as the verbum dicendi for statements, ASK for questions, and ASK and TELL for directives and requests. Indeed, in most of the textbooks, treatment of ‘reporting speech’ is divided into three sections: reporting statements, reporting questions and reporting directives and requests. In STR, for example, the relevant headings read: ‘reported speech & thought’, ‘reported questions’, and ‘tell & ask with infinitive’. Second, given their non-standard nature, it may not be surprising that BE like and GO are not included in any of the textbooks.2 But it may come as a surprise that THINK is not consistently included: it is mentioned only in INS and STR. That is, in five out of seven textbooks no thought is given to the presentation of thought, no doubt an important factor in conversational narrative. Third, there seems to be little agreement as to which reporting verbs should be covered: the textbooks differ noticeably from each other as to which and how many quotatives are mentioned (see Appendix 2 for a list of quotatives by textbook). This lack of agreement may be due to the fact that decisions in textbooks regarding the inclusion or exclusion of lexical items are generally made on bases other than frequency analyses in representative corpora (see, for example, Mindt’s (1997) alternative proposal of a list of irregular verbs based on their relative frequency). In light of the above mentioned parallels regarding mode between textbooks and journalistic writing, I thought it interesting to investigate whether the quotatives covered in the textbooks show a general preference for writing and/or a specific preference for journalistic writing. Using the set of pre-defined subcorpora in the BNC XML Edition, a comparative analysis was conducted investigating the frequencies of those reporting verbs which are mentioned in at least two of the seven textbooks across what Biber et al. (1999) refer to as the major English registers, viz. Academic Writing, Fiction and Verse, Newspapers, and Conversation. This distributional analysis did not include the verbs SAY and THINK simply because of the broad agreement that these two quotatives are crucial both in writing and speech. It needs to be admitted, however, that a register-distributional analysis of the lemmas of verbs used as quotative verbs in EFL textbooks is not without problems because we cannot be sure that all the forms are being used as
Discourse presentation in EFL textbooks
425
quotatives in all the four registers considered. Even seemingly straightforward quotatives such as SAY and TELL can be used as non-quotative verb forms. To name only two examples. The formula I say predominantly acts as a discourse marker rather than a quotative (Rühlemann 2007: 172 ff.), and the verb TELL can be used as a mental verb as in Yeah but you can’t tell screws from security can you?. Further, we cannot rule out completely that the quotative proportions of any given verb may exhibit significant variation across the registers. To ensure with sufficient confidence that only quotative uses are being compared it would be necessary to download all instances of each form of each lemma in each of the four registers and to inspect concordance lines; and to ensure that register variation in quotative use is taken into account it would be necessary to compare quotative proportions. Obviously, going to these lengths is not feasible in the present connection. Instead, I will assume that the eleven verbs examined, all of which are clearly verba dicendi, predominantly perform a quotative function in any register and fully acknowledge the inherent dangers in so doing. Bearing these reservations in mind, the results of the following analyses need to be taken as approximate rather than definitive. Table 4 lists in alphabetical order eleven quotatives shared by at least two of the textbooks under examination, the respective raw frequencies and normed frequencies per million words in the four registers; further, it shows the ratios obtained between, on the one hand, the three written registers taken together and, on the other, conversation: Table 4: Distributional analysis of eleven verbs across Academic Writing (ACW), Fiction and Verse (FIC), Newspapers (NEW), and
Conversation (CON) in the BNC XML Edition ACW 18m
FIC 19m
NEW 11m
CON 5m
ASK RF NFpm Ratio W/C
4,327 21,576 4,357 2,530 240 1,136 396 506 ------------------------------------------------- 1.17 -------------
ADVISE RF NFpm Ratio W/C
703 498 562 27 39 26 51 5 ------------------------------------------------- 7.73 -------------
COMPLAIN RF NFpm Ratio W/C
583 783 613 148 32 41 56 30 ------------------------------------------------- 1.43 -------------
426
Christoph Rühlemann
EXPLAIN RF NFpm Ratio W/C
4,421 3,620 1,340 206 246 191 122 41 ------------------------------------------------- 4.55 -------------
INVITE RF NFpm Ratio W/C
634 1,197 685 166 35 63 62 33 ------------------------------------------------- 1.37 -------------
PERSUADE RF NFpm Ratio W/C
652 896 579 37 36 47 53 7 ------------------------------------------------- 6.48 -------------
REMIND RF NFpm Ratio W/C
469 1,896 335 201 26 100 31 40 ------------------------------------------------- 1.31 -------------
REFUSE RF NFpm Ratio W/C
1,622 1,844 1,833 55 90 97 167 11 ------------------------------------------------ 10.73 -------------
SUGGEST RF NFpm Ratio W/C
9,537 2,644 1,811 128 530 139 165 26 ------------------------------------------------ 10.69 -------------
TELL RF NFpm Ratio W/C
3,031 29,025 8,353 6,731 168 1,528 759 1,346 -------------------------------------------------- 0.61 -------------
WARN RF NFpm Ratio W/C
350 1,265 1,692 46 19 67 154 9 ------------------------------------------------- 8.89 -------------
The results displayed in Table 4 suggest two major conclusions: (i) that the verbs are typically used in writing rather than informal speech and (ii) that they are mostly used in journalistic writing. The evidence for (i) is twofold. First, as can be seen from the shaded figures highlighting the highest frequency in each row of normed frequencies,
Discourse presentation in EFL textbooks
427
none of the eleven verbs reach the highest frequency in Conversation. On the contrary, eight of them are least frequent in Conversation; only ASK, REMIND, and TELL do not follow this pattern (ASK and REMIND are least frequent in Newspapers, and TELL is least frequent in Academic Writing). Second, the ratios between the three written registers, on the one hand, and conversation, on the other, show that TELL, for which a ratio of 0.61 was obtained, is the only verb which is more frequent in conversation than in the three written context types taken together. The remaining ten verbs are invariably more frequent in writing than in conversation, with four verbs displaying slight preferences for writing – ASK (1.17), COMPLAIN (1.43), INVITE (1.37), and REMIND (1.31), and, conversely, six verbs displaying strong and very strong preferences for the written mode – ADVISE (7.73), EXPLAIN (4.55), PERSUADE (6.48), REFUSE (10.73), SUGGEST (10.69), and WARN (8.89). Initial evidence for (ii), that the verbs in question are mostly used in journalistic writing, is the fact that, as can be seen in Table 4, five out of eleven quotatives obtain the highest normed frequency in Newspapers compared to Conversation and the two other written subcorpora. Further, this tendency becomes stronger as soon as we take the group of verbs mentioned in only one textbook (see Table 3) into account. Table 5 shows the results of a distributional analysis of these 20 quotatives. Again, the highest normed frequencies per quotative are shaded: Table 5: Distributional analysis of quotatives mentioned in only one out of seven textbooks across Academic Writing (ACW), Fiction and Verse (FIC), Newspapers (NEW), and Conversation (CON) in the BNC XML Edition
ACW 18m
FIC 19m
NEW 11m
CON 5m
ACCEPT RF NFpm
4,381 243
2,186 115
1,703 155
153 31
ADD RF NFpm
3,006 167
4,001 211
5,062 460
278 56
AGREE RF NFpm
3,065 170
3,483 183
2,304 210
264 53
APOLOGISE RF NFpm
33 2
377 20
241 22
16 3
428
Christoph Rühlemann
BEG RF NFpm
130 7
794 44
142 13
84 17
CLAIM RF NFpm
3,359 187
838 44
3,764 342
110 22
CONCLUDE RF NFpm
1,968 109
384 20
364 33
4 1
DECIDE RF NFpm
3,386 188
4,548 239
2,661 242
450 90
DENY RF NFpm
1,354 75
985 52
1,642 149
27 5
ENCOURAGE RF 2,307 NFpm 128
544 29
913 83
60 12
ENQUIRE RF NFpm
99 6
778 41
19 2
13 3
HEAR RF NFpm
1,778 99
13,980 736
3,233 180
2,407 481
HOPE RF NFpm
1,188 66
4,898 258
3,356 305
949 190
INFORM RF NFpm
1,069 59
820 43
379 35
30 6
INSIST RF NFpm
857 48
1,386 73
1,259 115
37 7
Discourse presentation in EFL textbooks
429
INTRODUCE RF 3,017 NFpm 168
786 41
1,242 113
32 6
ORDER RF NFpm
1,105 61
1,461 77
1,010 92
187 37
PROMISE RF NFpm
442 25
1,834 97
923 84
87 17
RECALL RF NFpm
686 38
1,091 57
683 62
21 4
WANT to know RF 108 NFpm 8
1,188 63
152 14
204 41
Ten out of the 20 quotatives listed in Table 5 are most frequent in Newspapers compared to Conversation, Fiction and Verse, and Academic Writing. Conversely, once again, Conversation remains without top-scoring quotative, while Academic Writing and Fiction and Verse obtain highest frequencies with six quotatives each. If we combine the results from Table 4 and Table 5, we see that of all 31 EFL quotatives which were compared across registers (remember that SAY and THINK were excluded from this analysis), 15 verbs are most frequent in Newspapers, whereas Academic Writing has seven and Fiction and Prose nine top-scoring quotatives. That is, almost half of all EFL quotatives are used mostly in journalistic writing. 4.
Conclusions and implications for EFL teaching
We can now compare the results of the analyses on discourse presentation in natural conversation and EFL textbooks and draw conclusions. As regards reporting mode, I have shown that discourse presentation in conversation is overwhelmingly in direct mode, whereas the modes promoted in textbooks are indirect mode and, to a much lesser extent, narratised mode. The focus on indirect mode in EFL textbooks is such that this mode is presented as if it were the default mode; narratised mode is mentioned only marginally and neither representation of voice, free direct, free indirect nor, most importantly, direct mode are mentioned at all as reporting modes in their own rights. The analyses of the sets of quotatives used in conversation and EFL textbooks suggested that the two sets overlap to some degree – e.g., the default
430
Christoph Rühlemann
quotative for speech presentation SAY is included in both sets – but, more importantly, diverge: while none of the ‘new quotatives’ BE like and GO, which play increasingly important roles in conversation, are included in the textbooks, the EFL quotatives exhibit a skew not only towards the written mode but, specifically, to journalistic writing. Moreover, in the analysis of how mode and quotatives are realised in EFL textbooks I found evidence that discourse presentation in EFL textbooks resembles in essential ways discourse presentation in journalistic writing. This resemblance was observed on two levels. First, the heavy emphasis EFL discourse presentation puts on indirect and, to a smaller degree, narratised mode is reminiscent of the emphasis on indirect and narratised mode which Semino and Short (2004) found in their press corpus. Second, the distributional analyses carried out on the quotatives used in EFL textbooks suggest a clear tendency: almost half of all quotatives examined were used most frequently in the Newspapers subcorpus and less frequently in Academic Writing, Fiction and Verse, and Conversation. This double evidence raises the question whether the model underlying EFL discourse presentation is found in discourse presentation in journalistic writing – hence, maybe, the preference in EFL for the term ‘reporting’. Again, however, a cautionary note is in order not only because of the methodological reservations acknowledged above but also, as one reviewer commented, because no large claim can be made on the basis of a number of verbs that occur in just two to maximally seven texts, all of which have been published in one place (the UK). Bearing these reservations in mind, the overlap between EFL discourse presentation and journalistic writing found in this paper is merely sufficient to hypothesize that the former is modelled on the latter, and to leave this hypothesis to be tested for future research. In conclusion, the comparison of discourse presentation in conversation and EFL textbooks shows that this is an area in which the gap between school English and real English is particularly wide: EFL textbooks disregard not only a primary reporting mode – direct mode – which is the norm not only in conversation, no doubt a ‘core register’ (cf. Rühlemann 2007), but also in fiction, a similarly important context type, thus creating the impression for EFL learners that indirect mode is the only choice they have for ‘reporting’. Further, EFL textbooks promote quotatives which will help EFL learners as readers of British newspapers but not as conversationalists in informal L2 encounters. Moreover, EFL textbooks fail to equip EFL learners with what is most central to discourse presentation in conversation: an awareness of what end discourse presentation serves in conversation, namely the establishment of ‘bonds of communion’ through the creation of narrative as drama, and the corresponding linguistic means to achieve that end. Thus, a yawning gap divides discourse presentation in natural conversation and EFL textbooks. Can the gap be closed? Unlike, for example, modals or progressives, whose representation in textbooks, it seems, can easily be re-aligned to naturallyoccurring English, attempts to re-align the representation of discourse presentation in EFL to conversational discourse presentation will face a major
Discourse presentation in EFL textbooks
431
problem. This problem arises from the fact that conversational discourse presentation is fraught with non-standard English. Take, for example, the reporting verbs GO and BE like: these are generally “considered by many people to be non-standard and grammatically unacceptable” (Carter and McCarthy 2006: 823), an observation supported by an attitudinal survey conducted by Blyth et al. (1990: 223) whose respondents judged the two verbs as “stigmatized, ungrammatical, and indicative of casual speech” (for a more differentiated picture of attitudes towards the two quotatives in the UK see Buchstaller 2006). To complicate matters, the two verbs are by no means the only non-standard features of discourse presentation. The long list of conversational discourse presentation features which are at odds with Standard English includes: I says, a seemingly clear case of ‘subject-verb discord’ (cf. Rühlemann 2007), use of SAY not only for presented statements but also questions, seemingly careless switches between historic past (HP) and narrative past (NP) (but see Schiffrin 1981 who saw HP in association with the Complicating events section in narratives), seemingly careless shifts between reporting verbs, seemingly unmotivated repetitions of reporting clauses, use of past –ing with reporting verbs (cf. McCarthy 1998), and, finally, use of ‘utterance openers’ such as oh and well (cf. Biber et al. 1999). As regards its non-standardness, conversational discourse presentation is far from being exceptional; it is in the good company of a great many non-standard features distinctive of conversational English. So, conversational language generally is ‘vernacular’ language to such an extent that Biber et al. (1999: 1121) declare the notion of Standard English “problematic in talking of the spoken language.” Hence, bringing the representation of conversational discourse presentation into closer correspondence with discourse presentation in natural conversation will be difficult because conversational discourse presentation, just as conversational language generally, conflicts with Standard English, the model which has long been predominant in EFL both for teaching writing and speech (Quirk 1985: 7). Therefore, we need to be aware that teaching conversational discourse presentation, just as teaching most other conversational features, presupposes a readiness to sacrifice, at least partly, Standard English as the ‘oneand-only’ model.3 Instead, it has been suggested, Standard English needs to be reduced to a ‘core variety’ (Bex 1993: 261), underlying the teaching of the written language, while the spoken language should be taught on the basis of the model of ‘conversational grammar’, a more appropriate model that major corpus linguistic studies have elaborated in great detail (e.g., Biber et al. 1999; Carter and McCarthy 2006).4 Such a ‘register approach’ (Rühlemann forthcoming) would tie in well with recent attempts to argue a shift in emphasis from monolithic descriptions to register-specific descriptions of the grammar of English (Conrad 2000). The obvious advantage of this approach would be that it enables EFL teaching to reflect what is seen as a fundamental property of language: its functional diversity (cf. Stubbs 1993; Bex 1993).
432
Christoph Rühlemann
Notes 1
For many researchers though, the term ‘speech report’ is a misnomer (cf. Tannen 1986) because neither do conversationalists ‘report’ faithfully what was being said nor is it always speech that is rendered but rather a broad spectrum of types of discourse, including not only actual speech but also habitual and potential speech, thought, emotion, gesture, and sound (e.g., Buchstaller 2008; Rühlemann 2007: 121 ff.)
2
This is not to imply that no findings from corpus research on conversational discourse presentation had found their way into the representation of discourse presentation in TOU. An example of what has been taken up is past –ing with reporting verbs, as in she was saying how nice it was – a form which serves “to focus on the content rather than the actual words” (McCarthy et al. 2006: 90; cf. also McCarthy 1998: Chapter 8; Biber et al. 1999: 1120; Rühlemann 2007: 133 f.). This form is explicitly taught in TOU. Also, TOU mentions BE like as a quotative; however, it does so not in the section on speech reporting but elsewhere (viz. in the context of summarising the various functions of discourse marker like) and without any illustrative examples.
3
For a discussion of the problems surrounding this partial sacrifice see Rühlemann (2008).
4
Noteworthy in this discussion are also attempts to argue an acknowledgment of ‘spoken standard’ complementing (and thus relativising) the ‘written standard’ (e.g., Carter 1999). It is questionable though whether all or most forms of conversational discourse presentation will be accepted as spoken standard. Particularly doubtful cases are, for example, I says, quotative GO (including I goes) or BE like which generally seem to attract rather negative attitudes.
References Adolphs, S. and R.A. Carter. (2003), ‘And she’s like it’s terrible, like: Spoken Discourse, Grammar, and Corpus Analysis’, International Journal of English Studies 3 (1): 45-56. Andersen, G. (2001), Pragmatic Markers and Sociolinguistic Variation. A relevance-theoretic approach to the language of adolescents. Amsterdam/Philadelphia: John Benjamins. Bex, T. (1993), ‘Standards of English in Europe’, Multilingua 12 (3): 249-264. Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), Longman Grammar of Spoken and Written English. Harlow: Pearson Education Limited. Blyth, C. Jr., S. Reckenwald and J. Wang (1990), ‘I’m like “Say what?!”: A new quotative in American oral narrative’, American Speech 65 (3): 215-227.
Discourse presentation in EFL textbooks
433
Buchstaller, I. (2002), ‘He goes and I’m like: The new Quotatives re-visited’, Internet Proceedings of the University of the Edinburgh Postgraduate Conference 1-20. Buchstaller, I. (2006), ‘Social stereotypes, personality traits and regional perceptions displaced: Attitudes towards the ‘new’ quotatives in the UK’, Journal of Sociolinguistics 10 (3): 362-381. Buchstaller, I. (2008), ‘The localization of global linguistic variants’, English World-Wide 29 (1): 15-44. Butters, R.R. (1982), ‘Narrative Go ‘Say’’, American Speech 55 (4): 304-307. Carter, R.A. (1999), ‘Standard grammars, spoken grammars: Some educational implications’, in: T. Bex and R.J. Watts (eds.) Standard English. The widening debate. London: Routledge, pp. 149-166. Carter, R.A., R. Hughes and M.J. McCarthy (2000), Exploring Grammar in Context. Cambridge: Cambridge University Press. Carter, R.A., and M.J. McCarthy (2006), Cambridge Grammar of English. Cambridge: Cambridge University Press. Conrad, S. (2000), ‘Will corpus linguistics revolutionize grammar teaching in the 21st century?’, TESOL Quarterly 34 (3): 548-560. Conrad, S. (2004), ‘Corpus linguistics, language variation, and language teaching’, in: J. McH. Sinclair (ed.). How to Use Corpora in Language Teaching. Amsterdam / Philadelphia: John Benjamins, pp. 67-85. Coulmas, F. (1986), ‘Reported speech: Some general issues’, in: Coulmas, F. (ed.) Direct and Indirect Speech. Berlin/New York/Amsterdam: Mouton de Gruyter, pp. 1-28. Fairon, C. and J.V. Singler (2006), ‘I’m like, “Hey, it works!”: Using GlossaNet to find attestations of the quotative (be) like in English-language newspapers’, in: A. Renouf and A. Kehoe (eds.) The Changing Face of Corpus Linguistics, Amsterdam/New York: Rodopi, pp. 325-336. Ferrara, K. and B. Bell (1995), ‘Sociolinguistic variation and discourse function of constructed dialogue introducers: The case of be + like’, American Speech 70 (3): 265-290. Halliday, M.A.K. and M.I.M. Matthiessen (2004), An Introduction to Functional Grammar (3rd edition). London: Edward Arnold. Kilgarriff, A. (1998), ‘BNC database and word frequency lists’, http://www.kilgarriff.co.uk/bnc-readme.html. Leech, G. and M. Short (1981), Style in Fiction. London/New York: Longman. Macaulay, R. (2001), ‘You’re like ‘why not?’ The quotative expressions of Glasgow adolescents’, Journal of Sociolinguistics 5 (1): 3-21. Malinowski, B. (1923), ‘The problem of meaning in primitive languages’, in: C.K. Ogden and I.A. Richards (eds.) The Meaning of Meaning. London: Routledge, 296-336. McCarthy, M.J. (1998), Spoken Language and Applied Linguistics. Cambridge: Cambridge University Press. McIntyre, D., C. Bellard-Thomson, J. Heywood, T. McEnery, E. Semino and M. Short (2004), ‘Investigating the presentation of speech, writing and
434
Christoph Rühlemann
thought in spoken British English: A corpus-based approach’, ICAME 28: 49-76. Miller, J. and R.Weinert (1998), Spontaneous Spoken Language: Syntax and Discourse. Oxford: Clarendon Press. Mindt, D. (1996), ‘English corpus linguistics and the foreign language teaching syllabus’, in: J. Thomas and M. Short (eds.) Using Corpora for Language Research, London: Longman, pp. 232-247. Mindt, D. (1997), ‘Corpora and the Teaching of English in Germany’, in: A: Wichmann, S. Fligelstone, T. McEnery, and G. Knowles (eds.) Teaching and Language Corpora. Harlow: Longman, pp. 41-50. Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik (1985), A Comprehensive Grammar of the English Language. London: Longman. Romaine, S. and D. Lange (1991), ‘The use of like as a marker of reported speech and thought: a case of ongoing grammaticalization in progress’, American Speech 66 (3): 227-279. Römer, U. (2005), Progressives, Patterns, Pedagogy. A Corpus-driven Approach to English Progressive Forms, Functions, Contexts and Didactics. Amsterdam/Philadelphia: John Benjamins. Rühlemann, C. (2007), Conversation in Context. A Corpus-driven Approach. London: Continuum. Rühlemann, C. (2008), ‘A register approach to teaching conversation: Farewell to Standard English?’, Applied Linguistics 29 (4): 672-693. Schourup, L. (1982), ‘Quoting with Go ‘Say’’, American Speech 57 (2): 148-9. Semino, E. and M. Short (2004), Corpus stylistics: Speech, writing and thought presentation in a corpus of English writing. London: Routledge. Short. M., E. Semino and J. Culpeper. (1996), ‘Using a corpus for stylistics research: speech and thought presentation’, in: J. Thomas and M. Short. (eds.) Using Corpora for Language Research. Studies in Honour of Geoffrey Leech. London/New York: Longman, 110-131. Sinclair, J. McH. (ed.) (1989), Collins COBUILD Dictionary of Phrasal Verbs. London: HarperCollins. Stenström, A., G. Andersen and I.K. Hasund (2002), Trends in Teenage Talk. Amsterdam / Philadelphia: John Benjamins. Stubbs, M. (1993), ‘British Traditions in Text Analysis – From Firth to Sinclair’, in: M. Baker, G. Francis, and E. Tognini-Bonelli (eds.) Text and Technology. In Honour of John Sinclair. Amsterdam/Philadelphia: John Benjamins, pp. 1-33. Tagliamonte, S. and A. D’Arcy (2004), ‘He’s like, she’s like: The quotative system in Canadian youth’, Journal of Sociolinguistics 8 (4): 493-514. Tagliamonte, S. and R. Hudson (1999), ‘Be like et al. beyond America: The quotative system in British and Canadian youth’, Journal of Sociolinguistics 3 (2): 147-172. Tannen, D. (1986), ‘Introducing constructed dialogue in Greek and American conversational and literary narrative’, in: F. Coulmas (ed.) Direct and indirect speech. Berlin: Mouton de Gruyter, pp. 311-332.
Discourse presentation in EFL textbooks
435
Tannen, D. (1988), ‘Hearing voices in conversation, fiction and mixed genres’, in: Tannen, D. (ed.). Linguistics in Context: Connecting Observation and Understanding. Norwood, NJ: Ablex, 89-113. Willis, J. and D. Willis (1989), Collins COBUILD English course. London: Collins COBUILD. Appendix 1: Sources for textbook representation of discourse presentation Cunningham, S. and P. Moor (2004), Cutting Edge (Intermediate). London: Heinle. Deller, H. and A. Walker (2004), Innovations (Intermediate) Workbook. Harlow: Longman. Greenall, S. (1995), Reward (Intermediate). Oxford: Macmillan Education. Kay, S. and V. Jones (2000), Inside Out (Intermediate). Oxford: Macmillan Education. Kerr, P. and C. Jones (2006), Straightforward (Intermediate Student’s Book). Oxford: Macmillan Education. McCarthy, M.J., McCarten J. and H. Sandiford (2006), Touchstone. Student’s book 4. Cambridge: Cambridge University Press. Soars, L. and J. Soars (2003), New Headway (Intermediate Student’s Book) Oxford: Oxford University Press. Appendix 2: Quotative verbs by textbook: CUT INN INS NEW REW STR TOU
ask, say, tell apologise, ask, complain, enquire, invite, persuade, say, suggest, tell ask, say, tell, think advise, ask, beg, invite, order, refuse, remind, say, tell accept, advise, agree, ask, decide, encourage, explain, hope, introduce, persuade, promise, refuse, remind, say, suggest, tell, warn ask, claim, complain, deny, inform, insist, know, say, tell, think, want to know, warn add, ask, conclude, explain, recall, remember, say, tell
Awful adjectives: a type of semantic change in present-day corpora Göran Kjellmer University of Gothenburg Abstract Semantic change observable in isolated linguistic items is both frequent and interesting in itself. More interesting, perhaps, are cases of structural change, i.e. cases where one and the same tendency can be discerned in a related group of words. This paper uses modern corpus material in order to sketch the development of one such group, words meaning ‘frightening’, and suggests that they all follow the same trend in the direction of ‘impressive, overwhelming’ although they differ with respect to how far they have advanced along that route. The semantic changes of some 25 words in the chosen area are studied in detail, and their development is illustrated with corpus material. One of the conclusions of the study is that their rate of semantic progress is partly dependent on the time when they entered the semantic field. The paper deals with the adjectives in the group and leaves the adverbs, although equally interesting, out of account for a later investigation.
1.
Introduction
Saussure’s division of language study into a synchronic and a diachronic section is not always possible or indeed fruitful to uphold. Many of the perplexing phenomena in the language of today can be naturally explained with reference to historical facts and, perhaps more importantly, there are changes taking place in the modern language before our eyes. To insist on a rigorous synchronic OR diachronic approach in such matters would therefore be counterproductive. The present paper will study a case of ongoing semantic change in modern English, a transition in meaning from negative to positive, a change that is often referred to as amelioration. Amelioration can be found in many, perhaps all, languages, and a few examples may be illuminating. Terribilis is used in a positive sense in the Vulgate translation of the Song of Songs (Snaith 1993: 88), and negative-to-positive changes are found in several Semitic languages (Goitein 1965: 220). Swedish grym ‘cruel’ is used informally in the sense of ‘very good, skilful, “cool”’ (NEO). In English, shrewd has passed from meaning ‘wicked’ to meaning ‘clever, astute’, and nice has passed from meaning ‘foolish, stupid’ to meaning ‘agreeable, delightful’ (OED). And, finally, a recent English parallel is the use of wicked to mean ‘excellent, splendid; remarkable’ (OED, s.v. Wicked 3.b).
438
Göran Kjellmer
All those are thus cases of amelioration, a well-known type of semantic change, although most of the time the change is not as extreme as in the examples just given. The examples of amelioration we have just seen are isolated instances of semantic change. What would be much more interesting would be if more general trends were to be found, in line with Stern’s finding (1964 [1931]: 190) that “English adverbs which have acquired the sense ‘rapidly’ before 1300 always develop the sense ‘immediately’”. This paper will try to find such regularity in the semantic change of words meaning ‘frightening’. 2.
Adjectives in the field
A terrible film is a very bad film, but a terrific film is a very good film. Both terrible and terrific originally denote ‘causing terror’1; they have thus developed differently, though not necessarily in different directions. Similarly, an awful place is a bad place, but an awesome place is a positively impressive one.2 Again both awful and awesome originally meant practically the same thing,3 and again they have developed differently. What is common to the two pairs is thus that a negative element that has remained in one member has developed into a positive one in the other member. We might hypothesise that adjectives having (had) the meaning ‘causing fear’ in common will show a degree of similarity in their developmental tendencies. It may be that they will coincide to such an extent that a tendency for the whole group will appear. A study of the group from this perspective could therefore be of interest. The adjectives to be looked into are listed in Table 1. Table 1: Adjectives meaning ‘causing fear’ alarming appalling awe-inspiring awesome awful creepy
dreadful fearful fearsome formidable frightening frightful
hairy horrendous horrible horrific horrifying ominous
redoubtable scary shocking startling terrible terrific
terrifying tremendous
The words can be said to belong to the same semantic field, where the common factor is ‘causing fear’, or causing closely related sensations such as awe, dread, fright, horror, shock, terror, trembling. There is no suggestion that the words are absolute synonyms at any stage of their development, only that they share or have shared an important element in their semantic make-up. As we shall see, that element is present to varying degrees in the words as used today. The words, particularly in their early uses, overlapped in meaning to a considerable extent, the common element being ‘frightening’. Some of the words also have a prior stage in common, namely that of ‘feeling fear’, ‘frightened’.
Awful adjectives
439
OED’s first recorded occurrence of awesome means ‘full of awe’ (1598); similarly that of dreadful means ‘full of dread’ (a1225) and that of frightful means ‘full of terror’ (c1250). In some cases such a sense seems to have developed later than the ‘frightening’ one: the ‘frightening’ sense is found earlier, but awful is recorded as ‘terror-stricken’ (c1590), fearful as ‘frightened’ (c1374), fearsome (“?erron.”) as ‘timid’ (1863) and scary as ‘frightened’ (1800). However, in time they all come together in the meaning of ‘frightening’, a common startingpoint for their subsequent development. Note that this change implies a widening of application: only a person or an animal can be full of awe, but living creatures as well as lifeless things can be frightening. There is a certain semantic fluctuation in the present-day use of the words with occasional uncertainty as to the exact meaning of the items in individual cases; the present-day semantics in the field are sometimes indistinct and worth looking into. I will address the subject in two different ways, one synchronic and one that could be called synchronic-diachronic. 3.
Synchronic approach. Semantic polarity of head-words
If we go back to the pair terrible: terrific, which had quite different meanings, one positive and one negative, with the head-word film, the question arises, how do we know whether a “frightening” word is positive or negative when the head does not suggest either interpretation, i.e. when it is neutral in the positivenegative dimension? The head, that is, does not seem to be of much help here. Nevertheless, it seems clear that we can say a terrible disaster or a terrific achievement but hardly *a terrible achievement or *a terrific disaster. We may then assume, rather trivially, that the nature of the heads of the NPs in which the adjectives occur will give some indication of the semantics of the adjectives. This is where corpus evidence will be most useful. The terrible: terrific examples suggest that one contrast likely to play a distinguishing role in the nominal heads is that between semantically positive and semantically negative. An achievement can be seen as a positive thing, as something good, and a disaster as a negative thing, as something bad. However, it is obvious that for many, probably most, nouns there is no such semantic charge: a thought, an experience, a feeling, a job are neither good nor bad in themselves. Determining the semantic prosody of the nominal heads with a tripartite classification of the heads as positive, negative and neutral was therefore seen as important. In order to find relevant material for statistical calculations the CobuildDirect Corpus was used. A list was produced of all the relevant adjectives immediately followed by a noun found in the Corpus. For each adjective its most frequent nominal collocates were selected; they were limited to those occurring at least twice, up to a maximum of 20 nouns. The nouns were then classified into the three classes Positive, Negative and Neutral. The following criteria for the
440
Göran Kjellmer
classification were used. If the noun (as used by itself) fits into either of the slots in the formula It was ½ This was ¸ These were ¾ (quite a(n)) ---, She was ¸ He was ¶ This was evidence of ---,
º ¸ ¾ so obviously I liked it/them/him/her ¿
it was considered a positive noun (e.g. achievement), but if it fits into the formula It was ½ This was ¸ These were ¾ (quite a(n)) ---, She was ¸ He was ¶ This was evidence of ---,
º ¸ ¾ so obviously I DIDN’T like it (etc.) ¿
it was considered a negative noun (e.g. shock). If, finally, it fits into neither formula, or both, it was classed as a neutral noun (e.g. contrast). A subjective element is as inevitable here as it is in the classification of our adjectives. “VALUE adjectives are thus subjective in the same sense as deictic terms: their referential meaning is largely dependent on their speaker’s identity.” (Adamson 2000:45). The nominal heads are listed in Appendix 1. It appears that the great majority of the nominal heads fall in the category Neutral and that the rest is unequally divided between Positive and Negative, so that Negative is roughly three times as frequent as Positive, as might have been expected given the original meaning of the adjectives. The distribution of the adjectives over the positive, negative and neutral heads is shown in Appendix 2. Without going into too much detail here, the clear difference in meaning that could be observed with terrible and terrific is reflected in the distribution of their nominal collocates: terrible occurs with 8 neutral and 12 negative heads but never with a positive head, whereas terrific occurs with 7 positive heads and 13 neutral ones but never with a negative head. When adjectives that take positive heads occur with neutral heads they normally still convey that positive meaning, as in formidable reputation or tremendous amount. On the other hand, adjectives with frequently occurring negative heads can be seen to convey the same negative meaning with neutral heads, as in horrendous consequences, horrific incident or terrifying experience. The proportion of the adjective’s occurrences with positive or negative heads thus seems to be indicative of the meaning it carries with neutral heads. But although that may be so, the division of collocates into positive, negative and neutral is not enough to explain how the words have moved across
Awful adjectives
441
their semantic terrain. We will then have to apply a synchronic-diachronic approach. 4.
Synchronic-diachronic approach
Several of the ‘frightening’ words develop semantically from ‘frightening’ to ‘(very) impressive’ or ‘overwhelming’. In order to understand how this is possible one could posit several intermediate steps, from ‘frightening’ to ‘very bad’, from ‘very bad’ to ‘great/big/large’, and from ‘great/big/large’ to ‘impressive, overwhelming’. It is characteristic of the succession of steps that for any two adjoining steps the speaker can intend one and at the same time imply the other one. Traugott (1990: 498f.; Traugott and Dasher 2002: passim) uses the term “invited inference” and shows that invited inferences can become lexicalised. The second step will then take over the main import of the word, without necessarily letting go of the first meaning.4 This is a situation that Stern (1964 [1931]: 380) describes as “adequation”. Referring to the semantic change of horn from ‘animal’s horn’ to ‘musical instrument of a certain kind’, he says, The principal element of its meaning - of the subjective apprehension of the referent - changes; the notion ‘animal’s horn’ recedes to a subsidiary position, and the notion ‘instrument of a certain kind’ takes its place as predominant. It is only when the hearer accepts this added element as being part of the word’s meaning that a semantic change takes place. Semantic change is thus a result of a collaborative (but mostly unconscious) effort: [Meanings have] a starting point in the conventional given, but in the course of ongoing interaction meaning is negotiated, i.e. jointly and collaboratively constructed ... This is the setting of semantic variability and change. (Lewandowska-Tomaszczyk, quoted from Traugott & Dasher 2002: 25) It should be stressed that the originally dominant element need not disappear but can “recede to a subsidiary position”. This will serve as a characterisation of each one of the steps leading from ‘frightening’ to ‘impressive’. The original semantic component of fear may even remain as a background component all through the later development of the word, cf. sentence (5). The progression could be viewed as a change from less subjective to more subjective, in which case it would be in line with the principle of semantic change put forward by Traugott (1990: 500). Let us now take a look at the steps separately. First there is the semantic transition from ‘frightening’ to ‘very bad’. An evaluative element is introduced, which will be part of the word all through its
442
Göran Kjellmer
later development. Awful carnage is presumably frightening, but it is also very bad, as in (1)
He is creating racial hatred against ethnic minorities, as he would approve the awful carnage of the Muslims by ethnic cleansing in the former Yugoslavia. Corpus: ukmags/03. Text: N0000000887.
A slight shift in meaning then makes it possible for awful to refer to things that are very bad but may not be frightening, as in (2)
What a vile place, what a bloody awful place to spend a bloody awful afternoon. Corpus: ukbooks/08. Text: B0000000100.
Fear is hardly involved here. The next step, from ‘very bad’ to ‘great/big/large’, follows logically. If something stands out as being very bad, it may be because of its scale or size, as in: (3)
I think political stands will account for an awful lot. Corpus: npr/07. Text: S2000910312.
where there may be no suggestion of ‘badness’, but where the evaluative element is clearly present. A final step will then be that from ‘very great/big/large’ to ‘impressive’ or ‘fascinating’, again a natural and logical development. What is very big is often also impressive, fascinating or even overwhelming. Cf. (4)
There is an awful suspense in watching this self-absorbed creature being taken over, ... Corpus: times/10. Text: N2000960217.
The coupling of big size with impressiveness and fascination leads on to a situation where the words in the field can denote impressiveness without at the same time necessarily denoting magnitude: (5)
Can we ordain to ourselves the awful majesty of God - to decide what cities and villages are to be destroyed, who will live and who will die...? Corpus: usbooks/09. Text: B9000001351.
In (5) the original component of ‘fear, dread’, in this case of the deity, can be seen to remain alongside the new one of fascination. The full scale then stretches from ‘frightening’ over ‘very bad’ to ‘very great’ and culminates in ‘very impressive’, ‘fascinating’, ‘overwhelming’. In a more general way, the change can be seen as going from negative to positive, the first two steps representing the negative side and the last two steps representing the positive side. Ullmann (1962: 137) sees this change as resulting from a tendency to overstatement.5 There is a great deal of continuity in the development and no sudden jumps from one meaning to the next. One developmental stage is always foreshadowed
Awful adjectives
443
(“invited”) in the previous stage. A very similar development is described by Gustaf Stern (1921: 261), who discusses the historical change of the Old English adverb fæste from the sense ‘strongly, immovably’ to ‘closely, securely, well’, and says, The whole process consists of a series of small changes, each representing an imperceptible advance in one direction, and capable of being explained as an association of the simplest kind. It is not necessary, at any point, to assume complicated psychic processes in order to explain the development. His description is equally relevant to the semantic development of the “fearinspiring” words. A graphic illustration of the semantic field is presented in Appendix 3, where examples of head-words appear under the relevant senses. The adjectives in the left column have (roughly) the meaning given at the top as their predominant element when they modify nouns of the type given under the meanings. The adjectives differ in the extent to which they have covered the way towards fascination and impressiveness; hairy and creepy have just begun their progress in that direction whereas the adjectives at the bottom of the table have gone all the way. Cf. Traugott (1990: 514): “[S]emantic change very rarely applies to items of the same lexical field at the same time, and thus is rarely capturable in a rule.” Even if hairy and creepy are beginning to move in the same direction as the others, it should be stressed that that is not necessarily the case. A word with the semantic characteristics of the group dealt with here is likely to change in the direction suggested, but it does not have to do so.6 What is very clear is that the words in the field all move in the same direction. It seems less likely that a word should develop in the opposite direction, from ‘fascinating’ to ‘frightening’. This raises the question of directionality in semantic change. Particularly within the theory of grammaticalisation claims have been made that changes always move in the same direction, from lexical to grammatical and not the other way round. (Cf. the title of Traugott’s 1990 paper.) Lass (2000) contests these claims in a spirited paper, where he shows that the strong version of the unidirectionality position is untenable. Even if most of the evidence supports the hypothesis that all grammaticalisation is unidirectional, the hypothesis must remain a hypothesis for several reasons. (The number of counterexamples is theoretically infinite, the difference between lexical and grammatical is insufficiently well defined, etc.) Similarly, Olga Fischer uses the story of the to-infinitive to show that “grammaticalisation processes do not always run the same course, that there may be differences between similar languages, that the process may indeed be reverted, and that this relates to the specific grammatical circumstances that a language finds itself in” (2000: 163). The weaker position, that some kind of unidirectionality can normally be observed in semantic change, thus a tendency but not a law, seems in any case defensible, if less revolutionary. This then
444
Göran Kjellmer
applies not only to grammaticalisation but to semantic change more generally. “The crucial point is that if SP/Ws [speakers/ writers] begin to exploit a lexeme in new ways, and the new meanings are adopted by others, the reverse order of change is not expected.” (Traugott and Dasher 2002: 281) And as we saw, the words in our lexical field, although deriving from widely different sources, all follow the same semantic path.7 As there are no given borderlines between the stages of progression, the proportions between the constitutive semantic elements of the words change as time goes by. Figure 1 is a schematic and greatly simplified representation that shows their development in the positive-negative dimension. The adjectives are seen to move from left to right in the semantic spectrum. At the beginning of their career, the negative semantic elements prevail, but with time positive elements grow in relative importance until they are totally predominant. Different adjectives represent different stages of this development.
Positive ( ‘great’, ‘impressive’)
Negative ( ‘frightening’, bad’)
Figure 1: Development of “frightening” adjectives in a positive-negative dimension 5.
Conclusion
There is a great deal of dynamism and regularity in the group of ‘frightening’ adjectives. Many words have come together in the common sense of ‘causing fear’. The ‘frightening’ sense is a common starting-point, and a necessary one for the subsequent development. The early stages of this change, the ‘frightening’ element, may remain as part of the word’s semantic set-up throughout its development, as in awe-inspiring, or they may fade away through the process of semantic bleaching, as in terrific. It seems probable that adjectives that have only covered part of the stretch will eventually acquire the sense of ‘impressive’. How long that will take will vary with the individual words, as new words meaning ‘frightening’ are likely to come into use, like the comparatively recent hairy or creepy. But that words meaning ‘frightening’ will develop in the direction of
Awful adjectives
445
‘impressive, overwhelming’ is as probable as that Stern’s words meaning ‘rapidly’ developed into words meaning ‘immediately’. Notes 1
OED, s.v. Terrible 1: “Exciting or fitted to excite terror; such as to inspire great fear or dread; frightful, dreadful.” OED, s.v. Terrific 1: “Causing terror, terrifying; fitted to terrify; dreadful, terrible, frightful.”
2
As in the quotes from the CobuildDirect Corpus: “my problem is that er I make it sound as though place name’s an absolutely awful place and er place name is not an absolutely awful place.” “This is real Lawrence of Arabia country, an awesome place of shimmering sands described in his Revolt in the Desert.”
3
OED, s.v. Awful 1: Awe-inspiring. OED, s.v. Awesome 2: Inspiring awe
4
“[O]ld and new meanings typically coexist in the same text [...] original meanings tend to persist so that no pure synonyms develop” (Traugott and Dasher 2002: 280).
5
“In a less extreme form, the same tendency to overstatement is responsible for countless hyperbolical expressions in everyday life: awful, dreadful, frightful, terrific, tremendous, abysmal, bottomless, deadly, and many more. The meaning of some of these words has been completely cancelled out by their emotional tone: to speak of a ‘terrific success’, a ‘tremendous welcome’, or of something ‘awfully funny’, is really a contradiction in terms.”
6
Cf. “No lexeme is required to undergo the type of change schematized here ... The hypothesis is that if a lexeme with the appropriate semantics undergoes change, it is probable that the change will be of the type specified.” (Traugott and Drasher 2002: 281)
7
It may be of some interest that Fred Householder argued, even in 1992, against any kind of directionality in semantic change. (Householder 1992).
References Adamson, S. (2000), “A lovely little example. Word order options and category shift in the premodifying string.”, in: Fischer, Rosenbach and Stein, 39-66. Allén, S. (ed.) (1995-96), Nationalencyklopedins ordbok. Göteborg and Höganäs: Språkdata and Bra Böcker. (NEO) American heritage dictionary, see Soukhanov, 1992.
446
Göran Kjellmer
Bright, W. (ed.) (1992), International Encyclopedia of Linguistics. New York & Oxford: Oxford University Press. CobuildDirect corpus, an on-line service: http://titania.cobuild.collins.co.uk. Fischer, O. (2000), “Grammaticalisation: Unidirectional, non-reversable? The case of to before the infinitive in English.”, in: Fischer, Rosenbach and Stein, 149-169. Fischer, O., A. Rosenbach and D. Stein (eds.) (2000), Pathways of Change: Grammaticalization in English. Amsterdam/Philadelphia: Benjamins. Goitein, S.D. (1965), “Splendid like the brilliant stars”, Journal of Semitic studies, 10: 220-221. Householder, F.W. (1992), “Semantic and lexical change.”, in: Bright: 3: 387389. Lass, R. (2000), “Remarks on (uni)directionality.”, in: Fischer, Rosenbach and Stein, 207-227. Maxidico = Domas, J. (ed.) 1996. Le Maxidico. Dictionnaire encyclopédique de la langue française. Éditions de la Connaissance. NEO, see Allén (1995-96). OALDCE = Wehmeier, S. (ed.) (2000), Oxford advanced learner’s dictionary of current English. 6th ed. Oxford: Oxford University Press. OED = Simpson, J.A., and E.S.C. Weiner (eds.) (1989), The Oxford English dictionary, 2nd ed., online version. Oxford: Clarendon. Saussure, F. de (1922 [1915]), Cours de linguistique générale. 2nd ed. Lausanne & Paris: Bally & Sechehaye. Snaith, J.G. (1993), The Song of Songs: based on the revised standard version. London: Marshall Pickering. Soukhanov, A.H. (ed.) (1992), The American heritage dictionary. 3rd ed. Boston and New York: Houghton Mifflin. Stern, G. (1921), Swift, swiftly, and their synonyms. Göteborg: Wettergren & Kerber. Stern, G. (1964 [1931]), Meaning and change of meaning. Bloomington: Indiana University Press. Originally published as Göteborgs högskolas årsskrift. Traugott, E. Closs (1990), “From less to more situated in language: the unidirectionality of semantic change.”, in: Adamson, S., V. Law, N. Vincent and S. Wright (eds.). Papers from the 5th International Conference on English Historical Linguistics. Amsterdam/Philadelphia: Benjamins, 496-517. Traugott, E. Closs, and R.B. Dasher (2002), Regularity in Semantic Change. Cambridge: Cambridge University Press. Ullmann, S. (1962), Semantics. An introduction to the science of meaning. Oxford: Blackwell.
Awful adjectives
447
Appendix 1: Positive, negative and neutral nominal heads Positive: achievement advantage boost clarity effort energy force friends fun originality performance potential power quality relief responsibility stability strength success support talent value
Neutral: act actions admission adventure agenda amount announcement anticipation array arsenal aspect attitude barrier behaviour canyon catalogue challenge change character claim climax colour comeback conclusion condition consequences contrast day
Negative: accident allegation assault attack blow bore bully burden burns car crash cloud conflicts cost crash crawlies crime cruelty death disease error fall foe indictment injury loss mess mistake monster
decline defence degree development difference display dream drop effect events evidence example exercise experience faces fact feeling figure film foursome frequency game group guy headlines horse idea ---
murder nightmare obstacle opponent opposition ordeal pain problem rival sex attack shame shock shortage slump strain threat tragedy trouble warning waste violence
Appendix 2: Distribution of adjectives over positive, negative and neutral nominal heads TYPES
TOKENS
Pos. Neutral Neg.
% TOKENS
Pos. Neutral Neg.
Pos. Neutral Neg.
Alarming
18
2
103
7
0.0
93.6
6.4
Appalling
15
5
103
27
0.0
79.2
20.8
0.0
100.0
0.0
53
2
37.5
60.2
2.3
AweAwesome
1 5
14
2 1
33
Awful
14
6
860
34
0.0
96.2
3.8
Creepy
3
1
8
14
0.0
36.4
63.6
12
8
107
45
0.0
70.4
29.6
Dreadful
448
Göran Kjellmer
Fearful
8
Fearsome Formidable
2
Frightening
16
4
1
12
6
19 3
Frightful
0.0
100.0
0.0
13
2
0.0
86.7
13.3
75
41
10.8
57.7
31.5
1
110
3
0.0
97.3
2.7
3
6
7
0.0
46.2
53.8
0.0
100.0
0.0
28
37
0.0
43.1
56.9
147
18
2.4
87.0
10.7
14
Hairy
3
Horrendous
9
11
16
3
Horrific
9
11
50
102
0.0
32.9
67.1
Horrifying
7
4
18
8
0.0
69.2
30.8
11
3
37
11
Horrible
1
Ominous Redoubtable
-
7 4
-
0.0
77.1
22.9
0.0
0.0
0.0
Scary
15
1
48
4
0.0
92.3
7.7
Shocking
15
5
75
14
0.0
84.3
15.7
8.7
91.3
0.0
276
0.0
53.0
47.0
Startling
2
18
Terrible Terrific
8 7
Terrifying
6 12
13
31
15
5
63 311 85
26.7
73.3
0.0
70
24
0.0
74.5
25.5
Tremendous
12
6
2
166
194
30
42.6
49.7
7.7
TOTAL
28
269
91
254
2589
706
7.2
73.0
19.9
Appendix 3: Semantic progression of adjectives ‘frightening’ ‘very bad’
‘v great/big/large’
‘impressive’, ‘overwhelming’
Negative
Negative
Neut./pos.
Positive
creepy
tale
dope
-
-
hairy
moments
old boats
-
-
ominous
clouds
news
-
-
scary
film
prospect
-
-
alarming
experience
effect
frequency
-
appalling
crime
behaviour
increase
-
dreadful
disease
situation
noise
-
fearful
wrath
racket
energy
-
fearsome
attack
reputation
pace
-
frightening
story
football
speed
-
Awful adjectives
449
frightful
catastrophes mess
lot
-
horrendous
injury
mistake
number
-
horrible
crime
embarrassment road toll
horrific
murder
fall
traffic problem
-
horrifying
violence
moment
kick
-
shocking
picture
waste
speed
-
terrible
accident
loss
cost
-
terrifying
violence
addiction
proportions
-
-
awe-inspiring Civil Guards loss
wingspan
beauty
awesome
task
effect
display
awful
tragedy
mistake
lot
majesty
formidable
threat
problem
energy
intellect
unnerving
movie
habit
concentration
performance
redoubtable
fighter
-
sceptic
larynx
startling
-
awkwardnesses contrast
originality
tremendous
-
problem
delight
achievement
terrific
-
-
pace
performance
disgrace
Global English – Global Corpora: Report on a panel discussion at the 28th ICAME conference Marianne Hundt Heidelberg University 1.
Introduction
At the 28th ICAME conference, a panel discussion was held on the role of corpus linguistics in the study of English as a global language. The panel members were: Pam Peters, Joybrato Mukherjee and Anna Mauranen. The panel was chaired by Marianne Hundt. The topics to be covered were (a) English as an international lingua franca (EIL), (b) the question of ‘ownership’ or who to count as a native speaker, and (c) norms for global English. Since both the title of the panel and the topic areas were rather broad, we decided to focus the discussion by introducing provocative statements on the topic areas. The chair passed the following statements on to the panel members: 1.
2.
3.
Corpus linguistics will enable us to describe the international core of English, namely those features that are shared by all L2 varieties of English. One of the core requirements for inclusion in the International Corpus of English (ICE) is that the authors and speakers of the texts were educated through the medium of English – thus ‘English-medium education’ and ‘long-term residence’ have replaced the criterion of ‘nativeness’. With its focus on ‘standard English’ (especially varieties of English as L1), corpus linguistics has (often involuntarily) fed into the ‘standard ideology’.
The idea for the panel discussion was to combine theoretical issues concerning ‘Global English’ with the methodological angle of corpus linguistics. Questions for discussion included: How do our methodological decisions influence our results? How does linguistic theory guide us in our methodological decision making? Do we have the ‘right’ corpora for studying global English? The panel opened with short ‘position statements’ from the panel members. Each of them focussed on a different topic area. The discussion that followed centred mainly on one point: the variety status of English as a lingua franca (ELF) and the norms that might apply to it. Furthermore, and as Anna Mauranen had predicted in her position statement, it was at times a rather emotional discussion. In this report of the panel, the position statements of the panel members are presented first; they were written by the panel members themselves. The summary of the ensuing discussion is based on the notes that David Minugh took
452
Marianne Hundt
at the time. The names of the participants in the discussion are not mentioned although some statements may come close to verbatim passages in the original discussion. 2.
Position statements
2.1
Pam Peters (Macquarie University, Sydney): The ICE corpora and Global English
Q. Do we have the “right” corpora for studying global English? How far do the ICE corpora go in meeting our research needs? A. In a nutshell, only part of the way. The ICE project is remarkable in many ways, providing a larger view of world English than any corpus project before it. It does nevertheless constrain or frame our view of world Englishes in at least two ways. With their fixed size (1 million words, half spoken discourse/half written discourse, and multiple subcategories of each), the ICE corpora inevitably provide only limited coverage of each variety, and a somewhat arbitrary range of lexis, morphology and syntactic constructions. Even high frequency polysemous words may not present identical sets of uses, especially in L2 varieties of English. For example, some uses of until in Singapore English are slightly different from those of international written English, particularly in situation-dependent discourse such as: (1) I waited until I (was) angry; luckily my turn came ten minutes later. Here the wait of the main clause continues all through the until clause, whereas in standard English the until-clause marks the point at which the main clause action ceases. Yet among 200 examples of until in Singapore ICE, there is only one example of this usage, in a rather fractured conversation. Since this probably reflects the Chinese aspectual particle dao, it is of particular interest as an example of the way in which substrate languages may impinge on outer-circle varieties of English. The subtler semantic developments in new Englishes may not emerge from the smallish amounts of interactive discourse in ICE corpora, even if straightforward loans such as the discourse particle lah are represented well enough in the data. The set of Englishes included in ICE is still limited. While it includes quite a few of those based on British English (e.g. Australian, New Zealand, Indian, Hong Kong English), there is only Philippine English to represent those based on American English. New ICE projects for the Bahamas, Fiji and Sri Lanka will extend the range, but the ICE network remains much more a coverage of Commonwealth Englishes than of “global English” per se. Without ICE-US and indeed ICE-Canada we still lack key reference points in world English, and the means of comparing the interplay of millennial British and American English on other inner and outer circle varieties of English. Their relative impacts on
Global English – Global corpora
453
expanding circle varieties such as Japanese, Chinese and Thai English could also be more effectively researched were there an ICE-US available alongside the other ICE-corpora. 2.2
Joybrato Mukherjee (University of Gießen): Corpus linguistics and linguistic ownership
In an often-quoted programmatic statement, Widdowson (1994) forcefully argues that in the light of the global spread of English, it is no longer native speakers alone who can claim ownership of the English language: How English develops in the world is no business whatever of native speakers in England, the United States, or anywhere else. They have no say in the matter, no right to intervene or pass judgment. They are irrelevant. […] It is a matter of considerable pride and satisfaction for native speakers of English that their language is an international means of communication. But the point is that it is only international to the extent that it is not their language. It is not a possession which they lease out to others, while still retaining the freehold. Other people actually own it. (Widdowson 1994: 385) Now, it is true that there are many more non-native than native speakers of English today – in this particular sense, it is obvious that English as a truly global language is no longer exclusively bound to native-speaker communities and their socio-cultural contexts. More specifically, it is generally accepted today that institutionalised second-language varieties around the world such as New Englishes in the Caribbean, in Africa, in Asia and in the Pacific region are normdeveloping varieties in their own right that are – to some extent, at least – independent of exonormative standards set by native speakers. It should not go unmentioned, however, that even in well-established English as a Second Language (ESL) communities such as India, one typically observes what Kachru (passim) has repeatedly called ‘linguistic schizophrenia’. D’souza (1997) describes linguistic schizophrenia in the Indian context as follows: We use English as if it belongs to us but the minute this is brought to our attention we get into a flap and say this is not our language. (D’souza 1997: 95) Even in India, then, Widdowson’s (1994) position is not entirely reflected by local users’ attitude towards the English language: ownership does not seem to be an all-or-nothing attribute. What is more, the simple fact that one uses the English language regularly and competently does not automatically mean that one also feels one is the owner of the language. Using and owning a language are clearly two different things.
454
Marianne Hundt
In my statement, I would like to concentrate on the increasing use of English as a lingua franca in intercultural communication by non-native speakers when communicating with other native and non-native speakers. Picking up on Widdowson’s (1994) stance, Seidlhofer (2001) has been in the vanguard of claiming linguistic ownership of English for everyone who uses English as a lingua franca. She writes that ELF speakers are usually not [...] concerned with emulating the way native speakers use their mother tongue within their own communities, nor with socio-psychological and ideological meta-level discussions. Instead, the central concerns for this domain are efficiency, relevance and economy in language learning and language use. (Seidlhofer 2001: 141) It is certainly true that ELF is part of linguistic reality – Seidlhofer (2001) is right in criticising that ELF has been a ‘conceptual gap’ for too long. Once we accept the existence of ELF as an integral part of global English, it is self-evident that this very kind of English needs to be described on the basis of solid data. It is, thus, a very welcome development that various corpus projects – including Seidlhofer and Jenkins’s VOICE project and Anna Mauranen’s ELFA project – have been launched. They will provide us with a comprehensive picture of what ELF actually looks like and what happens in ELF communication. What bothers me is not that ELF corpora are being compiled and analysed – quite the contrary. However, I have a niggling worry that by creating ELF corpora ELF is posited as a well-defined variety of English – which, in my view, it is not. ELF is an umbrella term for a multitude of variants, including all kinds of variants that we find in different learners with different L1 backgrounds and at various competence levels. ELF is a conglomerate of variants, but it is not a variety. What makes a variant – or a set of variants – a variety? Nayar (1998) offers a list of ten linguistic, sociolinguistic, political and other features that are characteristic of a variety. At the risk of some gross over-simplification, I have noted down on the right-hand side whether or not the features can be found in ELF: Linguistic features 1. Identifiably distinct formal features 2. Internal consistency and systematicity 3. Lectal range to accommodate variation
? – –
Sociolinguistic features 4. Ethnolinguistic vitality 5. Distinctive cultural attributes and pragmatics 6. Standardisability and codifiability
? – –
Political features 7. International acceptance
?
Global English – Global corpora 8.
Socio-political identity
Other (desirable) features 9. Indigenous literature 10. Distinct pragmatics
455 –
– ? (List from Nayar 1998: 285)
As for linguistic features, it is possible to describe formal features of ELF in an ELF corpus – but whether they are sufficiently distinct is a different matter. The level of distinctness is presumably very low because ELF includes variants of speakers with all kinds of L1 backgrounds. There is no internal consistency and systematicity – apart from high-frequency deviances from native norms, which we would traditionally refer to as learner errors. There is not so much a lectal range but, more importantly, a range of different levels of competence. With regard to sociolinguistic features, we get a very similar picture: while we could argue that ELF is ethnolinguistically vital in that it provides a communicative vehicle for intercultural communication, ELF as such is, by definition, independent of any specific culture, distinctive cultural attributes and pragmatics. I cannot see how ELF could develop its own standard and how it could be codified as a well-defined variety. The international acceptance of ELF is a disputed issue, but clearly, ELF has no specific socio-political identity and no indigenous literature (can ELF be truly indigenous in the first place?). It is difficult to conceive of any distinct ELF-specific pragmatics; I would assume that ELF pragmatics is, at best, a convergence of the pragmatic systems of the cultures that are linked via ELF. The overall picture that emerges from this characterisation of ELF is that it is not a variety with which anyone actively and positively identifies himself or herself, that it is a makeshift code that is used to overcome language barriers in intercultural communication, that it is not bound to any specific culture and that, consequently, it is not ‘owned’, as it were, by anyone. ELF is a communicative epiphenomenon. The existence of ELF corpora should not lead us to believe that ELF is a variety of English – although it seems to be an attractive mainstream position at the moment. Note in this context that the same holds true for what has been labelled ‘Euro-English’. Mollin (2006) shows that Euro-English is, by and large, a fata morgana – true, English is used in Europe as a lingua franca, but there is no such thing as a Euro-English variety. What is more, Mollin’s (2006) results seem to drag the skeleton from the closet of many advocates of ELF-based models of English – the native speaker: New standards need to be standards in the mind, too. Ideally, the speakers sampled in the [Euro-English] corpus should thus be asked whether they consider features which have emerged in the corpus to be potential markers of the new variety as correct, and whether they would use these themselves. [...] The results of both direct attitude elicitation parts and acceptability tests on supposedly Euro-English sentences, however, have demonstrated that the standard that
456
Marianne Hundt European speakers follow and wish to follow is that of native speakers. (Mair and Mollin 2007: 347)
This is where it all comes full circle: it seems that native speakers have a say in the matter – because non-native speakers want them to. Non-native European users of English are a significant part of the lingua-franca users of English worldwide – Mollin’s (2006) study might thus have wider implications for ELF in general. As other studies show, many ELF speakers are oriented towards nativespeaker norms and they do not want to learn and use a reduced variant of English that is still more or less intelligible or, as Jennifer Jenkins (passim) would put it, ‘communicatively successful’ despite its deviances from native-like usage. There is no point in ignoring the fact that the native speaker remains a relevant reference point for ELF speakers and learners of English. Thus, I cannot see why it should be useful to describe an international core of English across all ENL and ESL varieties and the myriad of variants of English that we subsume under ELF. The concept of a common core is a very useful one, but it should only be based on – and abstracted away from – full-fledged varieties of English. 2.3
Anna Mauranen (Helsinki University): English as a Lingua Franca
Corpus linguistics is an excellent means for discovering what L2 Englishes have in common – or, indeed, what all Englishes have in common, and where varieties differ. It is hard to think of serious alternatives to corpora for answering such questions. Even though corpora, for obvious reasons, have been heavily dominated by first language use and standard English, we can now move on and accept that L2 speakers constitute an important group of users who are different from ‘learners’. L2 speakers outnumber L1 speakers by about four to one these days, which means that we live in interesting times of potentially rapid changes in English. Large numbers of people use English for a wide range of purposes, many use it regularly in contexts which are important parts of their lives. Even though English is the medium of communication, the context is more often than not transcultural, and the location outside English-speaking countries. English is used as a global lingua franca – but English corpus linguistics is only beginning to take this development on board. The brief for this panel was to discuss the international core of ELF, the ownership of English, i.e. the status of the native speaker, and the norms for global English. It seems to me that if there is a common core to lingua franca English it can most reliably be discovered by exploring relevant corpus data; but the existence of such a core is an empirical question. The ownership of English is a trickier issue, but at this point suffice it to say that the ownership cannot be limited to those who were born with a given language, because our relationships to the languages we encounter and acquire throughout our lives are prone to change: a new language can become more important than our first language. These changes can be radical and unexpected, especially in today’s globalised and unstable world. Even so, there is every reason to respect the special relation-
Global English – Global corpora
457
ship people have with their first languages. The question of norms and global English tends to unleash emotions. Some people seem to think that if any concessions are made to the legitimacy of global English, all standards will go down the drain, no norms will be respected and soon communication between different Englishes is going to be impossible. This is a sad picture, and a dire motive for holding on to a native speaker norm. In the most basic sense, norms define what is normal. They are inherently evaluative, and they exert a powerful influence on people’s behaviour. We can roughly distinguish two kinds of linguistic norms: those which are prescribed and those which arise spontaneously. The first kind, norms which prescribe good usage, are institutionalized and sanctioned in many ways, largely through educational systems and normative reference works. We might call these imposed norms because they are sanctioned by authorities ‘higher’ than ordinary speakers. The second type, which can be called natural norms, originate in the selfregulation of speech communities or communities of practice. No institutional body controls them, and they can deviate considerably from standard language norms. Basically these are norms of use, emergent, uncodified, and a good deal more elusive than fixed standards. Natural norms tend to be receptive to innovations, and insofar as the innovations gain wide acceptance, they result in general language change and eventually find their way to the standard. What the two kinds of norm have in common is an interest in ensuring efficient and effective communication; this is why any community regulates their language use. There is an inevitable tension between actual usage and the imposed standard. But this tension keeps within comfortable limits if the standard gets updated often enough and the updates are informed by changes in use – good corpora are invaluable for judgments of what to treat as a norm. But how is the norm related to the native speaker? The native speaker is a problematic concept in that it is used to refer to both the ideal native speaker of certain linguistic theories and real-world native speakers, but the distinction is not always kept clear. Corpus linguistics is of course interested in the reality of language use. In the real world, not all native speakers are equally exemplary users of their language, certainly not equally good in all domains of use: while some may be good at giving public talks, and others at writing in an entertaining way, some excel in research writing, others again are fun to chat with. Some skills and genres are more highly valued in the linguistic market than others, and in compiling a norm-informing database we need to assess which genres and what uses we judge as worth including. Although for the non-native speaker it is ‘the native speaker’ that is held up wholesale as a desirable model, it is clear that this makes no sense at all for native speakers. What we need in a norm-informing corpus are instances of ‘good usage’, for example ‘educated English’ or some other limited section of the language, whether broadly or narrowly defined. If the native speaker is not an appropriate basis for an imposed norm in a native language community, is it really any more appropriate for non-natives? I would like to argue that it makes no more sense to define a standard for non-
458
Marianne Hundt
natives by simply pointing to a group of speakers who have the target language as a mother tongue than it would be for native speakers; a standard must be based on some model of good usage. But good usage need not be limited to native speakers; it ought to be independent of the speaker’s first language, as long as the usage of the target meets the criteria set for it. Non-native standards do not have to be any slacker than native standards, but they must be different because they apply to a different social and cultural context of use. The natural norm is a less sensitive issue than the imposed norm. Natural norms arise in the self-regulating mechanisms that any speech community possesses: what features a speech community adopts, tolerates or rejects. Natural norms are of descriptive and theoretical interest to linguists, because they are manifest in language variation, in non-standard use, in New Englishes – and in ELF. This is where ELF really comes to its own; whether we want to speculate on a need for a world standard or a general ELF standard is not decisive for a scholarly interest in ELF. ELF speaking communities may not be regarded as speech communities in the ordinary sense, since for example they are not associated with a locality, but it is certainly true that many communities of practice have adopted ELF as their de facto language, and that the ensuing norms of use are regulated by the participants of those communities. ELF is also the language of wide and diffuse networks of uses and users. To find out how these specific social contexts of use develop and affect the shape of English, we need databases of their authentic language. We already have corpora of New Englishes and learner English, both of which are interesting and valuable in increasing our understanding of English; one exploring nativised varieties, the other tracking the developmental paths of individuals towards a target. We need evidence from ELF to provide a missing link of using English in foreign language contexts outside settings where the speakers are positioned as learners. ELF provides an important basis for establishing what might be the necessary features of language – certainly of English – in situations of demanding and sophisticated use when there is no institutionalized basis for an imposed norm. By exploring these different kinds of databases, we can hope to come closer to answering questions on the similarities and differences in these hybrid Englishes, and trace their impact on English as a whole. ELF corpus data is capable of throwing light on mechanisms of language change, directions and patterns of the ways in which features travel in today’s globalised and multilingual world, and on social contexts of use not captured by other corpora. This has wider significance on language theory, as it reflects the unique situation in which virtually all languages in the world are in contact with one language. ELF research enables us to go beyond contact between two or very few languages, and beyond positing first language interference as the major, let alone only, explanatory factor behind deviations from native usage. It can help us understand the nature of emergent norms, and throw light on possible language universals or necessary features of language from a new perspective.
Global English – Global corpora
459
In sum, what I have suggested in this brief statement on what global English means to norms and corpus linguistics is: (1)
For imposed norms, we need to gather information on good usage independently of its origin.
(2)
For natural norms we need to include ELF, for description and theoretical models.
3.
Discussion
3.1
Accommodation
A question raised by the audience was whether ELF speakers accommodate more than native speakers. Anna Mauranen replied that we need to accommodate all the time when we speak to people with different language skills. She also pointed out how evidence from the ELF corpus compiled at Helsinki university indicates how speakers, in accommodating and their use of repair sequences, appear to concentrate on content rather than form. 3.2
ELF – description and norms
Various members of the audience were ready to accept that corpus linguists could (and should) describe ELF, but wondered whether we needed norms for it. A widely held opinion was that we must be able to correct student errors, not merely accept them as part of their interlanguage or ELF. To this, Pam Peters replied that phenomena such as reduced morphology are tolerable in an ELF situation, but that classroom assessment cannot allow this – and that writing was “a different ball game altogether”. Similarly, Joybrato Mukherjee rejected the reduction of, for instance, the third person singular present tense –s as a permissible feature of ELF because native speakers would not accept it. To the question from the audience how we should deal with the assessment of student writing and conversation, Joybrato Mukherjee replied that it was necessary to distinguish between describing ELF and teaching it; he attacked the idea of using ELF as an international language between non-native speakers, pointing out that it was not a goal desired by learners. Anna Mauranen, on the other hand, wanted to separate ELF from teaching norms and was less convinced that native-speaker norms truly dominated. A related issue discussed was the enforcement of native-speaker norms in publishing. Members on the panel pointed out how even linguistics journals for English as a world language recommended that non-native speakers have a native speaker edit their texts prior to submission, and that generally, prescriptive norms are often applied in the editing of L1-area journals. To this, a member from the audience contributed his view as a former editor, stressing that he himself concentrated on content but that he’d had copy-editors to back him up who would
460
Marianne Hundt
focus on the language side of editing. Anna Mauranen remarked that she had consistently avoided such language checking, maybe advocating indirectly that others do the same? A member from the audience pointed out the much more widespread use of English by immigrants, i.e. English as a second language (ESL) rather than ELF. The example given was the use of Eastern European immigrants moving into the UK. The suggestion was that – for those who did not want to fully master the language, an alternative would be to teach domain-specific forms like Business English, Agricultural English, etc. Another colleague pointed out how even native speakers have to ‘learn’ how to use domain-specific varieties, mentioning Eurocrat-speak for grant applications as an example. At this point in the discussion, Antoinette Renouf tried to elicit a North American response to a hitherto European series of descriptions. A colleague from Canada pointed out how accent and phonology were key features of language use in the global context. An American colleague mentioned the difference between migrating to an L1 context and assimilating in two generations (as in North America), and what is currently happening especially in parts of Asia where English is not the L1 and lacks the cultural roots, one example being the use of English in mainland China or English in Africa. On the question of norms and varieties, Marianne Hundt alluded to a statement by John Algeo that all varieties are fictions, but that they are useful fictions, wondering whether ELF, too, was a useful fiction. A member from the audience saw the creation of new norms (e.g. ELF norms) as possibly useful, but also mentioned them as potential channels for oppression. Marianne Hundt, taking up Anna Mauranen’s suggestion that ELF contexts constitute their own ‘communities of practice’, was wondering what the organizational frame would be that held them together. Pam Peters suggested that web-based virtual communities might be one example. The challenge from the chair was that even if such communities of practice for ELF existed, ELF lacked an underlying system and therefore did not qualify as a variety of English. Anna Mauranen countered this argument by pointing out that systems change through usage. A member from the audience suggested that speakers can perhaps create joint sub-varieties. Anna Mauranen added that we collected corpora of learner English and did not find that surprising; collecting data of EFL use, she stressed, did not imply that a system called ‘EFL’ existed; an ELF corpus would merely reflect what existed in the world (pointing out the similarities with other dialects of English). A member from the audience pointed out that defending ELF was often seen as rejecting native norms, but that this was a false perception. From this, the discussion moved on to the political aspects in ascribing variety status to a phenomenon such as EFL. A member from the audience said that we were witnessing a shift to a true lingua franca, and the creation of ELF corpora would be a way of recording this shift. The chair pointed out that, if we research a phenomenon, people assume that the phenomenon has an underlying system and that this could have implications for language teaching which might eventually lead to the short-selling of learners.
Global English – Global corpora 3.3
461
Common core English – myth or reality?
To the question as to how real a common core for English was, Pam Peters replied that this was a highly abstract question. She pointed out that even highfrequency items found in corpora are often polysemous across national varieties, so that the notion of the common core may even be a rather elusive one empirically. A member of the audience added to this set of questions by asking whether something like ‘global English’ existed. A common answer to this question used to be that there were global Englishes, and the question now was whether we could expect norm convergence over time (a possible example mentioned was the world-wide-web as a locus where ‘global English’ might be observed as a result of global convergence). The member of the audience suggested that we should be looking at the divergences instead of converging trends and that the ICE corpora provided a good tool for this. Ending on a critical note, he pointed out that one of the problems was that ICE-GB was a corpus of educated London English, rather than “ICE-GB” for all of Great Britain. 4.
Concluding remarks
The cautioning remark on ICE-GB (which is actually a sample of educated London English) brings us back to one of the questions raised in the introduction and that was also addressed by Pam Peters in her position statement, namely the question whether we have the right corpora for studying global English. Despite the wide scope of the ICE project, the corpora that we do have so far represent a tiny slice of the range of Englishes spoken and written within the Commonwealth. Obviously, to compile corpora with the coverage of something approximating the BNC is out of the question on a global scale, so one avenue for future research may be to exploit the world-wide-web for corpus building, both to complement some existing ICE corpora and to cover some of the ground that ICE has not covered so far (and is not likely to cover in the near future). The fact that the compilers of ICE-GB ended up compiling a corpus of educated London English rather than a corpus representative of all of Great Britain is closely connected to practical issues in corpus methodology – and we might have to be somewhat more cautious in our interpretation of results obtained from ICE data (not just with respect to the British component, but also – and especially – when working with the other ICE components).1 Coming back to the initial statements, we may conclude that (a) we are still a far cry from being able to describe the international core of English and might never actually reach that goal; (b) the question of ‘ownership’ is still a controversial one and the panel discussion simply reflects that we are dealing with an unresolved issue; (c) the ‘standard ideology’ was not directly addressed by any of the participants but is an issue that surfaces in the discussion about the status of ELF and norms for teaching.
462
Marianne Hundt
Notes 1. On a somewhat critical note: more detailed documentation than the existing manuals is needed. The ‘detail’ that we are missing so far is information on the compilation process and the decisions taken along that road – this kind of information would enable the corpus linguistic community to be more cautious in their interpretations of the results. References D’souza, J. (1997), “Indian English: some myths, some realities”, English WorldWide 18(1), 91-105. Mair, C. & S. Mollin (2007), “Getting at the standards behind the standard ideology: what corpora can tell us about linguistic norms”, in: S. Volk-Birke and J. Lippert (eds.) Anglistentag 2006 Halle: Proceedings, Trier: WVT, 341-353. Mollin, S. (2006), Euro-English: Assessing Variety Status. Tübingen: Gunter Narr. Nayar, P.B. (1998), “Variants and varieties of English: dialectology or linguistic politics?”, in: H. Lindquist, S. Klintborg, M. Levin & M. Estling (eds.) The Major Varieties of English: Papers from MAVEN 97, Växjö 20-22 November 1997, Växjö: Växjö University, 283-289. Seidlhofer, B. (2001), “Closing a conceptual gap: the case for a description of English as a lingua franca”, International Journal of Applied Linguistics 11, 133-158. Widdowson, H.G. (1994), “The ownership of English”, TESOL Quarterly 28(2), 377-389.