Perspectives on Formulaic Language: Acquisition and Communication
This page intentionally left blank
Perspectives on Formulaic Language Acquisition and Communication
Edited by David Wood
Continuum International Publishing Group The Tower Building 80 Maiden Lane 11 York Road Suite 704 London SE1 7NX New York, NY 10038 www.continuumbooks.com © David Wood and contributors 2010 All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage or retrieval system, without prior permission in writing from the publishers. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. ISBN:
978-1-4411-5047-9 (Hardback)
Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress.
Typeset by Newgen Imaging Systems Pvt Ltd, Chennai, India Printed and bound in Great Britain by the MPG Books Group
Contents
Notes on Contributors Acknowledgements 1. Formulaicity and Usage-based Language: Linguistic, Psycholinguistic and Acquisitional Manifestations Regina Weinert
vii xi
1
Part 1: Formulaic Language in Acquisition and Pedagogy 2. The Development of Collocation Use in Academic Texts by Advanced L2 Learners: A Multiple Case Study Approach Jie Li and Norbert Schmitt
23
3. Idiomatically Speaking: Effects of Task Variation on Formulaic Language in Highly Proficient Users of L2 French and Spanish Fanny Forsberg and Lars Fant
47
4. Effectiveness of Text Memorization in EFL Learning of Chinese Students Zhenqiong Dai and Yanren Ding
71
5. Lexical Clusters in an EAP Textbook Corpus David Wood 6. An Investigation of Lexical Bundles in ESP Textbooks and Electrical Engineering Introductory Textbooks Lin Chen
88
107
Part 2 : Identification and Psycholinguistic Processing of Formulaic Language 7. Formulaicity in Code-switching: Criteria for Identifying Formulaic Sequences Kazuhiko Namba 8. Holistic Processing of Regular Four-word Sequences: A Behavioural and ERP Study of the Effects of Structure, Frequency, and Probability on Immediate Free Recall Antoine Tremblay and Harald Baayen
129
151
vi
Contents
9. The Phonology of Formulaic Sequences: A Review Phoebe Ming Sum Lin 10. Processing MWUs: Are MWU Subtypes Psycholinguistically Real? Georgie Columbus
174
194
Part 3: Communicative Functions of Formulaic Language 11. A Text in Speech’s Clothing: Discovering Specific Functions of Formulaic Expressions in Beowulf and Blogs Matt Garley, Benjamin Slade, and Marina Terkourafi 12. The Semantic Structure of Arabic Idioms Ashraf Abdou
213
234
13. Formulaicity and Translation: A Cross-corpora Analysis of English Formulaic Binomials and Their Italian Translations Salvatore Giammarresi
257
Index
275
Notes on Contributors
Ashraf Abdou has obtained degrees in Arabic language and Islamic studies and in Linguistics from Cairo University. He also has an MA in Teaching Arabic as a Foreign Language from The American University in Cairo, and a PhD in Linguistics from the University of Manchester, where his dissertation was a corpus-based study of Arabic idioms. His teaching experience includes Arabic linguistics and Arabic as a foreign language at these three universities. Harald Baayen is a Professor at the Department of Linguistics at the University of Alberta. His research interests include quantitative linguistics, lexical statistics, exploratory data analysis, stylometry, mixed-effects modeling, morphology and morphological processing. Lin Chen worked as an electrical engineer for eight years in China before beginning her graduate studies in applied linguistics at Carleton University, where she is now a lecturer in English for Academic Purposes. Her research interests include formulaic language, corpus linguistics, and discourse analysis. Georgie Columbus is a phraseologist working in corpus linguistics and psycholinguistics. Her secondary interests lie in the variation in discourse markers between English varieties. Georgie is currently researching at the University of Alberta, Canada, on the processing of multiword units in native and non-native speakers. Yanren Ding is Professor of English, School of Foreign Studies, Nanjing University. His research interests include second language acquisition, discourse analysis and language teaching methodology. Lars Fant is a Professor in the department of Spanish and Portuguese at Stockholm University. He has taught Romance languages in a wide variety of contexts, and has researched and published on many aspects of second
viii
Notes on Contributors
language use including cross-cultural communication, discourse analysis and pragmatics, and formulaic language. Zhenqiong Dai is an English instructor, School of Foreign Studies, Nanjing University. Her research and teaching interests centre around formulaic language and the role of memorization in language acquisition. Her article in this volume, co-authored with Yanren Ding, is based on her graduate work. Fanny Forsberg is a researcher and lecturer in French at Stockholm University’s department of French, Italian and Classical Languages. She has researched and published extensively on formulaic language and high-level proficiency in second language use. Matt Garley is a PhD candidate in Linguistics at the University of Illinois at Urbana-Champaign, where he serves as editor of Studies in the Linguistic Sciences. His research interests include sociolinguistics, language contact, the language of hip-hop, and computer-mediated communication. He is currently planning a dissertation project on linguistic borrowings and the construction of identity in the German hip-hop fan community. Salvatore Giammarresi holds a PhD in Synchronic and Diachronic Linguistics from the University of Palermo, Italy. He is the site creator and moderator of the Formulaic Language Research Network (www.eflarn.ning. com). His academic interests include formulaicity, translation technology, translation theory and localization. He is a lecturer at the University of Palermo, Italy, teaching localization, computer assisted translation tools and global marketing. Professionally Salvatore is currently serving as the Vice President of Products at a web-based company in San Francisco. Jie Li is a PhD student at the University of Nottingham. Her academic interests include vocabulary learning and teaching, formulaic language, corpus linguistics, and foreign/second language acquisition. Her thesis focuses on second language (L2) learner acquisition and use of formulaic language in academic writing. She has recently published an article on Chinese advanced L2 learner’s acquisition process of lexical phrases in the Journal of Second Language Writing. Phoebe Ming Sun Lin is a researcher at the School of English Studies of the University of Nottingham, UK. At the time of writing, she is completing
Notes on Contributors
ix
a large study which provides a comprehensive description of the prosody of formulaic sequences. Her recent publications explore formulaic language from the perspectives of phonology, corpus linguistics and second language teaching and learning. Her wider research interests include formulaic language, intonation, corpus linguistics and spoken English. Kazuhiko Namba is Associate Professor of Applied Linguistics at Kyoto Sangyo University. His main research interest is the role of formulaic language in bilingual children’s language acquisition and the structural aspects of code-switching. He acquired his PhD and MA at Cardiff University. He taught English in Japanese secondary schools for over 20 years and is raising his two children as English-Japanese bilinguals. Norbert Schmitt is Professor of Applied Linguistics at the University of Nottingham. He is interested in all aspects of second language vocabulary, including vocabulary acquisition, formulaic language, vocabulary testing, the relationship between reading and listening and vocabulary learning, and implicit and explicit vocabulary knowledge. He has most recently completed a vocabulary research manual, to be published by Palgrave Press. Benjamin Slade is a PhD student at the University of Illinois, currently preparing a dissertation on the history of Sinhala interrogative constructions under the direction of Hans Henrich Hock. His earlier work was on the development of do-support in English and the evolution of Indo-Aryan compound verbs. Forthcoming work includes a study of dragon-slaying formulae in early Indo-European, to appear in Historische Sprachforschung. Antoine Tremblay obtained his BA in Hispanic Studies and MA in Spanish Morphology from Laval University, Quebec, Canada, and his PhD in Psycholinguistics from the University of Alberta, Edmonton, Canada. He is currently a postdoctoral fellow at Georgetown University Medical Center, Washington DC, in the Department of Neuroscience. His research focuses on the processing of compositional multi-word sequences in the auditory and visual modalities using behavioral and brain imaging methods. Marina Terkourafi is Assistant Professor in Linguistics, University of Illinois at Urbana-Champaign. She has research interests in post-Gricean pragmatics, theories of (im)politeness, language contact and change, language and ideology, and construction grammar(s). Her work in these areas has appeared in journals such as Cognition & Emotion, Diachronica, Journal of
x
Notes on Contributors
Historical Pragmatics, Journal of Pragmatics, The Journal of Politeness Research, and Journal of Greek Linguistics, as well as in edited collections. Regina Weinert is Reader in Germanic Linguistics at the University of Sheffield. Her main research interests are syntax and discourse-pragmatics, especially of spoken language, and usage-based approaches to language and language acquisition. Her work includes the analysis of clauses, clause complexes, focusing constructions, particles and pronouns, with an emphasis on German and English.
Acknowledgements
The editor wishes to acknowledge the support and inspiration of colleagues, mentors and friends in the development and realization of this project. Sincere thanks and appreciation go to Ito Harumi and the faculty in the Department of English at Naruto University of Education, in Tokushima, Japan. As well, special thanks to colleagues at Carleton University – Ellen Cray, Randall Gess. Mentors from the past, Mari Wesche and T. S. Paribakht at the University of Ottawa also deserve lasting gratitude. Those who have contributed to the building of the study of formulaic language are numerous, but Alison Wray, John Sinclair, A. P. Cowie, Andrew Pawley, Norbert Schmitt are among those who have made history. Additional thanks to the positive constants in my life, Jeremy Chee, Beryl Wood, and Deborah Wood-Salter and Donald Wood and their families. A round of applause goes to the contributors to this volume, who have done so much to further our knowledge of formulaic language and its vital role in language acquisition and in communication.
This page intentionally left blank
Chapter 1
Formulaicity and Usage-based Language: Linguistic, Psycholinguistic and Acquisitional Manifestations Regina Weinert University of Sheffield
Introduction A single chapter cannot do justice to the myriad approaches and the detail of studies into formulaic language and usage-based accounts of language and language acquisition, which is a sure sign of a flourishing field of enquiry. The various strands also speak for themselves in this volume. What a synthetic chapter can demonstrate is that formulaicity and the close relationship between language use and language representation are now central concerns in the study of language, evident in a cluster of booklength publications and dozens of articles in the last ten years. Questions which I raised in Weinert (1995) and associated methodological challenges are being tackled vigorously with a wide range of tools, while some answers remain elusive. My own work has since focused on the syntax and discoursepragmatics of spoken language and the implications for linguistic theory, arguing for a usage-based approach. This chapter sets out some of the key arguments and findings in work on formulaic and usage-based language as well as highlighting issues for further research.
Formulaic Language and Usage-based Language As a non-technical term, the English word formulaic connotes a lack of originality and stasis in cultural or language expression, witness three recent examples from online reviews of literature, music and film: Renowned author Dan Brown staggered through his formulaic opening sentence; Kronos becoming formulaic; Mummy 3 was formulaic, corny, and predictable. Similarly, in linguistic
2
Perspectives on Formulaic Language
traditions which place generative, syntactic rules at the centre of theories and inquiries, formulaic and also irregular language becomes a marginal or at best separate phenomenon and rule-governed, created language is accorded central, and often by implication a high status, considered to be indicative of the sophistication reached by human language (e.g. Pinker & Prince, 1991; Pinker, 1994). Wray (2002) was the first book-length treatment which gathered phenomena considered formulaic in adult language, first and second language acquisition and in language disorders, adopting an open-door policy and dispelling the notion that they are marginal in native language. Anyone looking for orientation in the field will find this and her subsequent book, Wray (2008), the most comprehensive guides. Wray (2002, p. 9) sets out to shed the baggage and associations of the label formulaic language accumulated through previous linguistic literature and to clear a path through the fifty or more alternative terms. She defines a formulaic sequence as follows: ‘ a sequence, continuous or discontinuous, of words or other elements, which is, or appears to be, prefabricated: that is, stored and retrieved whole from memory at the time of use, rather than being subject to generation or analysis by the language grammar.’ Subsequently Wray reserves the term formulaic sequence for her particular, theory-sensitive definition. The term formulaic language reappears as a ‘neutral mass (uncountable) noun’, including in the titles of her books, and formula ‘is used as the neutral count noun’ (Wray, 2008, p. 8). Most recent studies converge on the label formulaic as an umbrella term and refer to specific manifestations of the phenomenon with additional labels. These manifestations include (oral) narratives, prayers, proverbs, social routines, noncompositional idioms, (more or less) transparent idioms, collocations, lexical bundles, sentence stems, complex word forms, frequently used sequences of words and clauses, fixed sequences, sequences with open slots which can be filled subject to varying levels of constraints, community-wide sequences and idiosynchratic sequences. Studying formulaic language essentially involves two related tasks, data or corpus-based analysis and (psycholinguistic) experimentation, with both approaches also benefiting from being tied to theoretical argumentation. The first task is to uncover the extent and nature of formulaic language in healthy and impaired adult language use and in L1 and L2 acquisition. Computer searches of corpus data readily reveal the pervasiveness of re-current and co-occurrent word or other unit combinations. Yet selecting relevant sequences for analysis and interpreting and applying such findings requires a model of language and its psycholinguistic verification. The second task then is fraught with difficulties. Accounts of formulaic language do not only point to the
Formulaicity and Usage-based Language
3
phenomenon as a linguistic one, they assume or suggest that formulaic sequences are processed and produced as wholes, that is as single units, rather than being analysed or generated. This is said not only to apply to non-compositional language such as opaque idioms, but potentially also to sequences which can in principle be analysed. Formulaic language has tended to be examined as an aspect of the lexicon, and the obvious link to the extensive research under the label phraseology has helped to give it a coherent and substantial body of research. Wray (2002, p. 263) proposes the heteromorphic distributed lexicon, which contains different types of formulaic sequences, including sequences which at the level of the magna-language could be represented in terms of more general patterns or rules. Wray (2008, pp. 33–34) states that in her model the notion of formulaic language disappears as such. The heteromorphic lexicon contains simple as well as complex units which may be quite large and novelty lies in their combination, whether this involves many rules or simple selection from a small set of items. As Wray says herself, the specific characteristics of the units are nevertheless of major interest. However, there is a further point. While formulaic sequences are per definition ‘stored and retrieved whole’, they also per definition contain more than one unit and hence the issue of their internal structure arises, of patterns and of boundaries. Psycholinguistic verification is therefore also still necessary. One drawback of placing formulaic language within the lexicon, even if only metaphorically (Wray, 2008, p. 77), is the separation from syntax. While morphology naturally enters the arena through the inclusion of MEUs, that is, morphologically complex words may be formulaic and/or rule-generated, and while much of formulaic language is of limited generalisability, all kinds of combinatory phenomena are being studied within usage-based language models. It is in this context that the notion of formulaic language may also well disappear, that is, once a model of language uncovers and validates multiple levels of abstraction and representation. Formulaic language can in principle take its place in usage-based theories of language and is naturally aligned with such models since there is no claim or expectation of maximal analycity and minimal representation (e.g. Barlow & Kemmer, 2000; Bybee, 2006). Here we get, potentially, simple and complex forms, general rules, rules of limited generalisability, patterns with open slots, fixed expressions, multiple levels of categorisation and representation, particular exemplars, community wide sequences, and individual language users’ sequences. We get effects of locality, frequency, salience and recency, a relationship between sequentiality and hierarchical constituency, bi-directionality (from wholes to components, from components
Perspectives on Formulaic Language
4
to wholes), interaction (rather than a hierarchy) between phonology, lexis, syntax, semantics and pragmatics and we get dynamism. We get what Langacker calls ‘a sea of particularity’ (Langacker, 1987, p. 46.). In addition, cognitive, usage-based approaches such as Langacker (2000) not only reflect the immense complexities of linguistic structure, they posit a unified account of cognition built on basic psychological principles, in preference to a dual-mechanism account which acknowledges the pervasiveness of irregularity but still goes exclusively for compositionality and rules wherever patterns can be discerned. Both written and spoken language can be characterized along usagebased lines and formulaicity has been demonstrated for a variety of written texts, yet it is in spontaneous spoken language that performance factors may be seen to relate more directly to linguistic form. Spontaneous spoken language is subject to processing and production constraints which are very different for much of written language. Furthermore, spoken language is primary in humans. The concept of formulaicity and usage-based models therefore become especially pertinent, given that they include both a linguistic as well as a psycholinguistic component. What then is the evidence of formulaic and usage-based language and how can they be characterized?
Memory and Analysis
˘
The cognitive unity of formulaic sequences and the usage-based nature of grammar are explained in terms of a greater memory than processing capacity of the human brain, in terms of size versus speed (e.g. Weismer, 2004). Dabrowska (2004a) compares the human brain with a desktop computer, the latter processing at a rate a million times faster than the former. The question is, what would be achievable with the human capacity for neurons to fire about 200 times per second, with their vast number and the connections between them? de Bot (1992) calculates that speakers typically have to make a word choice decision two to five times a second from a store of 30 000 words, which may well be a conservative estimate. One way of meeting this cognitive challenge could therefore be pre-assembly and storage in long-term memory. Sensitivity to frequency is considered an indication of memory storage (the more frequent an item is, the easier it is to access). Yet idioms, especially the classic opaque ones such as kick the bucket, are not necessarily frequent (Moon, 1998). Hence particularity, and/ or irregularity, are also criterial – although irregularity is rarely absolute.
Formulaicity and Usage-based Language
5
This includes broad idiomaticity where a particluar function is expressed conventionally by one form rather than the many potential alternatives (the Pawley & Syder ‘puzzle’ (1983)), for example, telling the time as it’s half ten (Scottish English, meaning 10.30). This brings us then to specific methodologies and findings.
Data and Corpus-based Research Corpus analysis has been used extensively to search the limitations in actual language use, manifest in co-occurrence and re-currence, in the substantial work on phraseology, collocation, lexical bundles and so on (e.g. Sinclair, 1991; Stubbs, 1995; Cowie, 1998). Erman and Warren (2000), much cited, estimate that well over 50 per cent of their spoken and written English data is formulaic as opposed to novel. Studies vary greatly in estimates, mostly as an effect of differences in methodology (see Wray, 2002, pp. 28–31 for a discussion), that is, some search for fixed expressions, others have a very low threshold of frequency and sequence length, others again work with a pre-selected list of sequences. ‘Novel’ does not necessarily mean ‘ungenerated’ since a frequently used sequence could in principle still be assembled with rules. Measurements therefore always have to be accompanied by a detailed analysis of the sequences themselves. Wray points out that we also need functionally tagged corpora in order to compute the ratio of formulaic expressions to the total number of times a particular function has been expressed. This approach is common in cross-cultural pragmatic research, although mostly with elicited rather than naturalistic data. For instance, Vollmer and Olshtain (1989) note that German speakers make use of a range of linguistic options for apologizing, with some set phrases. English speakers appear to operate with a more restricted set. Finally, there is the question of how functions could have been expressed, beyond those which actually occur, but this is tricky to investigate since the options are virtually infinite. This question should remain within our vision, however, so that we do not overestimate the power of generation. It would require psycholinguistic experiments, for example, production tasks which deprive speakers of commonly used expressions or comprehension tasks with grammatically possible alternatives which do not commonly occur. Computer searches are revealing interesting patterns within formulaic language, for example, Butler (1998) notes that many fixed formulas are adverbials, and those which allow preceding or following material are often part of nominal or prepositional groups. Garley, Slade and Terkourafi
Perspectives on Formulaic Language
6
(this volume) show that formulaic sequences can serve to characterize different text genres. Attention to local effects is central to usage-based approaches and language users’ representation of language may well also be genre-bound. In addition, the need for examining dense corpora of individual speakers as opposed to cross-speaker corpora is becoming an urgent task in order to test usage-based models, especially of adult native usage. Sequences identified by computer searches may ‘intuitively’ not seem formulaic (e.g. and the) and these are often excluded from studies which attempt to compare formulaic and non-formulaic sequences and/or which compare native versus non-native use of formulaic sequences, for valid reasons (e.g. Underwood, Schmitt, & Galpin, 2004).‘Intuitively not formulaic’ typically means that there is no corresponding meaning/function attached. Yet if chunking is at least partially a processing phenomenon, and if formulaicity, frequency and particularity influence language, then even sequences which do not appear to be tied to a semantic or pragmatic function may still have unitary status. We should therefore not discount them irrevocably. Raupach (1984) is an early study which indicates that some syntactic stems such as what I mean is, I think that are formulae. Wray (2002) notes the potential formulaicity of similar sequences which do not conform to standard constituents in aphasic language (all around the). What counts as formulaic is therefore an empirical issue which links into the issue of formulaicity of syntax and mental representation more generally, to be discussed further below.
Morphology and Syntax
˘
Invoking formulaicity in morphology and syntax does not imply abandoning the notion of abstraction. But it allows us to re-adjust and re-size some generalizations. Morphology is often seen as a testing ground for dual versus single mechanism models of language and typically involves getting participants to create forms of nonce words, for example, to form plural nouns, past tense verbs and so on. Dabrowska (2004b) provides a tour de force discussion of the two types of models and shows that some of the common examples used in this context do not actually allow one to choose between them. For instance, using the English regular and irregular past tense confounds regularity with other factors such as the high token frequency, the lack of transparency of form and the narrow domains of application in irregular
Formulaicity and Usage-based Language
7
˘
verbs as opposed to the regular verbs. Testing native speakers’ productivity with Polish genitive and dative inflections to study the effect of regularity per se, using nonce words, she found that type frequency and phonological heterogeneity were much better predictors of productivity than regularity. In other words, speakers did not operate with maximally productive rules which apply equally across all contexts. Dabrowska argues that these results are consistent with a view of a single mental mechanism dealing with schemas of varying degrees of generality. Formulaicity in syntactic constructions has been claimed for lexicalized sentence stems which have specific discourse functions (Nattinger & de Carrico, 1992), for example, it seems to me, what I’d like to show, it has to be said. Spoken English clefts in particular exhibit some clear patterns of use. Weinert and Miller (1996) showed that the majority of Reverse WH-clefts of the type [NP BE WH-Clause] (that’s what I mean) had in fact deictic cleft heads, the main patterns being that’s what, that is what, that’s where and that is where. Calude (2008) observes a similar trend for Australian English. O’Keefe, McCarthy, & Carter (2007) confirm this for their data and add that most of these clefts can be accounted for by the formula that’s what I + forms of say, tell, mean, think, wonder and want. In German, where Reverse WH-clefts are rare but carry a strong emphatic function, they are equally limited, occurring as das ist x (das) was (‘that is x that what’), for example, das ist genau das was ich immer sage (‘that is exactly that that I always say’). Weinert and Miller (1996) and O’Keeffe et al. (2007) note preferred WHclefts, for example, what I being a frequent opening sequence. Bybee (2002) argues that frequency in sequentiality affects constituency, evidenced by phonological fusions such as wanna, hafta. Some such fusions affect semantically linked items, or traditional constituents, for example, can’t in English or l’ami (’the friend’) in French. Many others do not and seem to ‘violate’ constituency, for example, I’ll, I’m, I’ve, or German hörma(l), sagma(l) [Imperative+Modal Particle] (‘listen’, etc). Halliday and Hasan (1976) suggest that the resulting constituents may then assume a functional unity as the interpersonal ‘Modal Element’ versus the ‘Propositional Element’ of a clause. This does not readily appear to work for other combinations such as [Preposition + Determiner] in German (e.g. zum, im, ‘to the’, ‘in the’), yet their functional status remains to be examined. The unitary status of sequences has further been shown for some expressions in experiments. Sosa and MacFarlane (2002) report that language users find it difficult to identify ‘of’ in sort of and kind of. While semantic unity often seems to match formal unity, frequency effects apparently also hold without a form being tied to a meaning (Saffran, Alsin, & Newport, 1996).
8
Perspectives on Formulaic Language
Even the seemingly most rule-based and flexible syntactic phenomena illustrate boundedness, at least in spoken language. Let us take the examples of noun phrases and of word order in declarative main clause, the most basic and frequent of syntactic units. Miller and Weinert (1998) found that in informal spoken English and German noun phrases exhibit a small range of types and simple structure. Almost 50 per cent of the English noun phrases and over 60 per cent of the German ones are single pronouns. If we add to this the single nouns, ca 64 per cent of English and German NPs contain only one constituent. 16.3 per cent of English and 16.9 per cent of German NPs have the structure [Determiner + Noun]. This means that 80 per cent of NPs are accounted for in terms of Pronoun, Noun and [Determiner + Noun]. The remaining NPs include postmodified NPs (with prepositional phrases and relative clauses), some compound NPs and premodified NPs, but the latter amount to only 3.3 per cent in English and 3.6 per cent in German. The data came mostly from young adults, most of the German speakers had a university education. The question then arises as to the nature of the speaker’s mental representation of what a linguist could analyse as a noun phrase with its various pre- and postmodifying structures. At the very least this invites us to examine the level of abstraction language users operate with. Bybee (2002) sees the noun phrase as showing the strongest signs of being a constituent, based on the frequency with which nouns are preceded or followed by certain items. She suggests that various levels of abstraction are found, from very specific (word level), to partially general (e.g. my + Noun), to more general (Possessive + Noun), to fully general (Determiner + Noun). It could also be that grammatical function preferences affect such categorisations, for example, 80 per cent –90 per cent of German clause-initial pronouns are subjects (Weinert 2007), which brings me to my second example. Word order in main clauses is also a good testing ground for usage-based models, given the frequency and potential for variation inside main clauses (Weinert to appear). German is famously known for allowing virtually any constituent in clause initial position, in contrast to English. The difference is found especially with clause-initial object NPs, which in English are associated with strong focus, for example, to highlight a contrast or emphasize an NP, whereas in German they are mainly associated with thematic information. The distribution of elements in this slot is nevertheless not even. Durrell (2002) refers to an estimate that two thirds of initial elements in German in all registers are subjects. It appears that for spoken German especially we may have to revise the view of an entirely open clause-initial position. An examination of 2000 clause-initial NPs in my data revealed very
Formulaicity and Usage-based Language
9
confined usage. 1000 NPs each were taken from private, informal everyday conversations and from public semi-formal discussions and private academic student-lecturer consultations. The picture was strikingly similar for both data sets. Around 80 per cent of clause-initial NPs are pro-forms; of these ca 64 per cent are single pronouns and 16 per cent are deictic adverbs such as da and dann (‘there’ and ‘then’). Only ca 8 per cent are lexical adverbs and ca 12 per cent are full noun phrases. If we look only at noun phrases we find that ca 85 per cent are pronouns (compared with the figure of 60 per cent for all noun phrases referred to above in the section on noun phrases). In the informal data, 13.5 per cent are object NPs and in the formal data 9.7 per cent are object NPs (with one dative object, the others accusative in each data set; also included are 10 cases of dative mir (8 in the formal data) in constructions such as mir scheint (‘it seems to me’). To sum up, main clause initial NPs are 85 per cent pronouns and of these 90 per cent are subjects. Taking all clause-initial elements into account, 80 per cent are pro-forms, 64 per cent of these pronouns, the rest mostly da and dann. The similarity between the two data sets is rather surprising, especially given that in the formal data the noun phrases in other clause positions are much more complex. Again the question is, what type of word order generalizations do language users operate with? While the details remain to be investigated, in spontaneous spoken German an open rule for clause-initial position would seem rather too powerful, both for linguists as well as for language users.
Psycholinguistic Reality There is then plenty of surface linguistic evidence of formulaicity and constrained usage, that is, of limited distribution across constructions, even in the potentially most flexible and general ones such as noun phrases and main clauses. Some evidence of an accompanying psycholinguistic effect comes from the ‘fusion’ effects mentioned above, where phonological alternations, mostly reductions, and accompanying orthographic alternations (would of for would have) indicate that sequences are no longer subject to analysis. Usage-based effects have been shown by work on Polish nonce words, which suggest that morphological rules may well be local rather than maximally abstract and regular. The evidence regarding the processing of formulaic sequences and of their advantage over non-formulaic sequences is more sparse, not surprisingly, given how difficult this is to examine. Van Lancker, Canter, & Teerbeek, (1981) report that in a production task
10
Perspectives on Formulaic Language
speakers inserted longer pauses between the words of literal versus metaphorical interpretations of idioms such as he was skating on thin ice. But work on idioms is inconclusive and regularly suggests that there is no difference in comprehension speed between literal and metaphorical interpretations (Gibbs, 1985), or between idioms and literal paraphrases (Gibbs, Bogadanovich, Sykes, & Barr, 1997). Schmitt, Grandage, & Adolphs (2004) remind us that corpus-derived recurrent clusters are not necessarily psycholinguistically real. Conversely, as pointed out above, corpus-based analysis may discard formulae which are psycholinguistically real. A lack of processing difference (whatever the method) does not necessarily mean that formulaic and non-formulaic sequences are not processed differently, we may simply be considering the wrong sequences. The pre-selection of formulaic sequences is far from straightforward. Namba (this volume) tackles the issue of identification with a complex set of diagnostic criteria and Lin (this volume) refines the examination of phonological coherence as an indicator of formulaicity. Most processing studies rely on measuring participants’ production speed or reaction times on pre-selected sequences. Bod (to appear) for instance, compared frequently used clauses such as I like it with less frequent ones such as I keep it. He found that native participants were able to decide more quickly that a sequence was a possible English sequence when it was a more frequent one. Columbus (this volume) provides some evidence for non-semantic, non-constituent sequences such as at the end of the being processed faster than ‘compositional’ [their label] control sequences such as to eat a sandwich. How any such frequent, apparently compositional and non-constituent sequences are processed is still an open question. Conklin and Schmitt (2008) used formulaic sequences whose initial elements made later ones highly predictable, for example, [beauty is in the] [eye of the beholder], verified by means of cloze tests on native speakers. A self-paced line-by-line reading task found that such formulaic sequences were read more quickly than non-formulaic ones. The value of self-paced reading is that it can potentially tap the representation of individual words or items in a sequence, which is precisely what is needed. Schmitt and Underwood (2004) examined the terminal words of sequences in such a task. They found no difference in reading times between words which occurred in formulaic versus non-formulaic sequences in native speakers, which they attribute partially to the word-by-word presentation or to having to press a space bar, as this may have disrupted the holistic processing. Eye-tracking has been used to show that the final word in a formulaic sequence is fixated less often and for a shorter time than the same final
Formulaicity and Usage-based Language
11
word in a non-formulaic sequence in the case of native speakers (Underwood et al., 2004). Speed is certainly a plausible indicator of holistic processing, given the general consensus on the limits to our processing capacity. Tremblay and Baayen (this volume) make a convincing case for level of absolute speed indicating holistic processing and a lack of assembly of four-word sequences (rather than comparative speed between sequences or between native and non-native speakers), based on commonly accepted lexical access speed for single words. In addition, they found evidence of a frequency effect of three-word sequences of words and of single words within the longer sequences, and suggest that multi-word strings are stored as parts as well as wholes. The notion ‘holistic’ becomes further relativized in Columbus (this volume), who teases out differences according to type of multi-word unit. Wray (1992, 2008) alerts us to the possibility that psycholinguistic tests may not tap formulaicity, given clashes in findings between test and non-test situations in clinical investigations. Some of the authentic and spontaneous tasks referred to above address this problem to some extent. Studies which attempt, scrupulously, to develop reliable criteria for identifying formulaic sequences, may in fact be forced into a corner, back to a relatively small group of non-compositional, opaque idioms and possibly phonologically fused items. On the other hand, holistically processed formulaic sequences may be found to exhibit a much greater range – including those based on form, not only on function – once we examine sequences in context and in naturally occurring language, and it is these contexts which may usefully be narrowed and controlled for in studies, rather than just pre-selected sequences. Yet the difficulties in demonstrating that formulaic sequences are processed differently from non-formulaic sequences may not only be a question of methodology. This is where considering formulaicity as an aspect of usage-based language knowledge may throw light on what the issue is. If language knowledge is usage-based then it should not be surprising that it is difficult to find consistent and absolute differences between formulaic and non-formulaic sequences. This is because all sequences will be subject to factors such as frequency, familiarity, recency, and context (e.g. genre, scripts), rules can be local as well as general, there are different levels of abstraction and sequences and units have a potential effect on each other. Just as not every generalization has to occur at the most abstract level, so not every formulaic sequence has to be entirely cast in concrete form. What this means in terms of the internal composition of sequences is not clear. For instance, observed general word frequency effects on the
Perspectives on Formulaic Language
12
processing of formulaic sequences may not mean that such words are actually parts within the sequences, they may simply be bumps in the landscape, just as certain parts of words, such as the beginning and end, are more prominent, hence the ‘bathtub effect’ of lexical recall (Aitchson, 1987). Columbus (this volume) also observes non-linear frequency effects. One further intractable issue, which is implied but rarely stated explicitly in studies, is that a mental representation of a whole will in production at least, have to appear in linear order. Prior to being uttered, this whole may not necessarily be of linear form and the nature of the transition also deserves to be studied.
First Language Acquisition
˘
First language acquisition is clearly of immense, central relevance to a theory of language. Usage-based approaches allow for item-based learning and try to chart the relationship between exemplars and rules (e.g. Tomasello, 1992; 2002; 2003; Diessel, 2004; Brandt, Diessel, & Tomasello, 2008). This work regularly comments on children’s use of formulaic language and local schemas, rather than abstract, adult rules (whatever the adult rules may in fact be). Tomasello (1992) very extensively examines the verbs and the structures they occur in for one child from age 15–24 months. The analysis suggests that the child operated with structures associated with particular verbs rather than classes of verbs, that the structures were independent of each other and that recency in the child’s production also played a role. Diessel (2004) proposes how complex syntactic constructions may develop out of exemplars of simple ones, showing again that local restrictions go hand in hand with development, for instance descriptive relative clauses with main clause predicate complement heads or heads of presentational main clauses are the first to appear, which is consistent with the early appearance of such main clauses. Again, at the very least, such findings invite us to rethink where generalizations may be found, what they may be based on or look like. Furthermore, in terms of input, children have been shown to be exposed to highly repetitive lexical frames, for example, utterance initial item-based frames such as what, that, it, there, you and so on (Tomasello, 2002). Dabrowska and Lieven (2005) investigate the development of English syntactic questions, looking at the internal structure of the NPs and VPs, not only the position of the auxiliary and a WH-word. Questions were chosen since they are often seen as evidence of children’s abstract rules,
Formulaicity and Usage-based Language
13
˘
especially in generative accounts. The children were older, that is, 2 and 3 years of age. Dabrowska and Lieven were able to account for 90 per cent of the children’s utterances with a lexically specific grammar based on the child’s linguistic experience. While their corpus was dense, it still only comprised about 7 per cent of the total of this experience. Experimental study is then also much needed in this area, but this is still relatively sparse (Tomasello, 2000). Extensive work using nonce words has been done on verbs, with some work on morphology (Dabrowska, 2004b; 2006) and is gathering momentum. These studies are highly supportive of the exemplar-based leaning view, at least in young children. Akhtar and Tomasello (1997), Tomasello and Brooks (1999) and Akhtar (1999) found that 2–3.5-year old children do not readily produce or comprehend transitive constructions with novel verbs in terms of patterns abstracted from known verbs and learn about new verbs just in terms of these verbs. Childers and Tomaselleo (2001) found that 2.5-years olds were best at generalizing transitive constructions to new verbs in stabilising frames with pronouns such as he’s VERB-ing it, regardless of their level of familiarity with the verbs. Overgeneneralization on the other hand, often cited as evidence of abstract rules, hardly occurs before age 2.5–3. (Pinker, 1989). Overgeneralization also appears to be favoured by familiarity and constrained by lack of familiarity and the availability of alternatives, that is, it does not apply across all possible structures (Brooks, Tomasello, Lewis, & Dodson, 1999). In addition, previous studies have shown how specific communicative needs or crisis points can lead learners towards analysis, revealing what was previously a unit (Iwamura, 1980, p. 85). Such detailed analysis of communicative situations therefore continues to hold vital clues. The full extent to which first language acquisition involves formulaic sequences is not yet known, although evidence for a substantial role is being put forward. Yet we cannot ignore children’s demonstrated use of simple forms and words and their (own) subsequent construction of two, three, four word utterances. In addition, specific languages also yield different learning tasks. More comparative research with children acquiring languages which differ in terms of formulaicity, regularity and irregularity are needed to show when and how abstraction emerges. At the same time, a theory which links formulaic language to the evolution of language offers an intriguing set of questions (Wray, 2008). But just as we should not assume that once analycity is possible it replaces formulaicity, so it may be premature to suppose that just because formulaicity may have been evolutionary prior, it remains a default ontogenically. ˘
14
Perspectives on Formulaic Language
Second Language Acquisition Since the revival of interest in formulaic language in the mid-nineties, studies have examined its role in non-native language as communicative, production and processing, and learning strategy, but not to equal degrees. Myles, Hooper & Mitchell (1998) and Myles, Mitchell, & Hooper (1999) report some evidence for a relationship between formulas and the development of rules among classroom learners and Weinert (1994) notes some classroom-induced negative effects. But most prominent has been the issue of native-like idiomaticity in the broad sense, that is, including non-compositional idioms, collocations and other conventional formmeaning pairings, their development and how learners may be helped towards their use (e.g. Granger, 1998; Howarth, 1998; de Cock, 2000; Schmitt, 2004; Siyanova & Schmitt, 2008; Li & Schmitt and Dai & Ding, this volume). Such work includes comparisons of actual language use with the language of textbooks, for example, complex academic texts have been shown to consist of conventionalized combinations, ranging from discourse organisers to stance bundles and a variety of referential expressions (Chen & Wood, this volume). Research which aims to establish the extent of idiomatic language use among non-natives as well as compare the processing of formulaic sequences in native and non-native language reports both quantitative and qualitative differences between native and non-native speakers, with non-natives generally not operating with the appropriate range and function of formulas or not experiencing the same processing advantage when they do. As I have suggested elsewhere (e.g. Weinert, 1995), the (understandable) aim to develop non-native speakers’ knowledge and use of formulaic language through appropriate teaching may run counter to their own processes. Wray (2002) proposes that post-school and especially adult learners are driven more towards analysis than formulaicity. She sees the difficulty of achieving native-like idiomaticity as arising out of the need to develop strong associations between components, something which is not easy to achieve when faced with their particular social, intellectual and learning experience. One might say that it is difficult to become fully socialized, in a broad sense, a second time around, without the stable linguistic, developmental and motivational environment normally afforded first language learners. Kecskes (2002), adds to this the notion of conceptual socialisation, requiring the development of a common underlying conceptual base for the two languages. A further important aspect which is raised only sporadically is the use of non-native formulaic language, both as a communicative as well as a processing strategy (e.g. Rehbein, 1987;
Formulaicity and Usage-based Language
15
Bolander, 1989). Rehbein talks about migrant speakers’ formulae constituting a ‘self-imposed reduction of their own system of needs’. The learner’s lack of native-like formulaic use may also be due to affective variables and an investigation into which areas are particularly associated with a learner’s identity and goals seems timely. Studies into intercultural communication and competence (e.g. Woodin, 2007) have long acknowledged the complexities of development in both conceptual structure and pragmatic conventions, questioning the aim for learners of achieving native command of a second language (indeed an oxymoron). Socialisation research, including in L2 contexts, has become a field of its own (e.g. Kitzinger, 2000; Kasper, 2001).
Society and Culture Wray (2002, 2008) suggests that formulaic language is a mechanism for the promotion of self. This cannot readily be maintained, both as an explanation and as a single motivating factor. Created language can serve the self equally, and this is implied both by the Wray’s NOA (needs of analysis) hypothesis (i.e. there will be times when analysis is needed to promote the self). In addition, Wray herself (2008) points out that the language of esoteric societies may be characterized by a high level of formulaicity, whereas those of exoteric societies may have a lower level. So-called Western societies value originality, creativity and effort of thought, at least in some areas. What type of formulaic language attracts critical responses or prompts novelty is itself an interesting social phenomenon. Taking exception to you know what I mean can express prejudice against Americans, you know may be considered an unnecessary filler indicative of an empty head and refusing to use the phrase push the envelope a vestige of rebellion in an ever more corporate climate of educational establishments. Creative adaptions of the officialese used in the former GDR was considered a sign of subversion and liberation (v. Polenz, 1993). Even in related languages and cultures differences can be observed. This has long been recognized in relation to classic idioms (see also Abdou, this volume) as well as in work on cross-cultural pragmatics, but there are likely to be further, less obvious effects. Forsberg and Fant (this volume) show differences between Chilean Spanish and French in relation to grammatical and discursive versus lexical formulaic sequences, which they see as being related to structural differences between the languages. To what extent cultural differences are implicated in other differences will require
16
Perspectives on Formulaic Language
detailed and sensitive study. In the case of translation (Giammarresi, this volume), this opens up the question of the effect of a change from formulaic to non-formulaic, from the conventional to the novel. Finally, individuals may vary in their attachment to conventions or their thirst for the novel, not only as experience but also as meaningful activity.
Conclusion The relationship between memory and analysis and the conventional and the novel will continue to keep researchers busy in the areas of language acquisition, language use, language representation and processing and as a socio-cultural phenomenon. Three areas in particular seem to me pivotal: the nature of spoken language (child, adult, child-directed, native and non-native), the psycholinguistic status of linguistic units (interpretation of processing speed, unit internal structure, linearity/non-linearity, functionalisation of formal unity, level of abstraction, redundancy in representation, transition from representation to production, production vs. comprehension) and the socio-cultural parameters which influence the level of formulaicity and the balance between convention and novelty (in communities and for individuals, effect on acquisition, implications for translation). Recent research shows a vigorous engagement with the challenges inherent in studying formulaicity and in verifying usage-based accounts of language. In fact, considering formulaic language as an aspect of usage-based language alerts us to the possibility, likelihood even, that we may never find the perfect methodology for demonstrating the difference between formulaic and non-formulaic sequences per se. Instead, as shown by the studies in this volume, fine-grained analyses of formal and functional aspects of re-current and co-occurrent sequences, carried out for a wide range of purposes, are revealing just how immensely complex language use is and how variously language knowledge may be represented. To adapt a friend’s phrase, there is no time to sleep on our bay leaves.
References Aitchison, J. (1987). Words in the mind. Oxford: Basil Blackwell. Akhtar, N. (1999). Acquiring basic word order: Evidence for data-driven learning of syntactic structure. Journal of Child Language, 26, 339–56. Akhtar, N., & Tomasello, M. (1997). Young children’s productivity with word order and verb morphology. Developmental Psychology, 33, 952–65.
Formulaicity and Usage-based Language
17
˘
Barlow, M,. & Kemmer, S. (2000). (Eds.), Usage-based models of language. Stanford CA: Centre for the Study of Language and Information. Bod, R. (to appear). Exemplar-based syntax: How to get productivity from examples. The Linguistic Review, Vol. 23, Special Issue on Exemplar-Based Models of Language. Bolander, M. (1989). Prefabs, patterns and rules in interaction? Formulaic speech in adult learners’ L2 Swedish. In K. Hyltenstam & L. Obler (Eds.), Bilingualism across the lifespan: Aspects of acquisition, maturity and loss (pp. 73–86). Cambridge: Cambridge University Press. Brandt, S, Diessel, H., & Tomasello, M. (2008). The acquisition of German relative clauses. Journal of Child Language, 35 (2), 325–48. Brooks, P., Tomasello, M., Lewis, L., & Dodson, K. (1999). Young children’s overgeneralisations with fixed transitivity verbs. Child Development, 70, 1325–37. Butler, C. (1998). Multi-word lexical phenomena in Functional Grammar. Revista Canaria de Estudios Ingleses, 36, 13–36. Bybee, J. (2002). Sequentiality as the basis of constituent structure. In T. Givón & B. F. Malle (Eds.), The evolution of language out of pre-language (pp. 107–34). Amsterdam/Philadelphia: John Benjamins. Bybee, J. (2006). Frequency of use and the organization of language. New York/ Oxford: Oxford University Press. Calude, A. (2008). Demonstrative clefts in spoken English. Doctoral Thesis, University of Auckland. Childers, J. B., & Tomasello, M. (2001). The role of pronouns in young children’s acquisition of the English transitive construction. Developmental Psychology, 37, 739–48. Conklin, K., & Schmitt, N. (2008). Formulaic sequences: Are they processed more quickly that non-formulaic language by native and nonnative speakers? Applied Linguistics, 29 (1), 72–89. Cowie, A. P. (1998), (Ed.). Phraseology: Theory, analysis and application. Oxford: Oxford University Press. Dabrowska, E. (2004a). Language, mind and brain. Edinburgh: Edinburgh University Press. Dabrowska, E. (2004b). Rules or schemas? Evidence from Polish. Language and Cognitive Processes, 19, 225–71. Dabrowska, E. (2006). Low-level schemas or general rules? The role of diminutives in the acquisition of Polish case inflections. Language Sciences, 28, 120–35. Dabrowska, E., & Lieven, E. (2005). Towards a lexically specific grammar of children’s question constructions. Cognitive Linguistics, 16, 437–74. de Bot, K. (1992). A bilingual production model: Levelt’s ‘speaking’ model adapted. Applied Linguistics, 13, 1–25. De Cock, S. (2000). Repetitive phrasal chunkiness and advanced EFL speech and writing. In C. Mair & M. Hundt (Eds.), Corpus linguistics and linguistic theory. (pp. 51–68). Amsterdam: Rodopi. Diessel, H. (2004). The acquisition of complex sentences. Cambridge: Cambridge University Press. Durrell, M. (2002). Hammer’s German grammar and usage. Fourth edition. London: Arnold. ˘
˘
˘
18
Perspectives on Formulaic Language
Erman, B., & Warren, B. (2000). The idiom principle and the open choice principle. Text, 20, 29–62. Gibbs, R. (1985). On the process of understanding idioms. Journal of Psycholinguistic Research, 14 (5), 465–72. Gibbs, R., Bogadanovich, J., Sykes, J., & Barr, D. (1997). Metaphor in idiom comprehension. Journal of Memory and Language, 37, 141–54. Granger, S. (1998). Prefabricated patterns in advanced EFL writing: Collocations and formulae. In A. P. Cowie (Ed.), Phraseology: Theory, analysis and application (pp. 145–60). Oxford: Oxford University Press. Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London: Longman. Howarth, P. (1998). The phraseology of learners’ academic writing. In A. P. Cowie (Ed.), Phraseology: Theory, analysis and application (pp. 161–86). Oxford: Oxford University Press. Iwamura, S. G. (1980). The verbal games of pre-school children. London: Croom Helm. Kasper, G. (2001). Four perspectives on L2 pragmatic development. Applied Linguistics, 22 (4), 502–30. Kecskes, I. (2002). Situation-bound utterances in Ll and L2. Berlin: Mouton de Gruyter. Kitzinger, C. (2000). How to resist an idiom. Research on Language and Social Interaction, 33 (2), 121–54. Langacker, R. W. (1987). Foundations of cognitive grammar, Vol 1. Theoretical prerequisites. Stanford, CA: Stanford University Press. Langacker, R. W. (2000). A Dynamic usage-based model. In M. Barlow & S. Kemmer (Eds.), Usage-based models of language (pp. 1–63). Stanford CA: Centre for the Study of Language and Information. Miller, J., & Weinert, R. (1998). Spontaneous spoken language: Syntax and discourse. Oxford: Oxford University Press. Moon, R. (1998). Fixed expressions and idioms in English. Oxford: Clarenden Press. Myles, F., Hooper J., & Mitchell, R. (1998). Rote or rule? Exploring the role of formulaic language in classroom foreign language learning. Language Learning, 48 (3), 223–363. Myles, F., Mitchell, R., & Hooper, J. (1999). Interrogative chunks in French L2. A basis for creative construction? Studies in Second Language Acquisition, 21, 49–80. Nattinger, J. R., & DeCarrico, J. S. (1992). Lexical phrases and language teaching. Oxford: Oxford University Press. O’Keeffe, A., McCarthy M., & Carter, R. (2007). From corpus to classroom: language use and language teaching. Cambridge: Cambridge University Press. Pawley, A., & Syder, F. H. (1983). Two puzzles for linguistic theory: nativelike selection and nativelike fluency. In J. C. Richards & R. W. Schmidt (Eds.), Language and communication (pp. 191–226). New York: Longman. Pinker, S. (1989). Learnability and cognition: The acquisition of verb-argument structure. Cambridge, MA: Harvard University Press. Pinker, S. (1994). The language instinct. Harmondsworth: Allen Lane, Penguin Press. Pinker, S., & Prince, A. (1991). Regular and irregular morphology and the psychological status of rules in grammar. Proceedings of the 17th Annual Meeting of the Berkeley Linguistics Society (pp. 230–51). Berkeley, CA: BLS.
Formulaicity and Usage-based Language
19
Raupach, M. (1984). Formulae in second language speech production. In H. Dechert, D. Möhle, & M. Raupach (Eds.), Second language productions (pp. 114–37) Tübingen: Gunter Narr. Rehbein, J. (1987). Multiple formulae. Aspects of Turkish migrant workers’ German in intercultural communication. In K. Knapp et al. (Eds.), Analysing intercultural communication (pp. 215–48). Berlin: Mouton de Gruyter. Saffran, J. R., Alsin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants. Science, 274, 1926–1928. Schmitt, N. (2004). (Ed.) Formulaic sequences: Acquisition, processing and use. Amsterdam/ Philadelphia: John Benjamins. Schmitt, N., Grandage, S., & Adolphs, S. (2004). Are corpus-based recurrent clusters psycholinguistically valid? In N. Schmitt (Ed.), Formulaic sequences: Acquisition, processing and use (pp. 127–51). Amsterdam/Philadelphia: John Benjamins. Schmitt, N., & Underwood, G. (2004). Exploring the processing of formulaic sequences through a self-paced reading task. In N. Schmitt (Ed.), Formulaic sequences: Acquisition, processing and use (pp. 173–72). Amsterdam/Philadelphia: John Benjamins. Sinclair, J. McH. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press. Siyanova, A., & Schmitt, N. (2008). L2 learner production and processing of collocation: A multi-study perspective. Canadian Modern Language Review, 64 (3), 429–58. Sosa, A. V., & MacFarlane, J. (2002). Evidence for frequency-based constituents in the mental lexicon: Collocations involving the word of. Brain and Language, 83 (2), 227–36. Stubbs, M. (1995). Collocations and cultural connotations of common words. Linguistics and Education, 7 (4), 379–90. Tomasello, M. (1992). First verbs: A case study of early grammatical development. Cambridge: Cambridge University Press. Tomasello, M. (2000). Do young children have adult syntactic competence? Cognition, 74, 209–53. Tomasello, M. (2002). The emergence of grammar in early child language. In T. Givón & B. F. Malle (Eds.), The evolution of language out of pre-language (pp. 309–28). Amsterdam/Philadelphia: John Benjamins. Tomasello, M. (2003), Constructing a language. Cambridge, MA and London: Harvard University Press. Tomasello, M., & Brooks, (1999). Early syntactic development: A construction grammar approach. In M. Barrett (Ed.), The development of language (pp. 161–90). Hove: Psychology Press. Underwood, G., Schmitt, N., & Galpin, A. (2004). The eyes have it. An eye-movement study into the processing of formulaic sequences. In N. Schmitt (Ed.), Formulaic sequences: Acquisition, processing and use (pp. 153–72). Amsterdam/Philadelphia: John Benjamins. v. Polenz, P. (1993). Die Sprachrevolte in der DDR im Herbst 1989. Zeitschrift für Germanistische Linguistik, 21, 127–49. Van Lancker, D., Canter, G. J., & Teerbeek, D. (1981). Disambiguation of ditropic sentences: Acoustic and phonetic cues. Journal of Speech and Hearing Research, 24, 330–35.
20
Perspectives on Formulaic Language
Vollmer, H. J., & Ohlstain, E. (1989). The language of apologies in German. In S. Blum-Kulka, J. House, & G. Kasper (Eds.), Cross-cultural pragmatics: Requests and apologies. Norwood: Ablex. Weinert, R. (1994). Some effects of a foreign language classroom on the development of German negation. Applied Linguistics, 15 (1), 76–101. Weinert, R. (1995). The role of formulaic language in second language acquisition: A review. Applied Linguistics, 16 (2), 80–205. Weinert, R. (2007). Demonstrative and personal pronouns in formal and informal conversations. In R. Weinert (Ed.), Spoken Language Pragmatics: An analysis of form-function relations (pp. 1–28). London/New York: Continuum. Weinert, R. (to appear). German free word order – reality or myth? The front-field in spoken main clauses. Weinert, R., & Miller, J. (1996). Cleft constructions in spoken language. Journal of Pragmatics, 25, 173–206. Weismer, S. E. (2004). Memory and processing capacity. In R. M. Kent (Ed.), MIT encyclopedia of communication disorders (pp. 349–52). Cambridge, MA: MIT Press. Woodin, J. (2007). Intercultural positioning: tandem conversations about word meaning. In R. Weinert (Ed.), Spoken Language Pragmatics: An analysis of formfunction relations (pp. 208–28). London/New York: Continuum. Wray, A. (1992). The focusing hypothesis. Amsterdam: John Benjamins. Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press. Wray, A. (2008). Formulaic language: Pushing the boundaries. Oxford: Oxford University Press.
Part 1
Formulaic Language in Acquisition and Pedagogy
This page intentionally left blank
Chapter 2
The Development of Collocation Use in Academic Texts by Advanced L2 Learners: A Multiple Case Study Approach Jie Li and Norbert Schmitt University of Nottingham
Introduction It is now generally agreed that the native-like use of collocations (word combinations such as heavy smoker, make a speech, bitterly cold) is an important element of proficient language use (e.g. Sinclair, 1991; Wray, 2002). However, researchers have found that L2 learners rely heavily on creativity so as to produce expressions which are simply not used by native speakers (Pawley & Syder, 1983; Wray, 2002). Skehan (1998) and Foster (2001) claim that non-native speakers, unlike native speakers, generate a great proportion of their language from rules instead of lexicalized routines. Native speakers use conventional expressions to convey meaning, while learners often express meaning with unidiomatic combinations of words. A number of studies (Granger, 1998; Howarth, 1998; Nesselhauf, 2003) have shown that even advanced L2 learners often experience problems with collocations in written English. For example, Granger (1998) used a corpus-based approach to look at -ly intensifier + adjective collocations automatically extracted from advanced French learners’ academic essays and similar essays written by native English students. She found that one type of intensifier, that is, ‘boosters’ (e.g. deeply, strongly, highly) were underused by French learners compared with the frequency (i.e. the number of types and tokens) of those used by natives. She then concluded that advanced French learners of English did use collocations, whereas they tended to underuse native-like expressions but overuse those unidiomatic word pairs which have direct L1 translation equivalents. Using a frequency-based statistic approach, Lorenz (1999) also investigated intensifier-adjective collocations in “expository-argumentative” texts
24
Perspectives on Formulaic Language
produced by advanced German learners and native British students. By calculating association measures of collocations and type-token ratios, he found that advanced German learners of English had smaller repertoires of collocations (as measured by type-token ratio) and overused a limited number of high frequency collocations (as measured by t-score and MI). Building on Lorenz’s statistical approach, one of Siyanova and Schmitt’s (2008) three studies used corpus-based frequency data and mutual information statistics (MI) to investigate adjective-noun collocations in advanced Russian learners’ and native university students’ written English. By consulting the BNC for counting frequency and calculating the MI value of each collocation, they found that only 45 per cent of the collocations used by advanced Russian learners in their writing texts were appropriate (i.e. frequent and strongly associated English word combinations). Following a phraseological approach, Howarth (1998) focused on restricted verb-noun collocations (e.g. make a claim, reach a conclusion) identified from native and advanced non-native academic written corpora. Based on the norms established in native speaker writing, he reported that advanced non-native MA students employed about 50 per cent fewer restricted collocations than natives. He also found that approximately 6 per cent of collocations produced by advanced learners are non-conventional. Based on Howarth’s analysis, it seems that among the three collocational groups (i.e. restricted collocations, free collocations, and idioms), restricted collocations are most problematic for advanced non-native learners. Another more comprehensive study which explored advanced German speaking learners’ verb-noun collocation (e.g. take a break, shake one’s head) is that of Nesselhauf (2003). Like Howarth, she also adopted a phraseological approach and classified collocations into three groups, namely, free combinations (e.g. want a car), collocations (e.g. take a picture) and idioms (e.g. sweeten the pill). She found that learners made the greatest proportion of errors with collocations (79 per cent), followed by free combinations (23 per cent) and idioms (23 per cent). As can be seen, the existing studies all used a corpus-based native versus non-native comparison to investigate learners’ collocation use and identify gaps between these two populations. In general, three main approaches have been employed to define and identify collocations in written texts. One has studied all word combinations of a particular grammatical form (e.g. –ly amplifier + adjective), regardless of whether they are ‘idiomatic’ or not (Granger, 1998). A second is the so-called phraseological approach, represented by Howarth (1998) and Nesselhauf (2003, 2005). Borrowing the Russian School’s definition and classification of phraseology (Cowie,
Collocation Use in Academic Texts
25
1998), collocations are typically identified according to two defining criteria: semantic opacity – the degree to which words are used with their ‘dictionary’ meanings, and fixedness – the degree to which elements of a phrase can be substituted. A final approach uses occurrence frequency of word combinations within the investigated corpus as identification criteria. Thus, Lorenz (1999) and Siyanova and Schmitt (2008) compared word pairs in non-native and native equivalent corpora and used statistical ‘association measures’ to identify which pairs were characteristically idiomatic. Although different approaches have been employed to identify and define collocations, they point to the same conclusion: L2 learners have difficulties with collocation use in their language production. However, existing studies have largely been descriptive in nature, and tend to focus on one-off compositions produced by learners. Little research has focused on an empirical analysis of L2 learners’ collocations over time, which could inform about how collocational knowledge develops. A small number of longitudinal studies have been undertaken to investigate the role of formulaic language improvement in young L2 learners’ language acquisition (e.g. Wong-Fillmore, 1976; Huang & Hatch, 1978). Apart from Adolphs and Durow’s (2004) longitudinal study of two L2 postgraduates’ three-word formulaic language improvement in spoken English, few studies have done the same for advanced L2 learners’ improvement of formulaic language. The only truly reliable way to identify patterns of development in the use of collocations by L2 learners is to conduct longitudinal studies of the same learners over time. This study attempts to do this using a multiple case-study approach. The purpose is to provide a rich and detailed description of several individual learners’ use of collocations over a period of one academic year. We are also interested in how the individual results combine into group results. As our goal is descriptive, we begin with no formal research questions. However, the following general questions helped to focus the investigation: 1. Will advanced Chinese L2 learners improve their collocation use in academic writing assignments over a one-year study abroad postgraduate programme? Are the collocations used by Chinese students similar to those used by published authors? 2. Are the statistical measures of collocations we use valid for the investigation of collocation improvement over an academic year? To what extent we can put our faith in the statistical results of group patterns of collocation development?
Perspectives on Formulaic Language
26
Methodology Participants The four participants were female Chinese postgraduates, on a one-year MA programme in English Language Teaching (here after ELT) in the School of Education at the University of Nottingham. All of them were English majors from China, with ages ranging from 26 to 29. Their English language learning experiences in China were similar in that their exposure to target language was mainly from non-native teacher-dominated classroom instruction, which was generally grammar-based and input-poor. The participants had similar career plans and expectations, namely returning to China to start teaching in colleges or universities. Overall, the four participants were advanced English language learners, who received similar English language training in China, and were exposed to the same L2 environment at a British university. Details of the individual participants are shown in Table 2.1 as follows: Table 2.1
Participant’s Personal Details
Participant Age Education background Teaching experience IELTS/TOEFL score LH
29
Technical College
5 years
6.5, Writing: 6.0
TT
26
Bachelor’s Degree
4 years
640 (TOEFL), Writing: 5.5
WL
27
Technical College
5 years
6.5, Writing: 6.0
YJ
27
Bachelor’s Degree
5 years
6.5, Writing: 6.0
Since the scores of IELTS and TOEFL are not directly comparable, it is impossible to compare TT’s English language competence to the other three participants. Nevertheless, similar to the other members of the participant group, she is a proficient English language user on the basis of her TOEFL marks.
The learner corpus The learner corpus consisted of 36 academic writing assignments (including eight essays and one dissertation for each participant) written over a period of one academic year (i.e. three terms). Since the four participants are all from the one-year MA programme in ELT, their writing requirements were the same except for the coursework for elective modules. This MA course is
Collocation Use in Academic Texts
27
comprised of two core modules: Applied Linguistics and Syllabus Design & Methodology; four elective modules and a final dissertation requirement. Each module requires a 3,000 word essay for coursework apart from core modules, which require 6,000 words (i.e. two essays of 3,000 words each). The word count requirement for the dissertation in Term 3 is 12,000 words. Overall, each participant is required to produce 12,000 words in each of three terms over the course of 12 months. Developing the corpus involved collecting each text in electronic form, cleaning it (i.e. removing unnecessary parts: titles, headers, footers, captions, and reference list), and categorizing it according to the term it was written. The resulting corpus contained 149,587 running words (tokens) and 7,259 types, which was divided into three subcorpora: Term 1 – 50,376 running tokens, Term 2 – 48,530, and Term 3 – 50,681. The three subcorpora are, therefore, directly comparable in terms of text length and text style.
The BNC academic written corpus The academic written sub-set of the BNC World Edition (2000) was used as the ‘proficient writer’ comparison corpus. It consists of 501 texts totaling over 16 million words, selected from books and journal articles in the six disciplines proposed by Lee (2001): humanities/arts, medicine, natural science, politics/law/education, technical/engineering, and social science.
Procedure All adjective-noun combinations were extracted from the learner corpus in the following manner. The corpus was searched for the 187 nouns from Sublist 1 of Coxhead’s (2000) Academic Word List (AWL), and those selected which were used by at least one participant over time (at least in two of the three terms). Then WordSmith 5.0 was used to locate adjacent adjective collocates for each of these recurring academic nouns. Collocations were excluded from analysis if they included one of the following constituents: hyphenated adjectives (e.g. corpus-based approach), pronouns, possessives, determiners, numbers/ordinals, adjectives to signify nationalities (e.g. Chinese, English), and terminology (e.g. Lexical Approach, Universal Grammar). This selection procedure produced 41 nouns, leading to 147 different adjective-noun collocation types in Term 1, 95 in Term 2, and 107 in Term 3, a total of 494 collocation tokens and 299 collocation types.
28
Perspectives on Formulaic Language
The number of collocate types (i.e. different collocations) and tokens (i.e. occurrences of each type) produced by each participant across terms was counted and recorded. For example, the node noun role was used with different adjectives by participant WL in her academic texts across three terms, that is, important in Term 1, central in Term 2, and key, potential, critical, significant in Term 3. Based on these frequency counts, the type-token ratio (TTR) of each collocation type was calculated for all of the four participants. The t-score and MI value for each collocation type was also calculated. Since it is claimed that low-frequency collocations jeopardize the reliability of all association measures (Manning & Schütze, 1999; Evert & Krenn, 2001), all the extracted collocations with less than four occurrences in the BNC academic corpus were excluded from MI and t-score calculations. Various researchers define their cut-off points differently: Manning and Schütze (1999) suggest a minimum of three occurrences; Stubbs (2001) five occurrences; Church and Hanks (1990) five occurrences. We used a cut-off point of four to include as large a set of learner collocations as possible. To explore the four participants’ collocational development pattern over the period of 12 months, the TTR, t-score, and MI values for each participant were then averaged within each term and these averages compared across terms. Finally, to explore the development of the strongly-associated collocations preferred by expert writers, the adjective-noun collocations were ranked into different bands according to their MI values.
Results The value of case-studies is the elicitation and analysis of rich data, and so we will report both the group results and the results of each individual participant.
Participants’ overall collocation use Group 494 adjective-noun collocation tokens were identified from the learner corpus, made up of 299 types. Of these 299 types, over 40 per cent can be considered frequent and strongly-associated (which we will term robust in this section), at least according to the criteria of appearing four or more
Collocation Use in Academic Texts Table 2.2
29
Participant Group’s Overall Collocation Use
Adjective-noun Tokens Types Tokens Types Tokens Types Tokens Types collocations (total) (total) (Term 1) (Term 1) (Term 2) (Term 2) (Term 3) (Term 3) F<4
143
128
65
62
35
33
43
38
F≥4 & MI>3 & t-score>2
283
123
112
67
81
42
90
54
Total
494
299
198
147
142
95
154
107
% of robust collocations
57.3
41.1
56.6
45.6
57.0
44.2
58.4
50.5
times in the 16 million-word BNC academic subcorpus and having association figures of MI>3 and t-score>2 (Table 2.2). On the other hand, there was a similar percentage of rarely-occurring combinations, that is, appearing less than four times in the BNC subcorpus. However, in terms of instances of use (tokens), the participants used the robust collocations considerably more often than the infrequent ones: TTR of 0.43 for robust collocations in comparison to 0.90 for infrequent ones. On average, each robust collocation type occurred more than twice in the 36 academic writing texts. The table also shows how the participants’ use of robust collocations developed over the academic year. Although the number of types used declined after Term 1, the percentage of robust types used remained essentially the same from Term 1 to Term 2, and then increased slightly by the time the dissertation was written up in Term 3. In terms of tokens, the number of robust collocations used over the three terms showed a similar pattern as that of types, while the percentage of robust tokens only ranged from 56.6 per cent to 58.4 per cent during the academic year, and so was relatively stable. Overall, after the one-year exposure to an English academic environment, it appears that the participant group as a whole displayed little if any improvement in the number of robust adjective-noun collocation types or tokens produced in their academic writing.
Individual participants Although the group results showed little improvement, an analysis of the individual participants’ results shows quite varied behaviors. Table 2.3 shows that although all participants used about 40 different collocation types in Term 1, LH and WL used steadily fewer over the year, while TT dropped in
Perspectives on Formulaic Language
30 Table 2.3
Participants’ Use of Robust Collocations over Three Terms
Participant
Adjective-noun collocation
Term 1 Term 1 (tokens) (types)
Term 2 (tokens)
Term 2 (types)
Term 3 (tokens)
Term 3 (types)
LH
F≥4 & MI>3 & t-score>2
20
17
16
13
9
8
Total
44
40
36
32
32
27
% of robust collocations
45.5
42.5
44.4
40.6
28.1
29.6
F≥4 & MI>3 & t-score>2
18
17
18
12
22
19
Total
43
41
32
22
38
33
% of robust collocations
41.9
41.5
56.3
54.5
57.9
57.6
F≥4 & MI>3 & t-score>2
36
19
20
13
17
11
Total
57
39
29
21
25
17
% of robust collocations
63.2
48.7
69.0
61.9
68.0
64.7
F≥4 & MI>3 & t-score>2
38
28
27
19
42
27
Total
54
43
45
37
59
42
% of robust collocations
70.4
65.1
60.0
51.4
71.2
64.3
TT
WL
YJ
Term 2, but recovered somewhat in Term 3. YJ remained relatively stable in the number of types she used through the year. Regarding the number of collocation tokens used by the four participants across three terms, it displays a rather similar development trend to that of collocation types apart from that of YJ. She tended to use more collocation tokens (59 in Term 3 compared with 54 in Term 1) by the end of the academic year, largely due to frequent repetition (with TTR 0.71 in Term 3, and 0.80 in Term 1). It is also interesting to note the percentage of robust collocations used. In conjunction with her reduced diversity of collocation types/tokens, LH also dropped in the percentage of robust collocations, both types and tokens. Overall, her mastery of these collocations seems to have deteriorated over the year. Conversely, although WL declined in the number of collocation types/tokens used, her percentage of robust collocation types increased over the year, while her percentage of robust collocation tokens reached the peak in Term 2, then dropped slightly in Term 3. She therefore used
Collocation Use in Academic Texts
31
fewer types over time, but a greater percentage of those types were similar to those used by proficient English writers. YJ had a dip in Term 2, but ended up in Term 3 essentially where she began in Term 1, both in number of types and percentage of robust types/tokens. Thus, her usage of collocations was relatively stable over the year. TT also had a dip in numbers of types/tokens produced in Term 2, but her percentage of robust collocations (both types and tokens) steadily increased over the three terms. Overall, her figures indicate gradual improvement in collocation mastery.
Development in the diversity of adjective-noun collocations produced The average number of adjective types used to describe academic nouns provides a general measure of the diversity of adjective-noun collocations produced. The group mean result in Table 2.4 exhibits U-shaped behavior, with the Term 3 figure not recovering to the Term 1 figure. However, this group average does not show the substantial differences between the individual participants. In fact, the group profile only serves to disguise the very
Table 2.4
Average Number of Adjective Types per Noun across Three Terms
2.70 2.50 2.30 2.10 1.90 1.70 1.50 Term 1
Term 2
Term 3
LH
2.00
1.68
1.59
TT
1.67
1.57
1.83
WL
2.60
1.62
1.89
YJ
2.15
2.18
2.33
Mean
2.11
1.76
1.91
32
Perspectives on Formulaic Language
real differences in the participants’ individual development of adjective variation. LH started with an average of two adjective types in the first term of her academic year, followed by a consistent decrease from 1.68 to 1.59 in the next two terms afterward. This continuous decline in the mean number of adjective types over the course of three terms indicates that LH used less diverse adjective-noun collocations (about 20 per cent less) by the end of her study abroad programme. In contrast, participant YJ showed an opposite developmental trend in the use of adjectives to describe academic node nouns in her academic writing tasks over the course of three terms. At the beginning of the academic year, an average of 2.15 adjective types were used. This figure rose to 2.18 and 2.33 respectively in the following two terms. The steady increase suggests that YJ used slightly more diverse (approximately 8.4 per cent more) adjective-noun collocations by the end of her MA course. Both TT and WL experienced a decrease in the number of adjective types from Term 1 to Term 2, and a substantial increase from Term 2 to Term 3, although their developmental profile is very different. WL’s employment of adjective types dropped sharply from 2.6 to 1.62 (about 37.7 per cent less), followed by a substantial rise of nearly 17 per cent from 1.62 to 1.89. This left her using less diversity of adjective-noun collocations over the course of the year. On the other hand, TT initially experienced a slight decline of approximately 6 per cent from 1.67 to 1.57, followed by a rise of 16.6 per cent from 1.57 to 1.83. Unlike WL, by the end of the academic year, TT used more various adjective-noun collocations (9.6 per cent more) compared with those used in Term 1. Changes in the repetition of adjective-noun collocation TTR value can provide indication of the repetition frequency of collocation use. Table 2.5 shows the TTR value of target adjective-noun collocations for each participant and for the participant group as a whole. The TTR pattern for the group shows a rather stable and subtle decline from 0.91 to 0.87 over the academic year. This steady drop in TTR value over time suggests slightly more repetition of collocation by the four participants as a whole. However, as the decrease is less than 4.5 per cent, it is probably not particularly meaningful. Of more interest is the individual behavior, which again varies substantially among the participants. The only participant with a profile which in
Collocation Use in Academic Texts Table 2.5
33
Type-Token Ratios of Adjective-Noun Collocations across Three Terms
1.00
0.95
0.90
0.85
0.80
0.75
Term 1
Term 2
Term 3
LH
0.98
0.94
0.86
TT
0.98
0.81
0.95
WL
0.78
0.90
0.81
YJ
0.89
0.94
0.85
Mean
0.91
0.90
0.87
any way resembles the group profile is LH, and even here the rate of decrease is much more extreme than the group profile. She started with a TTR of 0.98 in Term 1, dropping to 0.94 and 0.86 in Terms 2 and 3 respectively. This steady decline (nearly 12 per cent decrease) in the TTR suggests that LH tended to repeat collocations more often at the end of the academic year. Unlike LH, participant TT’s TTR value underwent a noticeable fluctuation over the course of three terms. Her TTR value began at 0.98 in Term 1, followed by a considerable drop to 0.81 in Term 2, and ending up nearly where she started at 0.95 by the end of her MA course. It is difficult to say what caused the drop in Term 2, other than to note that it was not based on a single aberrant paper, as TT submitted four papers in this term, as did all the participants. Participants WL and YJ share a similar trend, both experiencing a rise of TTR in Term 2, and a drop afterward in Term 3. Overall, WL showed slightly more repetition of collocations than YJ throughout the year.
Perspectives on Formulaic Language
34
Development of high-frequency and typical collocation use We have seen changes in the participants’ diversity and repetition of adjective-noun collocation use, and now focus on their production of the type of collocations frequently used by native professional writers in their academic publications, as measured by the t-score statistic and the BNC Academic reference corpus (Table 2.6). The group result indicates no change in t-score from Term 1 to Term 2 (5.30), and then a slight improvement to 5.44 in Term 3. This suggests that the four participants as a group used more frequent/typical adjective-noun collocations in their dissertations than in their earlier assignments. However, in this case, this profile accurately represents none of the individual participants’ profiles. LH’s average t-score in Term 1 was 5.20, which rose to 5.47 and then dropped to 5.10 in Term 3. Thus, over the year, there was no improvement in LH’s higher-frequency collocation use. It suggests that LH did not use more native-like adjective-noun collocations which are commonly used by professional expert writers in academic texts. WL’s
Table 2.6
T-scores of Adjective-Noun Collocations across Three Terms
6.00 5.90 5.80 5.70 5.60 5.50 5.40 5.30 5.20 5.10 5.00 Term 1
Term 2
Term 3
LH
5.20
5.47
5.10
TT
5.09
5.40
5.68
WL
5.60
5.29
5.94
YJ
5.36
5.05
5.04
Mean
5.31
5.30
5.44
Collocation Use in Academic Texts
35
developmental trend is almost a mirror image of LH’s with t-score averages of 5.60, 5.29 and 5.94 for Terms 1, 2, and 3 respectively. It seems that by the end of the MA programme, when WL wrote up her high-stakes dissertation, she tended to use collocations with higher frequency levels, compared with those used in her earlier assignments. YJ produced a declining profile, dropping from an initial t-score of 5.36 to 5.04/5.05, which indicated a tendency to use adjective-noun collocations which were less frequent and typical by the end of the academic year. Finally, TT produced the type of profile which one might expect given the rich linguistic environment, consistently rising throughout the three terms. She started with an average t-score of 5.09, which thereafter rose to 5.40 and 5.68 in the following two terms. This steady increase suggests that the collocations which occurred in TT’s academic writing assignments over the course of the 12-month postgraduate programme were, generally speaking, increasingly more typical of proficient writers. Development of strongly-associated collocation use Since MI value is known to emphasize a rather different set of collocations from t-score (Schmitt, in press), a similar analysis was carried out using the MI statistic. It highlights collocations which are typically not very frequent, but which are strongly associated when they do occur (e.g. tectonic plates). The group averages (Table 2.7) show a very shallow U-shaped profile, which can probably be best interpreted as no meaningful change across the different terms. But again, the group averages do not accurately represent any of the individual profiles. LH’s collocation use showed a continuous decline in MI values from 4.33 to 3.95 over the course of three terms. This consistent decrease suggests that participant LH tended to use adjective-noun collocations with less association strength in her dissertation, compared with those used in her writing tasks completed in Term 1. By contrast, TT’s collocation use displayed an opposite developmental direction. Her MI averages increased over time (4.51, 4.51, 5.48), which indicates TT’s use of adjective-noun collocation by the end of her study abroad programme was more nativelike, since such strongly-connected collocations characterized the professional writers’ academic texts in the BNC sub-corpus. Although participants WL and YJ underwent completely different developmental trends over the year, they both ended up with lower MI scores in comparison with their initial levels in Term 1. Despite the fluctuations which took place within the length of 12-month postgraduate programme,
Perspectives on Formulaic Language
36 Table 2.7
MI Scores of the Adjective-Noun Collocations across Three Terms
5.60 5.30 5.00 4.70 4.40 4.10 3.80
Term 1
Term 2
Term 3
LH
4.33
4.09
3.95
TT
4.51
4.51
5.48
WL
4.50
4.80
4.46
YJ
5.06
4.42
4.75
Mean
4.60
4.46
4.66
both WL and YJ showed little growth in the employment of adjectivenoun collocations with stronger association strength. This suggests that the collocations used by participants WL and YJ did not become more expert-writer-like after the one-year exposure to the academic target language environment.
Differences in the distribution of collocations according to MI banding In order to investigate the participant group’s collocation patterns in terms of the distribution of strength of association over time, we classified all the adjective-noun collocations used by the four Chinese MA students into five bands on the basis of their MI score values and raw frequency counts obtained from the reference BNC sub-corpus. The MI statistic tends to highlight collocations which are not frequent, but which are highly associated, and are thus likely to be very salient to native speakers (and perhaps proficient non-natives as well). It is thus useful to explore whether the participants began using more of the higher MI collocations, as these may be particularly important in providing a sense of native-likeness to written
Collocation Use in Academic Texts
37
compositions (Durrant & Schmitt, 2009). We use the term ‘non-associated’ to represent those two-word combinations which are either unattested or with raw frequency of below four in the BNC academic texts. (We recognize that these very infrequent combinations may well be associated, but use this terminology in order to clearly differentiate these combinations from our other categories.) ‘Weak-strength collocations’ are those which occur more than four times in the BNC reference corpus with a MI score of less than 3. ‘Moderate-strength collocations’ have MI scores of 3<MI<5, ‘stronger collocations’ have MI strengths of 5<MI<7, and ‘extremely strong collocations’ have MI over 7. (All with frequencies of four or more in the BNC sub-corpus.) When we look at the four participants as a group, the developmental pattern measured by MI values across three terms is shown in Table 2.8. The combined percentage of ‘non-associated’ and ‘weak-strength’ collocation types decreased from nearly 50 per cent in Term 1 to 44.6 per cent in Term 3. This subtle drop indicates that the four participants as a whole used a somewhat lower percentage of less native-like lower-strength collocation types in their academic writing. On the other end of the scale, the participants used only small percentages of ‘extremely strong’ collocation types, and the amount of usage remained about the same over the year. The ratios for both ‘moderate-strength’ and ‘stronger’ collocation categories Table 2.8 Types)
Participant Group’s MI Distribution across Three Terms (Collocation
100.0% 90.0% 80.0% 70.0% MI>7 60.0%
5<MI<7
50.0%
3<MI<5
40.0%
MI<3 Frequency<4
30.0% 20.0% 10.0% 0.0% Term 1
Term 2
MI>7
5.4%
1.9%
Term 3 3.6%
5<MI<7
20.1%
31.9%
24.7%
3<MI<5
24.5%
22.5%
27.0%
MI<3
11.6%
16.9%
12.2%
Frequency<4
38.3%
26.9%
32.4%
Perspectives on Formulaic Language
38
rose slightly from 24.5 per cent to 27 per cent, and from 20.1 per cent to 24.7 per cent respectively. Overall, the group produced slightly less non-associated combinations, and slightly more weak/moderate/stronger strength collocations, although the amount of the most strongly associated collocations (extremely strong collocations) decreased slightly. This group summary can be compared against the individual results. Table 2.9 shows that LH’s combined use of ‘non-associated’ and ‘weakstrength’ collocation types has increased over the course of the academic year, from a total of 57.5 per cent in Term 1 to 70.4 per cent in Term 3. The consistently rising figures indicate that LH used increasingly larger proportions of less strongly associated collocation types over the course of the year. As for those collocations with an MI value of above 3, the drop in percentage of ‘moderate-strength’ collocation types is largely offset by the increase in the ‘stronger’ collocation types, which would suggest some shift towards the use of more strongly associated collocations. However, working against this conclusion is the disappearance of all collocations with an MI over 7. Table 2.9
LH’s MI Distribution across Three Terms (Collocation Types)
100.0% 90.0% 80.0% 70.0% MI>7 60.0%
5<MI<7 3<MI<5
50.0%
MI<3
40.0%
Frequency<4 30.0% 20.0% 10.0% 0.0% Term 1
Term 2
MI>7
5.0%
0.0%
Term 3 0.0%
5<MI<7
10.0%
18.8%
18.5%
3<MI<5
27.5%
21.9%
11.1%
MI<3
10.0%
12.5%
18.5%
Frequency<4
47.5%
46.9%
51.9%
Collocation Use in Academic Texts Table 2.10
39
TT’s MI Distribution across Three Terms (Collocation Types)
100.0% 90.0% 80.0% 70.0% MI>7 60.0%
5<MI<7
50.0%
3<MI<5
40.0%
MI<3 Frequency<4
30.0% 20.0% 10.0% 0.0% Term 1
Term 2
Term 3
MI>7
7.3%
0.0%
12.1%
5<MI<7
14.6%
40.9%
21.2%
3<MI<5
19.5%
22.7%
27.3%
MI<3
14.6%
27.3%
9.1%
Frequency<4
43.9%
9.1%
30.3%
TT’s MI distribution profile is completely different from that of LH. As shown in Table 2.10, TT used a smaller percentage of ‘non-associated’ and ‘weak-strength’ collocation types in her course assignments written up in Terms 2 (36.4 per cent) and 3 (39.4 per cent) compared to Term 1 (58.5 per cent). Over the same period, she used a larger proportion of the ‘extremely strong’ adjective-noun collocation types. The most noticeable feature of the graph is the great increase in ‘stronger collocations’ in Term 2, and then equally dramatic decreases in Term 3. Overall, TT increased in her percentage of collocations with the scores of 3 or above from Term 1 to Term 2, and then remained relatively stable in this regard from Term 2 to Term 3, with just the distribution among the three strongest bands varying. A similar overall summary also describes WL’s profile (Table 2.11), although the increase of collocation use in the 3<MI<5 band is noticeable in her profile. Table 2.12 displays YJ’s changes of collocation use measured by MI values over the three terms. If we look at the combined ‘non-associated’ and ‘weakstrength’ percentages, we find that although there was a peak in Term 2, the Term 3 figure (33.3 per cent) was essentially the same as at Term 1 (32.5 per cent). Likewise, the moderate-strength percentages remained nearly
Perspectives on Formulaic Language
40 Table 2.11
WL’s MI Distribution across Three Terms (Collocation Types)
100.0% 90.0% 80.0% 70.0% MI>7 60.0%
5<MI<7
50.0%
3<MI<5
40.0%
MI<3 Frequency<4
30.0% 20.0% 10.0% 0.0% Term 1
Term 2
0.0%
4.8%
0.0%
5<MI<7
25.6%
38.1%
23.5%
3<MI<5
23.1%
23.8%
41.2%
MI<3
10.3%
14.3%
11.8%
Frequency<4
41.0%
19.0%
23.5%
MI>7
Table 2.12
Term 3
YJ’s MI Distribution across Three Terms (Collocation Types)
100.0% 90.0% 80.0% 70.0%
MI>7
60.0%
5<MI<7
50.0%
3<MI<5
40.0%
MI<3
30.0%
Frequency<4
20.0% 10.0% 0.0% Term 1
Term 2
MI>7
9.3%
2.7%
Term 3 2.4%
5<MI<7
30.2%
29.7%
35.7%
3<MI<5
27.9%
21.6%
28.6%
MI<3
11.6%
13.5%
9.5%
Frequency<4
20.9%
32.4%
23.8%
Collocation Use in Academic Texts
41
the same in Terms 1 and 3. The increase in the percentage of ‘stronger collocations’ (+5.5 percentage points) is mostly accounted for by the decrease in ‘extremely strong collocations’ (–6.9 percentage points). Overall, there was a spike in the percentage of ‘non-associated’ and ‘weakstrength’ collocations in Term 2, but by Term 3 this had been rectified, and YJ ended up the academic year largely where she started in Term 1.
Discussion The multiple case-study approach used in this study produced a rich set of data, which was analysed in a number of ways: the number of types and tokens produced, the diversity of adjectives used with each academic noun, the amount of repetition of each adjective-noun collocation, how the collocations produced compared to those in the BNC academic reference corpus according to t-score (largely frequency-based) and MI score (largely based on strength of association), and the degree of strength of association according to a five-band MI rating scale. Overall, the participant group used fewer adjective-noun collocations (both types and tokens) over the course of the academic year, although the percentage of ‘robust’ collocations (types and tokens) increased slightly. The diversity of collocation (average number of adjective types per academic noun) decreased across the year, during which the collocations were repeated slightly more often in later academic writing tasks. This indicates that the group of four Chinese postgraduates as a whole demonstrated a tendency to use a somewhat smaller group of collocations more repetitively by the end of the 12-month MA programme. In terms of how ‘native-like’ the collocations were, the group produced a modest increase in t-score over the year, while the MI scores remained relatively static. When the strength of association was explored by MI banding, the group profile was largely similar in Terms 1 and 3. In sum, the statistical approach used in this study was able to show relatively little substantial change in the production of adjective-noun collocations over the course of an academic year. The MI bandings showed some improvement from Term 1 to Term 2 in the decreased use of MI<3 collocations, but this plateaued out and there was no further improvement at Term 3. However, for most of the measures, the group results fail to adequately represent the individual participants. LH showed fairly consistent decreases in the ability to use collocations on all of the measures. Conversely, TT produced
42
Perspectives on Formulaic Language
improved figures on most of the indices. YJ showed fluctuations through the year, but largely ended up in Term 3 near where she started in Term 1, with some indices slightly up and others slightly down. WL produced a mixed set of results, with fewer types produced, but a great percentage of robust collocations. Her diversity of collocations decreased over the year, although so did the amount of repetition, albeit slightly. She used a higher percentage of collocations with higher t-scores, but those same collocations were about the same at the beginning and end of the year in terms of MI score. Indeed, perhaps the most interesting result of the study is the demonstration that the group figures painted a misleading picture of all the participants. The group results showed little real change, yet the study had one declining student, one improving student, and one with mixed results. There was also one student who ended up nearly where she started, but even here the group figures disguised the amount of variation in YJ’s results. Learners typically have a great deal of variation in their acquisition and use results, and this is particularly true of vocabulary (Meara, 1996). The small number of participants in this study also makes it hard for the variation to be ‘evened out’. Nevertheless, these results provide a warning to researchers of formulaic language to be careful about generalizing individual behavior from group averages. It may be that the acquisition and use of formulaic language is so idiosyncratic that group averages will have difficulty in making useful statements about any of the individuals within the group. In other words, the group scores may serve only to average out all of the variation inherent in the group, and thus provide a misleadingly ‘smooth’ representation of what might be quite different behaviors for all of the participants involved. This is clearly seen in this study, where the participants were very similar to each other. They were all Chinese, female, and of a similar age. They went through the same school system, and attended mainly the same MA courses during their stay in Nottingham. They (with the exception of TT who took the TOEFL) had the same IELTS overall score (6.5) and the same writing score (6.0). Thus we might expect that they would develop the adjective-noun collocations in similar ways. Yet in spite of this, they usually ended up with four quite different profiles in each of the measures. If such a small and homogeneous group as this demonstrates such varying behavior, then it appears that researchers will need to be cautious in their approach to group data in formulaic language research. This point has already been acknowledged by Howarth (1996), who argues that researchers
Collocation Use in Academic Texts
43
can lose opportunities for identifying significant differences among learners’ processing mechanisms by extracting an average performance from a corpus of various non-native writers. He also goes one step further, and claims that non-native language proficiency is best researched by means of small-scale manual analysis (such as carried out in this study). This raises the interesting issue of how much variation is inherent in the acquisition and use of formulaic language by second language learners, compared to the amount of consistency among learners. On all of our measures there was variation, but without firm benchmarks, it is difficult to interpret those variations. For example, the average t-scores of the collocations moved between 5.00 and 6.00 for our participants. But is this just normal fluctuation and not meaningful? Perhaps it takes 2–3 full points movement to indicate a truly meaningful change in collocation behavior? Without such benchmarks, researchers can describe the variation, but it is difficult to know if the variation represents real change. Unfortunately, to our knowledge, there are no established benchmarks for t-score, MI, or types and tokens which show the degree of change which indicates real improvement or decline for those measures. Indeed, for t-score and MI, the advice seems to be that they are best used for ranking collocations against each other, rather than providing absolute measures of strength of association (Stubbs, 1995; Manning & Schütze, 1999). Most of the measures in this study have been widely used in the study of formulaic language, but it has largely been descriptive up until now. If we are to use them to inform about the acquisition of formulaic language, the field will either need to somehow establish benchmarks to work against, or to move to alternative methodologies. One of these methodologies that worked well in a single case study of the acquisition of formulaic language was expert rating panels (Li & Schmitt, 2009). The present study has attempted to use a statistical approach to investigate a longitudinal learner corpus in order to identify the learners’ improvement in collocation use over the course of one academic year. This is different from previous research which used statistical measurements to explore the differences in the collocation use between native and non-native speakers (e.g. Lorenz, 1999; Durrant & Schmitt, 2009). Such research has successfully used association measurements of collocation (i.e. t-score and MI) as valid discriminators between the two populations with different proficiency levels, but this study seems to suggest that they are less efficient in identifying improvement in collocation of advanced L2 learners over a relatively short period of time. For example, Lorenz
44
Perspectives on Formulaic Language
(1999) and Durrant and Schmitt (2009) found that proficient writers prefer collocations with high MI values but relatively low t-scores, but that the reverse is true for less proficient non-native writers. The present study cannot clearly distinguish Chinese L2 postgraduates’ developmental phases of collocation use within the 12-month period of investigation. This is not surprising when we take into account the relatively short-term investigation undertaken. One academic year may simply not be long enough for advanced level Chinese MA students to show meaningful improvement in collocation use. The present study is not without its limitations. Firstly, the BNC academic sub-corpus is not a parallel reference corpus, which consists of research articles, books, and book sections from all disciplines. Since a number of studies (e.g. Cortes, 2004; Biber, 2006; Hyland, 2008) have shown that there are considerable variations in the frequency of forms and structures across different types of academic writing texts, the use of adjective-noun collocations may vary from discipline to discipline. If a reference native corpus containing the specific texts from students’ MA reading list were compiled, it would be a more parallel comparison. In addition, it should be noted that the size of the longitudinal learner corpus is relatively small, consisting of only four participants’ writing tasks written within an academic year. This case-study approach has allowed a detailed exploration of individual progress, but a larger-sized longitudinal learner corpus built up over a longer period may yield a more insightful account of L2 learners’ collocation development over time.
References Adolphs, S., & Durow, V. (2004). Social-cultural integration and the development of formulaic sequences. In N. Schmitt (Ed.), Formulaic sequences (pp. 107–26). Amsterdam: John Benjamins. Biber, D. (2006). University language: A corpus-based study of spoken and written registers. Amsterdam: John Benjamins. Church, K., & Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics, 16 (1), 22–29. Cortes, V. (2004). Lexical bundles in published and student disciplinary writing: Examples from history and biology. English for Specific Purposes, 23 (3), 397–423. Cowie, A. P. (1998). Phraseological dictionaries: some East-West comparisons. In A. P. Cowie (Ed.), Phraseology: Theory, analysis and applications (pp. 145–60). Oxford: Oxford University Press. Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34 (2), 213–38.
Collocation Use in Academic Texts
45
Durrant, P., & Schmitt, N. (2009). To what extent do native and nonnative writers make use of collocations? International Review of Applied Linguistics, 47, 157–177. Evert, S., & Krenn, B. (2001). Methods for the qualitative evaluations of lexical association measures. Paper presented at the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France. Foster, P. (2001). Rules and routines: A consideration of their role in the task-based language production of native and non-native speakers. In M. Bygate, P. Skehan, & M. Swain(Eds.), Language tasks: Teaching, learning and testing (pp. 74–93). Harlow: Longman. Granger, S. (1998). Prefabricated patterns in advanced EFL writing: Collocations and formulae. In A. P. Cowie (Ed.) Phraseology: Theory, analysis, and applications (pp. 79–100). Oxford: Oxford University Press. Howarth, P. (1996). Phraseology in English academic writing: Some implications for language learning and dictionary making. Tübingen: Niemeyer. Howarth, P. (1998). The phraseology of learners’ academic writing’. In A. P. Cowie (Ed.), Phraseology: Theory, analysis, and applications (pp. 161–86). Oxford: Oxford University Press. Huang, J., & Hatch, E. (1978). A Chinese child’s acquisition of English. In E. Hatch (Ed.), Second language acquisition: A book of readings (pp. 118–31). Rowley, MA: Newbury House. Hyland, K. (2008). Academic clusters: Text patterning in published and postgraduate writing. International Journal of Applied Linguistics, 18 (1), 41–62. Lee, D. (2001). Genres, registers, text types, domains and styles: Clarifying the concepts and navigating a path through the BNC jungle. Language Learning & Technology, 5 (3), 37–72. Li, J. & Schmitt, N. (2009). The acquisition of lexical phrases in academic writing: A longitudinal case study. Journal of Second Language Writing, 18, 85–102. Lorenz, G. (1999). Adjective intensification – learners versus native speakers: A corpus study of argumentative writing. Amsterdam and Atlanta: Rodopi. Meara, P. (1996). The classical research in vocabulary acquisition. In G. Anderman & M. Rogers (Eds.), Words, words, words (pp. 27–40). Clevedon: Multilingual Matters. Retrieved December 20, 2008, from http://www.lognostics.co.uk/vlibrary/ index.htm. Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press. Nesselhauf, N. (2003). The use of collocations by advanced learners of English and some implications for teaching. Applied Linguistics, 24 (2), 223–42. Nesselhauf, N. (2005). Collocations in learner corpus. Amsterdam/Philadelphia: John Benjamins. Pawley, A., & Syder, F. H. (1983). Two puzzles for linguistic theory: Nativelike selection and nativelike fluency. In J. C. Richards & R. W. Schmidt (Eds.), Language and communication (pp. 191–226). New York: Longman. Schmitt, N. (in press). A vocabulary research manual. Palgrave Press. Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press. Siyanova, A., & Schmitt, N. (2008). L2 learner production and processing of collocation: A multi-study perspective. The Canadian Modern Language Review, 64 (3), 429–58.
46
Perspectives on Formulaic Language
Skehan, P. (1998). A cognitive approach to language learning. Oxford: Oxford University Press. Stubbs, M. (1995). Collocations and semantic profiles: On the cause of the trouble with quantitative studies. Functions of Language, 2 (1), 23–55. Stubbs, M. (2001). Texts, corpora, and problems of interpretation: A response to Widdowson. Applied Linguistics, 22 (2), 149–72. Wong-Fillmore, L. (1976). The second time around: Cognitive and social strategies in second language acquisition. Doctoral dissertation: Stanford University. Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press.
Chapter 3
Idiomatically Speaking: Effects of Task Variation on Formulaic Language in Highly Proficient Users of L2 French and Spanish1 Fanny Forsberg and Lars Fant Stockholm University
Introduction Formulaic language in second language use Formulaic language (FL) is commonly defined as consisting of multiword structures which carry a conventionalized ‘holistic’ meaning and which have not been generated in any sense by the grammar component of the language to which they belong (i.e. they are considered to be ‘non-productive’; cf. Erman & Warren, 2000; Wray, 2002). In second language acquisition (SLA) research, holistic and non-productive multiword structures have been discussed from various perspectives, for instance as communicative strategies for early learners (Krashen & Scarcella, 1978), as a basis for creative rule development (Wong-Fillmore 1976; Myles, F., Mitchell, R., & Hooper, J., 1998; Myles, F., Mitchell, R., & Hooper, J. 1999) or as indexes of idiomaticity (Yorio, 1989; Wray, 2002; Ellis 2002a, 2002b; Schmitt et al., 2004; Forsberg, 2008). A natural consequence of this diversity of approaches is the fact that different researchers have come to focus on quite different kinds of sequences, depending on the level of the learners studied and on which specific functions are attributed to the sequences. A second language (L2) beginner, for instance, may have learned sequences such as ‘my name is’ or ‘I like’ as non-productive chunks, understood by the analyst to be formulaic; this is the perspective taken by for example, Krashen & Scarcella (1978). Other researchers such as Schmitt, Grandage et al. (2004), Lewis (2008) and Bardovi-Harlig (2008) are more interested in expressions which are shared by an entire speech community, for example, ‘as a matter of fact’ or ‘you’re welcome’.
48
Perspectives on Formulaic Language
An encompassing view, then, would be to regard formulaic language as a concept covering holistic non-productive strategies that range from an individual and often transitory usage to a collective and stable usage. However, only sequences used by an L2 learner which correspond to preferred patterns of an entire speech community will convey true idiomaticity, and this kind of sequence, that is, formulaic sequences which reflect nativelike selection (Pawley & Syder, 1983), is the focus of the present study. A basic assumption, then, is that L2 learners’ use of FL is a valid yardstick for measuring the idiomaticity of their language production as a whole.
What is ‘Advanced’ second language use? Although there is a well-established belief that FL could constitute a stumbling block for L2 learners, few researchers have set out to investigate in which ways highly proficient L2 speakers actually cope with these sequences. Another problem is that the learners considered in most studies to be ‘advanced’ do not, in fact, seem to have reached a very high level of integration with the target community, let alone to have acquired a degree of proficiency that would qualify as near-native. The subjects of these studies are often higher education language students who studied in language classrooms (e.g. Nesselhauf, 2005) or during a stay abroad (e.g. Schmitt Grandage et al., 2004). It seems to be the case that FL requires more target language exposure and integration than many other linguistic phenomena (Dörnyei, Z, Durow, V., & Khawla, Z., 2004). Considering the actual level of proficiency of the subjects in most studies addressing this issue, it is hardly surprising that their authors conclude that the differences to be found between natives and non-natives in their use of FL, whether in speech or writing, are considerable indeed. In the present study, non-native subjects will be considered in their capacity as L2 users (cf. Cook, 2002) rather than L2 learners. Furthermore, the L2 users studied are people who have lived in the target community for many years, having used the L2 as their prevailing communication code. In fact, Forsberg (2008), looking at this type of learner of French L2, has suggested that there are no significant differences at all between this category of L2 French users and native speakers of French as regards the quantity of FL produced, or the category distribution or even the type frequencies. However, her results were based on data drawn from a limited number of participants and produced in a semi-structured self-presentation interview. Since this may be considered a relatively easy task to perform, the question
Effects of Task Variation on Formulaic Language
49
arises whether the same tendency will hold in more difficult tasks, which is why the present study is based on task variation.
Activity types and task variation The perceived difficulty of a task could arguably be seen as depending on how familiar an individual is with the corresponding activity type. Since formulaic sequences (FSs) emerge in and depend on context (MacWhinney, 2001), it can be assumed that all language users, native or non-native, master formulaic sequences associated with ‘common’ situations better than those occurring in unfamiliar situations (cf. Tavakoli & Skehan, 2005, on the notion of ‘familiarity’). Accordingly, ‘familiar’ activities would have to be contrasted with ‘non-familiar’ activities in order to obtain a clearer picture of possible differences between highly proficient L2 users and native speakers. The importance of performance differences across tasks has been acknowledged and thoroughly discussed over more than a decade by proponents of the ‘task-based language teaching paradigm’. Most of this body of research is interested in how complexity, accuracy and fluency (the so-called CAF parameters) are affected by task variation. Robinson and Gilabert (2007) have put forward the ‘cognition hypothesis’ according to which, among other things, the learner’s linguistic output will be more complex if the task is also a cognitively more complex one. A different position is taken by Skehan (1998), whose ‘limited attentional capacity model’ suggests that learners find difficulties in attending to all three CAF features at the same time, so that an increased focus on one will entail a diminished focus on the other two. This also implies that some tasks are more suited for testing complexity or accuracy and others more suitable for assessing fluency. Foster (2001) provides the only study to our knowledge that links FSs with task completion. The author finds that time does not affect (intermediate level) learners’ use of FSs, in contrast to native speakers, who use more formulaic sequences when they are not given any planning time. Foster’s findings suggest, first of all, that intermediate learners’ repertoire of FSs is not very large and therefore planning time is of little importance. More importantly, however, the study suggests that native speakers tend to use more formulaic sequences in time-constrained situations when they have to rely on automaticity. An ensuing question is to what extent high-proficient users will behave like natives in this regard. It should be added that even less advanced learners vary their choice of linguistic expression across different tasks, as shown by Tyne (2005). It is by
50
Perspectives on Formulaic Language
no means a far-fetched assumption that people do not speak in the same way in all situations, regardless of whether they are native speakers, L2 learners or high-proficient L2 users.
Aims and research questions The first aim of this study concerns the effects that task properties may have on highly proficient L2 speakers’ formulaic language as compared to native use. To this effect, two different communicative tasks were designed, one of which is dialogic and represents an activity type based on common experience, while the other is monologic and represents a less ‘natural’ activity type2.The second aim regards the choice of L2s. Most SLA studies on formulaic language to date have focused on English3. Apart from providing an opportunity to compare the degree and nature of formulaicity in different languages, it can be assumed that taking several languages into consideration will yield a clearer and more generalisable picture of L2 use and acquisition. In this study, two L2s, French and Spanish − both of them Romance languages − are involved4, and there are four groups of participants: high-proficient L2 users of French, high-proficient L2 users of Chilean Spanish, native speakers of French and native speakers of Chilean Spanish. Since this contribution should be considered a first approach to the study of very advanced non-native use of FL, the research questions are quite generally stated as follows: The first and most comprehensive question is whether it is possible to discern any significant differences at all between non-native speakers’ and native speakers’ use of FSs. At least some contrasts are to be expected, considering that in other domains of linguistic behaviour there are divergences even between extremely proficient (near-native) and native speakers (Abrahamsson & Hyltenstam, 2009). If the answer to this is yes, an ensuing question is to what extent the nonnative use of FL can be seen as influenced by the communicative task. A plausible answer to this question would be that non-natives are better off on an interactive task oriented towards an everyday-life situation, and will therefore produce a higher proportion of FSs, than on a non-interactive task which involves less frequent vocabulary. Another possible answer would be that even highly proficient non-natives do not attain the pragmatic competence of native speakers and for that reason produce fewer FSs than the natives on an interactive task. The third research question is whether any differences can be found between Swedish non-native speakers of French and of Spanish respectively
Effects of Task Variation on Formulaic Language
51
as regards their mastery of FSs and preferences in use. A hypothesis would be that, given the similarity between the two non-native groups in terms of length of residence and educational background, and given the typological proximity of the two languages, no greater differences will be found. Cultural differences between the two target communities involved could, however, play in another direction, since non-natives having resided for a long time in the target community can be expected to accommodate to a great extent to the L1 patterns they have been exposed to.
Approaches to Formulaic Language in SLA Research Psycholinguistic perspectives on FL Following Wray (2002), humans have access to two different modes of language processing: the analytic mode and the holistic mode. While in L1 acquisition and use holistic processing and ‘needs-only analysis’ are the preferred options5, the dominant mode in L2 use, according to Wray (2002), is analytic processing. This is mainly due to the impact of literacy and exposure to writing: L2 learners – especially adults – are practically always literate and they are exposed to the new language in its written shape prevailingly or at least to a considerable extent. The learners will therefore tend to segment speech in the way this is done in writing, in other words, with ‘minimal’ graphic words as basic units. When producing language in writing or speech, they will consequently have to figure out which words go together, an effort which L1 users are mostly spared since they already have the word combinations stored in memory. Wray believes that the only possibility for non-native speakers to obtain nativelike command of FL is through residing and interacting for a longer period of time in the target community (Wray, 2002, pp. 199–213). She also suggests that children and adults follow different paths towards fluency and idiomaticity. In the L1, fluency is obtained through chunking from the start, whereas L2 learners need to go through phases of automatization in order to attain, if at all possible, a comparable degree of fluency. After Wray (2002) had launched her definition of FSs as holistically stored and retrieved, researchers set out to test the psycholinguistic validity of her claim using various kinds of experiments. Two closely related questions have been addressed: are holistic sequences processed faster than ‘analytic’ sequences and, if this is the case, do native speakers and non-natives benefit from the same processing effects? Conklin and Schmitt (2008), in a study based on a self-paced reading task, have shown that both natives and
52
Perspectives on Formulaic Language
non-natives processed formulaic language faster than non-formulaic language, and that natives did so even faster than non-natives. It should be kept in mind, however, that the FSs of their study were exclusively fixed idioms, and not transparent structures. Schmitt Grandage et al. (2004), on the other hand, have found that not all formulaic sequences seem to be processed as wholes. Processing benefits seemed to decrease on a continuum ranging from entirely fixed forms to more open combinations. Nick Ellis’s (2002a, 2002b) proposals partially coincide with Wray’s (2002) and also provide experimental support for Wray’s claim. Ellis’s ideas on FL are part of his more general explanatory model for frequency effects in language processing. According to this view, the natural sequence of acquisition runs from formulas (non-analysed units) via low-scope patterns to truly ‘creative’ constructions, and the extraction of regularly occurring patterns across formulas is what affords the development of a creative capacity. But do FSs necessarily have to break down and evolve into creative rules? Ellis holds that ‘formulas can break down’ (Ellis, 2002b), although there is a huge processing gain in keeping them as chunks. The question then is: can the learning of FSs be seen as a process of learning chunks and extracting regularities from them while striving towards keeping as many chunks as possible unanalysed for processing gains? In answer to this, Nick Ellis suggests that FSs need gradual strengthening in order to become automatized and entrenched in the mental lexicon. This means that initial chunk learning will not suffice in the long run, or in his own words, ‘nativelike idiomaticity takes an awful lot of figuring out which words go together’ (Ellis, 2002a, p. 157). In this perspective, FSs are seen as acquired in second language learning by means of two separate processes: holistic chunking and incremental automatization. Forsberg (2008, pp. 262–65) suggests that holistic chunking prevails in early L2 learning, whereas incremental automatization is more typical of later phases of acquisition. Low automatization due to lack of incremental learning would also explain the differences that may be found between native and high-proficient non-native users.
FL in highly proficient L2 users Although little research has been carried out to date as regards the FL of highly proficient L2 users’ in natural settings6, more studies are to be found concerning university students who are regarded as ‘advanced’ in a wider sense (for an operational definition, see e.g. Bartning, 1997). The use of FL
Effects of Task Variation on Formulaic Language
53
in this type of data has mainly been examined with regard to written production, reception and performance in psycholinguistic experiments. When it comes to spoken production, the few studies that exist have mainly addressed the role of FSs as fluency devices. Raupach (1984) has found that German L2 learners of French overused certain formulas after a stay in France, contending that the subjects were using these formulas as a production strategy, or as ‘islands of reliability’. Hancock (2000) has come to a similar conclusion in a study on Swedish learners of L2 French. The fluency aspect of formulaic sequences has been thoroughly investigated by Wood (2006), who concludes that the use of formulaic sequences contributes highly to L2 fluency. It should be kept in mind, however, that researchers who talk of formulas do not always refer to the same phenomena. The type of speaker studied (the early learner/late learner/native speaker) influences the kinds of FSs in focus. In fluency studies, discourse markers tend to be dealt with, whereas studies on written production more often address collocations. In connection with spoken discourse, the only studies we know of that link FL and idiomaticity are Forsberg (2008) and Boers, Eyckmans, Kappel, Stengers, & Demecheleer (2006), all of which come to the conclusion that the learners’ general proficiency level seems to coincide with the degree of their mastery of FSs. With regard to writing, there is a considerably higher number of studies on FL, at least as far as L2 English is concerned. Some FL features in advanced learners’ written production seem to be generally recognized, such as the systematic overuse, or underuse, of specific FS types in learners’ written production (cf. Granger, 1998; Jaglinska, 2006; Bolly, 2008). Granger (1998), Nesselhauf (2005) and Lewis (2008) all regard cross-linguistic influence as an important factor in writing in L2, while Howarth (1998), Nesselhauf (2005) and Bolly (2008) claim that the degree of restriction or fixedness influences the learners’ mastery of FSs. Bardovi-Harlig (2008), whose main research question is whether the lack of pragmatic formulas in L2 English is a reception or a production problem, reports that although her learners seem to recognize formulas, they fail to produce them in discourse completion tests. Observations in the same direction were also made by Abrahamsson and Hyltenstam (2009) in an idiom cloze-test included in a test battery designed for near-native speakers of Swedish. This task was found to be one of the most difficult in the set and also one in which the age of onset was a determining factor. None of the above-mentioned studies can be used for direct comparison with the present study, since they do not include the same type of participants
54
Perspectives on Formulaic Language
nor tasks. To our knowledge, no earlier studies have examined the production of formulaic sequences in the speech of long-residency L2 users, let alone with a multi-task research design. However, existing studies enable some predictions about our data. Differences are to be expected, considering that both Ekberg (2003) and Abrahamsson and Hyltenstam (2009) have found divergences between native and near-native speakers with regard to certain types of formulaic sequences. Divergences of this kind could at least partly be due to processing constraints, as suggested by Conklin and Schmitt (2008). Finally, differences due to cross-linguistic transfer may surface as both over- and underuse. In spite of all this, since the linguistically and culturally immersed L2 users of the present study constitute a type that has hardly been investigated to date, clear-cut predictions are difficult to make.
Data and Method Subjects The participants are split into two study groups and two control groups. The subjects who constitute the study groups are highly proficient and ‘immersed’ users of L2 French residing in Paris, and of L2 Spanish residing in Santiago de Chile, 10 subjects for each country. The control groups are of the same size as the study groups and consist of native speakers (NS) of French and Chilean Spanish who live in the same cities as the non-natives. The Swedish L1 non-native speakers (NNS) were selected according to two main criteria: they should have lived at least five years in the target language country and they should have completed at least upper secondary studies. In fact, a majority had experience of academic studies and all had gone through a period of formal instruction in the target language, although the extent of this instruction varied considerably among the participants. Differences with regard to exposure to formal instruction did not yield any differences with regard to FS use, which corroborates Forsberg’s (2008) earlier findings that formal study does not seem to promote the use of FS. The French non-native speaker group is more homogeneous than the other groups: it includes females only, all around 30 years of age, and with similar educational backgrounds. Most of them have lived in France since their early 20s, came originally to study and are now working. The French native control group is well matched to the non-natives as far as age and
Effects of Task Variation on Formulaic Language Table 3.1
55
Sociological Parameters of the Participants (SD = standard deviation)
Group/ parameter
French NNS (Paris) = 10
French NS (Paris) = 10
Spanish NNS (Santiago) = 10
Spanish NS (Santiago) = 10
Average age
29 (range 25–33, SD 2.9)
27.3 (range 23–34, SD 3.5)
39.8 (range 27–59, SD 10.6)
38.8 yrs (range 22–71, SD 14.4)
Mean length of stay (NNS)
10.3 years (range 5–14, SD 9.2)
−
9.9 years (range 4–16, SD 6.4)
−
Gender
10F
6F, 4M
6F, 4M
4F, 6M
educational background is concerned, though not with regard to gender: the group consists of six females and four males. The non-native speakers of Spanish are more heterogeneous than the non-natives in France. In average, they are about ten years older, with a wider age range and a higher degree of dispersion. With regard to educational background and professional activities both non-native groups are fairly similar. The native Spanish-speaking control group is well matched to the non-natives as regards age and educational profile. Both the native and the non-native group are of mixed gender: four females versus six males in the non-native group and six females versus four males in the native control group.
Tasks The present study is based on elicited, though spontaneously produced, spoken data. Two different tasks were included7, which will be referred to as ‘Boss’ and ‘Charlie’. The first task, ‘Boss’, is a role-play, in which the subject is asked to phone her/his manager to ask for a two-day leave (the ‘manager’ who answers the call is a native speaker hired for this purpose). The situation is complicated by the fact that an important business meeting is scheduled for one of the days and the subject is expected to play an important part in that meeting. In the second task, ‘Charlie’, the subject is asked to perform an online retelling of a 14.5-minute clip from the Charlie Chaplin movie ‘Modern Times’. The participants, who were not allowed any planning time, were instructed to tell what happens on the screen to someone who is unable to see it, and to try to capture as many details as possible.
Perspectives on Formulaic Language
56
Table 3.2 Communicative Task Characteristics (based on Tavakoli & Skehan 2005) Task/ Task Familiarity of DialogicDegree of Degree of Concretecharacteristics information monologic structure routine abstract
Number of elements
‘Boss’
High
Dialogic
‘Charlie’
Low
Monologic High
Low
High
Concrete
Low
Low
Concrete
High
As can be seen, the tasks are very different in nature. ‘Boss’ is a dialogic task representing an activity which the subjects can be expected to be familiar with. It is similar to one of the tasks used by Taguchi (2007) in the sense that one party has more power than the other, there is a considerable hierarchical distance, and the request made by the less powerful party implies a high degree of imposition (high PDR values, see Brown & Levinson 1987, pp. 74–83). The ‘Charlie’ task, in turn, is monologic and does not represent any activity that could be thought of as familiar to the subjects (acting as a sports commentator would come closest and none of the participants had any such experience). It may be considered that two equally difficult8 tasks have been chosen: one can be regarded as pragmatically challenging and the other seen as a complex task in more general terms, due to its real-time processing demands, high number of elements involved and relatively infrequent lexis. In Table 3.2 the characteristics of the two tasks are listed according to the criteria proposed by Tavakoli and Skehan (2005), to which has been added ‘degree of routine’, a factor that could be expected to affect the degree to which formulaic sequences will be used.
Identification of FSs in the data The point of departure for identifying FSs in the present data is Erman and Warren’s (2000) definition of what these authors refer to as ‘prefabs’, in other words, ‘a combination of at least two words favoured by native speakers in preference to an alternative combination which could have been equivalent had there been no conventionalization’ (Erman & Warren 2000, pp. 31–32). Apart from a good definition, however, the identification of FSs in empirical data requires the application of heuristic procedures. In recent years, corpus linguistics have opened up for quantitative (and presumably objective) methods such as automatic extraction of frequent multiword sequences, used for example, by De Cock (2004) or Schmitt, Grandage and
Effects of Task Variation on Formulaic Language
57
Adolphs (2004). A statistical device for measuring the strength of the connection between the parts of a multi-word combination was used for instance by Siyanova and Schmitt (2008) and consists of calculating a so-called MI (mutual information) score. A basic condition, however, for the meaningful use of this or other corpus linguistics methods is the existence of very large corpora. In view of the fairly modest size of our corpus (94,000 words) and the objective of extracting all FSs occurring in the data, manual procedures were preferred. The main criterion used for teasing true FSs out from from non-formulaic multi-word combinations has been the principle of restricted exchangeability (Erman & Warren, 2000, pp. 31–32) in combination with the researchers’ own intuition.9 As a subsidiary tool, database (in particular Google) searches were carried out in cases of doubt in order to ascertain the formulaic status of a given multiword expression. Furthermore, in order to strengthen reliability, the FSs of the present data were manually detected and listed by each author separately and the results were subsequently matched for intersubjectivity.
Categories of FSs The identified FSs were classified along two dimensions, one concerning their transferability between the non-natives’ Swedish L1 and the French or Spanish target language (see below), and the other concerning their linguistic function. With regard to the latter dimension, a primary distinction was made between lexical and non-lexical expressions (see below) and a secondary distinction was made between phrasal and clausal expressions (see below). Non-lexical items were furthermore subdivided into a grammatical and a discursive category (see below).
Transferable versus non-transferable In this study, only such FSs as can be seen as non-transferable from the nonnative speaker’s L1 to the L2 have been considered. This choice was made on the assumption that the L1 (and possibly also other L2s previously acquired) plays an important part in the acquisition of a new language, in the sense that learners will find it easier to identify and appropriate FSs that coincide structurally and semantically with FSs of their native tongue (or of an earlier acquired L2). A considerable proportion of FSs are, in fact,
58
Perspectives on Formulaic Language
transferable across many languages, such as ‘find a solution’, which is equivalent to French ‘trouver une solution’, Spanish ‘encontrar una solución’ and also Swedish ‘hitta en lösning’. If non-transferable FSs in the L2 will be more of a challenge to the learner than transferable FSs, they will also constitute a more interesting object for analysis. These assumptions, which are in line with findings reported by for example, Cieslicka (2006), will also lead to an increase in focus on targetlike idiomaticity. A consequence of this methodological choice is that not only will transferable FSs be excluded in the non-native data, but also all such FSs will be excluded in the native data as can be considered transferable to the nonnative speakers’ L1. All instances of for example, ‘encontrar una solución’ (‘find a solution’, cf. above) will therefore be discarded, regardless of whether they are produced by a non-native or native speaker of Spanish. It could be suspected, however, that the elimination of transferable FSs will skew the proportion between the natives’ and the non-natives’ overall production of FL. In order to investigate this possible effect, a random sample covering approximately 10 per cent of the total quantity of words produced by each of the four groups (French and Spanish NNSs vs. French and Spanish NSs) was drawn and the proportion of transferable FSs was calculated for each group. Differences turned out to be fairly slight, with a somewhat lower proportion for the natives (French NSs 25.1 per cent, Spanish NSs 23.1 per cent) than for the non-natives (French NNSs 27.8 per cent, Spanish NNSs 29.3 per cent), which means the non-native speakers’ production of FSs can be said to be underestimated in the count by 10–11 per cent as a consequence of the choice to eliminate transferable FSs. Lexical versus non-lexical This distinction is a common one in semantics and lexicology (although the terminology may vary), which separates ‘content words’ from ‘function words’.10 Provided FSs are understood as items stored as wholes in speakers’ mental lexicons, no difference between single-word and multi-word items should be seen in this regard. Lexical FSs are thus regarded as expressions which have a denotative meaning, such as Fr. ‘faire du sport’ (‘practice a sport’) or Sp. ‘pedir permiso’ (‘ask for a leave’), or a denotative and pragmatic meaning, such as Fr. ‘merci bien’ or Sp. ‘muchas gracias’ (‘thanks a lot’). Non-lexical items, on the other hand, are understood as items which do not carry any own denotative meaning but function as operators on words, phrases and clauses and only thereby contribute to sentence (and pragmatic) meaning.
Effects of Task Variation on Formulaic Language
59
Lexical-phrasal versus lexical-clausal Lexical FSs are seen as subdivided into phrasal structures (NPs, VPs, AdjPs, AdvPs, PrepPs) and clausal structures. Apart from the fact that these subcategories represent syntactic units of a different hierarchical order, there is another important feature distinguishing them. Phrasal FSs will have to go through syntactic processes such as inflection, determination, insertion in word order or voice transformations on their way from lexical storage to use in actual utterances, whereas clausal FSs can be inserted directly, or almost, into a syntagm. Thus, a phrasal FS such as Sp. ‘horario laboral’ (‘work hours’) will appear in the singular or in the plural, and with or without definite or an indefinite determiners, and a phrasal expression such as Fr. ‘mettre NP-PERS au courant’ (‘keep someone informed’) is likely appear inflected with regard to tense, mode, number and person and show up in the active, middle or passive voice. On the other hand, clausal expressions such as Fr. ‘je vous remercie de votre compréhension’ (‘thanks for being so understanding’) or Sp. ‘me queda absolutamente claro’ (‘it is perfectly clear to me’) are already there, ready for use in an utterance. As are verbless, though still clausal, expressions such as the above-mentioned Sp. ‘muchas gracias’ and Fr. ‘merci bien’ (‘thanks a lot’).
Non-lexical FSs: grammatical versus discursive Non-lexical FSs are regarded as analogous to function words, which means their meaning is procedural rather than denotative or referential. Depending on whether a non-lexical FS operates on a syntactic unit such as words, phrases or clauses, or whether they operate on whole utterances, they will be categorized as grammatical and discursive, respectively. Discursive FS are, in fact, multi-word discourse markers. Clear-cut examples of grammatical FSs are Fr.’un peu’/ Sp. ‘un poco’ (‘a bit’), or Fr. ‘pas du tout’ / Sp. ‘para nada’ (‘not at all’), whereas equally unequivocal examples of discursive FSs would be Fr. ‘c´est vrai que S’/ Sp. ‘la verdad es que S’ (‘as a matter of fact S’), or Fr. ‘par contre’/ Sp. ‘por el contrario’ (‘on the other hand’). As can be deduced from the two last pairs of examples, the distinction ‘phrasal/clausal’ are relevant also for discursive FSs, although that subclassification has not been implemented in the present data analysis. The grammatical/discursive distinction implies a twilight zone, for instance as regards conjunctional expressions which also function as pragmatic operators on whole utterances or stretches of utterances. In such
Perspectives on Formulaic Language
60
cases, for example, Sp. ‘a no ser que S’ or Fr. ‘a moins que S’ (‘unless’), the choice was made to register the FS as grammatical.
Results Word production In order for meaningful comparisons to be made on the use of FL, the amount of words produced by each group in each of the tasks needs to be attested. In Table 3.3, these figures are presented together with figures that indicate speech rate in terms of words per minute. In the case of the ‘Charlie’ task, the duration of which is constant (14.5 minutes), the average speech rate for each group is directly calculable. Considering the variable duration of the individual phone calls in the ‘Boss’ task, however, the average duration for each group has to be given in order to obtain the corresponding figures for speech rate. Admittedly, comparing word production between French and Spanish is not an unproblematic issue since it is not obvious that speakers of one language will express an identical conceptual content with the same number of words. The typological similarity between the languages leads us to conclude, however, that no significant differences of the kind are to be encountered.11 Table 3.3 shows considerable differences in some respects, and fewer differences in others. With the exception of the column for speech rate in the ‘Boss’ task, the French speakers (NS or NNS) yield higher figures than the Spanish speakers and this difference is statistically significant at 0.005 level on the ‘Charlie’ task. Also, the non-native speakers tend to resemble the corresponding natives more than they resemble each other. Furthermore, no significant differences can be found in any of the native/ non-native comparison pairs, a result which precludes the attribution of divergences in FS production to word amount.
Table 3.3
Word Production, Duration and Speech Rate; Means for Each Group
Group/Task
‘Boss’, words
‘Boss’, minutes
‘Boss’, words/ ‘Charlie’, words ‘Charlie’, words/ minutes minutes
French NNS
568
7.67
74.1
2,130
146.9
French NS
611
6.79
90.0
2,120
146.2
Spanish NNS 518
6.22
83.3
1,419
97.9
Spanish NS
5.02
94.6
1,525
105.2
475
Effects of Task Variation on Formulaic Language
61
The results of Table 3.3 raise a number of interesting questions. Why is the word production of the French speakers (NS and NNS) so much higher than that of the Spanish speakers on ‘Charlie’, but not on ‘Boss’? Why is the native speakers’ speech rate clearly higher than the non-natives’ on ‘Boss’ when their word production is not? Why is the natives’ speech rate higher than the non-natives’ on ‘Boss’, and not on ‘Charlie’? In connection with ‘Boss’ it should be kept in mind that this is a dialogic task in which another party than the subject participates (viz. the ‘boss’), and that the degree to which this other party dominates the conversation is liable to affect the subject’s performance in terms of both prolixity and speech rate.
Degree of formulaicity A general overview of the use of FL in the four groups is given in Table 3.4, with figures indicating the amount of FSs produced per 100 words. First of all, Table 3.4 shows that on ‘Boss,’ both the French-speaking and the Spanish-speaking natives produced significantly more FSs than do the non-natives. No such significance, however, can be found on ‘Charlie’, which means that ‘Boss,’ a dialogic and pragmatic task, seems to yield a greater distance between natives and non-natives than ‘Charlie’, a monologic and narrative task. This result runs counter to the hypothesis suggested at the beginning of the paper according to which the non-natives would perform in a more nativelike way with regard to FL on a familiar, everyday-life-oriented task than on a task which corresponds to an activity which the subjects can be expected to be less familiar with. On the other hand, Table 3.4 does show that all participant groups tended to be more
Table 3.4
Degree of Formulaicity (FS means per 100 words)
Group/Task
‘Boss’, FS/100 words
‘Charlie’ FS/100 words
French NNS
9.0 SD 1.4
9.3 SD 1.3
French NS
11.2* (p<0.02 in unpaired t-test) SD 1.8
8.9 (not significant)
Spanish NNS
8.1 SD 2.0
6.8 SD 2.0
Spanish NS
11.6* (p<0.002 in Mann-Whitney U) SD 1.2
8.0 (not significant)
SD 1.3
SD 1.0
Perspectives on Formulaic Language
62
formulaic (though not to a statistically significant degree) on ‘Boss’ than on ‘Charlie’.
Category distribution The distribution of FS categories, as defined earlier is provided across the four groups. Table 3.5 accounts for this distribution, which is also represented graphically in Figures 3.1 and 3.2. In both languages and on both tasks, the native speakers produced significantly more lexical-phrasal FSs than the non-natives. This was also the case for lexical-clausal FSs on the ‘Boss’ task, whereas on ‘Charlie’, none of the groups produced enough clausal FSs to make comparison meaningful. Furthermore, when comparing the tasks, we can see that in all groups, more lexical items in general, and more lexical-phrasal items in particular, were produced on ‘Charlie’ than on ‘Boss’, which is an expected outcome, considering that a greater amount of denotative-referential lexis will be activated in the former task. The fact that natives turn out to be better producers of lexical FSs than non-natives is congruent with the findings in Forsberg (2008), with one Table 3.5
Distribution of FS Categories in ‘Boss’ and in ‘Charlie’ Lexical-clausal Lexical-phrasal
Grammatical
Discursive
Tot. FS
French NNS
0.2 SD 0.42
2.4 SD 0.7
2.1 SD 1.3
4.3 SD 1.1
9.0 SD 1.4
French NS
1.4 SD 1.27
3.2 SD 1.4
2.2 SD 0.7
4.4 SD 1.1
11.2 SD 1.8
Spanish NNS
0.7 SD 0.82
3.0 SD 0.8
1.2 SD 1.0
3.2 SD 0.8
8.1 SD 2.0
Spanish NS
1.5 SD 0.7
5.4 SD 1.6
1.1 SD 0.6
3.6 SD 0.7
11.6 SD 1.2
French NNS
0.2
3.8 SD 1.3
3.6 SD 0.8
1.7 SD 0.9
9.3 SD 1.25
French NS
0.3
5.3 SD 1.2
2.5 SD 0.7
0.8 SD 0.4
8.9 SD 1.28
Spanish NNS
0.1
4 SD 0.9
1.9 SD 0.6
0.8 SD 0.9
6.8 SD 2.0
Spanish NS
0.1
5.6 SD 0.7
1.9 SD 0.6
0.4 SD 0.3
8.0 SD 1.0
‘Boss’ Group/task
‘Charlie’ Group/task
Effects of Task Variation on Formulaic Language
63
14 12 10 French NNS 8
French NS Spanish NNS
6
Spanish NS 4 2 0 Lex claus
Lex phras
Gram
Disc
Tot FS
Figure 3.1 Distribution of FS Categories in ‘Boss’
important exception: the most advanced group in that study, whose proficiency level was comparable to the French-speaking NNSs’ of the present data, produced a proportion of lexical FSs which equalled the native speakers of the control group. The reason why the non-native subjects did not perform as well in the present data is most likely to do with the degree of difficulty of the tasks involved. In Forsberg (2008), the subjects were
10 9 8 7 French NNS
6
French NS
5
Spanish NNS
4
Spanish NS
3 2 1 0 Lex claus
Lex phras
Gram
Disc
Tot FS
Figure 3.2 Distribution of FS Categories in ‘Charlie’
64
Perspectives on Formulaic Language
respondents in a self-presentation interview, a task which should be regarded as a clearly less challenging than either ‘Boss’ or ‘Charlie’. The figures yielded for the grammatical and discursive categories open up a number of questions. Grammatical FSs are consistently more frequent on ‘Charlie’ than on ‘Boss’, while the reverse is true of discursive FSs. A plausible explanation for this incongruence would be that the argumentative nature of ‘Boss’ requires a more extensive use of discourse markers in general. This could possibly happen at the expense of grammatical formulas, although there is no clear answer as to why. A cross-linguistic comparison yields higher figures for the French-speaking than for the Spanish-speaking groups with regard to both non-lexical categories, which may be due to structural differences between the languages in the sense that grammatical and discursive markers more often correspond to multi-word expressions in French and to single words in Spanish. Further research is needed on this point. Finally, as far as the native/non-native dimension is concerned, hardly any differences can be found on ‘Boss’ with regard to either non-lexical category. On ‘Charlie’, however, the French-speaking non-natives used significantly more non-lexical FSs than the natives12, which may be explained by the fact that markers of this kind tend to be used as a production strategy in order to enhance fluency (Raupach 1984).
Concluding Remarks Not surprisingly, a number of contrasts have been found in this study between native and non-native performance with regard to FL, some of which can be proven statistically significant and others not. Both non-native groups converged with regard to many parameters. On ‘Boss’, they produced fewer words, a lower speech rate, fewer FSs overall and fewer lexical FSs in particular, than did the natives, whereas on ‘Charlie’ a corresponding divergence could be found only with regard to lexical FSs (fewer, as on ‘Boss’, in the non-native groups). This high degree of convergence between L2 users of different languages is interesting in that it actually demonstrates something which hitherto was only an expectation, namely that highly proficient L2 users tend to perform quite similarly regardless of which is the target language. Task and activity characteristics seem to affect the subjects’ production of FSs in various ways. All groups except the French-speaking non-natives
Effects of Task Variation on Formulaic Language
65
produced more FSs overall on ‘Boss’ than on ‘Charlie’. Furthermore, ‘Charlie’ favoured the production of grammatical FSs, and ‘Boss’ that of discursive FSs, in all groups. In both tasks the results regarding lexical FSs are less nativelike than in the self-presentation interviews used by Forsberg (2008). An interesting detail is the high non-native figures for discursive FSs on ‘Charlie’ (though not on ‘Boss’) in comparison with those of the native speakers, which may be interpreted as a production strategy and a sign of stress in a situation which could easily be perceived as more challenging for a non-native than for a native speaker. Finally, there are some general divergences to be found between the languages, regardless of whether the speakers are native or non-native. Speakers of Chilean Spanish produce fewer words on both tasks than speakers of French, a result which can be interpreted in light of cultural preferences. The speech rate on ‘Charlie’ is also higher in the French than in the Spanish groups, although this does not happen on ‘Boss’, a result for which there is no obvious explanation. The French-speaking groups also produce a considerably higher proportion of non-lexical (grammatical plus discursive) FSs than the Spanish speakers, which may be attributed to structural differences between the languages. The findings of this study are, of course, to be regarded as preliminary. First of all, very little data for comparison has been available for the simple reason that highly proficient and immersed L2 users have been, to date, a rare object of study in SLA research. The results are also preliminary in the sense that the data needs to be examined not only by quantitative measures such as ‘degree of formulaicity’ and ‘category distribution’, but also through a more fine-grained analysis. As regards the ‘Boss’ task, pragmatic aspects need to be examined in greater detail by means of argumentation analysis; linking FL with a speech act-based analysis could reveal considerable divergences between native and non-native speakers. As for the ‘Charlie’ task, it would be desirable to investigate the individual types of sequences used. Even though no significant differences could be ascertained with regard to degree of formulaicity, differences were found between natives and nonnatives in the amount of lexical FSs produced. Important aspects of the lexical FS category, such as the use of infrequent vocabulary, require closer comparison. Hopefully, the present study will contribute not only to research on FL in general, but also to a more fine-tuned description of differences between native and non-native speakers, even at very high levels of L2 proficiency.
66
Perspectives on Formulaic Language
Notes 1
2
3
4
5
6
7
8
This study is part of the project ‘Formulaic sequences and communicative proficiency in advanced L2 use’ (involving French, Spanish and English as L2), which in turn is part of the programme High-level Proficiency in Second Language Use, funded by the National Bank of Sweden. The authors also wish to express their gratitude to STINT (the Swedish Foundation for the Internationalisation of Higher Education and Research) for enabling the collection of Chilean data through an institutional grant given the Pontifical Catholic University of Chile and Stockholm University. In a study on L2 use and age of acquisition, Abrahamsson and Hyltenstam (2009.) used as many as 25 different tests to measure their informants’ nativelikeness. The two tests of the present study were chosen in order to represent as different a set of genres as possible. Galkowski and Masiejczyk (2007) made an attempt to compare Polish and English formulaicity and actually claimed that Polish was less formulaic in nature than English. There seems to exist some kind of received wisdom that English is a particularly formulaic language. This is why it is so important to study other languages, apart from English and preferably using the same methodology, in order to allow for comparisons. This type of project is currently being carried out at Stockholm University, where parallel corpora of English, French and Spanish L1 and L2 have been collected and analysed. The present study being the first pilot study of this project, it accounts for results from comparisons of two typologically close languages. A needs-only-analysis means that the speaker will do online parsing and composing of structures only when this, for whatever reason, is necessary, the default mode being holistic interpretation and retrieval. This view, it should be underscored, is discrepant with the generative paradigm, where emphasis is put on the speaker’s creative capacity. One study that addresses higher proficiency levels is Ekberg (2003), who investigates near-native speakers of Swedish in a multilingual setting in Sweden and concludes that near-native speakers use less conventionalised lexical and grammatical patterns than monolingual Swedes. Although these results have implications for the present study, the subjects studied – teen-age near-native speakers in a multilingual setting – are quite different. These two tasks are part of a larger test battery including other pragmatic and narrative tasks and also a self-presentation interview. This study is intended as a first sub-study in investigating the task variation effect. As reported in Tavakoli and Skehan (2005), defining the difficulty that a task represents for the participants is a complex endeavour – probably even more difficult in the case of the highly proficient participants of this study than for the subjects used in the task-based learning paradigm. Narrative tasks are often considered difficult (as reported by Tavakoli and Skehan 2005) insofar as they imply the encoding of a large number of elements and spatial-temporal relations. They are also regarded as more constraining in the sense that the speaker has little freedom as regards choice of lexis.
Effects of Task Variation on Formulaic Language 9
10
11
12
67
In much earlier work on FL, such as Bahns, J., Burmeister, H., & Vogel T. (1986), the researchers’ own intuition has been used as the only or main identification tool. See also Peters (1983) and Hickey (1993) as regards methods used in L1 acquisition studies, and Myles et al. (1998) or Bardovi-Harlig (2002) as regards SLA research. This distinction is far from being uncontroversial. One question regards the concepts ‘denotative’ versus ‘referential’. Pronouns, and with them pronominal FSs, should be regarded as referential but not denotatative. In this study, pronominal FSs count as grammatical (= non-lexical) in spite of the fact that they do not function as operators on other units. However, deictic or pseudo-deictic expressions like En. ‘in a different way’, ‘somewhere else’ or ‘last week’, although they could be regarded as more or less equivalent to pronominal expressions, have been registered as lexical(-phrasal) FSs. Although the common way of indicating speech rate is in terms of syllables per time unit, this measure is less suitable for cross-linguistic purposes, due to phonetic differences between languages. Because of its simpler phonetic syllable structure, Spanish will systematically yield more syllables per time unit than French. On the other hand, the structural-typological similarities between the two languages will ensure that the ‘same’ conceptual-structural content will be expressed by roughly the same number of words in each language. The differences that do exist do not seem to endanger comparability. The increase in word number caused by the obligatory subject pronoun in French, as compared to the Spanish pro-drop, will, for instance, be greatly compensated by the reducing effect of French elision (elided items such as j’ai, c´est or m’a are represented in the word count as single words). The Spanish-speaking non-natives produced more discursive (but not more grammatical) FSs than the natives on the ‘Charlie’ task, though not to a statistically significant degree.
References Abrahamsson, N., & Hyltenstam, K. (2009). Age of acquisition and nativelikeness in a second language: Listener perception versus linguistic scrutiny. Language Learning, 59, 249–306. Bahns, J., Burmeister, H., & Vogel T. (1986). The pragmatics of formulas in L2 learner speech:use and development. Journal of Pragmatics, 10, 693–723. Bardovi-Harlig, K. (2008). Recognition and production of formulas in L2 pragmatics. In Z. H. Han (Ed.), Understanding second language process. Clevedon: Multilingual Matters. Bartning, I. (1997). L’apprenant dit avancé et son acquisition d’une langue étrangère. AILE, 9, 9–50. Boers, F., Eyckmans, J., Kappel, J., Stengers, H., & Demecheleer, M. (2006). Formulaic sequences and perceived oral proficiency: Putting a lexical approach to the test. Language Teaching Research, 10(3), 245–61.
68
Perspectives on Formulaic Language
Brown, P. and Levinson, S. C. (1987). Politeness:some universals in language use. Cambridge: Cambridge University Press. Bolly, C. (2008). Les unites phraséologiques: un phénomène linguistique complexe? Séquences (semi-)figées avec les verbes prendre et donner en français écrit L1 et L2. Approche descriptive et acquisitionnelle. Doctoral dissertation. Université Catholique de Louvain. Cieslicka, A. (2006). On building castles on the sand or exploring the issue of transfer in the interpretation and production of L2 fixed expressions. In J. Arabski (Ed.), Cross linguistic influences in the second language lexicon (pp. 226–45). Clevedon: Multilingual Matters. Conklin, K., & Schmitt, N. (2008). Formulaic sequences: are the processed more quickly than nonformulaic language by native and nonnative speakers? Applied Linguistics, 29(1), 72–89. Cook, V. (2002). Background to the L2 user. In Portraits of the L2 user. Clevedon: Multilingual Matters. De Cock, S. (2004). Preferred sequences of words in NS and NNS speech. Belgian Journal of English Language and Literatures, New Series 2. Gent: Academia Press. Dörnyei, Z, Durow, V., & Khawla, Z. (2004). Individual differences and their effects on formulaic sequence acquisition. In N. Schmitt (Ed.), Formulaic sequences (pp. 87–106). Amsterdam: John Benjamins. Ekberg, L. ( 2003). Grammatik och lexikon i svenska som andraspråk på nästan infödd nivå. In K. Hyltenstam & I. Lindberg (Eds.), Svenska som andraspråk: i forskning, undervisning och samhälle (pp. 259–76). Lund: Studentlitteratur. Ellis, N. C. (2002a). Frequency effects in language processing: a review with implications for theories of implicit and explicit language acquisition. Studies in Second Language Acquisition, 24 (2), 143–88. Ellis, N. C. (2002b). Reflections on frequency effects in language processing. Studies in Second Language Acquisition, 24(2), 297–339. Erman, B., & Warren, B. (2000). The idiom principle and the open choice principle. Text, 20(1), 29–62. Forsberg, F. (2008). Le langage préfabriqué – formes, fonctions et fréquences en français parlé L2 et L1. Bern: Peter Lang. Foster, P. (2001). Rules and routines: A consideration of their role in the task-based language production of native and non-native speakers. In M. Bygate, P. Skehan, & M. Swain (Eds.), Researching pedagogic tasks: second language learning, teaching and testing (pp. 75–94). London/New York: Longman. Gałkowski, B., & Masiejczyk, A. (2007). When two become one (or three): reconsidering the dual-mode models of linguistic processing. Paper presented at the 25th UWM Linguistics Symposium, University of Wisconsin-Milwaukee, 18–21 April 2007. Granger, S. (1998). Prefabricated patterns in advanced EFL writings: Collocations and formulae. In A. P. Cowie (Ed.), Phraseology: Theory, analysis and applications (pp. 145–60). Oxford: Clarendon Press. Hancock, V. (2000). Quelques connecteurs et modalisateurs dans le français parlé d’apprenants universitaires. Cahiers de la recherche 16. Doctoral dissertation. Department of French and Italian. Stockholm University. Hickey, T. (1993). Identifying formulas in first language acquisition. Journal of Child Language, 20, 27–41.
Effects of Task Variation on Formulaic Language
69
Howarth, P. (1998). Phraseology and second language proficiency. Applied Linguistics, 19 (1), pp. 24–44. Jaglinska, A. (2006). Idiomaticity in learner language: a study of the use of prefabs in the writing of Polish advanced EFL learners. Unpublished doctoral thesis. Marie Curie-Skodowska University, Lublin. Krashen, S. D., & Scarcella, R. (1978). On routines and patterns in language acquisition and performance. Language Learning, 28, 283–300. Lewis, M. (2008). The Idiom Principle in L2 English. Doctoral dissertation. English Department. Stockholm University. MacWhinney, B. (2001). Emergentist approaches to language. In J. Bybee & P. Hopper (Eds.), Frequency and the emergence of linguistic structure (pp. 449–69). Amsterdam: John Benjamins. Myles, F., Mitchell, R., & Hooper, J. (1998). Rote or rule? Exploring the role of formulaic language in classroom foreign language learning. Language Learning. 48(3), 323–63. Myles, F., Mitchell, R., & Hooper, J. (1999). Interrogative chunks in French L2. A basis for creative construction? Studies in Second Language Acquisition, 21, 49–80. Nesselhauf, N. (2005). Collocations in a learner corpus. Amsterdam: John Benjamins. Pawley, A., & F. Syder. (1983). Two puzzles for linguistic theory: Nativelike selection and nativelike fluency. In J. C. Richards & R. W. Schmidt (Eds.), Language and communication (pp. 191–226). London: Longman. Peters, A. M. (1983). The units of language acquisition. New York: Cambridge University Press . Raupach, M. (1984). Formulae in second language speech production. In H. W. Dechert, D. Möhle, & M. Raupach (Eds.), Second language production (pp. 114–37). Tübingen: Gunter Narr. Robinson, P., & Gilabert, R. (2007). Task complexity, the cognition hypothesis and second language learning and performance. IRAL, 45 (3), 161–77. Schmitt, N., Dörnyei, Z., Adolphs, S., & Durow, V. (2004). Knowledge and acquisition of formulaic sequences: a longitudinal study. In N. Schmitt (Ed.), Formulaic sequences: acquisition, processing and use (pp. 55–86). Amsterdam: John Benjamins. Schmitt, N., Grandage, S., & Adolphs, S. (2004). Are corpus-derived recurrent clusters psycholinguistically valid? In N. Schmitt (Ed.), Formulaic sequences: Acquisition, processing and use. Amsterdam: John Benjamins. Siyanova, A., & Schmitt, N. (2008). L2 learner production and processing of collocation: a multi-study perspective. Canadian Modern Language Review, 64(3), 429–58. Skehan, P. (1998). A cognitive approach to learning language. Oxford: Oxford University Press. Taguchi, N. (2007). Task difficulty in oral speech act production. Applied Linguistics, 28(1), 113–35. Tavakoli, P., & Skehan, P. (2005). Strategic planning, task structure and performance testing. In R. Ellis (Ed.), Planning and task performance in a second language (pp. 239–75). Amsterdam: John Benjamins. Tyne, H. (2005). La maîtrise du style en français langue seconde. Unpublished doctoral thesis. Université de Paris X Nanterre/University of Surrey.
70
Perspectives on Formulaic Language
Wong-Fillmore, L. M. (1976). The second time around: Cognitive and social strategies in second language acquisition. Unpublished doctoral thesis. Stanford University. Wood, D. (2006). Uses and functions of formulaic sequences in second language speech: An exploration of the foundations of fluency. Canadian Modern Language Review, 63(1), 13–33. Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press. Yorio, C. (1989). Idiomaticity as an indicator of second language proficiency. In K. Hyltenstam & L. Obler (Eds.), Bilingualism across the lifespan (pp. 55–72). Cambridge: Cambridge University Press.
Chapter 4
Effectiveness of Text Memorization in EFL Learning of Chinese Students Zhenqiong Dai and Yanren Ding Nanjing University
Introduction The role of text memorization in second language (L2) learning is controversial. Repetition and learning by heart are regarded as two ‘outlaws’ (Cook, 1994, p. 133), and Chinese students studying abroad are often seen as ‘rote learners’ (Biggs, 1991, cited in Pennycook 1996, p. 222). Within China proper, the traditional literacy method of learning texts by heart seems to have fallen out of fashion since Chinese learners and teachers of English as a foreign language (EFL) favour a communicative approach today as they have more exposure to English than they had twenty or thirty years ago. At the same time, however, researchers (e.g. Ting, 2004) note that many teachers continue to use the old method of text memorization and many students hold it to be a ‘good medicine that tastes bitter’ (Ting, 2004). Pennycook (1996) argues that the strategy of text memorization used by Asian students should be looked at in a more positive light, and Parry (1998) suggests that ‘literate Chinese have well-developed strategies for appreciation and memorization that they could well turn to the purpose of learning English’ (p. 65).
Literature Review Theoretical background The dual-nature view of language Skehan (1998) points out that linguists often see language as a rule-based analytical system. While not denying the importance of this model, he proposes
72
Perspectives on Formulaic Language
a language user’s model of language, which is an exemplar-based memory system; to him, language is both rule-based and exemplar-based. This dualnature view implies that learning language involves both learning rules and learning exemplars, which may or may not be immediately explained by rules. To Skehan, however, development of the knowledge of exemplars seems to be a natural result of language use; the language user ‘chunks free processing resources during communication so that planning for the form and content of future utterances can proceed more smoothly’ (1998, p. 3). Here, no room is left for conscious effort, and Skehan does not discuss what second language learners should do to learn those exemplars.
The noticing and frequency hypotheses Second language acquisition (SLA) literature does not contain much discussion on text memorization, but some theories have important implications in that regard. While Schmidt (1990, 2001) emphasizes the role of noticing (i.e. noticing an L2 feature to the point of being able to report it verbally) in the development of the interlanguage system, N. Ellis (2003, 2006) emphasizes the role of frequency of exposure and usage. Looked at together, their hypotheses seem to suggest that the more frequently a form occurs in input, the more likely it is to be noticed and then integrated into learner language. In this respect, the practice of text memorization can not only focus learners’ attention on language form but also make use of the form more frequent. SLA researchers are primarily interested in acquisition processes where learners are engaged in meaningful language use (Ding, 2007). They do not answer questions as to how, under the pressure of real time communication, learners can overcome the difficulty with noticing and rehearsing the features they should learn. Neither are they specific as to what these features are.
The learning of formulaic sequences One role of text memorization is that it helps with the learning of formulaic sequences (FSs), a term Wray (2002) suggests using to include various types of multi-word units and collocations such as idioms, lexical bundles, and lexical phrases. Researchers call FSs by many different names. Bolinger (1975) proposes that much language processing relies upon familiar, memorized material. Pawley and Syder (1983) claim that the average native
Effectiveness of Text Memorization
73
speaker knows hundreds of thousands of lexicalized sentence stems. Lewis (1993) states that language consists of grammaticalized lexis. Nattinger and DeCarrico (1992) claim that the degree of fluency is not decided by learners’ command of grammar rules but by the amount of formulaic language stored in their memory. The FSs stored in the mind of language users seem to be what Skehan (1998) means by exemplars, which are not grammatically analysed at the time of use or sometimes not analysable in the first place. It is the memory system that provides learners with formulaic sequences to retrieve and use. By nature, these FSs have to be learned through repeated exposure and memorization. They have to be noticed and consciously learned. Nesselhauf (2003) points out that although rote learning has fallen into disrepute along with Behaviourism, it is vitally important that a number of collocations be taught and learnt explicitly.
Empirical research Studies on text memorization There is a scarcity of studies on the effects of text memorization. The few studies available (Dong & Fu, 2003; Ting, 2004; Ding, 2007) are largely based on qualitative analysis of student work and reflection. These studies found that text memorization or recitation leads to the learning of formulaic chunks, which in turn helps improve the quality and fluency of learners’ L2 writing and translation and, most of all, helps build their confidence in language use. One difficulty with quantitatively measuring the effects of text memorization is controlling the amount of time spent on language learning outside the classroom. It is assumed that students assigned the task of text memorization would spend more time on learning than those not assigned the task; therefore, the gains of these students may be attributed to the extra time spent learning.
Studies on the learning of formulaic sequences There is no shortage of research on the importance of FS knowledge in second language learning. Wong-Fillmore studied six Hispanic children acquiring English and concluded that ‘the strategy of acquiring formulaic speech is central to the learning of language’ (1976, as cited in Nattinger & DeCarrico, 1992, p. 25). Ting and Qi, (2005) compared the number of
74
Perspectives on Formulaic Language
FSs used in oral and written texts, the grammatical accuracy, and oral and written test scores of Chinese English majors and claimed that the number of FSs serves as a better predictor of the quality of student texts than grammatical accuracy. That is to say, the learners’ ability to use FSs in writing or speech is related to their achievement in composition or oral skills; the better L2 writers/speakers are also better users of formulaic sequences. In terms of how L2 learners use FSs, Qi’s (2006) study provides a comprehensive picture. She investigated how 56 Chinese English majors improved the frequency, variation and accuracy of the FSs they used in their monologues over a period of four years. Her analysis of the corpus suggests, among other things, that over the years, the learners made limited progress in their use of what she classified as interpersonal and textual FSs, but made significant progress in the use of ideational FSs, especially in that of nominal and prepositional FSs. In Qi’s study, frequency served as an index of fluency in producing FSs; variation was an index of diversity in the FSs used in the text; accuracy was an index of the quality of FS uses. The present study borrows her method of analysing learner use of FSs.
Research questions Given the controversies over the practice of text memorization, studies are called for that can tease out the effects of such practice on L2 learning. This study followed a quasi-experimental design and attempted to tackle the following questions: 1. Does text memorization have any effects on L2 learners’ proficiency and writing ability? That is, how do learners make progress from the beginning to the end of the experiment? 2. Does text memorization have any effects on L2 learners’ use of FSs in writing? 3. Does text memorization produce the same or different effects on highand low-achieving learners in the use of FSs in their L2 writing?
Methodology Settings The study was set in the undergraduate English programme of a military academy. Although students enter this academy through the national college matriculation examinations just like students in other schools, they
Effectiveness of Text Memorization
75
live the life of a military camp. All of them take the same classes at the same time. Outside the classes, they all study during designated periods of time; every evening (at 6 p.m.) they are assembled and led to the classroom for ‘self-study,’ which means doing homework assignments and preparing for upcoming classes and exams. At the end of this self-study period (at 9:20 p.m.), they are gathered and led back to the dormitories. No one is allowed not to study during the self-study period, and no one could study outside of this period either since they had other tasks and assignments. Such a unique setting made it possible to control the amount of time spent on learning outside class and therefore compare a group given the task of text memorization with a group not given the task.
Participants Two intact sophomore classes (average age 18.5) were chosen from the English Department. One class contained 26 students (23 males and 3 females) and served as the experimental group. Another contained 29 students (25 males and 4 females) and served as the control group. The two classes were taught by the same teacher. Like other English majors in China, these students were all taking courses in intensive reading (eight hours a week), extensive reading (two hours a week), oral English (four hours a week) and listening comprehension (two hours a week). They used the same textbooks. In this study, the final exam of the intensive reading in the semester before the study started was used as the pre-test since intensive reading was a major course and the final exam in this course measures proficiency rather than reading skills alone. The result of an independent-samples t-test of the scores of this pre-test revealed no significant difference between the two groups (t = –1.527, p = 0.133). Five high achievers and five low achievers were chosen from the experimental group for the purpose of exploring the differences in their practice of text memorization. They were selected according to (1) their English scores in the national college matriculation examinations and (2) their scores from the intensive reading course for the three semesters after they entered the academy (see Appendix A for the specific scores).
Instruments The study employed three instruments in collecting quantitative and qualitative data, pre- and post-tests of proficiency, pre- and post-tests of composition, and interview questions.
76
Perspectives on Formulaic Language
1. Pre- and post-tests of proficiency. The post-test used the same format as the pre-test. It consisted of four parts, reading comprehension (20 per cent), paraphrasing (30 per cent), cloze (20 per cent) and translation (30 per cent). The time limit for each test was 90 minutes. Admittedly this might not be an ideal proficiency test since it did not measure speaking and writing skills. It was therefore supplemented by the test of English composition. 2. Pre- and post-tests of composition, which used the same topic: comment on whether it is good or not for college students to get married. The time limit for writing was 40 minutes. 3. Interviews after the experiment was over. The interview questions included: How did you memorize the text in each unit? What difficulties did you encounter when trying to memorize a text? In your opinion, how did such practice affect your English learning? The interviews were conducted in Chinese.
Data collection The experimental period lasted for about three months (about one semester). In the intensive reading course, the two groups, taught by the same teacher, used the same textbook. The teacher covered eight units during this period. Each unit consisted of a text, that is, a short essay, and a series of exercises. The text in each unit was 1,000 to 1,500 words long. Similar to most intensive reading courses in China’s university English programmes, for each unit, the teacher spent six to seven class hours on expounding the sentences in the text and two to three hours on exercises. Some of the FSs that occurred in these texts are given in Appendix B. The intensive reading and writing pre-tests were administered at the beginning of the experiment. The students were not informed of the essay topic in advance. During the experimental period, the experimental group was assigned to memorize the text from each of the eight units. The students in this group were often asked to recite the text in class, and under the pressure of performance in class, all of them worked hard to memorize the text during the ‘self-study’ period after the class. The control group was not assigned this task. The students of this group studied on their own during the ‘self-study’ period; they might go over the lessons they had learned, complete the exercises in the unit, read English novels and practice listening
Effectiveness of Text Memorization
77
comprehension. The students of the experimental group, presumably, had much less time for these learning activities. It might also have happened that some students of the control group did some text memorization on their own, but they were never asked to recite any text in class. At the end of the experimental period, the two groups were given the proficiency and writing post-tests. The writing was on the same topic as that in the pre-test. After the tests, the ten students were interviewed individually.
Data analysis The proficiency tests were scored according to the set criteria by all teachers of the intensive reading courses in the university, including the one who taught the two classes that participated in this study. With each writing test, the student essays were marked by two experienced teachers with similar educational and teaching backgrounds. The scores were keyed in on the computer. The identification of FSs used in student essays followed the practice in previous studies (Howarth, 1998; Nesselhauf, 2003; Qi, 2006): using as reference an authoritative dictionary, in this study, Longman Dictionary of Contemporary English (2006), and manually identifying in student essays all the idioms, phrases and collocations that were included in the dictionary. Each time an FS was used was counted as a token but not as a type unless it was used for the first time in an essay. If an FS was used inappropriately (e.g. ‘it was interfere with my study’) or contained an error in the use of function words (e.g. ‘the teacher prevented us to look up textbooks’), it was counted as an erroneous token. The numbers of the correct and erroneous FS tokens and types for each essay were also keyed in on the computer. These numbers and scores were analysed with SPSS (Statistical Package for Social Sciences, Version 11.5). FS development was measured using three indices: frequency, variation and accuracy. The calculation of these indices followed Qi (2006): z z z
Frequency: dividing the total number of FS tokens in an essay by the total number of words of that essay and then multiplying by 1000 Variation rate: dividing the square of the number of FS types by the number of FS tokens per essay Accuracy rate: dividing the total number of error-free FS tokens by the total number of FS tokens per passage
To explore whether or not text memorization produces different effects on high and low achievers in the use of formulaic sequences, five high
Perspectives on Formulaic Language
78
achievers and five low achievers were compared in terms of the frequency, accuracy and variation of the FSs they used in their pre- and post-test compositions.
Results and Discussion Learners’ progress in English proficiency and writing ability The results of the proficiency tests and writing tests administered before and after the experiment showed that both the experimental and control groups made improvement in English proficiency and writing ability. However, the independent samples t-tests show that the experimental group made significantly greater progress in proficiency than the control group (Table 4.1). The two groups did not show any significant difference in the pre-test (t = –1.527, p = 0.133) although the mean score of the control group (M = 67.5517) was slightly higher. However, they showed significant difference in the post-test (t = –2.650, p = 0.012); at the end of the experimental period, the experimental group (M = 80.4231) outperformed the control group (M = 77.2759). It can be concluded that text memorization appears to have helped these learners improve their English proficiency. The independent-samples t-tests also show that the experimental group made significantly greater progress in composition than the control group (Table 4.2). The two groups showed no significant difference in the pre-test (t = 0.039, p = 0.969) but significant difference in the post-test (t = 2.391, p = 0.020); the experimental group (M = 11.0385) outperformed the control group (M = 9.8793). Text memorization appears to have enabled these learners to improve their writing ability.
Table 4.1 Independent-samples T-tests of the Scores of Pre- and Post-tests of Proficiency Comparing the Experimental and Control groups Test
Group
N
Mean
Proficiency test (pretest)
1
26
65.6538
12.66473
2
29
67.5517
11.12778
Proficiency test (post-test)
1
26
80.4231
9.90827
2
29
77.2759
5.65620
Notes: 1 = the experimental group; 2 = the control group.
Standard deviation
T-value
P
–1.527
0.133
–2.650
0.012
Effectiveness of Text Memorization
79
Table 4.2 Independent-samples T-test of the Scores of Pre- and Post-tests of Composition Comparing the Experimental and Control Groups Test
Group
N
Pretest writing
1
26
8.6923
2.11224
2
29
8.6724
1.70752
1
26
11.0385
1.72002
2
29
9.8793
1.85960
Post-test writing
Mean
Standard deviation
T-value
P
0.039
0.969
2.391
0.020
Notes: 1 = the experimental group; 2 = the control group.
Table 4.3
Correlation Analysis between Writing Scores and the Use of FSs
Pretest writing score Pearson Correlations Post-test writing score
FS types in the pretest
FS tokens in the pretest
0.454**
0.463**
FS types in the post-test
FS tokens in the post-test
0.369**
0.368**
Notes: ** = significant at the 0.01 level.
Changes in FS use over the experimental period The relationship between writing scores and the use of FSs Table 4.3 summarizes the results of the correlation analysis of FS types/ tokens and the writing scores. As shown in the table, all the correlation coefficients were significant at the 0.01 level. It can be concluded that the scores of the learners’ L2 compositions were positively correlated with the numbers of FS types and tokens used in those compositions. The learners who wrote well also tended to use more FSs than others. This finding provides support for the finding of Ting and Qi (2005) that the learners’ ability to use formulaic language is a better predictor of their written English scores than is their grammatical accuracy.
General development in the use of FSs The findings about FS use in terms of frequency, accuracy and variation are shown in Table 4.4. These indices demonstrate the progress made by the two groups over the period of three months except for the index of variation. To be specific, the standardized frequencies of FS tokens per
Perspectives on Formulaic Language
80 Table 4.4
Overall Description of FS Use by All Learners in Both Groups
Index
Frequency
Experimental group
Mean SD
Accuracy Variation
Control group
Pretest
Post-test
Pretest
143.1
150.8
136.2
Post-test 139.7
23.11
21.34
23.82
Mean
0.90
0.95
0.86
0.89
SD
0.05
0.03
0.12
0.09
13.56
15.57
13.11
13.01
4.56
4.40
5.08
4.57
Mean SD
32.94
Table 4.5 Paired-samples T-test of Between-test Differences for the Overall Sample Index
Experimental group
Control group
Pretest – post-test
Pretest – post-test
Mean difference
Trend
P
Mean difference
Trend
P
Frequency
7.7
–
0.144
4.5
–
0.432
Accuracy
0.05
↑
0.000
0.03
–
0.214
Variation
2.01
↑
0.037
–
0.922
–0.1
Notes: ↑ = statistically significant increase; – = no statistically significant change.
essay for the experimental group increased from 143.1 to 150.8, 5.38 per cent, and for the control group increased from 136.2 to 139.7, 2.57 per cent. The accuracy rate of FS tokens for the experimental group rose from 0.90 in the pre-test to 0.95 in the post-test, 5.56 per cent, and for the control group, from 0.86 to 0.89, 3.49 per cent. However, for the index of variation, the mean of variation per passage for the experimental group increased from 13.56 to 15.57, 14.82 per cent, and for the control group decreased from 13.11 to 13.01, –0.76 per cent. In addition, between-test differences of the two groups in these three indices were distinguished with the help of paired-samples T-tests. The results are shown in Table 4.5. The accuracy and variation for the experimental group both significantly increased (p = 0.000 for accuracy and p = 0.037 for variation). On the whole, the experimental group made more improvement than the control group.
Effectiveness of Text Memorization
81
Table 4.6 Differences in the Frequency of FS Use between the High and Low groups Pretest
Post-test
Mean
SD
Mean
SD
High
152
17.89
166
20.74
Low
124
19.49
146
16.73
MD Between-group difference High-low Between-tests difference Pretest – post-test
Pretest
–28
Post-test
–20
High
14
Low
22
Discrepancy between high- and low-achievers in the use of FSs A comparison of high and low achievers suggests that on the whole, low achievers benefited more from the practice of text memorization than high achievers; that is, the practice produces more positive effects for low achievers in the use of formulaic sequences.
Changes in frequency The low achievers made greater progress than the high achievers in the frequency of the FSs used in their compositions (Table 4.6). As expected, the high achievers used more FS tokens than the low achievers in both pre- and post-tests. However, the low achievers made greater progress than the high achievers (22 vs. 14); the difference between them narrowed from 28 in the pre-test to 20 in the post-test.
Changes in variation The low achievers made greater progress than the high achievers in the variation of the FSs used in their compositions. As shown in Table 4.7, the mean of the variation ratio of the low achievers (M = 9.7160) was much smaller than that of the high achievers (M = 17.6640) in the pre-test, indicating that the low achievers were only able to use a narrow range of FSs. However, at the end of the three-month experimental period, the low achievers achieved a variation ratio (M = 18.2060) that was even slightly higher than the high achievers’ (M = 17.4380); the high achievers made no progress in this aspect.
Perspectives on Formulaic Language
82 Table 4.7
Differences in Variation of FS Use between the High and Low groups Pretest
Post-test
Mean
SD
Mean
SD
High
17.6640
3.15377
17.4380
1.29239
Low
9.7160
2.11110
18.2060
3.08046
MD Between-group difference High-low Between-tests difference Pretest – post-test
Pretest
7.948
Post-test
–0.768
High
–0.226
Low
8.49
Table 4.8 Differences in the Accuracy of FS Use between the High and Low groups Pretest
Post-test
Mean
SD
Mean
SD
High
0.9120
0.05541
0.9660
0.03209
Low
0.9260
0.02702
0.9220
0.02775
MD Between-group difference High-low Between-tests difference Pretest – post-test
Pretest
0.014
Post-test
–0.044
High
0.054
Low
0.004
Changes in accuracy Unlike what happened to frequency and variation, in the accuracy of the FSs used in the compositions the high achievers made greater progress than the low achievers (Table 4.8). As shown in Table 4.8, the high achievers (M = 0.9120) started out lower than the low achievers (M = 0.9260) in their accuracy rate, but at the end of the study in the post-test they (M = 0.9660) did much better than the low achievers (M = 0.9220).
Possible reasons for the discrepancies The interview data showed that the five high achievers had sometimes practiced text memorization since high school. They said that the practice
Effectiveness of Text Memorization
83
was a ‘piece of cake’ to them (High-2), and even without the teacher’s instructions, they would go ahead to memorize a passage if they found it ‘contained beautiful words’ (High-1). The five low achievers, by contrast, all said that text memorization had been ‘boring’ (Low-2) and ‘extremely painful’ to them (Low-1); towards the end of the experiment, however, they realized that the practice could help them ‘with further understanding of the text’ (Low-2). One of them (Low-2) said: ‘Sometimes, some of the phrases in the text “jumped” to my brain. . . . This gave me confidence I did not have before.’ The practice of text memorization (or the lack of it) may explain the differences in the command of FSs between the high and low achievers. It may also help explain the differences in the focus of their attention when they were memorizing the text materials. Since the high-achievers had already been good at text memorization, they may have paid more attention to the linguistic, textual context of FSs, resulting in progress in the accuracy of their FS use. The low achievers, by contrast, may have made more efforts to absorb the FSs they were not familiar with, resulting in progress in the frequency and variation of their FS use. At the same time, the use of newly learned FSs may have affected the accuracy; as a result, the low achievers did not make much progress in this aspect. The interview data also suggested that text memorization helped foster in the learners a habit of attending to language while engaged in reading or listening. This may help explain why the interviewees claimed that, in their post-test compositions, they had not used many of the FSs they had learned from the text materials assigned to them for memorization but instead, used many FSs they had learned elsewhere during the experimental period. They themselves, however, attributed this to the fact that the topic of the composition was not related to the content of the text materials they had memorized.
Conclusion Major findings The study produced the following findings: 1. Learners who practice text memorization make faster and greater progress in English proficiency and writing ability than those without such practice.
84
Perspectives on Formulaic Language
2. Learners who practice text memorization make significantly greater progress in the accuracy and variation of formulaic sequences used in their compositions than those without such practice. 3. The progress resulting from text memorization is especially marked for low achieving students in the frequency and variation of the FSs they used in writing and for high achieving students in accuracy. That is to say, through text memorization, low achievers learn to use more FSs in a broader range of variety while high achievers learn to use them more accurately.
Implications The findings of the study suggest that text memorization is an effective second language learning strategy. In fact, as shown by the performance of the two groups, the strategy is actually more cost-effective than other learning practices when the total amount of time spent on learning is brought under control. There should be little doubt that teachers should encourage such practice and encourage students to attend to, imitate, memorize, and learn to use the collocations and sequences in the input, which will markedly improve the quality of their output. Text memorization is effective in language learning largely because language is exemplar-based (Skehan, 1998). Learning language involves learning rules as well as exemplars or FSs, and the learning of FSs depends on frequent exposure and memorization rather than on analysis and reasoning. The practice of text memorization, among other things, improves learner command of these unanalysed or unanalysable sequences. The current SLA literature does not answer questions as to how learners can overcome the failure to notice and to rehearse when they are under the pressure of real time communication. However, as Ding (2007) points out, because the practice of text memorization takes place outside of the settings of online meaning comprehension and expression, it enhances noticing as learners attend to formulaic sequences and other features of usage that tend to be ignored during real time communication. It also enhances rehearsal with repeated reading and recitation. The findings of this study lend support to this view.
Acknowledgements The authors would like to thank Dr. Don Snow, a colleague from Nanjing University, who helped with the editing and revision of this manuscript.
Effectiveness of Text Memorization
85
References Biggs, J. B. (1991). Approaches to learning in secondary and tertiary students in Hong Kong: Some comparative studies. Education Research Journal, 6, 27–39. Bolinger, D. (1975). Meaning and memory. Forum Linguisticum, 1, 2–14. Cook, G. (1994). Repetition and learning by heart: An aspect of intimate discourse, and its implications. ELT Journal, 48, 133–41. Ding, Y. (2007). Text memorization and imitation: The practices of successful Chinese learners of English. System, 35, 271–80. Dong, W., & Fu, L. X. (2003). The role of recitation input in the teaching of college English (in Chinese). Foreign Languages Society, 4, 56–59. Ellis, N. (2003). Constructions, chunking, and connectionism: The emergence of second language structure. In C. J. Doughty & M. H. Long (Eds.), The handbook of second language acquisition (pp. 63–103). Malden, MA: Blackwell. Ellis, N. (2006). Language acquisition as rational contingency learning. Applied Linguistics, 27, 1–24. Howarth, P. (1998). Phraseology and second language proficiency. Applied Linguistics, 19, 24–44. Lewis, M. (1993). The lexical approach: The state of ELT and a way forward. Hove, UK: Language Teaching Publications. Nattinger, J. R., & DeCarrico, J. S. (1992). Lexical phrases and language teaching. Oxford: Oxford University Press. Nesselhauf, N. (2003). The use of collocations by advanced learners of English and some implications for teaching. Applied Linguistics, 24, 223–42. Parry, K. (Ed.). (1998). Culture, literacy, and learning English: Voices from the Chinese classroom. Portsmouth, NH: Heinemann. Pawley, A., & Syder, F. (1983). Two puzzles for linguistic theory: nativelike selection and nativelike fluency. In J. C. Richards & R. Schmidt (Eds.), Language and communication (pp. 191–226). London: Longman. Pennycook, A. (1996). Borrowing others’ words: Text, ownership, memory, and plagiarism. TESOL Quarterly, 30, 201–30. Qi, Y. (2006). A longitudinal study on the use of formulaic sequences in monologues of Chinese tertiary-level EFL learners. Unpublished doctoral dissertation, Nanjing University. Schmidt, R. (1990). The role of consciousness in second language learning. Applied Linguistics, 11, 17–46. Schmidt, R. (2001). Attention. In P. Robinson (Ed.), Cognition and second language instruction (pp. 3–32). Cambridge, UK: Cambridge University Press. Skehan, P. (1998). A cognitive approach to language learning. Oxford University Press. Ting, Y. R. (2004). Learning English text by heart in a Chinese university. Xi’an: Shanxi Normal University Press. Ting, Y. R., & Qi, Y. (2005). A study of the correlation between the use of formulaic sequences and English oral and writing proficiency (in Chinese). Journal of PLA Institute of Foreign Studies, 5, 49–53. Wong-Fillmore, L. (1976). The second time around: Cognitive and social strategies in second language acquisition. Unpublished doctoral dissertation, Stanford University. Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press.
Perspectives on Formulaic Language
86
Appendix A The interviewees’ English scores (and ranks in the class) in the College Entrance Examinations (CEE) and for their intensive reading courses in three semesters Table for Appendix 1 CEE Rank Term 1 Rank Term 2 Rank Term 3 Rank Pretest Rank High achiever 1
140
1
87
1
86
1
87
1
87
1
High achiever 2
138
2
78
2
85
2
82
2
84
2
High achiever 3
135
3
73
4
82
3
80
3
83
3
High achiever 4
135
3
75
3
78
4
80
3
81
4
High achiever 5
133
5
72
5
76
5
79
5
77
5
Low achiever 1
103
22
63
22
65
22
66
22
64
22
Low achiever 2
100
23
63
22
62
23
65
23
63
23
Low achiever 3
100
23
62
24
60
25
64
24
60
24
Low achiever 4
99
25
56
25
61
24
57
25
53
25
Low achiever 5
95
26
51
26
54
26
53
26
49
26
Note: CEE = score for the English test in the national college matriculation examinations; Term 1 = score for the intensive reading course in the first semester; etc.; pretest = score for the pretest.
Appendix B Some of the FSs occurring in the texts the experimental group learned by heart: used to
be concerned with
spend time doing sth.
look through
end up in
the result of
have to do sth.
break silence
a hint of
lose control of sb’s temper
shrug one’s shoulders
one’s first impressions of
take measures to do sth.
take/tear sb. away from
a capacity for
a great surprise
in accordance with
wait for
happen to do sth.
in anger
in one’s youth
Catch sight of
bring sb. sth.
seem to do sth.
Come across
have an instinct about
in a way
in a sense
sort of
from one’s experience
ask sb. for sth.
do sth. for sb.
go (all) to pieces
Effectiveness of Text Memorization drive at
on account of
be taken aback
in good condition
so… that …
part of
in the case of
be of great importance
for example
the influences of
have influence on
live in
Correspond to
the question of
distinguish between
ideally speaking
the role of
more than
take sth. for granted
because of
the problem of
change one’s opinions
change one’s mind
of course
the possibility of
be capable of
to a large extent
the nature of
the personality of
in other words
so that
the importance of
the style of
go out
no doubt
for one thing/for another
no exception
no matter how
both … and …
afford to do sth.
As well as
the nerves of
a memory of
in one’s eyes
at the same time
a bottle of
in front of
in despair
all manner of
the essence of
mix with
the pressure of
let alone
tend to do sth.
in fact
depend on
as long as
between … and …
in need of
lead to
an example of
learn from
log on
access to
add sth. to sb.
argue over
blame sb. for sth.
break the law
bring about
challenge sb. to do sth.
direct attention to
develop one’s ability
forbid sb. from doing sth.
have a long way to go
hear of
interfere in/with
lag behind
leave behind
look before you leap
lay sth. on sb.
make a solid basis for
draw a conclusion
pass by
Prefer to do sth.
produce effect on
quit school
succeed in sth.
support sb. in doing sth.
take responsibility for sth.
take responsibility to do
take advantage of
travel through
In terms of
Note: The italicized FSs are those that the experimental group used in the post-test writing.
87
Chapter 5
Lexical Clusters in an EAP Textbook Corpus David Wood Carleton University
The present study is an investigation of the most frequently used formulaic sequences in typical textbooks most popularly used in English for academic purposes (EAP) courses in North America. The overall objective is to determine what kind of exposure EAP learners receive to these important elements of academic prose, and whether EAP textbooks and materials promote the noticing, processing, and use of formulaic sequences most relevant to reading and writing in the academy. Since EAP courses are meant to be a means of preparing non-native speakers of English to deal with the demands of post secondary academic reading and writing, it stands to reason that they should deal with the formulaic sequences most frequently used to construct academic text in English. At the least, learners should be guided to notice and understand the functions and meanings of the sequences in texts drawn from appropriate academic genres. Formulaic sequences are defined here as multi-word strings or frames which are processed mentally as if single words. The sequences are generally seen as serving a wide range of functions in discourse, and as a way of expressing concepts and textual relationships to facilitate efficient and effective communication. Formulaic sequences have been the focus of a variety of types of research over the past three or four decades, and have been given as many as 40 different labels (Wray & Perkins, 2000), but the term formulaic sequence has become the generic term for the collocations, idioms, metaphors, lexical phrases, and other labels and types of multiword units studied. Due to the nature of the language corpora to be analysed in the present study, and the process required to identify formulaic sequences, one specific type of formulaic sequence is under consideration. In previous studies of formulaic sequences in academic language, using corpora of textbooks, lectures, and articles published in journals, as well as student writing and speech, lexical bundles have been the focus of inquiry. Lexical bundles
Lexical Clusters
89
can be defined as frequently occurring combinations of three or more individual words which occur at a certain frequency across a range of texts in a given register. Frequency counts using corpus analysis software can locate the bundles, which can then be ranked as to frequency, and categorized by type, discourse function, or meaning. In the present study, the term lexical cluster is used, due to the fact that the focus on a small corpus comprised of topically diverse texts mitigated against the requirement of occurrence across a range of texts. Previous research of this type has mainly focused on comparing work published in academic journals to that of novice and advanced regular university student writers in the same disciplines (Biber, Conrad, & Cortes, 2004; Cortes, 2004). As for EAP-focused work, there exist very few studies. One recently published work with this intent is by Jones and Haywood (2004), studying the teaching and acquisition of particular formulaic sequences by EAP learners. In their study, they found limited success in teaching a pre-set group of formulaic sequences to EAP learners. In EAP textbooks and materials, it is often the case that certain lexical bundles are presented as types of connectors or discourse markers, although they are seldom presented in a context or arranged as to frequency (Oshima & Hogue, 2004; Williams, 2005). It is usually assumed that lexical bundles will be acquired subconsciously through input and exposure to text, or that teaching lexical bundles as connectors and discourse markers from lists and through written exercises using decontextualized sentences can suffice. Some movement is occurring in the EAP textbook field toward a more corpus-based approach to materials and textbook design (see Harwood, 2005 for an overview), but the mainstream EAP textbook market at present most likely relies on the tried and true, providing large amounts of academic reading and listening input and requiring learners to write extended texts. While providing learners with large amounts of input on various academic themes is part of a valid approach to EAP instruction, several questions arise which are related to the material used in such programmes: 1. Does the material contain rich reading input containing lexical clusters likely to be used in the actual texts the learners will encounter in regular introductory texts in their major disciplines? 2. Are learners made aware of the importance and the range of purposes to which the lexical clusters are put in the types of reading texts they are likely to encounter in regular introductory classes in their major disciplines?
90
Perspectives on Formulaic Language
Formulaic Language and EAP For the past four decades, researchers have been interested in formulaic sequences, multi-word lexical strings or frames which are processed mentally as if single words. The sequences have a large number of discourse functions, and are often genre-specific ways to express concepts and textual relationships for efficient and effective communication. A relatively small but growing body of research explores relationships between the formulaic sequences in university written and oral discourse and materials used in the teaching of EAP. In academic writing, for example, some evidence has emerged to show that formulaic sequences are at high frequency in the writing of authors in academic disciplines, but they are comparatively rare in the production of both regular students and EAP students in those disciplines. In 1983 Pawley and Syder noted a connection between formulas and fluent language use, that we tend not to make use of the endless lexical and grammatical potential of language, and they were among the first to observe that language production may be only partly based on rule-governed formation of utterances from lexis through syntax, morphology and phonology. In spontaneous speech, such a laborious method of language production would seem improbable, particularly in light of the limitations of human memory and attention. Over time, the development of computer technology, corpus study, and phraseology have provided discourse-based evidence of how words tend to collocate and cluster. Phraseologists have pointed out that collocations cross a wide spectrum, and can include phrasal verbs, prepositional phrases, and more (Mel’cˇuk, 1998). A small number of studies have examined the role of formulaic sequences in adult L2 acquisition. Yorio (1980) found that formulaic language was used by adults as a production strategy, to save effort and attention in speaking. Schmidt (1983), in a case study of an adult L2 learner in a naturalistic context, found the learner used increasing numbers and varieties of formulaic sequences while showing little development of grammatical and other aspects of language. Bolander (1989), in a longitudinal study of adult L2 acquisition, found that learners used formulaic sequences containing particular language structures long before they could show they had actually acquired the structures themselves. The links between formulaic sequences and pragmatic competence have been researched. Coulmas (1979, p. 241) states that ‘As they provide the verbal means for certain types of conventional action, their meanings are conditioned by the behavior patterns of which they are an integrated part,’ and goes on to note that
Lexical Clusters
91
formulas help to facilitate unambiguous communication. Bygate (1988), in a study of adult learners, found pragmatic uses of formulas to include repetition, questioning, agreeing, confirming, clarification, and focusing attention. Formulaic sequences are ubiquitous in both written and spoken discourse (Schmitt & Carter, 2004). A study of spoken and written discourse by Erman and Warren (2000) found that word combinations comprised 58.6 per cent of the spoken corpus, and 52.6 per cent of the written corpus. The use of recurring word combinations has long been considered a marker of proficiency in many genres, including academic writing (Bamber, 1983; McCully, 1985). Over time, studies on the acquisition and use of formulaic sequences have employed a variety of research methods, including ethnography (Hakuta, 1974; Fillmore, 1979; Peters, 1983), conversation analysis (Manes & Wolfson, 1981; Tannen, 1987), and quantitative investigations of the use of multi-word units (Kjellmer, 1991; Sinclair, 1991), to name a few. Such research has tended to adopt one of two relatively distinct areas of focus. One of the foci has been sequences in which the meaning or function of the whole unit does not equal the sum of its component lexical parts, the so-called pure idioms (Chafe, 1968; Moon, 1998). The other area of focus has been fixed expressions in spoken language, particularly those which are related to particular speech events or which further the fluent production of speech (Coulmas, 1981; Wood, 2006; Yorio, 1980). The advent of corpus analysis has greatly benefitted and enriched the study of formulaic language, and the research has tended to adopt one of two approaches – the phraseological and the distributional (see Granger and Paquot, 2008 for a detailed review). The more traditionally phraseological approach involves a search for sequences which have been previously identified by researcher intuition or lexicographic or other means (e.g. Nattinger & DeCarrico, 1992). The distributional approach is more bottom-up, using corpus analysis software to identify co-occurrences of two, three, or more words, at different frequency cut-off points (Altenberg, 1998; Biber, Johansson, Leech, Conrad, & Finnegan, 1999). Phraseological methods of corpus-based research into formulaic sequences have generally been used in the study of smaller corpora, particularly those comprised of spoken language, and for identifying the meanings and discourse functions of the sequences. Distributional methods have yielded quite useful results with larger corpora and for the examination of genres such as academic prose. Another line of inquiry has focused on formulaic language in native and nonnative speaker writing in English (Bahns, 1993; Granger, 1998; Howarth,
92
Perspectives on Formulaic Language
1998), often uncovering a connection between use of formulaic sequences and pragmatic competence. As Granger (1998, p. 145) points out, it is often the case that rules of pragmatics are realized through the use of formulaic language. Some recent research has addressed the question of whether the formulaic language needed for academic writing can be acquired by native speakers through input and experience. Although this research has not addressed issues specific to EAP students, it is possible that English language learners could also acquire formulaic language through general exposure to it in both EAP and regular university classes. Hewings and Hewings (2002) conducted a key study comparing the use of certain linguistic features in published academic writing and that of regular university students. Cortes (2004), conducted a corpus-based study of the use of formulaic sequences in published and student writing in the specific disciplines of history and biology, finding that students tended not to use the formulaic sequences which were most frequent in published writing, and that when they did, they tended to do so for different pragmatic purposes. If these regular university students rarely used the target lexical bundles in their writing, EAP learners should have even less facility with them, which suggests that EAP programmes should include some explicit treatment of formulaic language. The use of formulaic sequences has been linked to perceptions of proficiency in academic writing. Wray (2002) noted that the use of formulaic sequences has two benefits for academic writers – they are connected to the expression of identity within disciplinary communities, and they reduce reading effort. Haswell (1991) pointed out that that recurrent word combinations are a marker of mature academic writing. Cortes (2004) remarked that the use of formulaic sequences ‘seems to signal competent language use within a register to the point that learning conventions of register use may in part consist in learning how to use certain fixed phrases’ (p. 398). All of these claims about the value of formulaic language have been analysed in research into a particular type of frequently occurring word combination called lexical bundles (Biber et al., 1999; Biber & Conrad, 1999). Lexical bundles are combinations of three or more words which are identified in a corpus of natural language by means of corpus analysis software programmes. An additional characteristic of lexical bundles is that they occur across a range of texts or, in the case of academic language, a range of disciplines. Biber and Conrad (1999) noted that these word combinations ‘are so common, it might be assumed that lexical bundles are
Lexical Clusters
93
simple expressions, and that they will be acquired easily.’ (p. 188) However, the acquisition and use of lexical bundles does not appear to occur naturally. Lexical bundles have been shown to be used at high frequency in published academic writing, and particular types of the bundles are characteristic of particular disciplines (Cortes, Jones, & Stoller, 2002). Academic disciplines have different ways of seeing the world, connected with different communicative conventions (Hyland & Hamps-Lyon, 2002). Biber (2006) presented a comprehensive corpus-based analysis of university language, including an examination of lexical bundles in textbooks. He found that academic disciplines differed in their use of lexical bundles, with natural and social sciences relying on them more than the humanities. Overall, the distribution of lexical bundles across functional categories in Biber’s study show that referential bundles – making direct reference to real or abstract entities or to textual content or their attributes – are the most common. Stance bundles – expressing attitudes or assessments of certainty – are the second most common type of function for lexical bundles in the textbooks, whereas discourse organizers – reflecting relationships between previous and subsequent discourse – were the least common. Within the category of referential functions, it appears that quantity and intangible framing subfunctions represent the largest categories.
Method The present study is an examination of the occurrence of lexical clusters in EAP textbooks. The corpus was compiled from six commercially available textbooks, at the intermediate-advanced level, published within the past seven years by major publishers: z z
four multi-skills, comprehensive intermediate-advanced English for academic purposes (EAP) textbooks two reading and writing skills-focused textbooks
EAP materials generally contain both reading texts and instructional material, the latter consisting of exercises, instructions for activities, and so forth. The corpus used in the present study was divided into two subcorpora, a textual subcorpus made up of the reading texts from the textbooks, and an instructional subcorpus made up of the instructional material from the textbooks.
94
Perspectives on Formulaic Language
The corpus overall consisted of : z z z
Total corpus – 579,345 tokens (running words) Textual subcorpus – 187,959 tokens (running words) Instructional subcorpus – 391,386 tokens (running words)
Before being compiled into a corpus, the textbooks were pruned of superfluous text which did not play a direct role in either instruction or input, for example titles, headings, captions, tables of contents, prefaces and overviews, and so on. The textual subcorpus made up roughly 32 per cent of the total corpus, a rather telling piece of data in that it indicates that actual textual input made up a relatively small part of the content of the textbooks. In contrast, the instructional subcorpus made up almost 68 per cent of the total corpus, which reveals that the language related to tasks and activities took up the great majority of the textual space. Textbook writers are responsible for creating valid learning activities in the materials they produce, and it stands to reason that the sets of instructions, explanations, questions, and advice that they present in textbooks are in some ways the heart of their craft. In EAP materials, instructional material tends to be centred around texts extracted from original academic sources, and that was the case in all of the textbooks selected for this analysis. However, the fact that instructional language predominated in this EAP corpus might seem to indicate that a wide range of tasks involving analysis, synthesis, and extrapolation followed from the texts, and that a certain amount of this activity would focus on formulaic language, either directly or implicitly. This was tested by searching page by page for reference material, exercises, or activities which overtly focused on the lexical clusters and other types of formulaic language. It is important to note that lexical bundles can be viewed as a subset of formulaic language, which is often defined as multi-word lexical strings or frames processed mentally as if single words, serving a wide variety of functions in discourse, and providing generally agreed upon means of expressing concepts and textual relationships which facilitate efficient and effective communication. This last is crucially important for EAP, and it might be said that formulaic language is key to the nature of particular discourses as certain recurring word combinations are used within such communities as commonly agreed upon ways to express ideas, to link ideas, to express stance or opinion, and so forth. Lexical bundles may be defined as frequently occurring combinations of three or more individual words
Lexical Clusters
95
identified by means of corpus analysis, usually at a frequency of 20 per million words, across a certain number of texts or types of texts. The present study incorporated the methods used by researchers such as Biber, Cortes, and others, with certain modifications. For one, while adopting the frequency cut-off of 20/million words, no requirement of text coverage was used, because of the small size of the corpus and the fact that it was comprised of texts on a variety of topics seldom repeated. The unit of analysis in this study is four-word bundles for two reasons: they are more common than five-word bundles; they serve a wider range of function and structure than three-word bundles. WordSmith Tools 4.0 (Scott, 2004) was used to scan the corpus. The two subcorpora in this study, the instructional and the textual, were treated as separate in determining frequency of occurrence of clusters. That is, the frequency cut-off of 20/million words was used within the instructional subcorpus and within the textual subcorpus, and the overall corpus word count was not used in measuring frequencies. This yielded a remarkably high number of identified clusters in the instructional subcorpus, since the text in that subcorpus consisted to a great extent of repeated instructions, directives, and so on. This language is by necessity quite formulaic, and might be said to be an indication that the instructional language present in EAP textbooks is representative of a discourse community. In addition, publishers tend to try to ensure consistency within a textbook or series in the way instructional language is presented and phrased, which adds to the likelihood of repeated formulaic ways of expression. It may be argued that instructional language is not necessarily input for readers/learners, but rather organizational language used largely by classroom instructors. This view is an assumption which has been largely untested, however, and it is entirely possible that readers/learners read or take note of the written instructional language in textbooks. With this in mind, the two subcorpora were both treated in this study as sources of input for learners. In classifying the bundles or clusters as to discourse function and so on, we used the very workable and user friendly taxonomy elaborated by Hyland (2008), which reflects the three major metafunctions of language, ideational, textual, and interpersonal. The ideational functions are termed research-oriented by Hyland: Research-oriented – help structure experience and activity of real world. z z
Location (at the same time, at the beginning of) Procedure (the use of the, the purpose of the)
Perspectives on Formulaic Language
96 z z z
Quantification (a wide range of, one of the most) Description (the structure of the, the size of the) Topic (in the United States, the currency board system)
The textual functions are labeled by Hyland as text-oriented: Text-oriented – deal with meaning of text and its organization. z z z z
Transition signals (on the other hand, in addition to the) Resultative signals (as a result of, it was found that) Structuring signals (in the present study, in the next section) Framing signals (in the case of, on the basis of)
The interpersonal functions are labeled participant-oriented: Participant-oriented – focused on the writer or the reader. z z
Stance features (may be due to, it is possible that) Engagement features (as can be seen)
After scanning the corpora for lexical clusters and classifying them according to discourse function, each of the textbooks was examined page by page for reference material, exercises, or activities which overtly focused on the lexical bundles.
Results The clusters One of the most obvious features of the lexical clusters in the corpus is the fact that of the top 40 most frequent in the entire corpus, 26 or 80 per cent are from the instructional materials, not the texts. Table 5.1 presents a list of the clusters in the instructional subcorpus, that is, the instructions and so on, not the reading texts. Not surprisingly, a remarkably high number of clusters were identified in this subcorpus, over 800, of which the top 30 most frequent are presented in Table 5.1. The clusters in the instructional subcorpus vary in structure and type, and many of them are immediately recognizable as samples of instructional language, with references to lectures, words, reading resources, class, sentences, underlined words, and so on. Table 5.2 presents a list of the lexical clusters which occurred in the textual subcorpus, that is, the reading texts. Given the nature of this subcorpus,
Lexical Clusters Table 5.1
97
Lexical Clusters in the Instructional Subcorpus
At the end of
Do you think the
This part of the
Listen to the lecture
On the left with
Of the reading selection
The meaning of the
The end of the
Scan the reading to
In the case of
Rest of the class
Form of the word
The words in the
In the united states
In the box below
Of the reading resources
Of the words in
You listen to the
The rest of the
What do you think
To answer the following
With the rest of
In the following sentences
Of the underlined words
Answer the following questions
Guess the meaning of
Is the main idea
At the beginning of
Answer the questions that
And the number of
Table 5.2
Lexical Clusters in the Textual Subcorpus
On the other hand
In the form of
In the labor force
In the united states
At the beginning of
At the time of
At the same time
Is one of the
And the united states
One of the most
The size of the
The source of the
The nature of the
The beginning of the
In the level of
The end of the
According to the multistore model
A role in the
Parts of the world
Of the united states
In the northern hemisphere
Was one of the
The rest of the
Played a role in
To the multistore model At the university of
In the U.S.
In the case of
Increase the amount of
Have the right to
which consisted of a number of short texts extracted from various sources and from a broad range of topic areas, it would be expected that relatively few clusters would meet the frequency cut-off of 20/million words. In fact, 65 clusters were identified in these texts, and the top 30 most frequent are presented in Table 5.2. It is obvious from a glance that these clusters are different in lexical content from those in the instructional subcorpus presented in Table 5.1 earlier. Here we see no consistent lexical theme, and there appear to be more standard academic discourse functions attached to these clusters. For example, the most frequent cluster is on the other hand, which links contrasting information in texts. Clusters such as in the case of and a role in the are clearly recognizable as having specific textual discourse functions even when encountered in a decontextualized environment such as a list.
Perspectives on Formulaic Language
98
The functions The instructional subcorpus Figure 5.1 is a chart illustrating the percentages of the bundles in the instructional subcorpus which were research, text, or participant oriented. As expected, most are research oriented, but over a third are participant oriented, as befits instructions and so on, directed to learners. Clusters such as listen to the lecture, do you think the, answer the following questions, and what do you think are actually directly addressed to the reader/learner. Figure 5.2 is a chart illustrating functional subcategories of clusters in the instructional subcorpus. Looking a bit closer at the clusters in the instructional subcorpus, we find that research-oriented clusters dealing with location and description – bars two and four from the left – were most common, along with participant oriented clusters of engagement, the second last bar on the right. This latter category is exemplified by the examples listed above as being directly addressed to the reader/learner. The clusters dealing with location and description include on the left with, at the end of, at the beginning of, in the box below, to answer the following, and this part of the. An examination of the cluster list and the functional categories and subcategories from the instructional subcorpus reveals lexical content and functions consistent with classroom and instructional language – questions and commands and directives addressed to the reader/learner, along with
3.1%
NA 3.1% Research-oriented 60.9% Text-oriented 1.6%
34.4%
Participant-oriented 34.4% 60.9%
1.6%
Figure 5.1 Functional Categories of Clusters in Instructional Subcorpus
Lexical Clusters
99
20 18 16 14 12 10 8 6 4
N/A 2 Research-oriented: Location 13 Research-oriented: Quantification 8 Research-oriented: Description 17 Research-oriented: Topic 1 Text-oriented: Framing Signal 1 Participant-oriented: Engagement 19 Participant-oriented: Stance 3
2 0
Figure 5.2 Functional Subcategories of Clusters in Instructional Subcorpus
clusters drawing attention to texts, lists, and other phenomena relevant to the instructional foci, as well as features of these foci. In some respects these types of clusters are similar to those used in classroom talk, as reported by Biber (2006), in that they tend to be more stance-oriented and discoursefocused than those in written academic text.
The textual subcorpus Figure 5.3 is a chart illustrating the percentages of the bundles in the textual subcorpus which were research, text, or participant oriented. As expected, the vast majority are research oriented, roughly a fifth are textual, and almost none are participant oriented or directed to learners. In this subcorpus, the clusters are overwhelmingly linked to the discourse of the texts themselves and the ideas they contain. The text-oriented clusters, such as on the other hand and at the time of, are virtually completely absent from the instructional subcorpus, which generally contains brief stretches of more isolated text which does not need a great deal of cohesion marked by lexical clusters. Figure 5.4 is a chart illustrating functional subcategories of clusters in the textual subcorpus. Looking more closely at the clusters in the textual subcorpus, we find that research oriented bundles of location, quantification, description and topic
Perspectives on Formulaic Language
100
15%
3.1%
NA 3.1% Research-oriented 75.4% Text-oriented 20.0% Participant-oriented 1.5%
20.0%
75.4%
Figure 5.3 Functional Categories of Clusters in Textual Subcorpus
NA 2 20 18 16 14 12 10
Research-oriented: Location 17 Research-oriented: Quantification 9 Research-oriented: Description 10 Research-oriented: Procedure 1 Research-oriented: Topic 12
8
Text-oriented: Transition 4
6
Text-oriented: Resultative 2
4
Research-oriented: Structuring 1
2 0
Text-oriented: Framing Signal 6 Participant-oriented: Stance 1
Figure 5.4 Functional Subcategories of Clusters in Textual Subcorpus
Lexical Clusters
101
are most common. Given that the texts which comprise this subcorpus are drawn by the textbook authors from a wide range of original authentic sources linked to a wide range of themes, it is likely that these subcategories of functions are roughly similar to those to be found in academic writing in general. The interesting broad conclusion to be made from examining this corpus is that an extremely high number of high frequency clusters are from instructions, not reading input, and that the nature of the clusters in the instructional subcorpus is different from that of the textual subcorpus. So if frequency alone is most important, as is the common assumption in many textbooks, and if the instructional language is a source of input for learners, we could expect learners to acquire bundles such as answer the following questions, scan the reading selection or listen to the lecture possibly ahead of or instead of more intuitively useful academic language such as in the case of, on the other hand, or one of the most. In other words, classroom procedural language dominates here, if frequency is used as the sole criterion.
Lexical clusters in the instructional material As noted earlier, the texts presented in EAP material are not necessarily selected or sequenced to function as extensive sources of input. The job of a textbook writer entails establishing learning activities which exploit certain features of the texts, at times using them as exemplars of features of academic prose. With this in mind, after the identification of the lexical clusters and their functions in the corpus, the textbooks were searched in depth and all the tasks in them completed, in order to see and feel how the material might have made up for a weak set of lexical clusters by teaching or drawing attention to their important roles in discourse and communication. Each textbook was searched page by page for reference material, exercises, or activities which overtly focused on the lexical clusters. The findings are as follows:
Textbook 1: 4 skills In the case of may be used to complete one task. Is one of the, one of the few, in the United States may be used to complete one task. The source of the may be used to complete one task.
102
Perspectives on Formulaic Language
No tasks highlight the clusters in input. No tasks require focus or practice of clusters.
Textbook 2: 4 skills Unit on compare/contrast: instruction about in contrast; text shows that on the other hand and at the same time can indicate contrast, but not highlighted. Unit on expressing purpose: instruction about in order to, so as to, so that, for the purpose of.
Textbook 3: 4 skills The end of the helps comprehension in one task. On the other hand and on the one hand help in completing one task. In the labor force is highlighted in the input in one reading task. A role in the helps with completing a written task. One of the most is highlighted as valuable in completing one reading task. With the idea of is useful in completing one reading task, but is not highlighted.
Textbook 4: 4 skills If the suggested change occurs frequently in input for a task and is valuable – but not highlighted. No tasks highlight the clusters in input. No tasks require focus or practice of clusters.
Textbooks 5 and 6: Reading and Writing focus No tasks highlight the clusters in input. No tasks require focus or practice of clusters. The picture which emerges from this examination of the pedagogical treatment of lexical clusters in the textbooks is that the textbooks authors do not exploit this aspect of language in any systematic or meaningful way. The clusters are not taught or highlighted in the instructions. Tasks in the textbooks do not draw attention to the clusters, nor to any form of formulaic
Lexical Clusters
103
language. In some cases certain lexical clusters are useful for learners to complete some tasks, but these clusters are not explicitly attended to by the authors. The textbooks are quite rich in lexical clusters, and both the instructional and textual subcorpora contain plenty of them. In terms of frequency, however, the instructional subcorpus contains clusters of quite high frequency which are not clearly relevant to academic reading and writing. The clusters which appear in the textual subcorpus may be very useful in academic discourse, and we may even assume that they likely are representative. However, these clusters do not occur at a high frequency and are therefore unlikely to be acquired from reading. This problem is exacerbated by the fact that this corpus is comprised of six textbooks, and the average EAP learner is likely to deal with only several textbooks over the course of an EAP programme, and therefore encounter fewer clusters or other types of formulaic sequences than we can see in this corpus. To compound the problem, in addition to the low frequency of most clusters in the textual subcorpus here, the authors do not exploit this resource in any meaningful way in the materials. From this representative sample of recent EAP textbooks, it appears that formulaic language is definitely not a priority of EAP textbook authors. They provide neither substantial input with high frequencies of lexical clusters, nor instructional support in recognizing and utilizing formulaic language.
Conclusion In general, it can be concluded that the textbooks examined in this study are not particularly effective in dealing with lexical clusters and formulaic language. The highest frequency clusters appear in the instructional material, not the texts, meaning that the types of lexical clusters which students would be most likely to encounter in these textbooks are language classroom based and instructional, not those characteristic of academic prose. Furthermore, the lexical clusters which appear in the texts in these books are not present at high frequency, and the input the learners receive is impoverished in this regard. These weaknesses in the texts and discourse structures of the material in the textbooks might have been balanced had there been tasks and activities which drew attention to the clusters and their functions. However, virtually no exercises, reference materials, or activities in these textbooks focus overtly or even implicitly on lexical bundles.
104
Perspectives on Formulaic Language
It is clear from this small-scale general overview of formulaic language in EAP textbooks that more work is needed which focuses on classroom materials and the role of formulaic language in them. Larger scale study of corpora of EAP materials can further investigate the nature of the gap between EAP materials and the language of the academic world, examining the use of lexical clusters and bundles in texts, the types of activities and tasks which can facilitate acquisition of this aspect of language, and so on. A conscious awareness of the role played by lexical bundles in regular academic textbooks, for example, might help EAP materials designers to craft textbooks which include a focus on the relevant bundles. More research into this is needed – see the paper by Chen in the present volume for an examination of how the lexical bundles presented in EAP textbooks geared toward electrical engineering students match those in EAP materials designed specifically for electrical engineering learners. If EAP materials are to be authentic, that is, if they are to present academic language which exemplifies that which is common in the academy, then materials developers need to attend to key features of academic language. Among these features is formulaic language, with its vital role in the creation of meaning and the structure of discourse. Materials developers should provide learners with input which is rich in formulaic sequences and activities which raise awareness of and facility with these elements of language.
References Altenberg, B. (1998). On the phraseology of spoken English: The evidence of recurrent word-combinations. In A. P. Cowie (Ed.) Phraseology: Theory, analysis, and applications (pp. 101–22). Oxford: Clarendon. Bahns, J. (1993). Lexical collocations: A contrastive view. English Language Teaching Journal, 47 (1), 101–14. Bamber, B. (1983). What makes a text coherent? College Composition and Communication, 34, 417–29. Biber, D. (2006). University language. Amsterdam: John Benjamins. Biber, D., & Conrad, S. (1999). Lexical bundles in conversation and academic prose. In H. Hasselgard & S. Oksefjell (Eds.) Out of corpora: Studies in honour of Stig Johansson (pp. 181–89). Amsterdam: Rodopi. Biber, D., Conrad, S., & Cortes, V. (2004). If you look at . . . lexical bundles in academic lectures and textbooks. Applied Linguistics, 25, 371–405. Biber, D., Johansson, S., Leech, G, Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. London: Longman. Bolander, M. (1989). Prefabs, patterns and rules in interaction? Formulaic speech in adult learners’ L2 Swedish. In K. Hyltenstam & L. K. Obler (Eds.), Bilingualism
Lexical Clusters
105
across the lifespan: Aspects of acquisition, maturity, and loss (pp. 73–86). Cambridge: Cambridge University Press. Bygate, M. (1988). Units of oral expression and language learning in small group interaction. Applied Linguistics, 9 (1), 59–82. Chafe, W. (1968). Idiomaticity as an anomaly in the Chomskyan paradigm. Foundations of language, 4, 109–27. Cortes, V. (2004). Lexical bundles in published and student disciplinary writing: Examples from history and biology. English for Specific Purposes, 23, 397–423. Cortes, V., Jones, J., & Stoller, F. (2002, April). Lexical bundles in ESP reading and writing. Paper presented at TESOL Conference, Salt Lake City, Utah. Coulmas, F. (1979). On the sociolinguistic relevance of routine formulae. Journal of Pragmatics, 3, 239–66. Coulmas, F. (1981). Conversational routine. New York: Mouton Publishers. Erman, B., & Warren, B. (2000). The idiom principle and the open-choice principle. Text, 20, 29–62. Fillmore, L. W. (1979). Individual differences in second language. In C. Fillmore, D. Kempler, & W. Wang (Eds.), Individual differences in language ability and behavior (pp. 203–28). New York: Academic Press. Granger, S. (1998). Prefabricated patterns in advanced ESL writing: Collocations and formulae. In A. Cowie (Ed.), Phraseology: Theory, analysis, and applications (pp. 145–60). Oxford: Oxford University Press. Granger, S., & Paquot, M. (2008). Disentangling the phraseological web. In S. Granger & F. Meunier (Eds.), Phraseology: An interdisciplinary perspective (pp. 27–50). Amsterdam: John Benjamins. Hakuta, K. (1974). Prefabricated patterns and the emergence of structure in second language acquisition. Language Learning, 24 (2), 287–97. Harwood, N. (2005) What do we want EAP teaching materials for? Journal of English for Academic Purposes, 4 (2), 149–61. Haswell, R. H. (1991). Gaining Ground in College Writing: Tales of Development and Interpretation. Dallas: Southern Methodist University Press. Hewings, M., & Hewings, A. (2002). ‘It is interesting to note that . . . ’: A comparative study of anticipatory ‘it’ in student and published writing. English for Specific Purposes, 21, 367–83. Howarth, P. (1998). The phraseology of learners’ academic writing. In A. Cowie (Ed.), Phraseology: Theory, analysis, and applications (pp. 161–86). Oxford: Oxford University Press. Hyland, K. (2008). As can be seen: Lexical bundles and disciplinary variation. English for Specific Purposes, 27. 4–21. Hyland, K., & Hamp-Lyons, L. (2002). EAP: Issues and directions. Journal of English for Academic Purposes, 1, 1–12. Jones, M., & Haywood, S. (2004). Facilitating the acquisition of formulaic sequences. In N. Schmitt (Ed.), Formulaic sequences: Acquisition, processing and use (pp. 269–92). Amsterdam: John Benjamins. Kjellmer, G. (1991). A mint of phrases. In K. Aijmer & B. Altenberg (Eds.), English corpus linguistics: Studies in honour of Jan Svartvik (pp. 111–27). London: Longman. Manes, J., & Wolfson, N. (1981). The compliment formula. In F. Coulmas (Ed.), Conversational routine: Explorations in standardized communication situations and prepatterned speech (pp. 115–32). New York: Mouton Publishers.
106
Perspectives on Formulaic Language
McCully, G. (1985). Writing quality, coherence, and cohesion. Research in the Teaching of English, 19, 269–82. Mel’cˇuk, I. (1998). Collocations and lexical functions. In A. P. Cowie (Ed.), Phraseology: Theory, analysis, and applications (pp. 23–53). Oxford: Clarendon. Moon, R. (1998). Frequencies and forms of phrasal lexemes in English. In A. P. Cowie (Ed.) Phraseology: Theory, analysis, and applications (pp. 79–100). Oxford: Clarendon. Nattinger, J. R., & DeCarrico, J. S. (1992). Lexical phrases and language teaching. Oxford: Oxford University Press. Oshima, A., & Hogue, A. (2004). Writing academic English (4th ed.). Montreal, QC: Pearson. Pawley, A., & Syder, F. H. (1983). Two puzzles for linguistic theory: Nativelike selection and nativelike fluency. In J. C. Richards & R. W. Schmidt (Eds.) Language and communication (pp. 191–226). New York: Longman. Peters, A. M. (1983). Units of language acquisition. Cambridge: Cambridge University Press. Schmidt, R. W. (1983). Interaction, acculturation, and the acquisition of communicative competence: A case study of an adult. In N. Wolfson & E. Judd (Eds.), Sociolinguistics and language acquisition (pp. 137–74). Rowley, MA: Newbury House. Schmitt, N., & Carter, R. (2004). Formulaic sequences in action: An introduction. In N. Schmitt (Ed.), Formuliac sequences: Acquisition, processing and use (pp. 1–22). Amsterdam/Philadelphia: John Benjamins. Scott, M. (2004). WordSmith Tools (Version 4). Oxford: Oxford University Press. Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press. Tannen, D. (1987). Repetition in conversation as spontaneous formulaicity. Text, 7 (3), 215–43. Williams, J. (2005). Learning English for academic purposes. Montreal, QC: ERPI. Wood, D. (2006). Uses and functions of formulaic sequences in second language speech: An exploration of the foundations of fluency. Canadian Modern Language Review, 63 (1), 13–33. Wray, A., & Perkins, M. R. (2000). The functions of formulaic language: An integrated model. Language and Communication, 20, 1–28. Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press. Yorio, C. (1980). Conventionalized language forms and the development of communicative competence. TESOL Quarterly, 14 (4), 433–42.
Chapter 6
An Investigation of Lexical Bundles in ESP Textbooks and Electrical Engineering Introductory Textbooks Lin Chen Carleton University
Formulaic language plays a prominent role in structuring spoken and written discourse as demonstrated in a range of studies (Nattinger & DeCarrico, 1992; Partington, 1998; Altenberg, 1998; Biber, Conrad, & Cortes, 2003, 2004; Hyland, 2008). Defined as multiword strings ‘[. . . ] stored and retrieved whole from memory at the time of use, rather than being subject to generation or analysis by the language grammar’ (Wray & Perkins, 2000, p. 1), formulaic language serves to relieve processing stress in time-demanding tasks (Wray, 2002). For example, sport commentators in live programmes rely heavily on formulaic language for the description of actions. In addition, formulaic language accomplishes pragmatic functions in spoken and written discourse. For example, in research into business dialogues, it has been found that we need to frames requests from an authority (the board of a company) in a polite way and softens the authoritative power of the commands made by the management (O’Keefe, McCarthy, & Carter, 2007). Medical writers frequently use it is/has been (often) . . . that X to introduce outside information as support and add authority to their own stances or positions (Oakey, 2002). For a systematic description of formulaic language, researchers have proposed a range of linguistic descriptors and types, including ‘lexical phrase’ (Nattinger & DeCarrico, 1992), ‘formulaic sequence’ (Wray, 2002), ‘lexical bundle’ (Biber, Johansson, Leech, Conrad, & Finegan, 1999), ‘collocation’ (Partington, 1998), and more. From among these, lexical bundle was chosen as the focus of the present study, since lexical bundles feature in significant proportions and have distinct pragmatic functions in spoken and written discourse. In the 40-million-word Longman Spoken and Written English (LSWE) Corpus, 28 percent of words in conversation and 20 percent of words in academic prose occur in three- and four-word
108
Perspectives on Formulaic Language
lexical bundles (Biber et al., 1999). What is more, lexical bundles have been shown to fulfill specific functions of expressing stances, organizing discourse and specifying referential details in academic discourse (Biber et al., 2003, 2004). The present study focuses on lexical bundles and their pragmatic functions in electrical engineering (EE) introductory textbooks and English for specific purposes (ESP) textbooks. With the aim of investigating a gap in language use between the two types of textbooks, this study examines how lexical bundles with consistent pragmatic functions construct EE text, whether these lexical bundles are present in ESP textbooks, and whether pragmatic functions of the lexical bundles in EE textbooks are maintained in pedagogical materials.
Background of the Present Study Lexical bundles, defined as ‘sequences of word forms that commonly go together in natural discourse’ (Biber et al., 1999, p. 990), are identified mainly by the statistical feature of frequency. For example, for four-word sequences to be qualified as lexical bundles in the LSWE Corpus, Biber et al. (1999) specified that (1) the minimal cut-off frequency is five times per million words, and (2) the sequences should occur in at least five texts to avoid idiosyncrasies of individual speakers and writers. Lexical bundles are characterized by their pragmatic functions across four university registers, including classroom teaching, textbooks, conversation and academic prose (Biber et al., 2003, 2004). Biber et al. (2004) proposed a functional taxonomy, grouping four-word lexical bundles into three main categories: (1) stance bundles, (2) discourse organizers, and (3) referential bundles. Under each main category, Biber et al. (2004) identified subcategories with more specific functions. Stance bundles (e.g. are more likely to) express epistemic certainty or attitude/modality towards following propositions (see example 1). Discourse organizers (e.g. want to talk about, on the other hand) indicate the logical relationship between discourse segments, functioning as markers to introduce new topics or elaborate on current issues (see example 2). Referential bundles (e.g. those of you who, the nature of the) ‘make direct reference to the physical, temporal, and contextual context’ (Biber et al., 2003, p. 81). For example, the nature of the specifies an abstract attribute of the entity coming after (see example 3).
An Investigation of Lexical Bundles
109
1. Boys are more likely to be hyperactive, disruptive, and aggressive in class. (p. 390) 2. But, before I do that, I want to talk about Plato. (p. 392) 3. . . . students must define and constantly refine the nature of the problem. (p. 395) While the studies mentioned above investigated general language use in written and spoken registers (Biber et al., 2003, 2004), other studies have focused on language use in published research articles from a variety of disciplines (Cortes, 2004; Hyland, 2008). Disciplines in tertiary settings have their own communicative practices and norms; writers in the discipline establish appropriate reader/author relationship and precisely present propositional contents by drawing on particular lexical bundles to express their evaluation, engage readers, describe propositional facts, and organize the flow of discourse (Cortes, 2004; Hyland, 2008). For example, since biology and electrical engineering take a ‘[. . . ] linear and problemoriented approach to knowledge construction’ (Hyland, 2008, p. 19), writers of research articles in the two disciplines employ stance bundles to exert directive forces on readers by addressing them directly (e.g. we can see that, it is important that). In addition, these writers use a large amount of referential bundles for a precise description of research procedures in disciplines based on experimental research (Hyland, 2008). Writers of biology journal articles use referential bundles to specify time/location information (at the beginning of, at the University of), describe physical attributes (the depth of), or/and state quantities (a large number of) (Cortes, 2004). With the research focus on lexical bundles in general spoken/written registers or discipline-specific research articles, previous studies seem to have ignored language use in introductory textbooks of particular disciplines. Serving as an important type of texts in tertiary settings, entry-level textbooks present valid knowledge in the disciplines and offer the basis for academic lectures and evaluated tasks (exams, graded assignments, oral and written reports) (Olson, 1989; Carson, 2001). However, to date, little research has been conducted on describing how lexical bundles construct the text in introductory textbooks of different disciplines. Academic neophytes form a major reader’s group of introductory textbooks including both native and non-native English speakers. Some L2 learners simultaneously take introductory courses and preparatory ESP courses that are designed to focus on the development of discipline-specific language skills. As ESP textbooks are often the main reading materials
110
Perspectives on Formulaic Language
for non-native speakers in English classrooms, these textbooks determine the language input that these students are immersed in and construct the basis of learning tasks in language classrooms (McDonough, 1998). The websites and prefaces of ESP textbooks explicitly state that the textbooks use authentic, discipline-specific readings as the main contents. However, it is not clear how authentic these pedagogical materials are and whether a gap exists in language use between ESP textbooks and introductory textbooks, since little empirical research has been conducted regarding this concern. Understanding how lexical bundles construct the text in introductory textbooks can help academic neophytes develop their discipline-specific literacy skills. In addition, identification of a gap in language use between ESP textbooks and discipline introductory textbooks could increase nonnative speakers’ chances of boosting discipline-specific language skills in language classrooms. Taking these factors into consideration, the present study initiates the investigation of lexical bundles in electrical engineering (EE) introductory textbooks and explores a gap in language use between ESP textbooks and EE introductory textbooks. Three main research questions are addressed: 1. What are the lexical bundles and their pragmatic functions in EE introductory textbooks? 2. Do these lexical bundles occur in ESP textbooks? 3. Do the pragmatic functions of the bundles differ in ESP textbooks?
Methodology This corpus-based study examined and compared the use of lexical bundles in EE introductory textbooks and ESP textbooks in three steps. Firstly, two corpora were set up, respectively representing EE introductory textbooks and ESP textbooks and serving as the database for textual analysis. Secondly, lexical bundles were identified in the EE textbooks. WordSmith Tools 4.0 (Scott, 2004), a computer analysis software generated frequency and distribution data of four-word sequences. Only word sequences that satisfied both frequency and distribution criteria qualified as lexical bundles (see Biber et al., 1999). Finally, WordSmith Tools 4.0 generated concordance lines or bigger stretches of text from EE textbooks and ESP textbooks, which were examined for the pragmatic functions of lexical bundles.
An Investigation of Lexical Bundles
111
The corpora For examining the gap in language use between EE textbooks and ESP textbooks, the electrical engineering introductory textbook corpus (EEITC) and the English for specific purposes textbook corpus (ESPTC) were set up. With 247 346 running words, the EEITC consists of all the 16 chapters of Basic engineering circuit analysis (Irwin, 2002), the first five chapters of Microelectronics circuits (Sedra & Smith, 2004) and a laboratory manual for an introductory course. The decision to include the two EE introductory textbooks in the EEITC was based on an online survey of the EE undergraduate programmes at three Ontario universities and an EE professor’s evaluation of representativeness of selected EE introductory textbooks. The survey of the EE undergraduate programmes indicates that the core introductory courses in the three programmes designate a total of four textbooks as course books, which offer EE novices basic concepts in circuit analysis regardless of their future specialization. A professor currently teaching an EE introductory course at one of the universities was consulted to evaluate the representativeness of the four textbooks. The professor identified two textbooks as representative, Basic engineering circuit analysis (Irwin, 2002) and Microelectronics circuits (Sedra & Smith, 2004). He considered the first textbook a good introductory course book due to its coverage of basic concepts in circuit analysis. As for the second book, he recommended the first five chapters for a one-term introductory course. The laboratory manual for Circuits and signals was included in the EEITC since the manual offered experiment instructions for students to follow in laboratory sessions. With 99 774 running words, the ESPTC draws its content from two ESP textbooks for EE students. Due to the limited supply of ESP textbooks designed for EE students on the market, only two ESP textbooks were included in the ESPTC, which are Oxford English for electronics (Glendinning & McEwan, 1993) and Oxford English for electrical and mechanical engineering (Glendinning & Glendinning, 1995). The ESP textbooks contain authentic reading materials for real communicative purposes (not created for ESP pedagogical purposes) from a wide range of sources with topics relevant to EE (e.g. electronics in the home, block diagrams and circuits).
Analytical procedures After the establishment of the two corpora, the frequency and distribution criteria for the identification of lexical bundles were set up. Since the EEITC
Perspectives on Formulaic Language
112
is similar to the corpus used in Cortes’ (2004) study in terms of the size and specificity of language use, the same cut-off frequency was adopted, a minimum frequency of 20 occurrences per million words for four-word lexical bundles. The same distribution criterion as that of Biber et al. (1999) was followed, requiring a lexical bundle to occur in at least five texts. Concord and Wordlist, two functions of Wordsmith Tools 4.0 (Scott, 2004), were used for identification of lexical bundles and provision of concordance lines. Wordlist produced a list of four-word clusters that met the minimum cut-off frequency criterion. Concord was used to check if the clusters met the distribution criterion. Concord was also used to extract concordance lines of lexical bundles from original text, or, if necessary, longer stretches of text. The functional taxonomy by Biber et al. (2004) served as the analytic framework to determine pragmatic functions of lexical bundles in the two corpora. Only the primary function of a lexical bundle was categorized for further analysis if it had multiple functions. The final decisions about the pragmatic functions of lexical bundles were made after consultation with two university professors with expertise in applied linguistics. Lexical bundles identified in the EEITC were examined for their presence and pragmatic functions in the ESPTC. The comparison of lexical bundles and their functions in the two corpora is reported in detail below.
Results Lexical bundles in electrical engineering introductory textbooks A total of 105 four-word lexical bundles were identified in the EEITC (see Table 6.1). These bundles were categorized as stance bundles, discourse organizers, and referential bundles. Each main function subcategory
Table 6.1
Functional Classification of Four-Word Lexical Bundles in the EEITC
Categories
Lexical bundles (sequenced in a descending order of frequency)
I. Stance bundles A. Epistemic stance A1. impersonal:
the fact that the, from the fact that, to note that the, we assume that the, can be used to, can be obtained by, can be written as, can be found from, be found from the
A2. personal:
we note that the, we see that the, we find that the
An Investigation of Lexical Bundles Table 6.1
113
(Continued)
Categories
Lexical bundles (sequenced in a descending order of frequency)
B. Attitudinal/modality stance B1. desire
we wish to determine, wish to determine the, we wish to find, wish to find the
B2. obligation/directive
consider the circuit shown, let us determine the, let us find the
B3. intention/prediction Personal:
we will use the
II. Discourse organizers Topic elaboration/ clarification
on the other hand, as well as the, as long as the
III. Referential expressions A. Identification/focus
is known as the, is determined by the, in the case of, referred to as the, is independent of the, is defined as the
B. Specification of attributes B1. Quantity specification is equal to the, in the range of, the sum of the, is proportional to the, is the sum of, is directly proportional to, the remainder of the B2. Tangible framing attributes
the magnitude of the, the value of the, the voltage across the, in the design of, the current in the, the gain of the, the operation of the, the ratio of the, the slope of the, the frequency response of, the voltage at the, input resistance of the, the output of the, the input resistance of, the values of the, gain of the amplifier, on the value of, the current through the, of the input signal, voltage across the capacitor, the polarity of the, the voltage gain of, equivalent circuit for the, the difference between the, energy stored in the, for the network in, the relationship between the, the direction of the, voltage gain of the, an expression for the, the energy stored in, the expression for the, the amplitude of the, a plot of the, a sketch of the, value of the current, the location of the, the resistance of the, power delivered to the, a function of the, the details of the
B3. Intangible framing attributes
in terms of the, with the result that, with respect to the, a result of the, the effect of the, the use of the, the case of the, the presence of the (Continued)
Perspectives on Formulaic Language
114 Table 6.1
(Continued)
Categories
Lexical bundles (sequenced in a descending order of frequency)
C. Time/place/text reference C1. Place reference
is connected to the, in series with the, in the frequency domain, in the time domain, in parallel with the, in the circuit in, in the circuit of, in the network in, present in the network, delivered to the load, be connected to the
C2. Text deixis
is of the form, it follows that the, given by the expression, in this case the, is given by the
C3. Multi-functional reference
at the end of, the end of the
has several subcategories with more specific meanings and functions. Within each functional subcategory, the lexical bundles were sequenced in a descending order of frequency.
Stance bundles in the EEITC Stance bundles in the EEITC express assessment or attitudes toward following propositions. They fall into two categories: epistemic stance bundles evaluating the knowledge status of the proposition coming after, and attitudinal/modality stance bundles expressing the authors’ attitudes towards ongoing events. As seen in Table 6.1, the majority of epistemic stance bundles are impersonal ones (e.g. the fact that the, can be used to), demonstrating that EE introductory textbook authors prefer to assess the proposition after lexical bundles without involving their personal opinions. This usage is in line with the impersonal style of discipline-specific textbooks (Conrad, 1996; Reppen, 2004), where textbook authors avoid personal stances in arguments and claim to maintain text objectivity and an authoritative control over the data and the audience (Hyland, 2002). With the form of a modal verb + passive verb, impersonal epistemic stance bundles (e.g. can be used to, can be obtained by) express logical possibility towards the following proposition, hedging the statements without blocking the possibility of potential alternatives (see example 4): (4) Ebers and Moll, two early workers in the area, have shown that this composite model can be used to predict the operation of the BJT in all of its possible modes. (ME Chapter 05) Personal epistemic bundles (e.g. we note that the, we see that the) explicitly state knowledge claims or analysis results (see example 5). Verbs such as find, see, and note guide readers’ attention to the propositional facts. In
An Investigation of Lexical Bundles
115
addition, the use of the collective pronoun we considers readers as active participants of the analysis/argument procedure. (5) In addition, we note that the solution of sinusoidal steady-state circuits would be relatively simple if we could write the phasor equation directly from the circuit description. (CA Chapter 07) Attitudinal/modality stance bundles in the EEITC mark textbook authors’ attitude towards the events in the following proposition, including desire, obligation/directive and intention (see Table 6.1). Desire bundles (e.g. we wish to determine, we wish to find) often occur in the instructions for learning tasks where the authors clarify the ends of these tasks (see example 6). Frequently appearing in imperative clauses, obligation/directive bundles (e.g. consider the circuit shown) express the writer’s stance of directing readers to participate in particular learning tasks (see example 7). The only intention bundle, we will use the, indicates the writer’s intention (see example 8). (6) We wish to find all the currents and voltages labelled in the ladder network shown in Fig. 2.30a. (CA Chapter 02) (7) For example, consider the circuit shown in Fig.2.18a. (CA Chapter 02) (8) In general, we will use the initial condition value since it is generally the one known, but the value at any instant could be used. (CA Chapter 06) Discourse organizers in the EEITC In the EEITC, only three bundles, on the other hand, as well as the, and as long as the function as discourse organizers connecting prior and coming discourse. For example, on the other hand signals the contrast between coming and prior discourse (see example 9): (9) . . . , the power delivered to the load (from the secondary side of the transformer) is less than or at most equal to the power supplied by the signal source. On the other hand, an amplifier provides the load with power greater than that obtained from the signal source. (ME Chapter 01) Referential bundles in the EEITC Three main subcategories of referential bundles were identified in the EEITC: referential identification/focus bundles, attribute specification bundles and time/place/text reference bundles (see Table 6.1), which identify important entities, specify concrete or abstract characteristics, and make reference to time, place and text. Referential identification/focus
116
Perspectives on Formulaic Language
bundles direct readers’ attention to the noun phrases following the bundles, and thus are widely used for the purpose of introducing concepts/definitions or identifying particular situations in the EEITC. In example 10, is known as the introduces a loop technique with the term the supermesh approach. (10) One is a special loop techniques and the other is known as the supermesh approach. (CA Chapter 03) Attribute specification bundles in the EEITC specify quantity or frame tangible/intangible attributes of the noun phrases after the bundles. For example, is equal to the, a quantity specification bundle, specifies quantities of particular items as in example 11: (11) Note that the power supplied is equal to the power absorbed. (CA Chapter 01) In addition to quantity specification bundles, most of the attribute specification bundles in the EEITC frame physical or abstract characteristics of following noun phrases, correspondingly labelled as tangible or intangible attribute framing bundles. Examples of the former subcategory include the magnitude of the, the value of the, and so on, while bundles of the latter type are like in terms of the, with the result that, and so on. The magnitude of the, a tangible attribute referential lexical bundle with the highest occurrence of 63 times in the EEITC, captures magnitude as an important specification parameter for electric current, voltage, and many other variables in circuit analysis (see example 12). Intangible attribute framing bundles frame abstract properties of noun phrases after the bundles (see example 13). (12) Therefore, it is important to specify not only the magnitude of the variable representing the current, but also its direction. (CA Chapter 01) (13) Write the equation that defines the voltage relationship between the two nonreference nodes as a result of the presence of the voltage source. (CA Chapter 03) Among the total of 18 place/text/multi-functional reference bundles (See Table 6.1), 12 are place reference bundles (e.g. in series with the, in the frequency domain), which offer location information or describe the spatial connection among electrical components in electrical circuits (see example 14). The rest are text deixis bundles and multi-functional reference bundles. Text deixis bundles (e.g. is of the form, it follows that the) make reference to text or formulas, connect written texts with diagrammatic texts (e.g. formulas, diagrams), integrate two modes of text together and create meaning out of the new combination (see example 15). Multi-functional
An Investigation of Lexical Bundles
117
reference bundles refer to time or place information, depending on the original context. In example 16, at the end of in the first concordance line refers to a particular location while the bundle in the second line locates a specific time point. (14) ...we exchange the current source and parallel impedance for a voltage source in series with the impedance, as shown in Fig. 7.19a. (CA Chapter 07) (15) Since the roots are real and unequal, the circuit is overdamped, and v (t) is of the form v (t)=K1e–2t+K2e–0.5t . (CA Chapter 06) (16) 1. numerous resistors, it is recommended that the analysis begin at the end of the network opposite the terminals. (CA Chapter 02) 2. requires that the energy stored in the inductor must be the same at the end of each switching cycle. (CA Chapter 06)
Lexical bundles in ESP textbooks Table 6.2 includes the lexical bundles present in the ESPTC, which are classified into functional subcategories according to their pragmatic functions in the original text. Although the lexical bundles present in the ESP textbooks retain the same pragmatic functions as in EE introductory textbooks, the total number of bundles in the ESPTC is much lower than in the EEITC. Impersonal epistemic stance bundles in the ESPTC (e.g. the fact that the, can be used to) indicate assessment of the following proposition (see Table 6.2). For instance, the fact that the expresses certainty towards the coming proposition by emphasizing it as a fact (see example 17). (17) They stem from the fact that the tolerance extremes of a value reach the extremes of adjacent values, thereby covering the whole range without overlap. (English for electronics) The only discourse organizer in the ESPTC, as long as the, connects two propositions and clarifies the following proposition as the necessary condition for the prior proposition to happen (see example 18). (18) The armature continues turning as long as the direction of the current, and therefore its magnetic poles, keeps being reversed. (English for electrical and mechanical engineering) Identification/focus reference bundles in the ESPTC (e.g. is known as the, is determined by the) (see Table 6.2) guide readers’ attention to the noun phrases after the bundles (see example 19):
Perspectives on Formulaic Language
118 Table 6.2
Functional Classification of Four-Word Lexical Bundles in the ESPTC
Categories
Lexical bundles (sequenced in the order of descending frequency)
I. Stance bundles Impersonal epistemic stance
the fact that the, from the fact that, can be used to
II. Discourse organizers Topic elaboration/ clarification
as long as the
III. Referential expressions A. Identification/focus
is known as the, is determined by the, in the case of, referred to as the
B. Specification of attributes B1. Quantity specification
is proportional to the
B2. Tangible framing attributes the magnitude of the, the value of the, in the design of, the current in the, the operation of the, the frequency response of, the output of the, the values of the, the current through the, of the input signal, the direction of the, the amplitude of the, value of the current, the resistance of the B3. Intangible framing attributes
with the result that, with respect to the, a result of the, the effect of the, the presence of the
C. Place/time reference C1. Place reference
is connected to the, in series with the, in parallel with the
C2. Time reference
at the end of, the end of the
(19) The overall effect of this phenomenon for the whole amplifier is known as the total harmonic distortion (THD). (English for electronics) Attribute specification bundles in the ESPTC (see Table 6.2) specify quantity, tangible or intangible features of particular entities. The majority of these bundles are tangible/intangible attribute framing bundles. Tangible attribute framing bundles (e.g. the magnitude of the, in the design of) specify physical characteristics of the following entities while intangible attribute framing bundles (e.g. with the result that, with respect to the) specify the abstract characteristics of particular entities. Place reference bundles (e.g. is connected to the) and time reference bundles (e.g. at the end of) in the ESPTC respectively specify spatial connections among electrical components and indicate exact time points in detailed description of research procedures.
An Investigation of Lexical Bundles
119
The comparison of lexical bundles in the ESPTC and the EEITC The comparison of lexical bundles and their pragmatic functions in the EEITC with those in the ESPTC indicates a gap in language use between the two corpora. Table 6.3 lists the percentages of functional subcategories of lexical bundles in both corpora. Figure 6.1 compares subcategory distributions of lexical bundles in the two corpora by stacking subcategory percentages of lexical bundles in the EEITC with those in the ESPTC. The data from Figure 6.1 and Table 6.3 are used to interpret how lexical bundles with pragmatic functions have different distributions in the two corpora and how in different ways these bundles structure the text in EE introductory textbooks and ESP textbooks.
Table 6.3 Percentages of Functional Subcategories of Lexical Bundles in the EEITC and ESPTC Main functional Functional subcategories of types of lexical lexical bundles bundles Stance bundles Epistemic stance Bundles: personal
Referential bundles
2.86
0
Epistemic stance bundles: impersonal
8.57
9.09
Attitudinal/modality stance bundles: desire
3.81
0
Attitudinal/modality stance bundles: directive/obligation
2.86
0
Attitudinal/modality stance bundles: intention
0.95
0
Total Discourse organizers
Percentages Percentages in the EEITC in the ESPTC
Topic clarification
19.05
9.09
2.86
3.03
Identification/focus
5.71
12.12
Attribute specification: quantity
6.67
3.03
Tangible attribute specification
40.95
42.42
7.62
15.15
10.48
9.09
Intangible attribute specification Place reference bundles Text deixis bundles
4.76
0
Time reference bundles
0
6.06
Multi-functional reference bundles Total
1.90 78.19
0 87.87
Perspectives on Formulaic Language
120
corpora the EEITC the ESPTC
100.0%
Per cent
80.0%
60.0% 40.0%
20.0% 0.0% 1
2
3
4
5
6
7
8
9 10 11 12 13 14
Figure 6.1 Proportional Distribution of Functional Subcategories of Lexical Bundles in the EEITC and the ESPTC
The comparison of stance bundles in the EEITC and the ESPTC Stance bundles in the EEITC indicate how textbook writers construct their role as discipline experts and engage readers as participants in tasks or as observers of particular lines of argument. Textbook writers’ preference for impersonal epistemic stance bundles (can be obtained by, from the fact that) introduces shared evaluation towards particular propositions, keeps authors’ personal opinions out of the text, and thus helps to maintain objectivity and authority of the texts (Conrad, 1996; Reppen, 2004). The common use of the collective first person pronouns we and us in four other subcategories of stance bundles (we note that the, let us determine the) signposts writers’ invitations to readers as participants in the text and leads readers to become aware of the intellectual tasks or lines of arguments deemed essential by writers. Through the use of stance bundles, EE introductory textbook authors position readers as learners and present them with an integrated and
An Investigation of Lexical Bundles
121
coherent picture of well-established knowledge in the discipline. To initiate novices into the discipline norms, textbook authors apply stance bundles to exert directive forces (Hyland, 2002) on readers in three ways, mainly at the cognitive level, (1) by directing readers’ attention towards the findings or facts that writers believe are important (to note that the, we see that the), (2) by guiding readers to new lines of argument (we assume that the, from the fact that the), or (3) by directly instructing readers on the next step of an analysis procedure (we will use the, we wish to determine). Compared with the broad coverage of five subcategories of stance bundles in the EEITC, the ESPTC has only one subcategory of stance bundles (impersonal epistemic stance bundles) (see Figure 6.1). The missing four subcategories of stance bundles in the ESP textbooks mean that ESP texts do not offer novices chances to observe and explore the interpersonal relationship between readers and writers constructed by stance bundles in EE textbooks. For example, the lack of personal epistemic stance bundles and attitudinal/modality stance bundles in the ESPTC means that ESP learners are not allowed the chance to observe how these bundles (we see that the, we find that the) direct the readers’ attention to important points of knowledge or to explore how these bundles (e.g. we will use the, we wish to determine) explicitly signal novices about writers’ intentions for the next stage of the analysis procedure.
The comparison of discourse organizers in the EEITC and the ESPTC Only a few four-word lexical bundles in both the EEITC and the ESPTC serve as discourse organizers by linking prior and coming discourse in original text. The scarcity of discourse organizers in the EE introductory textbooks is in line with the research finding by Biber et al. (2004) that discourse organizing bundles are extremely rare in textbooks. Unlike the situation of real-time production (e.g. classroom teaching), authors of EE introductory textbooks tend to have more diverse lexical choices and more time to organize the text in a cohesive and coherent way without depending on recurrent lexical patterns.
The comparison of referential bundles in the EEITC and the ESPTC Accounting for 78 percent of the lexical bundles in the EEITC (see Table 6.3), referential bundles help present norms of the field and construct a body of knowledge by describing referential characteristics relevant to propositional content or context of the research process. Referential
122
Perspectives on Formulaic Language
bundles identify important entities (is known as the, is determined by the), specify concrete/abstract attributes of variables or electrical components (the magnitude of the, with the result that), or make reference to time, location or text (at the end of, is connected to the, is of the form). Compared with the EEITC, the ESPTC lacks text deixis bundles and multi-functional reference bundles (see Figure 6.1). Text deixis bundles in the EEITC (is of the form, it follows that the) indicate interaction with readers and engage readers in textual acts (Hyland, 2002) by referring to certain texts or formulae. Furthermore, these bundles in the EEITC contribute to the construction of knowledge and the creation of a context in readers’ minds. Text deixis bundles link prior text with following text/formulas so that the new content is understood and ‘a context is created in the mind and signaled in the text in the process of its production’ (Widdowson, 2007, p. 43). In addition to text deixis bundles, multi-functional reference bundles are also absent in the ESPTC, which make both time and place references in the EEITC (e.g. at the end of). Not only does the ESPTC lack the two functional subcategories of referential bundles mentioned above, the rest of the subcategories show proportional distributions different from those in the EEITC, indicating that the two types of textbooks do not share the same focus on the presentation of referential information. In EE introductory textbooks, the relatively high percentages of quantity specification bundles (6.67 percent), text deixis bundles (4.76 percent) and place reference bundles (10.48 percent) (see Table 6.3) suggest textbook writers’ preference for describing quantity characteristics, making textual reference and depicting the physical connections among electrical components in circuit analysis. However the three subcategories of lexical bundles are either underrepresented or totally ignored in ESP textbooks. Instead the ESP textbooks seem to have an explicit interest in introducing new concepts/definitions and providing time information, evidenced by high percentages of identification/focus referential bundles (12.12 percent) and time referential bundles (6.06 percent) (see Table 6.3). The results of the present study indicate that ESP textbooks as a preparatory type of text for discipline novices misrepresent referential information in EE introductory textbooks. Consequently the ESP textbooks may misguide novices in terms of how referential bundles help construct the body of knowledge in target disciplines. As such, it may be advisable to supplement referential bundles and their pragmatic functions in target language use in ESP pedagogies for the purpose of increasing learners’ awareness of the presentation of referential information in target disciplines.
An Investigation of Lexical Bundles
123
Conclusion Referential bundles and stance bundles in EE introductory textbooks construct text with consistent pragmatic functions. Referential bundles in the textbooks specify referential information relevant to research processes and circuit analysis, facilitating the presentation of propositional facts and the construction of the body of knowledge. Stance bundles help textbook authors construct their role as discipline experts and engage readers as participants in intellectual tasks or as active observers of lines of arguments. The comparison of lexical bundles and their pragmatic functions in the EEITC and the ESPTC has indicated that a gap in language use exists between EE introductory textbooks and ESP textbooks. Only one-third of the lexical bundles identified in EE introductory textbooks appear in the ESP textbooks, covering a much narrower scope of functional subcategories. Compared with introductory EE textbooks, four functional subcategories of stance bundles are missing in the ESP textbooks. As for the use of referential bundles, the ESP textbooks underrepresent quantity and spatial specifications but overemphasize referential information which is not central in target language use, such as introduction of new concepts/ definitions and provision of time information. The gap of language use between ESP textbooks and introductory EE textbooks suggests that lexical bundles with pragmatic functions in the EEITC should be included in ESP pedagogies. Simply exposing undergraduates to discipline-specific texts does not necessarily mean that students will acquire the use of lexical bundles in target disciplines. For instance, Cortes (2004) found that third and fourth undergraduate students studying in history and biology did not exhibit adequate use of referential bundles in their writings. One student repeatedly mentioned the same exact date in a short stretch of text without adopting temporal referential bundles (e.g. at the end of, at the beginning of). As a result, it seems reasonable to include lexical bundles in pedagogical materials, which would not only enrich the description of target language use but also increase novices’ awareness of the patterns of use in the discipline. Concordance lines of referential bundles can serve as examples of describing details of research procedures and help students deal with discipline-specific tasks such as explaining diagrams and reporting their laboratory procedures. Introducing stance bundles to ESP classrooms can allow students to observe how EE introductory textbook authors use these bundles to signal key knowledge points and express their stances.
124
Perspectives on Formulaic Language
The present study is somewhat limited in terms of the size of the ESPTC and the group of beneficiaries as first-year EE undergraduates. Future research may deal with a larger corpus by including more ESP textbooks and need to decide whether the conclusions made in this study are valid for the larger corpus. With the research focus on language use in entrylevel EE textbooks, the findings of this study seem to mainly be of concern for teaching EE students in their first-year undergraduate programmes. Future studies may consider setting up multi-discipline corpora with materials drawn from introductory textbooks of other disciplines so that novices can have the opportunity to learn how in their disciplines lexical bundles construct the body of knowledge and indicate the authors’ stances towards propositional content.
Acknowledgements I would like to thank Dr. Ellen Cray, Dr. David Wood and Dr. Tom Cobb for bringing valuable insights into key issues of this research. And special thanks to David Cooper and Rachelle Freake for cogent comments on drafts.
References Altenberg, B. (1998). On the phraseology of spoken English: the evidence of recurrent word combinations. In A. P. Cowie (Ed.), Phraseology: Theory, analysis, and applications (pp. 101–22). Clarendon: Oxford University Press. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. Essex: Pearson Education Limited. Biber, D., Conrad, S., & Cortes, V. (2003). Lexical bundles in speech and writing: an initial taxonomy. In A. Wilson, P. Rayson & T. McEnery (Eds.), Corpus linguistics by the Lune: a festschrift for Geoffery Leech (pp. 71–92). Frankfurt/Main: Peter Lang. Biber, D., Conrad, S., & Cortes, V. (2004). If you look at . . . : lexical bundles in university teaching and textbooks. Applied Linguistics, 25 (3), 371–405. Carson, J. G. (2001). A task analysis of reading and writing in academic contexts. In D. Belcher & A. Hirvela (Eds.), Linking literacies: perspectives on L2 readingwriting connections (pp. 48–83). Ann Arbor, MI: University of Michigan Press. Conrad, S. M. (1996). Investigating academic texts with corpus-based techniques: An example from biology. In Linguistics and Education, 8, 299–326. Cortes, V. (2004). Lexical bundles in published and student disciplinary writing: Examples from history and biology. English for Specific Purposes, 23, 397–423. Glendinning, E. H., & Glendinning, N. (1995). Oxford English for electrical and mechanical engineering. Oxford: Oxford University Press.
An Investigation of Lexical Bundles
125
Glendinning, E. H., & McEwan, J. (1993). Oxford English for electronics. Oxford: Oxford University Press. Hyland, K. (2002). Directives: argument and engagement in academic writing. Applied Linguistics, 23 (2), 215–39. Hyland, K. (2008). As can be seen: Lexical bundles and disciplinary variation. English for Specific Purposes, 27, 4–21. Irwin, J. D. (2002). Basic engineering circuit analysis (7th ed.). New York: John Wiley & Sons. McDonough, J. (1998). Survey review: recent materials for the teaching of ESP. ELT Journal, 52 (2), 156–65. Nattinger, J. R., & DeCarrico, J. S. (1992). Lexical phrases and language teaching. New York: Oxford University Press. Oakey, D. (2002). Formulaic language in English academic writing: a corpus-based study of the formal and functional variation of a lexical phrase in different academic disciplines. In R. Reppen, S. M. Fitzmaurice, & D. Biber (Eds.), Using corpora to explore linguistic variation (pp. 111–29). Amsterdam/Philadelphia: John Benjamins. O’Keeffe, A., McCarthy, M., & Carter, R. (2007). From corpus to classroom: Language use and language teaching. Cambridge: Cambridge University Press. Olson, D. R. (1989). On the language and authority of textbooks. In S. de Castell, A. Luke, & C. Luke (Eds.), Language, authority, and criticism: Readings on the school textbook (pp. 233–44). London: The Falmer Press. Partington, A. (1998). Patterns and meanings. Amsterdam/Philadelphia: John Benjamins. Reppen, R. (2004). Academic language: an exploration of university classroom and textbook language. In U. Connor & T. A. Upton (Eds.), Discourse in the professions (pp. 65–86). Amsterdam/Philadelphia: John Benjamins. Scott, M (2004). Oxford WordSmith Tools: Version 4.0. Retrieved from http://www.lexically.net/downloads/version4/wordsmith.pdf Sedra, A. S., & Smith, K. C. (2004). Microeletronic circuits (5th ed.). Oxford: Oxford University Press. Widdowson, H. G. (2007). Discourse analysis. Oxford: Oxford University Press. Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press. Wray, A., & Perkins, M. A. (2000). The function of formulaic language: an integrated model. Language and Communication, 20, 1–28.
This page intentionally left blank
Part 2
Identification and Psycholinguistic Processing of Formulaic Language
This page intentionally left blank
Chapter 7
Formulaicity in Code-switching: Criteria for Identifying Formulaic Sequences Kazuhiko Namba Kyoto Sangyo University
This chapter examines the possibility that code-switched units are formulaic sequences (a term defined presently). This idea comes from a paper by Backus (2003), in which he proposes that code-switching (CS) always involves a ‘unit’ that is produced holistically. Backus’ claim contradicts the position taken by Azuma (1996), that CS entails complete syntactic constituents. These two proposals will be compared and evaluated, with particular attention to examples in their own work and in a dataset of bilingual children, which can help sort out the different predictions that their positions arrive at. In short, are there examples that can reasonably be defined as formulaic sequences but which cannot be viewed as complete syntactic constituents? The reverse is less likely, since there is no reason why a syntactic constituent should not also be a formulaic sequence. A major part of the present chapter will be taken up with resolving a procedural problem: namely, how to identify a formulaic sequence. After reviewing the basic claims of Backus (2003) and Azuma (1996), the definition of formulaic sequence is considered and a means of identifying formulaic sequences in text is presented and tried out on examples from monolingual and bilingual data.
Does Code-switching Entail a Syntactic Constituent or a Formulaic Sequence? Embedded language insertions1 in CS phenomena are a valuable test case for theories of how language is processed. Wray’s Needs Only Analysis model (2008, p. 17) proposes that virtually any kind of wordstring, continuous or a frame with gaps, can be a single lexical unit if (1) it has a reliable meaning as it stands, and (2) the input experienced by the speaker, and
130
Perspectives on Formulaic Language
his/her output needs, have not required it to be broken down further. Clearly, this approach predicts that a ‘unit’ need not be a recognizable syntactic constituent. In contrast, standard syntactic models would assume that if a ‘unit’ is to be taken from another language, and that unit is more than a single word, it will be a syntactic constituent. The two sides of the argument are taken up by Backus (2003) and Azuma (1996) respectively. Azuma examined CS examples of spontaneous speech from the literature in order to verify his hypothesis, that is, ‘bilinguals switch at syntactically definable constituent boundaries’ (1996, p. 397). The naturalistic data suggests that switching occurs at the boundary of grammatical constituents; however, such data has its limits when it comes to detecting non-constituent switching because the size of the corpus might not have been large enough for researchers to recognize them when gathering data (1996, p. 403). Subsequently Azuma conducted an experiment in which Japanese-English bilinguals were asked to switch language in response to randomly generated tones. The number of subjects who continued their speech in the same language at least one word after the tone was quite high (68 per cent) and only 5 per cent of the time was the material preceding the switch judged as not being a syntactic constituent (Azuma, 1996, p. 406). It should be noted that this experiment cannot elicit insertional CS but alternational CS.2 The subjects were asked to change language when they heard the tone, which is alternational CS by definition. Backus (2003) argues that inserted EL items are not always syntactic constituents but ‘lexical units’, which are ‘any recurrent combinations of two or more morphemes that together exhibit idiomatic meaning’ (2003, p. 90). He proposes the ‘Unit Hypothesis’, saying that ‘Every multimorphemic EL insertion is a unit, inserted into a ML clausal frame’ (2003, p. 91). The Unit Hypothesis is tested using his data from Turkish-Dutch CS and the use of lexical units as inserted EL items is shown. The tested EL items in Backus’ paper (2003) are basically all syntactic constituents, and therefore the role of formulaicity is not independently verified. However Backus finds that the Unit Hypothesis can apply to many cases of alternational CS. Our agenda was to find EL items which are not syntactic constituents but formulaic sequences. As it happens, all the examples of insertional CS in both Azuma’s3 and Backus’ papers map onto syntactic constituents, so there is no scope there for testing Backus’ hypothesis. However it turned out that in both papers, several non-syntactic constituents were observed in alternational CS. If we find non-syntactic constituents, the next agenda is to examine the formulaicity of the units. The inherent difficulty here lies in the identification of formulaic sequences, whereas that of syntactic constituents is more
Formulaicity in Code-switching
131
straightforward if we use already established syntactic theories. In order to identify formulaic sequences, we first need to define them.
Formulaicity in Language Processing In monolinguals’ language processing, the syntactic structure and formulaicity can be described as two processing systems. Chomskians’ view of language processing (e.g. Radford, 2004, p. 5) entails an analytic approach in which morphemes and words are combined into phrases and sentences by grammatical rules for output while input is broken down into words and morphemes. Sinclair (1991, p. 109) calls this analytic processing the ‘open choice principle’. But Sinclair also identifies another type of processing, the ‘idiom principle’, in which ‘a language user has available to him or her a large number of semi-preconstructed phrases that constitute single choices, even though they might appear to be analysable into segments’ (Sinclair, 1991, p. 110). In essence, Sinclair’s ‘single choices’ are the same as what Wray (2002) terms ‘formulaic sequences’. Both Sinclair and Wray propose that the production of these units is holistic, and therefore bypasses the analytic process of construction that occurs with the open choice principle. Wray (2002, p. 14) notes the advantages of having a dual language processing system. Whereas the analytic system has ‘flexibility for novel expression and the interpretation of novel and unexpected input’ (2002, p. 18), the holistic system contributes to ‘the reduction of processing effort’ (2002, p. 18). She argues that for native speakers it is ‘the accessing of large prefabricated chunks, and not the formulation and analysis of novel strings, that predominates in normal language processing’ (2002, p. 101). Wray (2002) reviews the literature comprehensively and defines formulaic sequences as follows4: a sequence, continuous or discontinuous, of words or other elements, which is, or appears to be, prefabricated: that is, stored and retrieved whole from memory at the time of use, rather than being subject to generation or analysis by the language grammar (p. 9). This definition aims to be as inclusive as possible but ‘is, or appears to be’ can be a problem in identifying a unit in a different language. A CS unit is strongly marked out as different, and this might make it ‘appear to be’ formulaic. Another problem is how we are to tell whether a formulaic sequence is prefabricated with sufficient confidence to use for testing the hypothesis. Here the issue is one of differentiating between definition and
Perspectives on Formulaic Language
132
identification. The purpose of a definition is to draw a box around what we are going to include in our discussion, when we start trying to figure out what is occurring. It is therefore not suitable for identifying examples. We need to employ this definition as a starting point and establish some criteria by which formulaic sequences can be independently detected.
Eleven Diagnostic Criteria for Identifying Formulaic Sequences Wray (2002, Chapter 2) identifies four major characteristics in the existing descriptions of formulaic sequences in the literature: ‘form’, ‘meaning’, ‘function’, and ‘provenance’. The four characteristics are not mutually exclusive, but overlap. Some wordstrings which aren’t marked in relation to ‘form’ can be formulaic from other perspectives.5 Wray and Namba (2003) offer eleven diagnostic criteria6 (see Figure 7.1) which should capture the multifaceted features of formulaic sequences. Table 7.1 shows which of the four characteristics criteria A to K cover. The criteria support a researcher’s intuitive judgement rather than being a stand-alone check-list. When a researcher judges a wordstring as formulaic, the criteria can be employed to explain why he or she does so, although
Function Provenance
Situation/register specificity
Pragmatic function
Idiolect
G
H
9
9
I
K
9
9
9
9 9
J
Mismatch with maturation
Semantic opacity
9
F
Derivation
E
Previous encounter
D
Grammatical/lexical indication
C
Inappropriate application
Meaning
B
Performance indication
Form
A
Grammatical irregularity
Criteria
Table 7.1 Eleven Criteria Coverage of the Four Characteristics of Formulaic Sequences
9
9 9
9
9
9
Formulaicity in Code-switching
133
disagreement can occur between judgements by different researchers. Approaches which use intuition have weaknesses since they are subjective. Wray (2002, Chapter 2) reviewed means to identify formulaic sequences, for example, intuition, corpus research and phonological analysis, and found there was no single criterion to identify formulaic sequences in a consistent way. The difficulty lies in the inability to distinguish them from novel strings because they can be grammatically regular and semantically transparent. In order to solve these problems the 11 diagnostic criteria (see Figure 7.1) employ intuition as a starting point and each criterion plays a role in establishing reliable justification.
A: B: C: D:
E: F:
G:
H:
I:
J: K:
By my judgement there is something grammatically unusual about this wordstring. By my judgement, part or all of the wordstring lacks semantic transparency. By my judgement, this wordstring is associated with a specific situation and/or register. By my judgement, the wordstring as a whole performs a function in communication or discourse other than, or in addition to, conveying the meaning of the words themselves. By my judgement, this precise formulation is the one most commonly used by this speaker/writer when conveying this idea. By my judgement, the speaker/writer has accompanied this wordstring with an action, use of punctuation, or phonological pattern that gives it special status as a unit, and/or is repeating something s/he has just heard or read. By my judgement, the speaker/writer, or someone else has marked this wordstring grammatically or lexically in a way that gives it special status as a unit. By my judgement, based on direct evidence or my intuition, there is a greater than-chance-level probability that the speaker/writer will have encountered this precise formulation before, from other people. By my judgement, although this wordstring is novel, it is a clear derivation, deliberate or otherwise, of something that can be demonstrated to be formulaic in its own right. By my judgement, this wordstring is formulaic, but it has been unintentionally applied inappropriately. By my judgement, this wordstring contains linguistic material that is too sophisticated, or not sophisticated enough, to match the speaker’s general grammatical and lexical competence.
Figure 7.1 The Eleven Diagnostic Criteria for Identification of Formulaic Sequences (Wray and Namba, 2003)
Perspectives on Formulaic Language
134
In addition to the 11 criteria, the fixedness of formulaic sequences should be noted. Some formulaic sequences are highly fixed or frozen and do not show much variability.7 Furthermore a number of formulaic sequences usually have a fillable slot in them. Open class items, in many cases referential noun phrases, are often like this and we need to keep an eye out for formulaic sequences with gaps since they will contain material that is not in itself formulaic. Before the criteria can be applied to the data, it must be noted that not all of them can or should be applied to all data types. In order to cater to different types of data,8 guidelines are set as follows (see Table 7.2). The judgement is made based on a five point scale, that is, ‘Strongly agree’, ‘Agree’, ‘Not applicable’, ‘Disagree’ and ‘Strongly disagree’. Some examples might end up with more ‘Agree’s/‘Strongly agree’s than others. We cannot say one example is more formulaic than the other , based on the count of ‘SA’s and ‘A’s. The most we can say is that there are more individual indicators of formulaicity for one example than the other. Formulaic status is a question of storage and access, to which tests of form, meaning and function can only give us partial access.
9
9
9
9
9
9
9
9
Original Correct
9
9
9
9
9
Error in Usage
Original
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
K
Mismatch with maturation
Derivation
J
Previous encounter
I
Grammatical/lexical indication
H
Performance indication
G
9
Error in Form
Appropriate
F
Idiolect
E
Pragmatic function
D
Situation/register specificity
C
Semantic opacity
Error Free
B
Grammatical irregularity
Criteria
A
Inappropriate application
Table 7.2 Application of Criteria to Different Data Types (based on Wray and Namba, 2003, p. 28)
9 9 9 9
9 9
Formulaicity in Code-switching
135
Testing the Eleven Criteria Applying the criteria to formulaic sequences in the research literature is a good test. Since they are usually classic citation examples, contextual information is not available, but if we can judge from our intuition, ‘Agree’ at least will be given. They are not error forms, therefore criterion J is not applied. The idiom ‘kick the bucket’ (i.e. in its meaning of ‘die’) is noncompositional and the meaning is totally opaque from its parts. Therefore it is scored as ‘Strongly agree’ on criterion B (see Table 7.3). The institutionalized routine ‘Happy birthday!’(Table 7.4) is said on a specific day (C), it has a function of congratulating the addressee on their birthday (D). It is often said with a gesture, facial expression, special prosodic features or indeed is often sung (F). Even without evidence, one can assume that this wordstring is learned as a whole from other people, probably family members (E). Next we will test the criteria on a monolingual’s speech data in context. Example (1) occurs in a conversation on a busy roadside on the outskirts of York. A motorist asks a pedestrian how to get to the city centre. Here is part of the pedestrian’s explanation: (1) if you just carry on:: down this road this is Heslington Lane:: (Langford, 1994, p.158)9
Table 7.3
Formulaicity in ‘kick the bucket’
Mismatch with maturation
K
Inappropriate application
J
Derivation
I
Previous encounter
H
Grammatical/lexical indication
G
Performance indication
F
Idiolect
E
Pragmatic function
D
Situation/register specificity
C
Semantic opacity
Judgement
B
Grammatical irregularity
Criteria
A
SD
SA
NA
SD
NA
NA
SD
SA
SD
NA
NA
Perspectives on Formulaic Language
136 Table 7.4
Formulaicity in ‘Happy birthday!’
Mismatch with maturation
K
Inappropriate application
J
Derivation
I
Previous encounter
H
Grammatical/lexical indication
G
Performance indication
F
Idiolect
E
Pragmatic function
D
Situation/register specificity
C
Semantic opacity
Judgement
B
Grammatical irregularity
Criteria
A
A
SD
SA
SA
SA
SA
SD
SA
SD
NA
NA
A wordstring consisting of ‘carry on’ and adverbial10 (Table 7.5) is observed four times in the same speaker’s speech, which indicates ‘Strongly agree’ on idiolect (E). This is specific to the situation of explaining a direction – ‘Strongly agree’ on C. With regard to Criterion B, the meaning of ‘carry’ doesn’t directly correspond to the meaning of the whole wordstring, that is, ‘continue’, however ‘on’ corresponds to it. Therefore ‘Agree’ is given to this criterion. Here, without the contextual information, criteria C, E will not get SAs.
Derivation
Inappropriate application
Mismatch with maturation
SA
H
Previous encounter
A
G Grammatical/lexical indication
SD
D
Performance indication
Situation/register specificity
C
E
F
Idiolect
B
Pragmatic function
A
Semantic opacity
Judgement
Formulaicity in ‘carry on + [Adverbial]’
Grammatical irregularity
Criteria
Table 7.5
NA
SA
I
J
K
A
NA
A
SD
NA
NA
Formulaicity in Code-switching
137
Lastly we will discuss the key examples from Azuma and Backus. Our agenda is to find examples which are not grammatical constituents but are formulaic. Example (2) is a counter-example to Azuma’s hypothesis because switching does not occur between syntactic constituents. The switching occurs after a couple of words (the arrow indicates the tone). ↓ and then he’ll go to
(2)
otearai ni ikimasu toilet to go {and then he’ll go to the toilet} (Azuma, 1996, p. 412)
The wordstring ‘and then NP will go to’ does not have evidence for formulaicity in most criteria (Table 7.6). However it should be noted that Backus (2003, p. 115) cited this particular example of Azuma’s as a ‘construction’ therefore a ‘lexical unit’ which means that he considered it to be formulaic. Next we will examine Backus’ example of a ‘construction’(see example 3). (3) die is de slechte persoon, ondan sonar cogˇunlunkla yapıncık, mesela altı kis¸i yapıncık o zaman artık o ja normal görünür, ama o ilk kis¸i DIE IS GEWOON, ja, en berbatı (Turkish in normal font, Dutch in italics, lexical units are capitalized)
Table 7.6
Formulaicity in ‘and then NP will go to’
Mismatch with maturation
K
Inappropriate application
J
Derivation
I
Previous encounter
H
Grammatical/lexical indication
G
Performance indication
F
Idiolect
E
Pragmatic function
D
Situation/register specificity
C
Semantic opacity
Judgement
B
Grammatical irregularity
Criteria
A
SD
SD
SD
SD
A
NA
NA
NA
SD
SD
SD
Perspectives on Formulaic Language
138
A
A
Mismatch with maturation
SD
Inappropriate application
A
H
Derivation
Idiolect
SD
G
Previous encounter
E
Pragmatic function
B
Grammatical/lexical indication
D
Situation/register specificity
A
Performance indication
C
Semantic opacity
Judgement
Formulaicity in ‘die is gewoon’
Grammatical irregularity
Criteria
Table 7.7
F
I
J
K
NA
SD
A
SD
NA
NA
{she’s the bad person, and then the majority will do it, for example if six people will do it, from then on it’s seen as ‘oh sure,’ as normal, but that first person, SHE’S JUST, well, the worst} (Backus, 2003, p.118) According to Backus (2003, p. 119) the Dutch adverb gewoon has a basic meaning of ‘normal’ but in this construction its meaning is ‘just’ – ‘Agree’ on semantic opacity (B) (see example 7). It is used ‘if one wishes to make an emphatic statement about someone, stating a quality that is either surprising or characteristic’ (2003, p. 119). At least ‘Agree’ can be given to D. There isn’t direct evidence but this construction might have been acquired from someone else and might be used again – ‘Agree’ on E and H.
Modification of the Criteria In the last section, three different groups of wordstrings are tested to verify the 11 criteria. Whereas the criteria work well for the examples in the monolingual examples, they don’t detect strong formulaicity in the bilingual examples in non-syntactic constituents from Azuma’s and Backus’ papers. In this section we will assess the 11 criteria and propose solutions to problems. The analysis has not been exhaustive but the result of this small examination shows that formulaic sequences are multi-faceted phenomena. Out of the 11 criteria, criterion B ‘semantic opacity’ and D ‘pragmatic function’ seem to be strong ones. Even when other criteria are not marked, marking
Formulaicity in Code-switching
139
on either of these two alone can be evidence for formulaicity. Criteria C, E, F and H can be common if more contextual information is available. Criterion C can be applied if the situation or register is specific such as the dialogue of finding directions. Criteria E (idiolect) and H (previous encounter) can be judged by a researcher’s intuition and at least ‘Agree’ can be scored. When direct evidence is available ‘Strongly agree’ can be given. With criterion F (performance indication), contextual information is crucial. Criteria E, H or F alone cannot be strong indicators of formulaicity but they will support other criteria. Formulaic sequences marked by criteria A, I, J, K have not been encountered, but this might be attributed to the limitation in the number of samples we tested. Some gaps in the criteria have been noticed. In the literature ‘genre’ has been mentioned, Gläser (1998, p. 143) examines the use of formulaic sequences in a variety of genres from popular science articles to literary texts and finds specific roles of formulaic sequences in specific genres, for example, in text books they are employed to ‘enhance the intelligibility and memorability of a text’. Perhaps we should include ‘genre’ in criterion C. It will be modified as follows: C: By my judgement, this wordstring is associated with a specific situation, register and/or genre. Biber, Johansson, Leech, Conrad and Finegan (1999, p. 989) point out that idioms such as ‘kick the bucket’ are used occasionally in fiction but rarely in other genres. Therefore ‘kick the bucket’ is given ‘Agree’ on criterion C (it was previously NA – see Table 7.3). The non-syntactic constituents observed in alternational CS in Azuma’s and Backus’ papers, are good examples to verify formulaicity independently. However the test doesn’t show them to be strongly formulaic. Pawley and Syder’s (1983, p. 210) ‘lexicalized sentence stems’ are similar to ‘constructions’. An example of a sentence stem is ‘NP be-TENSE sorry to keep-TENSE you waiting’. An actual form can be ‘I’m sorry to have kept you waiting’, for example, which looks like a novel sentence but there is an underlying frame that is formulaic. In the current criteria, there is no way to ensure that this sort of example is captured, and we need another criterion to capture underlying frames. L : By my judgement, there is an underlying frame and one or more gaps in this wordstring. The frame is formulaic and the gaps can be filled with any lexical items
Perspectives on Formulaic Language
140
This criterion alone is not strong enough to verify formulaicity. If other criteria, such as pragmatic function (D) or semantic opacity (B) are marked, this criterion will be more robust. The example ‘NP be-TENSE sorry to keep-TENSE you waiting’ will be analysed with the modified criteria as follows (Table 7.8). The examples in the dialogue of direction (example 1) can be scored on this criterion. The wordstring ‘carry on + [ADV] + [Place]’ is given a score of ‘Strongly agree’ on criterion L. With regard to Azuma’s example (2) a score of ‘Agree’ on criterion L can be given to it. With this example , ‘[NP] go to [Destination]’can be a formula. Here it is weakened by the fact that ‘the toilet’ is not a destination and ‘go to the toilet’ is formulaic11 however the Japanese part otearai ni ikimasu ‘go to the toilet’ is formulaic, which indicates that CS occurs at the boundary of a formulaic sequence. The switch can be either before or after a formula. Backus’ example (3) die is gewoon [ . . .] ‘she’s just’ will be given ‘Strongly agree’ since it has pragmatic function which supports formulaicity strongly. This criterion seems to be a useful and potentially robust one, so it will be adopted in the remainder of the analysis. While alternational CS is seen as a good place to explore such ideas, the examples, but these examples are too few and problematic (as Azuma and Backus found), therefore we will examine the data of alternational CS from another dataset to arrive at a judgement on the question of whether CS occurs at formulaic or constituent boundaries.
Table 7.8 Formulaicity in ‘[NP] be[-TENSE] sorry to keep [-TENSE] you waiting’
Underlying frame
L
Mismatch with maturation
K
Inappropriate application
J
Derivation
I
Previous encounter
H
Grammatical/lexical indication
G
Performance indication
F
Idiolect
E
Pragmatic function
D
Situation/register/genre specificity
C
Semantic opacity
Judgement
B
Grammatical irregularity
Criteria
A
SD
SD
SD
A
A
NA
SD
A
SD
NA
NA
SA
Formulaicity in Code-switching
141
Data Analysis of Alternational CS The data of alternational CS are taken from a longitudinal case study of two bilingual and bicultural children (Namba, 2008). The two siblings’ interactions were recorded in a naturalistic way while they were playing with toys and games.
Preliminary analysis with insertional CS Before analysing alternational CS patterns, we will see a more straightforward formulaic sequence and examine whether CS occurs outside it rather than within it. The following multi-word item (example 4), an N’- a combination of an adjective + a noun, appears to be formulaic. (4)
sore wa naughty boy -na that TOP INFL {That is something a naughty boy would do. }
koto thing
da COP
This is an extract from an interaction between one of the bilingual children and his father. The child is talking about his brother’s behaviour. The formulaicity of ‘naughty boy’ is checked as follows (see Table 7.9).
Table 7.9
Formulaicity in ‘naughty boy’
Underlying frame
L
Mismatch with maturation
K
Inappropriate application
J
Derivation
I
Previous encounter
H
Grammatical/lexical indication
G
Performance indication
F
Idiolect
E
Pragmatic function
D
Situation/register/genre specificity
C
Semantic opacity
Judgement
B
Grammatical irregularity
Criteria
A
A
SD
N
A
SA
N
SA
SA
N
N
N
N
Perspectives on Formulaic Language
142
The wordstring is not semantically irregular – ‘Strongly disagree’ on B. Grammatically, the occurrence of NP without a determiner, as an N’, is irregular- ‘Agree’ on A. It is placed in the slot for Japanese nominal adjectives or loan adjectives- ‘Strongly agree’ on grammatical indication (G). The child must have heard this wordstring when he or his brother was scolded by his mother – with direct evidence of previous encounter, ‘Strongly agree’ on H. He employs this wordstring frequently- with direct evidence, ‘Strongly agree’ on E. The wordstring ‘naughty boy’ itself has a function of reproaching – ‘Agree’ on D. In theory, the child could have inserted just the English word ‘naughty’, which would have fit better into the slot in the Japanese ML. Yet he used ‘naughty boy’ instead, a formulaic expression, as established above. In this example, CS doesn’t occur inside the formulaic sequence but outside of it. This example is fairly straight forward but when constructions have a gap, for example, ‘it’s [ . . .]’, the story is more complicated. We will analyse such a pattern in the following section.
‘it’s [Jp-CLAUSE]’ construction First we will look at ‘it’s [Jp-CLAUSE]’ construction12. If Japanese NPs or adjectives are inserted into the slot of the ‘it’s [ . . .]’ construction, it will be called insertional CS. However, when a Japanese clause follows ‘it’s’, the switching point doesn’t correspond with the boundary of the syntactic constituent and alternational CS appears to be a reasonable account. We will examine the formulaicity of the ‘it’s [Jp-CLAUSE]’ in alternational CS here. All the following examples (5–10) occurred when the two siblings were playing with monster and super-hero figures.
(5) If it’s
karaa taimaa ga kuro then it’s shinderu energy indicator NOM black dead {If the energy indicator is off then it’s dead E(name).}
(6) now it’s (.)
taiyoo sun {Now, the sun doesn’t exist.}
wa TOP
nai doesn’t exist
E?
Formulaicity in Code-switching (7) then it’s (.)
hiraiteru open
(8) but it’s (.)
datCOP
me o eye ACC {then the eyes are open}
ashita tomorrow {but I can do it tomorrow}
tara if
143
dekiru can-do
(9) because it’s Gaochibi toka Gao-rainosu wa nige- ta kara right? PropN or PropN TOP escape PAST because {because Gaochibi (a monster) or Gaorainosu, for example, escaped right?}
(10) it’s minna waruku natta except for sono futari right everyone bad become PAST those two {everyone became bad except these two, right?} The roles of the English and Japanese parts are fairly clear. All the examples above show that the Japanese parts are in charge of describing the toy’s movements or states, that is, they express propositional meanings. On the other hand the English parts appear to function as discourse markers. The formulaicity of the construction ‘it’s [Jp-CLAUSE]’ is examined as follows (see Table 7.10). The second clause of example (5) is not an example of ‘it’s [Jp-CLAUSE]’ because ‘it’ refers to the toy, therefore this is not included in this analysis. The wordstring ‘it’s’ itself is grammatically fine but the whole frame ‘it’s [CLAUSE]’ is grammatically irregular – ‘Agree’ on criterion A. This wordstring itself means ‘the state of affairs’ which is different from ‘it’s’ in the normal usage such as the second ‘it’s’ in example (5). This is semantically opaque – ‘Strongly agree’ on criterion B. This pattern is frequently observed (68 tokens) in our dataset – ‘Strongly agree’ on E. The wordstring ‘it’s’ functions as a filler between the conjunctions, for example, ‘if’ (5), and the following Japanese clauses. This filler appears to be buying time for the Japanese clause to occur. This pattern of ‘it’s [En-CLAUSE]’ hasn’t been observed in our dataset. Therefore ‘it’s’ appears to function as a CS indicator – ‘Strongly agree’ on pragmatic/discourse function (D). Both examples show that this English frame ‘it’s [. . .]’ is
Perspectives on Formulaic Language
144 Table 7.10
Formulaicity in ‘it’s [Jp-CLAUSE]’
Underlying frame
L
Mismatch with maturation
K
Inappropriate application
J
Derivation
I
Previous encounter
H
Grammatical/lexical indication
G
Performance indication
F
Idiolect
E
Pragmatic function
D
Situation/register/genre specificity
C
Semantic opacity
Judgement
B
Grammatical irregularity
Criteria
A
A
SA
N
SA
SA
N
N
A
A
N
N
SA
combined with the conjunctions and functions as pragmatic/discourse frames ‘Strongly agree’ on L. The result of the diagnostics supports the idea that the ‘it’s [Jp-CLAUSE]’ construction is formulaic. Although it appears to be complicated to identify CS in this frame, the possible answer here is that ‘it’s’ [Jp-CLAUSE] is formulaic and CS at the boundary between ‘it’s’ and the clause is compulsory because the CS instruction is actually built into the substance of the formula – that is, the formula is a device for effecting CS.
The portmanteau structure In this section we will look at another pattern of alternation CS, the portmanteau structure in which the English part and the Japanese part have the same meaning and they occur as if they were mirror images (Nishimura, 1997, p. 103). The following example (11) appears to be a variation of the ‘it’s [. . .]’ construction, however, it turns out to be one instance of the portmanteau structure. (11) It’s really it was
tadano just {It was just an evil will}
jaakuna evil
ishi will
da-tta COP-PAST
Formulaicity in Code-switching Table 7.11
Formulaicity in ‘it was [Jp-CLAUSE]’
Underlying frame
L
Mismatch with maturation
K
Inappropriate application
J
Derivation
I
Previous encounter
H
Grammatical/lexical indication
G
Performance indication
F
Idiolect
E
Pragmatic function
D
Situation/register/genre specificity
C
Semantic opacity
B
Grammatical irregularity
Criteria
A
Judgement
145
SD
SD
SD
SD
N
N
N
N
N
SD
SD
N
The switching point doesn’t match the boundary of the syntactic constituents because the Japanese VP comes in the slot for the English NP or AP and alternational CS is a reasonable explanation. However, this pattern is different from ‘it’s [Jp-CLAUSE]’ pattern because ‘it’ functions as a pronoun and refers to a concrete thing (a monster). The Japanese past tense copula da-tta has the same meaning as the English past tense copula ‘was’. The English part and the Japanese part appear to be hinged with the Japanese NP tadano jaakuna ishi. This pattern is one of the examples of the portmanteau structure. We will examine formulaicity in this example. First, the formulaicity of ‘it was [Jp-CLAUSE]’ is verified (see Table 7.11).The grammatical form and meaning of ‘it’ and ‘was’ is clear – ‘Strongly disagree’ on A and B. We can’t find any ‘Strongly agree’ or ‘Agree’ on the other criteria. There is no strong evidence that this wordstring is formulaic. Whereas, if we turn to the ‘hinge’ part tadano jaakuna ishi, it appears to be formulaic. The criteria are also applied to the Japanese NP as follows (see Table 7.12). The child is just copying what a character in a super-hero programme said, and he would never create this phrase as a novel one – ‘Strongly agree’ on H. This wordstring is employed to describe monsters in the programme – ‘Agree’ on criterion C. The word jaaku doesn’t match with the child’s maturation – ‘Strongly agree’ on K. With this example, ‘it was’ is a novel construction followed by an attribute. Therefore when the child needs to find the attribute, he goes to his lexicon and pulls out this formula tadano jaakuna ishi. The boundary is essentially the lexical unit here. The past
Perspectives on Formulaic Language
146
C Situation/register/genre specificity
Pragmatic function
Idiolect
Performance indication
Grammatical/lexical indication
Previous encounter
Derivation
Inappropriate application
Mismatch with maturation
Underlying frame
Criteria
A
Judgement
B
Semantic opacity
Formulaicity in ‘tadano jaakuna ishi’
Grammatical irregularity
Table 7.12
D
E
F
G
H
I
J
K
L
SD
SD
A
N
N
N
N
SA
N
SD
SA
SD
tense copula dat-ta is triggered by this Japanese formula. This also supports the notion that CS occurs at the boundaries of formulaic sequences, whether before or after. There is another example of the portmanteau structure in which a loanword works as a trigger word and alternation from English to Japanese occurs.
(12)
I want to be
goorukiipaa goal keeper {I want to be goal keeper}
ni RSL
nari- tai become-MOD
From a syntactic point of view, an NP should occur after the wordstring ‘I want to be’, but the Japanese VP follows. Therefore the switching doesn’t occur at the boundary of syntactic constituents. The formulaicity in the English part ‘I want to be [ . . .]’ and the Japanese part [ . . .] ni naritai are both analysed with the diagnostics (see Table 7.13). The English part ‘I want to be [ . . .]’ is a frame with a gap, therefore ‘Strongly agree’ on L, which is supported by other criteria. It is used when he is playing and deciding which role to play – ‘Strongly agree’ on C. By using this wordstring he is claiming to grab the role – ‘Strongly agree’ on D. He may have learned this when he was playing with other children – ‘Agree’ on H (not ‘Strongly’ because there is no direct evidence). There is an example in the dataset where he says ‘I want to be, I want to be’ when he is competing to grab
Formulaicity in Code-switching Table 7.13
Formulaicity in ‘I want to be [ ]’
Table 7.14
Inappropriate application
Mismatch with maturation
Underlying frame
L
Derivation
K
Previous encounter
J
Grammatical/lexical indication
I
Performance indication
H
Idiolect
G
Pragmatic function
F
Situation/register/genre specificity
E
Semantic opacity
D
SD
SA
SA
SA
N
SD
A
SD
N
SD
SA
Formulaicity in ‘[NP] ni naritai’
Underlying frame
L
Mismatch with maturation
K
Inappropriate application
J
Derivation
I
Previous encounter
H
Grammatical/lexical indication
G
Performance indication
F
Idiolect
E
Pragmatic function
D
Situation/register/genre specificity
C
Semantic opacity
B
Grammatical irregularity
Criteria
C
SD
A
Judgement
B
Grammatical irregularity
Criteria
A
Judgement
147
SD
SD
SA
SA
A
NA
SD
A
SD
N
N
SA
a role when playing with his brother – ‘Strongly agree’ on E, which also supports the judgement on C. If we turn to the Japanese part, we will have a similar result to the English one (see Table 7.14). The Japanese frame [. . .] ni naritai is also used in a playing situation – ‘Strongly agree’ on C and it has the function of claiming a role – ‘Strongly agree’ on D. There isn’t direct evidence of his use of it in the data but he might have learned this wordstring from his friends when playing – ‘Agree’ on H and will use it again – ‘Agree’ on E.
148
Perspectives on Formulaic Language
With this example, both the English wordstring ‘I want to be [ . . .]’ and the Japanese wordstring [ . . . ] ni naritai appear to be formulaic. CS occurs at the gap which is the finishing point of the English formulaic frame, and at the same time, the starting point of the Japanese formulaic frame. In this section we have seen two patterns of the portmanteau structure which can be identified as the English part – the hinge part (Jp) – the Japanese part. With example (11) the NP which works as a hinge is formulaic. On the other hand, in example (12), the English and Japanese part surrounding the hinge part are formulaic. This strongly supports Backus’ position that CS occurs at the boundary of formulaic sequences.
Conclusion In order to answer the research question of whether CS occurs at the boundary of formulaic sequences rather than syntactic constituents, the criteria for identification of formulaic sequences devised by Wray and Namba (2003) have been verified and revised in this paper. A new criterion which will identify an underlying frame with gaps was added. The examples of alternational CS were employed from bilingual children’s data and examined using the revised criteria. The data show that CS occurs inside the formulaic frames but it only occurs at the boundary of fixed and variable items as well as the boundary of two formulas. The frames are, in many cases, diagnosed as formulaic, which appears to support Backus’ claim rather than Azuma’s. However the occurrence of formulaic frames is a challenge to both, Azuma and Backus, since the frames themselves can entail CS that is not at a constituent boundary. Nevertheless, since the frame allows internal CS, seems to imply that Backus’s claim is too simplistic, that is, CS occurs inside a formulaic sequence, but only in order to fill the gaps in the frame. However, this makes it harder to make a strong theoretical claim, since the definition of what the frame actually is will be dependent on the variation seen in examples, including CS examples. Hence there is a danger of circularity: we know this is a frame because we see variation, including CS, in these gaps. The criteria employed to identify formulaic sequences still depend on the researcher’s subjective judgement. In applying them, additional assessors13 would strengthen the judgement. Since the researcher’s intuition is the starting point, consensus among the assessors would greatly strengthen the internal reliability.
Formulaicity in Code-switching
149
Acknowledgement I would like to express my gratitude to Professor Alison Wray at Cardiff University who steered me in the right direction in my PhD study. She gave insightful comments to the earlier version of this paper. I’d also like to thank the editor Dr. David Wood for his comments, suggestions and support.
Notes 1
2
3 4
5
6
7
8
9
10 11 12
The language which provides the abstract morphosyntactic frame of the bilingual clause and the frame itself is called the Matrix Language (ML) and the other participating language is called the Embedded Language (EL) (Myers-Scotton, 2002, p. 66). In ‘insertional CS’, a single constituent, either a single word or a multi-word item is inserted into the matrix language. Whereas in alternational CS the speaker changes language, even halfway through the clause (Muysken, 2000, p. 6). Azuma’s analysis of the naturalistic data deals with insertional CS. The conclusions Wray comes to as a result of the review are: multiword lexical units are morpheme equivalent (2002, p. 265); they arise on account of Needs Only Analysis, that is, exist mostly because they have not been broken down, not because they have been fused (p. 130); they persist because they have sociointeractional purposes (p. 204); they are distributed around the brain on the basis of how they are used (p. 251). For example, ‘very funny’ is not marked from the perspective of ‘form’. However it can be used when the actual event is not funny, because it is marked from the perspective of ‘meaning’ or pragmatics of use. The criteria introduced here were developed collaboratively with Professor Alison Wray, tested on some of the data in the present study, and published in 2003. I am grateful to Alison Wray for her input into the development of the criteria. In the account below, the criteria will be evaluated in the light of subsequent work on the data, and some modifications to them will be suggested. For example, ‘by and large’ always appear in this form, that is, syntactic or morphological changes never occur. Other formulaic sequences can be inflected. For instance ‘kick the bucket’ can be used in different tenses but cannot be passivized. On the other hand, ‘spill the beans’ can be passivized and shows more variability. This is an application for children and learners. Regarding the application of the criteria for adult native speakers, see Wray and Namba, 2003, p. 28. capitalization = loudness, underline = stress, ↑ = raised pitch height, colons = stretched segment (Langford, 1994, p. 49) The adverbial is realized as an adverb phrase or a prepositional phrase. ‘Agree’ on B, ‘Strongly agree’ on C and E. The constructions described here are based Backus’ use of the term. There are strong similarities to the notion of the ‘construction’ in Construction Grammar (Goldberg, 2003, 2006) but there is no scope within this paper to engage directly
150
13
Perspectives on Formulaic Language
with that theory. The frame and gap formulation is, in fact, much more longstanding than these recent works, having been described as productive elements of formulaic language as long ago as Pawley and Syder (1983, p. 210). Foster (2001) employed seven native speakers as assessors to identify formulaic sequences.
References Azuma, S. (1996). Speech production units among bilinguals. Journal of Psycholinguistic Research, 25 (3), 397–416. Backus, A. (2003). Units in code switching: evidence for multimorphemic elements in the lexicon. Linguistics, 41 (1), 83–132. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. Harlow: Pearson Education Limited. Foster, P. (2001). Rules and routines: a consideration of their role in the task-based language production of native and non-native speakers. In M. Bygate, P. Skehan, & M. Swain (Eds.), Researching pedagogic tasks: Second language learning teaching and testing (pp. 75–93). London: Longman. Gläser, R. (1998). The stylistic potential of phraseological units in the light of genre analysis. In A. P. Cowie (Ed.), Phraseology: Theory, analysis and applications (pp.125–43). Oxford: Clarendon Press. Goldberg, Adele E. (2003). Constructions: a new theoretical approach to language. Trends in Cognitive Sciences, 7 (5), 219–24. Goldberg, A. E. (2006). Constructions at work. Oxford: Oxford University Press. Langford, D. (1994). Analyzing talk. London: The Macmillan Press Ltd. Muysken, P. (2000). Bilingual speech: A typology of code-mixing. Cambridge: Cambridge University Press. Myers-Scotton, C. (2002). Contact Linguistics. Oxford: Oxford University Press. Namba, K. (2008). English-Japanese bilingual children’s code-switching: A structural approach with emphasis on formulaic language. PhD thesis, Centre for Language and Communication Research, Cardiff University. Nishimura, M. (1997). Japanese/English code-switching: Syntax and pragmatics. New York: Peter Lang. Pawley, A., & Syder, F. H. (1983). Two puzzles for linguistic theory: native like selection and native like fluency. In J. C. Richards & R. W. Schmidt (Eds.), Language and Communication (pp. 191–226). New York: Longman,. Radford, A. (2004). English syntax: An introduction. Cambridge: Cambridge University Press. Sinclair, J. M. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press. Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press. Wray, A. (2008). Formulaic language: Pushing the boundaries. Oxford: Oxford University Press. Wray, A., & Namba, K. (2003). Use of formulaic language by a Japanese-English bilingual child: A practical approach to data analysis. Japan Journal of Multilingualism and Multiculturalism, 9 (1), 24–51.
Chapter 8
Holistic Processing of Regular Four-word Sequences: A Behavioural and ERP Study of the Effects of Structure, Frequency, and Probability on Immediate Free Recall Antoine Tremblay and Harald Baayen University of Alberta
Introduction It is generally accepted that we store representations of words in a mental dictionary, which we call the lexicon. However, what exactly is stored in the mental lexicon remains an open question. For example, do we store the word dog as well as its plural form dogs, or do we only store dog and have a rule (NOUN + -s = plural) to compute the plural form? A similar question arises regarding the storage versus computation of multi-word units, wherein a single meaning is attached to a string of words. The canonical examples are phrasal verbs (give up), compounds (jailbird), and idioms (kick the bucket). By their very nature, these items offer us an opportunity to understand the interplay between storage and computation. Corpus-based research has shown that the tendency for words to occur together in discourse extends far beyond the canonical (e.g. Biber, Johansson, Leech, Conrad, & Finegan, 1999; Bod, Scha, & Sima’an, 2003). In fact, other sequences of words, such as in the middle of, pattern together with such frequency that it may be enough to treat them as single units in their own right (Biber et al., 1999). There is a good psycholinguistic basis for proposing that the mind stores and processes these multi-word units as wholes (e.g. Bod, 2001; Schmitt & Underwood, 2004; Underwood, Schmitt, & Galpin, 2004; Jiang Nekrasova, 2007; Conklin & Schmitt, 2008; Tremblay, Derwing, Libben, & Westbury, in press). The main reason may be the structure of the mind itself, which stores a vast number of information units in long-term memory, but is only able to process about 4–7 of them online, in working memory (Miller, 1956). In effect, the mind might make use of a relatively unlimited resource
152
Perspectives on Formulaic Language
(long-term memory) to compensate for a relatively limited one (workingmemory) by storing a number of frequently needed/used multi-word units as wholes. Such units could be easily retrieved and used as wholes without the need to compose them online through word selection and grammatical sequencing. Such an ability would place less demand on cognitive resources because the multi-word units would be ‘ready to go’ and require little or no additional processing. In the realm of psycholinguistics, research on questions of storage and computation has for the most part disregarded sentences on grounds that they are necessarily derived via general rules from individual words (Chomsky, 1988). That is, the meaning of a sentence such as I play soccer can be derived from the individual words that compose it and is therefore not stored in the lexicon. Such approaches to language are further supported by the observation that storing every possible utterance one has ever heard and/or seen is clearly infeasible. There is mounting evidence, however, suggesting that the repertoire of sentences native speakers commonly use is more restricted and repetitive than was previously thought (e.g. Biber et al. 1999). As a result, the notion that we store regular and irregular utterances becomes more credible. This has led researchers such as Goldberg (1995) and Bod et al. (2003) to propose models of language where more or less abstract ‘patterns’ or ‘constructions’ of variable lengths and degrees of idiosyncrasy emerge from the accumulation of stored instances (e.g. to pull X’s leg, It is X that Y, Subject – Verb – Object). When confronted with the need to produce a novel sentence, for example, one would choose the appropriate construction and fill out its open slots with the appropriate material (potentially other constructions). Recent findings suggest that, in addition to full sentences, regular sentence fragments are also stored in the mental lexicon. For instance, Biber et al.’s (1999) study of the British National Corpus found that frequent regular multi-word strings such as I think that and I don’t know are more likely to be repeated as wholes (e.g. I think that I think that DNA is a very good example, because erm, it presumably, it was initially a piece of jury search) and that pauses frequently occur at their boundaries (e.g. I mean they fought valiantly for peace but I, I think that erm <pause> the maternity bill I think is what everybody admits that we shall always go down as being noted for). Such a hypothesis implies that the mental lexicon keeps track of how many times it has experienced not only words, but also regular sentences and sentence fragments. To put it in terms of Hebb’s law of neural plasticity, one could say that words used together wire together. Supporting evidence for this idea is provided by a handful of recent psycholinguistic studies that report reduced processing loads for regular high frequency multi-word sequences (e.g. I said to her) relative to regular
Holistic Processing of Four-word Sequences
153
low frequency items (e.g. I was to her) in L1 and L2 speakers of English (e.g. Bod, 2001; Jiang & Nekrasova, 2007; Tremblay et al., in press). In a recent study, Tremblay et al. (in press) found that highly frequent four- and fiveword sequences referred to as lexical bundles (>=10 and 5 occurrences per million respectively) provide on-line processing advantages over comparable, low-frequency sequences (<10 and 5 per million respectively). The impression arising from such findings is that of a sharp lexical versus non-lexical bundle dichotomy. It is conceivable, however, that these categories are epiphenomenal to the factorial design Tremblay et al. (in press) used in their study. Would the same distinction have emerged had they considered sequences ranging from very low to very high frequencies? Moreover, is it reasonable to believe that non-lexical bundles with a frequency of 1 per million behave (exactly) like those with a frequency of 9 per million? Are the latter strings radically different from lexical bundles with a frequency of 10 or 11 per million? What about lexical bundles with a frequency of 20, 50, or 100? In order to investigate these issues, we conducted an immediate free recall task where the stimuli consisted of 432 regular four-word sequences with whole-string frequencies ranging roughly from 0.01 to 100 per million.
Immediate Free Recall In immediate free recall tasks, participants are asked to recall without delay items from a previously studied list in any order. In such tasks, single word frequency was revealed to be a paradoxical predictor. When comparing pure lists of high-frequency words (e.g. letter, money, people) to pure lists of low-frequency items (e.g. dike, strong, key), recall is usually better for high-frequency items (e.g. DeLosh & McDaniel, 1996; Merritt, DeLosh, & McDaniel, 2006). Surprisingly however, in lists consisting of mixed highand low-frequency words, the advantage is robustly given to low-frequency items (e.g. DeLosh & McDaniel, 1996; Merrit et al., 2006; Tse & Altarriba, 2007). In line with classical theories of information processing (e.g. Johnston & Heinz, 1978), DeLosh and McDaniel (1996) argue that this effect is attributable to the fact that a greater amount of attentional resources is allocated to the processing and interpretation of salient low-frequency items than trivial high-frequency items, which allows for suppression and inattention. In light of this and given the mixed-frequency stimulus list used in this experiment, we expect that items associated with lower frequencies and lower frequency-related measures such as probability of occurrence
154
Perspectives on Formulaic Language
(e.g. LogitABCD; see Table 8.1 below) will be correctly recalled more often than strings associated with higher frequencies and higher frequencyrelated variables. Focus of attention, probability and frequency are known to modulate a number of ERP components, among others the P1, N1, and P2 deflections. The P1 is the earliest visual event-related potential known to vary with spatial attention, state of arousal, lexical frequency, and probability. It arises at occipital scalp sites 60–90 msec and peaks 100–150 msec poststimulus (e.g. Luck, 2005, and references cited therein; Penolazzi , Hauk, & Pulvermüller, 2007, and references cited therein). The early portion of the P1 (peak latency 98–110 msec) is believed to have extrastriate generators (in the middle of the occipital gyrus) that possibly include areas V2 and V4 of the visual cortex, whereas the later portion (peak latency 136–146 msec) arises from the ventral extrastriate cortex of the fusiform gyrus (Hillyard, Teder-Sälejärvi, & Münte, 1998; Di Russo, Martinez, Sereno, Pitzalis, & Hillyard, 2002; Luck, 2005). The N1 is composed of at least three subcomponents, one which peaks at frontal and central sites ~100–150 msec after stimulus onset (N1a), and two later ones at posterior and occipital scalp sites with a peak latency ~150–200 msec (N1b). The N1 and particularly the anterior N1, believed to originate from centro-parietal sources (Di Russo et al., 2002), is known to be sensitive to spatial attention (Luck, 2005 and references cited therein) as well as lexical frequency and probability of occurrence (e.g., Sereno, Rayner, and Posner, 1998; Penolazzi et al., 2007, and references cited therein). The P2 typically onsets 150 to 220 msec after stimulus presentation at frontal and central scalp sites. It is known to be modulated by the amount of attention directed at features of an event as well as stimulus probability, expectancy, and frequency (e.g. Luck, 2005; Dambacher, Kliegl, Hofmann, & Jacobs, 2006; Wlotko & Federmeier, 2007). Given the word frequency effect on probability of recall mentioned above, we expect that lower frequency sequences will elicit larger P1, N1, and P2 deflections. Furthermore, we anticipate these early components to be followed by a slow wave at frontal sites known as the slow anterior negativity, which onsets ∼250 msec post-stimulus, peaks ~400 msec, and lasts until ∼500 msec. This wave is thought to reflect short-term memory processes (e.g. Kluender & Kutas, 1993). Given that lower-frequency sequences are expected to attract more attentional resources than higher-frequency items and therefore be recalled more readily, the amount of resources devoted to short-term memory processes indexed by slow anterior negativity amplitudes is expected to decrease as whole-string frequency increases.
Holistic Processing of Four-word Sequences
155
Participants Eleven female students from the University of Alberta were paid for their participation in the experiment. (Mean age = 23.4; SD = 1.6; Min = 22; Max = 27). All were native speakers of English. The Research Ethics Board approved the study. Participants gave informed consent after the nature of the study was explained to them. They were asked to fill out the Edinburgh Inventory handedness questionnaire (Oldfield, 1971). The questionnaire was presented on a PC using E-Prime (a stimulus presentation software). Ten were right-handed (Mean handedness score = 79.5/100; SD = 15.8; Min = 54.5/100; Max = 100/100) and one was left-handed (handedness score = –47.4/100). We also assessed participants’ reading span and working memory capacity (henceforth WMC) using an adaptation of the Salthouse and Babcock (1991) test (Mean WMC score = 73.3/100; SD = 10.4; Min = 53.6/100; Max = 87.5/100). The WMC test items were presented on a PC using E-Prime.
Materials The stimuli list consisted of 432 four-word sequences taken from the British National Corpus. Frequencies, obtained from the Variations in English Words and Phrases search engine, ranged from 0.03 to 105 occurrences per million.
Experimental design and procedure Participants first completed a practice block, which consisted of six trials. In each trial, six three-word sequences were presented in a random order (for a total of 36 practice items). At the end of each trial, participants were asked to recall as many sequences as possible. The experimental portion consisted of 72 blocks. Each block was divided into 18 trials, where, in each trial, six four-word sequences were randomly presented. A trial looked like the following: Participants first saw the word ‘Ready …’ for 2,500 msec (font: Courier New; size: 18; position: Center), then a fixation cross ‘+’, which was uniformly presented for 250 to 1,000 msec (font: Times New Roman; size: 16; position: Center), then a blank screen for 1,500 msec, followed by the first of six four-word sequences presented all at once for 1,500 msec (font: Times New Roman; size: 14; position: Center), followed by a fixation cross (as previously described) and the second of six sequences (as previously detailed), and so on until six four-word sequences were shown. At the end of each trial, participants were prompted to type in as
156
Perspectives on Formulaic Language
many sequences as they could recall. Participants had three two-minute breaks. Sequences subtended on average ~5º x 0.4º visual angle; the longest four-word string (becoming increasingly clear that) subtended ~8º x 0.4º visual angle.
Behavioural analysis and results While examining the data, we realized that one item was a three-word sequence and another one appeared twice in the list; they were thus removed leaving us with 430 items. The remaining data were analysed using linear mixed-effects regression (LMER; Baayen, 2008; Baayen, Davidson, & Bates, 2008). Our main interest here was to determine whether the number of times a sequence would be correctly recalled varied as a function of whole-string frequency/probability. Responses were coded as ‘correctly recalled’ or ‘incorrectly recalled’. In order to be correctly recalled, the sequence had to be recalled exactly. That is, if the target sequence was in the middle of, any response other than in the middle of was considered to be incorrect such as for instance in the middle, in the middle and, in the middle of a, or at the middle of. We did accept, however, minor misspellings such as in the mdle of or n the midle of. Given that whole-string frequency and probability correlate with a number of variables such as for instance a sequence’s length, the frequencies of the words that compose it, as well as sequence-internal bigram and trigram frequencies and probabilities, we considered in addition to whole-string frequency and probability a number of variables (fixed effects), which are listed and briefly described in Table 8.1. This would ensure that other potential sources of variation in recall would be controlled for and confirm that a significant whole-string frequency/probability effect, if it were found, would be independent of confounded variables. Subjects and items were entered in the model as random effects. The most parsimonious and generalizable model consisted of WMC, Position, FreqABC, FreqBCD, PhraseABCD*FreqC, PhraseABCD*FreqD, and Phrase ABCD*LogitABCD. Collinearity between model variables was acceptable, that is, there was no significant overlap in predictive power between model variables. Results of the linear mixed-effects regression are summarized in Table 8.2. Figure 8.1 illustrates the effects of each predictor on probability of recall. Note that the modulation of each variable is independent of other model predictors and additive. That is, the probability of recall of an item in this particular case is equal to the sum of the effects of WMC, Position, FreqABC,
Holistic Processing of Four-word Sequences Table 8.1
157
Variables Taken into Consideration in the Statistical Analysis
Variable
Description
WMC
Reading span and working memory capacity score.
EI
Handedness score.
Trial
The block in which an item was presented (out of 72 blocks).
Position
Position of an item within a trial (either 1, 2, 3, 4, 5, or 6).
Length
Length of the whole sequence in number of letters.
PhraseABCD
Whether the whole sequence was a phrase (e.g., in the United States), or a non-phrase (e.g. I think it’s the). Phrases are sequences that can stand alone such as in the United States, they don’t have to, she was going to, and he shook his head, while non-phrases such as but there is no, the result of a, and I don’t think it’s cannot.
WordTypeABCD
Patterns of content (Con) and non-content (N) words. For example, in the middle of has the structure NNConN.
FreqA, FreqB, FreqC, FreqD
Frequency of the first, second, third, and forth word of the sequence. Considering the sequence in the middle of, FreqA = frequency of in, FreqB = frequency of the, FreqC = frequency of middle, and FreqD = frequency of of.
FreqAB, FreqBC, FreqCD
Frequency of the sequence formed by the first and second word (FreqAB), the second and third word (FreqBC), and the third and forth word (FreqCD).
FreqABC, FreqBCD
Frequency of the sequence formed by the first, second, and third word (FreqABC) and second, third, and fourth word (FreqBCD) of a sequence.
FreqABCD
Frequency of the whole sequence (e.g. in the middle of).
LogitAB, LogitBC, The (log) probability of obtaining word B, C, or D given word A, B, or C LogitCD respectively. For example, LogitAB = log(FreqAB/((FreqA* – FreqAB)+1)). LogitABC, LogitBCD
The (log) probability of obtaining word C or D given the sequence AB or BC, respectively.
LogitABCD
The (log) probability of obtaining word D given the sequence ABC.
Note: The capital letters A, B, C, and D refer to words in the first, second, third, and forth position of a four-word sequence (e.g. in the middle of where A = in, B = the, C = middle, and D = of). The asterisk * is a wildcard representing any single word; if A = in then A* could stand for in the, in a, in your, etc. Con stands for ‘content word’ (e.g. middle), and N for ‘non-content word’ (e.g. the).
FreqBCD, PhraseABCD*FreqC, PhraseABCD*FreqD, and PhraseABCD* Logit ABCD. Given space constraints, we will only discuss results regarding the PhraseABCD and LogitABCD variables, which are the two variables of main interest. Analysis results are given in Table 8.2. Previous studies uncovered a positive correlation between number of words recalled and the amount of linguistic structure existing between them (e.g. Miller & Selfridge, 1950; Tulving & Patkau, 1962). It was thus expected that, in general, phrasal four-word
Perspectives on Formulaic Language
158 Table 8.2
Linear Mixed-Effects Regression Results
Random effects Groups
Name
Variance
SD
Subject
(Intercept)
0.0547
0.2339
Item
(Intercept)
0.1663
0.4078
Estimate
SE
z value
Fixed effects
Intercept
–2.1450
0.6814
2.6230
0.7737
3.4***
1st restricted cubic spline for Position
–0.2574
0.0555
–4.6***
2nd restricted cubic spline for Position
0.6856
0.0608
11.3***
PhraseABCD (phrases)
0.4798
0.6453
FreqC
–0.1025
0.0250
–4.1***
FreqD
–0.5
WMC
–3.1**
0.7
–0.0189
0.0365
FreqABC
0.1137
0.0333
FreqBCD
–0.0833
0.0386
LogitABCD
0.1074
0.0394
2.7**
PhraseABCD(phrases) by FreqC
0.1850
0.0538
3.4***
PhraseABCD(phrases) by FreqD
–0.1338
0.0612
0.2128
0.0698
PhraseABCD (phrases) by LogitABCD
3.4*** –2.2*
–2.2* 3.1**
Note: Estimates and standard errors correspond to log probability of recall (i.e. logit(P) = log(P/(1–P))). Probabilities (per cent) are obtained from the following equation: P = exp(logit(P))/(1+exp(logit(P))). Restricted cubic splines (rcs) with three knots were used for Position, indicating that the effect is non-linear. 4,730 observations, where one observation is equal to one four-word sequence correctly recalled or not by one participant. Collinearity index between model variables is 12.6, which is acceptable (15 is considered to be too high). * = p < .05. ** = p < .01. *** = p < .001.
sequences such as in the United States would be recalled more readily than non-phrasal strings such as by the end of. We believe this is due to the fact that phrases instantiate (relatively) complete concepts compared to non-phrases. The finding that higher whole-string probability (LogitABCD) facilitates recall is contrary to expectations. Indeed, it was predicted that lower frequency/probability sequences would have been more readily recalled, as was found elsewhere for words in mixed-frequency lists (e.g. DeLosh & McDaniel, 1996; Merrit et al., 2006; Tse & Altarriba, 2007). If more salient items are more easily recalled, then saliency, in the case of regular multiword sequences, appears to be related to lexical activation rather than to novelty: Lower activation thresholds and/or higher levels of activation relate to higher multi-word string saliency, which in turn is associated with
4
6
0.6
6
0.6 2
8
4
6
8
FreqBCD
LogitABCD
PhraseABCD*LogitABCD
2
4
6
–2
0
2
4
6
Log Frequency 2nd 3-gram
–4
0
0.4
np
0.2
Probability of Recall (%) –8
Whole-string Log Probability
np = non-phr ase p p = phrase
0.0
0.4 0.2 0.0
Probability of Recall (%)
0.4 0.2
Probability of Recall (%) 0
0.0
0.4 0.2 –2
0.6
FreqABC
0.6
FreqD
0.6
Log Frequency of 4th Word
0.6
FreqC
Log Frequency 1st 3-gram
PhraseABCD
0.4 0.2
Probability of Recall (%) 4
p np
0.0
0.6 0.4 2
8 10
np = non-phr ase p = phrase
–8
–4
PhraseABCD
2
8 10
0.2
PhraseABCD
np
PhraseABCD*FreqD
0.0
0.6 0.4
p
0.2
Probability of Recall (%)
np = non-phr ase p = phrase
0.0
0.6 0.4 0.2
Probability of Recall (%)
0.0
0.4
FreqD
Log Frequency of 3rd Word
0.0
Probability of Recall (%)
0.2
Probability of Recall (%) PhraseABCD*FreqC Probability of Recall (%)
Serial Position
Holistic Processing of Four-word Sequences
6
1 2 3 4 5 6
0.85
Working Memory Capacity FreqC
4
0.70
0.0
0.6 0.4 0.2 0.55
2
Position
0.0
Probability of Recall (%)
WMC
0
LogitABCD
159
Figure 8.1 Results of the Linear Mixed-Effects Regression Analysis. Each panel shows the effect sizes of significant variables on probability of recall. From top left to bottom right: WMC, Position, FreqC, FreqD, FreqABC, FreqBCD, PhraseABCD by FreqC, PhraseABCD by FreqD, and PhraseABCD by LogitABCD.
160
Perspectives on Formulaic Language
higher probability of recall. While token frequency provides an indication of an item’s salience relative to all other items in a language, whole-string probability offers an indication of its salience relative to its ‘family’. The following will clarify this notion. Let us first restate the equation used to calculate the LogitABCD value of a four-word sequence. (1) LogitABCD = log(FreqABCD/(FreqABC* – FreqABCD)+1)) That is, LogitABCD is equal to the frequency of the whole string divided by the sum of the frequencies of every four-word sequence that share the first three words minus the frequency of that string. By way of example, let us consider the sequence in the middle of, which has a token frequency of 28.46 occurrences per million in the British National Corpus. There are 243 other sequences that begin with the words in the middle, which we refer to as a sequence’s ‘family’. Some examples are given in (2), where whole-string frequencies and their respective rankings relative to other members of the family appear in parentheses (in the middle of is the most frequent sequence and thus ranked 1). (2) a. b. c. d. e.
in the Middle East (frequency = 4.99; rank = 2) in the Middle Ages (frequency = 2.2; rank = 4) in the middle and (frequency = 1.23; rank = 6) in the middle to (frequency = 0.2; rank = 9) in the middle are (frequency = 0.12; rank = 17)
If we only consider the part of the equation that provides the actual ratio between the frequency of in the middle of and the summed frequencies of all other sequences of its family, that is, FreqABCD/(FreqABC* – FreqABCD)+1), we obtain the value 28.46/((47.27–28.46)+1) = 28.46/19.81 = 1.44, which points to the fact that in the middle of stands out from other sequences in the family (in fact, it is the most salient one). Indeed, a ratio greater than 1 indicates that a sequence is more frequent than (most) other members, whereas a ratio smaller than 1 means that it is less frequent. Compare in the middle of to in the middle and, which has a ratio of 1.23/((47.27–1.23)+1) = 0.03 or with in the middle portion, which has a ratio of 0.01/((47.27– 0.01)+1) = 0.0002. To summarize, the fact that sequence-internal trigrams and single words modulate recall in addition to whole-string probability of occurrence (see analysis results in Table 8.2) suggests that four-word sequences are both stored as wholes and as parts. These results are in line with usage-based
Holistic Processing of Four-word Sequences
161
accounts of grammar (e.g. Goldberg, 1995; Bod et al., 2003) according to which regular multi-word sequences as wholes leave memory traces in the brain (whatever the definition of the term ‘memory trace’). The behavioural results, however, are silent as to what type of memory trace might be left behind. Are they procedural or lexical/conceptual memory traces? In other words, are four-word sequences put together online or retrieved as parts and wholes? Because of its high temporal resolution (to the millisecond), electroencephalography is the perfect tool to distinguish between fast computation and holistic retrieval. If it turns out that whole-string probability affects early ERP components such as the P1, the N1, and the P2 deflections, one could argue for holistic retrieval. Indeed, it is believed that words are accessed within ~110–180 msec of presentation (e.g. Sereno, Rayner, & Posner, 1998) irrespective of whether they appear in or out of the context of a sentence. It would thus be impossible to retrieve four words, let alone perform the necessary computations to integrate them, within 110–180 msec. EEG recordings and processing Electroencephalogram (EEG) recordings were made with Ag/AgCl active electrodes from 32 locations according to the international 10/20 system (www.biosemi.com/headcap.htm) at the midline (Fz, Cz, Pz, Oz) and left and right hemisphere (Fp1, Fp2, AF3, AF4, F3, F4, F7, F8, FC1, FC2, FC5, FC6, C3, C4, T7, T8, CP1, CP2, CP5, CP6, P3, P4, P7, P8, PO3, PO4, O1, O2). Electrodes were mounted on a nylon cap. Additional electrodes were placed at the left and right mastoids, which served as off-line re-reference. Eye movements were monitored by electrodes placed above and below the left eye and at the outer canthi of both eyes, which were bipolarized off-line to yield vertical (VEOG) and horizontal (HEOG) electrooculograms. Analogue signals were sampled at 8,192 Hz using a BioSemi (Amsterdam, Netherlands) Active II digital 24 bits amplification system with an active input range of –262 mV to +262 mV per bit and were band-pass filtered between 0.01 and 100 Hz. The digitized EEG was initially processed off-line using Analyser version 1.05; it was downsampled to 128 Hz, DC detrended 100 msecs before stimulus markers (Henninghausen , Heil, & Rosler, 1993), band-pass filtered from 0.01 to 32 Hz using an inverse discrete wavelets transform (14 levels), and corrected for eye movements and eye blinks using vertical and horizontal EOGs (Gratton, Coles, & Donchin, 1983). The processed signal was then segmented into epochs of 3,000 msec (1,500 before stimulus onset and 1,500 msec after). Each epoch was baseline corrected on the 1,500 msec segment immediately preceding stimulus onset using the baseline correction option of the inverse discrete Haar-Dauberchies
162
Perspectives on Formulaic Language
2 wavelet transform in Analyser. This was done in order to obtain brain activity measures for each item that, as much as possible, would be uncontaminated by activity from previously presented segments. The data were then exported for further processing and analysis in R version 2.7.2. Data points exceeding ±100 μV at any channel were excluded from the analysis. We further inspected the data by drawing voltage density and quantilequantile plots for each channel of each subject; channels showing a significant departure from the normal distribution and failing to reach a peak voltage density of 0.035 were removed.1 Overall, we discarded 10.4 per cent of our data (2,752,128 over 26,419,200 data points). Electrophysiological analysis and results We used the generalized additive modelling approach (henceforth GAM) to analyse the ERP data (see Wood, 2006; see Baayen, Hendrix, & Tremblay, 2008, for an application of GAM to ERP data analysis). In essence, GAM determines a linear and/or non-linear equation that strikes a balance between overfitting and overgeneralizing a set of data through a process called penalized iteratively re-weighted least squares (see Wood, 2006, for details). The main advantages of using the GAM method for ERP analysis over the traditional ERP averaging method are as follows: (1) the possibility to fully appreciate the effects of graded variables, such as frequency; (2) the potential to identify non-linear effects; (3) the ability to estimate longitudinal effects in the data; and (4) the power to determine a predictive model of brain activity. In short, this data analysis technique affords the opportunity both to conceive of and investigate research questions previously unthought of or dismissed as untestable. Though we could have included the sole left-handed participant in the ERP analysis, we restricted our analysis to the ten right-handed participants in order to reduce variability between subjects and increase statistical power. Given that our main interest in this study is to determine whether wholestring frequency and/or probability affects the retrieval and processing of regular multi-word sequences, and that we do not know exactly what stimuli elicited the event-related potentials recorded to incorrectly recalled sequences, we decided to restrict the ERP analysis to correctly recalled sequences only (i.e. 32.3 per cent of our processed data, which represents 7,635,149 over 23,667,072 data points). We relegate the comparison of ERPs to both types of responses to future work. In the ERP analysis, we used only those variables that reached significance in the behavioural analysis. Baseline corrected epochs were segmented
Holistic Processing of Four-word Sequences
163
into seven 250 msec windows overlapping 50 msec at edges (mostly because models are not as robust at the edges). For each time window we averaged over subjects. Using GAM, we also removed main time trends and variability due to individual items, as well as the following interactions: Time*FreqC* PhraseABCD, Time*FreqD*Phrase ABCD, Time*FreqABC, and Time*Freq BCD. We subsequently assessed, again using GAM, whether the remaining voltage variability was modulated by Time*LogitABCD*PhraseABCD. Main voltage trends (in microvolts) for the first time windows are shown in Figure 8.2. We will not be concerned with the other six time windows here given space limitations and given that our analysis focuses on very early ERP components. In Figure 8.2, each panel represents an electrode: Fp1 and
Figure 8.2 0–250 msec Time Window
Perspectives on Formulaic Language
164
Fp2 are at the top (i.e. the front of the head) and O1, Oz, and O2 at the bottom (i.e. the back of the head). The x- and y-axes represent time in milliseconds and (baseline-corrected) microvolts respectively; positive is plotted up. Grey circles are scalp voltages averaged over items; solid black lines correspond to fitted curves for time obtained from the GAM analysis. Figure 8.3 depicts main effects of Time*LogitABCD on scalp voltages and Figure 8.4 Time:LogitABCD:PhraseABCD (phrase) interactions. As in Figure 8.2, each panel represents an electrode: Fp1 and Fp2 are at the top of the plots (the front of the head) and O1, Oz, and O2 are at the bottom (the back of the head). At the top of each panel appears the name of the electrode it represents. The x-axis represents time (msec); the first vertical
Fp1
F7 200
0
50
100
0
200 0
!Fz 0.5 50 0
FC5 100 0
200 0
50
0
0 50
0
100
50
0
0
0
200 0
50
0
T7
CP5 0
0
100
200
200
50 0
0
!1
CP1 0
50
100
200
100
200 0
0
0
p = 10.08957
100
0
200
50
100
0
0 0.5 p = 0.11335
p = 0.56693
50 0
100
Pz
0
50 0
100
0
200 0
50 0
PO3 200
p=1
FC6 0
0
50
100
! 0.5
200
0
0
0
50
T8
0
100
200 0
0
100
50
0
50
! 0. 5
0
200
! 0.5
100
0
200
p = 0.51194
Oz 200
0
0 50
100
P8
0
0 50
100
200
p = 0.61336 0
200
0
p = 0.50766
! 0.5
O2 200
0
0
50
100
0
p = 0.73755
200 0 0
100
0
0.5 p = 0.9324 1
! 0.5
p = 0.04613
O1 100
0
p=1
0
200 0
0.5 p = 0.01564 !PO4 0.5
p = 0.63482
50
0
200
0
0
100
100
P4 0
0.5 p = 0.01041
p = 0.11554 0.5
50
50
CP6
1
200 0
0
CP2
0
0
100
F8 0
0
! 0.5 p = 0.15631
P30
0
0
200
50
! 0.5
!1
p = 0.66763
P7 100
200
0
0.5
p = 0.06147
50
100
C4
0
0
0
0.5 p = 0.03702
0
0
p = 0.18822 !F4 0.5
Cz
0.5 p = 0.03684
p = 0.67767 ! 0.5 100
50
0
200 0
0
5
50
200
50
! 0.5
0.
0
100
0
0
0
50
100
0.5 p = 0.00952 ! 0.5 FC2 0
p = 0.02052 0.5
0
C3 0
0
0
0
0
1 0.5p = 3e! 05*
p =0. 0.98356 5
0
p = 0.04597 !1 FC1 0 100 0 200
0
0
p = 0.99996
200
AF4
0 0
p = 0.01911 !F3 0.5 0 50 100
100
p=1
0 200 0
0
50
0
100
0
0
50
0
200
p=1 ! 0.5 AF3
0 0
Fp2
100
1
0 50
0
200
0
p = 0.29203
Figure 8.3 Time* LogitABCD Main Effect in the 0–250 msec Time Window
0
Holistic Processing of Four-word Sequences Fp1 50
Fp2 200 0
100
0
50
100
p=1
50
100
AF4
0
0
200 0
0
50
100 0
p = 0.99923
100
0
200
0 100
50
Fz 200 0
0
50 0
0 50
FC1 200 0
p = 0.99666 T7 ! 2 100
0
50 0
50
2
0
!2
200
0
50
100
200
0
p=1
100
FC6 200 0
0 50
50 0
0
100
C4 200 0
50
0
100
0
0
100
200
0
0
p = 0.42583
0 200
p=1
T8
0
200 0
0
50 0
0
50
100
200
200
50
!2
0
Pz
0
50
100
0 200
0
2 p = 0.10565
0
100
0
2000
2 p =4 0.02314
!2
0
!2
CP6 500 1000
0 200
0
0
!2 2
P4 50
0
!2
0 200
100
2 4 p = 0.02878
!2
0
p = 0.36099
P8 0
50 0 100 0
200 0
0
p = 0.24713
PO4
50 0
0
100
0
200
0
0
50 0100 0
0 !2
p = 0.98021
CP2
0
2 4 05* p = 5e! PO3 ! 2
p = 0.99835
0
!2
0
0
Cz
2
p = 0.62453 2
2 p = 0.01512 !2 P3 0 100 200 0 0
!2
50
0
p = 0.02522
100
!2
!2
2 p = 7e! 4 05*
0
2
0
200 0
50
2
P7 50 0 100 0
0
0
!2
p = 0.0146
200
2
100 0
100
p = 0.012542
200 0
CP1
!4
50
F8 0
0 200
100
0
!4
0
CP5 0
0
C3
0
2
F4 50
FC2
p = 0.99681 !2
0
2 p =4 0.0539
100
0
200
0
0
p = 0.18867
0
p = 0.99908
FC5
!2
!2
p = 0.99997
100
200 0
0
0
50
100
0
50
200 0
p = 0.33939
F3
0
0
0
!2
F7
0
!2
0
200
p = 0.99999
AF3
0
165
50
100
200 0
2 p = 0.01719
p = 0.35031
Oz
2
0
!2
!2
2 p = 0.00363 4 0 O1
0
0
50
1000 p = 0.00417
O2 200
0
0
50
100
0
200
0
p = 0.00884
Figure 8.4 Time:LogitABCD:PhraseABCD (phrase) Interaction in the 0–250 msec Time Window
dashed line represents the 50 msec time point and the following two broken lines the 100 and 200 msec time points. The y-axis represents LogitABCD values (log probability of occurrence) from very small at the bottom of the panel (≈ –6) to very high at the top of the panel (≈4). Scalp voltages in microvolts are represented by both little black contour lines and shades of grey. Voltage values are indicated on the contour lines and are also given via colour-coding: The lighter the grey, the more positive the voltage and similarly, the darker the grey, the more negative the voltage. These voltage maps are very similar to topographic maps, where the height of a mountain or the depth a valley is indicated by values appearing on the lines that form
166
Perspectives on Formulaic Language
them, and their steepness by the amount of space separating those lines (the closer the lines, the steeper the incline/decline). At the bottom of each panel p-values are provided; significant effects are marked by an asterisk (Bonferroni corrected significance threshold = 0.05/32 electrodes/7 time windows = 0.00022). A significant Time*LogitABCD main effect (after Bonferroni correction) was found at electrode FC1 (0–250 msec time window; p = 0.00003), and significant Time*LogitABCD*PhraseABCD (phrase) interactions at electrodes P3 and P7 (0–250 msec time window; p = 0.00005 and 0.00007 respectively). We believe the effects found at these sites are real given that other electrodes in their vicinity also recorded the same electrical pattern (though they did not reach significance). We did not find any significant LogitABCD modulations on either the P2 or the slow anterior negativity. We discuss the electrophysiological results in the following section. Early fronto-central negativity (N1a) A significant Time*LogitABCD main effect was found in the 0–250 msec time window at electrode FC1. In order to interpret this effect, it is necessary to consider it in the context of its associated Time smooth (i.e. the black line in Figure 8.2, electrode FC1). Indeed, Figure 8.3 merely illustrates the manner in which LogitABCD (for phrases and non-phrases alike) modulates the electroencephalogram (EEG) in this time window, that is, that the N1 component is more positive for lower probability sequences and more negative for higher probability ones. To observe the actual N1, it is necessary to add the Time*LogitABCD curve to the Time curve (hence the term ‘additive’ in ‘generalized additive model’). This is shown in Figure 8.5. Note that the bottom panel of Figure 8.5 is merely intended to give an approximate representation of what the actual EEG at this time window looks like. In Figures 8.5, it can be observed that the N1 component increases in amplitude as the probability of occurrence of a regular four-word sequence increases. Given that stimulus characteristics such as length are known to affect N1 amplitudes, it is possible that the modulations we observe in our data are attributable to Length rather than to LogitABCD. Note that we did not include Length from the start of the ERP analysis because this variable did not reach significance in the behavioural analysis (recall that we only considered those variables that significantly accounted for variability in the behavioural data, which did not include Length).2 We thus examined
2 0 –4 2 –2 –6 –6 –4 –2 0
2
Microvolts averaged over time + TimeLogitABCD main effect
Holistic Processing of Four-word Sequences
Time*LogitABCD main effect
Figure 8.5 Adding the Time*LogitABCD Smooth to the Time Smooth at Electrode FC1 167
Top panel: Time smooth (points: voltages averaged over items; line: fitted Time smooth); the x-axis is time in msec and the y-axis is voltage in μV (positive is plotted up). Middle panel: Time*LogitABCD smooth; the x-axis is time in msec; the y-axis is LogitABCD; the z-axis (contours and colors) is voltage in μV (lighter shades are positive and darker shades are negative). Bottom panel: The sum of the top two panels (only approximate).
168
Perspectives on Formulaic Language
whether the LogitABCD effect would survive the addition of Length to the model. We first took out from the total variance that portion explained by individual items, Time, Time*FreqC*PhraseABCD, Time*FreqD*Phrase ABCD, Time*FreqABC, Time*FreqBCD, and Time*Length*PhraseABCD and then fitted another model on the remaining variance where Time* LogitABCD*PhraseABCD was entered as the only predictor. The correlation existing between Length and LogitABCD is very small (r = –0.1). It was thus probable that the LogitABCD effect would remain even after removing the variability due to Length. Neither the Time*Length main effect nor the Time:Length:PhraseABCD (phrase) interaction reached significance (p = 0.9072 and 0.0328 respectively). As expected, the Time* LogitABCD main effect survived the addition of Length to the model (p = 0.00001; αBonferroni = 0.00022). Early parietal positivity (P1) We now turn to the Time:LogitABCD:PhraseABCD (phrase) interaction found at electrodes P3 and P7 in the 0–250 time window. Figure 8.6 depicts how P1 amplitudes vary as a function of time and the probability of occurrence of phrasal four-word sequences (electrode P7 shown). Stimulus characteristics such as length are also known to affect P1 deflection. We thus assessed whether the addition of Length to the model would remove the Time:LogitABCD:PhraseABCD (phrase) effect. Neither the Time*Length main effect (P3: p = 0.00085; P7: p = 0.961) nor the Time: Length:PhraseABCD (phrase) interaction (P3: p = 0.01293; P7: p = 0.755) reached significance. The Time:LogitABCD:PhraseABCD (phrase) interaction is robust to the addition of Length to the model at both electrodes (P3: p = 0.00016; P7: p = 0.00018; α = 0.00022). As mentioned in the introduction, the early P1 (peak ~98–110 msec) is believed to originate from the dorsal extra-striate cortex of the middle occipital gyrus and the late P1 (peak ~136–146 msec) from the ventral extra-striate cortex of the fusiform gyrus (Di Russo et al., 2002). Functional magnetic resonance imaging (fMRI) and positron emission tomography (PET) studies have reported activation of this complex during word, object, and face presentations, which diminished with repeated presentations (Rossion, Schiltz, Robaye, Pirenne, & Crommelinck, 2001, and references cited therein). These observations are generally attributed to ‘better (or faster) performance at processing these stimuli, thus indicating the neural correlates of perceptual priming or implicit memory processing . . . In other words, these deactivations reflect a facilitation in neural computations when the same information is processed again’ (Rossion et al., 2001,
5 0 –5 2 –2 –6 –6
–2
2
Microvolts averaged over items + Time:LogitABCD:PhraseABCD (phrase) interaction
Holistic Processing of Four-word Sequences
Time:LogitABCD:PhraseABCD (phrase) interaction
Figure 8.6 Adding the Time:LogitABCD:PhraseABCD (phrase) Smooth to the Time Smooth at Electrode P7 169
170
Perspectives on Formulaic Language
p. 1027). Given these findings, it is conceivable that the P1 amplitudes observed in the present study, which decrease as the probability of phrasal four-word sequences increase, reflect the level of entrenchment of at least some aspects of these items in the occipito-temporal pathway.
Conclusion We investigated the processing of regular four-word sequences from both a behavioural and an electrophysiological perspective. The fact that wholestring probability as well as sequence-internal word and trigram frequency affected recall suggests that multi-word strings are stored both as parts and wholes. Furthermore, frequency/probability was found to modulate recall and event-related potentials in a continuous rather than categorical manner, thus indicating that lexical bundles and non-lexical bundles are best viewed as two extremes of a ‘whole-string frequency/probability’ continuum. It was unclear from the behavioural results whether the whole-string probability effect reflected fast computation or holistic retrieval. The electrophysiological results provided evidence to the effect that (at least some aspects of) four-word sequences are retrieved in a holistic manner (whatever the definition of the term holistic) rather than computed online via rule-like processes. Indeed, the fact that whole-string probability modulated P1 and N1 amplitudes ~110–150 msec after stimulus onset strongly supports this deduction. If the earliest frequency/probability effect on event-related potentials to single word processing is reported to be ~110–180 msec (e.g. Sereno et al., 1998; Penolazzi et al., 2007), it is most unlikely that four words can be accessed, let alone stringed together, within this time frame. The results reported here strongly suggest that phrasal and non-phrasal four-word sequences leave memory traces in the brain; there would otherwise be no LogitABCD effect on brain activity ~110–180 msec after stimulus onset. These results are in line with usage-based accounts of grammar (e.g. Goldberg, 1995; Bod et al., 2003; Bybee & McClelland, 2005; McClelland & Bybee, 2007).
Author Note This paper is supported by a Major Collaborative Research Initiative Grant (number 412-2001-1009) and a Doctoral Fellowship (number 7522006-1315) from the Social Sciences and Humanities Research Council of
Holistic Processing of Four-word Sequences
171
Canada (SSHRC). We wish to thank Dr. Gary Libben, Dr. Bruce Derwing, Dr. Ruth Ann Atchley, Dr. Patrick Bolger, Dr. Jeremy Caplan, Dr. Kathryn Conklin, and several attendees from the 6th International Conference on the Mental Lexicon, held in Banff, 7–10 October 2008, the Canadian Linguistic Association conference, held at the University of British Columbia, 31 May–2 June 2008, as well as from the Formulaic Language Research Network Third International Postgraduate Conference, held in Nottingham, UK, 19–20 June 2008. Correspondence concerning this article should be addressed to Antoine Tremblay, Brain and Language Lab, Department of Neuroscience, Georgetown University, Building D, room 237, Box 571464, Washington, DC, 20057, USA. E-mail:
[email protected]; Web site: www.ualberta.ca/ ~antoinet
Notes 1
2
A peak voltage density under 0.035 reflects the fact that voltages are too widely distributed around the channel’s peak density (usually ~0 μV). In other words, the channel is too noisy. This threshold value was obtained by visually comparing EEG epochs and their voltage density plots. Given that recalling an item and encoding it in short-term memory are two different cognitive processes (with some possible overlap), strictly speaking the ERP data should have been modelled independently; variables that did not significantly account for variability in the behavioural data might have done so in the ERP data.
References Baayen, R. H. (2008). Analyzing linguistic Data: A practical introduction to statistics using R. Cambridge: Cambridge University Press. Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. To appear in Journal of Memory and Language, special issue on Emerging Data Analysis Techniques. Baayen, R. H., Hendrix, P., & Tremblay, A. (2008). Generalized additive modeling: An application to event-related brain potential data from word naming and free recall tasks. Manuscript preparation. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Grammar of spoken and written English. Harlow: Longman. Bod, R. (2001). Sentence Memory: Storage vs. Computation of Frequent Sentences. Abstract retrieved on October 29, 2006, from http://staff.science.uva.nl/~rens/ cuny2001.pdf. Bod, R., Scha, R., & Sima’an, K. (2003). Data-oriented parsing. Stanford, CA: Studies in Computational Linguistics.
172
Perspectives on Formulaic Language
Bybee, J. & McClelland, J. L. (2005). Alternatives to the combinatorial paradigm of linguistic theory based on domain general principles of human cognition. The Linguistic Review, 22, 381–410. Chomsky, N. (1988). Language and problems of knowledge. Cambridge, MA: MIT Press. Conklin, K., & Schmitt, N. (2008). Formulaic sequences: Are they processed more quickly than Nonformulaic language by native and nonnative speakers? Applied Linguistics, 29, 72–89. Dambacher, M., Kliegl, R., Hofmann, M., & Jacobs, A. M. (2006). Frequency and predictability effects on event-related potentials during reading. Brain Research, 1084, 89–103. DeLosh, E. L., & McDaniel, M. A. (1996). The role of order information in free recall: Application to the word-frequency effect. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22, 1136–46. Di Russo, F., Martinez, A., Sereno, M. I., Pitzalis, S., & Hillyard, S. A. (2002). Cortical sources of the early components of the visual evoked potential. Human Brain Mapping, 15, 95–111. Goldberg, A. (1995). Constructions: A construction grammar approach to argument structure. Chicago: The University of Chicago Press. Gratton, G., Coles, M. G. H., & Donchin, E. (1983). A new method for off-line removal of ocular artifact. Electroencephalography and Clinical Neurophysiology, 55, 468–84. Hennighausen, E., Heil, M., & Rosler, F. (1993). A correction method for DC drift artifacts. Electroencephalography and Clinical Neurophysiology, 86, 199–204. Hillyard, S. A., Teder-Sälejärvi, W. A., & Münte, T. F. (1998). Temporal dynamics of early perceptual processing. Current Opinion in Neurobiology, 8, 202–10. Jiang, N., & Nekrasova, T. M. (2007). The processing of formulaic sequences by second language speakers. The Modern Language Journal, 91, 433–45. Johnston, W. A., & Heinz, S. P. (1978). Flexibility and capacity demands of attention. Journal of Experimental Psychology, 107, 420–35. Kluender, R., & Kutas, M. (1993). Bridging the gap: Evidence from ERPs on the processing of unbounded dependencies. Journal of Cognitive Neuroscience, 5, 196–214. Luck, S. J. (2005). An introduction to the event-related potential technique. Cambridge, MA: MIT Press. McClelland, J. L., & Bybee, J. (2007). Gradience of gradience: A reply to Jackendoff. The Linguistic Review, 24, 437–55. Merritt, P. S., DeLosh, E. L., & McDaniel, M. A. (2006). Effects of word frequency on individual-item and serial order retention: Tests of the order-encoding view. Memory and Cognition, 34, 1615–27. Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. The Psychological Review, 63, 81–97. Miller, G. A., & Selfridge, J. A. (1950). Verbal context and the recall of meaningful material. The American Journal of Psychology, 63, 176–85. Oldfield, R. C. (1971). The assessment and analysis of handedness: The Edinburgh inventory. Neuropsychologica, 9, 97–113. Penolazzi, B., Hauk, O., & Pulvermüller, F. (2007). Early semantic context integration and lexical access as revealed by event-related brain potentials. Biological Psychology, 74, 373–88.
Holistic Processing of Four-word Sequences
173
Rossion, B., Schiltz, C., Robaye, L., Pirenne, D., & Crommelinck, M. (2001). How does the brain discriminate familiar and unfamiliar faces? A PET study of face categorical perception. Journal of Cognitive Neuroscience, 13, 1019–34. Salthouse, T. A., & Babcock, R. L. (1991). Decomposing adult age differences in working memory. Developmental Psychology, 27, 763–76. Schmitt, N., & Underwood, G. (2004). Exploring the processing of formulaic sequences through a self-paced reading task. In N. Schmitt (Ed.), Formulaic sequences (pp. 171–89). Amsterdam and Philadelphia: John Benjamins. Sereno, S. C., Rayner, K., & Posner, M. I. (1998). Establishing the time-line of word recognition: evidence from eye movements and event-related potentials. NeuroReport, 9, 2195–2200. Tremblay, A., Derwing, B., Libben, G., & Westbury, C. (in press). Processing advantages of lexical bundles: evidence form self-paced reading sentence recall tasks. Language Learning, 61:3. Tse, C. -S., & Altarriba, J. (2007). Testing the associative-link hypothesis in immediate serial recall: Evidence from word frequency and word imageability effects. Memory, 15, 675–90. Tulving, E., & Patkau, J. E. (1962). Concurrent effects of contextual constraint and word frequency on immediate recall and learning of verbal material. Canadian Journal of Psychology, 16, 83–95. Underwood, G., Schmitt, N., & Galpin, A. (2004). The eyes have it: An eye-movement study into the processing of formulaic sequences. In N. Schmitt (Ed.), Formulaic sequences (pp. 153–72). Amsterdam and Philadelphia: John Benjamins. Wlotko, E. W., & Federmeier, K. D. (2007). Finding the right word: Hemispheric asymmetries in the use of sentence context information. Neuropsychologia, 45, 3001–3014. Wood, S. N. (2006). Generalized additive models: An introduction with R. Boca Raton, FL: Chapmans and Hall/CRC.
Chapter 9
The Phonology of Formulaic Sequences: A Review Phoebe Ming Sum Lin University of Nottingham
It’s not what you say, it’s the way you say it. Anonymous
In the past decade, many corpus-based studies have revealed the use of formulaic sequences in spontaneous speech (e.g. Altenberg & Eeg-Olofsson, 1990; Altenberg, 1998; Biber, 2006; Biber & Barbieria, 2007; De Cock, 1998, 2000, 2007). A very popular approach taken by these studies is to examine the most frequent lexical bundles (Biber, Johansson, Leech, Conrad, & Finegan, 1999) extracted from a spoken corpus using automatic extraction tools like WordSmith 4.0 (Scott, 2004). While this approach has revealed many important findings, it has masked the essential properties that distinguish spoken formulaic sequences from written formulaic sequences. In fact, with formulaic sequences in spontaneous speech, a lot of the information crucial for interpreting their meanings is encoded in speech context and prosody. It is quite obvious that it is not only what is said that counts but also how it is said (see Crystal, 2003). It follows naturally that an investigation of spoken formulaic sequences should go a step beyond listing spoken lexical bundles as if they are merely textual in nature, but should also address the prosodic aspects which give colours to these spoken sequences. This paper presents a summary of the prosodic characteristics of formulaic sequences in speech. Describing the prosodic patterns of formulaic sequences is a complex task because prosody itself is conditioned by many aspects of context, including neighbouring words, syntactic and lexical structure, semantic focus and emphasis, discourse factors and so on. That is why there was a tendency for prosody research in the past to be done in an experimental setting with participants reading aloud scripted sentences. In this way, the contextual factors mentioned above are easily controlled so that the target
The Phonology of Formulaic Sequences
175
variables can be isolated and examined independently. However, prosody of read aloud texts cannot be treated as a more ‘purified’ version of spontaneous speech with the extraneous variables controlled because, as Chafe (2006) shows, their prosodies are two different varieties. While some researchers may see the contextual variables in spontaneous speech samples as extraneous, they actually reflect the speakers’ real time choices of words and utterances when they are under pressure to process speech online. The way the speakers chunk the utterances, stress individual words, repeat and hesitate are all indicative of the mental processes involved. That is to say, an analysis of the phonological aspects of formulaic sequences may be a window for us to see the processing of formulaic sequences from a different perspective. This paper provides an overview of the relevant research on the phonological features of formulaic sequences. First, we look at what is said in the literature about the phonological features of formulaic sequences in spontaneous speech, which can be summed up as phonological coherence. Then, we consider the psycholinguistic underpinnings of these phonological properties. In the next section, we discuss the implications of phonological coherence for different areas of research, and finally, we look at some recent empirical studies on the phonology of formulaic sequences in spontaneous adult native speaker speech.
Aspects of the Phonology of Formulaic Sequences While it is widely accepted that formulaic sequences demonstrate some degree of fixedness in lexical form (Zgusta, 1967; Peters, 1983; Baker & McCarthy, 1988; Hickey, 1993; Weinert, 1995; Wray & Namba, 2003), we can perhaps imagine there is also a certain level of fixedness in the phonological form of formulaic sequences, which can be illustrated using Ashby’s (2006) examples. While it is normal to say (1a) and (2a), the variants (1b) and (2b) destroy the idiomatic meaning of the idioms, or as Ashby says, produce humourous effects. (1a) (1b) (2a) (2b)
She has eyes in the back of her HEAD. She has eyes in the BACK of her head. They’re ROLLing in money. They’re rolling in MONEY.
This fixedness extends beyond stress placement, which is obvious in these two examples, to intonation and tempo. In fact, discussion on these two
176
Perspectives on Formulaic Language
aspects has a longer history than the stress placement aspect because the first mentions of intonation and tempo can be traced back to the 1970s in child language literature (e.g. Bloom, 1973; Scollon, 1974; Dore, 1976; Rodgon, 1976; Peters, 1977, 1983). Peters observes that children are able to produce utterances that are above their current expected level of grammatical competence. According to Peters (1983), these utterances are ‘always produced fluently as a unit with an unbroken intonation contour and no hesitations for encoding’ (p. 8). This is the first use of the term phonological coherence in the literature. There are other descriptions of the phonological characteristics elsewhere, for example, these utterances have ‘a “melody” unique enough so that [they] can be recognized even if rather badly mumbled’ (Peters, 1977, p. 563), and they are articulated fluently but imprecisely by children (Plunkett, 1990). These phonological features provide clues based on which researchers can distinguish utterances that have been learned as a chunk from those that are produced by saying single words in succession. Bloom (1973) traced the emergence of grammar in her daughter’s speech from when the child said her first words at nine months to when she uttered sentences at 22 months. Analysing diary records and video recordings of the child, Bloom observes that words said in succession as single word utterances have ‘terminal falling pitch contour, and relatively equal stress, and there was a variable but distinct pause between them so that utterance boundaries were clearly marked’ (1973, p. 41). Based on similar observations, Peters (1983) includes phonological coherence as one of her list of six criteria for identifying formulaic sequences in child language.1 This list of identification criteria is advanced by Hickey (1993) who expands the number of criteria to nine and introduces the distinction between the typical and the necessary criteria. Only two out of the nine criteria are categorized as necessary (i.e. the satisfaction of the criteria is essential), and phonological coherence is one of them. This shows the importance of the phonological criterion in detecting formulaicity in child language. In fact, Plunkett’s (1990) success in applying his articulatory/fluency criteria to identify formulaic expressions in the longitudinal data from two Danish children demonstrates how phonological coherence can be used as the sole identification criterion. Researchers of formulaic sequences in adult language are quick to spot the potential benefits of the finding that formulaic sequences demonstrate unique phonological properties in child language. This finding on child language, combined with the report of Dechert (1983) and Raupach (1984) that distinctively fluent stretches of speech are found enclosed by pauses and/or falling tones, leads some researchers to speculate on the use of phonological coherence to identify formulaic sequences in adult native
The Phonology of Formulaic Sequences
177
speaker’s speech. Gradually, we see proposals in the literature like ‘If a formulaic string is treated as a single, holistic unit, it ought to be relatively resistant to internal dysfluency and inaccuracy . . . we can make the prediction that there would be far fewer pauses and errors within formulaic strings than between them’ (Wray, 2004, p. 260), ‘multiword items often form single tone units’ (Moon, 1997, p. 44) and ‘MWUs may be tested by examining whether or not they are amenable to crossing the boundaries of tone-units. [...] MWUs up to clause-length normally occupy one tone-unit and only one’ (Baker & McCarthy, 1988, p. 14). However, while there is empirical evidence to support the use of phonological coherence to identify formulaic sequences in child language, researchers have yet to test if the phonological criteria work in adult native speaker speech, and if they are essential in the identification. Before empirical evidence is found for adult native speaker speech, these suggestions remain theoretical. Whether it is valid to directly apply the phonological criteria in the identification of formulaic sequences in adult spoken language is an issue we will return to later in the chapter. Worth noting is the fact that this borrowing of ideas from child language research might have overlooked the phonological aspects not mentioned in the original child language literature. A closer look at the phonological properties above reveals that they mainly concern two aspects of phonology, that is, intonation (i.e. unbroken contour and single tone units) and tempo (i.e. resistance to internal dysfluency, no hesitations, few pauses), but missing the phonemic aspect and the stress placement of formulaic sequences. However, recent studies by Ashby (2006), Bybee (2000, 2001, 2002) and others fill this research gap. On the phonemic aspect, empirical studies show that high frequency words and phrases undergo phonetic reduction at a faster rate than low and mid frequency sequences (Bybee, 2002, 2006). These include Hooper’s (1976) study on English schwa deletion as in memory versus mammary, Bybee’s studies on American English t/d deletion (2000) and Spanish intervocalic d-deletion (2001), Bybee and Scheibman’s (1999) study on the reduction of don’t in sequences like I don’t know, I don’t think, I don’t have, and Bush’s (2001) study on palatalization in sequences like did you, don’t you, would you and that you. These studies, therefore, lead researchers to believe that formulaic sequences, which tend to have relatively high frequency, experience phonetic reduction more readily than non-formulaic language, which tend to have relatively low frequency. On the aspect of stress placement, Ashby (2006) studies the accentual patterns of idioms and suggests that idioms can be put into three classes depending on the type of fixedness with their accentual patterns.
178
Perspectives on Formulaic Language
His Case (i) idioms have an accentual pattern that is the same as the leastmarked literal version, e.g. to have a CHIP on one’s shoulder compared with to have a BEE on one’s shoulder. Case (ii) idioms have an accentual pattern different from the corresponding literal expression, e.g. POUR down (idiomatic), pour DOWN (literal), be ROLLing in money (idiomatic), be rolling in MONEY (literal). Case (iii) idioms have a very strict accentual pattern and even tone choice is highly constrained, e.g. I could eat a \HORSE (falling tone) instead of I could eat a \/HORSE (falling-rising tone). After analysing sound-play type of jokes involving idioms such as My EARS are burning, Ashby suggests that they work by inviting listeners to look for an interpretation other than the obvious. This is achieved prosodically by imposing a narrow focus distinction on the non-compositional part of the idioms. This finding highlights the part played by narrow focus distinction in the phonological fixedness of formulaic sequences.
Psycholinguistic Underpinnings of Phonological Coherence So far we have reviewed the literature that discusses the phonological features of formulaic sequences. These phonological features, which are manifest in the intonational, temporal, phonemic, and stress aspects, seem to be particular to formulaic sequences. This leads us to speculate that the phonological form of formulaic sequences may be relatively fixed in the same way as the lexical form may be. To explain the formal fixedness of formulaic sequences, the most prominent and popular theory by far is the theory of holistic storage. The theory of holistic storage The idea that formulaic sequences are stored and retrieved as unanalysed wholes in the mental lexicon is very powerful and convincing. It explains the phenomena of multiword utterances in child speech (e.g. Peters, 1983; Plunkett, 1990), fossilized errors (Myles, Mitchell, & Hooper, 1999) and distinctively fluent chunks (Dechert, 1983; Raupach, 1984) in language learners’ speech, and ‘stereotyped, conventional utterances’ (Van Lancker & Canter, 1981, p. 64) in some aphasics patients. In adult native speaker speech, what lends support for holistic storage and retrieval are semantic non-compositionality (i.e. the meaning of the whole does not equal to the sum of meanings of the component words) in idiomatic expressions (e.g. spill the beans, kick the bucket and at the end
The Phonology of Formulaic Sequences
179
of the day), cranberry collocations (e.g. to and fro and put the kibosh on from Moon, 1998), and syntactically irregular structures which may be passed down from archaic English (e.g. here comes X, attached please find X and believe you me). The theory of holistic storage is also in line with the assumption that formulaic sequences should form single intonation units, have less internal dysfluencies such as hesitations and pauses, be uttered faster than rule-based language, and require specific accentual patterns or focus distinction. While holistic storage is the most popular theory, it is only one among many. There are alternative theories which explain other aspects of the phonological properties of formulaic sequences better than holistic storage. Therefore, it is important that we recognize that the phonology of formulaic sequences is explained by combining the perspectives of different theories. The theory of holistic storage is the most widely accepted explanation for the behaviour of formulaic sequences which includes formal fixedness, semantic non-compositionality and situation-dependence (i.e. form-function pairing). As far as the phonological aspect is concerned, this theory makes so much sense especially when we consider the case of formulaic sequences in child language. When a child says an utterance with a syntactic structure clearly beyond the developmental stage of the child, and the utterance demonstrates phonological coherence (including distinctive intonation (Peters, 1977), an unbroken intonation contour (Peters, 1983) and articulatory fluency (Plunkett, 1990)), it is highly probable that the whole sequence has been learned as an unanalysed chunk and stored as a holistic unit in the child’s mental lexicon. That is to say the theory of holistic storage explains phonological coherence, but at the same time, the observation of phonological coherence also strengths the creditability of the theory. The interdependence that exists between the theory of holistic storage and phonological coherence seems logically circular when we point it out, but this is really the logic underlying many researchers’ approach to the problem. Some researchers (e.g. Moon, 1997) predict formulaic sequences are phonological coherent because it is believed that they are stored as holistic units, and there are others (e.g. Peters, 1977; Wray, 2004) who infer the holistic storage of formulaic sequences on the basis of phonological coherence. It is difficult to say which of these two approaches is more justifiable or viable than the other. It is perhaps useful to point out that phonological coherence is more of a fact because it is measurable. Holistic storage, however, is more of a claim because its existence is inferred based on facts. On this foundation, we need to critically evaluate on a case to case basis if the evidence is strong enough for us to infer holistic storage from
180
Perspectives on Formulaic Language
phonological evidence. We have seen that it is reasonable to equate phonologically coherent stretches of talk with formulaic sequences in the case of child language because the young child’s speech production appears less complex. In the situation of adult speech, however, to equate phonological coherence with formulaic sequences is debatable because given the maturity of their linguistic system and the capacity of their memory, adults are capable of producing a long stretch of speech within the span of a single phonologically coherent unit. This long stretch of speech can be made up of a chain of one or several formulaic sequences linked together by rule-based language. In other words, a phonologically coherent unit can be equivalent to one or more formulaic sequences combined with other single-word items. Ultimately, we are in a dilemma whether to equate a phonologically coherent unit with a formulaic sequence. The equation works well in child language, maybe also in EFL learners’ speech, but it is still an unknown if it works in adult native speaker speech. Chunking and the theory of holistic processing This issue leads us to consider the idea of chunks and chunking in psycholinguistics literature, which provides a sound alternative to the theory of holistic storage. In psycholinguistics literature, a chunk is a unit of language processing. A long sentence has to be processed in chunks because of the limitation in the amount of information our short term memory can handle at one time, and this applies to the speaker who has to plan his speech online as well as to the hearer who decodes the speech simultaneously. According to Chafe (1987), an intonation unit represents ‘a single focus of consciousness’ (1987, p. 32, see above), which is exactly the meaning of a chunk. This is why a chunk and an intonation unit are believed to be equivalent, and both of them are a representation of a processing unit in the brain. It is important that we note the distinction between holistic storage and holistic processing. Arguably, a unit of storage is more precise than a unit of processing. A unit of storage may be the same as or smaller than a unit of processing. This idea can be illustrated with the help of a metaphor. We can imagine if a colleague brings in a stack of papers which you have requested. If he delivers the whole stack of papers in one go, the whole stack is a processing unit. In the stack, there are journal articles which have been stapled together but there are also loose single sheets. Each detached item, be it a sheet or a stapled collection of sheets, is a storage unit. When we look at the stack of paper stacked together, we know that there are stapled sheets
The Phonology of Formulaic Sequences
181
and loose sheets, but we cannot tell just by looking at the stack of paper whether a particular sheet of paper has been stapled with other sheets or not. We have to be satisfied with ‘a stack’ as the most precise unit achievable with the information available. If we focus on holistic processing, we are just concerned with the fact that the stack is delivered in one go. This is the same with the idea of the intonation unit. What an intonation unit reveals is a unit of processing, and the fact that a sequence of words is processed as a unit. It does not reveal a unit of storage (and I argue that in reality no psycholinguistic method can ever do that). By reverting to holistic processing, we can incorporate the eventuality when in adult native speaker speech an intonation unit or a phonologically coherent unit does not map exactly with a formulaic sequence. That can therefore reconcile the divided acceptability of equating formulaic sequences with phonologically coherent units in child language and in adult native speaker speech, which was discussed earlier. With regard to the level of evidence phonological coherence provides, we can be certain that phonological coherence indicates holistic processing. There is the possibility that a unit of storage may span across a whole unit of processing, but a unit of storage can be smaller than a unit of processing – empirical research is needed to determine the distribution of these two cases (see Lin and Adolphs, 2009). The frequency-based approach Apart from holistic storage and holistic processing, the frequency-based approach to language also sheds light on the psycholinguistic underpinnings of the phonological features of formulaic sequences, especially concerning phonological reduction (see Bybee, 2002). The frequency-based approach argues that articulation of speech is made up of neuromotor routines whose process can be smoothened by frequent practice (Bybee, 2002). Therefore, the more a series of routines is practiced, the more fluent the articulation is. Therefore, formulaic sequences, which are supposed to be highly frequent in everyday speech, should demonstrate phonological reduction more readily than non-formulaic language. This explains why researchers expect the articulation of formulaic sequences to have higher articulation/speech rate, and markedly fewer dysfluency phenomena like pauses and errors. In fact, this assumption finds support from Dechert (1983) and Raupach’s (1984) observations that second language learners can utter distinctively smooth and fluent stretches that seem to be formulaic amid their other dysfluent productions.
182
Perspectives on Formulaic Language
So far we looked at the theory of holistic storage, the concept of chunking and the frequency-based approach to language. We discussed how the phonological features of formulaic sequences can be seen from the perspectives of these three areas. As we can see, some phonological features are explained better by one theory than the other (e.g. the frequency-based approach is particularly suited to account for the phonemic aspect of phonological coherence). For this reason, it is important that we combine the strengths of each theory in order to develop a deeper understanding of the phonology of formulaic sequences.
Implications of the Phonology of Formulaic Sequences Having looked at the psycholinguistic underpinnings of the phonology of formulaic sequences, now we can consider why it matters whether formulaic sequences are phonologically coherent. There are many ways in which researchers, teachers and learners can benefit from a greater understanding of phonological coherence in formulaic sequences, but our focus here particularly concerns three areas. First, as suggested by Baker and McCarthy (1988), Moon (1997) and Wray (2004), phonological coherence may help us identify formulaic sequences in adult speech. Secondly, EFL learners may be able to improve their perceived speech fluency if they also attend to the intonation patterns specific to formulaic sequences (see Wennerstrom, 2001, 2006). Thirdly, researchers of natural language processing (NLP) may be able to make use of this model of the phonology of formulaic sequences to enhance robotic speech and to make it sound even closer to human spontaneous speech. Identifying formulaic sequences in spontaneous speech has long been a problem in the field of formulaic language research. While it is easy for researchers to give examples of prototypical formulaic sequences, it is difficult to extract formulaic sequences in a corpus which may have as many as a million words. The most popular corpus method nowadays is to use automatic extraction tools to search for highly recurrent word combinations, or lexical bundles (Biber et al., 1999). However, many researchers, such as De Cock (1998), have taken note of the fact that amid the prototypical multiword expressions like I think, in fact and and things like that, sequences like it was it was, to to and it you, which may not appear to be psycholinguistic valid, are also retrieved in the corpus results. There are only very few ways of examining the psycholinguistic validity of formulaic sequences,
The Phonology of Formulaic Sequences
183
including eye-tracking (Underwood et al., 2004), self-paced reading (Schmitt & Underwood, 2004; Conklin & Schmitt, 2007), reaction-time experiments (Jiang & Nekrasova, 2007) and so on. All these methods have to be used in an experimental setting using prompts carefully designed in advance to control for intervening variables, and so far they only focus on written formulaic sequences. That said, if phonological coherence can really help researchers examine the psycholinguistic validity of potential formulaic sequences, it will provide a unique window into the processing of spoken formulaic sequences in spontaneous speech. The way this works is, for instance, if lexical bundles extracted from a spoken corpus demonstrate features like a complete alignment with intonation units, lower tendency to contain dysfluency phenomena or errors, or phonological reduction, we can be more certain that the lexical bundles are formulaic. The phonological criteria have to be applied in conjunction with another criterion such as frequency of occurrence because intonation units cannot be equated with formulaic sequences in adult speech as they can in child speech. Another area which may benefit from a comprehensive investigation of phonological coherence is second language learning. It has long been suggested in the literature that the use of formulaic language is one of the keys to speech fluency in native speakers and language learners alike. The idea is that while some learners of English have to pause to plan their speech online, native speakers would make use of formulaic sequences as fillers (e.g. I mean, you know or you know what I mean) to buy time for online planning (Skehan, 1998). This way native speakers can maintain what Fillmore (1979) calls disc jockey fluency – the ability to keep the speech flow going. It seems that the focus of this idea is on the production of formulaic sequences and it does not concern how formulaic sequences are uttered. However, it is exactly this aspect of phonological coherence that is also important to speech fluency. To illustrate this point with Ashby’s (2006) idiom examples again, if the intended is (3a) but the learners say it as (3b) or (3c), it hardly gives an impression of fluency in Lennon’s (2000) broader sense of the term. Wennerstrom (2001, 2006) makes a similar point about the important role of intonation in second language fluency. It is clear that the influence of intonation on perceived fluency does not only affect idioms like (3a) but also other forms of conventionalized language or even general spoken English. For instance, if a speaker intends to say ‘it rains a LOT in the UK’ but actually says ‘it rains A lot in the UK’, he/she does not come across as a fluent speaker either.2 Anyway, this relationship between meaning and accurate corresponding stress placement, which is what
184
Perspectives on Formulaic Language
Wennerstrom (2006) called intonational meaning, seems to be even more rigid with idioms than with general spoken English, and this point is not immediately obvious to learners of English. Wennerstrom (2006) also suggests that a more holistic style with less self-monitoring and a greater reliance on routinized language chunks would be characterized by longer intonational phrases which would in turn help learners who struggle with word-by-word speech to come across as more fluent speakers. That is why a language syllabus that aims to promote nativelike proficiency and fluency should encourage, not only a greater use of formulaic sequences, but also an awareness of the way formulaic sequences should be said. (3a) It was raining cats and DOGS. (3b) It was raining cats AND dogs. (3c) It was raining CATS and dogs. The final area which may benefit from a comprehensive study of phonological coherence is natural language processing (NLP) research which models human speech for developing applications in automatized speech synthesis. While the important role of intonational meaning (Wennerstrom, 2006) applies to this area as well, another challenge in this area is to predict the assignment of tempo and pauses, and where an intonation contour begins and ends. In text-to-speech technology, researchers rely mainly on syntactic structure and punctuation to assign temporal and intonation features, but the results are still far from ideal as research by Altenberg (1990a, 1990b) and Knowles and Lawrence (1987) suggest (see below). Chafe’s (1988) study also implies that punctuation is at best a rough guide to temporal or intonation units. It does not predict satisfactorily where an intonation unit should begin and end because the length of a punctuation unit (i.e. one punctuation mark every 8.9 words) is almost twice as long as that of an intonation unit (i.e. one intonation unit every 5–6 words). As various corpus-based research shows, recurrent contiguous word combinations take up 58.6 per cent (Erman & Warren, 2000) to 80 per cent (Altenberg, 1998) of spoken language (depending on the selection criteria, corpus and method of calculation). This high proportion of recurrent sequences, in which formulaic sequences are found, means that if there is a model that describes the phonology of formulaic sequences, a significant proportion of the task to assign phonological features to computer generated language can be tackled. Over the decades, there have been at least two attempts in the linguistic literature to make computer generated voice more human-like in text-
The Phonology of Formulaic Sequences
185
to-speech synthesis. This is achieved by predicting intonation unit boundaries and tempo based on syntactic parsing. Inspired by Crystal (1975), Altenberg (1990a, 1990b) took the top-down approach and based his prediction of intonation unit boundaries predominantly on grammatical units. This approach has achieved an impressive 93 per cent success rate in predicting boundaries at the between-clauses level but it fails to predict intonation unit boundaries below the between-clauses level including the clause level (i.e. between clause elements) and phrase level (i.e. between phrase constituents). Knowles and Lawrence (1987), however, took the bottom-up approach by starting at the word level and working out when the boundary between each word has to be removed. For instance, the boundary between a ‘weak’ grammatical word and a lexical word as in the man has to be removed. This is the same with combinations like adjective + noun (old man) or verb + adverb (walked slowly). Like the top-down approach, this bottom-up system of Knowles and Lawrence (1987) also relies on syntactic parsing. That is perhaps because that might well have been the only relatively mature automatic language analysis tool in the late 1980s to early 1990s, and automatic speech synthesis has to be developed on the basis of existing automatic technology available. But the benefits of incorporating insights from formulaic language research are that: (1) formulaic sequences account for at least half of spoken English; (2) formulaic language is a system independent of syntax/grammar so it does not need to be founded on syntactic parsing; and (3) formulaic sequences typically concern the more subtle clause level and phrase level which Altenberg’s (1990a, 1990b) project failed to predict. Apparently, future models of speech synthesis in NLP can be enhanced by introducing the concept of formulaic sequences to the current top-down and bottom-up approaches when predicting the assignment of intonation unit boundaries.
Some Recent Empirical Work on the Phonology of Formulaic Sequences in Adult Native Speaker Speech In the previous sections we have discussed the origin of the idea of phonological coherence and what it entails. In particular reference to phonological coherence in adult native speaker speech, we have examined the different psycholinguistic underpinnings of the phonology of formulaic sequences. This final section will review the few empirical studies in recent years that specifically address the phonology of formulaic sequences in adult native speaker speech.
186
Perspectives on Formulaic Language
The Van Lancker, Canter & Terbeek (1981) study In the literature, Van Lancker, Canter, & Terbeek’s (1981) study of the phonological cues which help hearers distinguish accurately whether the speaker intends to convey the idiomatic or literal meaning of the same idioms has often been cited as providing ‘the empirical evidence’ supporting the use of phonological criteria to identify formulaic sequences in adult spontaneous speech. The five phonological features that Van Lancker et al. (1981) observe have since been taken as a sketch of the phonological profile of formulaic sequences.3 However, this appears to be a misled interpretation of the findings of the study because the Van Lancker et al. made no attempt to claim application of their findings beyond the context they focused on (i.e. reading aloud of idioms in an experimental setting) to cover all types of formulaic sequences in spontaneous speech of adults. In fact, the researchers are also clear about the finding that hearers could not discern with a statistically significant level of accuracy whether the idiomatic or the literal meaning was intended unless the speakers were asked to exaggerate in their own ways the prosodic differences between the two readings. So instead of accepting that the five observed phonological properties (which are the results of the speakers’ exaggeration) constitute a simple phonological profile of formulaic sequences, Ashby (2006) offers an alternative interpretation of Van Lancker et al.’s results. His idea, as mentioned earlier, is that there is a certain restriction with introducing focus distinction to the non-compositional part of idioms. If this restriction is tampered with, it serves as a hint to the hearers to look for an interpretation other than the obvious. In other words, what Van Lancker et al.’s (1981) study shows is the different strategies that can be employed by the speakers to tamper with the restriction on focus distinction, and this restriction concerns a small class of semantically non-compositional idioms in the readaloud experimental condition. Empirical studies that begin to investigate the broadly defined idioms (i.e. prefabricated language that appear semantically transparent) in spontaneous adult native speaker speech are Dahlmann (2009), Erman (2006, 2007), Lin (2010) and Lin and Adolphs (2009).
The Dahlmann (2009) study Dahlmann (2009) studies the distribution of pauses around automatically extracted multiword units in her two corpora, the English Native Speaker
The Phonology of Formulaic Sequences
187
Interview Corpus (ENSIC) for adult native speaker speech and the Nottingham International Corpus of Learner English (NICLE) for second/foreign language learner speech. With the support of Raupach (1984) and Wray’s (2004) finding, she makes the assumption that pauses should not normally be found within multiword units. This concept then forms the main basis of her phonological criterion when she attempts to develop an inventory of spoken formulaic sequences especially for second/foreign language learners. She extracted lists of multiword units of various lengths automatically from her two corpora and then used the phonological criterion as a filter to select psycholinguistically valid formulaic sequences for inclusion in the inventory.
The Erman (2006, 2007) studies Erman (2006, 2007) is interested in the distribution of pauses in spontaneous speech collected in the London-Lund Corpus (LLC) and the Bergen Corpus of London Teenager Language (COLT). Her perspective on pauses is that pauses reflect the cognitive processes of speech production and the duration of some types of pauses reflect differences in cognitive processing effort. Therefore, she compares the duration of pauses that come before prefabricated language (or prefabs) with those before non-prefabricated, creative language. Her results show that the retrieval of prefabs demands a smaller amount of cognitive processing effort than the retrieval of non-prefabricated language as it is manifest through pause duration. Her study also investigates if there are differences between adolescent and adult speakers in the distribution of pauses over prefabs and nonprefabricated language. Although the results in this respect are not conclusive, there is some indication that the adolescent speakers are more uncertain about their prefabs as indicated in the proportionally higher frequencies of pauses in prefabs.
The Lin and Adolphs (2009) and the Lin (2010) studies Unlike Dahlmann (2009) and Erman (2006, 2007), a recent study by Lin and Adolphs (2009) is interested in the intonation of formulaic sequences in spontaneous speech of adult learners of English. Following what is suggested in the literature (Baker & McCarthy, 1988; Moon, 1997), the study focuses on putting to test the assumption that formulaic
188
Perspectives on Formulaic Language
sequences form single intonation units. Using WordSmith 4.0 (Scott, 2004), 56 instances of the five-word lexical bundle I don’t know why were extracted from the Nottingham International Corpus of Learner English (NICLE). The intonation features of these instances were analysed using both auditory and acoustic approaches. The results of the study reveal that more than half of the instances of I don’t know why form single intonation units, i.e. they align with the intonation unit boundaries on both sides of the sequence. In 85 per cent of the cases there was matching with the intonation unit boundary on either the left or right side. The significance of their results is two-fold. First, it is the first to test and provide empirical evidence to support the prediction that formulaic sequences often form single intonation units. Secondly, it reveals that there are other many factors that affect whether a formulaic sequence forms a single intonation unit, for instance, the syntactic makeup of the sequence, the position of the sequence in an utterance and the function of the sequences in context. A point to note, however, is that this study investigates the alignment of formulaic sequences with intonation unit boundaries in adult English learner’s speech. The results may be different with native speakers. Lin (2010) extends the investigation of the alignment of formulaic sequences with intonation units in adult native speaker speech. Instead of taking a bird’s-eye view of the phonology of a formulaic sequence regardless of the co-text the sequence is in, this study aims at capturing the level of alignment with full attention to the effects of the intervening co-textual and contextual factors such as the neighbouring words, syntactic and lexical structure, the intended emphasis of the utterance, the prosodic features of the speech genre and even the online processing capacity of the speakers. Preliminary results after the analysis of 62 formulaic sequences from a lecture speech extract indicate that 82 per cent of the 62 different formulaic sequences investigated align with intonation unit boundaries on at least one side of the formulaic sequences and 40 per cent align on both sides. Compared with the findings of Lin and Adolphs (2009), these percentages indicate a slightly lower level of alignment of formulaic sequences with intonation units in adult native speaker speech. This lower level of alignment may illustrate the difference in the phonological features of formulaic sequences between native speakers and learners of English, but it can also be put down to differences in genre (Lin and Adolphs, 2009, looks at interview speech but Lin, 2010 examines lecture speech), or other co-textual and contextual factors suggested earlier. For detailed findings and discussion, readers can refer to the study.
The Phonology of Formulaic Sequences
189
Conclusion In this chapter we reviewed discussions in the literature over the past four decades concerning the phonological features of formulaic sequences, which have been referred to as phonological coherence in some literature (e.g. Peters, 1983; Hickey, 1993). We looked at how phonological coherence is manifest in the phonemic, intonation, tempo and stress aspects not only in child language where the notion of phonological coherence originates from, but also in adult native speaker speech. In the earlier sections we also examined the psycholinguistic or cognitive basis of the notion of phonological coherence, and what level of information the phonological evidence provides regarding the holistic storage and processing of formulaic sequences. We considered how researchers of formulaic language, natural language processing, and learners of English can benefit a lot from a deeper understanding of the phonological features of formulaic sequences. Finally, we looked at some recent empirical studies on the phonological coherence demonstrated in formulaic sequences in spontaneous speech of adult native speakers. In our discussion, we stressed the great need for empirical research on how phonological coherence is manifest in spontaneous adult native speaker speech. While child language researchers began research in this area in 1970s, work that targets adult native speaker speech is still in its infancy. It is expected that a description of the phonological coherence of formulaic sequences in adult speech would be much more difficult because co-textual, contextual, generic and even cognitive factors can also influence the prosodic shape of formulaic sequences in spontaneous speech. That is why a lot of work is waiting to be done on this rich, phonological aspect of formulaic sequences.
Notes 1
Peter’s (1983, pp. 7–12) six criteria for the identification of formulaic sequences in child language are: (a) Is the utterance an idiosyncratic chunk that the child uses repeatedly and in exactly the same form? (b) Is the construction of the utterance unrelated to any productive pattern in the child’s current speech? (c) Is the utterance somewhat inappropriate in some of the contexts in which it is used? (d) Does the utterance cohere phonologically? (e) Is the usage of the expression situationally dependent for the child? (f) Is the expression a community-wide formula?
190 2 3
Perspectives on Formulaic Language
The author would like to thank Ron Martinez for suggesting the example. Van Lancker et al.’s (1981) 5 observations concerning the phonological features of the idioms include: (1) literal utterances have longer duration than idiomatic utterances (i.e. Idiomatic utterances are spoken faster); (2) literal utterances have 5 times as many pauses compared to idiomatic utterances; (3) juncture (Lehiste, 1973) occurred almost 3 times more in literal than in idiomatic utterances; (4) there are more pitch contours in literal than in idiomatic utterances (0.33 to 0.5 times); and (5) literal utterances are systematically marked by what Bolinger (1965)) defined as Accent A.
References Altenberg, B. (1990a). Automatic text segmentation into tone units. In J. Svartvik (Ed.), The London-Lund corpus of spoken english: Description and research (pp. 287–324). Lund: Lund University Press. Altenberg, B. (1990b). Predicting text segmentation into tone units. In J. Svartvik (Ed.), The London-Lund Corpus of Spoken English: Description and research (pp. 275–86). Lund: Lund University Press. Altenberg, B. (1998). On the phraseology of spoken English: The evidence of recurrent word-combinations. In A. P. Cowie (Ed.), Phraseology: Theory, analysis and applications (pp. 101–22). Oxford, England: Clarendon Press. Altenberg, B., & Eeg-Olofsson, M. (1990). Phraseology in spoken English: Presentation of a project. In J. Aarts & W. Meijs (Eds.), Theory and Practice in Corpus Linguistics (pp. 1–26). Amsterdam: Rodopi. Ashby, M. (2006). Prosody and idioms in English. Journal of Pragmatics, 38(10), 1580–97. Baker, M., & McCarthy, M. (1988). Multiword units and things like that. In Mimeograph. Birmingham: University of Birmingham. Biber, D. (2006). University Language: A corpus-based study of spoken and written registers. Amsterdam; Philadelphia: John Benjamins. Biber, D., & Barbieria, F. (2007). Lexical bundles in university spoken and written registers. English for Specific Purposes, 26(3), 263–86. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. Harlow, Essex: Longman. Bloom, L. (1973). One word at a time: the use of single word utterances before syntax. The Hague: Mouton de Gruyter. Bolinger, D. (1965). On certain functions of accents A and B. In I. Abe & T. Kanekiyo (Eds.), Forms of English (pp. 57–66). Cambridge: Havard University Press. Bush, N. (2001). Frequency effects and word-boundary palatalization in English. In J. Bybee & P. J. Hopper (Eds.), Frequency and the emergence of linguistic structure (pp. 255–80). Amsterdam: John Benjamins. Bybee, J. (2000). The phonology of the lexicon. In M. Barlow & S. Kemmer (Eds.), Usage-based models of language (pp. 65–85). Stanford, CA: CSLI Publications. Bybee, J. (2001). Phonology and language use. Cambridge: Cambridge University Press.
The Phonology of Formulaic Sequences
191
Bybee, J. (2002). Phonological evidence for exemplar storage of multiword sequences. Studies in Second Language Acquisition, 24, 215–21. Bybee, J. (2006). From usage to grammar: the mind’s response to repetition. Language, 82(4), 711–33. Bybee, J., & Scheibman, J. (1999). The effect of usage on degees of constituency: The reduction of don’t in American English. Linguistics, 37, 575–96. Chafe, W. (1988). Punctuation and the prosody of written language. Written Communication 5, 395–426. Chafe, W. (2006). Reading aloud. In R. Hughes (Ed.), Spoken English, applied linguistics and TESOL: Challenges for theory and practice (pp. 53–71). London: Palgrave. Chafe, W. L. (1987). Cognitive constraints on information flow. In R. S. Tomlin (Ed.), Coherence and grounding in discourse (pp. 21–51). Philadelphia: John Benjamins. Conklin, K., & Schmitt, N. (2007). Formulaic sequences: are they processed more quickly than nonformulaic language by native and nonnative speakers? Applied Linguistics, 1–18. Crystal, D. (1975). The English tone of voice: Essays on intonation, prosody and paralanguage. London: Edward Arnold. Crystal, D. (2003). Prosody. In The Cambridge encyclopedia of the English language (2 ed., pp. 248–49). Cambridge: Cambridge University Press. Dahlmann, I. (2009). Towards a multi-word unit inventory of spoken discourse. Unpublished PhD thesis, University of Nottingham, UK. De Cock, S. (1998). A recurrent word combination approach to the study of formulae in the speech of native and non-native speakers of English. International Journal of Corpus Linguistics, 3 (1), 59–80. De Cock, S. (2000). Repetitive phrasal chunkiness and advanced EFL speech and writing. In C. Mair & M. Hundt (Eds.), Corpus linguistics and linguistic theory (pp. 51–68). Amsterdam: Rodopi. De Cock, S. (2007). Routinized building blocks in native speaker and learner speech: Clausal sequences in the spotlight. In M. C. Campoy & M. J. Luzón (Eds.), Spoken corpora in applied linguistics (pp. 217–33). Bern: Peter Lang. Dechert, H. W. (1983). How a story is done in a second language. In C. Faerch & G. Kasper (Eds.), Strategies in interlanguage communication (pp. 175–95). London: Longman. Dore, J. (1976). Holophrases, speech acts and language universals. Journal of Child Language, 2, 21–40. Erman, B. (2006). Non-pausing as evidence of the idiom principle. Paper presented at the First Nordic Conference on Syntactic Freezes, University of Joensuu, Finland, May 19–20, 2006. Erman, B. (2007). Cognitive processes as evidence of the idiom principle. International Journal of Corpus Linguistics, 12 (1), 25–53. Erman, B., & Warren, B. (2000). The idiom principle and the open choice principle. Text, 20 (1), 29–62. Fillmore, C. J. (1979). On Fluency. In C. J. Fillmore, D. Kempler, & W. S. Y. Wang (Eds.), Individual differences in language ability and language behaviour (pp. 85–101). New York: Academic Press.
192
Perspectives on Formulaic Language
Hickey, T. (1993). Identifying formulas in first language acquisition. Journal of Child Language, 20 (1), 27–41. Hooper, J. B. (1976). Word frequency in lexical diffusion and the source of morphophonological change. In W. M. Christie (Ed.), Current progress in historical linguistics (pp. 96–105). Amsterdam: North Holland. Jiang, N., & Nekrasova, T. M. (2007). The processing of formulaic sequences by second language speakers. The Modern Language Journal, 91 (3), 433–45. Knowles, G., & Lawrence, L. (1987). Automatic intonation assignment. In R. Garside, G. Leech, & G. Sampson (Eds.), The Computational analysis of English: A corpusbased approach. London: Longman. Lehiste, I. (1973). Phonetic disambiguation of syntactic ambiguity. Glossa, 7 (2), 103–22. Lennon, P. (2000). The lexical element in spoken second langauge fluency. In H. Riggenbach (Ed.), Perspectives on fluency (pp. 25–42). Ann Arbor, MI: University of Michigan Press. Lin, P. M. S. (2010). The prosody of formulaic sequences in spontaneous speech. Unpublished PhD thesis, The University of Nottingham, UK. Lin, P. M. S., & Adolphs, S. (2009). Sound evidence: Phraseological units in spoken corpora. In A. Barfield & H. Gyllstad (Eds.), Collocating in another language: Multiple interpretations. Basingstoke, England: Palgrave Macmillan. Moon, R. (1997). Vocabulary connections: Multi-word items in English. In M. McCarthy (Ed.), Vocabulary: Description, acquisition and pedagogy (pp. 40–63). Cambridge; New York: Cambridge University Press. Moon, R. (1998). Frequencies and forms of phrasal lexemes in English. In A. P. Cowie (Ed.), Phraseology: Theory, analysis, and applications (pp. 55–78). Oxford: Clarendon Press. Myles, F., Mitchell, R., and Hooper, J. (1999). Interrrogative chunks in French L2: A basis for creative construction? Studies in Second Language Acquisition, 21, 49–80. Peters, A. M. (1977). Language learning strategies: Does the whole equal the sum of the parts? Language, 53 (3), 560–73. Peters, A. M. (1983). The units of language acquisition. Cambridge: Cambridge University Press. Plunkett, K. (1990). The segmentation problem in early language acquisition. Center for Research in Language Newsletter, 5 (1), 1–17. Raupach, M. (1984). Formulae in second language speech production. In H. W. Dechert, D. Möhle & M. Raupach (Eds.), Second language productions (pp. 114–37). Tubingen: Gunter Narr. Rodgon, M. (1976). Single-word usage, cognitive development and the beginnings of combinatorial speech. Cambridge: Cambridge University Press. Schmitt, N., & Underwood, G. (2004). Exploring the processing of formulaic sequences through a self-paced reading task. In N. Schmitt (Ed.), Formulaic sequences: Acquisition, processing and use (pp. 173–90). Amsterdam; Philadelphia: John Benjamins. Scollon, R. (1974). A real early stage: an unzippered condensation of a dissertation on child language. University of Hawaii Working Papers in Linguistics, 6, 67–81.
The Phonology of Formulaic Sequences
193
Scott, M. (2004). WordSmith Tools (Version 4). Oxford: Oxford University Press. Skehan, P. (1998). A cognitive approach to language learning. Oxford: Oxford University Press. Underwood, G., Schmitt, N., & Galpin, A. (2004). The eyes have it: An eye-movement study into the processing of formulaic sequences. In N. Schmitt (Ed.), Formulaic sequences: Acquisition, processing and use (pp. 153–72). Amsterdam; Philadelphia: John Benjamins. Van Lancker, D., Canter, G., & Terbeek, D. (1981). Disambiguation of ditropic sentences. Journal of Speech and Hearing Research, 24, 330–35. Weinert, R. (1995). The role of formulaic language in second language acquisition: A Review. Applied Linguistics, 16(2), 180–205. Wennerstrom, A. K. (2001). The role of intonation in second language fluency. In H. Riggenbach (Ed.), Perspectives on fluency (pp. 102–27). Ann Arbor, MI: University of Michigan Press. Wennerstrom, A. K. (2006). Intonational meaning starting from talk. In R. Hughes (Ed.), Spoken English, TESOL and applied linguistics (pp. 72–98). Basingstoke, England: Palgrave Macmillan. Wray, A. (2004). ‘Here’s one I prepared earlier’: Formulaic language learning on television. In N. Schmitt (Ed.), Formulaic sequences: Acquisition, processing and use (pp. 249–68). Amsterdam; Philadelphia: John Benjamins. Wray, A., & Namba, K. (2003). Formulaic language in a Japanese-English bilingual child: a practical approach to data analysis. Japan Journal of Multilingualism and Multiculturalism, 9, 24–51. Zgusta, L. (1967). Multiword lexical units. Word, 23, 578–83.
Chapter 10
Processing MWUs: Are MWU Subtypes Psycholinguistically Real? Georgie Columbus University of Alberta1
Introduction The study of multiword units has a long history of theoretical and experimental research in the fields of semantics, morphology, phonology, syntax and psychology. These multiword units (MWUs) include a wide range of types of multiword strings. Some of these MWUs are as follows: restricted collocations (RCs) such as to pay attention, where *to take attention is not available for use despite its availability in the semantically analogous take care (to); frequent word strings (lexical bundles, or LBs), including functional phrases such as the store assistant’s Can I help you? and nonsemantic units such as at the end of the; and finally, idioms (IDs) such as Too many cooks spoil the broth. One question that is still to be definitively answered, and one that is being worked on in several research centres with differing goals and approaches, is what is happening when speakers process MWUs. We do not yet know if they are processed in the same way as conventional propositions, if this is the same for L1 and L2 speakers of English, or if it is the same for different types of MWUs. Only once these questions have been answered can we determine how this information can be applied to proposed models of language processing (viz. superlemma theory of Sprenger, Levelt, & Kempen, 2006) and language pedagogy (cf. Lewis’ Lexical Approach, 1993, inter alia).
Literature Review Research into MWUs has been conducted for over a century now. The majority of previous formulaic sequence studies have focused either on semantics and the differences between literal and nonliteral idiom
Processing MWUs
195
processing (e.g. Swinney & Cutler, 1979; Gibbs, 1985, 1986; Cacciari & Tabossi, 1988; Cacciari & Glucksberg, 1991; Titone & Connine, 1994; Cutting and Bock, 1997) or on the production/use of MWUs in corpus data (e.g. Moon, 1996, 1998; Biber, Conrad, & Reppen, 1998; Schmitt, 2005). There are several notable exceptions to this focus, however, and each of these authors note the lack of psycholinguistic studies of MWUs to date. The underlying question in psycholinguistics regarding MWUs is whether or not these co-occurring strings are stored, accessed and processed the same as compositional language, regardless of their internal make-up and semantic and/or functional features. Studies which have added to the debate on holistic and/or probabilistic storage and processing have had varying results, and call for further investigations. One such study is Schmitt, Grandage & Adolphs (2004), who looked at fluency and accuracy of MWU production in native and non-native speakers. They had participants repeat stories with embedded MWUs, with an intermediate addition problem for the native speakers in order to compensate the working memory load difference between this group and the non-native speakers. The target items were ‘Recurrent Clusters’ (viz. frequently co-occurring words) made up of both LBs (Biber, Johansson, Leech, Conrad, & Finegan, 1999) and formulaic sequences (Wray, 2002) which were derived from corpora. Their underlying goal was to determine if these statistically-robust combinations were also psycholinguistically valid as measured by facilitated re-production. However, they did not find the expected results. Instead, they found that the MWUs were generally not reliably reproduced as whole clusters. The exception to this was very frequent clusters such as I don’t know what to do. The authors suggested that the knowledge of these frequent LBs is part of individuals’ phrasalects,2 and that semantically- or functionally-linked MWUs are more likely to be reproduced. Yet this explanation doesn’t take into account either semantic differences between the formulaic sequences and LBs nor the effect of context in story-retelling; it seems possible that the combination of the discourse task and the semantic task of retelling a story may override the interest in verbatim accuracy (which the pilot studies showed native speakers were able to do with near perfect accuracy before inclusion of the arithmetic task). Hence, while there was some validation of the holistic process/storage of MWUs here, it is possible that the context issues (i.e. semantic parsing) may have overridden the production task. Unlike the production study above, Underwood, Schmitt & Galpin’s (2004) eye-movement study did find significant processing advantages for final words within MWUs compared to final words within matched
196
Perspectives on Formulaic Language
non-MWUs. The authors did not determine any differences between the MWU types used (transparent metaphors, sayings/proverbs, lexical phrases and idioms). Underwood et al. (2004) focused on analysing the terminal word in a MWU. The final word was considered the best position to measure MWU advantages given native speakers’ tendency to fixate on some but not all words in any sentence. This skipping is arguably where the processing advantages of MWUs truly come into play, and therefore worth further investigation. Similarly, while the faster reading times for MWU terminal words are put forward as support of holistic processing (and therefore of holistic storage also), this cannot be taken as a complete measure of MWU processing. This is because the MWUs can span between three and eight words; measuring only the last word of this unit may obscure further effects. Thus any advantage of the full MWUs cannot be inferred from these results. Also, it was not clear whether there were differences in processing times between the units within the MWUs, nor for the different types of MWUs used. Overall, Underwood et al. (2004) do offer further evidence of MWU reading advantages. However, with a terminal unit analysis, low generalizability (20 stimuli), and no distinction among MWU types, they cannot give a complete picture of MWU processing. In a follow-up study, Schmitt and Underwood’s (2004) self-paced reading task investigated processing advantages for a mixture of MWUs in wordby-word reading times. Twenty MWU sequences were presented within a story-like passage. The lack of facilitation for MWUs in reading speed here (a slight and nonsignificant speed increase over controls) is perhaps due to stimuli differences in transparency, literality, flexibility and familiarity. The authors also note the inability to skip over a word in self-paced reading paradigms when presented word-by-word. This may have weakened or nullified the reading advantage of the MWUs, adding to the lack of significance. In contrast, Tremblay, Derwing, Libben & Westbury’s (submitted) word-by-word self-paced reading study found a reliable processing advantage for LBs, though small in millisecond terms. Given the range of results found for the MWU tasks in the studies mentioned above, some scholars have attempted to determine the possible driving force for this variation. Schmitt (2005) discusses in detail the variation between and within different types of MWUs, and theorizes that processing and storage of these is also likely to vary by type. Schmitt goes on to suggest that there is a core template or exemplar as the memorised unit of an MWU. Access to any of the content words in this template should then facilitate access to the whole MWU.
Processing MWUs
197
The role of MWUs in processing should be made clearer by generally increased reading speeds over entire MWUs as compared to non-MWUs. Thus eye-movement data is likely a promising measure of MWU processing and determining differences between MWU types, especially given the preliminary results found by Underwood et al. (2004). The eye-tracking experiment undertaken in this study employs timed reading and eye-movement data (fixation durations) to investigate the role of MWUs in the processing of English sentences by native readers. The research questions are as follows: 1. Are the various MWU subtypes processed differently by native readers? And if so, how? 2. Do individual MWU type predictors play a role in processing? And if so, which? To be able to differentiate among the MWU subtypes and other potential predictors, the stimuli set includes three types of MWUs and two controls. Each is collated from authentic data sources (i.e. sentences from a corpus of English).
Methodology This study reports mixed-effects modelling. The key independent variables are the stimuli sentence types. There are three different MWU subtypes (restricted collocations, e.g. to pay attention, lexical bundles, e.g. at the end of the, and figurative idioms, e.g. to let the cat out of the bag) and two control structures (conventional compositional strings, e.g. to eat a sandwich, and semantically anomalous strings, e.g. to eat the criticism). Given Tremblay et al.’s (submitted) success in finding lexical-bundle effects in sentential contexts, we decided that presenting the MWUs in such contexts would also be sufficient. The dependent variables in this study are the reading time measures. Covariates in the mixed-effects modelling include frequency of individual key words of the MWUs compared to frequency of words in the carrier sentence, reading speed of the MWU unit compared to non-MWU units in each sentence, trial presentation order, and sentential location of the MWU. These covariates are random, since the stimuli were randomly selected a corpus without any attempt to factorialize or otherwise control the values of the covariates. As instances from a corpus, the stimuli are
198
Perspectives on Formulaic Language
actual exemplars of language usage, rather than artificial structures. The intention here is to reflect real language more accurately. Stimulus selection The stimuli, divided into the five types described above, were collected from the British National Corpus (BNC) of English. The items were selected using the BNCweb search tool (Lehmann, Hoffmann, & Schneider, 2002). A set of key words for various idioms, restricted collocations and lexical bundles were first randomly selected from pages of various dictionaries of idiomatic language (e.g. the Benson, Benson & Ilson’s Combinatory Dictionary of English, 1986, and Kuiper McCann, Quinn, Aitchison, & van der Veer’s A syntactically annotated idiom database [SAID], 2003) as well as from previous studies, such as Moon’s (1996; 1998) fixed expression searches in the Collins COBUILD corpus. We searched for either a key word or key pair of words from the MWU. A results-thinning tool was then used to randomly select a smaller number of concordance lines from the total results per search word.3 A random selection of 50 sentences containing an example of an MWU was then made for each subtype. This involved choosing the first possible stimulus going from either the top or the bottom of the list, with the only exclusionary factors being extreme length, non-standard carrier structures (i.e. sentences which used literary but unlikely word ordering, or partial responses from spoken or written dialogue), incomplete sentences or contexts (e.g. I don’t!) and illogical/speech error exemplars. Some short contexts were retained, however, due to limited results and also to ensure a more varied stimulus set in terms of sentence length. This quasirandom process ensured that the item selection conformed to the requirements for random factors in a mixed-effects design, at the same time minimizing the possibility of experimenter bias effects (Forster, 2000). The control sentence types were produced in two ways. The compositional sentences were selected randomly from searches containing common words such as it, the and a, resulting in stimuli such as The teaching is excellent. The semantically anomalous data were collected in the same manner, but each real sentence had one content word replaced so as to create a grammatical but illogical sentence which should cause a processing problem for the participants (viz. Please join in the fun, buy a T-shirt >> Please sweep in the fun, buy a T-shirt). Finally, the ID stimuli were checked for familiarity with several native North American English (NAE) speakers to ensure use in NAE. This eliminated several stimuli of more British usage.4 Other, new, stimuli accepted by the raters were introduced to make up the set of fifty, collected
Processing MWUs Table 10.1
199
Example Stimuli
Sentence
Sentence type
Register
Don’t let him take me.
CON
W
Alcohol might have played some part in this serious miscalculation.
RC
W
‘While I’m getting ready, would you like to come up and see my room?’
LB
W
The old boy had lost his marbles somewhere along the line.
ID
W
I’m actually in the carpet in Tower Street.
SA
S
in the manner described above. Table 10.1 illustrates a stimulus for each of the five conditions.
Stimulus variables Since the data will be analysed with mixed-effects regression statistical modelling, rather than factorial analysis, it is necessary to define the variables to be considered. The five conditions for the stimuli sentence types are the three MWU subtypes and the semantically anomalous and normal compositional control sentences. As stated above, the covariates that are to be investigated for their predictive power, but critically allowed to vary randomly, are word frequency, the position of each of the units in the MWU within the carrier sentence, and sentential position of the word being read, insofar as the placement of the non-MWU strings and MWUs were concerned. Each carrier sentence’s original modality was also documented. The proportions of written to spoken sentences was 233 to 17. This does leave a clear written bias in the stimuli, though since eye-tracking is visual, this is appropriate. It is also difficult to avoid given the largely written (90 per cent) modality of the BNC. The modality differences were thus left out of the current analysis.
Word frequency measures The frequency of each word in the set of carrier sentences was calculated using the BNCWeb tool’s search functions for the surface form (SFF) only. The range of frequencies is 0.01–62014.49 per million words (pmw). The reasons for concentration on the surface forms are many. The presentation
200
Perspectives on Formulaic Language
of the items is visual; therefore all shapes (i.e. orthographic forms) which are identical to a word in its stimulus form are potentially accessed with the presentation of the stimulus. On the same grounds regarding visual interpretation of shapes, all compound words which are hyphenated are likely to appear as one word (i.e. a contiguous string of letters) in peripheral view (Rayner, 1998). Thus these are given a frequency count per million for the hyphenated compound as a whole, not for each of the parts. Likewise, the SFFs of numbers are calculated from their representations as Arabic numerals only, and $, €, and £ symbols are excluded from any frequency counts. If there is a drawback to this frequency count, it is that the homophone frequencies for the words in the stimuli are not calculated, which fails to address phonetically-motivated processing accounts. However, such an account would also have to include stress, prosody and other suprasegmental features, variables unavailable in the corpus. Thus, we have no choice but to set aside the issue of homophone frequencies for another study. As a result, word frequency in this study refers to SFFs of homographs alone.
Participants Twenty-one undergraduate and graduate students from the University of Alberta, participated in this experiment. All were native speakers of English with normal or corrected-to-normal vision. They were compensated $15 per hour for their participation.
Materials Stimuli were presented through a PC laptop computer onto a 21-inch monitor, using Experiment Builder® software from S-R Research. Eyemovement data were collected through a head-mounted eye-tracker system (EyeLink® II) from the same company. Calibrations were based on using three horizontal points. Participants used a foot pedal to cue the next stimulus, and a button box (Cedrus RB-530 ®) to indicate responses to the comprehension questions.
Design There were 250 experimental stimuli with 50 sentences per condition, preceded by 12 practice stimuli. Trials within the practice and experimental
Processing MWUs
201
blocks were randomly ordered. During the task, the participants read sentences at a comfortable distance from the computer screen while wearing the head-mounted eye-tracker, with rests and then recalibrations after every 25 stimuli. To ensure they were paying attention to the reading task rather than scanning, participants also answered yes/no comprehension questions at a rate of approximately one for every five to six questions (see Table 10.2, for an example). The stimuli were presented in yellow on a black screen, with the comprehension questions and instructions appearing in white. Longer sentences (>15 words) were presented in size 14 Times New Roman font, and shorter sentences in size 18. Each sentence appeared in the centre-left position of the screen following a fixation cross in the same location. Participants were told to read the sentences silently and as quickly as they could, but not so fast that they could not expect to answer the comprehension questions. They indicated they were finished reading the sentence by pressing a foot pedal. This would call up the next fixation cross and sentence, unless there was a break forthcoming. The eye-tracking equipment estimated eye positions at a sampling rate of 250 or 500 Hz (depending on participant conditions), and associated each with a time stamp. From these, it also estimated the temporal and spatial attributes of both saccades and fixations. Several dependent variables were later calculated, such as total sentence reading time, and reading times on individual words.
Results The analyses of the data were all carried out in R (R Development Core Team, 2007) using mixed effect regression models, following a stepwise variable selection procedure. The dependent variables in this study were time spent reading the word on the first pass (first fixation duration, or FFD), the total sentence reading time (SRT), and the total word reading time (WRT). These values were logarithmically adjusted before analysis to
Table 10.2 Sentence : Question: Answer :
Example Stimulus and Related Comprehension Question
An injury to the Motherwell veteran Davie Cooper at the eleventh hour meant that Fleck was summoned to join the squad. Did Cooper play? (Yes or No?) No.
202
Perspectives on Formulaic Language
minimize the effect of outliers and rightward skew. Each covariate was analysed in a model for each of the reading time measures, and predictors which failed to reach significance were removed from the model. Regression models were validated with ANOVAs. All significance values reported below are t-values, unless otherwise indicated, where all t-score greater than +/–2 indicate significance (Baayen, Davidson & Bates, 2008). Additionally, all models included participants and items as crossed random effects. The covariates are being investigated for their ability to explain MWU processing effects. As mentioned above, each variable is analysed for its predictive power in each of the reading time measurements. These reading times form the dependent variables for each model: the FFD measure offers insight to the ballistic first pass process of reading, including facilitation for reading MWU sentences and strings. Meanwhile WRT gives the effects of total reading time for each word, and the SRT illustrates the effects of each covariate and independent variable through the time it takes the participant to read the entire sentence. The first covariate is sentential position which was included to determine if the reading process is affected by where the reader is in the sentence. Next, the individual surface form frequency was included to consolidate past findings of word frequency effects in parts of MWUs (Columbus, 2008; Kuiper, Columbus & Schmitt, 2009). This is in opposition to traditional measures which have investigated the frequency of each MWU as a whole. We inspected the frequency plots and found that frequency was nonlinear, and thus made it a quadratic predictor in both word reading time models. Reading speed within the MWU region in the carrier sentence was measured to determine if the MWU itself does offer real processing advantages compared to non-MWU words in the complete sentence. This is a binary covariate. The order of presentation of the stimuli was also included. Inspection of this variable showed Trial Order had to be scaled before being added to the models. Finally, Sentence Type independent variables underwent pairwise comparisons in each model to measure the effect between MWU-only sentences (i.e. ID, LB and RC stimuli) and the two controls, (CON and SA) on the reading times. Table 10.3 shows the results for each covariate in each of the time measures. Table 10.3 illustrates the level of predictive capacity each variable has for MWU reading times. Word frequency is consistently significant, as is reading within the MWU region of a sentence. The presentation order of the stimuli results in slower reading nearer the end of the experimental session, and reading is also inhibited by the length of the word or sentence. The sentence
Processing MWUs Table 10.3
203
Results for Each Predictor in Each Reading Time Measure
Predictor
WRT
SRT
FFD
Control sentence type
*fac
*
~
Idiom sentence type
~/*fac
*fac
~
Restricted collocation sentence type
*fac
*fac
~
Lexical bundle sentence type
*fac
trend
~
Semantically anomalous sentence type
*in
*in
~
Trial order
*
**
trend
Sentential position
**in
N/A
***in
In the MWU region
**fac
N/A
**fac
Number of words in sentence (quadratic)
N/A
**in
N/A
Word length
**in
N/A
---
Frequency (linear)
*fac
N/A
---
Frequency (quadratic)
*fac
N/A
**fac
trend
N/A
---
Interactions ID by frequency LB by frequency
~
N/A
---
RC by frequency
~
N/A
---
SA by frequency
~
N/A
---
ID by frequency (quadratic)
trend
N/A
---
LB by frequency (quadratic)
~
N/A
---
RC by frequency (quadratic)
~
N/A
---
SA by frequency (quadratic)
~
N/A
---
Keys: * N/A ~ --trend in fac
significant at p<0.05, **, *** = more significant not applicable to this reading time measure not significant not significant, thus left out of the best fit model nearing significance inhibitory (= slower reading) facilitory (= faster reading)
Perspectives on Formulaic Language
204
type facilitations should be interpreted as between types: where CON is significant, it is read slower compared to MWU type sentences, that is, LB, RC and ID. The MWU sentence types are read faster than the CONs, and SAs are read slower than all sentence types. This is not always significant, but the pattern of reading times between sentence types is clear for WRT and SRT in Figure 10.1, below.
Results summary
5 1
2
3
7.5
3
4
log First Fixation Duration
5 4
log Word Reading Time
8.0
log Sentence Reading Time
6
8.5
6
7
The results above show that the MWU types do differ significantly from CON (and, as would be expected, SA), but these differences are not consistent across the reading time variables (e.g. WRT vs. FFD). Broadly
CON
ID
LB
RC SA
Sentence Type
CON
ID
LB
RC SA
Sentence Type
CON
ID
LB
RC SA
Sentence Type
Figure 10.1 Sentence Type Distribution across All Reading Times
Processing MWUs
205
speaking, the further into the sentence the participant reads, the slower the reading for SAs and the faster the reading for IDs, with other MWU types filling the gradient between. It is not surprising that the length of the sentence has a significant role in processing time. This is equally true of word length. However, the strong facilitory effects for ST and MWU region in most models show that both the MWU types and the MWU strings within carrier sentences do promote faster processing compared to non-MWU strings and sentences. Overall, there is evidence that processing of MWUs differs according to certain predictors, which suggests that the MWU-types introduce differences in processing. It is not clear, however, if it is the types per se or their inherent features, such as predictability, semantic relatedness and/or clarity, which explain the differences.5
Discussion The results for MWUs in this eye-movement experiment offer an insight into the processing differences and predictors involved in parsing MWUs. This study set out to determine if MWUs can be differentiated psychologically by type, and if so, which variables were key predictors in this distinction. Several observations can be made from the results above. Overall, there is faster processing of MWUs of all types over control sentences. Equally importantly, the significance of the reading speed effect on MWU sentence types over non-MWU sentence types is not modulated by the slower reading times of the SA sentence, as was seen in the comparisons between all sentence types to each MWU sentence type. In terms of the predictors, there were also differences found with respect to processing MWUs. For example, being in the MWU region does explain the significantly faster reading times for MWU sentences over control sentence types, as both word reading times are facilitated. On the other hand, how far into the sentence the participant is reading predicts the inhibitory effect in whole sentence reading time, but showed it had no predictive power for reading of the MWU region. This suggests that the MWU processing advantage is available in all sentential positions, a finding which could not be made in strictly controlled MWU experimental designs. The best time variable for predicting faster MWU processing is WRT, that is, the total time spent reading each word in a sentence containing an MWU.
206
Perspectives on Formulaic Language
Additionally, word length had a predictably significant effect on both total and first pass reading times, and was found to be nonlinear. Trial order was also a significant inhibitor on reading times, with the fatigue effect leading to slower reading later in the experimental session. The strong facilitation found for frequency was significant for both word reading time measures. (Word frequency is not a factor in sentence reading time.) These findings go against weak frequency effects found in some previous studies, such as Schmitt et al. (2004), although they use MWU-unit frequency rather than word frequency. It is possible that the whole MWU frequency (of the unit and/or transitional probability) adds to this effect, meaning there may be another frequency effect at a different level. Further investigation into this is necessary, to determine whether frequency effects which have not been found in previous studies are due to assuming linear frequency (cf., the nonlinear frequency found here). It is perhaps not the MWU unit as a whole but the linking between the components which leads to the processing/reading advantages found. This is proving to be an interesting factor in production of four-word LBs in Tremblay et al.’s (in preparation) ERP paradigm results. In this study, a key predictor in speed of producing LBs is the transitional probabilities in function versus content words. A clarification of which frequency effects are relevant to MWU processing may be able to eliminate the possibility that previous studies’ non-effects with respect to frequency are due to contextual issues (i.e. the semantic content of the carrier passages). When comparing the effects for MWU facilitation found here compared to previous studies, we need to consider power. Schmitt and colleagues’ studies (discussed earlier) failed to find clear effects across several paradigms, but it must be acknowledged that they tested approximately 20 stimuli in each experiment. It is possible that the stronger processing advantages found in the current experiment were a result of more power, in that 50 stimuli were presented for each condition. This presentation method was made possible because of the shorter sentence contexts. The reliable findings for sentential positions may also have benefitted from the shorter contexts. Indeed Underwood et al.’s (2004) eye-movement study could only look at the terminal items of an MWU. It may be that shorter contexts reveal better the reader’s processing by reducing the working memory load needed in processing longer passages. That is, the reading of full texts requires the participant to retain more semantic information for longer periods of time. Conversely, short but complete sentences only require the participant to retain the relevant information until the next stimulus is presented.
Processing MWUs
207
Finally, the biggest question of this study was to determine if the different types of MWUs discussed here, ID, LB and RC, are psychologically valid. The answer to this question appears to be ‘yes’. The SRT and WRT reading times both showed significant differences between MWU sentence types. Idiom reading times were faster than both LB and RC in the WRT measure, and LB and RC reading times were significantly different from each other in the SRT measure. Yet there is a strong pattern of differences in reading times between all sentence types, even though the distances fail to reach significance in every model. The U-shaped curves and order of facilitation are highly similar for SRT and WRT, as can be seen in Figure 10.1. The fact that this is not echoed in the FFD measure may reflect the lack of full parsing of the sentence in the first pass. Thus, we can be almost certain that the three MWU sentence types here are psychologically distinct as measured through reading times. Indeed, the fact that different variables produce different effects suggests that the MWUs may not be different because of their types per se, but due to the inherently different features among those types, such as semantic transparency and transitional probability.6 This is an area which requires further investigation before the implications of MWU subtype differences are brought into computational, cognitive, pedagogical and pathological models of language processing and storage.
Conclusion By performing eye-movement tests on three different subtypes as a timed reading task, we have found evidence for differences among MWUs. The results offer a better understanding of how native speakers of English read MWUs. These findings contribute to the ongoing MWU debate, particularly with respect to clarification of how three of the MWU types are processed by native speakers of English. This study has also shown that while it seems clear that there are psychologically valid types of MWUs, there is much more left to determine. Special attention needs to be paid to the nonlinear frequency, unit versus word frequency, transitional probability, unit length, semantic transparency, and complexity issues raised here in order to have a clearer understanding of how the differences found above are instantiated. For now we can only say that there are separate MWU types which require separate treatment in the research into reading processes, and that we have a better idea of which predictors may be key to the processing advantages gained.
208
Perspectives on Formulaic Language
Notes 1
2 3
4
5
6
I would like to thank Gary Libben, Patrick Bolger and Harald Baayen for their help over the course of this study. Thanks are also due to the participants at the Postgraduate Conference for the Formulaic Language Research Network 2008 and the Psycholinguistics In Flanders 2008 conference for their comments and suggestions on parts of this research. All errors, of course, remain my own. That is, the phrasal knowledge in the mental lexicon of each individual speaker. For example, if the total number of results was 768 concordance lines with the MWU word in them, the thinning tool would randomly select 100 or 200 of these as the set we used. That is, items that were not known to most or all of the raters. This removed the potential of these items being recognised but not known through use in British films, and so on. These differences are currently under investigation (Columbus, in preparation) in EEG analyses and further eye movement studies with respect to L1 and L2 processing. These possibilities are under investigation (Columbus, in preparation) using the same stimuli.
References Baayen, R. H., Davidson, D. J., & Bates, D. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59 (4), 390–412. Benson, M., Benson, E., & Ilson, R. (Eds.). (1986). The BBI combinatory dictionary of English: A guide to word combinations (1st ed.). Amsterdam/Philadelphia: John Benjamins Publishing Company. Biber, D., Conrad, S., & Reppen, R. (1998). Lexico-grammar. In D. Biber, S. Conrad, & R. Reppen, Corpus linguistics: Investigating language structure and use (pp. 84–105). Cambridge: Cambridge University Press. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. Harlow: Longman. Cacciari, C., & Glucksberg, S. (1991). Understanding idiomatic expressions: the contribution of word meanings. In G. B. Simpson (Ed.), Understanding word and sentence (pp. 217–40). North Holland: Elsevier. Cacciari, C., & Tabossi, P. (1988). The comprehension of idioms. Journal of Memory and Language, 27, 668–83. Columbus, G. (in preparation). Processing of different MWU-types in L1 and L2 speakers of English. PhD dissertation, University of Alberta, Canada. Columbus, G. (2008). An eye-movement study for disambiguating types of multiword units. Poster presented at The Sixth Mental Lexicon, University of Alberta, Banff, Canada, October 7–10, 2008.
Processing MWUs
209
Cutting, J., & Bock, K. (1997). That’s the way the cookie bounces: Syntactic and semantic components of experimentally elicited idiom blends. Memory and Cognition, 25 (1), 57–71. Forster, K. I. (2000). The potential for experimenter bias effects in word recognition experiments. Memory & Cognition, 28 (7), 1109–15. Gibbs, R. (1985). On the process of understanding idioms. Journal of Psycholinguistic Research, 14 (5), 465–72. Gibbs, R. (1986). Skating on thin ice: Literal meaning and understanding idioms in conversation. Discourse Processes, 9, 17–30. Kuiper, K., Columbus, G., & Schmitt, N. (2009). The acquisition of phrasal vocabulary. In S. Foster-Cohen (Ed.), Advances in language acquisition (pp. 216–240). Basingstoke: Palgrave Macmillan. Kuiper, K., McCann, H., Quinn, H., Aitchison, T., & van der Veer, K. (2003). A syntactically annotated idiom database. Philadelphia: Linguistics Data Consortium, University of Pennsylvania. Lehmann, H-M., Hoffmann, S., & Schneider, P. (2002). BNCweb interface, v. 2.0. http://homepage.mac.com/bncweb/manual/bncwebman-home.htm. Lewis, M. (1993). The lexical approach. Hove: LTP. Moon, R. (1996). Fixed expressions and idioms in English. Oxford: Clarendon Press. Moon, R. (1998). Frequencies and forms of phrasal lexemes in English. In A. P. Cowie (Ed.), Phraseology: Theory, analysis, and applications (pp. 79–100). Oxford: Clarendon Press. R Development Core Team. (2007). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. http:// www.R-project.org. Rayner, K. (1998). Eye-movements in reading and information processing: 20 years of research. Psychological Bulletin, 124 (3), 372–422. Schmitt, N. (2005). Formulaic language: Fixed and varied. ELIA: Estudios de Linguïstica Inglesa Aplicada, 6, 13–39. Schmitt, N., Grandage, S., & Adolphs, S. (2004). Are corpus-derived recurrent clusters psycholinguistically valid? In N. Schmitt (Ed.), Formulaic sequences (pp. 127–51). Amsterdam/Philadelphia: John Benjamins Publishing Company. Schmitt, N., & Underwood, G. (2004). Exploring the processing of formulaic sequences through a self-paced reading task. In N. Schmitt (Ed.) Formulaic sequences (pp. 173–89). Amsterdam/Philadelphia: John Benjamins Publishing Company. Sprenger, S., Levelt, W., & Kempen, G. (2006). Lexical access during the production of idiomatic phrases. Journal of Memory and Language, 54, 161–84. Swinney, D., & Cutler, A., (1979). The access and processing of idiomatic expressions. Journal of Verbal Learning and Verbal Behavior, 18, 523–34. Titone, D., & Connine, C. (1994). Comprehension of idiomatic expressions: effects of predictability and literality. Journal of Experimental Psychology: Learning, Memory and Cognition, 20 (5), 1126–38. Tremblay, A., Derwing, B. L., Libben, G., & Westbury, C., (submitted). Processing advantages of lexical bundles: Evidence from self-paced reading experiments, word and sentence recall tasks, and off-line semantic ratings.
210
Perspectives on Formulaic Language
Tremblay, A., Tucker, B. V., Gagnon, C., & Lemke, S. (in preparation). The production of four-word sequences. Underwood, G., Schmitt, N., & Galpin, A. (2004). The eyes have it: An eye-movement study into the processing of formulaic sequences. In N. Schmitt (Ed.), Formulaic sequences (pp. 153–72). Amsterdam/Philadelphia: John Benjamins Publishing Company. Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press.
Part 3
Communicative Functions of Formulaic Language
This page intentionally left blank
Chapter 11
A Text in Speech’s Clothing: Discovering Specific Functions of Formulaic Expressions in Beowulf and Blogs Matt Garley, Benjamin Slade, and Marina Terkourafi University of Illinois at Urbana-Champaign
Introduction In this paper, we consider the functions that formulae perform in two genres which exist in written format as texts, but maintain close links to oral forms, namely Old English (OE) verse, specifically the epic poem Beowulf, and weblogs, or ‘blogs’. We identify five important functions of formulae found in common across OE verse and blogs, classifying these functions as discourse-structuring functions, filler functions, epithetic functions, gnomic functions, and tonic functions. In addition, a sixth type of formulaic function necessarily tied to the written medium, the acronymic function, is identified in both genres. The aim of this chapter is to demonstrate that the formulae found in the Beowulf and blog samples fulfill certain functions which alternately (1) link these emergent text genres to analogous oral forms, in the case of the first five functions mentioned above, and (2) mark the genres as written forms, in the case of the sixth function. The remainder of this chapter is structured as follows: to begin with, we survey previous work on formulaicity and discuss several formal characteristics of formulae identified in the literature. Then, we briefly introduce the blog and Beowulf samples from which we take data for the present study. Next, we introduce the five functions of formulae found in our samples through the discussion of specific excerpts. Finally, we summarize our findings, indicating directions for future research.
214
Perspectives on Formulaic Language
Characterizations of Formulae and Their Functions in Previous Work For the purposes of the current analysis, Wray and Perkins’ (2000, p. 1) characterization of formulaic sequences offers a useful starting point. According to this, a formulaic sequence is: a sequence, continuous or discontinuous, of words or other meaning elements, which is, or appears to be, prefabricated: that is, stored and retrieved whole from memory at the time of use, rather than being subject to generation or analysis by the language grammar. This definition can be considered a processing-based definition as it does not identify specific formal properties of formulae, but rather resorts to the notion of holistic storage and retrieval. Despite the psycholinguistic flavor of the above definition, Wray and Perkins (2000) nevertheless engage the notion of function with regard to formulaic expressions, offering what is ultimately a descriptive model reconciling two main functions of formulae: to compensate for the limitations of memory, e.g. as processing short-cuts, time-buying devices, or mnemonics; and to function as identity-marking devices in social contexts. These two functions are unified in a speaker-hearer model, with the compensatory function serving to ensure felicitous production on the part of the speaker, and the socio-interactional function serving to ensure felicitous comprehension on the part of the hearer. Wray and Perkins draw data from various descriptions of formulaic language use in multiple populations, making their type of broad approach useful in making high-level generalizations about formulaic language use. However, while several functions they identified are also found in the samples analysed in the present study, the functions we identify here are more specifically linked to the registers of written and oral texts, and emerge from a close engagement with the texts in question. Our study thus has the advantage of picking out functions which are intimately related to the genres in question, thereby enabling us to comment also on the function of formulae as stylistic devices delimiting different genres.
Data and Methodology This study involves two corpora falling under different genres: the first is a sample of approximately 300 lines from the Old English epic poem Beowulf, and the second is a blog sample of approximately 800 lines.
A Text in Speech’s Clothing
215
For the present analysis, the annotated corpora were examined with a mind toward the functions of these formulaic expressions – keeping in mind the following questions: what are authors doing with formulae, and what are formulae doing for authors?
The Old English sample Beowulf is an heroic epic poem written in alliterative Old English verse.1 The date of its composition is much debated, but on the basis of linguistic and palaeographic evidence, it must have been composed sometime between the late seventh and early eleventh centuries C.E.2 Albert Lord, John Miles Foley and others developed oral-formulaic analyses, which ultimately derive from Milman Parry’s research relating traditional poetry (such as the Homeric epics) to modern orally-composed verse (such as that found among largely unlettered poets in the former Yugoslavia). Parry (1928, 1930) argued convincingly that the Homeric epics were composed by poets working within an oral tradition, shown by repeated use of formulae.3 Analysis of unlettered Yugoslavian poets (cf. Lord 1991 and the references therein) demonstrated further that epic poems (of up to 12,000 lines or more) could be composed in real time using ‘ready-made’ phrases established within their poetic tradition. This use of formulae by modern Yugoslavian oral poets in what Lord (1991:77) calls their ‘composition in performance’, that is, their real-time composition of poetry, and the appearance of similar formulae in ‘traditional’ poetry like Beowulf suggests that it too was composed through the use of ‘ready-made’ formulae (cf. Magoun (1953)). However, Benson (1966) demonstrated that obviously lettered poets (like Cynewulf) also used formulae and thus that appearance of formulae is no guarantee of ‘orality’. For this reason, Foley (1991) discusses Anglo-Saxon texts as being ‘oral-derived’; that is, composed using the traditional ‘oral’ style, even if the extant texts themselves were produced as written documents.4 The fact that lettered poets continued to employ traditional poetic formulae strengthens the case for the importance of poetic formulae in Anglo-Saxon society, since composition-by-formula clearly survived the development of literacy. Interestingly, O’Keeffe (1987, 1990) has shown that Anglo-Saxon scribes commonly substituted one formula for another (usually one that is grammatically and semantically equivalent) in the process of copying, suggesting that such formulae were salient even for non-poets. In other words, such oral-derived texts persist in utilizing what Foley (1991, p. 6–9) calls ‘traditional referentiality’.
216
Perspectives on Formulaic Language
Traditional referentiality is similar in some ways to literary allusion, except that, rather than making reference to some particular scene/image in a particular text, traditional referential elements ‘reach out of the immediate instance in which they appear to the fecund totality of the entire tradition . . . bear[ing] meanings as wide and deep as the tradition they encode’ (Foley 1991, p. 7). Thus, unlike literary allusion, a listener/reader need not be familiar with any particular text to understand the deeper meaning of a passage; rather the deeper (connotative) meanings of traditional referential elements are accessible simply through being part of the traditional culture. Put another way, if the formulae are stored (and learned) like lexical items, simply by speaking the language one has access to the meanings (connotative as well as denotative) of the formulae, since the formulae are common to the culture and not embedded in any one particular text. And this is why poets in traditional cultures, even after the advent of literacy, continue to compose texts using the traditional referentiality of poetic formulae, not ‘out of a misplaced antiquarianism or by default, but because, even in an increasingly textual environment, the ‘how’ developed over the ages still holds the key to worlds of meaning that are otherwise inaccessible’ (ibid.). To investigate the functions of formulaic sequences in Old English epic verse, 323 randomly selected lines (1,757 words) of Beowulf (Klaeber 1950) were scanned for formulae: ll. 1–52, 189–240, 499–661, 1506–1572, 2538–2583a. Formulae in this sample were identified with the help of the following four heuristics: 1. list of repeated verses in Beowulf drawn primarily from the formulae lists of Orchard (2003) and secondarily from Hutcheson (1995); 2. recurrence of sequences/collocations, determined through search of the Internet-based corpus of the Dictionary of Old English (Healey et al. 1998);5 3. the presence of verse-internal alliteration; 4. non-literality of a sequence. Any potentially formulaic sequence had to exhibit at least two tokens somewhere in the OE poetic corpus in order to be considered a formula (essentially heuristics (1) and (2)). The last three heuristics served primarily as guides for identifying possible additional formulae not present in the lists of Orchard or Hutcheson. Some recurrent sequences were excluded if their frequent occurrence was expected given the nature of their components, for example, sequences of modal + verb.
A Text in Speech’s Clothing
217
Since we are here considering poetic formulae from the standpoint of what is considered formulaic in the linguistics literature (see also above), sequences which are to be considered the ‘same’ formula are taken to be those which have the same essential meaning/function. So if one formulaic sequence can be substituted for another, they are considered to be tokens of the same formulaic type. For example, ‘X under the clouds’ and ‘X under the stars’ are taken to be tokens of the same formulaic type, since they have the same basic meaning and can be substituted for one another (under the right alliterative conditions). Our criterion for categorization is thus akin to Keller’s (1981, p. 100) ‘loose substitutability’, proposed to deal with conversational ‘gambits’. Overall, our system of evaluation is similar to that used in oral-formulaic analysis by OE specialists (esp. Fry 1967, 1968a, 1968b; Niles 1981; Riedinger 1985). The blog sample Weblogs, or ‘blogs’, are texts written by one or more authors and posted to a specific Internet address. The blog is a diverse medium with several subgenres, and the format can be used for many purposes: popular blogs are often topical,6 and feature links and commentary. Blogs can also function as photograph repositories, news commentary,7 records for collaborative projects, or advertisements. A survey by Herring et al. (2004) found that the vast majority of blogs are single-author personal journals. The blog excerpts analysed here were automatically collected in May 2006 for the ICWSM dataset8 by Nielsen Buzzmetrics from blogs published on the Internet. Only blog entries written in English were used, and while there is no way of being certain about the dialect region or even the country of origin of bloggers, those entries that were clearly written by non-native speakers were discarded. Surnames included in the text were changed to ‘NAME’. The dataset obtained in this way consisted of 7,178 words in 34 blog entries, each entry from a different blog. Due to the anonymity of the medium, specific and accurate demographic information is impossible to assert, but both purported males and purported females are represented. The blogs were hosted at three different blogging sites: Blogspot,9 LiveJournal,10 and Wordpress,11 with the majority of the texts being from LiveJournal. These data were annotated by three annotators, who identified formulae according to a quasi-naïve definition: if a stretch of text sounded like a common saying or expression, or functioned in context in a way comparable to
218
Perspectives on Formulaic Language
a common saying or expression, it was marked as formulaic. Inter-annotator agreement between two of three annotators sufficed to identify the sequence as a formula.
Functions of Formulae in OE Poetry and in Blogs In this section, we present the major findings with regard to the textual functions that formulae serve in our data. We find six major categories of functions: the first five, namely discourse-structuring functions, filler functions, epithetic functions, gnomic functions, and tonic functions, link the texts in question to oral analogues. The remaining category, acronymic functions, serve to situate the texts firmly in the written register by exploiting specific properties of written texts. In the examples given in the following analysis, bold typeface indicates stretches of text identified as formulaic by the previously introduced heuristics.
Discourse-structuring functions Formulae with discourse-structuring functions serve primarily to organize discourse, either as (1) instructions allowing the author and reader to organize sections of text, or (2) instructions which qualify the manner in which the text is to be taken; hedges, for example would fall into this latter category. With regard to previous work on formulaic language, discoursestructuring formulae would include most of Keller’s (1981) conversational gambits, particularly those which introduce the frame of the conversational topic and identify the social context of the conversation.
Discourse-structuring formulae in the Beowulf sample In the sample from Beowulf, discourse-structuring formulae appear less often than in the blog sample, but occupy especially salient positions in the poem. One prominent discourse marker occurs as the first word of the poem: hwæt. Hwæt literally means ‘what’, but in this context it is usually rendered in Modern English as ‘listen’, ‘lo’, ‘well’ and so on. In order to understand the function of the introductory hwæt, we discuss the employment of hwæt elsewhere in the poem. Hwæt is a marker employed in the representation of spoken discourse. It occurs a total of five times in Beowulf, once as the first word of the poem, where it occurs as part of the ‘narrator’s’ text. The other four instances all
A Text in Speech’s Clothing
219
occur in the discourse of the characters, twice as the opening of their speech (l. 530, 1652). When it occurs as the first word of a character’s speech, hwæt signals the character’s intention to begin a dialogue or a narrative, as in Bwf. 1652, where it is the first word spoken by Beowulf to King Hrothgar when he triumphantly returns to the king’s meadhall after slaying the she-troll who had murdered one of the king’s retainers. More interesting perhaps is the use at Bwf. 530, where it indicates the speaker’s intention to take the floor. Here one of Hrothgar’s men, Unferth, has been verbally challenging and disparaging Beowulf, and hwæt is the first word Beowulf speaks in his reply (perhaps interrupting Unferth). (1) ‘Hwæt, þu worn fela, wine min Hunferð, beore druncen ymb Brecan spræce; sægdest from his siðe . . .’ (Bwf. 530–2a)12 ‘Hey, Unferth my friend, drunk on beer, you’ve had a lot to say about Breca, talked about his adventure . . .’ However, hwæt may also occur in the midst of a character’s speech. It then serves to focus the addressee’s attention to what follows, as in the midst of Hrothgar’s long speech to Beowulf. As the poem’s opening word, hwæt serves a different function: it signals to the audience that the poet is about to speak (rather than, say, write). (2) Hwæt! We Gardena in geardagum þeodcyninga þrym gefrunon; hu ða æþelingas ellen fremedon. (Bwf. 1–3) Listen! We heard of the glory of the Spear-Danes in days of yore, of those clan-kings: how those nobles performed courageous deeds. The orality of hwæt is further emphasized by the choice of verb in the following sentence, namely gefrunon ‘heard’. The story of the Danish kings is not something to be read, but something to be heard – notwithstanding the fact that this occurs in a written text. The use of we ‘we’ is also noteworthy, implying the collective of poet/‘speaker’ and audience/‘hearers’ and their joint knowledge of the tradition within which the following discourse is embedded (discussed above on ‘traditional referentiality’). The use of hwæt as the opening word is not unique to Beowulf: eight other OE poems begin the same way,13 testifying to hwæt’s formulaic character in this function. The importance of these examples is not simply that they are
220
Perspectives on Formulaic Language
formulaic, but that they unmistakably invoke an ‘oral’ setting. In other words, above and beyond its discourse-structuring function, the use of the hwæt-sequence marks the text as spoken.
Discourse-structuring formulae in the blog sample Discourse-structuring formulae are quite common in the blog sample. This is not unexpected, given Herring, Scheidt, Bonus, & Wright’s (2004, p. 1) characterization of ‘journal’ blogs as ‘internal (the blogger’s thoughts and internal workings)’. The present corpus consists solely of ‘personal journal’ type blogs, which function to capture the author’s internal dialogue. This streamof-consciousness style is characteristic of personal journal blogs, and discourse-structuring formulae help arrange and organize each text into a coherent whole. For this reason, blog authors commonly use formulae with discourse-structuring functions. A formulaic introduction to a blog post is generally a one-word formula which can be considered formulaic because it consistently appears at a specific position in the text: (3) [Beginning of post] Okay, I need help. The use of okay, a very frequent element in spoken discourse, serves a discourse-structuring function when introducing a blog post. Condon (2001, p. 492) discusses okay and well as discourse markers which ‘orient to a default organization that contrasts unmarked, routine sequences and marked, nonroutine departures from expected events’. In our data, okay in turninitial position enacts a transition – for both the author and the audience – by directing attention from unknown previous engagements to the blog author’s text. Another common formula with a discourse-structuring function is I guess. This hedge, which softens the impact of the following statement, can function to display apathy about a proposition. I guess also distances the text from more formal genres of writing, for example, academic texts or news – linking the text to oral forms and informal written forms, such as personal diaries. (4) he is turning me into an uber nerd. i guess it’s ok The third type of discourse-structuring formula presented here from the blog corpus is speaking of X. This formula links a newly-introduced topic to
A Text in Speech’s Clothing
221
a previously-mentioned item by pointing out a relation (however tenuous) between them. By linking the two discourse items, this formula indicates a topic change while maintaining discourse coherence between the old and new topics. (5) Yeaah. I shall pick Joci up and amuse her, then fall asleep. Speaking of sleeping, i’m exhausted[. . . ] These three types of formulae are representative of the broad range of discourse-structuring formulae found in blogs.
Filler functions We define fillers as words or phrases which do not significantly advance the discourse or introduce new information. Wray and Perkins (2000, p. 16) mention fillers as one of several subtypes of time-buying formulae which ensure ‘planning time without losing the turn’. Examples given include ‘If the truth be told’ and ‘If you like’. In our analysis, where the notion of ‘turn’ is problematic due to the nature of the texts, we posit two functions for fillers: fillers can buy time, when composing texts in real time, or serve to acknowledge and reinforce the oral conventions of each genre through reference to real-time composition. In OE verse, fillers can additionally be helpful in satisfying metrical constraints, and in blog texts, fillers serve to ‘pad’ entries which might otherwise be considered too short.
Filler formulae in the Beowulf sample Some frequent formulae in OE verse appear to be employed solely in order to satisfy metrical requirements. One especially frequent formulaic type of this sort is X beneath heavens/clouds/skies/stars. All of these essentially mean ‘on earth’ and do not add any propositional content as they do not typically occur in contexts where there is any doubt that the action is taking place on earth. Examples in Beowulf include 714a wod under wolcnum ‘(he) waded/ advanced under the skies/clouds’, and Bwf. 505a gehedde under heofenum ‘(he) heeded under the heavens’. Another type of filler formula is the alliterative bridge, a verse whose primary function is to fulfil the metrical requirements of OE verse by providing an alliterator for another verse which would otherwise lack alliteration.14 Such bridges often involve what is known in Old English
222
Perspectives on Formulaic Language
studies as ‘variation’, that is, the repetition of a concept or term present in the preceding verse or line, usually in the on-verse15 of the following line. Bwf. 231–3, which describes the Danish coastguard’s curiosity about the unknown men he observes approaching the kingdom, displays two ‘bridges’ in swift succession (232a fyrdsearu fuslicu ‘eager war-devices’, a variation on 231b beorhte randas ‘bright shields;’ and 233a modgehygdum ‘mind-thoughts’, a variation on fyrwyt ‘curiosity’): (6) beran ofer bolcan beorhte randas fyrdsearu fuslicu; hine fyrwyt bræc modgehygdum hwæt þa men wæron. (Bwf. 231–33) (They) bore over the gang-plank, bright bossed shields, ready war-gear; in him (the coast-guard) curiosity rose up, the thoughts of his mind (about) who these men were. Both 232a, 233a are formulaic, and here it appears that these formulaic alliterative bridges are used solely as fillers, as they simply paraphrase nouns of the preceding b-verses. However, without these formulaic fillers, the lines would violate the alliterative requirement of OE meter (as there would be no stressed alliterator in common between the two verses). Filler formulae in the blog sample The blog sample, which can be characterized by an overarching streamof-consciousness style, displays two primary principles at work with regard to fillers. First, blog authors are under a compulsion to post at regular intervals, and furthermore produce sizeable chunks of text. Because of this, the author in each case is searching for further topics of discussion. And second, bloggers often type as they think, in which case items like Um . . . function as quasi-time-buying devices, despite the editable written format. (7) [ . . .] Well yeah and then on sunday me and chelsey went to kmart that was fun. Yeah i have nothing else to say. [End of post] (8) First entry . . . nothing to say really . . . The last part of (7) above may be considered a formulaic closing, which would fall under the set of discourse-structuring functions, but the repeated yeah here does fill space as well. In this sense, (7) is an apt reminder of the frequent complementarity of functions displayed by the formulae in this
A Text in Speech’s Clothing
223
paper. (8) is an example of a formula serving a filler function occurring at the beginning of the post. Together, these serve as examples of space-filling text as well as evidence that blog-writers do in fact search for things to ‘pad’ their entries. Even though there is no concrete deadline for submission of a blog, and the author theoretically has an unlimited amount of time to revise their text, the established oral-like conventions provide an impetus to produce this style.
Epithetic functions Epithets are words or phrases conventionally associated with certain characters or people. Formulae with epithetic functions serve to evoke entire characters or personalities by reference to some prototypical characteristic attribute. In typical spoken discourse, the participants share a great deal of background knowledge, and it is infelicitous and inconvenient to repeat lengthy descriptions word-for-word. By exploiting prior knowledge of characters or personages under discussion, formulae with epithetic functions refer to these characters or personages by way of their most salient characteristics, additionally reinforcing the characterization of the subject in those terms. In contrast, epithets are less useful in most written texts, which are prototypically aimed at broader, possibly unfamiliar audiences, who may not share the relevant background knowledge.
Epithetic formulae in the Beowulf sample Foley (1991) analysing epithets like ‘swift footed’, as characterizing Achilleus in Iliad XXIV.559, argues that the use of this epithet (in this case, while Achilleus is not in motion), is a traditionally established method of bringing the full image of Achilleus to the minds of the audience (Foley 1991: 142f.). In our terms, this involves the formula’s Relation to a (Culture-Specific) Frame– namely the figure of Achilleus and his characteristic attribute of being ‘swift footed’, whether or not this attribute is directly relevant to the immediate context of use of the formula. Likewise, when Beowulf is referred to at 2539a as heard under helme ‘hard/fierce under [his] helmet’ it does not seem to be because the poet is trying to convey that Beowulf was unusually fierce (or unusually wearing a helmet) due to anything in the immediate context of this verse. Rather, heard under helme is a characteristic phrase applied repeatedly to Beowulf (and in fact solely to Beowulf within the entire OE corpus) and invokes a
224
Perspectives on Formulaic Language
distinctive image of the character. Arguing that heard under helme is simply a metrical filler in this case blatantly ignores this character-identifying function. Consider the context of the verse16: (9) Aras ða bi ronde rof oretta heard under helme, hiorosercean bær under stancleofu strengo getruwode. (Bwf. 2538–40) Then the bold warrior(=Beowulf) arose with his shield, severe under his helm; he wore a battle-shirt, under the stone cliffs, trusted in the strength . . . The employment of epithets in Old English verse can be seen as a ‘shortcut’, a way of summoning up the essence of a character by way of a short phrase referring to one of the character’s defining qualities, something which sets him/her apart from others. Epithets are thus a ‘low-cost’ device employed by poets to create a narrative which is readily comprehensible to an audience who share the same system of traditional referentiality (observed above). Epithets, at least in OE verse, may sometimes serve an additional function, that of a (metrical) filler, discussed above. In other words, the use of an epithet can be used to ‘buy time’ for the poet so that he can use his processing resources for other purposes, such as the composition of the next line. The formula may thus fulfill both goals: its use eases the processing burden (speaker-oriented), while at the same time it invokes an easily identifiable and traditionally-licensed image of the character (hearer-oriented).
Epithetic formulae in the blog sample In the blog sample, no convincing examples of epithets were marked as formulaic by the annotators. This is perhaps unsurprising: epithets serve to summarize particular ‘characters’, and in a hypothetical blog-genre epithet, this would refer to the summary of a person’s characteristics in a way that would be formulaic only within the frame of reference to that particular person. As the corpus contained at most one blog post for any given blog, there was simply no chance for annotators to identify characters built by blog authors through successive posts, for example, through the use of a catchphrase. However, we expect that examples like this exist, and a different kind of analysis involving multiple successive posts from the same blog would be more likely to yield such data. The following examples from blogs
A Text in Speech’s Clothing
225
found through Google searches provide some encouraging preliminary data in this direction: (10) Miss ‘I’mqueenofthehouseoratleastIactlikeIam’ Madison started school this year.17 (11) ‘I rely on the NRA because I’m too god damned lazy to do it myself and rather use my time to bitch from the sidelines.,’ said Mr. ‘I’m not comfortable with McCain.’18 (12) Aaaaanyways, the real blog prompt today is a certain person’s blog. Miss ‘I-don’t-give-a-fuck-about-what-people-think-about-me’ (Miss Idga) says at one time, ‘I don’t fucking care’, then says [ . . .]19 This Miss-X/Mr.-X is a formula which would be expected in spoken discourse, or in informal written texts. It fulfills an epithetic function in that it summarizes a character (in this case most likely a real person) with reference to either previous actions (I’m the queen of the house or at least I act like I am) or an attributed quote (I’m not comfortable with McCain.) The incidence of this epithetic function thus further links the blog genre to oral forms.
Gnomic functions Formulae with gnomic functions serve to express the speaker’s conceptions of general truths about the world. These conceptions are not idiosyncratic to individual speakers, but instead must have a broader socio-cultural currency.
Gnomic formulae in the Beowulf sample Gnomic expressions in Beowulf include the example in (13), spoken to Hrothgar by Beowulf, when Hrothgar is grieving for his lost friend: (13) selre bið æghwæm þæt he his freond wrece þonne he fela murne . . . (Bwf. 1384–5a) it is better for every man that he avenge his friend, than mourn over-much . . .
226
Perspectives on Formulaic Language
Here perhaps it is not surprising to find such an expression, since it is part of the cited discourse of the poem. But the narrator frequently makes such gnomic comments as: (14) . . . Swa sceal mæg don. . . (Bwf. 2166b) So should a kinsman act . . . Often these gnomic statements take the form swa sceal/sceolde Noun Verb ‘So ought N V’ or swylc sceal/sceolde Noun Verb ‘Such ought N V’. Plot-elements of Beowulf conform to gnomes found in other OE verse: (15) . . . þyrs sceal on fenne gewunian ana innan lande . . . (Maxims II 42a–3b) . . . A troll shall dwell in the fen, alone in the land . . . (16) . . . Draca sceal on hlæwe, frod, frætwum wlanc . . . (Maxims II 26b–7a) . . . A dragon shall be in a barrow, old and wise, proud in treasure . . . The þyrs ‘troll’ Grendel lives in a fen, and the dragon of Beowulf lives in a barrow, guarding treasure for over three hundred years.
Gnomic formulae in the blog sample In personal journal blogs, formulae with gnomic functions are also quite common, in the form of truisms that further illuminate or reinforce a certain point with respect to the blogger’s life. The following excerpts from the blog sample provide evidence of these kinds of expressions: (17) Taking a stand takes too much effort. Standing aside hurts just as much. Fucked either way. (18) It became quite a party and of course there are cupcakes, signage, boxes and foil laying around. DISHES too . . . Fun always has its consequences. As Sorrell (1992, p. 33) remarks, ‘[t]he sententious expression of wisdom is a hallmark of oral cultures’ (cf. Frye, 1969, p. 7; Bloomfield and Dunn, 1989, p. 106–49, esp. p. 135–37). In other words, the fact that gnomic
A Text in Speech’s Clothing
227
formulae appear in OE verse and blogs reflects the oral-derived nature of both genres.
Tonic functions Formulae with tonic functions deal with emotional impact and tone along several dimensions, for example, seriousness versus levity, irony versus earnestness, and so on. Perhaps one of the most oft-discussed formulae in the literature, kick the bucket, serves this sort of function by backgrounding the event of death, replacing the more serious alternative expression with a non-serious and light-hearted euphemism. However, formulae with tonic functions do not always lean toward the non-serious pole; the opposite is also possible. As will be seen below, while the examples of formulae with tonic functions from Beowulf impart a darker, more serious mood, the blog examples generally serve to distance the speaker from his subject emotionally, making the overall tone more light-hearted. Tonic functions must then be broadly understood as serving to alter the type or degree of emotive tone.
Tonic formulae in the Beowulf sample As discussed above, X under wolcnum ‘X under clouds’ usually appears as a filler. However, one particular instance of X under wolcnum acts to signal a particular tone. This formula, discussed by Riedinger (1985, p. 299–303), is wan under wolcnum ‘dark under clouds’, and always occurs signalling ‘ominous darkness accompanying supernatural events’ (Riedinger, 1985, p. 300). It occurs at Bwf. 651a, describing the monster Grendel’s approach.20 Similarly, Bwf. 528a nihtlongne fyrst ‘the space of a whole night’ is formulaic, found in other OE poems, and is another example of a tonic formula which ‘always signifies a terrifying period of time prior to a battle’ (Riedinger, 1985, p. 296).
Tonic formulae in the blog sample Tonic formulae are fairly prominent in the blog sample, and this is perhaps expected: as noted in Herring et al. (2004, p. 1), blogs of the personal journal type reflect the blogger’s internal thoughts and workings. In the following cases, as in the prototypical case of kick the bucket, the formulae
228
Perspectives on Formulaic Language
serve to bring some measure of levity to what might otherwise be considered overly serious. (19) Hah, I wonder how many times this has been asked but it does seem relevant. How is it I can ‘spill my guts’ to strangers but not to my closest of friends. (20) I’m going to have a big overdraft fee before too long. Then shit will really hit the fan.
Acronymic functions The final category of functions of formulae found in our texts is a special case among the categories: acronymic functions rely on writing systems, calling upon knowledge of alphabets and other written systems to concisely represent a larger item. We argue that these formulaic functions, in contrast to the previous five, necessarily situate these texts in a written register. Crucially, the combination of the acronymic function of formulae and the five earlier functions related to oral form mark these texts as transitional or hybrid genres.
Acronymic formulae in the Beowulf sample In contrast to the gigabytes of server space afforded to bloggers (see below), the technology available to Anglo-Saxon scribes was somewhat more limited, as the production of vellum for manuscripts was a costly process. Therefore scribes employed a number of abbreviating devices in order to conserve vellum. Some of these are unremarkable, such as the omission of the -m of dative -um endings, where the omitted m is indicated by a line over the u. However, such abbreviating devices – amounting essentially to little more than spelling conventions – remain distinct from the type of formulaic sequences we are concerned with in this paper. A more interesting abbreviating device is the use of runes. The runic alphabet(s) were used to write Germanic languages prior to and for some time after the Christianization of the British Isles and Scandinavia. The runic symbols have acrophonic names, thus the symbol used to represent /m/, , is called mann ‘man’. In a number of OE manuscripts (which are written primarily in roman characters), runic symbols are not intended to
A Text in Speech’s Clothing
229
be read as their phonetic values but instead as their acrophonic names. For instance, in Beowulf, the rune eðel ‘homeland’, used in the Anglo-Saxon runic alphabet to represent /œ/, is used three times in its acrophonic value eðel, 520b, 913a, 1702a. The OE poet Cynewulf (ca. ninth/tenth c. C.E.) employs runes with both acrophonic and phonetic values. In the poems themselves the runes must be given their acrophonic values, but given their phonetic values they also make up an acrostic signature spelling out the name of the poet. 21 The fact that Cynewulf can put runes to this dual purpose reflects the written side of the character of the extant OE verse. Read aloud, the runes would have to be given their acrophonic values (otherwise both meter and sense would be violated), and thus could not be given their phonetic values. Cynewulf’s acrostic signatures work only when the runes are given their dual acrophonic-phonetic values.22
Acronymic formulae in the blog sample Internet-specific acronyms like lol (laughing out loud), wtf (what the fuck), btw (by the way), are by their very nature formulaic – what they refer to must be well known in order for them to be correctly interpreted. While example (22) below, from the blog corpus, makes clear that wtf can stand in for ‘what the fuck’ as part of a sentential unit, lol in (21), generally accepted as ‘laughing out loud’, cannot occur in a construction like *I’m lol. This suggests that lol, in particular, has become lexicalized as a separate unit from laughing out loud possibly along the lines of an uninflected discourse marker. (21) lol I made a site today. (22) Thought bolt through my hollow dome as I think about WTF am I gonna do. Well not so much wtf am I going to do as [ . . .]. In other genres of computer-mediated communication like text-messaging, where message size is limited, these acronyms function as space-saving devices. In the blog genre, however, there is a theoretically unlimited space for the conveyance of a message, so acronyms instead serve to position the text as a written genre, more specifically as a subgenre of computermediated communication. We suggest that these acronyms have been conventionalized to serve as markers of group membership among Internet users, exemplifying the function of identity-marking observed above.
230
Perspectives on Formulaic Language
Conclusions In this study, we identified common functions of formulae in Old English verse and blogs, in both cases simultaneously linking them to oral and written genres. Thus both the Beowulf and blog samples appear to occupy a transitional position with respect to the macro-level genres of oral and written communication. With regard to the broader field of study of formulaic language, we have demonstrated that the majority of the functions identified in our samples from OE poetry and blogs – despite their physical graphic encoding – link the respective genres with more prototypically oral texts, though one of the functions identified the texts as belonging to the written register. Examination of formulae and their functions thus emerges as a useful tool for the study of genres. This line of research could be fruitfully extended to other genres in order to determine the types of functions formulae may fulfill therein, adding genre studies to the long list of disciplines that stand to profit from an in-depth study of formulaic language.
Notes 1
2
3
4
5
6
7
8 9 10
Alliterative verse uses alliteration as its main structural device for unifying lines, rather than, for instance, rhyme. Two words alliterate when they begin with the same consonant (all vowel-initial words alliterate, as they all begin with no initial consonant). In OE poetry, each line is divided into two verses. The first is called an on-verse or a-verse, the second is the off-verse or b-verse. The first stressed word of the off-verse of a line must alliterate with a stressed word of the on-verse. On the controversy surrounding the date of composition of Beowulf, see the collection of papers in Chase (1997). For persuasive linguistic arguments for dating Beowulf between 685–825 C.E., see Fulk (1992). Cf. Nagy (1996, 2004a, 2004b) who continues the oral-formulaic analysis of Homer. Such ‘oral-derived’ texts may then exhibit some characteristics of literary texts as well, since they occupy a position on the boundary between ‘oral’ and ‘written’. This corpus contains all extant Old English text, including all OE verse, prose and glosses. See, e.g. the Language Log
for an example of a well known linguistics blog – Google lists nearly 6,000 links to this site. TIME.com, for instance, the website of the newsmagazine TIME, has several high-profile blogs. ICWSM: http://www.icwsm.org Blogspot: http://www.blogspot.com LiveJournal: http://www.livejournal.com
A Text in Speech’s Clothing 11 12
13
14
15 16
17 18
19 20
21
22
231
Wordpress: http://www.wordpress.com All quotations from Beowulf follow the text of Klaeber (1950), with macrons and underdotting removed. Citations from Beowulf are referred to as Bwf. followed by a line number. All other Old English poems referred to can be found in Krapp and Dobbie (1931–53), and are referred to by the accepted title followed by line number. All translations from Old English are ours. These are: Andreas, Exodus, Fates of the Apostles, Dream of the Rood, Juliana, Vainglory, Solomon and Saturn, and Judgement Day II. On alliterative bridges in OE verse, see, for instance, Creed (1959:448–49) and Foley (1990:230). See fn. 1 above. Of course this verse alliterates (on heard) with the following verse (on hiorosercean) in accord with the metrical requirement of OE meter, and if it were removed the structure of the poem would be compromised, likewise if it were a verse without an h-alliterator since something must alliterate with hiorosercean in 2539b. So one might argue that it is simply an (essentially meaningless) metrical filler used to fulfil the requirements of the meter. Yet hiorosercean bær is likely also somewhat redundant, since at 2523b–24a we were already told that Beowulf has armour and shield. So in fact all of line 2539 is propositionally redundant in terms of the audience’s knowledge, and nothing would be lost in plot or sense if the entire line were missing. Hence the formula may be used to invoke Beowulf’s full (traditional) image. http://www.kylieclark.com/blog/ Accessed 7 April 2008 http://michaelbane.blogspot.com/2008/03/butt-alligators-etc.html Accessed 10 April 2008 http://draco8876.com/personal/blog.html Accessed 7 April 2008 The same formula is found in the Dream of the Rood 55a, where it describes the darkness that covered the earth in the hours of Christ’s crucifixion; in Guthlac B 1280a, as the sun sets on the dying saint’s last day; and in Andreas 837a, before Andreas is eaten by cannibals. An acrostic is a piece of writing in which some recurrent feature (e.g. first letter of the line) spells out another message. Here acrostic refers to the fact that the runes in with their phonetic values spell out Cynewulf’s name. See Das (1942) on Cynewulf’s poems.
References Benson, L. D. (1966). The literary character of Anglo-Saxon formulaic poetry. Publications of the Modern Language Association of America, 81, 334–41. Bloomfield, M. W., & Dunn, C. W (1989). The role of the poet in early societies. Cambridge: D. S. Brewer. Chase, C. (Ed.). (1997). The dating of Beowulf. Toronto: University of Toronto Press. Condon, S. L. (2001). Discourse ok revisited: Default organization in verbal interaction. Journal of Pragmatics, 33, 491–513. Creed, R. P. (1959). The making of an Anglo-Saxon poem. English Literary History, 26 (4), 445–54.
232
Perspectives on Formulaic Language
Das, S. K. (1942). Cynewulf and the Cynewulf canon. Calcutta: University of Calcutta Press. Foley, J. M. (1990). Traditional oral epic: ‘Beowulf’, the ‘Odyssey’ and the Serbo-Croatian Return Song. Berkeley: University of California Press. Foley, J. M. (1991). Immanent art: From structure to meaning in traditional oral epic. Bloomington IN: Indiana University Press. Fry, D. K. (1967). Old English formulas and systems. English Studies, 48, 193–204. Fry, D. K. (1968a). Old English formulaic themes and type-scenes. Neophilologus, 52, 48–54. Fry, D. K. (1968b). Variation and economy in Beowulf. Modern Philology, 65, 353–56. Frye, N. (1969). Mythos and Logos. Yearbook of Comparative and General Literature, 18, 5–18. Fulk, R. D. (1992). A history of Old English meter. Philadelphia, PA: University of Pennsylvania Press. Healey, A. d., Haines, D., Holland, J., McDougall, D., McDougall, I., & Xiang, X. (1998). The dictionary of old English corpus in electronic form. Toronto: University of Toronto. [http://www.press.umich.edu/titleDetailDesc.do?id=6544] Herring, S. C., Scheidt, A. L., Bonus, S., & Wright, E. (2004). Bridging the gap: A genre analysis of weblogs. Proceedings of the 37th Hawaii International Conference on System Sciences (HICSS-37). Los Alamitos: IEEE Computer Society Press, Track 4, p.40101.2. Hutcheson, B. R. (1995). Old English poetic metre. Cambridge: D.S. Brewer. Keller, E. (1981). Gambits: Conversational strategy signals. In F. Coulmas (Ed.), Conversational routine: Explorations in standardized communication situations and prepatterned speech (pp. 93–113). The Hague: Mouton de Gruyter. Klaeber, F. (Ed.). (1950). Beowulf and the Fight at Finnsburg. Boston: Heath. Krapp, G. P., & van Kirk Dobbie, E. (Eds.). (1931–53). The Anglo-Saxon poetic records: A collective edition. Vols. I–VI. New York: Columbia University Press. Lord, A. B. (1991). Epic singers and oral tradition. Ithaca, NY: Cornell University Press. Magoun, F. P., Jr. (1953). The oral-formulaic character of Anglo-Saxon narrative poetry. Speculum, 28, 446–67. [reprinted, Essential articles for the study of Old English poetry, In J. B. Bessinger, Jr., & S. J. Kahrl (1968) (Eds.), pp. 219–51. Hamden, CT: Archon]. Nagy, G. (1996). Poetry as performance: Homer and beyond. Cambridge: Cambridge University Press. Nagy, G. (2004a). Homeric responses. Austin: University of Texas Press. Nagy, G. (2004b). Homer’s text and language. Champaign, IL: University of Illinois Press. Niles, J. D. (1981). Formula and formulaic system in Beowulf. In J. M. Foley (Ed.), Oral traditional literature: A festschrift for Albert Bates Lord (pp. 394–415). Columbus, OH: Slavica. O’Keeffe, K. O. (1987). Orality and the developing text of Caedmon’s Hymn. Speculum, 62, 1–20. [reprinted in R. M. Liuzza (2002) (Ed.), Old English Literature (pp. 79–102), New Haven, CT: Yale University Press]. O’Keeffe, K. O. (1990). Visible song: Transitional literacy in Old English verse. Cambridge: Cambridge University Press.
A Text in Speech’s Clothing
233
Orchard, A. (2003). A critical companion to Beowulf. Cambridge: D.S. Brewer. Parry, M. (1928). L’Epithète traditionnelle dans Homère: essai sur un problème de style homérique. Paris: Société Editrice ‘Les Belles Lettres’. [trans. A. Parry and rpt. in Parry, M. (1971), pp. 1–190.] Parry, M. (1930). Studies in the epic technique of oral verse-making. 1. Homer and Homeric style. Harvard Studies in Classical Philology, 41, 73–147. [rpt. in Parry, M. (1971), pp. 266–324]. Parry, M. (1971). In A. Parry (Ed.), The making of Homeric verse: The collected papers of Milman Parry. Oxford: Clarendon. Riedinger, A. R. (1985). The Old English formula in context. Speculum, 60, 294–317. Sorrell, P. (1992). Oral poetry and the world of Beowulf. Oral Tradition, 7 (1), 28–65. Wray, A., & Perkins, M. (2000). The functions of formulaic language: An integrated model. Language & Communication, 20, 1–28.
Chapter 12
The Semantic Structure of Arabic Idioms Ashraf Abdou The University of Manchester
Introduction Idioms have often been regarded as examples of conventionalized use of figurative language. This standpoint suggests two phases in the life history of an idiom. First, the expression comes out as an instance of ingenious exploitation of one or more figures of speech such as metaphor and metonymy. Second, gradually, because of its repeated occurrence, the originally creative expression shades into an institutionalized unit and becomes part of the phraseological repertoire of the language. Corpus-based studies of idioms help reveal their semantic properties that could be otherwise difficult to uncover (see e.g. Hümmer, 2007). In the case of Arabic idioms, for instance, relying only on introspection and existing dictionaries has some serious shortcomings. The former is by definition highly subjective and, moreover, can be influenced by memory. As to existing Arabic dictionaries, they are not being updated regularly, and, in general, are not based on comprehensive examinations of large corpora. This gives rise to the subjectivity problem again, and also, it may lead to the absence of some (new) idiomatic meanings from their entries, not to mention the absence of the idioms themselves in some cases. Due to its very nature, corpus data can help tackle these problems in a more satisfactory, objective manner. For example, as to the subjectivity problem, corpus data may be regarded as a poll of how different speakers use and perceive idioms (Riehemann, 2001). Also, drawing on sizeable, appropriately designed corpora increases the chances of detecting all the meanings an idiom might have and any (new) developments in its usage. Previous research on idioms has shown that investigating their semantic properties is a crucial step for explaining their formal behaviour, in particular their lexical and syntactic variation. For instance, Riehemann (2001) found that certain types of syntactic variation, such as the modification of
The Semantic Structure of Arabic Idioms
235
individual idiomatic constituents and passivization, are only possible with semantically decomposable idioms. Examining the semantic structure of idioms may also shed light on the nature and range of their discursive functions. For example, Moon (1998) found that 89 per cent of the expressions in the class designated metaphors in her data (this class practically subsumes all the range of phenomena labelled idioms by the present study) have some evaluative content. As a step further in this direction, examining the figurative processes underlying evaluative idioms could help explain how and why they differ in respect of their orientation, to convey positive, negative, or ambivalent attitudes (see Fernando, 1996 and Moon, 1998). Furthermore, psycholinguistic research shows that often in idiom comprehension the literal meaning of the expression is not deactivated or absent (see e.g. Cacciari & Tabossi, 1988). In cognitive-linguistic terms, this means that when processing an idiom, a complex scene that comprises both its literal and non-literal meanings is evoked (see Langacker, 1987). Backed up by these findings, and others, Langlotz (2006) argues that ‘idioms grossly derive their communicative and cognitive functionality from their status as complex scenes’ (p. 108). Despite a growing need for corpus-based research on Arabic idioms (and Arabic phraseology in general) to meet both practical and theoretical ends, there is still a scarcity of this type of research. The present work is part of a larger project that takes a step to fill this gap. The focus here is on two significant aspects of the semantic structure of Arabic idioms: their underlying figurative patterns and their isomorphism.
Structure and Conventions of the Paper The following section presents a characterization of the term idiom and briefly discusses the notion of motivation which features in many linguistic studies of idioms. The next section sets forth the methodology used. Subsequent sections present the results of the study. The final section discusses these results. When appropriate, the Arabic words shakhs· ‘person’ and shay’ ‘thing’ have been used in the transliterations to stand for any elements that are supplied by the context to fill any open slots in the structure of the idiom. These elements are represented in the glosses, literal translations, and proposed English translations by the abbreviations sb and sth for ‘somebody’ and ‘something’, respectively. Italics have been used in all these cases.
236
Perspectives on Formulaic Language
Generally, when providing examples of verbal idioms, the verbs have been presented in the third person, singular, and masculine form.
Key Terminology A characterization of idiom In this study, the term idiom is understood as: a multiword unit that occurs within the clause and has a figurative meaning in terms of the whole or a unitary meaning that cannot be derived from the meanings of its individual components. This characterization excludes several types of multiword units from the study, for example, proverbs, restricted collocations, and discourse structuring devices.
Motivation Motivation refers to the relationship between the literal meaning of the expression and its idiomatic meaning. This notion has two facets: global and constituental. The former refers to the relationship between the original meaning of the idiomatic expression as a whole and its derived meaning. [Whereas the latter . . . ] involves the relationship between the original, literal meaning of the constituent parts of the idiomatic expression, and the interpretation that those parts receive within the derived reading of the expression as a whole [if they receive any]. (Geeraerts, 2003, p. 436) A number of points are in order. First, some idioms do not have (full) literal meanings for many native speakers. This could take place for one or more of the following reasons. That is, the idiom could contain one or more unique components that do not occur outside it and do not bear any meaning for most native speakers today. These items are often called cranberry elements. Also, the idiom could be syntactically idiosyncratic. That is, its structure could violate the familiar rules of the grammar. Finally, the idiom could contain a highly specialized lexical item that is not known to most general language users. Idioms showing one or more of these features are sometimes labelled literally non-compositional idioms (Langlotz, 2006).
The Semantic Structure of Arabic Idioms
237
Second, even when the idiom has a synchronic, literal meaning, whether it is motivated or not for a certain speaker depends on the depth and breadth of their knowledge of the world. As a consequence, an idiom could be motivated for one speaker but not for another. Third, native speakers may assign the idiom a literal interpretation that is different from its etymological one. This ‘newly’ assigned reading could have a different relationship with the idiomatic meaning, or it could fully/ partially block the semantic motivation of the idiom, rendering the expression opaque. This phenomenon may take place, for example, because one (or more) of the idiomatic components has undergone a process of semantic change, is a homonym of another lexical item, or is polysemous. Idioms containing such elements are sometimes called idioms with garden-path constituents (Langlotz, 2006), as the presence of these items often ‘leads contemporary native speakers up the garden path’ about the etymological, literal meaning of the idiom. Finally, motivation can be lost or weakened. This ‘often results from cultural changes. More often than not, the background image that motivates the figurative shift is an aspect of the material or the immaterial culture of a language community – and when the culture changes, the imagistic motivation may lose its force’ (Geeraerts, 2003, p. 442).
Methodology The sample The initial list of idioms gathered for this study contains 654 examples that belong to Modern Standard Arabic, a primarily written variety of Arabic that is used, for instance, in journalism and literary, religious, and scholarly writings. These examples have been culled from some published dictionaries and examples from everyday interactions and readings. These idioms have been classified according to their syntactic class and the examples belonging to each category have been arranged alphabetically within a distinct list. As a result, six sub-lists have been compiled, containing 58 verb-subject idioms (i.e. idioms that consist at least of a verb and (the syntactic head of) its subject), 304 VP idioms, 162 NP idioms, 92 PP idioms, 31 AP idioms, and 7 AdvP idioms. I have decided to investigate a sample that consists of 10 per cent of the idioms in each of these lists. Two major steps have been taken to select this sample. First, I have calculated how many idioms in each list are going to be examined. Here, as a rule, I have rounded up the fractions to the nearest
Perspectives on Formulaic Language
238
higher whole numbers. This step applies to all the lists except for the AdvP idiom one, since it contains only seven examples. I have therefore decided to examine two AdvP idioms. Accordingly, the sample contains 6 verbsubject idioms, 31 VP idioms, 17 NP idioms, 10 PP idioms, 4 AP idioms, and 2 AdvP idioms, that is, 70 examples. Second, the actual examples have been chosen using a piece of online software that generates random numbers within ranges that can be set by the user.1 The data The study relies on analysing the occurrences of the idioms in the sample mainly in the Newspapers section of Arabicorpus (AC) 2, an online corpus of Arabic developed by Dilworth Parkinson, at Brigham Young University. This section contains texts from five Arabic newspapers, with a total word count of 83,519,701. The newspapers included are Al-Hayat (LondonSaudi Arabia-Lebanon), Al-Ahram (Egypt), Al-Thawrah (Syria), Al-Tajdid (Morocco), and Al-Watan (Kuwait). Besides this corpus data, some of the judgements in the paper are based on additional data from some informal interviews with native informants.
Figurative patterns in Arabic idioms The figurative patterns underlying Arabic idioms in the data include metaphor, metonymy, the interaction of metaphor and metonymy, semantic extension based on conventional knowledge, hyperbole, and emblematizing. Yet, in five cases, it was not straightforward to classify the idioms under any single one of these categories. Therefore, these examples have been gathered and discussed in a separate section called special cases. To determine these patterns, the study relies on scrutinizing the relationship obtaining between the literal and idiomatic meanings of its examples. In this context, some points should be borne in mind. First, four idioms in the data are literally non-compositional: (1) qalaba z· ahr-a l-mijann-i l- shakhs· /shay’ turn.around.PST back-ACC DEF-shield-GEN to-sb/sth lit. he turned around the back of the shield towards sb/sth he became hostile to sb/sth (2) shakhs·i/shay’i sbi/sthi
bi-rummat-i-hi with-worn.out.piece.of.rope-GEN-POSS
The Semantic Structure of Arabic Idioms
239
lit. sbi/sthi with their/its worn out lead all sbi (e.g. a group of people) or all sthi (3) allutayya¯ wa-llatı¯ REL.F.SG.DIMINUTIVE and-REL.F.SG much ado/lengthy difficult discussions ’ab-ı¯-hi (4) shakhs·i/shay’i ‘ala¯ bakrat-i on young.camel.F-GEN father-GEN-POSS sbi/sthi sbi (e.g. a group of people) all together or all sthi Examples (1), (2), and (3) contain cranberry elements that carry no meaning to many native speakers today. These are mijann, rummah, and allutayya¯, respectively. Moreover, example (3) is syntactically ill-formed, wherein two relative nouns are coordinated with no relative clauses present. The idiom in (4) contains bakrah which only occurs within very specialized contexts and (therefore) bears no meaning for most Arabic native speakers. Apparently, the presence of these lexical elements and grammatical features renders the idioms unmotivated for native speakers. It might be argued that it is irrelevant from a synchronic point of view to establish the semantic relationships between the idiomatic meanings of such examples and the literal meanings they had for native speakers at some point in the past. However, establishing these relationships, whenever possible, has proved useful in explaining some aspects of the linguistic behaviour of idioms today. Therefore, these examples have been included below. Second, two idioms in the data contain garden-path constituents. In both cases, this veils their etymological motivation. (5) n«r-u-n ‘ala¯ fire-NOM-INDF on a well-known entity
‘alam-i-n mountain-GEN-INDF
(6) ¯alqat-u-n mufragh-at-u-n ring-NOM-INDF cast.in.a.mould-F-NOM-INDF lit. a-cast-in-a-mould ring a vicious circle The idiom in (5) is originally based on the fact that a fire that is set on a mountain, particularly during night, is very visible to people passing by,
Perspectives on Formulaic Language
240
hence the semantic extension to the state of being well-known. However, the word ‘alam does not have the meaning ‘mountain’ for many native speakers today. Rather, its most frequent meaning is ‘flag’. Therefore, even well-educated native speakers may not be able to fully construct the original, literal meaning, and, as a result, may not be able to comprehend the relationship depicted above. In (6), etymologically the literal meaning of the idiom is ‘a metal ring that is cast in a mould’ and, therefore, is seamless. Classically, the expression was used, typically as a simile, in situations wherein it is impossible to decide on the favourite member of a group of people. In such situations, the group, for example, the sons of a man, is considered similar to a seamless ring of which one cannot come to a conclusion on where it begins/ ends. This meaning of the word mufraghah is still in use. However, the word more frequently denotes other related meanings, that is, ‘emptied’ or ‘hollow’. Native speakers seem to think only of one of these related meanings rather than of the etymologically accurate one when they are asked to re-motivate the expression. The last point that should be taken into consideration here is that, as Radden (2003) rightly points out, people’s experiences and knowledge systems are subjective. Therefore, their ‘characterizations of semantic structures including figurative language may be different’ (p. 408). Therefore, some native Arabic speakers might disagree on some of the analyses below. Table 12.1 shows the distribution of the data over the figurative patterns revealed in the analysis.
Table 12.1
Distribution of Idioms over Figurative Patterns V-SUBJ VP idioms NP idioms PP idioms AP idioms idioms
Metaphor
2
Metonymy
2
Interaction of metaphor and metonymy
2
Conventional knowledge
14
2
3
29 2
15
3
1
2
Emblems
6
1
2
1
1 6
31
29
3
1
Hyperbole Total
8
AdvP Total idioms
1 14
8
4
2
65
The Semantic Structure of Arabic Idioms
241
Metaphor Metaphor is typically based on the relationship of similarity between two concepts that belong to two different domains of human experience. As Kövecses (2002) points out, in addition to real similarity, metaphors may be founded on ‘perceived resemblance and correlations in experience’ (p. 146). Some examples include: (7) ’afala najm-u shakhs·/shay’ set.PST star-NOM sb/sth lit. the star of sb/sth set the glory/fame of sb/sth came to an end (8) bala‘a l-³u‘m-a swallow.PST DEF-bait-ACC he swallowed the bait (9) rim«l-u-n muta¯arrikat-u-n sand.PL-NOM-INDF moving-NOM-INDF quicksand (i.e. a dangerous situation that is difficult to escape from)
Metonymy Metonymy is a type of semantic extension that takes place between two concepts that belong to the same domain of experience. In this case, the two concepts conveyed through the literal and figurative readings stand in a relationship of contiguity. Different types of metonymic association exist, for example, the relationships between the parts and the whole, the container and the contained, and the cause and the effect (see e.g. Saeed, 2003). Two examples in the data belong to this category: (10) inqaba±a ·adr-u contract.PST chest-NOM lit. sb’s chest contracted sb felt depressed
shakhs· sb
(11) tajammada l-dam-u freeze.PST DEF-blood-NOM the blood froze in sb’s veins
f» in
‘uru¯q-i vein.PL-GEN
shakhs· sb
242
Perspectives on Formulaic Language
In both cases, the (assumed) bodily reactions stand for their emotional causes. In (10), the tension experienced in the muscles of the chest (or maybe in the heart) stands for its cause; depression. In (11), although the literal meaning violates the truth conditions – after all, blood does not freeze inside people’s veins when they are frightened, it is still the case that ‘the emotional experience [of fear] is felt to be associated with assumed or real changes in body temperature’ (Kövecses, 2002, p. 71) in which this temperature becomes lower than usual. Also, a hyperbolic element may be involved in this idiom. This explains why its literal reading violates the truth conditions.
The Interaction of Metaphor and Metonymy Metaphor and metonymy may interact to provide the basis for idioms. The interaction between these two patterns has been studied by Goossens (2003). However, some examples of the interaction between metaphor and metonymy in my data do not appear to fit in any of the categories put forward in his account. Therefore, this section has been divided into two parts. The first deals with the examples that can be categorized according to Goossens (2003), and, therefore, I rely here heavily on his characterization of the phenomenon. The other part deals with the data examples that do not fit in his categorization.
Metaphtonymy Goossens (2003) uses the label metaphtonymy as a covering term for the interaction between metaphor and metonymy. His characterization of the phenomenon seems to account for 24 of the 29 idioms that feature interaction between these two patterns in the data. Three of the subtypes described in his work can be identified in Arabic idioms: metaphor from metonymy, metonymy within metaphor, and metaphor within metonymy.
Metaphor from Metonymy In characterizing this type, Goossens (2003) states that ‘the main point here is that underlying the metaphor there is an awareness that the donor domain and the target domain can be joined together naturally in one complex scene, in which case they produce a metonymy’ (p. 366, his italics).
The Semantic Structure of Arabic Idioms
243
Under this category, there are 21 idioms that can be classified: 14 VP idioms, 5 PP idioms, and the 2 AdvP idioms. For example, (12) ’ad«ra l-khadd-a DEF-cheek-ACC turn.PST he turned the other cheek
l-’«khar-a DEF-other-ACC
(13) ’argh« wa-’azbada and-froth.PST foam.PST he fumed with rage In (12), the actual turning of the other cheek after being slapped is metonymically connected (at least given its Biblical reference) to refraining from retaliating after an attack. When the expression is used to indicate refraining from retaliating in general after being attacked, whether physically or not, without intending its literal meaning, it turns into a metaphor. In (13), some people produce a lot of saliva that might come out of their mouths while shouting because of anger. This is the metonymic basis of this idiom, wherein the effect stands for its cause. When it is used with abstract entities as subjects, it becomes metaphorical. A corollary of the foregoing characterization is that, in principle, idioms that are based on this type of figuration can be used as pure metonyms, that is, they may occur in contexts where both the literal and non-literal meanings are relevant. However, in actual usage, only some of these idioms can be used in such a way.
Metonymy within Metaphor In this type, ‘a metonymically used entity is embedded in a (complex) metaphorical expression. The metonymy functions within the target domain’ (Goossens, 2003, p. 367). In other words, there are both metaphorical and metonymic relationships between the literal and idiomatic readings, wherein the metonymic relationship is present within a metaphor that functions at the level of the meaning of the expression as a whole. The metonymic element is shared by both the source and target domains of the metaphor. It ‘functions metonymically in the target domain only, whereas it is interpreted literally or (more often) (re)interpreted metaphorically in the donor domain’ (Goossens, 2003, p. 363).
244
Perspectives on Formulaic Language
Two idioms in the data belong to this type: (14) lam yu·addiq NEG.PST believe.JUSSIVE he did not believe his eyes
‘aynay-hi eye.DU.ACC-3SG.M.POSS
(15) naμ»f-u l-yad-i clean-NOM DEF-hand-GEN lit. clean-handed honest In (14), at the highest level; the level of the embedding metaphor, this idiom features a metaphorical extension based on a perceived similarity, probably, between the reaction of a person who is very surprised by something he/she sees and that of a person who does not accept what someone tells them as true. The word ‘aynayn ‘two eyes’ then is interpreted metaphorically in the source domain by attributing a human quality to it, that is, the quality that a human being may or may not be believed as a source of information. In the target domain, that is, that of the idiomatic meaning, on the other hand, eyes, as the organ of vision, stand in a metonymic relationship with the what-is-seen part of the idiomatic reading. The idiom in (15) is based on the interaction between a metonymy in which yad ‘hand’ stands for the activity and a metaphor in which what is ethical is considered clean.
Metaphor within Metonymy In this case, a metaphorically used element is embedded into a metonymy. Only one example in the data belongs here: (16) al-yad-u DEF-hand-NOM the upper hand
l-‘uly« DEF-upper
In this idiom, yad ‘hand’ stands metonymically for power, as it is the ‘instrument’ with which power is often exercised. The metaphorical extension of ‘ulya¯ ‘upper’ is based on perceiving victory in terms of being at a higher position. Although it might be argued that the presence of this metaphor metaphorises the whole expression, the saliency of the hand-for-power
The Semantic Structure of Arabic Idioms
245
metonymy provides the basis for classifying this idiom as a metaphor buried into a metonymy.
Another Type of Interaction between Metaphor and Metonymy Five idioms in the data feature an interaction between metaphor and metonymy that does not seem to fit in any of the above classes: (17) l« taqu¯m-u li- shakhs·/shay’ q«’imat-u-n NEG stand.up-IND for-sb/sth pillar-NOM-INDF lit. none of sb/sth’s pillars is up sb/sth is ruined (e.g. socially/professionally) ba‘±-u-hui bi-riq«b-i (18) shay’i ’akhadha sthi hold.PST part.of-NOM-POSSi at-neck.PL-GEN lit. the parts of sthi held on to each other’s necks sthi (e.g. a set of ideas/texts) interconnected (19) ‘al« ·af»¯-i-n on sheet.iron-GEN-INDF lit. on hot sheet iron anxious
ba‘± part
s«khin-i-n hot-GEN-INDF
(20) shadd-u-n wa-jadhb-u-n and-pulling-NOM-INDF tugging-NOM-INDF great tension (e.g. in a strained political state) (21) n«r-u-n ‘al« on fire-NOM-INDF a well-known entity
‘alam-i-n mountain-GEN-INDF
At first glance, these idioms might be categorized under the metaphor from metonymy type as described by Goossens (2003) and outlined above. Yet, a closer look helps see the dissimilarity between the two cases. According to him, the main point in the metaphor from metonymy type is that ‘the donor domain and the target domain can be joined together naturally in one complex scene, in which case they produce a metonymy’ (p. 366, his italics). This is probably not the case in these idioms. They cannot be used
246
Perspectives on Formulaic Language
as pure metonymies, wherein both their literal and idiomatic readings are relevant. For instance, in (17), being socially/professionally ruined is not contiguous with ‘a tent/building having none of its pillars up’. And, in (21), being well-known is not contiguous with ‘a fire on a mountain’. In these five cases, the similarity relationships obtain between the idiomatic meanings and entailments of the literal meanings rather than the literal meanings themselves. For instance, in (17) and (21) respectively, these relationships obtain between the entailments: physical ruination and clear visibility, on one hand, and professional/social failure and ‘well-knownness’, on the other. The idiom in (21), additionally, may be linked to a general model in which ‘knowing’ is perceived in terms of ‘seeing’.
Conventional Knowledge Shared knowledge about the world within the speech community may provide the basis for some idioms (Kövecses, 2002). The sources of shared knowledge in my data vary. They include general knowledge of the world, religion, and superstitions. Only one example is given below: (22) ¯ibr-u-n ‘al« waraq-i-n ink-NOM-INDF on paper-GEN-INDF inactive (e.g. of decisions/laws) From what we know about the world, decisions, truces, and so on, are usually written, so when these are ‘only ink on paper’, this implies that they are not in force.
Emblems The term emblem refers to ‘a stereotypical conceptual prototype that works as the material representation of a very abstract quality or attribute’ (Langlotz, 2006, p. 72). The emblematic relationship features in: (23) ghu·n-u branch-NOM olive branch
l-zaytu¯n-i DEF-olive-GEN
The Semantic Structure of Arabic Idioms
247
Hyperbole One idiom in the data may be considered hyperbolic: (24) shakhs·/shay’ s«baqa l-r»¯-a sb/sth race.PST DEF-wind-ACC sb went fast or sth (e.g. the performance of a company ) progressed fast The first meaning is based on deliberate exaggeration and the other one is a further abstraction from it3.
Special Cases Five idioms show characteristics that make it difficult to categorize them under any single one of the classes above. (25) dam-u-n blood-NOM-INDF blue blood
’azraq-u blue-NOM
’ab-»-hii (26) shakhs·i/shay’i ‘al« bakrat-i sbi/sth’i on young.camel.F-GEN father-GEN-POSSi sbi (e.g. a group of people) all together or all sthi (27) shakhs·i/shay’i bi-rummat-i-hii sbi/sthi with-worn.out.piece.of.rope-GEN-POSSi lit. sbi/sthi with their/its worn out lead all sbi (e.g. a group of people) or all sthi (28) allutayya¯ wa-llatı¯ REL.F.SG.DIMINUTIVE and-REL.F.SG much ado/lengthy difficult discussions (29) marba³-u l-faras-i place.where.sth.is.tied.up-NOM DEF-mare-GEN lit. the place where the mare is tied up the central point (e.g. in a/an problem/argument)
248
Perspectives on Formulaic Language
As to the idiom in (25), the following extracts from The Oxford English Dictionary about blue blood are enlightening: Blood is popularly treated as the typical part of the body which children inherit from their parents and ancestors; hence that of parents and children, and of the members of a family or race, is spoken of as identical, and as being distinct from that of other families or races. blue blood: that which flows in the veins of old and aristocratic families, a transl. [i.e. translation] of the Spanish sangre azul attributed to some of the oldest and proudest families of Castile, who claimed never to have been contaminated by Moorish, Jewish, or other foreign admixture; the expression probably originated in the blueness of the veins of people of fair complexion as compared with those of dark skin; also, a person with blue blood; an aristocrat. Therefore, blood may be linked metonymically to the race or family. Blue relates to the group of aristocrats to whom the expression historically referred, that is, because of the blueness of their veins. Both blood and veins are connected, since the latter is the container of the former. The literal reading of the idiom, however, violates the truth conditions, as blood cannot be blue. Finally, it seems that at some point, probably due to some similarity in status and/or behaviour, the narrow meaning of the expression as used within its etymological context was extended through a generalization process to refer to any race/group of people that is considered noble/aristocratic. Because of this generalization, the literal meaning of blue does not relate to the current idiomatic reading, as they are not contiguous with each other. However, the Arabic idiom seems to be a loan translation of the English, or maybe other Western, version. Therefore, the preceding analysis might not be relevant in its entirety to the Arabic context. The reason for this is that it is possible, at least in principle, that the expression was borrowed after the probable generalization process referred to had taken place. In (26), this literally non-compositional idiom originates in the story of three sons who were killed by an enemy during a journey in the desert, and then their bodies were sent back home carried ‘all together’ on their father’s camel. The idiom therefore has a metonymic basis. It seems that at a later stage, a generalization process took place, whereby the expression has become to mean ‘all together’ in a wide range of situations, without any reference to its original context.
The Semantic Structure of Arabic Idioms
249
In (27), this literally non-compositional idiom belongs to the context of selling and buying farm animals. The word rummah means ‘a worn out piece of rope’ that could be used as an animal’s lead. Thus, a sentence such as khudh-hu take.IMP.M.3SG-M.3SG
bi-rummat-i-hi together.with-a.worn.out.piece.of.rope-GENM.3SG.POSS
would literally mean ‘take it [even] together with its worn out lead’. This shows that the idiom has a metonymic basis. A generalization process seems to have taken place, by which the expression has become to mean ‘all/ completely’ in a range of contexts that is much wider than its original one. The study has arrived at no satisfactory conclusions regarding the nature of the semantic extension underlying the idioms in (28) and (29). However, as far as the former is concerned, one plausible account, which has been referred to in some classical Arabic dictionaries, is that the expression is the remnants of two coordinated, complete relative clauses that denoted an unpleasant state of affairs.
Isomorphism in Arabic Idioms The term isomorphism refers to: a one-to-one correspondence between the formal structure of the expression and the structure of its semantic interpretation, in the sense that there exists a systematic correlation between the parts of the semantic value of the expression as a whole and the constituent parts of that expression. [… Isomorphism is a static notion] that merely notes that a one-to-one correspondence between the parts of the semantic value of the expression as a whole and the meanings of the constituent parts of the expression can be detected, regardless of the question whether this correspondence has come about through a process of bottom-up derivation or through a top-down interpretative process (Geeraerts, 2003, p. 437). In order to decide whether an idiom is isomorphic or not, a body of interdependent information needs to be at the analyst’s disposal. This includes (1) a precise description of the meaning(s) that the expression takes on in
250
Perspectives on Formulaic Language
different contexts, (2) information on the presence of any ‘discourse entities’ (Pulman, 1993) or ‘discourse referents’, whether stated explicitly or not, that can be linked to the idiomatic constituents in a one-to-one fashion, (3) information on the extent to which the idiom allows for modification of its individual parts, which is in many cases, though not always, an indication that it is possible to attribute identifiable parts of the idiomatic meaning to these components (see Nunberg, Sag, & Wasow, 1994, and Stathi, 2007), and, very importantly, (4) determining whether or not the idiomatic words are used in their figurative senses outside the context of the idiom, since, if they are, this would be a clear indication of the idiom’s isomorphism/analysability (see Langlotz, 2006, and Geeraerts, 2003). Analysing the data obtained from AC, has been of much help in providing this information. The analysis shows that it is not adequate to classify Arabic idioms as either isomorphic or not. A third class is needed to subsume examples that can be, and are, used in both ways. Tables 12.2 and 12.3 show the distribution of the data over these three patterns. In the former, the classification is produced with reference to the syntactic types of the idioms, while in the latter, it is produced with regard to their figurative patterns. The five examples discussed under ‘special cases’ above have been considered in both tables. Examples of isomorphic idioms include: (30) ’afala najm-u shakhs·/shay’ set.PST star-NOM sb/sth lit. the star of sb/sth set the fame/glory of sb/sth came to an end (31) bala‘a l-³u‘m-a swallow.PST DEF-bait-ACC he swallowed the bait (32) dam-u-n blood-NOM-INDF blue blood
’azraq-u blue-NOM
Examples of non-isomorphic idioms include: (33) tajammada l-dam-u freeze.PST DEF-blood-NOM the blood froze in sb’s veins
f» in
‘uru¯q-i vein.PL-GEN
shakhs· sb
The Semantic Structure of Arabic Idioms Isomorphism and Syntactic Types of Idioms V-SUBJ VP idioms NP idioms PP idioms idioms Isomorphic
2
15
9
2
Non-isomorphic
4
14
7
8
2
1
Isomorphic/ Non-isomorphic
4
32 2
35 3
Isomorphic Non-isomorphic
28 1
3 2
Isomorphic/Non-isomorphic
23
3
1
3
1
Total
Special cases
Emblematizing
Con. Know.
Interac. of metaphor and metonymy
Metonymy
Isomorphism and Patterns of Semantic Extension in Idioms
Metaphor
Table 12.3
AP idioms AdvP Total idioms
Hyperbole
Table 12.2
251
1
32
4
35 3
(34) ¯ibr-u-n ‘al« waraq-i-n ink-NOM-INDF on paper-GEN-INDF inactive (e.g. of decisions/laws) (35) waqafa f» ³ar»q-i stand.PST in way-GEN he stood in the way of sth
shay’ sth
Three idioms can be used in both ways, for example, (36) shadd-u-n tugging-NOM-INDF
wa-jadhb-u-n and-pulling-NOM-INDF
This idiom can be used to mean ‘much social/political tension’, wherein it is not isomorphic. But it also occurs in contexts where each coordinate can be linked to one of two opposing forces that try to pull some benefits towards themselves and that are detectable in the context. The following corpus examples show these two uses respectively.
252
Perspectives on Formulaic Language
(37) wa-‘al«qat-i-ni tta·af-at bi-l-shadd-i wa-l-jadhb-i and-relationship- was. by-DEFand-DEFGEN-INDF characterised-F tugging-GEN pulling-GEN lit. and a relationship that was characterised by the pulling and tugging and a relationship that was characterised by much tension (38) bayna
shadd-i l-’»r wa-jadhb-i l-’u¯rubbiyy» li-daf‘-i «niyy»na na between tugging- DEFandDEFfor-forcingGEN Iranians pulling-GEN Europeans GEN ³ihr«n-a li-l‘an barn«maj-i-ha l-nawawiyy-i takhall» Tehran- to-DEF- about programme- DEFGEN giving.up GEN-POSS nuclear-GEN lit. between the pulling of the Iranians and the tugging of the Europeans to force Tehran to give up its nuclear programme In (37), although it is possible to notice two opposing forces, the meaning of the idiom seems to be centred on the tension entailed by their concurrent presence. On the other hand, in (38), it seems that each idiomatic component could receive a rather general reading that represents one of the two challenging actions, and that the specific interpretation can only be determined depending on the context. Even when it is possible to link all the formal constituents to identifiable parts of the idiomatic meaning, occasionally some meaning parts may not have parallel formal constituents, for example, in (39) ’ibrat-u-n f» needle-NOM-INDF in a needle in a haystack
kawmat-i stack-GEN
qashsh-i-n hay-GEN-INDF
an essential part of the meaning is that it is hard to locate the entity that the needle metaphorically represents. This meaning part does not map onto any single constituent.
Discussion This section discusses the following major points: (1) any correlations between the patterns of figuration and the syntactic types of idioms, (2) the high frequency of the metaphor from metonymy pattern compared to the
The Semantic Structure of Arabic Idioms
253
other two metaphtonimic types, and (3) how can the differences between idioms with regard to their isomorphism be explained? Eight out of the ten PP idioms in the sample contain a metonymic element. These include six idioms that involve an interaction between metaphor and metonymy and the two PP idioms that have been discussed among the five special cases. The correlation between PP idioms and the metonymic extension is not surprising in view of the range of the meanings that Arabic prepositions convey. They communicate various associations in time or space or express different relationships of, for example, causality, instrumentality, and accompaniment (see Ryding, 2005, and Badawi, Carter, & Gully, 2004). Since metonymy is by definition concerned with meaning extensions that are based on such relationships, when prepositional phrases are used figuratively, this use often involves metonymy. A similar explanation applies to the AdvP idioms in the sample. Since adverbs/adverbials typically express meanings related to time, space, and manner, it is not surprising to find that adverbial idioms involve metonymic extensions. The metaphor from metonymy pattern is far more frequent in the data than the other two metaphtonimic types. In characterizing this pattern, Goossens (2003) states that [m]etaphor from metonymy implies that a given figurative expression functions as a mapping between elements [A and B] in two discrete domains, but that the perception of ‘similarity’ is established on the basis of our awareness that A and B are often ‘contiguous’ within the same domain. This frequent contiguity provides us with a ‘natural,’ experiential, [sic] grounding for our mapping between two discrete domains. (p. 368) In fact, this experiential basis does not only provide the grounding for the mapping between the two distinct concepts, but it can also be regarded as a plausible explanation for the high frequency of this pattern as opposed to the metonymy within metaphor and metaphor within metonymy types which do not share the same degree of embedding in human everyday experience. As far as the metonymy within metaphor pattern is concerned, the condition to incorporate a shared element, that is, one that functions both in the source and target domains, might at some level be considered a restriction on the productivity of this type, and hence provides at least a partial explanation for its infrequency. However, in order to better understand this, other studies need to be conducted on different sets of data. In those
254
Perspectives on Formulaic Language
studies, effort should be made to categorize idioms in semantically coherent groups and to pay attention to the question of how distinct the target domains are from the source domains. It is important in this context to mention that Goossens (2003), in his study on the figurative expressions conveying linguistic action, reported a high frequency of the metonymy within metaphor type. This was the case when the source domain involved a body part that ‘can be functional in linguistic action’ (Goossens, 2003, p. 365). This shows the significance of the suggestion to assemble idioms in semantically consistent groups in preparation for a closer investigation. Finally, the metaphor within metonymy type is represented by only one example in the sample. ‘The fact that if we embed a metaphor into a metonymy, it tends to “metaphorise” the whole expression’ (Goossens, 2003, p. 367) may explain this rarity. As Table 12.3 shows, 28 out of the 32 isomorphic idioms in the data are metaphorical, and 27 out of the 35 non-isomorphic examples have at least a metonymic ingredient, these include 2 idioms discussed as special cases. Metaphors are typically based on noticing a set of similarities between two concepts. This usually leads to the analysability of purely metaphorical idioms, as it is often possible to establish the connections between parts of the idiomatic meaning and parts of the constituent structure encoding the literal meaning. On the other hand, the presence of a metonymy in the semantic structure of the idiom, whether or not it is the only figurative pattern at work, usually renders the idiom non-isomorphic. This accords with the findings of Langlotz (2006) in his work on English idioms. Within his cognitivelinguistic account, this correlation is explained as follows: By establishing deferred foci on conceptual relationships, metonymies can prevent direct ontological correspondences between constituents and possible correspondents in the idiomatic meaning. The constituental structure only provides indirect access to these correspondences via metonymic contiguity relationships. (p. 120) Nevertheless, four idioms which involve metonymies are isomorphic, for example, (40) al-yad-u DEF-hand-NOM the upper hand
l-‘uly« DEF-upper
The Semantic Structure of Arabic Idioms (41) dam-u-n blood-NOM-INDF blue blood
255
’azraq-u blue-NOM
The analysability of these examples is due to the fact that their metonymic extensions are constituental rather than global. The noun in (40) and the noun and adjective in (41) are motivated, at least as far as their etymology is concerned, by individual metonymic relationships and stand, each on its own, for other contiguous concepts. This individual motivation by separate patterns of figuration helps explain why these elements map onto identifiable parts of the idiomatic meaning, rendering the examples isomorphic. Partial isomorphism of the type observed in (39) can be attributed to the fact that the meaning ‘hard to locate’ is an institutionalized implication of the literal reading of the expression. This meaning part is implied by what we know about the physical characteristics of both needles and haystacks. It is not part of the compositional meaning of the expression, and, therefore, does not have a place in a structured semantic representation based only on this meaning. In fact, such an example shows how difficult it is in some cases to set a clear-cut distinction between metaphor and metonymy.
Notes 1
2 3
This software is available at: http://www.graphpad.com/quickcalcs/randomN1. cfm Arabicorpus is available at http://arabicorpus.byu.edu This idiom can be classified in some other ways too.
References Badawi, E., Carter, M. G., & Gully, A. (2004). Modern written Arabic: A comprehensive grammar. London: Routledge. Cacciari, C. & Tabossi, P. (1988). The comprehension of idioms. Journal of Memory and Language, 27, 668–83. Fernando, C. (1996). Idioms and idiomaticity. Oxford: Oxford University Press. Geeraerts, D. (2003). The interaction of metaphor and metonymy in composite expressions. In R. Dirven & R. Pörings, (Eds.) Metaphor and metonymy in comparison and contrast (pp. 435–65). Berlin: Mouton de Gruyter. Goossens, L. (2003). Metaphtonymy: The interaction of metaphor and metonymy in expressions for linguistic action. In R. Dirven & R. Pörings, (Eds.) Metaphor and metonymy in comparison and contrast (pp. 349–77). Berlin: Mouton de Gruyter.
256
Perspectives on Formulaic Language
Hümmer, C. (2007). Meaning and use: A corpus-based case study of idiomatic MWUs. In Fellbaum C. (Ed.) Idioms and collocations: Corpus-based linguistic and lexicographic studies (pp. 138–51). London: Continuum. Kövecses, Z. (2002). Metaphor: A practical introduction. Oxford: Oxford University Press. Langacker, R. W. (1987). Foundations of cognitive grammar. Volume I: Theoretical prerequisites. Stanford, CA: Stanford University Press. Langlotz, A. (2006). Idiomatic creativity: A cognitive-linguistic model of idiom-representation and idiom-variation in English. Amsterdam: John Benjamins Publishing Company. Moon, R. (1998). Fixed expressions and idioms in English: A corpus-based approach. Oxford: Clarendon Press. Nunberg, G., Sag, I. A., Wasow, T. (1994). Idioms. Language, 70, 491–538. Pulman, S. (1993). The recognition and interpretation of idioms. In C. Cacciari & P. Tabossi (Eds.), Idioms: Processing, structure, and interpretation (pp. 249–70). Hillsdale, NJ: Lawrence Erlbaum Associates. Radden, G. (2003). How metonymic are metaphors? In R. Dirven & R. Pörings, (Eds.) Metaphor and metonymy in comparison and contrast (pp. 407–34). Berlin: Mouton de Gruyter. Riehemann, S. (2001). A constructional approach to idioms and word formation. PhD dissertation. Stanford University. Ryding, K. C. (2005). A reference grammar of modern standard Arabic. Cambridge: Cambridge University Press. Saeed, J. I. (2003). Semantics (2nd ed.). Oxford: Blackwell Publishing. Stathi, K. (2007). A corpus-based analysis of adjectival modification in German idioms. In C. Fellbaum, (Ed.) Idioms and collocations: Corpus-based linguistic and lexicographic studies (pp. 81–108). London: Continuum.
Chapter 13
Formulaicity and Translation: A Cross-corpora Analysis of English Formulaic Binomials and Their Italian Translations Salvatore Giammarresi University of Palermo, Italy
Equivalence in difference is the cardinal problem of language and the pivotal concern of linguistics. Roman Jakobson (1959: 233)
This paper reports on a study investigating English formulaic binomials of the type VERB and go; go and VERB; VERB or go and their Italian translations. This research was designed to investigate if their formulaicity is a linguistic trait that is preserved during translation. A better understanding of the relationship between formulaicity and translation can improve translation practice, translation teaching, bi-lingual lexicography and the development of better computer assisted translation tools and machine translation systems. This research is based on the claim (Makkai, 1978) that formulaicity is a language universal, that is, formulaicity is so common in natural languages that it can be considered a defining feature of language itself. The results of this research suggest that while formulaicity, as it pertains to English binomials based on the verb to go, is pervasive in original texts, it tends to disappear or be highly reduced in translated texts. If these findings are supported by similar findings in other language pairs and in regards to other single and multi-word lexical units, then it will be possible to hypothesize that the absence of, or low frequency of formulaicity might be considered a translation universal (Baker, 1992; Kenny, 1998). Theoretically and methodologically this research is corpus-driven (Tognini Bonelli, 2002), in the sense that the analysed corpus data informed the investigation and shaped its theoretical background. In other words,
258
Perspectives on Formulaic Language
corpus data was not used to confirm any pre-determined ideas; instead, fully embracing the revolution initiated by corpus linguistics, corpus data was collected and analysed with an open mind. This research is based on frame theory (Fillmore, 1971), in the sense that framing is granted as a basic, common and shared human cognitive skill that underlies language and many other human abilities. As such, it permits explanations of linguistic phenomena beyond purely linguistic parameters. Finally, this research adheres to a usage-based theory of language (Bybee, 2006; Tomasello, 2003), that is, the idea that language usage determines the nature and status of the language we interact with, both personally and collectively. In this sense linguistic phenomena are not just the result of education and set rules, but are the result of our exposure to previous linguistic phenomena, measured in terms of frequency but also in terms of type of occurrence. A usage-based model can explain how framing processes dynamically shape our personal and collective languages. The theoretical and methodological framework of this research can be applied to any language pair. The following sections will give a brief overview of previous research on binomials, the linguistic resources used during this research, the methodology used and specifically the parameters that were used to determine the formulaic status of each sequence. A section will be dedicated to the analysis of the binomial come and go, one of the more than 2500 types of sequences that were analysed during this study. A summary of the final results of the study will be briefly reviewed in the conclusion.
Previous Research on Binomials While there is a steadily growing scientific literature on formulaic language, and a long tradition of theoretical studies on translation, no study before seems to have tackled, in general, the issue of formulaicity in translated texts, and, in particular, the analysis of English frame type formulaic binomials based on the verb to go. Malkiel (1959: 113) defined a binomial as ‘the sequence of two words pertaining to the same form-class, placed on an identical level of syntactic hierarchy, and ordinarily connected by some kind of lexical link’. Malkiel (1959) focused in particular on a specific kind of binomials, called ‘irreversible binomials’ where the two connected words are always used in the same order and never change places. Examples of irreversible binomials in English are odds and ends (not ends and odds) and husband and wife. Lambrecht (1984) dealt with German Bare Binomials (BB) which he described as of the form ‘Noun und Noun’, where the nouns
Formulaicity and Translation
259
are not preceded by determiners. According to Lambrecht (1984) what makes binomials, irreversible binomials and bare binomials interesting to linguists, is the fact that these binomials are similar to idioms, but are not really fixed idiomatic expressions since at times they can be used productively for the creation of new pairs. Lambrecht (1984) divided BBs into three main groups: lexicalized and irreversible; novel but semantically motivated; and semantically unmotivated but pragmatically constrained. Moon (1998: 152) listed binomials as frame-type formulaic sequences and provides examples of the various types of binomials she found in the Oxford Hector Pilot Corpus. Benor and Levy (2006), searching for binomial constructions of all lexical categories (Nouns and Nouns, Verbs and Verbs, Adjectives and Adjectives and Adverbs and Adverbs), from three tagged corpora (the Switchboard corpus, the Brown corpus, and the Wall Street Journal corpus), found 3680 distinct binomials. After removing binomials formed from personal names with extender phrases (such as and stuff or and everything) Benor and Levy had a list of 411 binomials, and of these only the ones listed below (sorted based on their frequency) included the verb to go.
Table 13.1
Go sequences and their frequencies in Benor and Levy (2006: 44–48)
Go sequences
Frequency
come and go
4
go and vote
2
been and gone
1
voted and went
1
hid and went
1
Resources This research was based on multiple English, Italian and bilingual corpora and dictionaries. The English corpora directly accessed and used in this research were the British National Corpus (BNC), the Time Corpus of American English (TCOAE), and the Collins WordbanksOnline English Corpus (CWOEC). Parts of the Cambridge International Corpus (CIC) were accessed indirectly through both the definitions and examples of the Cambridge Advanced Learner’s English Dictionary (CALD).
260
Perspectives on Formulaic Language
The Italian corpora directly accessed and used in this research were the Corpus di Italiano Scritto (CORIS) and the La Repubblica Corpus (LRC). Ideally it would have been more desirable to rely on a larger number of Italian corpora; however there is at present a lack of freely accessible reliable Italian corpora. Unfortunately not much has changed since Philip (2004: 3) stated, ‘publicly-accessible corpora for Italian are few and far between’. This research relied on the English-Italian portion of the European Parliament Proceedings Parallel Corpus 1996–2006 (EUROPARL) containing 1,251,315 sentences, consisting of 36,411,166 Italian words and 36,510,033 English words (Koehn, 2005). Given the setting in which these texts were produced, the language and the style used are mostly legal and bureaucratic in nature. Nevertheless, the volume of text, plus the fact that a good portion consists of transcribed spoken language, still make it a good test bed for researching linguistic phenomena. This research also relied on the 2006 edition of the Italian-English-Italian Ragazzini Zanichelli Dictionary (IEIRZD). Following Hoey’s (2005) lead, the idea was to use the IEIRZD to either confirm or question corpus-driven data, providing the point of view of traditional, that is, non-corpus-based, lexicography.
Methodology In order to identify, establish and list a series of go frame-type formulaic sequences in English (L1) a series of wild card searches (* and go; go and *; * or go) were performed in the BNC, TCOAE, CWOEC and CALD. This allowed the researcher to cast the widest net possible for sequences contained in those corpora, and therefore to truly view the data from a new perspective, as independent as possible from pre-defined theories or expectations. Each search resulted in a series of sequence types (for example the sequence type come and go). All sentences associated with each sequence type in each corpus were reviewed and analysed based on the criteria outlined below to test their formulaic status. Once the formulaic nature of an English sequence was established then the same sequence was searched in both EUROPARL and in IEIRZD to discover how each sequence was translated into Italian (L2). Finally the L2 sequences were searched within CORIS and LRC and analysed to study whether they were formulaic as well. For each formulaic binomial sequence its frequencies, collocations,
Formulaicity and Translation
261
colligations and semantic prosodies were analysed and compared in English and Italian. This study was based on the combined pool of data of the BNC, TCOAE, CWOEC and CALD on the English side, and CORIS and LRC on the Italian side, plus the EUROPARL corpus. This research then was not only corpus-driven, but also relied on the cross-analysis of multiple corpora, cross-linguistically between Italian and English. Doing so was an attempt to overcome two of the key limitations of corpus studies in general, that is, representativeness and the idiosyncrasies of one particular corpus. By relying on a larger and diverse pool of data, each finding in one corpus was either confirmed or denied by cross-referencing the other corpora. If a pattern was found across multiple corpora, then one could be more confident that perhaps it was a real pattern of the language. The initial inspiration for the methodology used in this research stemmed from Moon (1998), who claimed that statistically relevant phenomena are reproduced across corpora. The methodology also was an attempt to overcome some of the limitations of Tognini Bonelli’s (2002) methodology. Ideally based on a dictionary of formulaic sequences in English and with a large enough and reliable bilingual corpus of original English texts aligned with their Italian translations, the methodological aspect of this research would have been little more than a statistical exercise on collected data. Unfortunately such a dictionary and such a corpus did not exist. In theory the closest corpus available of this type was the CEXI corpus (Bernardini, 2003). However, its limited size did not allow it to be used as a source for any meaningful research data on formulaic sequences based on the verb to go. The EUROPARL corpus, on the other hand, had many limitations that made it unfeasible to be used as the sole corpus of this research. Specifically the English and Italian components were not big enough; the genre was very specific (and therefore EUROPARL lacked otherwise common binomial formulaic sequences); it was not always clear what was L1 (source language) and what was L2 (translation) within the corpus and last but not least, many sequences were not aligned or translated properly (probably one of the biggest flaws encountered during this research). In general, while a parallel corpus can provide valuable information about the translation habits of one or more translators, in one or more genres or styles, depending on how its texts were collected, it does not help to pinpoint the general behavior of L2 as a language. A parallel corpus, like CEXI or EUROPARL, had the limitation of presenting L2 data in only a mediated form. In order to capture in full the original L2 behavior
262
Perspectives on Formulaic Language
it was better therefore to rely on a L2 corpus comparable to a L1 corpus. Given the present state of the art of corpora design and implementation, it was deemed acceptable to compare general reference corpora, although their size, time frame and specific sub-topics may vary. It was then a matter of degrees of comparability. For this research the BNC and CORIS were deemed comparable, in that they both represented the best available general reference corpora in English and Italian respectively. While they have similar word counts, there was no doubt that they might not match at a closer inspection in terms of composition and percentage of genres. Most glaringly CORIS did not have a spoken component, but again CORIS was the only generally available and reliable reference corpus of 100,000 words in Italian. By the same token, the LRC and the TCOAE were deemed comparable. Their sizes vary, however they both are journalistic in style, and newspapers and magazines in the western world usually relate news regarding the same type of topics (and often the very same news stories) such as politics, art, technology, society, religion, and sports.
Signs of Formulaicity The problem of identification of formulaic binomial sequences in English was still a crucial one and needed to be addressed in order then to analyse how those formulaic binomial sequences were translated in Italian. If it was assumed that formulaic sequences are sequences of language which are stored and used as a whole in order to save processing effort (Sinclair, 1991; Kuiper, 2000; Wray, 2002; Kuiper, 2004), then it was reasonable to expect that given a choice between two different ways of conveying the same message, it was more likely that a speaker/writer would use the formulaic sequence rather than a randomly generated sequence. As stated by Patrick Hanks (1996: 85), ‘the creative potential of language is undeniable, but the concordances to a corpus remind us forcibly that in most of our utterances we are creatures of habit, immensely predictable, rehearsing the same old platitudes and the same old clichés in almost everything we say. If it were not so, language would become unworkable. Humankind cannot bear very much creativity.’ Indeed, corpus linguistics research seems to confirm this intuition and, comparing frequency counts, one can easily see that what are widely regarded as formulaic sequences usually are associated with higher frequency counts compared to non-formulaic sequences across multiple corpora. However, it would be wrong to use frequency
Formulaicity and Translation
263
counts as the only criterion for identifying formulaic sequences. First of all, there is a procedural limitation, since computers do no more and no less than what they have been instructed to do. In addition many highly frequent sequences found across corpora are not formulaic. This puts the burden on the linguist who needs to carefully analyse the results found within each corpus. Another major problem with frequency counts is that ‘corpora are unable to capture the true distribution of certain kinds of formulaic sequences’ (Wray, 2002: 27). There is no doubt that corpora have more to offer than anything available before in terms of linguistic data, however, as Sinclair (1991) has very well demonstrated, corpora cannot be seen as magic wands. The size and the composition of a corpus determine very heavily what can be found in it. The size of the data pool is always a critical factor when working on a statistical basis. In general, the larger the corpora the better are the chances to detect patterns of use. Sinclair (1991) argued for the use of very large corpora of at least 100 million words, based on the fact that smaller corpora too easily miss some important linguistic phenomena. Today, the latest generation of corpora of over a billion words would seem to theoretically be the answer, however one incurs processing issues and must always be aware of the type of data contained within the corpus. Therefore, while it cannot be solely relied upon, frequency can be a good means of tracing formulaicity in translation. As this research shows, when it is discovered in a parallel corpus containing a collection of translations, that different translators favor or avoid a certain sequence in L2 to express the semantic/syntactic/pragmatic/functional value of an original sequence in L1, then frequency is relevant. The frequency with which a sequence appears in a parallel corpus suggests how much translators favor that sequence. Another evidence of formulaicity, although the slowest to manifest, is a change in spelling. Some formulaic binomial sequences tend to either be used with hyphens in between each lexical item of the sequence, or in very rare cases, and at times subsequently, to drop the hyphen, or jump the hyphen phase altogether, and fuse the lexical items, thus forming one word. Corpora are the best tools to measure and track such spelling variations and evolutions. The sequences touch-and-go and stop-and-go are recorded in hyphenated form in many sentences found in the corpora used in this research. The use of hyphens, and even more so of fused forms, is the typographical manifestation of the strong ties among the lexical units participating in the sequence. The strength of the semantic and syntactic ties between the
264
Perspectives on Formulaic Language
lexical units in a sequence is a decisive factor contributing to the formulaic status of a sequence. In part inspired by Hickey’s (1993) list of nine conditions that a sequence should adhere to in order to be considered formulaic in her study on first language acquisition, the following is the list of conditions that, during this research, was used to detect formulaicity. z The sequence forms a semantic, syntactic and/or pragmatic unit, that is, some or all of the semantic, syntactic, pragmatic, collocational and colligational ties existing among the words forming the sequence (regardless of their position) are stronger than the same ties any of the words might have with other words in the sentence. This is a necessary condition although hard to measure a priori but quite easy to test, for a native speaker, a posteriori. z The total meaning of the sequence is more than the sum of the meanings of the single words within the sequence. The usual semantic compositionality is either suspended or altered. This is in many respects similar to what Tognini Bonelli (2002: 76) refers to as ‘functionally complete unit of meaning’. z The sequence is a community-wide formula. This is also a necessary condition and is hard to measure a priori but possible to test, for a native speaker, a posteriori. z The sequence is an idiosyncratic chunk. z The sequence is used repeatedly in the same form. z The sequence is situationally dependent. z The sequence is hyphenated. This is not a necessary condition, but when present it is a strong indicator that the sequence acts as one word particularly with binomials but also with some multi-word items. z The sequence is often placed at the end of a sentence before a full stop. Also this condition is not necessary but it reinforces the fact that the sequence acts as one word and therefore the binomial does not require right side collocates. z The meaning, form and function of the sequence are constant across multiple corpus-driven examples.
VERB and go Binomial Sequences Searching the BNC and the TCOAE for binomials according to the structure * and go, 856 types of sequences in the BNC and 792 types of sequences in the TCOAE were found. Of these, 646 types from the BNC and 643 from
Formulaicity and Translation
265
the TCOAE were discarded because they had a frequency of 1 and did not seem to have any of the formulaic features described above. The decision to discard types with a frequency of 1 was based on the guiding principle that more frequent sequences are more likely to be formulaic. Future research should take into account also the sequences with a frequency of 1, based on the idea that even the largest latest generation corpora might not be able to duly register some linguistic phenomena. The threshold of 1 was arbitrary; however, looking at the discarded types of sequences, they did not seem to contain potentially formulaic sequences, confirming the notion that truly formulaic sequences usually have higher frequency counts. The remaining 210 types of sequences from the BNC and the remaining 148 types of sequences from the TCOAE were then compared taking into account particularly the shared types resulting in a final list of 57 types. The first four shared sequence types are the top four most frequent sequence types in both the BNC and the TCOAE. This fact, plus the fact that these top four sequence types have matching relative rankings strengthens the argument for a cross-corpora-driven type of analysis and might be further proof that the most frequent sequence types are not corpusdependent, thus might actually reflect a common linguistic phenomenon in English. Just looking at the number of tokens, clearly the sequence come and go appears to be the most frequent binomial of type * and go, shared by both the BNC and the TCOAE. The second most common sequence type is ‘, and go’ which was not analysed because clearly not formulaic as described. The third most frequent * and go sequence type in the BNC and the TCOAE, up and go, had about half the number of tokens of come and go, and prima facie, based on frequency only, could have been a candidate for formulaicity, however analysing the sentences one by one, the word up, was always part of phrasal verbs such as wake up, get up, dress up and pack up. Since internal cohesion was one of the defining factors of formulaic binomials, and given that the components of a phrasal verb should act as one word, it was concluded that up and go was not a formulaic sequence. Of all 57 * and go types of binomials shared by the BNC and TCOAE, after removing sequences types with punctuation marks, those with words belonging to phrasal verbs, and finally clearly non formulaic sequences, the following six sequence types were analysed in major detail: come and go; touch and go, try and go, stop and go, leave and go, shop and go.
266
Perspectives on Formulaic Language
Table 13.2 Top four shared * and go sequence types in BNC and TCOAE and their frequencies Position
1 2 3 4
Sequences
come and go , and go up and go touch and go
Tokens BNC
TCOAE
241 212 121 44
183 96 73 26
Come and go The sequence come and go was by far the most frequent * and go type binomial found in both the BNC and the TCOAE. This finding mirrors and validates Benor and Levy’s (2006) results. Based on CALD we know that come and go deserves an appropriate dictionary entry since its use and meaning cannot be automatically deduced compositionally, that is, by the use and meaning of the single words come, and, go. Cross-corpora confirmation was yet another piece of evidence that the higher frequencies were not accidental or caused by the peculiar setup of a particular corpus, but really reflected a general tendency, a usual pattern of the language in use. The sequence come and go, then, most probably, is a completely lexicalized syntagma stored as a whole in the mind of most speakers. High frequency, while being a good indicator, however is not sufficient for formulaicity as defined earlier, therefore the collocational and colligational patterns plus the semantic prosodies associated with the sequence were analysed. In order to discover patterns that, on a frequency basis alone could be overlooked, the following inflected forms were also investigated and analysed: coming and going, come and gone, came and went. The sequence coming and going appears 92 times in the BNC and 35 times in the TCOAE. In the case of this sequence the relative frequencies did not fully match, indeed while coming and going was the second most frequent sequence in the BNC followed by the sequence come and gone, this was reversed in the TCOAE. CALD did not list the sequence come and gone as a separate entry, while BNC reported this sequence 67 times and TCOAE 102 times. One clue of the formulaic status of the sequence come and gone was the fact that this sequence was often found at the end of a sentence followed by a full stop. Indeed the sequence come and gone appeared to be formulaic, as defined previously, since in all examples it formed a lexical unit that acted as one word.
Formulaicity and Translation
267
The sequence came and went was found 163 times in the BNC and 132 times in the TCOAE. From a frequency perspective this sequence seemed quite frequent and its behavior seemed very similar to the sequence come and go, granted the inflected form difference. More than 50 per cent of the left collocates of the sequences come and go and came and went referred to people (such as presidents, secretaries of trade, stars, and officers). This percentage seemed too high to be casual, especially considering that the second most frequent collocate was fashion with only 5 per cent of occurrences. Third place were lexical items such as thoughts and pains with 3 per cent and then animals and feelings with 2 per cent. Looking at the right side collocates a relatively high percentage (14 per cent) of sequences such as – as you please, as they/you wish – were found. The other right collocates occurred only once in one hundred sentences and were lexical items which mostly indicated the manner or time such as forgotten, at irregular intervals, suddenly, and all day long. There seemed to be a strong collocational pattern in the sentences containing the sequence come and gone. In these sentences the sequence was almost always preceded by an inflected form of the verb have, and, in addition, in about 50 per cent of the sentences the words already or all were found between the inflected form of the verb have and the sequence come and gone. There did not seem to be predominant colligational patterns associated with the sentences reviewed containing come and go, came and went, coming and going and come and gone. In general there seemed to be a tendency to have conditional verbs like may, could, would, and planned to before the formulaic sequence. From a frequency perspective there was no one pattern that stood out. The semantic prosody of come and go, came and went, come and gone and coming and going seemed to express a somewhat negative and nostalgic view, in the sense of ‘not lasting’. There was a general meaning of possibility (expressed by words like seemed, can, could, may, would), uncertainty, inevitability of time, sense of hustle and bustle (noise), sense of freedom, and continuous action.
Translating Come and Go The English binomial come and go can have different translations in Italian. Based on IEIRZD the sequence come and go is usually translated as andare e venire (and its inflected forms such as vai e vieni, vanno e vengono) or andirivieni. Obviously other translations are possible but were not considered during this research. The table below lists the absolute frequencies and percentages of each sequence in CORIS, LRC and EUROPARL.
268
Perspectives on Formulaic Language
Table 13.3 Absolute frequencies and percentages in CORIS, LRC and EUROPARL ITALIAN
CORIS
LRC
EUROPARL
AF
%
AF
%
AF
%
264
0.264
660
0.173
6
0.020
vanno e vengono
96
0.096
489
0.128
3
0.010
andare e venire
73
0.073
147
0.038
3
0.010
6
0.006
50
0.013
0
0
andirivieni
vai e vieni
It is important to note, once again, that the relative frequencies and percentages did not vary much between CORIS, LRC and EUROPARL. This matching reinforced the validity of a cross-corpora methodology and the percentages indicated that the perceived (and dictionary sanctioned) default translations might not be the most frequently used. From a purely frequency count perspective it would seem that the Italian andirivieni was a better candidate, but frequency alone, as mentioned before, is not a sufficient indicator. Therefore to verify the comparability of andirivieni and come and go, their respective collocational and colligational patterns and semantic prosodies within CORIS, LRC and EUROPARL were analysed and compared. According to CORIS and LRC andirivieni collocated primarily with the articles l’ and un (the and a), and the words frenetico (frenzied, hectic, frantic) and continuo (continuous) on the left and with di (of ) with or without article and tra (in between) on the right. Andirivieni was found 6 times in EUROPARL. From a collocational perspective it was preceded by the word tra or fra (in between) in four out of six sentences, and in three of those four (50 per cent of all sentences) tra and fra were followed by the name of two cities. Thus in Italian andirivieni, at least according to EUROPARL’s data, seemed to be used mostly when referring to physical motion between two locations (namely cities). While come and go, as noted earlier referred mostly (50 per cent) to people and to a much lesser degree (5 per cent) to fashion. No left side collocates seemed to predominate. From a colligational perspective it is important to note that andirivieni is a noun and as such is preceded usually by either an article and/or an adjective. The semantic prosody was mainly negative highlighting the burden, absurdity, desire to end the going back and forth. Searching for come and go in the English-Italian section of the EUROPARL corpus yielded 16 hits (English and Italian paired sentences). Three sentences were not considered since the English and Italian sentences did not
Formulaicity and Translation
269
match. Of the remaining 13 sentences come and go was translated in 11 different ways. Only three times it was translated with the default translation andare e venire (including the inflected form vanno e vengono). Here are the translations found in EUROPARL: arrivano e partono, dilagano, entrare e uscire, la visita(?), si susseguono, si fanno e si rifanno, fanno i pendolari, attraversare, andare e venire, vanno e vengono, passano, spostarsi. The findings in EUROPARL showed that the formulaic nature of the sequence come and go, as demonstrated by frequency data and the analysis of colligational, collocational and semantic patterns, was lost when it was translated into Italian. In addition it showed that dictionary based translations were hardly helpful. Out of the 13 sentences found in the EUROPARL corpus, only three translated come and go, according to a translation proposed by IEIRZD. The following table shows frequency counts in BNC, TCOAE, CORIS, LRC and EUROPARL of come and go, (and four of its inflected forms) and some of the corresponding Italian translations. Table 13.4 Comparing frequencies across BNC, TCOAE, CORIS, LRC and EUROPARL BNC
TCOAE
CORIS
LRC
EUROPARL
241
183
16
92
35
1
came and went
163
132
0
come and gone
67
102
4
ENGLISH come and go coming and going
ITALIAN andirivieni
264
660
6
vanno e vengono
96
489
3
andare e venire
73
147
3
6
50
0
vai e vieni
Although the sequence came and went was the most frequent inflected form of the sequence come and go, it was not present in the EUROPARL corpus. This finding can be interpreted in different ways. It could, once again indicate that the formulaic nature of the sequence came and went was completely transformed when translated and it thus lost its formulaic status. It could point to a deficiency of the EUROPARL corpus due to its size and composition. Maybe a different and/or larger corpus would have yielded
270
Perspectives on Formulaic Language
different results. It could be explained by the very highly formulaic nature of the sequence, which, in an official setting such as the European Parliament, could be avoided by speakers, writers and translators because considered too colloquial. Unfortunately the lack of more data from EUROPARL (there is just one sentence) did not allow a complete investigation of the sequence coming and going. EUROPARL returned 4 hits for the sequence come and gone. From the EUROPARL results it would seem that there was no default translation for the English sequence come and gone, indeed all four results presented four different translations. As expected, and explained above, all four English sequences of come and gone were proceeded by the verb have or one of its inflected forms. Again the formulaicity of the English sequence was absent in its Italian translations. The Italian binomial andare e venire is formulaic based on the analysis of all the relevant sentences found in CORIS and LRC. The noun andirivieni is a fused form of the imperatives of the verbs andare (to go) and rivenire (come back). As noted earlier fused forms are often a result, over time, of the very strong ties between elements of a formulaic sequence. This research also confirmed Benor and Levy’s (2006) claim that come and go was an irreversible binomial. Indeed the BNC returned 241 sentences when searching for the string come and go while only four when searching for go and come. TCOAE did not contain a single sentence with the sequence go and come. Searching go and come on CWOEC yielded only five sentences. Analysing the sentences in BNC and CWOEC containing the sequence go and come, it was evident that the sequence did not act as a unit and often come was an integral part of another sequence such as come back, come out, come in and come through. The Italian translation andare e venire inverts the positions of the two verbs, literally go and come, however this study confirmed that while Italian might invert the order of the words the irreversibility of the corresponding formulaic binomial translation is maintained. Table 13.5 Frequency counts of andare e venire and venire e andare in CORIS, LRC and EUROPARL Corpus CORIS LRC EUROPARL
andare e venire
venire e andare
73
0
147
1
3
0
Formulaicity and Translation
271
Conclusions In this study more than 2500 types of go and VERB, VERB and go, and go or VERB binomials were analysed. Of these, based on the criteria explained above and especially based on the resources used (and their limitations), about 100 types were studied in greater detail. Of these the following 12 sequence types were considered formulaic based on the criteria explained earlier. z z z z z z z z z z z z
Come and go Coming and going Come and gone Came and went Touch and go Stop and go Shop and go Dead and gone Been and gone Go and get Go and see Go and buy
The first finding, which directly confirmed the validity of the methodology used, was that formulaic sequences appeared consistently, and often with very similar relative frequencies across the corpora used in this research. This finding confirmed that these formulaic binomials were pervasive and the consistency of their presence could indicate that they were not dependent on a specific corpus. Two binomials did not follow this rule: shop and go and been and gone. The binomial shop and go seemed to be a locale specific formulaic sequence, since in most US regions it was found hyphenated and used formulaically but not so in other English speaking areas. Lastly the binomial been and gone while present in the BNC and EUROPARL was not found in the TCOAE. Another important finding was that of all 12 formulaic binomials listed above only the binomial go and buy retained its formulaic status consistently when translated in Italian. Despite the fact that binomials like come and go and its Italian translation andare e venire are widely used formulaically in both English and Italian original texts, when translated, all, except the sequence go and buy, lost their formulaic status.
272
Perspectives on Formulaic Language
The data suggests, through frequency statistics and the analysis of collocational, colligational and semantic prosody patterns, that while the English language is rich in formulaic binomial sequences based on the verb to go, the idiom principle (Sinclair, 1991) does not hold true for the translations in Italian of the same formulaic sequences. If this finding is further confirmed by similar research focusing on other types of formulaic sequences and in other language pairs, then it might be possible to affirm that translated texts are characterized by a lack of, or a low frequency of, formulaic sequences, that is, the lack of, or a low frequency of formulaic sequences could be a translation universal, a common trait of translated texts (Baker, 1992). The methodology used in this research can be applied to any language pair. The main limitation being the availability of reliable and large enough parallel corpora and mono-lingual comparable corpora. This research represents a step towards fully understanding the phenomenon of formulaic binomial sequences in English and in particular as they relate to their translations in Italian. There are at least five areas in which the results of this study, and in general the implications of a better understanding of the relationship between formulaicity and translation could be applied. These areas are: translation practice, translation teaching, bi-lingual lexicography, computer assisted translation (CAT) tools and machine translation.
References Baker, M. (1992). In other words. A coursebook on translation. London, New York: Routledge. Benor, S. B. & Levy, R. (2006). The chicken or the egg? A probabilistic analysis of English binomials. Language, 82 (2), 233–78. Bernardini, S. (2003). Bi-directional corpora and translation: The CEXI corpus. In S. Conrad (Ed.) TESOL Quarterly Special Issue on Corpus Linguistics in TESOL, 528–37. Bybee, J. (2006). From usage to grammar: The mind’s response to repetition. Language, 82 (4), 711–33. Fillmore, C. J. (1971). How to know whether you’re coming or going. In K. HyldgardJensen, (Ed.) Linguistik. Frankfurt/M.: Athenäum, 369–79. Hanks, P. (1996). Contextual Dependency and Lexical Sets. International Journal of Corpus Linguistics 1 (1):75–98. Hickey, T. (1993). Identifying formulas in first language acquisition. Journal of Child Language, 20, 27–41. Hoey, M. (2005). Lexical priming: A new theory of words and language. London/New York: Routledge. Jakobson, R. (1959). On linguistic aspects of translation. In R. A. Brower (Ed.) On translation (pp. 232–39). Cambridge, MA: Harvard University Press.
Formulaicity and Translation
273
Kenny, D. (1998). Creatures of habit? What do translators usually do with words. Meta, XLIII (4), 515–23. Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. Proceedings from The Tenth Machine Translation Summit, Phuket, 79–86. Kuiper, K. (2000). On the linguistic properties of formulaic speech. Oral Tradition, 15/2, 279–305. Kuiper, K. (2004). Formulaic performance in conventionalized varieties of speech. In N. Schmitt (Ed.) Formulaic sequences: acquisition, processing and use (pp. 37–54) Amsterdam/Philadelphia: John Benjamins. Lambrecht, K. (1984). Formulaicity, frame semantics, and pragmatics in German binomial expressions. Language, 60 (4), 753–96. Makkai, A. (1978). Idiomaticity as a language universal. In J. Greenberg (Ed.) Universals of human language, vol. 3 (401–48). Stanford, CA: Stanford University Press. Malkiel, Y. (1959). Studies in irreversible binomials. Lingua, 8, 113–60. Moon, R. (1998). Fixed expressions and idioms in English. Oxford: Clarendon Press. Philip, G. (2004). Arriving at equivalence: Making a case for comparable corpora in translation studies. Paper presented at CULT-BCN: Corpus Use and Learning to Translate, Barcelona, Spain. Retrieved November 3, 2008 from http://amsacta. cib.unibo.it/archive/00002124/ Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press. Tognini Bonelli, E. (2002). Functionally complete units of meaning across English and Italian: Towards a corpus-driven approach In B. Altenberg & S. Granger (Eds.) Lexis in contrast (73–95). Amsterdam/Philadelphia: John Benjamins. Tomasello, M. (2003). Constructing a language. Cambridge/London: Harvard University Press. Wray, A. (2002). Formulaic language and the lexicon. Cambridge: Cambridge University Press. Dictionaries and corpora (with abbreviations used) BNC CALD
British National Corpus. http://www.natcorp.ox.ac.uk/ The Cambridge Advanced Learner’s Dictionary. Cambridge University Press, 2007. http://dictionary.cambridge.org/ CIC Cambridge International Corpus. http://www.cambridge.org/elt/ corpus/international_corpus.htm CORIS CORpus di Italiano Scritto. http://corpora.dslo.unibo.it/coris_eng. html CWOEC Collins WordbanksOnline English Corpus. http://www.collins.co.uk/ corpus/CorpusSearch.aspx IEIRZD Italian-English-Italian Ragazzini Zanichelli Dictionary. G. Ragazzini (Ed.) Zanichelli Editore, 2005 EUROPARL European Parliament Proceedings Parallel Corpus 1996–2006. http:// www.statmt.org/europarl/ LRC La Repubblica Corpus. http://dev.sslmit.unibo.it/corpora/corpora.php TCOAE Time Corpus of American English. http://corpus.byu.edu/time/
This page intentionally left blank
Index
abbreviating devices in old English verse 228 abstract rules children’s 12–13 academic writing formulaic language in 90, 92 lexical bundles in 93 accuracy of formulaic sequence use 74, 77, 79–80, 84 of formulaic sequence use, high- vs. low-achievers 82, 83 of formulaic sequence use, impact of task variation on 49 of multi-word unit production 195 Achilleus epithetic use 223 acronymic functions of formulaic sequences 218, 228–9 acrostics 229, 231n. 21 adjective-noun collocations non-native diverse production of 31–2 non-native repetition frequency of 32–3 non-native use of 25–31, 41–4 non-native use of strongly-associated 35–6 non-native use of typical 34–5 non-native use over time 36–41, 43–4 non-native vs. native use of 24, 25 adolescent speakers use of prefabs 187 adult native spoken language equating phonological coherence with formulaic sequences in 180 holistic storage and retrieval 178–9 phonological coherence in 177, 182, 184–8 adult second language acquisition role of formulaic sequences in 90–1 adverbial idioms 253 alignment of formulaic sequences 188 alliteration 230n. 1
alliterative bridges as fillers 221–2 alternational code-switching 149n. 2 formulaic sequences in 142–8 non-syntatic constituents in 139–40 analytic language processing 51, 131 Arabic idioms conventional knowledge and 238, 246 corpus-based research on 235, 237–8 figurative patterns in 238–40 garden-path constituents 239–40 isomorphism in 250–2, 254–5 metaphor and metonymy interactions 242–6, 253–4 metonymic extensions 241–2, 253, 255 special cases 238, 247–8 Arabicorpus 238, 255n. 2 attitudinal/modality stance bundles 113, 114, 115, 119, 120, 121 attribute specification bundles 113, 115, 116, 118, 119, 120, 122 automatic speech synthesis 184–5 bare binomials 258–9 behavioural analysis of regular multi-word sequences processing 156–61 Beowulf (epic poem) 213, 215, 231n. 12 acronymic formulae in 228–9 discourse-structuring formulae in 218–20 epithetic formulae in 223–4 filler formulae in 221–2 formulaic sequence functions in 216 gnomic formulae in 225–6 tonic formulae in 227 Bergen Corpus of London Teenager Language (COLT) 187 binomials 259 definition of 258 types of 259
276
Index
blogs 217 acronymic formulae in 229 discourse-structuring formulae in 220–1 epithetic formulae in 224–5 filler formulae in 222–3 formulaic sequence functions in 213, 215–16, 217–18 gnomic formulae in 226–7 tonic formulae in 227–8 Blogspot 217 ‘blue blood’ idiomatic reading of 248 BNC see British National Corpus British National Corpus (BNC) 155, 198, 259, 260, 261, 262, 264, 266, 269, 270 British National Corpus (BNC) World Edition (2002) academic written corpus 27, 44 CAF parameters see complexity, accuracy and fluency parameters Cambridge Advanced Learner’s English Dictionary 259, 260, 261, 266 Cambridge International Corpus (CIC) 259, 261 child language code-switching 141–8 criteria for formulaic sequences identification in 176, 189n. 1 formulaic language in 12–13, 51 phonological coherence 176–7, 179 repetitive lexical frames 12–13 Chilean Spanish L2 learners degree of formulaic language use by 61–2 formulaic sequence categories used by 62–4 native speakers vs. 60, 61, 62–3 word production by 60–1 Chilean Spanish SLA studies cross-linguistic comparison 15–16, 61, 64, 65 on task variation impact on formulaic sequence use 50–1, 54–64 Chinese L2 learners collocation use 25–32, 41–4 collocation use, high-frequency and typical 34–5 collocation use over time 36–41, 43–4 collocation use, repetition frequency of 32–3, 41 collocation use, strongly-associated 35–6, 41 formulaic sequence use 74–8, 82–4 formulaic sequence use accuracy 82
formulaic sequence use development 79–80 formulaic sequence use frequency 81 formulaic sequence use variation 81–2 chunks and chunking 6, 47, 51, 52, 73, 176, 178, 179, 180, 184 CIC see Cambridge International Corpus clausal formulaic sequences 59, 62 code-switching (CS) syntactic constituent vs. formulaic sequence 129–31, 134 cognition hypothesis 49 Collins WordbanksOnline English Corpus (CWOEC) 259, 260, 261, 270 collocations 23 non-native vs. native use of 23–5 see also adjective-noun collocations COLT see Bergen Corpus of London Teenager Language come and go (binomial) 258, 266–7, 271 Italian translation of 267–70 communicative tasks impact on formulaic sequence use 50, 55–6, 60–5 complexity, accuracy and fluency (CAF) parameters task variation impact on 49–50 see also accuracy computer-mediated communication space-saving devices in 229 conceptual socialisation 14 constituency formulaic sequence frequency effect on 7 conversational gambits 218 CORIS see Corpus di Italiano Scritto corpus-based analysis 2, 5–6, 91 of Arabic idioms 235, 237–8 of formulaicity and translation 257–8 of lexical clusters in EAP textbooks 93–104 limitations of 261 of non-native vs. native collocation use 23–5 significance of size and composition of corpus 263 Corpus di Italiano Scritto (CORIS) 260, 261, 262, 267–8, 269, 270 cranberry elements 236 Arabic idioms 239 cross-cultural pragmatics 15 corpus analysis in 5 CS see code-switching
Index cultural differences impact on Chilean Spanish and French formulaic sequence use 15–16 impact on formulaic language use 15–16 impact on formulaic sequence use 50–1 CWOEC see Collins WordbanksOnline English Corpus Cynewulf 215, 229 derivation as criterion of formulaic sequence identification 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 144, 145, 146, 147 desire stance bundles 113, 115 disc jockey fluency 183 discourse organizers 93, 108 in electrical engineering textbooks 113, 115 in electrical engineering vs. ESP textbooks 119, 120, 121 in ESP textbooks 117, 118 discourse-structuring functions of formulaic sequences 218–21 discursive formulaic sequences 59–60, 64, 65, 67n. 12 discursive functions of idioms 235 distributional approach of corpus-based research 91 EAP textbooks see English for academic purposes course materials early fronto-central negativity 166–8 early parietal positivity 168–70 EEG see electroencephalography EL see embedded language electrical engineering introductory textbooks functional classification of four-word lexical bundles in 112–17 lexical bundles in 108, 110–12 lexical bundles in ESP textbooks vs. 119–22 referential bundles in 123 stance bundles in 123 electroencephalography (EEG) regular multi-word sequences processing 161–2 electrophysiological analysis (ERP) GAM method vs. averaging method 162 of regular multi-word sequences processing 162–9
277
embedded language (EL) 149n. 1 emblematic idioms 238, 246 empirical research on formulaic sequence learning 73–4 on phonological coherence in spontaneous adult native spoken language 185–8, 189 on text memorization 73 English for academic purposes (EAP) course materials formulaic language in 90–3, 104 lexical bundles in 89 lexical clusters in 93–7, 101–3 lexical clusters in instructional subcorpus of 98–9, 103 lexical clusters in textual subcorpus of 99–101, 103 see also English for specific purposes textbooks English for specific purposes (ESP) textbooks 109–10 gap in language use between electrical engineering textbooks and 110 lexical bundles in 108, 110–12, 117–18 lexical bundles in electrical engineering introductory textbooks vs. 119–22 see also English for academic purposes course materials English formulaic binomials 257, 271 identification of 264–7 Italian translation of 258–9, 260–2 Italian translation of, formulaicity in 271–2 English formulaicity 66n. 3 in code-switching 141–8 English Native Speaker Interview Corpus (ENSIC) 187 English second language acquisition studies 53 on collocation use 23–5 on collocation use, multiple case study approach 25–44 on formulaic sequence use 74–8 on intonation in formulaic sequence 187–8 on phonological coherence 183–4 on proficiency and writing ability 78–9, 83 ENSIC see English Native Speaker Interview Corpus epistemic stance bundles 112, 114, 117, 118, 119, 120–1
278
Index
epithetic functions of formulaic sequence 218, 223–5 ERP see electrophysiological analysis ESP textbooks see English for specific purposes textbooks EUROPARL see European Parliament Proceedings Parallel Corpus European Parliament Proceedings Parallel Corpus (EUROPARL) 260, 261, 267–8, 269, 270 eye-tracking studies of multi-word unit processing 195–6, 197–205 of terminal words in formulaic sequences 10–11 FFD see first fixation duration figurative idioms 197 filler functions of formulaic sequences 218, 221–3 first fixation duration (FFD) 201–2, 204, 206, 207 first language (L1) acquisition formulaic language in 92 formulaic sequences 13 usage-based approaches 12–13 FL see formulaic language fluency of language formulaicity and 51, 90 fluency of multi-word unit production 195 fluency of second language formulaic sequence and 53 intonation impact on 183–4 fMRI see functional magnetic resonance imaging formulaic binomials identification criteria for 263–4 see also English formulaic binomials formulaic language (FL) 2–3, 257 advanced non-native use of 52–3, 61–2 as aspect of lexicon 3 children’s use of 12–13, 51 definition and notion of 1–2, 47, 107 in EAP 90–3 in EAP textbooks 104 non-native learners use of 14–15 novelty in 15 psycholinguistic perspectives of 51–2 in SLA research 47–8 usage-based language models and 3–4, 16
formulaic language (FL) development non-native learners 25 non-native learners, inherent variation amongst 43–4 formulaic sequence(s) (FS) acquisition of 52 categories of 57–60 categories of, distribution of 62–4 characteristics of 132 definition and notion of 3, 88, 131–2, 214, 262 as fluency device 53 L1 acquisition and 13 lexical 58, 59, 62–3, 64, 65 modification of criteria for 139–40 non-formulaic sequences vs. 11 non-lexical 58, 59–60, 64, 65 phonological aspects of see phonological coherence processing of 10–11 psycholinguistic validity of 182–3 word frequency impact on processing of 11–12 formulaic sequence (FS) development of Chinese L2 learners 79–82 measurement indices 77 formulaic sequence (FS) functions acronymic 218, 228–9 in blogs 213 compensatory 214 discourse-structuring 218–21 epithetic 218, 223–5 genre studies and 230 gnomic 218, 225–7 identity-marking 214 in old English verse 213, 215–16 time-buying 218, 221–3 tonic 218, 227–8 formulaic sequence (FS) identification 6, 10, 56–7, 77, 129 application of criteria for 134 assessment of criteria for 135–9 in child language 189n. 1 criteria for 10, 11, 132–4, 148–9, 262–3 phonological criteria for 182–3, 186 phonological criteria for, in child language 176 formulaic sequence (FS) learning/use 86–7 corpus-based analysis 5–6 cultural differences impact on 50–1 EAP learners 88, 89
Index in variety of genres 139 high- vs. low-achievers 81–3 non-native vs. native 49–50, 53–4, 64 research methods to study 91 SLA studies on 73–8 task completion and 49 task variation impact on 49–50, 54–6, 62–5 text memorization impact on 72–3, 82–4 writing scores and 79 frame theory 258 French L2 learners degree of formulaic language use by 61–2 English collocation use by 23 formulaic sequence use by 62–5 native speakers vs. 48–9, 60, 61, 62–3 word production by 60–1 French SLA studies 48–9, 53 cross-linguistic comparison 15–16, 61, 64, 65 on task variation impact on formulaic sequence use 49–50, 54–64 frequency-based approach collocation use 23–4, 25 on psycholinguistic underpinnings of phonological coherence 181–2 frequency of collocation use Chinese L2 learners 32–3, 41 frequency of formulaic sequence use 77 by high- vs. low-achievers 81 impact on constituency 7 limitations 262–3 FS see formulaic sequences functional categories of lexical bundles 108 in electrical engineering textbooks 112–17 in ESP and electrical engineering textbooks compared 119–22 in ESP textbooks 117–18 functional categories of lexical clusters 95–6 in EAP instructional materials 98–9 in EAP textbooks 99–101 functional magnetic resonance imaging (fMRI) 168 GAM see generalized additive modelling approach generalized additive modelling approach (GAM) 162, 163 German L2 learners collocation use by 24 French formulaic sequence use by 53
279
gnomic functions of formulaic sequences 218, 225–7 go frame-type formulaic sequences 260 grammatical formulaic sequences 59–60, 64, 65, 67n. 10 grammatical/lexical indication 132, 133, 134, 135, 136, 137, 138, 140, 141, 144, 145, 146, 147 grammatical irregularity 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 144, 145, 146, 147 happy birthday! formulaicity in 135, 136 heteromorphic distributed lexicon 3 holistic language processing 51, 131 of formulaic sequence 11, 52 holistic storage vs. 180–1 of regular multi-word sequences 151–3, 170 holistic storage high frequency multi-word units and 182 holistic processing vs. 180–1 phonological coherence and 178–80 hwæt (‘listen’, ‘lo’, ‘well’) as discourse marker 218–20 hyperbolic idioms 238, 247 I guess as discourse marker 220 identification/focus referential bundles 113, 115–16, 117, 118, 119, 120, 122 idiolects 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 144, 145, 146, 147, 148 idiom(s) 194, 234 corpus-based studies on 234 definition of 236 garden-path constituents 237 phonological features of 186, 190n. 3 reading time 207 semantic properties of 234–5 see also Arabic idioms idiom principle 131 idiomaticity non-native vs. native use of 14, 23–5, 50 IEIRZD see Italian-English-Italian Ragazzini Zanichelli Dictionary immediate free recall 153–4 impact of whole-string frequency/ probability on 154–6
280
Index
impersonal epistemic stance bundles 120, 121 inappropriate application as criterion of formulaic sequence identification 132, 133, 134, 135, 136, 138, 139, 140, 141, 144, 145, 146, 147 incremental automatization 52 insertional code-switching 141–2, 149n. 2 intensifier-adjective collocations non-native vs. native use of 23–4 intercultural communication and competence 15 intonation 177, 182 chunks and 180 phonological features of 187–8 unity boundaries 185 intonational meaning 183–4 introductory textbooks 109 gap in language use between ESP textbooks and 110 lexical bundles in 109 irreversible binomials 258, 259, 270 isomorphism in Arabic idioms 250–2, 254–5 definition and notion of 249–50 Italian-English-Italian Ragazzini Zanichelli Dictionary (IEIRZD) 260, 269 Japanese language formulaicity in code-switching between English and 141–8 kick the bucket 149n. 7, 151, 178 formulaicity in 5, 135, 139 tonic function of 227 L1 acquisition see first language acquisition L2 learners see non-native learners L2 users see non-native users La Repubblica Corpus (LRC) 260, 261, 262, 267–8, 269, 270 language dual-nature view of 71–2 in EE introductory textbooks vs. ESP textbooks 110, 111, 119–24 of esoteric societies 15 language processing 10 formulaic sequences in 131–2 frequency effects in 52 modes of 51–2
lexical bundles 88–9, 108, 153, 182, 194, 195, 197 definition and notion of 92–3, 94–5, 108 in discipline-specific books 109 in EAP course materials 89 in electrical engineering introductory and ESP textbooks compared 119–22 in electrical engineering introductory textbooks 108, 110–11, 112–17 in ESP course materials 108, 110–12, 117–18 functional categories of 108, 112 L1 learners use of 92 non-lexical bundles vs. 153 pragmatic functions of 107–8 reading time 207 in spoken corpus 174 lexical clusters 89 functional categories of 95–6 lexical clusters in EAP course materials 93–6 children’s use of 101–3 frequency of 101, 103 identification and functions of 96–101 lexical frames, repetitive children’s use of 12–13 lexical recall bathtub effect of 11–12 lexicalized sentence stems 139 lexicon, mental 151–3 limited attentional capacity model 49 linguistic structure impact on immediate free recall and 157–8 literally non-compositional idioms 236, 238–9, 248–9 literary allusion traditional referentiality vs. 216 LiveJournal 217 LLC see London-Lund Corpus London-Lund Corpus (LLC) 187 longitudinal studies on adult L2 acquisition 90 on formulaicity in code-switching by children 141–8 on non-native learners formulaic language development 25, 43–4 Longman Spoken and Written English (LSWE) corpus 107–8 loose substitutability 217
Index LRC see La Repubblica Corpus LSWE corpus see Longman Spoken and Written English corpus main clauses word order in 8–9 matrix language 149n. 1 memory analysis and 4–5, 16 metaphor 238 metonymy and 238, 242–6 metaphorical idioms 241, 254 metaphtonymy 242–5 metonymy 238, 241–2, 253 metrical fillers epithets as 223–4 MI statistics see mutual information statistics mismatch with maturation as criterion of formulaic sequence identification 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 144, 145, 146, 147 mixed-effects modelling 197 morphology formulaic language in 6–7 motivation definition and notion of 236 multiple case study approach of collocation use 25 limitations of 44 methodology of 26–8 results of 28–44 multi-word unit(s) (MWU) distribution of pauses in 187 memory traces in brain 160–1 psycholinguistic validity of 195–7, 207 types of 194 multi-word unit (MWU) processing 194 differences between types of 204–5, 207 eye-tracking experiment 197–205 holistic 151–3, 170 holistic, behavioural perspective 156–61 holistic, electrophysiological perspective 162–9 word frequency and 206 multi-word units (MWU), high frequency holistic storage of 182 multi-word units, terminal words of 206 eye-tracking experiment 195–6
281
mutual information (MI) statistics adjective-noun collocation use 28 collocation use 24, 35–41, 43 MWU see multi-word units natural language processing (NLP) phonological coherence and 182, 184–5 needs-only analysis (NOA) 15, 51, 129–30, 149n. 4 notion of 66n. 5 NICLE see Nottingham International Corpus of Learner English NLP see natural language processing nonce words experimental studies 13 non-native learners collocation use by 25 language development of 25 non-native users, advanced 48–9 formulaic language of 52–4, 64 task variation impact on 49–50 non-native vs. native learners collocation use, corpus-based analysis 23–5 idiomaticity use by 14, 50 performance differences 49–50 non-native vs. native users fluency and accuracy of multi-word unit production 195 formulaic language use 61–2, 64 formulaic sequence use 62–4, 67n. 12, 91–2 word production 60–1 non-syntactic constituents in alternative code-switching 139–40 noticing L2 features 72 Nottingham International Corpus of Learner English (NICLE) 187, 188 noun phrases 8–9 ‘noun und noun’ 258–9 okay as discourse marker 220 Old English verse formulaic sequence functions in 213, 215–16 metrical requirements 221–2, 231n. 16 see also Beowulf open choice principle 131 oral-formulaic analyses 215–17 overgeneralization 13
282
Index
participant-oriented lexical clusters 96, 98, 99, 100 pauses, distribution of in multi-word units 187 performance indication as criterion of formulaic sequence identification 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 144, 145, 146, 147 personal epistemic bundles 113, 114–15, 119, 120 personal journal type blogs 217, 220 phonologic reduction frequency of formulaic sequences and 177 phonological coherence 175–8 in adult native spoken language 182, 184–8 in adult spoken language 177, 185 in child language 176–7, 179 as formulaic language indicator 10 implications of 182–5 psycholinguistic underpinnings of 178–82 phrasal formulaic sequences 59, 62, 67n. 10 phraseological approach of corpus-based research 91 on collocation use 24–5 portmanteau structure 144–8 positron emission tomography study 168 pragmatic functions as criterion for formulaic sequence identification 132, 133, 134, 135, 136, 137, 138–9, 140, 141, 144, 145, 146, 147, 148 formulaic language and 107 prefabricated language duration of pauses before 187 previous encounter as criterion of formulaic sequence identification 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 144, 145, 146, 147, 148 prosody in spoken formulaic sequences 174–5 psycholinguistic research studies 2–3, 5, 9–11, 16 on formulaic sequences 51–2, 182–3 on multi-word units processing 195–7, 207 on multi-word units storage and computation 151–3 phonological coherence and 178–82
RC see restricted collocations referential bundles 93, 108 in biology textbooks 109 compared in electrical engineering and ESP textbooks 119, 120, 121–2 in electrical engineering textbooks 123 in ESP textbooks 117, 118 referential expressions in electrical engineering textbooks 113–14, 115–17 related languages 50 comparative study of impact of task on formulaic sequence use 60–1, 64, 65 cultural differences in 15–16 research-oriented lexical clusters 95–6, 98, 99, 100, 101 restricted collocations (RC) 194, 197 reading time 207 restricted exchangeability principle 57 restricted verb-noun collocations non-native vs. native use of 24 robotic speech 182, 184–5 runes acrophonic value 228–9 second language acquisition (SLA) research 14–15 formulaic language in 47–8, 51–4 on formulaic sequences 73–4 limitations of 72 on text memorization 72, 73 second language (L2) learners see non-native learners second language (L2) users see non-native users self-paced reading of formulaic sequences 10, 51–2 multi-word processing 196 semantic opacity 132, 133, 134, 135, 136, 137, 138–9, 140, 141, 144, 145, 146, 147 sentence reading time (SRT) 201–2, 207 single choices 131 see also formulaic sequences situation/register specificity 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 144, 145, 146, 147 SLA research see second language acquisition research socialisation research 15 Spanish L2 learners see Chilean Spanish L2 learners
Index Spanish SLA studies see Chilean Spanish SLA studies speaking of X as discourse marker 220–1 spoken formulaic sequences 91, 174 prosodic characteristics 174–5 in spontaneous speech 182–3 spoken language 4 boundedness in 8–9 lexical bundles in 174 spontaneous speech distribution of pauses in 186–7 formulaic sequence identification in 182–3 idioms in 186 intonation in 187–8 SRT see sentence reading time stance bundles 93, 108 in biology and electrical engineering textbooks 109 compared in electrical engineering and ESP textbooks 119, 120–1 in electrical engineering textbooks 112–13, 114–15, 123 in ESP textbooks 117, 118 stress placement 177–8 Swedish L1 learners 54 differences in French and Spanish formulaic sequence use 50–1, 57 syntactic constituents in code switching 129–30, 134, 137–8 syntactic models 130 syntactic parsing 185 syntax formulaic language in 7–9 t-score non-native collocation use 28, 30, 34–5, 41, 43 task variation 55–6 impact on formulaic sequence use 49–50, 54–6, 62–45 TCOAE see Time Corpus of American English tempo 177 in robotic speech 184–5 terminal unit analysis eye-tracking experiment 195–6 text-oriented lexical clusters 96, 98, 99, 100 text-to-speech technology 184 text memorization 71 effectiveness as SLA strategy 84
283
impact on formulaic sequence learning 72–3, 74, 82–4 SLA studies on 72, 73, 74–8 textbooks of EAP see English for academic purposes course materials Time Corpus of American English (TCOAE) 259, 260, 261, 262, 264, 265, 266, 270 time/place/text reference bundles 114, 115, 116–17, 119, 120, 121–2 tonic functions of formulaic sequences 218, 227–8 traditional referentiality 215–16 translation formulaicity in 257, 258, 266–72 TTR see type-token ratios type-token ratios (TTR) of collocation use 24 of adjective-noun collocation use 28, 32–3, 41–2 underlying frames as criterion of formulaic sequence identification 139–40, 141, 144, 145, 146, 147, 148 unit hypothesis 130 usage-based language 1 formulaic language and 3–4, 16 framing of processes and 258 L1 acquisition and 12–13 testing ground for 8–9 variation in old English studies 221–2 variation rate of formulaic sequence use 77, 84 Variations in English Words and Phrases 155 verb and go binomial sequences 264–7 verb-noun collocations non-native vs. native use of 24 weblogs see blogs well as discourse marker 220 whole-string frequency/probability 153–4, 170 impact on immediate free recall 154–6 impact on immediate free recall, behavioural study 156–61 impact on immediate free recall, electrophysiological study 162–9
284 word frequency impact on formulaic sequence processing 11–12 measures of 199–200 multi-word unit processing and 206 word length impact on total and first pass reading times 206
Index word reading time (WRT) 201–2, 204–5, 207 Wordpress 217 written language 4 formulaic sequences in 91 see also academic writing WRT see word reading time