Corpora and Language Learners
Studies in Corpus Linguistics Studies in Corpus Linguistics aims to provide insights ...
54 downloads
1711 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Corpora and Language Learners
Studies in Corpus Linguistics Studies in Corpus Linguistics aims to provide insights into the way a corpus can be used, the type of findings that can be obtained, the possible applications of these findings as well as the theoretical changes that corpus work can bring into linguistics and language engineering. The main concern of SCL is to present findings based on, or related to, the cumulative effect of naturally occuring language and on the interpretation of frequency and distributional data. General Editor Elena Tognini-Bonelli Consulting Editor Wolfgang Teubert Advisory Board Michael Barlow
Graeme Kennedy
Rice University, Houston
Victoria University of Wellington
Robert de Beaugrande
Geoffrey Leech
Federal University of Minas Gerais
University of Lancaster
Douglas Biber
Anna Mauranen
North Arizona University
University of Tampere
Chris Butler
John Sinclair
University of Wales, Swansea
University of Birmingham
Sylviane Granger
Piet van Sterkenburg
University of Louvain
Institute for Dutch Lexicology, Leiden
M. A. K. Halliday
Michael Stubbs
University of Sydney
University of Trier
Stig Johansson
Jan Svartvik
Oslo University
University of Lund
Susan Hunston
H-Z. Yang
University of Birmingham
Jiao Tong University, Shanghai
Volume 17 Corpora and Language Learners Edited by Guy Aston, Silvia Bernardini and Dominic Stewart
Corpora and Language Learners Edited by
Guy Aston Silvia Bernardini Dominic Stewart University of Bologna at Forlì
John Benjamins Publishing Company Amsterdam/Philadelphia
8
TM
The paper used in this publication meets the minimum requirements of American National Standard for Information Sciences – Permanence of Paper for Printed Library Materials, ansi z39.48-1984.
Cover design: Françoise Berserik Cover illustration from original painting Random Order by Lorenzo Pezzatini, Florence, 1996.
Library of Congress Cataloging-in-Publication Data Corpora and language learners / edited by Guy Aston, Silvia Bernardini, Dominic Stewart. p. cm. (Studies in Corpus Linguistics, issn 1388–0373 ; v. 17) Includes bibliographical references and index. 1. Language and languages--Computer-assisted instruction. I. Aston, Guy. II. Bernardini, Silvia. III. Stewart, Dominic. IV. Series. P53.28.C68 2004 418’.0285-dc22 isbn 90 272 2288 6 (Eur.) / 1 58811 574 7 (US) (Hb; alk. paper)
2004057693
© 2004 – John Benjamins B.V. No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Co. · P.O. Box 36224 · 1020 me Amsterdam · The Netherlands John Benjamins North America · P.O. Box 27519 · Philadelphia pa 19118-0519 · usa
Contents
v
Contents Introduction: Ten years of TaLC D. Stewart, S. Bernardini, G. Aston A theory for TaLC? The textual priming of lexis Michael Hoey Corpora by learners Multiple comparisons of IL, L1 and TL corpora: The case of L2 acquisition of verb subcategorization patterns by Japanese learners of English Yukio Tono New wine in old skins? A corpus investigation of L1 syntactic transfer in learner language Lars Borin and Klas Prütz Demonstratives as anaphora markers in advanced learners’ English Agnieszka Leńko-Szymańska
1
21
45
67 89
How learner corpus analysis can contribute to language teaching: A study of support verb constructions Nadja Nesselhauf
109
The problem-solution pattern in apprentice vs. professional technical writing: An application of appraisal theory Lynne Flowerdew
125
Using a corpus of children’s writing to test a solution to the sample size problem affecting type-token ratios N. Chipere, D. Malvern and B. Richards
137
vi
Contents
Corpora for learners Comparing real and ideal language learner input: The use of an EFL textbook corpus in corpus linguistics and language teaching Ute Römer
151
Can the L in TaLC stand for literature? Bernhard Kettemann and Georg Marko
169
Speech corpora in the classroom Anna Mauranen
195
Lost in parallel concordances Ana Frankenberg-Garcia
213
Corpora with learners Examining native speakers’ and learners’ investigation of the same concordance data and its implications for classroom concordancing with ELF learners Passapong Sripicharn
233
Some lessons students learn: Self-discovery and corpora Pascual Pérez-Paredes and Pascual Cantos-Gómez
245
Student use of large, annotated corpora to analyze syntactic variation Mark Davies
257
A future for TaLC? Facilitating the compilation and dissemination of ad-hoc web corpora William H. Fletcher
271
Index
301
Bionotes
307
Introduction: Ten years of TaLC
1
Introduction: Ten years of TaLC D. Stewart, S. Bernardini, G. Aston School for Interpreters and Translators, University of Bologna, Italy
1. Looking back over 10 years of TaLC TaLC was born in 1994, as a result of discussions among members of ICAME (the International Computer Archive of Modern and Medieval English), who realized that there was a growing interest in the use of text corpora in the teaching of languages and linguistics. e first TaLC conference was held in Lancaster in 1994, and its established purpose was well summed up in the announcement of the second conference (Lancaster 1996), which declared: While the use of computer text corpora in research is now well established, they are now being used increasingly for teaching purposes. This includes the use of corpus data to inform and create teaching materials; it also includes the direct exploration of corpora by students, both in the study of linguistics and of foreign languages.
e 5th TaLC conference, held in the hilltop town of Bertinoro in the summer of 2002, provided an opportunity to reflect on many of the developments that have taken place over the last decade. Perhaps the most striking development concerns the nature of the corpora investigated. Back in 1994, contributors were primarily concerned with what we might term “standard” or “reference” corpora, which were carefully designed to provide representative samples of particular language varieties. us there was much quotation of data from the Brown and LOB corpora, which aimed to provide representative samples of American and British written English, and a “bigger the better” enthusiasm for the growing Bank of English and the about-to-be-published British National Corpus. Comparisons of Brown and LOB had stressed the importance of geographical differences, so there was also considerable attention to the International Corpus of English project, with its
2
D. Stewart, S. Bernardini, G. Aston
attempt to produce corpora for a large number of different varieties.1 ere was also attention to domain- and genre-specific corpora, restricted to such areas as the oil industry and newspapers. Ten years later, it is clear that the distinctions now being made have become much more subtle. Geography and topic no longer seem to be the main criteria by which the type of corpora used in TaLC can be distinguished. Many of the papers in this volume, for instance, are concerned with corpora consisting of writing or speech produced by language learners, or of materials written for language learners. e question repeatedly implied is: what kinds of corpora are relevant for teaching?
2. Corpora AND learners At TaLC 5, Henry Widdowson drew attention to the least prominent part of the TaLC acronym, namely the small “a” of “and”. He reminded us that “the conjunction “and” [is] a very common word, number 3 in most frequency lists, and like most very frequent words, it has multiple functions […]”. He pointed out that it is not only “T” and “LC” that matter, but also the way they are related by this small, apparently insignificant conjunction “and”. Similarly, the interaction between language corpora “and” language learners may be of different kinds. In this volume we have identified three macro-areas, which appear to lie at the core of current research and applications of corpus linguistics to language teaching. Learners may be the authors or providers of corpus materials, they may be the ultimate beneficiaries of corpus insights, e.g. through the intermediation of the teacher or materials designer, or they themselves may be the main users of a corpus. is volume has accordingly been structured around three main sections, corresponding to these three different functions of the conjunction “and”.
2.1 Corpora BY learners e first section is concerned with corpora consisting of materials produced BY learners. Following the pioneering work by Sylviane Granger and her team in developing the International Corpus of Learner English (ICLE: Granger 1998),2 there has been rapidly growing interest in producing corpora which can be used to study features of interlanguage (oen in comparison with the
Introduction: Ten years of TaLC
3
language produced by native speakers) and to analyse “errors” – the latter raising considerable questions as to identifying and classifying errors, and hypothesising “correct” versions corresponding to the learner’s intentions. e general assumption underlying such work is that by identifying features of learner language it may be possible to focus teaching methods and contents more precisely so as to speed acquisition. is section therefore presents a series of studies of learner corpora, dealing with both lexico-syntactic and discoursal aspects of learner language, in almost all cases by means of a comparison with a TL corpus of English. Some contributions, however add a third variable: a corpus of the learners’ L1. is latter category, with which the section begins, thus examines learner language by means of a comparable corpus made up of three subcorpora: the students’ L2, L1 and TL. Tono investigates the acquisition of English verb subcategorization frame (SF) patterns on the part of Japanese learners by drawing multiple comparisons between the three corpora comprising his Japanese EFL Learner (JEFFL) corpus. ese are: (i) L1 Japanese, made up of newspapers and student compositions, (ii) TL English in the form of ELT textbooks at both junior and senior school level, and (iii) L2 English, i.e., his students’ interlanguage (IL), consisting of compositions and picture descriptions produced by students of varying levels of proficiency. e author’s aim is to study how various factors, principally the influence of verb SF patterns in Japanese, the degree of exposure to English SF patterns in the foreign language classroom, and the properties of inherent verb meanings in English, can influence the acquisition of such patterns on the part of Japanese learners. Borin and Prütz also investigate aspects of syntax, in this case the frequencies of POS sequences, using a similarly-constructed three-way comparable corpus. As was the case with Tono, the authors’ corpus consists of (i) texts in L1, in this case Swedish (the Stockholm Umeå Corpus of written Swedish), (ii) TL English in the form of the written part of the BNC sampler, and (iii) L2 English (IL), namely the Uppsala Student English Corpus. e three-way comparison favoured by both Tono and Borin and Prütz represents a move away from most other studies of learner language corpora, where the IL has been compared only with TL native-speaker production. e methodology adopted thus reflects a shi of emphasis towards considerations of L1 interference in IL. In the case in point it is claimed by Borin and Prütz that, by comparison with native-speaker English, there is significant overuse or underuse of specific
4
D. Stewart, S. Bernardini, G. Aston
POS sequences in the IL of advanced Swedish learners of English, and that such discrepancies are due to the influence of L1, inasmuch as Swedish is characterized by POS sequences analogous to the IL. e overuse or underuse of specific elements of usage on the part of learners by comparison with native speakers is also taken up by Leńko-Szymańska, who is one of a number of contributors who prefer a two-way comparison to investigate learner language, i.e., between a learner corpus and a nativespeaker corpus. e author uses the PELCRA corpus of learner English (essays written by Polish university students of varying levels of proficiency) and the BNC sampler to identify misuse of demonstratives as anaphora markers on the part of her students, and concludes that native-like use of demonstratives is unlikely to be acquired implicitly by Polish learners, who therefore need specific assistance in this area, particularly in view of the fact that this feature of language is given little emphasis in language programmes and ELT materials. On a broader level Leńko-Szymańska observes that the finer details of many interlanguage problem areas, whether L1 dependent or not, remain unexplored and oen not specifically focused upon in class, and that learner corpora must be seen as a vital resource in throwing light upon such details. e methodology of comparing a learner corpus with a native-speaker corpus is also adopted by Nesselhauf as part of her survey of support verb constructions (e.g., give an answer, have a look, make an arrangement) as used by advanced German-speaking learners of English. e analysis takes its data from a subcorpus of ICLE (the International Corpus of Learner English) containing essays written by native speakers of German. e support verb constructions extracted were then judged for acceptability via consultation not only of the written part of the BNC but also of a number of monolingual English dictionaries, along with native speakers where necessary. e author identifies constructions which would appear to be particularly problematic for German learners, subsequently suggesting ways in which her results could inform teaching strategies. Nesselhauf is wary, however, of drawing potentially glib conclusions from learner corpus studies. Most of these claim to have implications for language teaching, recommending that whatever is discovered to deviate significantly from native-speaker usage should be prioritized in the classroom. Yet by endorsing this view, the author argues, learner corpus researchers expose themselves to the kind of criticism that NS corpus analyses have encountered for some time now, i.e., that they rely exclusively and unimaginatively on fre-
Introduction: Ten years of TaLC
5
quency counts in order to reach their conclusions about what learners should be taught. Frequency is a crucial criterion, the author continues, but needs to be refined and elaborated within a more ample framework of associated criteria such as (i) the language variety the learners aim to acquire, (ii) text typology, (iii) the degree of disruption provoked for the recipient by inappropriate usage, and (iv) the frequency of those features of language that learners appear to find particularly useful. Flowerdew continues the series of papers which offer results stemming from comparisons between corpora of IL and TL. e author gives priority to discoursal aspects, focusing on problem-solution patterns used in technical reports by (i) apprentices and (ii) professionals. Salient lexis present in such patterns is identified and classified within the Hallidayan-based APPRAISAL framework, which is concerned on the Interpersonal level with the way language is used to evaluate and manage positionings. is contribution represents a departure from previous studies in two ways: firstly in its choice of raw materials, since APPRAISAL surveys to date have been applied mainly to media discourse, casual conversation and literature, and secondly within corpus linguistics itself, considering that problem-solution patterns have received scarce attention in corpus-based research by comparison with other areas of text linguistics. e contribution by Chipere, Malvern and Richards, which concludes the first section of the volume, also discusses a learner corpus, but with some important differences. In the first place the learners are native speakers, i.e., children writing in their first language, and secondly there is a much stronger methodological emphasis. e principal objective of the paper is to highlight sample size problems attendant upon the use of the Type-Token Ratio measure, and in particular to discuss what the authors suggest to be flawed strategies adopted in the literature over the years in order to address such problems. e authors then propose their own solution, based upon modelling the relationship between TTR and token counts, applying this to their corpus of children’s writing. It is subsequently claimed that the procedure adopted not only provides a reliable index of lexical diversity but also demonstrates that lexical diversity develops hand in hand with other linguistic skills.
6
D. Stewart, S. Bernardini, G. Aston
2.2 Corpora FOR learners e second section is concerned with corpora which are designed to benefit learning by allowing teachers and materials designers to provide better descriptions of the language to be acquired, and hence to decide what learners should learn: corpora FOR learners. is use of corpora was already well established in 1994, following the publication of the Cobuild project’s frequency-based dictionaries and grammars: there seems little point in teaching learners very rare uses, or failing to teach them common ones. e argument has extended itself from general surveys to more specific ones using corpora comprising language from situations in which particular groups of learners are likely to find themselves, such as the university settings considered in the construction of the Michigan Corpus of Academic Spoken English (MICASE) corpus.3 is approach raises questions of corpus construction, with the need for teachers of specific groups to be able to rapidly compile ad-hoc corpora which can be used to assess the linguistic characteristics of particular domains and genres – an ever-easier (but also in a way more complex) task given the massive quantity of electronic texts available on the internet. e central issue here remains just what language and what texts should be proposed to learners as models. Should learners be expected to imitate native speaker language? As far as English is concerned, it is increasingly argued that they above all need to acquire English as a lingua franca (ELF), and that in consequence, what should be analysed are corpora of speech and writing involving non-native speakers. is argument can be overstated, since most learners are likely to need to understand the speech and writing of native speakers, even if not necessarily to imitate it. But the move towards the study of ELF is an important reminder that language use is recipient-designed, to use the term of conversational analysts, and that it may not always be appropriate to take the language of native speakers for native speaker recipients as a model for learners’ own production: corpora for language learners may not be the same as corpora for linguists. e selection of appropriate corpora will be determined by the teacher’s and material writer’s assessment of learners’ needs and objectives, as a means of deciding what they should learn. e opening paper, Römer’s comparison of real and ideal language learner input, has links with the paper by Tono which opened Section 1, in that it is concerned with the use of a corpus of EFL textbook texts. Römer justifies her study by pointing out that while considerable attention is being devoted in cor-
Introduction: Ten years of TaLC
7
pus studies to learner output (clearly the first section of this book is a testimony to this), relatively little interest has been shown as regards learner input, and in particular the (substantial) input from EFL text books. e author has accordingly constructed a “pedagogical corpus” (Hunston 2002:16) of EFL material. e texts selected by Römer are all (written) representations of spoken English in EFL texts. ese were compared with the spoken part of the BNC, with particular focus on if-clauses, in order to seek insights into a topical question – of relevance not only to English language teaching but to language instruction in general – i.e., whether the input from foreign language textbooks is a fair reflection of the type of language students are likely to encounter in natural communicative situations. e section then moves from Römer’s pedagogical corpus to more “classic” types, i.e., target language reference corpora and parallel corpora, although it will be seen that the different types have some common goals. Kettemann and Marco consider the use of TL reference corpora in the classroom, yet their focus is different from other papers in the volume in that they propose the analysis of corpora of literary texts (in particular the complete works of writers examined). is move reflects the belief that approaching literary texts through corpora is a worthy pedagogical enterprise in many respects, not only in terms of foreign language acquisition but in particular in terms of awareness raising, whether this be language awareness, discourse awareness or methodological awareness. Examining corpora of classic British and American authors, Kettemann and Marco aim to raise the status of the literary corpus from its “subordinate position” in the TaLC sector, a position until now “too low-case to be assigned the capital L in the acronym’. e paper by Mauranen also describes classroom use of a TL reference corpus, though in this case the data examined are spoken rather than written. e author describes use of the MICASE Corpus within the framework of an experimental postgraduate course in English for Finnish students, using this as a springboard for reflections upon a number of topical issues in the TaLC sector. ese include (i) the degree of authenticity of a spoken corpus, which is in a sense twice removed from its original context, (ii) the communicative usefulness of a speech corpus, and (iii) – an issue clearly close to the author’s heart – is it fair that almost all spoken corpora consist of L1 adult data, i.e., is there a place for L2 spoken corpora, particularly as a model for international English? is goes hand in hand with the question of how necessary or relevant a highly idiomatic command of native-like English might be for most users of English
8
D. Stewart, S. Bernardini, G. Aston
as a foreign language. e author ends by stressing that, for most teachers not specialized in corpus use, making the corpus leap can be a daunting task. She therefore calls for both more sensitive training and more user-friendly corpus materials, in order to spread the word more effectively. e section closes with Frankenberg-Garcia’s paper on possible uses of a parallel corpus in second language learning, thus providing a variation upon the monolingual emphasis that has characterized the volume thus far. e issue is an interesting one because until now, as the author notes, parallel concordancing has been primarily associated with translation activities and lexicography, while it is monolingual concordancing which has prevailed in the language learning domain. Drawing upon examples from COMPARA, a parallel, bi-directional corpus of English and Portuguese, the author seeks to identify (i) which language learning situations might derive benefit from a parallel corpus, and (ii) how the corpus might best be exploited in the language classroom by teachers and learners alike. Such questions are not easily answered – and in any case lie on a different axis by comparison with monolingual corpora in pedagogy – precisely because of the dual nature of the corpus. Parallel concordancing offers contrasts not only between translational and non-translational language, but also between L1 and L2. It goes without saying that earnest reflection is required if such contrasts are to be converted to productive use in language teaching, and in this respect Frankenberg-Garcia furnishes some much-needed insights.
2.3 Corpora WITH learners e papers in the third section testify to a different perspective, again implying distinctive criteria for corpus selection on construction. Rather than what should be learned, they focus on how learning should take place. Right from the first TaLC conference, there were papers which viewed corpora primarily as tools which learners could use to find out about the language (and the culture behind that language) for themselves, with or without the help of their teachers. e section “Corpora WITH learners” includes discussions of a number of activities designed to help learners use corpora and to acquire linguistic knowledge and skills through their use. Here, the choice of corpora will depend on their appropriacy not as descriptive, but as learning tools. Sripicharn’s focus is on the processes and strategies adopted by users during concordance-based activities. He conducts an experiment to assess the per-
Introduction: Ten years of TaLC
9
formance of a group of ai students against that of a group of English native speakers, asking them to perform a number of concordance-based tasks. e author underlines the significantly different approaches used by the respective groups, with the ai students privileging data-driven hypothesis-testing strategies, while the English students paid less attention to the data and relied more on intuitive reactions, though both groups came up with sophisticated observations. Sripicharn, however, warns against the dangers of learners overgeneralising from the kind of restricted data attendant upon a small-scale study such as this. Pérez-Paredes and Cantos-Gómez also provide an example of how corpora can be used with learners. However in this case it is first and foremost the student, rather than the researcher/teacher, who examines the results. e authors collected samples of oral output from a group of Spanish students, then returning the findings to the group in the form of a spreadsheet complete with data on tokens, types, content words, frequency bands and other aspects of the students’ performance. e members of the group were then invited to compare their own individual production with mean data for the whole group, and consequently to provide an appraisal of their strengths and weaknesses. By confronting students with their own output, the authors aim to encourage learner autonomy through a guided process of self-discovery. Davies’ survey of classroom use of Spanish reference corpora occupies a shared middle ground between this and the previous section. It qualifies as corpora for learners inasmuch as it involves the use of TL reference corpora, but its chief emphasis is that of corpora with learners, since, like the contribution by Pérez-Paredes and Cantos-Gómez, it focuses upon learners’ ability to assimilate and draw conclusions from the available data. Davies reports the findings of an on-line course entitled “Variations in Spanish Syntax” for graduates in Spanish from different parts of the United States. e participants were trained in the use of a number of reference corpora, including the author’s own 100-million word Corpus del Español, and then assigned tasks regarding complex features of Spanish syntax where they were required to compare the corpus data with specific explanations provided in a well-known reference grammar of Spanish. Davies is especially interested in the role of the learners both as researchers, in locating useful corpus data, and as critics, in evaluating the findings of fellow learners. e survey thus has clear links with Römer’s paper in Section 2, since it takes as its justification and premise the notion that learners need to move beyond the sometimes simplistic usage and
10
D. Stewart, S. Bernardini, G. Aston
rules provided in foreign language manuals and textbooks. In brief, it is now recognized that what corpora should be used in the context of language teaching and learning depends on what they are to be used for. A variety of uses implies a variety of corpora. e papers in this volume indicate the richness of the issues raised, and the vitality of the field.
3. Looking out and around: A theory for TaLC, a future for TaLC? Two papers escape any obvious grouping along the lines just discussed. ese papers have been given very prominent positions in the volume: one opens it, and one closes it. e first is an important contribution to modern linguistic thought in the form of Michael Hoey’s discussion of the pervasiveness of priming in language use, and in particular the textual priming of lexis. Hoey argues that all lexical items are primed for grammatical and collocational use, i.e., every time we encounter a word it becomes “loaded with the cumulative effects of those encounters such that it is part of our knowledge of the word that it regularly co-occurs with particular other words or with specific grammatical functions” (p. 21). e author then underlines that priming goes beyond the sentence, i.e., that a lexical item may be primed (i) to appear in particular textual positions with particular textual functions, a phenomenon heavily influenced by text domain and genre, and (ii) to participate in cohesive chains. Because as individuals our exposure to language is unique, i.e., different from everybody else’s, it follows that a word is primed for the individual language user. In other words, priming belongs to the individual and is constantly in flux. Hoey concludes with a discussion of the relevance of his theory of priming for pedagogical issues: firstly, its implications for language learners, i.e., how priming could be tackled within the walls of the classroom, and secondly, what bearing it has on the production of language in terms of routine and creativity. e second paper, by William Fletcher, is concerned specifically with how to exploit the world wide web to create ad hoc corpora, i.e., how to harness more efficiently and more selectively the prodigious quantities of machinereadable data available on-line, and more generally how to prioritize quality and relevance over quantity. Fletcher argues that there are various obstacles which hamper on-line searches and thus prevent the web from realizing its full
Introduction: Ten years of TaLC
11
potential. e most persistent drawback, the author claims, is the difficulty in identifying documents which are (i) germane to the user, and (ii) reliable. As a possible remedy for such problems, the author discusses his web concordancer KwiCFinder, which automates and renders more streamlined the process of retrieving relevant documents. However, searches can be time-consuming nonetheless, and with this in mind the author sketches an outline of two rather more visionary proposals. Since the orientation of most existing search engines is towards the general public and in any case driven by commercial requirements, it would be useful for learners and language professionals to have access to (i) a selective web archive and (ii) a specialized search engine, specifically tailored to their needs. With regard to the first, Fletcher states his intention to implement the Web Corpus Archive of web documents, which will collect, disseminate and build upon users’ searches, with each member of the user community benefiting from the efforts of others. As regards the second proposal, Fletcher details his plan to create a Search Engine for Applied Linguists, which would enable sophisticated queries and furnish information such as the frequency and dispersion of a given form across the web pages included in the corpus. Finally, aer a review of what is currently available on the market in terms of web concordancers, web corpora, and search engines for applied linguists, the author recommends some useful web search resources for language teaching and learning. While Hoey sketches a theory of language within which the papers that follow, and pedagogic applications of corpora in general, can be situated, Fletcher gives an exhaustive account of the role the WWW is playing today, and might play in the future of TaLC. e perspectives adopted are very different, yet both are invaluable in providing insights which set the papers that form the core of this volume against the wider pictures of linguistic theory and language technology. In different ways, they suggest that there is indeed a future for TaLC.
4. Authenticity: A common thread running through TaLC 4.1 What is authentic language? For over a decade, authenticity has arguably been the one fundamental theoretical and methodological issue which all those with an interest in applying corpora to didactic uses have sooner or later had to confront. Several papers
12
D. Stewart, S. Bernardini, G. Aston
in this collection tackle this central issue, which was also the main object of a joint keynote session by John Sinclair and Henry Widdowson at TaLC 2002, discussing “Corpora and language teaching tomorrow”. e issue here is whether the language that foreign language learners are exposed to (from example sentences in grammar books or on blackboards to readings, videos etc.) should necessarily be “authentic”. Authenticity in this sense refers to a piece of text being “attested”, having occurred as part of genuine communicative (spoken or written) interactions. According to Hoey (this volume), exposure to authentic data is crucial since “only authentic data can preserve the collocations, colligations, semantic associations of the language” (p. 37). Indeed, it is this belief that motivates more and more teachers to introduce corpora into their classrooms. Römer (this volume), provides an example of the difference between “authentic” and “made-up” (or, more precisely, “made-up sounding”) examples. She cites an example from her EFL textbook corpus where the following exchange is used to illustrate the present progressive in yes/no questions: (1) MR SNOW MRS SNOW MR SNOW MRS SNOW MR SNOW MRS SNOW
Hello, Wendy. Hello, Ron. Where are the girls? Are they packing? Yes, they are. Or are they playing? No, they aren’t, Ron. ey are packing.
On the other hand, as Römer points out, a search of the BNC spoken component retrieves utterances such as: (2) What’s happening now, does anybody know? (3) What are we talking about, what’s the subject? (4) Are you listening to me? (5) Are you staying at your mum’s tonight?
No. I’m staying at Christopher’s.4
Competent speakers of English might consider the corpus examples to be more “natural” than the textbook examples.5 Römer goes on to claim that the corpus backs up this impression, confirming that the two verbs, “packing” and “playing”, are not at all frequent in the pattern “are they VERB-ing”. It would seem,
Introduction: Ten years of TaLC
13
as claimed by Sinclair (in many places, among these also at TaLC 2002) that “we cannot trust our ability to make up examples”… But corpora are great sources of serendipitous findings, as we all know. So let us stick for a moment with “are they”, and look at a concordance of this string as the first element of a spoken utterance in the BNC. To start with, “are they” does not seem to colligate very oen with the progressive. Out of 393 solutions, 41 only are followed by a verb in the progressive form. Of these, 17 are instances of the pattern “are they going to/gonna VERB”, leaving 24 “good” candidates only. An example of these is the following (KDE): (6) PS0M4 >: Alia and Aden are coming around to play with you this aernoon. PS0M5 >: Are they coming now? PS0M4 >: In a minute.
is short exchange may appear somewhat more similar to the textbook examples than the corpus examples in 2–5, and as such possibly less natural than the latter. Let us consider another short extract from the same conversation: (7) PS0M5 >: Who who bought this? PS0M4 >: Mummy and daddy bought it. PS0M5 >: Where did it came from? PS0M4 >: It comes from the Gap.
If we remove the hesitation in line 1 and correct the grammar in line 3, we have a typical textbook example of WH-questions. On the contrary, the following example comes across as more natural: (8) PS04U >: What’s Ken and Marg having turkey at Christmas or PS04Y >: Mm? PS04U>: are they having turkey at Christmas or don’t they, don’t you know? PS04Y >: I don’t know what there’ll [sic] have, you see Naomi and Mitch are vegetarian ...
And yet both 6–7 and 8 are authentic, attested corpus examples. e exchanges in 6–7, not unexpectedly perhaps, take place between a mother and her son aged 3. e one in 8 between two housewives. Could it be the case, then, that authenticity of language is to be treated not as an absolute feature, but rather as a gradient feature? Or, in other words, could it be the case that some instances of attested language use are more “proto-
14
D. Stewart, S. Bernardini, G. Aston
typically” authentic than others? And that in evaluating authenticity we should take into account what words are being spoken/written, as well as to whom they are addressed, for what purpose(s) and so forth?
4.2 A richer view of authenticity Mauranen (this volume), for instance, proposes a distinction between “subjective” authenticity (as perceived by learners) and “objective” authenticity (as evaluated by a teacher or researcher). She also acknowledges that at least certain instances of spoken corpus material (e.g. dialogue) may be seen as less authentic than written corpus material. While the latter requires a reader in order to be interactively complete, the former is a record of an interactional event that is complete in itself. e learner can only interact with this type of spoken material as an external observer. And yet, she argues, observing interaction is as important as participating in interaction. By highlighting repeated patterns, spoken corpora offer a more form- and function-oriented approach to interaction than in real-life situations, where observers are more likely to be led to focusing on content and the unfolding situation. Nesselhauf (this volume), would similarly appear to endorse the richer view of authenticity described above. She suggests that, alongside frequency in native speaker usage, there are a number of other criteria on which recommendations for teaching should be based. Among these, for instance, the “degree of disruption of an unacceptable expression for the recipient”: if a mistake is a likely cause of misunderstandings, for instance, it should be insisted upon more. Similarly, we might add, learners are likely to need sophisticated repair strategies and routines which make up for their language deficiencies. Whether and to what extent these are attested in monolingual reference corpora of the target language is an open question. e debate on authenticity thus feeds into a more general debate over the most appropriate model of language for learners. Current work on ELF (Mauranen this volume, Seidlhofer 2001) suggests that native speaker corpora of the target language might, by definition, not provide an ideal model, and that a better alternative could be “good international English spoken in academic and professional contexts” (Mauranen this volume). e latter would be contextually more appropriate, recording language spoken in situations in which learners are likely to find themselves. ey would provide indications of successful (and unsuccessful) strategies that competent non-native speakers use in
Introduction: Ten years of TaLC
15
interaction with each other and with native speakers. And they would be fairer to foreign learners and teachers, setting them a more achievable and more coherent target than that of an idealized community of native speakers. ELF corpora are just beginning to see the light: substantial efforts are needed to build them, evaluate their contribution to language teaching, and get through the likely resistance of teachers and learners, who might not like the idea of doing without the useful fiction of the “native speaker” model. But the debate is open.
4.3 A decade-long controversy: What next? As mentioned above, it is no coincidence that authenticity in language teaching/learning features so prominently in this volume. e discussion was rekindled at TaLC 2002 by the joint plenary on “teaching and language corpora tomorrow” given by John Sinclair and Henry Widdowson, who agreed to discuss their current position with respect to this topic, a decade aer two well-known articles first sparked off interest in it (Sinclair 1991, Widdowson 1991). As it happens, their positions turned out to be distinct in theory, and yet far from irreconcilable in practice.6 Building on a “syntagmatic” view of language, Sinclair suggests that at the foundation of language teaching in the future there is likely to be the lexical item, a unique form of expression that goes together with a unique meaning. Like words, lexical items are not regulated by the open-choice rules of grammar. ey can undergo modifications (expansion, contraction, (ironic) exploitation etc.) which are regulated by convention, by the idiom principle. Unlike words, however, lexical items are unambiguous. Sinclair has provided several memorable examples of lexical items, e.g. those whose core constituents are the words brook (VB), budge, gamut or naked eye. In the case of gamut, for instance, he suggests (Sinclair 2003) that this lexical item consists of a verb, usually run, followed by a noun group containing an article, usually the, an optional adjective, e.g. whole or synonyms, the node word gamut and a prepositional phrase or another adjective referring to the area over which the phrase ranges. is lexical item, whose simplified base form might thus be RUN the whole gamut of …, has the unified function of referring to a set of events, highlighting its size/complexity and the extensiveness of the coverage achieved.
16
D. Stewart, S. Bernardini, G. Aston
A syntagmatic model of language centred around the lexical item should make learning easier, safer, and arguably more successful, Sinclair claims, since learners do not have to cope with lexical ambiguity and to worry about lexicogrammatical choices below the level of the lexical item. is change of perspective implies that, if contrived examples could be acceptable in a paradigmatic approach, in a syntagmatic approach they would not, since intuition is notoriously unreliable when it comes to identifying, exemplifying or describing lexical items. Now this is not to imply that any real example is fine, but rather that “to have occurred in communication is a necessary, but not a sufficient condition for [a piece of text to be] presented as a model of language” (Sinclair 2002). Widdowson’s approach is complementary rather than opposed to Sinclair’s, shiing the perspective, in Widdowson’s words, from LCat to Talc, from language theories and descriptions which have (crucial) implications for teaching, to language theories and descriptions which are subservient to teaching and a means towards learning. He does not deny “the enormous contribution that corpora have made over the years to linguistic descriptions”, but suggests that, especially when time and resources are limited, as in most language courses, decisions have to be made about what to teach based not only on (frequency of) occurrence in the target community, but also on what language is the best investment for learners: So here the question has to do with what has to be taught to provide an impetus for learning, how do you create the conditions for learning to take place beyond the end of the course … an acceptance that some things are teachable, and some are only learnable, in the sense that we could only point learners in the right direction, developing “vectors of learning”. (Widdowson 2002)
It might turn out that the most frequent lexical items attested in a general corpus of the target language, taught using corpus materials, provide just this impetus, and constitute a valid basis on which to base a language course syllabus. e work of Tim Johns and colleagues on Data-Driven Learning (Johns and King 1991) goes in this direction. But once again, this is an open question that awaits empirical verification. e importance and value of LCat is nowadays undisputed. e syntagmatic model which owes so much to the work of Sinclair is generally perceived as a more accurate model of language for teaching purposes than the paradigmatic one. And the fact that virtually every new learner dictionary to come out is corpus-based bears eloquent testimony to this.
Introduction: Ten years of TaLC
17
Evidence in favour of Talc is, on the other hand, still limited (exceptions are Cobb 1997, Gitsaki 1996, Sripicharn this volume). We do not know for sure if learners become better at using the language for their intended purposes when taught within a framework which follows the underlying principles of a syntagmatic model. Aer five TaLCs, and a decade of discussion, there is still much we have to learn about the effects of our teaching practices on learners, whether corpus use and corpus-inspired materials affect their learning path, and whether they do so in a positive manner. In the words of Vivian Cook (2002:268): Memorable, interesting, invented sentences may lead to better conscious learning of language and ultimately to better unconscious language use; on the other hand the more neutral the sentence the more its language elements may be absorbed into the students’ competence. […] It may be better to teach people how to draw with idealized squares and triangles than with idiosyncratic human faces. Or it may not. The job of applied linguists is to present evidence to demonstrate the learning basis for their claims […].
Hopefully, the search for this evidence will feature prominently in the TaLC agenda for the next decade.
Notes 1. Brown: http://helmer.aksis.uib.no/icame/brown/bcm.html (visited 17.5.2004) LOB: http://helmer.aksis.uib.no/icame/lob/lob-dir.htm (visited 17.5.2004) BoE: http://www.cobuild.collins.co.uk/ (visited 17.5.2004) BNC: http://www.natcorp.ox.ac.uk/ (visited 17.5.2004) ICE: http://www.ucl.ac.uk/english-usage/ice/ (visited 17.5.2004) 2. ICLE: http://www.fltr.ucl.ac.be/fltr/germ/etan/cecl/Cecl-Projects/Icle/icle.htm (visited 17.5.2004) 3. MICASE: http://www.hti.umich.edu/m/micase/ (visited 17.5.2004) 4. e reference in this last example is of course to future time, i.e., it is a“present progressive as future”, as grammar books sometimes have it. 5. We do not intend to go into the thorny question of the difference between genuineness, naturalness and authenticity, but it is clear that contextual issues are key in any such discussion. e textbook example cited might not appear at first sight to be particularly typical, but it would not be too arduous a creative task to imagine a situation in which it might actually be attested (a tense, awkward, in part sarcastic exchange between an estranged husband and wife, where the husband, who has come to pick up the kids and take them on holiday, doubts
18
D. Stewart, S. Bernardini, G. Aston
his wife’s capacities as a mother). In any case – paradoxically enough – the textbook example is now attested, and in a number of places to boot: in an EFL textbook, in an EFL corpus, and in this book (twice). Is it therefore “more” authentic? 6. References to “Sinclair 2002” and “Widdowson 2002” refer to their (unpublished) talks at TaLC 2002.
References Cobb, T. 1997. “Is there any measurable learning from hands-on concordancing?”. System 25, 3:301–315. Cook, V. 2002. “The functions of invented sentences: A reply to Guy Cook”. Applied Linguistics 23, 2:262–269. Gitsaki, C. 1996. The development of ESL collocational knowledge. PhD thesis, The University of Queensland. Granger, S. (ed.) 1998. Learner English on Computer. London and New York: Longman. Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Johns, T. and King, P. (eds) 1991. Classroom Concordancing. ELR Journal 4. Birmingham: University of Birmingham. Seidlhofer, B. 2001. “Closing a conceptual gap: The case for a description of English as a lingua franca”. International Journal of Applied Linguistics 11, 2:133–158. Sinclair, J.McH. 1991. “Shared knowledge” in Georgetown University Round Table on Languages and Linguistics 1991, J. Alatis (ed.). Washington, D.C.: Georgetown University Press, 489–500. Sinclair, J.McH. 2003. Reading Concordances. Harlow: Longman. Widdowson, H.G. 1991. “The description and prescription of language” in Georgetown University Round Table on Languages and Linguistics 1991, J. Alatis (ed.). Washington, D.C.: Georgetown University Press, 11–24.
e textual priming of lexis
A theory for TaLC?
19
20
Michael Hoey
e textual priming of lexis
21
The textual priming of lexis Michael Hoey University of Liverpool, UK
This paper sketches a theory of language that gives lexis and lexical priming a central role. All lexical items are primed for grammatical and collocational use, i.e., every time we encounter a lexical item it becomes loaded with the cumulative effects of those encounters, such that it is part of our knowledge of the word that it regularly co-occurs with particular other words or with specific grammatical functions. Priming also goes beyond the sentence, i.e., a lexical item may be primed (i) to appear in particular textual positions with particular textual functions, a phenomenon heavily influenced by text domain and genre, and (ii) to participate in cohesive chains. Because as individuals our exposure to language is unique, i.e., different from everybody else’s, it follows that a word is primed for the individual language user. In other words, priming belongs to the individual and is constantly in flux. The theory of priming clearly has relevance for pedagogical issues, with important implications both for language learning, i.e., the way priming is to be tackled within the walls of the classroom, and for language production, in terms of routine and creativity.
Sinclair (e.g., 1991) has argued that the study of lexis leads to results incompatible with the descriptions provided by conventional grammars. Biber et al. (1999) have argued that lexical bundles characterize text types. Farr and McCarthy (2002) argue that the function of conditionals is specific to particular types of interaction. Morley and Partington (2002) argue that syntax is an epiphenomenon of lexis. All this suggests we need a new theory. In this paper I want to put forward a theory of language that places lexis at its very centre and gives to vocabulary the pivotal status once awarded to syntax. What I have to say is only a beginning – a mixture of the self-evident and the unproven. Much of what I am going to say will seem obvious but I want to build from a shared position to positions that may seem novel or wrong, though I shall defend those positions fiercely. e theory I shall briefly outline here has links with Brazil’s work on the grammar of speech (Brazil 1995), with
22
Michael Hoey
Construction grammar (e.g. Goldberg 1995) and with Pattern Grammar (Hunston and Francis 2000). It assumes the correctness of Halliday’s interpersonal and ideational metafunctions (Halliday 1967–8) but rejects, and attempts to supersede, his account of the textual metafunction (while retaining the insights that Halliday’s model provides). e classical theory of the word is epitomized by those two central nineteenth and early twentieth century compendia of lexical scholarship – Roget’s esaurus (1852) and the Oxford English Dictionary (Murray et al. 1884–1928) . According to such texts, lexis can be described in terms of hyponymy and co-hyponymy, near synonymy and antonymy and has meaning(s) which can be defined using the lexical relations just mentioned. Every word, furthermore, belongs to one or more grammatical categories and has pronunciation, etymology, and history. According to the theory that underpins these positions, words interact with phonology through pronunciation, with syntax through their grammatical categories and with semantics through their senses; they find their place in diachronic linguistics through etymology. In such a theory, the lexical item is reactive to other systems, particularly those of grammar and phonology, and in some versions of the classical theory, the relationship between the word and the other systems has been so weak that grammar has been generated first and the words brought in as the last stage in the process (Chomsky 1957, 1965) or that the semantics have been generated first and the words seen as merely expressing the pre-existent meaning. Systemic-functional linguistics has an altogether more central place for lexis, but even in this model the systems can sometimes make it seem as if lexical choice is the last (because most delicate) choice to be made. Even where theory starts from the assumption that lexis is chosen first, or at least much earlier, the assumption is still that it passes through a grammatical filter which organizes and disciplines it. I referred above to those great 19th century works of scholarship – the OED and Roget’s esaurus. It is interesting that these works have outlived almost all the theoretical work (apart from that of Saussure’s) from the same period. In the same way I am convinced that, when linguists look back at the 20th century, it will not be the grammatical theories that will be admired as permanent works of the highest scholarship but the corpus-backed advanced learners dictionaries, starting with Collins COBUILD and continuing with Oxford, Longman and Macmillan, and it is these works, and of course in particular the first Collins COBUILD Dictionary, that have shown the traditional
e textual priming of lexis
23
view of lexis outlined in previous paragraphs to be suspect. In particular what these dictionaries and accompanying corpus-linguistic work have established beyond doubt is the centrality and importance of collocation in any description of lexis. Collocation no longer needs support but it demands explanation. e only explanation that makes sense of its ubiquity, and indeed its existence, is psychological in nature. Every lexical item, I want to argue, is primed for collocational use. By primed I mean that as a word is acquired through encounters with it in speech and writing, it is loaded with the cumulative effects of those encounters such that it is part of our knowledge of the word (along with its senses, its pronunciation and its relationship to other words in the same semantic set) that it regularly co-occurs with particular other words. As Sripicharn (2002 and this volume) put it in a paper at the conference to which this volume owes its existence, “years of listening to people speaking make me know which words sound right together.” Each use we make of the word reinforces the priming (unless our use runs counter to the priming it has received), as does each new encounter with the word in the company of the same co-occurring other words. Each encounter that we have with the word that does not reinforce the original priming either weakens that priming slightly or complicates it. A word may, and routinely does, accumulate a range of primings which are weighted in our minds in a variety of ways that take account of relative frequency, mode, genre and domain. Part of our knowledge of a word is that it is used in certain kinds of combination in certain kinds of text. So I hypothesize (supported by small quantities of data) that in gardening texts during the winter and during the winter months are the appropriate collocations, but in newspaper texts or travel writing in winter and in the winter are more appropriate; the phrase that winter is associated with narratives. It follows from the processes involved in collocational priming that it is not in principle a permanent feature of the word. As new encounters alter the weighting of the primings, so they shi in the course of an individual’s lifetime, and as they do so (and because they do so) words shi imperceptibly in their meaning or their use. I suspect that for many older linguists such a shi has occurred in the priming of the word collocation itself! Its collocations, postHalliday and Hasan (1976) and pre-Sinclair (1991), were, I suspect, predominantly with the words text and sentence, rather than with corpus and word. So collocational priming is context specific and subject to change. It is also, importantly, a matter of weighting rather than requirement. So the relatively rare phrase through winter is English as well as in winter. Priming belongs to
24
Michael Hoey
the individual. A word is primed for a particular language user. A corpus cannot demonstrate the existence or otherwise of a priming for any individual. It can only show that a particular combination is likely to be primed for anyone exposed to data of the kind represented in the corpus in question. If we accept these positions, as I think we must if we are to account for the existence and prevalence of collocation, we open the way to a more general recognition of the notion of priming. To begin with, the grammatical category a word belongs to can be seen as its grammatical priming. Instead of saying “is word is a noun” or “is word is an adjective” I would argue we should say “is word is primed for use as a noun”. In other words, the word is loaded with the grammatical effects of our encounters with it in the same way as it is loaded with collocational effects. If the encounters all point the same way, we assume 100% identification of the word with a particular grammatical category; this happens occasionally with collocation also (e.g. kith with kin). Nevertheless such total identification is not as common as we might imagine (Hoey 2003). Words such as estimated (V, adj), teaching (V, N, adj), human (N, adj), and real (as in real nice, get real, real world, the real and the unreal) are the norm. How, for example, might one categorize red in a red sunset, the colour red, he went red or he saw red? If we agree that words are primed for grammatical category, the question must be regarded as inappropriate. As with collocational priming, grammatical priming can change through an individual’s lifetime. Anyone British and over 50 is likely to have had the word program shi in its priming from noun to verb in writing. (With the alternative spelling programme or from an American perspective the priming shi will have been different.) As with collocation, grammatical priming is context specific. In the conversation of homophobes, queer is primed as adjective and noun, but in the writings of cinema theorists, queer is primed as adjective only (queer cinema, queer theory). is means that the priming must be tagged for domain, purpose and genre. Again, more controversially, the priming is a matter of weighting not requirement. Margaret Berry once wittily said that you can verb any noun. Strictly this is not true – it does not apply to nouns derived from verbs in the first place – but her observation encapsulates a real fluidity in the language. So routinely do we adjective our nouns that we see it as entirely normal and label the use as a noun modifier or classifier, rather than admitting the protean nature of language; the Oxford Dictionary of Collocations however treats such usage as adjectival. (Aer all, in a red sunset, we would traditionally treat the
e textual priming of lexis
25
word red as an adjective, despite its nominal use in the colour red.) is grammatical priming does not necessarily assume the prior existence of any grammatical category. Sinclair’s masterly analysis (1991) of of as belonging to a grammatical category with just one member in it is a warning about the assumption that grammatical categories are givens in the language. It could indeed be argued that what we call grammatical categories may be post hoc generalisations derived from the myriad individual instances of lexical priming that we encounter and take on board in the course of our language development. One last point needs to be made about both collocation and grammatical categories, and it is a point that equally applies to those categories of priming I have yet to explicate. is is that primings nest. us wing collocates with west, west wing collocates with the, and the west wing collocates with in. Similarly, face is in the first place primed for use as a noun or verb. Put into the phrase in your face, however, face loses the verbal priming. Once very is added, the latent ambiguity of in your face disappears, and so does the nominal priming. In its place the phrase in your face (in very in your face) is primed for adjectival use. If we accept, at least for the sake of argument, that words are collocationally and grammatically primed, in other words if we accept that the learning of a word involves learning what it tends to occur with and what grammar it tends to have, it opens the door to the possibility of other kinds of priming. e first of these is semantic association, which in earlier papers (e.g. 1997a, 1997b) I referred to as semantic prosody (following or rather mis-following Louw 1993, and Stubbs 1995, 1996), and which Sinclair (1996, 1999) refers to as semantic preference. I would use Sinclair’s term if it were not for the fact that I want to avoid building the term “preference” into one of the types of priming, since one of the central features of priming is that it leads to a psychological preference on the part of the language user. Also, the use of “association” is designed to pick up on the familiar “company a word keeps” metaphor used to describe collocation. e change of term does not represent a difference of opinion. Semantic association is defined as occurring when a word is associated for a language user with a semantic set or class, some members of which are also collocates for that user. e existence of the collocates in part explains the existence, and in part is explained by, the semantic set or class in question. As an example of semantic association, consider the verb lemma train, analysed in considerable detail by Campanelli and Channell (1994) and cited by Stubbs (1996). Train (in my corpus) collocates with as a and the resultant
26
Michael Hoey
combination of words has a semantic association with the notion “skilled role or occupation”. e corpus has 292 instances of train* as a, of which 262 were followed by an occupation or related role. e data included the following (numbers of occurrences are given in brackets). train* as a teacher (25) train* as a doctor (12) train* as a nurse (11) train* as a lawyer (11) train* as a painter (8) train* as a dancer (7) train* as a barrister (5) train* as a chef (5) train* as a social worker (5) train* as a solicitor (5) train* as a braille shorthand typist (1) train* as a concentration camp guard (1) train* as a kamikaze pilot (1) train* as a boxing second (1) train* as a cobbler (1) train* as a train waiter (1)
e combination train* as a has some clear collocates (teacher, nurse, doctor, lawyer, painter, dancer, barrister, chef, social worker and solicitor), as the frequency figures suggest. But it is hard to imagine evidence ever being available to support the idea that braille shorthand typist is a collocation of train as a – except, importantly, in a specialist corpus of, say, minutes of the Royal Society for the Blind. But its occurrence is still accounted for because of the generalization inherent in the notion of “semantic association”. Many semantic associations such as the one just given seem to be grammatically restricted. Although there are plenty of instances of train with “skilled role or occupation” in other combinations in my data, particularly as teacher training, the relationship is in part constructed through the structure given in the column of data above. is suggests that for some lexical items there might be restrictions that are not simultaneously instances of semantic association. ese can be covered under another type of priming – colligational priming. e term “colligation” was coined by Firth, who saw it as running parallel to collocation. He introduced it as follows:
e textual priming of lexis
27
The statement of meaning at the grammatical level is in terms of word and sentence classes or of similar categories and of the inter-relation of those categories in colligation. Grammatical relations should not be regarded as relations between words as such – between ‘watched’ and ‘him’ in ‘I watched him’ – but between a personal pronoun, first person singular nominative, the past tense of a transitive verb and the third person singular in the oblique or objective form. Firth (1957:13)
As put, it is difficult to distinguish his notion from that of traditional grammar. Interestingly, though, Firth’s student, M.A.K. Halliday, used colligation in an apparently different way, and it is to be assumed that his use followed Firth’s intention. is is how Halliday introduces colligation: The sentence that is set up must be (as a category) larger than the piece, since certain forms which are final to the piece are not final to the sentence. Of the relation between the two we may say so far that:1, a piece ending in liau or j¦e will normally be final in the sentence; 2, a piece ending in s¦ i2, ηa, heu or sanhηgeu2 will normally be non-final in a sentence; 3, a piece ending in lai or kiu may be either final or non-final in a sentence. Halliday (1959:46; cited by Langendoen 1968, as an example of Halliday’s use of colligation)
Halliday here uses colligation to mean the relation holding between a word and a grammatical pattern, and this is how the term is currently used. For several decades it disappeared from sight with only the most occasional of references, and returned into use in papers from Sinclair (1996, 1999) and myself (1997a, 1997b). (We were not aware of each other’s work, but, as we were colleagues for many years, it is more than possible that I picked up the notion from conversations with him without realising that I had done so. In any case, the earlier of the papers in which he discusses colligation predates mine by a year, so the credit for resurrecting this valuable concept must rest with him.) One point to note about Halliday’s formulation is that he formulates the colligational relationship in terms of sentential position. us colligation covers not only grammatical relations as conventionally understood but also such matters as eme/Rheme position – and, I shall later argue, textual positioning too. If one considers the conventional grammatical statements one might make about the first two words of a clause such as (1) e cat sat on the mat [fabricated, as if you did not know]
they include the following:
28
Michael Hoey
a cat is head of the nominal group in which it appears b e cat is Subject c e cat is eme of the sentence.
In other words, we are capable of talking about a word’s place in its group, the function that the group plays in the clause and the textual implications of its position. It should be no surprise therefore that colligations can take any of these forms. I define colligation as a the grammatical company a word keeps (or avoids keeping) either within its own group or at a higher rank. b the grammatical functions that the word’s group prefers (or avoids). c the place in a sequence that a word prefers (or avoids).
My claim is that every word is primed to occur in certain grammatical contexts with certain grammatical functions and in certain textual positions, and this priming is as fundamental as its priming for collocation or semantic association. I see connections between colligation as I am here describing it and the notion of “emergent grammar” referred to by Farr and McCarthy (2002). ere are also clear parallels between the position here formulated and Hunston and Francis’s pattern grammar (2000). As an instance of the first type of colligation, the grammatical company a word keeps (or avoids keeping) either within its own group or at a higher rank, consider the word tea, which characteristically is strongly primed to occur as premodification to another noun, e.g. tea chest tea pot tea bag tea urn tea break tea party
It is also typically primed to occur as part of a postmodifying prepositional phrase, usually with of, e.g.
e textual priming of lexis
29
time for tea a cup of tea a pot of tea a packet of tea her glass of tea nine blends of tea
Even the Guardian newspaper with which I work provides evidence of this despite the low occurrence of such items as tea pot and tea set (presumably because the mechanics of tea-making are rarely newsworthy). In my data tea occurs as premodification over a quarter of the time (29%) and as part of a postmodifying prepositional phrase just under 19% of the time. When it does not occur as premodification or as part of a prepositional phrase, it is oen coordinated or part of a list, this accounting for almost 20% of such cases: green tea and melon tea and coffee tea and sandwiches tea and refreshments tea and digestives tea and toast tea and scones tea and sympathy tea and salvation
It will be noted above that colligations can be negative as well as positive, and one of tea’s most obvious colligations is negative: it is typically primed to avoid co-occurrence with markers of indefiniteness (a, another, etc.). Just as with other primings, this is a tendency, not an absolute. ere are 52 instances of a tea or a …tea in my data, just over 1% of instances, and one instance with another. Examples are: (2) a lemonade Snapple and a tea, milk no sugar (3) a tea made from the blossoms and leaves (4) there was never a tea or a bun at Downing Street (5) a Ceylon tea with a fine citrus flavour (6) to enjoy a cream tea or a double brandy
30
Michael Hoey
(7) Oh I’ll have a tea, two sugars, thank you very much for asking (8) Another tea and I start dealing with the day’s twaddle
Notice that four of the above examples are not interpretable as “a type of tea”. So tea can occur with indefinite markers – it is not a matter of grammatical impossibility, nor a matter of a specific type of usage – but typically it is primed to avoid them. It is worth noting, too, that this aversion to indefinite markers is not the result of its being a drink. In my data there are 390 occurrences of the word Coke, referring to the cola rather than the drug or the fuel. Of the 314 instances which refer to the drink, as opposed to the company that markets the drink, 10% occur with a. (I have no instances with another, though there are three occurrences along the lines of a rum and coke). All of the above illustrate the first type of colligation. As an instance of the second kind of colligation, consider the following data (Table 1), where the clausal distribution of consequence is compared with that of four other abstract nouns. It will be seen that there is a clear negative colligation between consequence and the grammatical function of Object. e other nouns occur as part of Object between a sixth and a third of the time. Consequence on the other hand occurs within Object in less than one in twenty cases. To compensate, there is a positive colligation between consequence and the Complement function. Only one of the other nouns – question – comes close to the frequency found for consequence. e others occur within Complement four times less oen than consequence. ere is also a positive colligation between consequence and the function of Adjunct, consequence occurring here nearly half the time. e Table 1. Distribution of consequence across the four main clause functions in comparison with that of other abstract nouns
Consequence Question Preference Aversion Use
Part of subject
Part of object
Part of complement
Part of adjunct
Other
24% (383) 26% (79) 21% (63) 23% (47) 22% (67)
4% (62) 27% (82) 38% (113) 38% (77) 34% (103)
24% (395) 20% (60) 7% (21) 8% (16) 6% (17)
43% (701) 22% (66) 30% (90) 22% (45) 36% (107)
5% (74) 4% (13) 4% (13) 8% (17) 2% (6)
e textual priming of lexis
31
other nouns in our sample occur between around a quarter and a third of the time. I would conclude that consequence is characteristically positively primed for Complement and Adjunct functions and negatively primed for Object function. (Interestingly, this is not true for the plural form consequences, which routinely occurs as part of Object, supporting the argument of Sinclair and Renouf (1988) and Stubbs (1996) against too ready an adoption of the lemma as the locus of analysis.) Consequence also illustrates the third type of colligation, in that 49% of instances in my data occur as part of eme. Given that one would expect, on the basis of random distribution, that around 33% of instances would occur in eme, this suggests that consequence is typically primed for this textual position. e position we have reached is that lexis is primed for each language user, either at the word or phrase level, for collocations, grammatical categories, semantic associations and colligations. I do not however believe that there is any necessity to assume that priming stops at the sentence boundary. Aer all, the third kind of colligation, concerned with textual positioning, is an overt claim that priming has a textual dimension, in that choice of eme is in part affected by the textual surround and therefore we are primed to use consequence to encapsulate the previous text, whether as Adjunct or Subject. We can take this point considerably further. In Hoey (forthcoming) I argue that words may be primed to appear in (or avoid) paragraph initial position. So consequences, for example, is primed to begin paragraphs, but consequence is primed to avoid paragraph-initial position. I secretly hope that you respond to this information with the feeling that I am spelling out the obvious, in that it is obvious that we might start a paragraph with mention of a multiplicity of consequences and then spend the rest of the paragraph itemising and elaborating on those consequences and equally it is obvious that if there is only a single consequence it will be tied closely to what ever was the cause. If you were to so react, then that would be evidence that it is part of your knowledge of the words consequence and consequences that they behave in these textual ways, in short, that you were primed to use them in particular textual positions. It may be objected that consequence and consequences are exceptional in that they have long been recognized to have special text organising functions (e.g. Winter 1977). But the evidence suggests that textual colligation is not limited to a special class of words, nor is priming for positioning only operative at sentence and paragraph boundaries or in the written word only. As an
32
Michael Hoey
example of spoken priming, the has an aversion to appearing at the beginning of conversational turns (McCarthy, personal communication). As an example of textual priming at a level higher than the paragraph, take sixty which, when the group in which it appears is sentence-initial, is positively primed for textinitial position. In my newspaper data, 14% of thematized instances of sixty are text-initial. Given that the average length of texts beginning with sixty is 20 sentences, this means that sixty begins a text three times more oen than would occur on the basis of random distribution. Sixty begins newspaper texts for a variety of reasons, all of which are specific to the goal of newspaper production. In the first place, sixty is a majority in terms of percentage and therefore potentially newsworthy; a number of such texts begin Sixty per cent of... Newspapers are conscious of time and their place in time; a number of articles begin Sixty years ago... If an event affects sixty people, it may be a significant event; a number of articles begin with phrases such Sixty spectators... Examples are the following: (9) Sixty per cent of adults support the automatic removal of organs for transplant from those killed in accidents unless the donor has registered an objection, according to a survey published yesterday. (10) Sixty years ago Florida was the holiday home of the super-rich and the flamboyant. (11) Sixty baffled teachers from 24 countries yesterday began learning how to speak Geordie as part of a three-week course run by the British Council on the banks of the Tyne.
e explanations I have given for sentences such as these, which are aer all not lexical but discoursal in nature, might be thought at first sight to challenge the notion of priming, in that the choice of sixty would appear to be the product of external factors. is would however be to misunderstand the relationship being posited between lexical choice and discoursal purpose. In the first place, the text-initial priming of sixty does not extend to 60 (nor do many of its other primings – there is no association of 60 with vagueness, for example). So the choice of sixty over 60 is made simultaneously with one of the discoursal choices described above. Secondly, there is no externally driven obligation on a writer to place the phrase of which sixty is a part in sentence-initial position. In theory (rather than practice), news articles and stories could begin:
e textual priming of lexis
33
(9a) A clear majority of adults support the automatic removal of organs for transplant from those killed in accidents unless the donor has registered an objection, according to a survey published yesterday. (10a) Florida was, six decades ago, the holiday home of the super-rich and the flamboyant. (11a) Five dozen baffled teachers from 24 countries yesterday began learning how to speak Geordie as part of a three-week course run by the British Council on the banks of the Tyne.
irdly, and most importantly, I would argue that the text-initial priming of sixty for journalists and Guardian readers is the result of their having encountered numerous previous examples of sixty in this position. Consequently, Guardian readers do not expect, and journalists do not provide, articles that focus on the views of twenty-two per cent of a sample of interviewees, despite the fact that these views may be original or though-provoking; readers and journalists do not concern themselves with what happened twenty-two years ago, even though time divisions are arbitrary as a way of talking about changes in the world and what happened twenty-years ago is, from some perspectives, as interesting as what happened sixty years ago. e possible effects of primings on the way we view the world are perhaps matters for critical discourse analysts to consider. Even more than was the case with collocation, grammatical category, semantic association and colligation (of the non-textual kind), claims about textual colligation have to be domain- and genre-specific. e claim just made for sixty is palpably false for academic articles, for example; on the other hand, I would speculate that for the latter genre the word recent might be positively primed for text-initial position – Recent research has shown.., Recent papers ... etc. For some purposes – lexicography, dictionaries of collocations, thesauri, comparable corpora – huge corpora representing a wide range of linguistic genres and styles are extremely useful. Resources like the Bank of English and the British National Corpus have huge value. But I believe, and what I am saying here and in the remainder of this paper provides grounds for believing, that homogenized corpora iron out and render invisible important generalisations – truths even – about the language they sample. For the purposes of identifying primings, specialized corpora are likely to be more productive. Gledhill (2000) shows that no corpus is too specialized: a mini-corpus of the introduc-
34
Michael Hoey
tions to cancer research papers revealed distinct differences from that of results sections. Before we leave textual colligation, it is worth noting that it is sometimes the case that one priming only becomes operative when another is overridden. An instance of this phenomenon is the combination of in and an abstract noun, which has a strong negative priming for sentence-initial position. Once, though, the negative priming is overridden, a strong positive priming for paragraph-initial position becomes operative. Textual position is not the only supra-sentential feature for which lexis appears to be primed. I want to argue that lexical items are also primed for cohesion. Certain words (e.g., Blair, planet, gay, and genetic) tend to appear as part of readily cohesive chains, whereas others (e.g., elusive or wobble) form single ties at best, and rarely if ever participate in extensive chains. In order to test this claim, I took a text (e Invisible Influence of Planet X) that I had previously analysed with respect to its cohesion (Hoey 1995) and identified the four lexical items that contribute most to the cohesion of the text. I then selected four items that appear only once in the text. For each of these eight items, I examined 50 lines of a concordance, moving in each case into the text from which the line was drawn and analysing the text in terms of the cohesiveness of the item under investigation. e results of this investigation are presented in Table 2.
Table 2. Cohesive tendencies of eight lexical items from e Invisible Influence of Planet X
planet Uranus Pluto planets week wobble wide weakest
Frequency in original text
No of instances participating No of occurrences in single in cohesive chains across cohesive links not forming 50 texts chains across 50 texts
23 11 10 8 1 1 1 1
36% 68% 84% 66% 32% 10% 0% 0%
13% 6% 3% 8% 12% 6% 8% 2%
e textual priming of lexis
35
It will be seen that there is a close correlation between the cohesiveness (or otherwise) of the items in the Planet X text and their cohesiveness (or otherwise) across a range of texts. All four of the items forming strong cohesive chains in e Invisible Influence of Planet X participate strongly in cohesive chains in other texts, such that between 36% and 84% of instances in the concordance were participating in such chains. ree of the four words that were not cohesive in the Planet X text also never or rarely participated in cohesive chains. e exception is of course week, which is only slightly less cohesive than planet in the corpus; it is of course predictable from the statistics for the four highly cohesive items that their priming for cohesion is on occasion overridden and this appears to be the case with week also. Obviously the more cohesive an item is, the fewer the texts represented in the corpus (because an item that is repeated twenty times in a single text will generate twenty concordance lines), so we cannot simply read the results off the table, without further investigation, but the correlation is strong for all that. I hypothesize that when we read or listen we bring our knowledge of cohesive priming to bear and attend to those items that are most likely to participate in the creation of the texture of the text. Furthermore, it is part of our knowledge of every lexical item that we know what type of cohesion is likely to be associated with it. So Blair, for example, tends to attract pro-forms – he, his, him etc. – and co-referents – the Prime Minister, the Labour Party leader, while planet tends to attract hyponyms – Mars, Venus, Pluto – and gay favours simple lexical repetition. Grosz and Sidner (1986) and Emmott (1989, 1997) argue that cohesion is better treated as prospective rather than retrospective; the position presented here is in accordance with their view, in that encountering a named person such as Tony Blair in a discourse immediately creates in the reader/listener an expectation that the pronoun he and the co-referent the Prime Minister will follow (as well as simple repetitions of the name); Yule (1981) discusses the conditions under which one rather than the other might be chosen, and Sinclair (1993) discusses the mechanisms of prospection. In the terms presented here, we could say that Tony Blair is characteristically primed to create cohesive chains making use of some or all of pronouns, co-reference and simple lexical repetition (Hoey 1991). e claim that some items are characteristically primed to be cohesive and other are characteristically primed to avoid participation in cohesion is supported by Morley and Partington’s finding (2002) that the phrase at the heart
36
Michael Hoey
of is non-cohesive. ey found 29 instances of the phrase, each one from a different text. Again, as so oen in this paper, the observation seems obvious, but it is the obviousness of the observation that most supports my case. With the first kind of textual priming, we associated certain lexical items with certain textual positions (e.g. beginning of the sentence, beginning of the speaking turn, beginning of the paragraph). is was seen as a textual extension of colligation. e kind of textual priming we have just been examining – the cohesive priming of lexis – could likewise be seen as a textual extension of collocation, in that the characteristic cohesion of a word could be seen as an extension of “the company a word keeps”. Analogy suggests that there should be a third kind of textual priming of lexis, associated with semantic association, and preliminary investigation supports the suggestion. In addition to being primed for textual position and cohesion, lexical items are, I argue, primed for textual relations. What I mean by this is that the semantic relations that organize the texts we encounter are anticipated in the lexis that comprises these texts. So, for example, ago is typically primed to occur in contrast relations, occurring in such relations in my data 55% of the time, and discovered occurs with (or in) temporal clauses 86% of the time. e word hunt is associated with a shi within a Problem-Solution pattern (Winter 1977; Hoey 1983, 2001; Jordan 1984) or a Gap in Knowledge-Filling pattern (Hoey 2001) 60% of the time; it is also associated with a move from past to present in 67% of cases. I hypothesize that this aspect of textual priming accounts for the average reader/listener’s enormous competence at following and making sense of text in very little time. Earlier work on textual signals (Winter 1977, Hoey 1979) only scratches the surface of the signalling that the average text supplies, if this hypothesis proves to be correct; it is possible that evoked and provoked appraisal (Martin 2000; on appraisal see also Flowerdew, this volume) and its textual reflex (Hoey 2001) are also accounted for by this feature of priming. It will be noticed that I talk of items being “typically” or “characteristically” primed. is is of course partly because priming belongs to the individual, not to the language, and so no blanket claim can be made about any word. It is also because, as noted earlier, all claims about priming are domain and genre specific. A claim that a particular lexical item is primed to occur text-initially or form cohesive relations is only valid for a particular narrowly-defined situation. Since my corpus overwhelmingly comprises Guardian newspaper, the claims made above about lexical priming are true of those (kinds of) data, but carry no weight, until verified, in any other situations. While Biber et al.’s notion of
e textual priming of lexis
37
the “bundle” may be over-simplified, he and his colleagues are certainly right in saying that they occur in, and are true of, text types. I want to claim that the types of textual colligation I have been describing occur in all kinds of texts but the actualisation of these colligations varies from text type to text type. If we accept the notion that lexical items are primed for collocation, semantic association and colligation (textual or otherwise), there are two possible implications. e first is that this priming accounts for our ready ability to distinguish polysemous uses of a word. Where it can be shown that a common sense of a polysemous word favours certain collocations, semantic associations and/or colligations, the rarer sense of that word will, I would argue, avoid those collocations, semantic associations and colligations (see Hoey in press). e second implication is that in continuous text the primings of lexical items may combine. us the words that make up the phrase Sixty years ago today, which begins the text we considered earlier with regard to cohesion, have the primings on page 38 (amongst others) in newspaper data. What we have here is colligational prosody, where the primings reinforce each other (or not), the naturalness of the phrase in large part deriving from the non-conflictual nature of the separate primings when combined. I would want to suggest that some of the work currently undertaken by grammar might be absorbed into, or superseded by, colligational prosody. Two questions naturally arise from the preceding discussion. e first is practical in nature: what are the implications of all this for the language learner? e other is theoretical: what place is le in this theory for creativity? To tackle the practical question first, if the notion of priming is correct, the role of the FL classroom is to ensure that the learner encounters the lexis in such a way that it is properly and correctly primed. is can only be a gradual matter; nevertheless, there are grave dangers in teachers or teaching materials incorrectly priming the lexis such that the learner is blocked, sometimes permanently, from correctly priming the lexical items. Furthermore, certain learning practices must be inappropriate, such as the learning of vocabulary in lists (i.e. stripped of all its primings), while others (e.g. exposure to authentic data) are apparently endorsed. Authentic data, however, are usually inauthentically encountered in the classroom, in that they are read or heard for reasons remote from those that gave rise to the data in the first place. On the other hand, only authentic data can preserve the collocations, colligations, semantic associations of the language and only complete texts and conversations can preserve the textual associations and colligations.
ago
Collocation with ago
t
Semantic association with UNIT OF TIME Semantic association with NUMBER
t
Positive colligation with text initial position, when paragraph initial
Positive colligation with paragraph initial position, when thematized
t
Positive colligation with paragraph initial position, when thematized
Positive colligation with text initial position, when paragraph initial
t
t
Strong colligation with eme
t
t
today
t
Positive colligation with text initial position, when paragraph initial
t
t
Positive colligation with text initial position, when paragraph initial
Positive colligation with paragraph initial position, when thematized
t
t
Positive colligation with paragraph initial position, when thematized
Weak colligation with eme
t
t
Strong colligation with eme
t
t
Collocation with ago
t
t
years Semantic association with NUMBER
t
t
Sixty Collocation with years
t
Michael Hoey
t
38
To turn now to the theoretical question, there is of course ample room for the production of original utterances through semantic association, but semantic association will not by itself account either for the ability of the ordinary speaker to utter something s/he has never heard before or for the ability of the more self-conscious creative writer to produce sentences that are recognisably English but have never been encountered before. I would argue that, when speakers go along with the primings of the lexis they use, we produce utterances that seem idiomatic. is is the norm in conversation and writing. If they choose to override those primings, they produce acceptable sentences of the language that might strike one with their freshness or with their oddness but will not seem idiomatic. Crucially, though, even these sentences will conform to more primings than they override. So when Dylan omas, a poet famous for his highly creative (and sometimes obscure) use of language, begins one of his poems with A grief ago, he rejects the collocations and semantic associations
e textual priming of lexis
39
of sixty and years but conforms to the primings of ago, such that the phrase functions textually in similar ways to sixty years ago.
t
------
t t t
t
------
t
t
------
ago Semantic association with NUMBER
t
t
Positive colligation with text initial position, when paragraph initial
t
Positive colligation with paragraph initial position, when thematized
t
Strong colligation with eme
grief
t
A
Strong colligation with eme
Positive colligation with paragraph initial position, when thematized Positive colligation with text initial position, when paragraph initial
Priming is therefore something that may be partly overridden but not completely overridden. Complete overriding would result in instances of nonlanguage. us the task that Chomsky set himself of accounting for all and only the acceptable sentences of the language requires priming as (part of) its answer. Indeed, what we think of as grammar may be better regarded as a generalisation out of the multitude of primings of the vocabulary of the language; it may alternatively be seen usefully as an account of the primings of the commonest words of the language (such as the, of and is). Either way, I hope I have done enough to demonstrate that a new theory of language might need to place priming at the heart of it.1 Note 1. Note that this sentence conforms to the priming for non-cohesion of at the heart of discussed earlier, and in that respect is idiomatic. In so far, however, as this endnote draws attention to the possibility of cohesion, it has created it and thereby demonstrated my ability to override one priming of the phrase while conforming to its other primings – an essential feature of a theory of priming.
40
Michael Hoey
References Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finnegan 1999. Longman Grammar of Spoken and Written English. Harlow: Longman. Brazil, D. 1995. The Grammar of Speech. Oxford: Oxford University Press. Campanelli, P. and Channell, J. M. 1994. Training: An Exploration of the Word and the Concept with an Analysis of the Implications for Survey Design. London: Employment Department. Chomsky, N. 1957. Syntactic Structures. The Hague: Mouton. Chomsky, N. 1965. Aspects of the Theory of Syntax. Cambridge (MA): MIT Press. Emmott, C. 1989. Reading between the lines: Building a comprehensive model of participant reference in real narrative. Ph.D. thesis, University of Birmingham. Emmott, C. 1997. Narrative Comprehension: A Discourse Perspective. Oxford: Clarendon Press. Farr, F. and McCarthy, M. 2002. “Expressing hypothetical meaning in context: Theory versus practice in spoken interaction”. Paper presented at the TaLC 5 Conference, Bertinoro, 26–31 July 2002. Firth, J. R. 1957. “A synopsis of linguistic theory, 1930–1955” in Studies in Linguistic Analysis, 1–32, reprinted in Selected Papers of J R Firth 1952–59, F. Palmer (ed.), 168–205. London: Longman. Gledhill, C. J. 2000. Collocations in Science Writing. Tübingen: Gunter Narr Verlag Tübingen. Goldberg, A.E. 1995. Constructions: A Construction Grammar Approach to Argument Structure. Chicago: The University of Chicago Press. Grosz, B. J. and Sidner, C. L. 1986. “Attention, intentions, and the structure of discourse”. Computational Linguistics, 12(3):175–204. Jordan, M.P. 1984. Rhetoric of Everyday English Texts. London: Allen & Unwin. Halliday, M.A.K. 1959. The Language of the Chinese ‘Secret History of the Mongols’. Oxford: Blackwell [Publication 17 of the Philological Society]. Halliday, M.A.K. 1967–8. “Notes on transitivity and theme in English” (parts 1, 2 and 3), Journal of Linguistics 3.1, 3.2 and 4.2. Halliday, M.A.K. and Hasan, R. 1976. Cohesion in English. London: Longman. Hoey, M. 1983. On the Surface of Discourse. London: Allen & Unwin. Hoey, M. 1995. “The lexical nature of intertextuality: A preliminary study” in Organization in Discourse: Proceedings from the Turku Conference, B. Wårvik, S-K. Tanskanen and R. Hiltunen (eds), 73–94. Turku: University of Turku [Anglicana Turkuensia 14]. Hoey, M 1997a. “Lexical problems for the language learner (and the hint of a textual solution)”, in Proceedings of the 5th Latin American ESP Colloquium, Merida, Venezuela. Hoey, M 1997b. “From concordance to text structure: New uses for computer corpora”, in PALC ’97: Proceedings of Practical Applications in Language Corpora Conference, B. Lewandowska-Tomaszczyk and P.J. Melia (eds), 2–23. Łódź: University of Łódź. Hoey, M. 2001. Textual Interaction: An Introduction to Written Discourse Analysis. London: Routledge. Hoey, M. 2003. “Why grammar is beyond belief” in Beyond: New Perspectives in Language,
e textual priming of lexis
41
Literature and ELT. Special issue of Belgian Journal of English Language and Literatures, J.P. van Noppen, C. den Tandt and I. Tudor (eds), Ghent: Academia press. Hoey, M. in press. Lexical Priming: A New Theory of Words and Language. London: Routledge Hoey, M. forthcoming. “Textual colligation – A special kind of lexical priming”, in Proceedings of ICAME 2002, Göteborg, K. Aijmer and B. Altenberg (eds). Amsterdam: Rodopi. Hunston, S. and Francis, G. 2000. Pattern Grammar: A Corpus-driven Approach to the Lexical Grammar of English. Amsterdam: John Benjamins. Langendoen, T. 1968. The London School of Linguistics: A Study of the Linguistic Contributions of B. Malinowski and J.R. Firth. Cambridge (MA): MIT Press. Louw, B. 1993. “Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies’” in M. Baker, G. Francis and E. Tognini-Bonelli (eds), Text and Technology, 157–76. Amsterdam: John Benjamins. Martin, J.R. 2000. “Beyond Exchange: APPRAISAL systems in English”. In Evaluation in Text: Authorial Stance and the Construction of Discours, S. Hunston and G. Thompson (eds), 142–75. Oxford: Oxford University Press. Morley, J. and Partington, A. 2002. “From frequency to ideology: Comparing word and cluster frequencies in political debate”, Paper presented at the TaLC 5 Conference, Bertinoro, 26–31 July 2002. Murray, J.A.H. et al. (ed) 1884–1928. A New English Dictionary on Historical Principles (reprinted with supplement, 1933, as Oxford English Dictionary) Oxford: Oxford University Press. Roget, Peter M. 1852. Thesaurus of English Words and Phrases. Harlow: Longman. Sinclair, J. McH 1991. Corpus, Concordance, Collocation, Oxford: Oxford University Press. Sinclair, J. McH 1993. “Written discourse structure” in Techniques of Description, J. McH Sinclair, M. Hoey and G. Fox (eds), 6–31. London: Routledge. Sinclair, J. McH 1996. “The search for units of meaning”. Textus 9:75–106. Sinclair, J. McH 1999. “The lexical item” in Contrastive Lexical Semantics, E. Weigand (ed.), 1–24. Amsterdam: John Benjamins. Sinclair, J. McH and Renouf, A. 1988. “Lexical syllabus for language learning” in Vocabulary and Language Teaching, R. Carter and M. McCarthy (eds), 197–206. Harlow: Longman. Sripicharn, P. 2002. “Examining native speakers’ and learners’ investigation of the same concordance data: a proposed method of assessing the learners’ performance on concordance-based tasks”. Paper presented at the TaLC 5 Conference, Bertinoro, 26–31 July 2002. Stubbs, M. 1995. “Corpus evidence for norms of lexical collocation” in Principle and Practice in Applied Linguistics, G. Cook and B. Seidlhofer (eds), 245–256. Oxford: Oxford University Press. Stubbs, M. 1996. Text and Corpus Analysis Oxford: Blackwell. Winter, E. O. 1977. “A clause-relational approach to English texts” Instructional Science 6: 1–92. Yule, G. 1981. “New, current and displaced entity reference”. Lingua 55:42–52.
42
Michael Hoey
Multiple comparisons of IL, L1 and TL corpora
Corpora by learners
43
44
Yukio Tono
Multiple comparisons of IL, L1 and TL corpora
45
Multiple comparisons of IL, L1 and TL corpora: The case of L2 acquisition of verb subcategorization patterns by Japanese learners of English Yukio Tono Meikai University, Japan
This study investigates the acquisition of verb subcategorization frame (SF) patterns by Japanese-speaking learners of English by examining the relative influence of factors such as the effect of first language knowledge, the amount of exposure to second language input, and the properties of inherent verb semantics on the use and misuse of verb SF patterns. To do this, three types of corpora were compiled: (a) a corpus of students’ writing, (b) a corpus of L1 Japanese, and (c) a corpus of English textbooks (i.e., one of the primary sources of input in the classroom). Ten high frequency verbs were examined for the learners’ use of SF patterns. Log-linear analysis revealed that the overall frequency of verb SF patterns was influenced by the amount of exposure to the patterns in the textbooks whereas error frequency was not highly correlated with it. There were strong interaction effects between error frequency and L1-related and L2 inherent factors such as the differences in verb patterns and frequencies between English and Japanese, and verb semantics for each verb type. Multiple comparison of IL, L1, TL (textbook) corpora were found to be quite useful in identifying the complex nature of interlanguage development in the classroom context.
1. Introduction Each individual language has its own way of realizing elements following a verb. Every verb is accompanied by a number of obligatory participants, usually from one to three, which express the core meaning of the event. Participants which are core elements in the meaning of an event are known as arguments. Other constituents, which are optional, are known as adjuncts. What
46
Yukio Tono
core elements follow a verb is accounted for by subcategorization. Different subcategories of verbs make different demands on which of their arguments must be expressed (cf. (1a) – (1c)), which can optionally be expressed (cf. (1d)), and how the expressed arguments are encoded grammatically – that is, as subjects, objects or oblique objects (objects of prepositions or oblique cases). For example, as in (1a), the verb “dine” is an intransitive verb and takes only one argument (i.e., a subject) while verbs such as eat or put can take two or three arguments respectively (see 1b and 1c). (1) a) Mary dined./ *Mary dined the hamburger. b) Mary ate./ Mary ate the hamburger. c) *Mary put./ *Mary put something./ Mary put something somewhere. d) Tom buttered the toast with a fish-knife.
[1 ARG] [2 ARG] [3 ARG] [optional]
In this paper, I will present a study which investigates the acquisition of verb subcategorization frame (SF) patterns by Japanese-speaking learners of English. For this study, I compiled three different types of corpora: Interlanguage (IL), L1 and Target Language (TL). For IL corpora, students’ free compositions were used whilst newspaper texts and EFL textbooks were assembled for L1 and TL corpora respectively. I will discuss the rationale of using textbooks as TL corpora in more detail below. By conducting multiple comparisons of the three corpora, I examined how different factors such as the effect of L1 knowledge, the amount of exposure to L2 input, and the properties of inherent verb meanings in L2 affect the acquisition of verb SF patterns. e acquisition of SF patterns is oen associated with the broader issue of the acquisition of argument structure (Pinker 1984, 1987, 1989). e development of argument structure can be influenced by several factors. Four main factors (verb semantics, learning stage, L1 knowledge, and L2 input) were selected and the relationship of these factors to the use/misuse of argument structure was investigated. An L1 corpus was used to define the influence of verb SF patterns in L1 while ELT textbook corpora were used for determining the degree of exposure to certain SF patterns in the classroom. Based on the data from these corpora, I compared the SF patterns of a group of high-frequency verbs in the Japanese EFL Learner (JEFLL) Corpus.
Multiple comparisons of IL, L1 and TL corpora
47
2. Factors affecting the acquisition of SF patterns 2.1 Views from L1 acquisition research ere are competing theories seeking to explain the acquisition of argument structure in L1 acquisition. e major issue is how to explain the children’s initial acquisition of argument structure. Do they learn the argument structure patterns from the meaning of verbs they initially acquire or do they acquire the structure first, then move on to the acquisition of verb meanings? e two bootstrapping hypotheses, semantic and syntactic, claim that the acquisition of argument structure is bootstrapped by first acquiring either semantic or syntactic properties of the verbs. Pinker (1987) is keen to identify what happens at the very first stage of syntax acquisition while Gleitman (1990) states the hypothesis in such a way that it applies not only to the initial stage but to the entire process of acquisition. As Grimshaw (1994) argues, however, these two hypotheses could complement each other, once the initial state issue is solved. Despite the difference in the view of how the acquisition of argument structure starts, Pinker and Gleitman both agree that knowledge of the relationship between a verb’s semantics and its morpho-syntax is guided in part by Universal Grammar (UG) (cf. Chomsky 1986) because adult grammars go beyond the input available. According to Goldberg (1999), on the other hand, it is a construction itself which carries the meaning. Although verbs and associated argument structures are initially learned on an item-by-item basis, increased vocabulary leads to categorization and generalization. “Light” verbs, due to the fact that they are introduced at a very early stage and are highly frequent, act as a centre of gravity, forming the prototype of the semantic category associated with the formal pattern. e perspective which Goldberg and other construction grammarians have taken on children’s grammar learning is fundamentally that of “general” nativism. ey reject the claim of “special” nativism in its particular guise of UG, but they still assume other, innate, aspects of human cognitive functioning accounting for language acquisition. As a matter of fact, this position is increasingly widely supported nowadays within more general cognitive approaches, including so-called emergentism (Elman et al. 1996; MacWhinney 1999), cognitive linguistics (Langacker 1987, 1991; Ungerer and Schmid 1996) and constructivist child language research (Slobin 1997; Tomasello 1992).
48
Yukio Tono
One of the purposes of this study is to determine the relative effect of L1 knowledge, classroom input, developmental factors and inherent verb semantics on the use/misuse and overuse/underuse of SF patterns by Japanese learners of English. It should be noted that the study does not need to call on a specific acquisition theory at this stage. Rather, this corpus-based study should shed light on the nature of IL development by weighting the factors which are possibly relevant to the acquisition of argument structure. is will help to evaluate the validity and plausibility of the claims made in L1 acquisition research in the light of SLA theory construction. For instance, if the study shows the strong effect of frequencies of verbs used in the ELT textbooks on the use of particular SF patterns, then the results may indicate that L2 acquisition can be better explained by the theory that attaches more importance to the frequency of the items to be acquired in the input. From this viewpoint, Goldberg’s theory is more plausible. On the other hand, if the effect of verb semantics is highly significant, one may be inclined to agree with the theory that emphasises the semantic properties of verbs as the driving force for the acquisition of argument structure. Hence one would be more likely to adopt the theoretical framework of semantic bootstrapping theory proposed by Pinker (see 1 above). is study has the potential, therefore, to tease out possible factors affecting L2 acquisition in the light of L1 acquisition theories, making observations on L1, TL, and IL corpus data while controlling all those selected factors, and finally giving each factor a weighting according to the results of the corpus analysis. is weighting of the factors relevant to L2 acquisition will then contribute to decision-making about which L1 acquisition theory is more plausible.
2.2 Views from L2 acquisition research Whilst a vast literature exists on the L1 acquisition of semantics-syntax correspondences, second language acquisition of verb semantics and morpho-syntax only really attracted detailed attention in the 1990s. e major issues in L2 acquisition of argument structure are: (1) whether or not L1 effects are strong in this area, (2) whether there is any evidence of universal patterns of development, and (3) the role of input in the acquisition of argument structure. From the previous SLA studies, L1 effects appear strong in the acquisition of argument structure. Especially SF frames are a case in point. Recently, there
Multiple comparisons of IL, L1 and TL corpora
49
has been much investigation of the proposal that the SF requirements of a lexical item might be predictable from its meaning (Levin 1993:12). e issue here is whether such lexical knowledge in L1 or in UG will affect L2 acquisition. is is usually investigated through the study of the acquisition of diathesis alternations1 – alternations in the expression of arguments, sometimes accompanied by changes of meaning – verbs may participate in. In the case of dative alternations (White 1987, 1991; Bley-Vroman and Yoshinaga 1992; Sawyer 1996; Inagaki 1997; Montrul 1998), most evidence seems to indicate that the initial hypothesis regarding syntactic frames is based on the L1. Studies on the locative alternations (Juffs 1996; epsura 1998 cited in Juffs 2000) indicate that there is a difference in the way a hypothesis is formed by learners at different proficiency levels. While beginning learners start off with a wider grammar for non-alternating locative verbs, very advanced learners end up with a narrower grammar (Juffs 1996). ere are several studies (Zobl 1989; Hirakawa 1995; Oshita 1997) that indicate an L1 transfer effect on transitivity alternations and the unergative/unaccusative distinction. To recapitulate, L1 effects appear strong in this area of grammar. Based on their L1, learners transfer and overgeneralize in the dative and the locative alternations. ey also show a preference for morphology for inchoatives. Consequently, learners are helped if their L1 has certain features which are also in the L2. Advanced learners, however, seem able to recover from overgeneralization errors in some instances by acquiring narrow conflation classes which are not in their L1. us there seems to be an interaction effect between L1 influence and proficiency levels. In spite of studies showing L1 effects, there is some evidence of universal patterns of development. Learners from a variety of backgrounds seem to use passive morphology for NP movement in English L2 with pure unaccusatives (Yip 1994; Oshita 1997). English-speaking learners of Spanish seem to use se selectively for the same purpose even when it is not required with unaccusative verbs (Toth 1997). Montrul (1998) found evidence which indicates that L2 learners have an initial hypothesis that all verbs can have a default transitive template, allowing an SVO structure in English even with pure unaccusatives and unergatives. Hence, learners seem to overgeneralize causativity in root morphemes much as children acquiring their first language do. ere are not many studies on the role of input in the acquisition of verb meaning and the way such knowledge relates to syntax. Inagaki (1997) argues that the fact that the double-object datives containing “tell”-class verbs were
50
Yukio Tono
more frequent in the input than those containing “throw”-class verbs, explains why the Japanese learners distinguished the tell verbs more clearly than the throw verbs. e fact that the English native speakers made a stronger distinction between the tell/whisper verbs than between the throw/push verbs is also consistent with the assumption that the double-object datives containing the tell verbs were more frequent in the input than those containing the throw verbs (ibid:660). Unfortunately, measuring the frequency in L2 input is difficult since so few analyses of input corpora for L2 learners exist (Juffs 2000:202).
3. JEFLL Corpus and the multiple comparison approach e JEFLL Corpus project aims to compile a corpus of Japanese EFL text produced by learners from Year 7 to university levels. e strength of the JEFLL Corpus is that it contains L1 and TL corpora as an integral part of its design. As was shown in the last section, very few studies have made use of both attested L2 learner data and L1/TL data to identify features of interlanguage development, let alone a corpus-based analysis of these data. Most learner corpus studies to date have made use of NS corpora because the studies are typically focused on learning English, and many native English corpora are readily available as a standard reference, whereas very few studies (except for JEFLL and PELCRA, see Leńko-Szymańska, this volume) collect parallel L1 corpora for comparison. Figure 1 shows the overall structure of the JEFLL Corpus. e total size of the L2 corpus is approximately 500,000 running words of written texts and 50,000 words of orthographically transcribed spoken data. e L1 corpus consists of a corpus of Japanese newspaper texts (approximately 11 million words) plus a corpus of student compositions written in Japanese. e L1 Japaneselanguage essays were written on the same topics as the ones used for the L2 English composition classes. e third part of the JEFLL Corpus comprises the TL corpus. It is a corpus of EFL textbooks covering both junior and senior high school textbooks. e junior high school textbooks are the ones used officially at every junior high school in Japan. ere are seven competing publishers producing such textbooks. Irrespective of which publisher one chooses, each publishes three books corresponding to the three recognized proficiency grades for years 7–9.
Multiple comparisons of IL, L1 and TL corpora
51
Table 1. e JEFLL Corpus project: Overall structure Part 1: L2 learner corpora – Written corpus (composition): ~500,000 words – Spoken corpus (picture description): ~50,000 words Part 2: L1 corpora – Japanese written corpus (composition): ~50,000 word, same tasks as in relevant L2 corpus – Japanese newspaper corpus: ~11,000,000 words Part 3: TL corpus – EFL textbook corpus: ~650,000 running words (Y7–9: 150,000; Y10–12; 500,000)
Senior high school textbooks are more diversified and more than 50 titles have been published. is corpus contains mainly the textbooks for English I and II (general English). I would argue that textbook English is a useful target corpus to use in the study of learner language. As this claim runs counter to that of other researchers (e.g., Ljung 1990; Mindt 1997), it is important to examine the basis for this claim in some detail. Firstly, the target language which learners are measured by should reflect the learning environment of learners. It is not always appropriate to use a general corpus such as the BNC or the Bank of English to make comparisons with non-native-speaker corpora. e differences you will find between L2 corpora and such general corpora will be those between learner English and the English produced by professional native-speaker writers. Such a comparison may be meaningful in the case of highly advanced learners of English or professional non-native translators. e output of such highly advanced learners, however, is something which the vast majority of L2 learners in Japan never aspire to. We have to consider very seriously what the target norm should be for the learners we have in mind. In the present case, it is certainly not the language of the BNC that the Japanese learners of English are aiming at, but, rather, a modified English which represents what they are more exposed to in EFL settings in Japan. I am fully aware of the fact that the type of language used in ELT textbooks may be unnatural in comparison to actual native speaker usage (see, for instance, Ljung 1990, 1991 and Römer, this volume). Pedagogically, however, beginning- or intermediate-level texts are designed to contain a level and form of English which can facilitate learning. In spite of all their peculiarities in comparison with L1 corpora, these textbooks represent the primary source of input for L2 learners in Japan, and as such their use in explaining and assessing L2 attainment is surely crucial.
52
Yukio Tono
e ELT textbook is the primary source of English language input for learners in Japan. Inside the classroom, some teachers will use classroom English, and others will not use English at all as a medium of instruction. Even if they do use English in the classroom, they usually limit their expressions to the structures and vocabulary that have previously appeared in the textbook. Outside the classroom, those who go to “cram” schools – private schools where students study aer school to prepare for high school or university entrance examinations – will receive extra input, but this input is comprised of questions borrowed from past entrance exams, or questions based on the contents of the textbooks (Rohlen 1983). Hence, it is fair to say that the English used in ELT textbooks is the target for most learners of English in Japan. If we exclude textbooks from our investigation, explaining the differences between TL and IL usage may be impossible. However, where textbooks are included in an exploration of L2 learning, they can explain differences between NS and NNS usage (McEnery and Kifle 1998). While the above argument presents the basis for the inclusion of textbooks in my model for the study of learner language, more evidence is required to substantiate this claim. is will be provided below, as part of the description of some of my research results, where the textbook corpus will be called upon to provide an explanation for differences between IL and TL. For the moment I will take the argument presented so far as sufficient evidence to warrant the inclusion of textbook material in my learner corpus exploitation model. My proposal, therefore, is that standard reference (e.g., the BNC), textbook and learner corpora all have roles to play in a fuller and proper exploration of learner language, a method which we may refer to as the “multimethod comparison” approach. Figure 1 illustrates this point diagrammatically. “IL1 ↔ ILx” in Figure 1 refers to the different subcorpus divisions according to academic year that L2 learner texts may be divided into. ese IL-IL comparisons can be of several different types, depending on the learner variables. For instance, if the independent variable (i.e., the variable that you manipulate) is age or the academic year of the learners, with all other variables constant, one can make a comparison of different IL corpora from different age groups. In ICLE (International Corpus of Learner English, Granger et al. 2002), on the other hand, the age (or proficiency level) factor is held constant, and research using ICLE centres around the IL characteristics of different L1 groups. A comparison between L2 corpora and TL corpora can also be made (see (B) in figure 1). One can use either a general standard corpus such as the Brit-
Multiple comparisons of IL, L1 and TL corpora
53
Figure 1. Multiple comparison of L1, TL and IL corpora
ish National Corpus to look at differences in, for example, lexicogrammar between native speakers and L2 learners, or use a more comparable corpus of native-speaker texts, e.g., LOCNESS (Louvain Corpus of Native English Essays)2 in ICLE, to compare like with like. We can refer to this type of comparison as IL-TL comparison. TL corpora may be are compared with L1 corpora (TL-L1 comparison, cf. (C) in figure 1) in order to describe the target adult grammar system and identify potential causes of L1 transfer. is analysis should be combined with L2 corpus analysis. TL-L1 comparison could provide significant information on the influence of the source language on the acquisition of the target language. A fourth type of comparison is that between IL corpora and L1 mother tongue corpora (L1-IL comparison, cf. (D) in figure 1). L1 corpora can provide information on features of the L2 learners’ native language, which can help us understand potential sources of L1-related errors or overuse/underuse phenomena. Despite the sophistication of recent error taxonomies, it is rather difficult to distinguish interlingual errors from intralingual ones, unless some empirical data are available on the pattern of a particular linguistic feature in both languages. L1-IL comparisons provide fundamental data in this area. Table 2 summarises each comparison type.
54
Yukio Tono
Table 2. Multiple comparison approach Comparison
Description
IL-IL comparison
Comparisons between different stages of ILs or ILs by learners with different L1 backgrounds. Comparisons between learner corpora and target language corpora (i.e. ELT textbook corpora in the present study, or general native corpora). Comparisons between target language corpora and L1 mother tongue corpora (to identify potential causes of L1 transfer). Comparisons between L1 corpora and learner corpora (to identify L1-related errors or overuse/underuse phenomena). Combination of the above comparisons (to identify the complex relationship between IL, L1 and TL corpora on L2 learners’ error patterns or overuse/underuse phenomena).
IL-TL comparison TL-L1 comparison L1-IL comparison IL-L1-TL comparison
4. The relationship between factors and corpora used Table 3 shows the factors to be examined in this study and how corpus data can supply the relevant information. It is only through multiple comparisons of L1, TL, and IL corpora that such issues can be fully addressed. Note that the primary purpose of this study is not to identify the role of specific UG constraints in L2 acquisition. Rather, the study aims to capture the cause-effect relationships among those variables and to identify their relative effects on the acquisition of argument structure in L2 English.
Table 3. e relationship between the factors in this study and types of information from different corpora Factors
Corpus data
e L1 effects
Frequency of similar/different argument structure properties in L1 corpus Frequency of subcategorization patterns in ELT textbook corpus Frequency of use/misuse of subcategorization patterns from the developmental IL corpus Frequency of different verb classes and alternations from the IL corpus
e L2 input Developmental stages e L2 internal effects
Multiple comparisons of IL, L1 and TL corpora
55
5. Research design 5.1 Research questions is study has the following research questions: 1. Which of the following variables affect L2 acquisition of argument structure (most)? • e L1 effects • e L2 input effects • e L2 internal effects • e developmental effects 2. Are there any interaction effects between the variables? If so, what are they? e clarification of the relationship between the above questions will contribute to current SLA research especially in terms of the possible role of L1 knowledge, L2 classroom input, and verb semantics-syntax correspondences in the acquisition of argument structure.
5.2 Variables and operational definitions Each variable is operationally defined as follows: 1. L1 effects: L1 effects were examined with respect to the following two aspects: the degree of similarities in SF patterns between English and Japanese in terms of (a) the degree of SF matching and (b) the frequencies of similar SF patterns in the L1 Japanese corpus and the COMLEX Lexicon (TL). 2. L2 input effects: L2 input effects were defined in terms of the frequencies of the given SF patterns in the L2 textbook corpus. 3. L2 internal effects: ese characteristics pertain to the English verb system. For differences in verb classes and alternation types I follow Levin’s (1993) classification. 4. Developmental effects: Developmental effects were simply measured in relation to the three groups of subjects categorized by their school years (Year 7–8; 9–10; 11–12).
56
Yukio Tono
5.3 Extraction of SF patterns For this study, I parsed the learner and textbook corpora using the Apple Pie Parser (APP), a statistical parser developed by Satoshi Sekine at New York University (see Sekine 1998 for details). e accuracy rate of the APP is approximately 70%, hence it was not very efficient to extract SF patterns automatically using the APP alone. Consequently, aer running the parser over the corpus, I exported concordance lines of verbs with the automatically assigned syntactic information into a spreadsheet program and then categorized them into SFs using pattern matching. is proved to be an efficient means of studying verb SFs. e Comlex Lexicon (Macleod et al. 1996; Grishman et al. 1994) was also referred to for frequency information relating to some subcategorization frames in the TL corpus. e Comlex Lexicon itself does not provide complete frequency data for all SF patterns. However, it has frequency information for the subcategorization frames of the first 100 verbs appearing in the Brown Corpus. I calculated the percentages of each SF pattern in the Comlex database and used the information to supplement the data from the textbook corpus. For the L1 corpus, a Japanese morphological analyser, ChaSen (Matsumoto et al. 2000), was used for tokenization and morphological analysis and the frequencies of SF patterns were detected by using pattern matching. SF extraction was done aer extracting all the instances of a particular verb under study, and thus manual postediting was also possible.
5.4 Categorization of verb classes e verb classification in Levin (1993) was used to categorize verbs into groups with similar meanings. Levin classifies verb classes into two major categories: (a) those which undergo diathesis alternations and (b) those which form semantically coherent verb classes. While Levin’s classification is very important for the study of lexical knowledge in the human mind, it should also be noted that her study is not concerned with the actual usage of those verb classes. Out of the 49 verb classes Levin created, only 22 classes were found in the top 40 most frequent verbs in the BNC. An important fact to note, therefore, is that a small number of categories which meet essential communication needs (e.g., “communication’, “motion’, and “change of possession’) predominate in actual verb usage. e input thus consists of only a handful of highly
Multiple comparisons of IL, L1 and TL corpora
57
frequent verb classes, with the rest of the classes being rather infrequent. e information on Japanese SFs was obtained from the IPAL Electronic Dictionary Project.3 Aer making a matching database of corresponding verbs in English and Japanese, the frequency information of English SFs was extracted from the Comlex Lexicon. SFs were also extracted from the ELT textbook corpus for TL (English) and from the Japanese corpus I made for L1 Japanese. e next step in the study involved a statistical analysis of these data, taking the various influences into account. Log-linear analysis was the method employed, and the next section gives a summary of the procedure.
5.5 Log-linear analysis e objective of log-linear analysis is to find the model that gives the most parsimonious description of the data. For each of the different models, the expected cell frequencies are compared to the observed frequencies. A Chi-square test can then be used to determine whether the difference between expected and observed cell frequencies is acceptable with an assumption of independence of the various factors. e least economical model, the one that contains the maximal number of effects, is the saturated model; it will by definition yield a “perfect” fit between the expected and observed frequencies. e associated χ2 is zero. In this study, the procedure called backward deletion was employed. is begins with the saturated model and then effects are successively le out of the model and it is checked whether the value of χ2 of the more parsimonious model passes the critical level. When this happens, the effect that was le out last is deemed essential to the model and should be included. Several statistical packages contain procedures for carrying out a log-linear analysis on contingency tables, e.g., SPSS, STATISTICA, SAS. In this study, STATISTICA was the main program used for model testing.
5.6 Subcategorization frame database For each high-frequency verb, the following information was gathered and put into the database format: • •
Parsed example sentences containing the target verb School year categories (year 7–8; 9–10; 11–12)
58
• • • • • • • • • • •
Yukio Tono
Verb name Verb class Verb meaning Alternation type SF for each example Frequency of SF in COMLEX Lexicon TL frequency of the given SF (i.e., textbook corpora) Learner errors Parsing errors Japanese verb equivalents L1 frequency of the equivalent SF (i.e., Japanese corpus)
ese data were collected for each of the high-frequency verbs and exported to the statistical soware used for further analysis. In order to process the data by log-linear analysis, the frequencies of TL and L1 were converted into categorical data ([HIGH]/ [MID]/ [LOW]). In order to study the acquisition of argument structure, ten verbs were selected for the analysis (bring, buy, eat, get, go, like, make, take, think, and want). While it would have been desirable to cover as many verbs as possible from different verb classes for the study, it should be noted that frequencies of SF patterns become extremely small if low frequency verbs are included. Only the ten most frequent verbs in the data were therefore selected for investigation, since these allowed a sufficient number of observations to be made for each verb. Even though they are frequent, be and have were excluded from the analysis because their status as lexical verbs is very different from that of other verbs. Due to limitations of space, I cannot go into the details of the SF patterns, but interested readers may consult Tono (2002).
6. Results 6.1. The results of log-linear analysis for individual verbs Using log-linear analysis, I tested various models using combinations of the six factors in Table 4. e results of the log-linear analysis of each individual verb revealed quite an interesting picture of the relationship between learner errors and the chosen
Multiple comparisons of IL, L1 and TL corpora
59
Table 4. Factors investigated in the study – – – – – –
L2 learners developmental factor (Factor 1): 3 levels: Year 7–8/ Year 9–10/ Year 11–12 Subcategorization matching between L1 and L2 (Factor 2): 2 levels: Matched/ Unmatched Subcategorization frequencies of each SF pattern in COMLEX (Factor 3): 3 levels: High/ Mid/ Low Subcategorization frequencies of each SF pattern in L1 Japanese Corpus (Factor 4): 3 levels: High/ Mid/ Low Subcategorization frequencies of each SF pattern in Textbook Corpus (Factor 5): 3 levels: High/ Mid/ Low L2 learner errors (Factor 6): 2 levels: Error/ Non-error
factors. Here let me summarise results by putting all the best fitting models together in a table (see table 5) and examining which factor exerts the most influence on learner performance across the ten verbs. In order to analyse the interactions, graphical interpretations of higher dimensional log-linear models are sometimes used (e.g., McEnery 1995; Kennedy 1992). However, as I am dealing with six dimensional models here, attempting to interpret them using graphical models would be extremely complicated. Also, my primary aim is not to interpret individual cases but to capture the overall picture of how factors are related across different verbs. Consequently I will not interpret the models visually, but simply provide an outline of the main results.
6.1.1 Distinctive effects of the school year Table 5 shows that the school year factor (YEAR) has a very strong effect across all of the verbs. For five out of the ten verbs (buy, get, go, make, and think), the main effect of YEAR was observed. e YEAR effect also has two-way interactions with the factor of text frequency (TEXTFRQ) for four verbs (bring, like, take, want) and with the learner error/non-error factor (LERR) for the verb get. is shows that the number of years of schooling influences the way L2 learners use the verbs. It involves both the use/misuse and the overuse/underuse of verbs.
60
Yukio Tono
Table 5. Summary of log-linear analysis Verbs
Factor 1 YEAR
Factor 2 Factor 3 SUBMATCH COMLEX
Factor 4 L1FRQ
Factor 5 TEXTFRQ
bring
51
532, 432
buy
1
642, 632 542, 532 642, 632 432, 521 432, 532
643, 543 532, 432 532, 543 632 632, 531 432 643, 543 432
643, 543 432 642, 543 542, 642, 432
543, 532 643 51 543, 542 532 642, 632 531, 521
642, 632
543, 532
61, 643
632, 542 432, 532 652, 542 532
632, 543 432, 532 643, 543 532
643, 543 432 543, 542 432 643, 543 542
632
642, 632 542, 532 632, 632 532 642, 632 542, 532 642, 632 542
632, 543 532 632, 543 532 632, 543 532 632, 543 31
543, 542 532 51, 652 543, 542 532 543, 542 532 51, 543, 532
eat get
1, 61
go
1
like
51
make
1
take
51
think
1
want
31
642, 543 542 642, 543 642, 543 542 642, 543 542
543,542 532 543, 542
Factor 6 LERR
652, 643 642, 632 642, 632 642, 632 642, 632
Note: e numbers correspond to the factors described in Table 4. A single underlined number (e.g. 1) is used for the main effect, two (e.g. 51) for the two-way interaction effect, and three (e.g. 642, 532) for the three-way effects.
6.1.2 Strong effects of the SF frequencies in the textbook corpus We can also see from the summary table that there are strong two-way effects between YEAR and TEXTFRQ. Note that there is only one case (652 for the verb like) of interaction of the textbook frequency factor (Factor 5) with the learner error factor (Factor 6). is implies that SF frequencies in the textbooks mainly affect the overuse/underuse of the verbs, not the use/misuse.
6.1.3 SF similarities and frequencies in L1 and TL Factors such as the degree of similarity in SF patterns between English and Japanese (SUBMATCH: Factor 2), the frequency in the COMLEX lexicon (Factor 3), and the frequency of SF patterns in L1 Japanese (L1FRQ: Factor
Multiple comparisons of IL, L1 and TL corpora
61
4) appear many times with the learner error factor (LERR: Factor 6). ese factors are different from the school year and textbook frequency factors, as they represent more inherent linguistic features of the verbs and L1 effects. Each of the effects, however, is not very strong because none of them survived backward deletion for the one-way or two-way effects. It seems that only the interactions of these factors affect learners’ use/misuse of the verbs.
6.2 The effects of verb classes and alternation types In order to analyse the relationship between verb classes/alternation types and the results of the above log-linear analysis, I used correspondence analysis (for more details, see Tono 2002). Instead of looking at each verb, I labelled each verb with its verb semantic classes and alternation types. I then gave scores to each factor according to the significance of its effects as shown in table 5; for instance, if a certain factor has a one-way interaction, which is the strongest, I gave it 10 points; if it has a two-way interaction, I gave 5 points to each of the factors involved. Only 1 point was given for each of the factors involved in three-way effects. In this way, I quantified each of the effects in the best model for each verb in table 5 and used correspondence analysis to see the relationship between the six factors and verb classes and alternation types. Figure 2 shows the results of the re-classification of the effects found by loglinear analysis for each verb according to verb alternation types. Correspondence analysis plots the variables based on the total Chi-square values (i.e., inertia): the more the variables cluster together, the stronger the relationship. Dimension 1 explains 71% of inertia, so we should mainly consider Dimension 1 as a primary source of interpretation. e figure shows clearly that there are three major groups of effects: the factor of SF patterns in the textbook corpus (TEXTFRQ) in the le corner, three effects (SF frequencies in L1 corpus, the degree of matching between English and Japanese SFs, and the SF frequencies in COMLEX) in the centre, and the learner error effect and the school year effect toward the right side. As was discussed above, the school year represents the developmental aspect of verb learning while the three factors in the middle represent linguistic features in each verb, and the textbook frequency represents L2 input effects. ere is a tendency for verbs involving benefactive alternations (buy, get, make, and take), sum of money alternations (buy, get, and make), and there insertions (go) to cluster around the school year factor and the error factor.
62
Yukio Tono
Figure 2. Correspondence analysis (alternations x effects)
us these verb alternation classes seem to be sensitive to the developmental factor of acquisition. Dative (bring, make, take, think, and want), locative (take, go) and as alternations (make, take and think) cluster around inherent linguistic factors such as the degree of SF matching and SF frequencies in L1 and TL. e verbs involving resultative alternations (bring and take) cluster around the textbook SF frequencies factor. Post-attributive and blame alternations are both features of the verbs like and want. ese two alternation types also cluster together close to the textbook frequency effect. ese are the verbs showing a strong relationship with L2 input effects. ere is only one alternation type that did not cluster with any other groups: ingestion (eat). e verb eat was very frequent in the learner data and was thus included in the analysis. However, it turned out that there were neither very many errors nor many varieties of alternations for this verb. e results for eat thus look very different from those for the other nine verbs.
7. Implications and conclusions In this paper, I have discussed some initial findings concerning the developmental effect of schooling, L1 effects, L2 input effects and L2 internal effects
Multiple comparisons of IL, L1 and TL corpora
63
(i.e., verb classes and alternations) on the overall use of a small number of very frequent verbs. I hope to have given an idea of the potential of a multiple comparison approach using IL, L1 and TL corpora for the study of classroom SLA. is study shows that it is valuable to compile corpora which represent different types of texts L2 learners are exposed to or produce, and to compare them in different ways to identify the relative strength of the factors involved in classroom SLA. Especially the method of comparing interlanguage corpora assembled based on the developmental stages, together with the subjects’ L1 corpus and TL textbook corpus seems to be quite promising in identifying the complex nature of interlanguage development in L2 classroom settings. As regards L2 acquisition of verb SF patterns, the results show that the learners’ correct use of verb SF patterns seemed to have little to do with the time spent on learning. Learners used verbs more oen which they encountered more oen in the textbooks, which is rather unsurprising. What is surprising is the fact that there was no significant relationship between learners’ correct use of those verbs and the frequency of those verbs in the textbooks. In other words, they continue to make errors related to the SF patterns of certain verbs even though their frequencies are relatively high in the textbooks. e study also reveals that the misuse of those verb patterns is mainly caused by the factors which are inherent in L2 verb meanings and their similarities and differences with L1 counterparts. ere is a tendency for certain alternation types to be more closely related to certain effects. For instance, benefactive alternations are linked to the developmental factor more strongly while dative and locative alternations are related to L1 effects more positively. Given that most SLA studies so far have only provided very fragmented pictures of different alternation types, it is beyond the scope of this study to determine the reason for such associations. To date, no SLA research has been conducted to identify the relative difficulties of different verb classes and alternations. is study does so. However, the theoretical implications arising from this study are a moot point until further research in this area is undertaken. Future studies of SLA will also require a large and varied body of L2 learner corpora. As we work together with researchers in natural language processing (NLP), there is the possibility that we will be able to develop a computational model of L2 acquisition. Machine learning techniques will facilitate the testing of prototypical acquisition models and the collection of probabilistic informa-
64
Yukio Tono
tion on IL using corpora. Computational analyses of IL data will shed light on the process of IL development in a way we never thought possible. For this to happen, well-balanced representative corpora of L2 learner output, along with appropriate TL and L1 corpora are indispensable.
Notes 1. Here by alternation I mean “argument-structure” alternation such as in the dative alternation (e.g. John gave a book to Mary/John gave Mary a book), the causative/inchoative alternation (He opened the door/ e door opened) among others. 2. http://www.fltr.ucl.ac.be/fltr/germ/etan/cecl/Cecl-Projects/Icle/locness1.htm (visited 10. 5. 2004) 3. IPAL is a machine-readable Japanese dictionary. For more details see http://www.ipa.go.jp/ STC/NIHONGO/ IPAL/ipal.html (visited 1.3.2004).
References Bley-Vroman, R. and Yoshinaga, N. 1992. “Broad and narrow constraints on the English dative alternation: Some fundamental differences between native speakers and foreign language learners”. University of Hawai’i Working Papers in ESL, 11:157–199. University of Hawaii at Manoa. Chomsky, N. 1986. Knowledge of Language: Its Nature, Origin and Use. New York: Praeger. Elman, J.L., E. Bates, M. Johnson, A. Karmiloff-Smith, D. Parisi and K. Plunkett 1996. Rethinking Innateness: A Connectionist Perspectives on Development. Cambridge, MA: A Bradford Book. Gleitman, L. 1990. The structural sources of verb meaning. Language Acquisition 1:3–55. Goldberg, A. 1999. “The emergence of the semantics of argument structure constructions”. In B. MacWhinney (ed.), 197–212. Granger, S., Dagneaux, E., and Meunier, F. (eds). 2002. The International Corpus of Learner English. Handbook and CD-ROM. Version 1.1. Louvain-la-Neuve: Presses Universitaires de Louvain. Grimshaw, J. 1994. “Lexical reconciliation”. In The Acquisition of the Lexicon, L. Gleitman and B. Landau (eds), 411–430. Cambridge, MA: MIT Press. Grishman, R. C. Macleod and A. Meyers 1994. “Comlex syntax: Building a computational lexicon”. Proceedings of 15th International Conference in Computational Linguistics (COLING 94), Kyoto, Japan, August 1994. Hirakawa, M. 1995. “L2 acquisition of English unaccusative constructions”. In Proceedings
Multiple comparisons of IL, L1 and TL corpora
65
of the 19th Boston University Conference on Language Development 1, D MacClaughlin and S. McEwen (eds), 291–302. Somerville, MA: Cascadilla Press. Inagaki, S. 1997. “Japanese and Chinese learners’ acquisition of the narrow-range rules for the dative alternation in English”. Language Learning 47:637–669. Juffs, A. 1996. Learnability and the Lexicon: Theories and Second Language Acquisition Research. Amsterdam: John Benjamins. Juffs, A. 2000. “An overview of the second language acquisition of links between verb semantics and morpho-syntax”. In Second Language Acquisition and Linguistic Theory, J. Archibald (ed.), 187–227. Oxford: Blackwell. Kennedy, J. 1992. Analyzing Qualitative Data. Log-linear Analysis for Behavioural Research. New York: Praeger. Langacker, R.W. 1987. Foundations of Cognitive Grammar. Vol.1: Theoretical Prerequisites. Stanford, CA: Stanford University Press. Langacker, R.W. 1991. Foundations of Cognitive Grammar. Vol.2: Descriptive application. Stanford, CA: Stanford University Press. Levin, B. 1993. English Verb Classes and Alternations. Chicago: The University of Chicago Press. Ljung, M. 1990. A Study of TEFL Vocabulary. [Stockholm Studies in English 78.] Stockholm: Almqvist & Wiksell. Ljung, M. 1991. “Swedish TEFL meets reality”. In English Computer Corpora, Johansson, S. & A.-B. Stenström (eds.), 245–256. Berlin: Mouton de Gruyter. Macleod, C. A. Meyers and R. Grishman 1996. “The influence of tagging on the classification of lexical complements”. Proceedings of the 16th International Conference on Computational Linguistics (COLING 96). University of Copenhagen. MacWhinney, B. (ed.) 1999. The Emergence of Language. Mahwah, NJ: Lawrence Erlbaum Associates. McEnery, T. 1995. Computational pragmatics: Probability, deeming and uncertain references. Unpublished PhD thesis. Lancaster University. McEnery, T. and Kifle N. 1998. “Non-native speaker and native speaker argumentative compositions – A corpus-based study”. In Proceedings of First International Symposium on Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching, S. Granger and J. Hung (eds). Chinese University of Hong Kong. Matsumoto, Y., A. Kitauchi, T. Yamashita, Y. Hirano, H. Matsuda, K. Takaoka, M. Asahara 2000. Japanese Morphological Analysis System ChaSen version 2.2.1. Online manual (http://chasen.aist-nara.ac.jp/chasen/doc/chasen-2.2.1.pdf). Mindt, Dieter 1997. "Corpora and the teaching of English in Germany”. In Teaching and Language Corpora, Knowles, G., T. McEnery, S. Fligelstone, A. Wichman (eds), 40–50. London: Longman. Montrul, S.A. 1998. "The L2 acquisition of dative experiencer subjects”. Second Language Research 14 (1):27–61. Oshita, H. 1997. “The unaccusative trap”: L2 acquisition of English intransitive verbs. Unpublished PhD thesis. University of Southern California.
66
Yukio Tono
Pinker, S. 1984. Language Learnability and Language Development. Cambridge, MA: Harvard University Press. Pinker, S. 1987. “The bootstrapping problem in language acquisition”. In Mechanisms of Language Acquisition, B. MacWhinney (ed.), 399–441. Hillsdale, NJ: Erlbaum. Pinker, S. 1989. Learnability and Cognition: The Acquisition of Argument Structure. Cambridge, MA: MIT Press. Rohlen, T.P. 1983. Japan’s High School. Berkeley: University of California Press. Sawyer, M. 1996. “L1 and L2 sensitivity to semantic constraints on argument structure”. In Proceedings of the 20th Annual Boston University Conference on Language Development, 2, A. Stringfellow, D. Cahana-Amitay, E. Hughes and A. Zukowski (eds), 646–657. Somerville, MA: Cascadilla Press. Sekine, S. 1998. Corpus based parsing and sublanguage studies. Unpublished PhD Thesis. New York University. Slobin, D. 1997. The Crosslinguistic Study of Language Acquisition. – Vol.: Expanding the Contexts. Mahwah, NJ; London : Lawrence Erlbaum. Tomasello, M. 1992. First Verbs: A Case Study of Early Grammatical Development. Cambridge: Cambridge University Press. Tono, Y. 2002. The role of learner corpora in SLA research and foreign language teaching: The multiple comparison approach. Unpublished Ph.D. thesis. Lancaster University. Toth, P.D. 1997. Linguistic and pedagogical perspectives on acquiring second language morpho-syntax: a look at Spanish se. Unpublished PhD thesis, University of Pittsburgh. Ungerer, F. and H.J. Schmid 1996. An Introduction to Cognitive Linguistics. Harlow Essex: Addison Wesley Longman. White, L. 1987. “Markedness and second language acquisition: The question of transfer”. Studies in Second Language Acquisition 9:261–286. White, L. 1991. “Argument structure in second language acquisition”. Journal of French Language Studies 1:189–207. Yip, V. 1994. “Grammatical consciousness-raising and learnability”. In Perspectives on Pedagogical Grammar, T. Odlin (ed.), 123–138. Cambridge: Cambridge University Press. Zobl, H. 1989. “Canonical typological structures and ergativity in English L2 acquisition”. In Linguistic Perspectives on Second Language Acquisition, S. Gass and J. Schachter (eds.), 203–221. Cambridge: Cambridge University Press.
New wine in old skins?
67
New wine in old skins? A corpus investigation of L1 syntactic transfer in learner language Lars Borin* and Klas Prütz** * Natural
Language Processing Section, Department of Swedish, Göteborg University, Sweden **Centre for Language and Communication Research, Cardiff University, UK
This article reports on the findings of an investigation of the syntax of Swedish university students’ written English as it appears in a learner corpus. We compare part-of-speech (POS) tag sequences (being a rough approximation of surface syntactic structure) in three text corpora: (1) the Uppsala Student English corpus (USE); (2) the written part of the British National Corpus Sampler (BNCS); (3) the Stockholm Umeå Corpus of written Swedish (SUC). In distinction to most other studies of learner corpora, where only the target language (L2) as produced by native speakers has been compared to the learners’ interlanguage (IL), we add a comparison with the learners’ native language (L1) as produced by native speakers. Thus, we investigate differences in the frequencies of POS n-grams between the BNCS (representing native L2) on the one hand, and the USE (representing IL) and SUC (representing native L1) corpora on the other hand, the hypothesis being that significant common differences would reflect L1 interference in the IL, in the form of underuse or overuse of L2 constructions. This makes our study not only one of learner language, or IL in general, but of specific L1 interference in IL. We compare the results of our study to methodologically similar learner corpus research by Aarts and Granger, as well as to our own earlier investigation of English translated from Swedish.
68
Lars Borin and Klas Prütz
1. Introduction An important strand of inquiry in second language acquisition (SLA) research is that devoted to the investigation of language learners’ successive approximations of the target language, referred to as interlanguage (IL) in the SLA literature. Similarly to the practice in other kinds of linguistic investigation, SLA researchers are concerned with empirical description of various kinds of interlanguage, with discovering correlations between traits in interlanguage and features of the language learning situation, with explaining those correlations, and finally with the practical application of the knowledge thus acquired to language pedagogy. e features of language learning situations which have at one time or another been claimed to influence the shape and development of IL are the following (based on Ellis 1985: 16f): 1. Situational factors (explicit instruction or not; foreign vs. second language, etc.) 2. Linguistic input 3. Learner differences, including learner’s L1 4. Learner processes In this paper, we will be concerned mainly with factor (3), and more specifically with the influence of the learner’s L1 on her IL. e phenomenon whereby features of the learner’s native language are “borrowed” into her version of the target language – the IL – is referred to as transfer in the SLA literature. Transfer could in principle speed up language learning, if L1 and L2 are similar in many respects, but the kind of transfer which understandably has been most investigated is that where the learner transfers traits which are not part of the L2 system (negative transfer or interference). Interference and other features of IL have long been studied by so-called error analysis (EA), where language learners’ erroneous linguistic output is collected. Traditional EA suffers from a number of limitations: – Limitation 1: EA is based on heterogeneous learner data; – Limitation 2: EA categories are fuzzy; – Limitation 3: EA cannot cater for phenomena such as avoidance; – Limitation 4: EA is restricted to what the learner cannot do; – Limitation 5: EA gives a static picture of L2 learning. (Dagneaux et al. 1998: 164)
New wine in old skins?
69
e use of learner corpora is oen seen as one possible way to avoid the worst limitations of traditional EA.
1.1 Studying interlanguage with learner corpora Learner corpora are a fairly new arrival on the corpus linguistic scene, but have quickly become one of the most important resources for studying interlanguage. Like other corpora, a learner corpus is “a finite-sized body of machinereadable text, sampled in order to be maximally representative of the language variety under consideration” (McEnery and Wilson 2001:32). A learner corpus is a collection of texts – written texts or transcribed spoken language – produced by language learners, and sampled so as to be representative of one or more combinations of situational and learner factors. is addresses the first limitation of EA mentioned in the preceding section; by design, learner corpus data is homogeneous. e whole gamut of corpus linguistics methods and tools are applicable to learner corpora, too. Available for immediate application are such tools as concordancers and word (form) listing, sorting and searching utilities, as well as statistical processing on the word form level. Even with these fairly simple tools a lot can be accomplished, especially with “morphologically naïve” languages like English. For deeper linguistic analysis, learner corpora can be lemmatized, annotated for part-of-speech (POS) – or POS-tagged – and/or parsed to various degrees of complexity. Learner corpora can also be annotated for the errors found in them, which raises the intricate question of how errors are to be classified and corrected (Dagneaux et al. 1998). Utilizing methods from parallel corpus linguistics (Borin 2002a; Kilgarriff 2001),1 learner corpora can be compared to each other or to corpora of texts produced by native speakers of the learners’ target language (L2) or their native language(s) (L1). Figure 1 illustrates some of the possibilities in this area. In Figure 1, case (i) [the double dotted line] is the ‘classical’ mode of learner corpus use (and of traditional error analysis) – interlanguage analysis (IA).2 Here, the interlanguage (IL), represented by the learner corpus, is compared to a representative native-speaker L2 corpus. Case (ii) [the dotted triangle] is an extension of (i), where different kinds of IL are contrasted to each other and to the L2 (called CIA – contrastive interlanguage analysis – by Granger 1996). e different ILs could be produced by learners with different native languages (as in most investigations based on ICLE; see Granger 1998 and 4.1 below) or
70
Lars Borin and Klas Prütz
Figure 1. Learner corpora and SLA research.
by learners with different degrees of proficiency, or, finally, by the same learners at different times during their language learning process, i.e. a longitudinal comparison (Hammarberg 1999), which goes some way towards dealing with limitation 5 of EA (see above). Case (iii) [faint double dashed line] represents a methodological tool which at times has been important in SLA research, but not very much pursued in the context of learner corpora, namely contrastive analysis (CA), where native-speaker L1 and L2 are compared in order to find potential sources of interference.3 Cases (i), (ii) and (iii) are quite general, and are meant to cover investigations on all linguistic levels. For pragmatic reasons, most such investigations have confined themselves to the level of lexis and such syntactic phenomena which are easily investigated through lexis. However, there is an increasing amount of work on (automatically) POS-tagged learner corpora (e.g., Aarts and Granger 1998; see 4.1 below), and even some investigations of parsed learner corpora (see Meunier 1998; Staerner 2001). e present paper addresses case (iv), the double solid lines, which to the best of our knowledge has not been investigated earlier using learner corpora.4 In the future, we hope to be able to also look into case (v) [the double+single solid lines], the extension of case (iv) to more than one kind of IL.
2. Investigating syntactic interference in learner language In distinction to most other studies of learner language corpora, where the IL has been compared only to native L2 production, in our own investigation we
New wine in old skins?
71
add a comparison with the learners’ L1. Arguably, this makes our study not only one of interlanguage in general, but of specific L1 interference as evidenced in IL, which is relevant, e.g., for the development of intelligent CALL applications, incorporating natural language processing components – our particular area of expertise – e.g. learner language grammars and learner models. We investigated differences in the frequencies of POS sequences (or POS ngrams) between a corpus of native English on the one hand, and two corpora – one of Swedish advanced learner English and one of native Swedish – on the other. e hypothesis is that significant common differences would reflect L1 interference in the IL on the syntactic level, since the POS sequences arguably serve as a rough approximation of surface syntactic structure, at least in the case of languages where syntactic relations are largely signalled by constituent order (both English and Swedish are such languages). e differences found were of two kinds, reflecting overuse or underuse of particular POS sequences, common to Swedish advanced learner English and Swedish, as compared to native English. In what follows, we will refer to those IL traits that we focus on in our investigation as “IL+L1”.5
2.1 The corpora and tagsets For our investigation, we used the following three sets of corpus materials. 1. e learner corpus, the Uppsala Student English corpus (USE; Axelsson 2000; Axelsson and Berglund 2002), contains about 400,000 tokens (about 350,000 words); 2. e native English corpus was made up of the written language portion of the British National Corpus Sampler (BNCS; Burnard 1999), containing about 1.2 million tokens (roughly 1 million words); 3. e native Swedish corpus, the Stockholm Umeå Corpus (SUC; Ejerhed and Källgren 1997), contains roughly 1.2 million tokens (about 1 million words). e BNCS and SUC corpora come in POS-tagged, manually corrected versions, which we have used without modification. e USE corpus was tagged by us with a Brill tagger trained on the BNC sampler, giving an estimated accuracy of 96.7 %. For the purposes of this investigation, both tagsets were reduced, the English set to 30 tags (from 148) and the Swedish to 37 tags (from 156). e reduced tagsets are listed and compared in the Appendix. e tagsets
72
Lars Borin and Klas Prütz
were reduced for two reasons: first, earlier work has indicated that training and tagging with a large tagset, and then reducing it, not only improves tagging performance, but also gives better results than training and tagging only with the reduced set. Prütz’s (2002) experiment with a Swedish Brill tagger and the same full and reduced tagsets as those used here gave an increased accuracy across the board of about two percentage points from tagging with the large tagset and then reducing it, compared to tagging with the full set. Tagging directly with the reduced set resulted in lower accuracy, by a half to one percentage point, depending on the lexicon used. Second, coarse-grained tagsets are more easily comparable than fine-grained ones even for such closely related languages as Swedish and English (Borin 2000, 2002b).
2.2 Experiment setup In Figure 2, the setup of the experiment is shown in overview. We used a similar procedure to that of our earlier investigation of translationese (Borin and Prütz 2001):6 1. First, we extracted all POS n-gram types (for n = 1 ... 4) and their frequencies from the three POS-tagged corpora; 2. From the n-gram lists we removed certain sequences, namely (a) those containing the tag NC (proper noun; we hypothesize that a higher or lower relative incidence of proper nouns is not a distinguishing trait in learner language), (b) those with punctuation tags except for those containing exactly one full-stop tag, in the first or the last position,7 and (c) those not appearing in all three corpora, either by necessity (because of differences between the English and Swedish tagsets) or by chance; 3. For each n-gram length, the incidence of the n-gram types in BNCS (representing native English) and USE (representing learner English) were compared, using the Mann-Whitney (or U) statistic (see Kilgarriff 2001 for a description and justification of the test for this kind of investigation), and instances of significant (p ≤ 0.05, two-tailed) differences (overuse and underuse) were collected (“n-gram ∆ analysis” in Figure 2); 4. BNCS and SUC (representing the learners’ native language, i.e. Swedish) were compared in exactly the same way; 5. Finally, the n-gram types which showed significant overuse or significant underuse in both comparisons were extracted, symbolized by the “&” (logical ) process in Figure 2.
New wine in old skins?
73
Figure 2. Experiment setup
3. Results by the numbers In this section, we give a general overview of our results, but defer discussion to section 4, where we compare our findings with those of other similar investigations. In Table 1, you will find the numbers, i.e. how many of each n-gram type occured in each corpus. We give both the actual and the theoretically expected figures. For unigrams, the expected figure is the cardinality of the tagset, of course, while the figure for the other n-grams is the actually occurring number of unigrams in the corpus in question raised to the corresponding power; thus, 293 (29 cubed) is the expected number of trigrams in the USE corpus. is simply illustrates the well-known fact that language has syntax, and is not in general freely combinatorial. e longer the sequence, the smaller the fraction becomes that is actually used of all possible combinations. is is what makes it possible to let POS n-grams stand in for real syntactic analyses. In Table 2, underuse and overuse are shown, found by the experimental procedure described in the previous section. e percentage figures shown in the table are calculated by dividing the underuse/overuse figures by the POS n-gram figures for the USE corpus, i.e., the percentage of significantly different (underused and overused) trigrams is calculated as (42+155)/6526 (≈ 0.03019, i.e. 3.0%). An interesting fact reflected by the figures in Table 2 is that there turned out to be more instances of overuse than of underuse for all n-gram lengths.
74
Lars Borin and Klas Prütz
Table 1. Actually occurring and expected n-gram types in the corpora corpus:
USE occurring (expected)
BNCS SUC occurring (expected) occurring (expected)
unigrams bigrams trigrams 4-grams
29 663 6526 31761
30 807 10800 60645
(30) (841) (24389) (707281)
(30) (900) (27000) (810000)
34 1035 13616 72770
(37) (1156) (39304) (1336336)
Table 2. Underuse and overuse per n-gram length unigrams
bigrams
trigrams
4-grams
underuse overuse underuse
overuse underuse
overuse underuse
overuse
1 3.4% = 13.7%
36 5.4%
155 2.4%
171 0.5%
3 10.3%
11 1.6% = 7.0%
42 0.6% = 3.0%
91 0.3% = 0.8%
In section 3.1, we discuss some representative cases of each n-gram type.
3.1 Distinctive IL+L1 n-grams 3.1.1 Unigrams Among the unigrams, there was one instance of underuse, “K2” (past participle), while there were three overused parts-of-speech: “V” (finite verb), “R” (adverb), and “C” (conjunction). Possibly, this indicates a less complex sentence-level syntax in the IL+L1 than in native English, with more finite clauses joined by conjunctions, rather than non-finite subordinate clauses.8 e adverbs could be a sign of a more lively, narrative style, and may possibly have nothing at all to do with the fact that these particular narratives happen to be in interlanguage (but see section 4.2). 3.1.2 Bigrams Just as adverbs by themselves are overused in the USE IL+L1, so are a number of bigrams containing adverbs, e.g. “R C” (adverb–conjunction), “R R”
New wine in old skins?
75
(adverb–adverb), “R NN” (adverb–common noun), “R V” (adverb–finite verb), “. R” (sentence initial adverb). Sentence initial common nouns (“. NN”) are also overused, perhaps strengthening the impression that sentence syntax is simpler in IL+L1 than in native L2. By way of illustration, we show some examples of the bigram “R R” from the USE corpus (the full tagset is used in this and in the other examples which follow below): (1) I/PPIS1 also/RR recantly/RR descovered/VVN that/CST my/APPGE spelling/NN1 was/VBDZ rather/RG poor/JJ so_that/CS is/VBZ someting/PN1 I/PPIS1 have/VH0 to/TO work/VVI on/RP ./YSTP (2) He’s/NP1 far/RR away/RP ./YSTP (3) So/RG naturally/RR ,/YCOM they/PPHS2 were/VBDR shocked/JJ to/TO find/VVI complete/JJ wilderness/NN1 and/CC a/AT1 nature/NN1 so/RR unlike/II the/AT English/NN1 ./YSTP
Additionally, examples 4–6 in section 3.1.3 below also contain “R R”. All the most consistently underused bigrams have in common the POS tag “K2” (past participle): “K2 I” (past participle–preposition), “K2 R” (past participle–adverb), “NN K2” (common noun–past participle), “V K2” (finite verb–past participle). We give some examples of the “K2 R” bigram in section 3.1.4 below (examples 13–18), from which we see that the adverb is usually the second component (the verb particle) of a phrasal (or particle) verb. Hence, the IL+L1 shows an underuse of either periphrastic tenses or non-finite clauses, or both, with phrasal verbs.9
3.1.3 Trigrams Many of the overused trigrams contain adverbs: “. R R” (sentence-initial adverb–adverb; example 3), “R R NN” (adverb–adverb–common noun; examples 4–6). Other examples of overused trigrams are “VI I NN” (infinite verb–preposition–common noun; examples 10–12), “V I NN” (finite verb– preposition–common noun). (4) When/CS I/PPIS1 write/VV0 ,/YCOM I/PPIS1 can/VM spend/VVI as/RG much/RR time/NNT1 as/CSA I/PPIS1 want/VV0 to/TO make/VVI changes/NN2 and/CC corrections/NN2 ./YSTP
76
Lars Borin and Klas Prütz
(5) ey/PPHS2 are/VBR trying/VVG to/TO imitate/VVI their/APPGE action/NN1 heroes/NN2 and/CC not/XX very/RG seldom/RR accidents/NN2 occur/VV0 ./YSTP (6) at_is/REX however/RR far_from/RG reality/NN1 ./YSTP
Among the underused trigrams we find many which contain adjectives: “A A NN” (adjective–adjective–common noun), “A NN K1” (adjective–common noun–present participle), “A NN K2” (adjective–common noun–past participle), “A NN NN” (adjective–common noun–common noun). Past participles appear among underused trigrams as well. us, we find “NN K2 R” (common noun–past participle–adverb) in addition to the already mentioned “A NN K2”.
3.1.4 4-grams Among overused 4-grams, there are a number involving conjunctions and prepositions, e.g.: “. C NN V” (sentence-initial conjunction–common noun–finite verb; examples 7–9), “C NN R V” (conjunction–common noun–adverb–finite verb), “VI I NN .” (sentence-final infinite verb–preposition–common noun; examples 10–12), “V I NN .” (sentence-final finite verb–preposition–common noun). (7) When/CS people/NN grew/VVD old/JJ they/PPHS2 were/VBDR depending_on/II their/APPGE relatives’/JJ goodness/NN1 ./YSTP (8) When/CS children/NN2 reach/VV0 a/AT1 certain/JJ age/NN1 ,/YCOM they/PPHS2 tend/VV0 to/TO find/VVI these/DD2 violent/JJ films/NN2 very/RG cool/JJ and/CC exciting/JJ ./YSTP (9) Because/CS fact/NN1 is/VBZ that/CST New/JJ Lanark/NP1 was/VBDZ a/AT1 success/NN1 ,/YCOM a/AT1 large/JJ one/PN1 ./YSTP (10) I/PPIS1 have/VH0 always/RR found/VVN it/PPH1 amusing/JJ to/TO write/VVI in/II English/NN1 ./YSTP (11) We/PPIS2 need/VV0 to/TO teach/VVI them/PPHO2 how/RRQ to/TO defend/VVI themselves/PPX2 in/II today’s/NN2 society/NN1 and/CC to/TO turn/VVI away_from/II violence/NN1 ./YSTP
New wine in old skins?
77
(12) Another/DD1 great/JJ fear/NN1 was/VBDZ that/CST wilderness/NN1 would/VM force/VVI civilised/JJ men/NN2 to/TO act/VVI like/II savages/NN2 ./YSTP
In the set of underused 4-grams, there are quite a few containing past participles, e.g.: “K2 R I A” (past participle–adverb–preposition–adjective), “K2 R I NN” (past participle–adverb–preposition–common noun; examples 13–15), “K2 R I P” (past participle–adverb–preposition–pronoun; examples 16–18), “NN V K2 R” (common noun–finite verb–past participle–adverb). (13) Why/RRQ does/VDZ anyone/PN1 want/VVI to/TO see/VVI a/AT1 man/NN1 get/VV0 his/APPGE head/NN1 chopped/VVN off/RP on/II television/NN1 ?/YQUE (14) Tom/NP1 is/VBZ blown/VVN up/RP with/IW dynamite/NN1 but/CCB is/VBZ still/RR alive/JJ ./YSTP (15) You/PPY can/VM be/VBI swept/VVN away/RP with/IW money/NN1 ,/YCOM towards/II materialistic/JJ values/NN2 ,/YCOM without/IW even/RR realizing/VVG it/PPH1 ./YSTP (16) It/PPH1 is/VBZ essential/JJ to/II all/DB infant/NN1 mammals/NN2 to/TO be/VBI taken/VVN care/NN1 of/IO ,/YCOM and/CC to/TO be/VBI brought/VVN up/RP by/II someone/PN1 who/PNQS knows/VVZ the/AT difficulties/NN2 of/IO life/NN1 ./YSTP (17) However/RR ,/YCOM the/AT Chief’s/NN2 images/NN2 of/IO machines/NN2 are/VBR not/XX only/RR similes/VVZ ,/YCOM he/PPHS1 also/RR suffers/VVZ delusions/NN2 which/DDQ make/VV0 him/PPHO1 think/VVI that/CST there/EX are/VBR actual/JJ machines/NN2 installed/VVN everywhere/RL around/II him/PPHO1 ,/YCOM controlling/VVG him/PPHO1 ./YSTP (18) I/PPIS1 know/VV0 that/CST woman/NN1 is/VBZ naturally/RR and/CC necessarily/RR weak/JJ in_comparison_with/II man/NN1 ;/YSCOL and/CC that/CST her/APPGE lot/NN1 has/VHZ been/VBN appointed/VVN thus/RR by/II Him/PPHO1 who/PNQS alone/JJ knows/VVZ what/DDQ is/VBZ best/JJT for/IF us/PPIO2 ./YSTP
78
Lars Borin and Klas Prütz
4. Comparisons with similar previous work In this section we compare our results in more detail to other relevant work. e only similar investigation of learner language that we know of is that made by Aarts and Granger (1998). eir work is methodologically similar to our approach, and therefore a fairly detailed comparison or our findings with theirs seems warranted. Section 4.1 is devoted to such a comparison. Further, it seems reasonable to assume that there should be common traits in translated language (translationese; Gellerstam 1985, 1996) and (advanced) learner language, and in section 4.2, we compare our results here to those obtained in our earlier investigation of translationese.
4.1 Aarts and Granger 1998 Aarts and Granger (1998; henceforth A&G) compared POS trigram frequencies in three learner corpora, the Dutch, Finnish and French components of ICLE, with comparable material produced by native speakers of English, i.e. the LOCNESS (LOuvain Corpus of Native English eSSays) corpus. eir investigation was thus an instance of corpus-based CIA (see above), and did not involve the native languages of the learners, other than indirectly, through the comparison between the three corpora. A&G produced POS trigram frequency lists from all four corpus materials (each about 150,000 words in length). Like in our investigation, they worked with a reduced version of the tagset they used for tagging the corpora (the TOSCA-ICE tagset with 270 tags, which were reduced to 19). ey then investigate their trigram lists in a number of ways: 1. ey calculate significant differences (underuse and overuse in relation to LOCNESS) in the rank orderings of the lists, using the c2 test; 2. ey investigate the differences common to the three ICLE components in relation to LOCNESS (the “cross-linguistic invariants”; about 7% of the trigrams), 3. and differences unique to one learner variety (“L1-specific patterns”; about 20–25% of the trigrams, depending on the L1), where only the French variety is discussed in any detail by A&G (see above). We now proceed to a more detailed comparison between the findings of A&G and our own results (B&P in what follows). We should keep some things in
New wine in old skins?
79
mind, though. First of all, A&G actually make a different investigation. ey investigate over- and underuse of POS trigrams in a learner corpus, compared to a native speaker corpus. Our investigation started out in the same way, but additionally, we remove all POS n-grams which do not differ in the same way between the native L2 corpus and a corpus of native L1, i.e. the native language of the learners. us, the POS n-grams that remain in our case should exclude A&G’s “cross-linguistic invariants”, if indeed their “L1-specific patterns” reflect transfer from the learners’ native language. A&G use a smaller tagset (which reflects a partly different linguistic classification) than we do. Also, we have used a different statistical test for significance testing. ese circumstances conspire to make comparisons between our investigations not entirely straightforward, and could easily account for the differences in the numbers that the two investigations arrive at (we come nowhere near the at least 20% L1-specific trigrams found by A&G; see Table 2, above). If our respective studies really investigate the same thing, we would make the following two predictions. 1. ere could – but need not – be partial overlap between the “L1-specific patterns” A&G found and those that we have uncovered. e overlap should in that case be larger, the closer the L1 in question is to Swedish, i.e. A&G’s Dutch ICLE material should show most overlap with our results. Unfortunately, A&G present concrete results only for French L1specific patterns, which show practically no overlap with our patterns, as expected; 2. We would also predict that those POS trigrams that A&G found to be over- or underused in all the three subcorpora they investigated – the “cross-linguistic invariants” –, should not appear in our material. By and large, this prediction holds, i.e. most of the patterns that A&G find as significantly different in the same way in all the three L1-specific subcorpora, are indeed not present in our set of significantly differently distributed POS n-grams. e only possible exceptions to this are the sentence-initial pattern shown in Table 3, where the picture is not as clear as in other cases. Although there are so few n-grams that no firm conclusions can be drawn from them, it still seems that there is a difference between those patterns where A&G found overuse and the ones that are underused according to their results. A&G tags should be fairly self-explanatory (except perhaps “#”, sentence break), and B&P tags are explained in the
80
Lars Borin and Klas Prütz
Table 3. Comparison with language-invariant sentence-initial patterns found by A&G (based on Section 4.2, Arts and Granger 1998: 137) A&G POS sequence # # CONNEC # # ADV # # PRON ##N # CONJ N # PREP Ving
overused
underused
= B&P POS sequence
A&G
B&P
.C .R .P
+ + +
+ + h
. NN . C NN . I K1
– – –
+ h h
Appendix. Differences are noted using “+” (overuse), “–” (underuse), and “h” (no significant difference).
4.2 Borin and Prütz 2001 Intuitively, translated language (translationese; see above) and IL ought to have features in common: “Both are situated somewhere between L1 and L2 and are likely to contain examples of transfer.” (Granger 1996: 48). us, it is of value to compare the results of the present investigation to an earlier similar investigation of translationese (Borin and Prütz 2001), where we looked at newstext translated from Swedish to English, using an almost identical experimental procedure to the one presented here. e differences were as follows. 1. Different corpora were used, of course: (a) the English translation and (b) Swedish original versions of a Swedish news periodical for immigrants, the “press, reportage” parts of the (c) FLOB and (d) Frown English corpora; 2. In addition to the 1- to 4-grams investigated in IL+L1, we also investigated 5-grams in our translationese study; 3. e initial selection of distinct n-grams was different, and based on an absolute difference in rank in the corpora, rather than on a statistical test. e same set of n-grams as in the present investigation were then removed from consideration (i.e., those containing proper nouns and certain kinds
New wine in old skins?
81
of punctuation, and those not occurring in all the compared corpora; see above); 4. e statistical test was applied only to the results of the initial selection, resulting in the removal of a number of n-grams. However, we do not know if the initial selection has excluded some n-grams which would have been singled out as significantly different by the statistical test. If we take as our hypothesis that there should be a fair amount of overlap between the two sets of distinct n-grams, or perhaps even that the n-grams found to be characteristic of translationese should be a subset of those characteristic of learner language, we have to admit that the hypothesis was soundly falsified. What we found was that there were a considerably larger number of significant differences characteristic of learner language than of translationese (506 2- to 4-grams in IL+L1 vs. 41 in translationese), except in the case of unigrams, where IL+L1 had 4, against 6 in translationese. On the other hand, there is almost no overlap – let alone inclusion – between the two sets of ngrams. ere are two shared bigrams (“. R” and “C VI”, both overused), one shared trigram (“. I P”, overused), and no shared unigrams or 4-grams.10 e one similarity that we did find was a somewhat similar situation with regard to overuse and underuse. ere are more overused than underused bigrams and trigrams both in IL+L1 and translationese, while they differ with respect to 4grams, where translationese displayed more underuse than overuse. In conclusion: while our results perhaps do not invalidate the intuition that IL and translationese “are situated somewhere between L1 and L2 and are likely to contain examples of transfer” (see above), it certainly seems that they are situated in quite different locations in the region between L1 and L2 (but see the next section). More research is clearly needed here.
5. Discussion and conclusion In this section, we would like to discuss some general issues which bear on the interpretation of our results and on the comparisons we have made of these results with the findings of other similar investigations: 1. Representativeness of the English “standard”. We have used (the written part) of BNCS as the L2 standard. Perhaps we should instead have used a
82
2.
3.
4.
5.
Lars Borin and Klas Prütz
native students’ essay corpus such as LOCNESS (like Aarts and Granger 1998), or perhaps even a corpus of spoken English, acknowledging the fact that the written English of Swedish learners is held to be influenced by colloquial spoken English (see Hägglund 2001); Representativeness of the Swedish “standard”. In the same way, we could question whether SUC really faithfully represents the learners’ “point of departure”, the form of Swedish most likely to influence their IL English. Perhaps here, too, a corpus of spoken Swedish would serve better (see Allwood 1999), or possibly a corpus of Swedish student compositions; What do the “L1-specific” trigrams found by Aarts and Granger (1998) reflect? Our hypothesis – which informed the way we set up our experiment, described in section 2 above – was that they represent transfer, i.e., that underuse and overuse of an n-gram type in IL reflect a relatively lower and higher incidence, respectively, of the same n-gram type in the L1. Only if this hypothesis holds are our results comparable with those of Aarts and Granger. If underuse or overuse in IL is due to something else, then obviously we cannot compare our results. For instance, underuse in the IL could be due to avoidance of an L1 structure, in which case it should be correlated to a higher incidence in the L1 or no significant difference; ere is an estimated tagging error rate of slightly more than 3% in the USE corpus (see section 2.1). If the errors made by the tagger are not random, there will be a bias in the results of our investigation; POS tag sequences are of course not syntactic units; they merely give better clues to syntax than word-level investigations are able to provide. e picture we get of learner (and native speaker) language syntax is therefore likely to be distorted and to need careful interpretation to be usable.
In conclusion, we would like to say that we think that our investigation confirms the observation made by Aarts and Granger (1998) and Borin and Prütz (2001) that a contrastive investigation of POS-tagged corpora can yield valuable linguistic insights about the differences (and similarities) among the investigated language varieties. At the same time, much remains to be done regarding matters of methodology; among others, the issues mentioned above need to be addressed. In the future, we would like to look into the issue of L1 and L2 corpus representativeness. We would also like to extend and refine our investigation of L1
New wine in old skins?
83
interference in learner language syntax in various ways, notably by the use of robust parsing (Abney 1996), which would enable us to look at syntax directly, to investigate e.g. which syntactic constituents and functions are most indicative of learner language.
Acknowledgements We would like to thank the volume editors for their careful reading and commenting of (the previous version of) this article. e research presented here was funded by the following sources: an Uppsala University, Faculty of Languages reservfonden grant; Vinnova through the CrossCheck project; e Knut and Alice Wallenberg Foundation through the Digital Resources in the Humanities project, part of the Wallenberg Global Learning Network initiative.
Notes 1. We use the term “parallel corpus linguistics” to subsume both work with parallel corpora – i.e., original texts in one language and their translations into another language or other languages – and work with comparable corpora, i.e., original texts in two or more languages which are similar as to genre, topic, style, etc. At least in the language technology-oriented research tradition, there are interesting commonalities between the two kinds of work (see Borin 2002a), e.g. in the use of distributional regularities for automatically discovering translation equivalents in both kinds of corpora. Work such as that presented here, dealing with comparisons among learner IL corpora and original L1 and L2 corpora, is most similar to work on comparable corpora, of course. 2. But using a learner corpus and (computational) corpus linguistics tools, we can do much more than in traditional EA. Perhaps the major advantage is that we can investigate patterns of deviant usage – i.e., instances of overuse and underuse – rather than just instances of clear errors. Even in the latter case, we can generalize over the normal linguistic contexts (on many linguistic levels, to boot) of particular errors fairly easily using corpus linguistics tools, something which in general was not feasible in traditional EA. is takes care of limitations 3 and 4 of EA mentioned above. 3. In corpus linguistics – at least if we are talking about the more interesting case, namely the development of automatic methods for making linguistically relevant comparisons between texts –, the closest thing to CA is the work on parallel and comparable corpora aimed mainly at extracting translation equivalents for machine translation or cross-language information retrieval systems (see, e.g., Borin 2002a). ese methods, although at present used almost
84
Lars Borin and Klas Prütz
exclusively for language technology purposes, could in principle be used for a more traditionally linguistically-oriented “contrastive corpus linguistics” as well, as has been argued elsewhere (e.g. in Borin 2001; cf. Granger 1996), complementing the largely manual modes of investigation used in present-day corpus-based contrastive linguistic research. 4. At least not in the way that we propose to do it. Although it shares some traits with Granger’s (1996: 46ff) proposed “integrated CA/CIA contrastive model [which] involves constant to-ing and fro-ing between CA and CIA”, we believe that our method provides for a tighter coupling between all the involved language varieties; there is no difference (indeed, there should be no difference) between CA and IA with our way of doing things. 5. Note that our method of investigation is by design unsuited for finding errors, since we count as instances of overuse only such items that actually appear in the native L2 corpus, i.e., if a construction appears in the L1 and IL corpora but not in the L2 corpus, it is not counted as an instance of overuse, even though the difference in itself may be statistically significant. Concretely, this is achieved by taking the L2 corpus – i.e., the British National Corpus Sampler in our case – as the basis for all comparisons; see further 2.2. 6. ere were some small differences, which we will return to below, when we compare the results of the two investigations. 7. e motivation for this is possibly less well-founded than in the case of proper nouns, but let us simply say that we wish to limit ourselves, at least for the time being, to looking at clause-internal syntax imperfectly mirrored in the POS tag sequences found in a text. Of course, at the same time we eliminate e.g., commas functioning as coordination conjunctions, i.e., clause-internally. We also do not wish to claim that rules of orthography, such as the use of punctuation, cannot be subject to interference. We are simply more interested in syntax more narrowly construed. e reason for keeping leading and trailing full stops is that a full stop is an unambiguous sentence (and clause) boundary marker, thus permitting us to look at POS distribution at sentence (and some clause) boundaries. 8. English has more possibilities for non-finite clausal subordination than Swedish, which may be relevant here. It seemed that the results of our earlier translationese investigation reflected this circumstance (Borin and Prütz 2001: 36). Granger (1997) finds a similar underuse of non-finite subordinate clauses in non-native written academic English as compared to that of native writers. 9. Here, it would be good to compare our results with Hägglund’s (2001) lexical investigation of phrasal verbs in the Swedish component of ICLE, compared to LOCNESS. For the time being, this will have to remain a matter for future investigation, however. 10. Although it is an intriguing fact that our translationese study found significantly more adverbs in Swedish than in all the English materials, and that the English translated from Swedish had more – but not significantly more – than either of the other two sets of English materials (see section 3.1.1).
New wine in old skins?
85
References Aarts, J. and Granger, S. 1998. “Tag sequences in learner corpora: A key to interlanguage grammar and discourse”. In Learner English on Computer, S. Granger (ed.), 132–141. London: Longman. Abney, S. 1996. “Part-of-speech tagging and partial parsing”. In Corpus-Based Methods in Language and Speech, K. Church, S. Young and G. Bloothooft (eds). Dordrecht: Kluwer. Allwood, J. 1999. “The Swedish spoken language corpus at Göteborg University”. In Fonetik 99: Proceedings from the 12th Swedish Phonetics Conference. [Gothenburg papers in theoretical linguistics 81]. Department of Linguistics, Göteborg University. Axelsson, M. W. 2000. “USE – the Uppsala Student English Corpus: An instrument for needs analysis”. ICAME Journal 24: 155–157. Axelsson, M. W. and Berglund, Y. 2002. “The Uppsala Student English Corpus (USE): A multi-faceted resource for research and course development”. In Parallel Corpora, Parallel Worlds, L. Borin (ed.), 79–90. Amsterdam: Rodopi. Borin, L. 2000. “Something borrowed, something blue: Rule-based combination of POS taggers”. Second International Conference on Language Resources and Evaluation. Proceedings, Volume I, 21–26. Athens: ELRA. Borin, L. 2001. “Att undersöka språkmöten med datorn”. In Språkets gränser och gränslöshet. Då tankar, tal och traditioner möts. Humanistdagarna vid Uppsala universitet 2001, A. Saxena (ed.), 45–56. Uppsala: Uppsala University. Borin, L. 2002a. “… and never the twain shall meet?”. In Parallel Corpora, Parallel Worlds, L. Borin (ed.), 1–43. Amsterdam: Rodopi. Borin, L. 2002b. “Alignment and tagging”. In Parallel Corpora, Parallel Worlds, L. Borin (ed.), 207–218. Amsterdam: Rodopi. Borin, L. and Prütz, K. 2001. “Through a glass darkly: Part of speech distribution in original and translated text”. In Computational Linguistics in the Netherlands 2000, W. Daelemans, K. Sima’an, J. Veenstra and J. Zavrel (eds), 30–44. Amsterdam: Rodopi. Burnard, L. (ed.). 1999. “Users reference guide for the BNC sampler”. Published for the British National Corpus Consortium by the Humanities Computing Unit at Oxford University Computing Services, February 1999. [Available on the BNC Sampler CD]. Dagneaux, E., Denness, S. and Granger, S. 1998. “Computer-aided error analysis”. System 26: 163–174. Ejerhed, E. and Källgren, G. 1997. “Stockholm Umeå Corpus (SUC) version 1.0”. Department of Linguistics, Umeå University. Ellis, R. 1985. Understanding Second Language Acquisition. Oxford: Oxford University Press. Gellerstam, M. 1985. “Translationese in Swedish novels translated from English”. Translation Studies in Scandinavia. Proceedings from the Scandinavian Symposium on Translation Theory (SSOTT) II, Lund 14–15 June, 1985, L. Wollin and H. Lindquist (eds), 88–95. Lund: Lund University Press. Gellerstam, M. 1996. “Translations as a source for cross-linguistic studies”. In Languages
86
Lars Borin and Klas Prütz
in Contrast. Papers from a Symposium on Text-Based Cross-Linguistic Studies. Lund 4–5 March 1994, K. Aijmer, B. Altenberg and M. Johansson (eds), 53–62. Lund: Lund University Press. Granger, S. 1996. “From CA to CIA and back: An integrated approach to computerized bilingual and learner corpora”. In Languages in Contrast. Papers from a Symposium on Text-Based Cross-Linguistic Studies. Lund 4–5 March 1994, K. Aijmer, B. Altenberg and M. Johansson (eds), 37–51. Lund: Lund University Press. Granger, S. (ed.). 1998. Learner English on Computer. London: Longman. Hägglund, M. 2001. “Do Swedish advanced learners use spoken language when they write in English?”. Moderna språk 95 (1): 2–8. Hammarberg, B. 1999. “Manual of the ASU Corpus – A longitudinal text corpus of adult learner Swedish with a corresponding part from native Swedes”. Stockholm University, Department of Linguistics. Kilgarriff, A. 2001. “Comparing corpora”. International Journal of Corpus Linguistics 6 (1): 1–37. McEnery, T. and Wilson, A. 2001. Corpus Linguistics. 2nd edition. Edinburgh: Edinburgh University Press. Meunier, F. 1998. “Computer tools for the analysis of learner corpora”. In Learner English on Computer, S. Granger (ed.), 19–37. London: Longman. Prütz, K. 2002. “Part-of-speech tagging for Swedish”. In Parallel Corpora, Parallel Worlds, L. Borin (ed.), 201–206. Amsterdam: Rodopi. Staerner, A. 2001. Datorstödd språkgranskning som ett stöd för andraspråksinlärning. [Computerized language checking as support for second language learning]. MA Thesis in Computational Linguistics, Department of Linguistics, Uppsala University. Online: http://stp.ling.uu.se/~matsd/thesis/arch/2001–007.pdf (visited: 16.04.2004).
New wine in old skins?
87
Appendix. Reduced Swedish and English tagsets Table A1. Reduced Swedish (SV-R) and English (EN-R) tagsets SV-R
EN-R
description
1 2 3 4 5 6 7
– ! “ ( ) , .
8 9 10
: ; ?
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
A C E F G I K1 K2 L M NC NC$ NN NN$ O P P$ Q R S T V VI VK VS X
– ! “ ( ) , . ... : ; ? $ A C E
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
I K1 K2
16 17 18
M NC
19 20
NN
21
O P P$
22 23 24
R S T V VI
25 26 27 28 29
X
30
dash exclamation mark quotes le bracket right bracket comma full-stop ellipsis colon semicolon question mark genitive clitic adjective conjunction infinitive mark numeric expression abbreviation preposition present participle past participle compound part numeral proper noun proper noun, genitive noun noun, genitive interjection pronoun pronoun, poss. or gen. pronoun, relative adverb symbol or letter determiner verb, finite verb, infinitive verb, subjunctive verb, supine unknown or foreign word (tagged at all only in SUC)
examples – ! ” ( ) , . ... : ; ? ’s röd, red och, that att, to 16 d.v.s. på, on seende, eating sedd, eaten högtvå, two Eva, Evelyn Åsas häst, goat tjuvs bu, um vi, we vår, our som fort, fast G en, the såg, ate se, eat såge sett
88
Agnieszka Leńko-Szymańska
Demonstratives as anaphora markers in advanced learners’ English
89
Demonstratives as anaphora markers in advanced learners’ English Agnieszka Leńko-Szymańska University of Łódź, Poland
The aim of this study is to confirm teachers’ informal observations and to identify the specific patterns of misuse of the demonstratives as anaphora markers in Polish advanced learners’ English. The misuse is treated here in terms of underuse or overuse of the particular categories of the demonstrative anaphors in students’ essays: the proximal versus the distal demonstratives and the demonstrative determiners versus the demonstrative pronouns. The specific questions addressed in this study are: (1) do Polish learners of English at higher and lower proficiency levels show different patterns of use of demonstrative anaphors? and (2) to what extent do these patterns differ from native speaker use? The data was drawn from two corpora: the PELCRA corpus of learner English and the BNC Sampler. Three stages of analysis were performed on the data. First, the frequencies of occurrence of the demonstratives in the three samples were compared. Next, the proportions of proximal and distal demonstratives were analysed across the samples. Lastly, the proportions of determiner and pronoun uses for the distal plural demonstrative those were assessed. The log likelihood chi-square and the regular chi-square tests were performed to estimate the statistical significance of the results. The results showed that Polish advanced learners of English overuse demonstratives in argumentative writing and this overuse is particularly robust with distal demonstratives. Moreover, learners show a preference for the selection of distal (as opposed to proximal) demonstratives when compared with the native norm. They also show statistically significant overuse of those as a determiner and underuse of those as a pronoun (results for other demonstratives not available). Finally, the patterns of learners’ misuse do not change significantly with years of exposure and learning. Thus, the results indicate that native-like use of the demonstratives is not acquired implicitly by Polish learners. The finding has important pedagogical implications, since this feature of language use has not been addressed explicitly in syllabi and ELT materials so far.
90
Agnieszka Leńko-Szymańska
1. Introduction In my experience of reading various types of argumentative and academic essays written by Polish advanced learners of English, it has come to my attention that students (and, to be frank, occasionally myself) have problems in using demonstratives. When sharing this intuitive finding with colleagues I learned that they had made a similar observation. e identified problems rarely involve explicit errors, but are frequently related to non-native patterns of use. Two areas of difficulty are the frequency of occurrence and the choice between proximal (this and these) and distal (that and those) demonstratives. e fragment of a student’s essay in (1) below illustrates the type of dilemma Polish learners of English encounter in their writing. (1) e fact is that there are as many approaches to achieving a success as there are people aiming at it. e same goes to what they perceive to be a success. For ____1____ with a superiority complex, ____2____ will be ruling a kingdom. 1. 2.
a) these a) this
b) those b) that
c) it
Demonstratives in English are classified as belonging to two different partof-speech categories: they can be determiners, when they premodify the head of a noun phrase, or pronouns when they themselves function as the head of a noun phrase. eir two major areas of use are situational and time reference (deixis) and anaphoric reference (Quirk et al. 1985, Biber et al. 1999). Teachers’ intuitions indicate that the deictic function of the demonstratives seems to be handled by Polish learners fairly well, moreover, deixis rarely surfaces in argumentative and academic writing. e type of use that is believed to be troublesome for Polish students is the anaphoric reference, when the choice of the proximal or distal demonstratives does not relate to the physical or temporal distance. Teachers’ observations concerning problems in the use of the demonstratives in Polish advanced learners’ writing do not go beyond awareness of the problem, and say very little about the exact nature of the difficulty. An aim of this study is to confirm these intuitions and to identify the specific patterns of misuse of the demonstratives as anaphora markers. e misuse will be treated here in terms of underuse or overuse of the particular categories of the demonstrative anaphors in students’ essays: the proximal versus the distal
Demonstratives as anaphora markers in advanced learners’ English
91
demonstratives and the demonstrative determiners versus the demonstrative pronouns, rather than in terms of the number of errors. e choice of the methodology is motivated by the fact that in the majority of contexts the selection of a proximate/distal demonstrative is not determined (as it is in gap 1 in (1) above) and depends solely on the writer’s intended meaning (cf. gap 2 in (1) above). us, learners’ problems with the demonstrative anaphora rarely involve errors and are rather connected with unnatural tendencies. Before investigating the problem, it can be worthwhile to explore how usage of the demonstratives is presented and explained to learners. A survey of ELT materials has revealed that this grammatical point is never taught explicitly. e coursebooks most widely used in Poland contain notes on the usage of demonstratives only in their deictic function, as a rule in the first units at the elementary level, and never return to this problem at more advanced stages. Even in books designed for students preparing for the Cambridge exams there are no sections devoted to the use of the demonstratives for anaphoric reference. Nor do ELT grammars offer much help in this area. For example, Swan’s Practical English Usage (1980) illustrates how singular and plural demonstratives can be used anaphorically and lists many examples of their use, but does not explain the difference between the use of the proximals and the distals. Descriptive grammars of English (cf. Quirk et al. 1985, Biber et al. 1999) also concentrate on the singular/plural distinction and in the little space devoted to the proximal/distal dichotomy, they present conflicting information. One reason for the lack of adequate explanations on the usage of the proximal and distal demonstratives for anaphoric reference may be that this depends mainly on subtleties of meaning which are very difficult to pinpoint in terms of rules: The conditions which govern the selection of this and that with reference to events immediately preceding and immediately following the utterance, or the part of the utterance in which this or that occur, are quite complex. They include a number of subjective factors (such as the speaker’s dissociation of himself from the event he is referring to), which are intuitively relatable to the deictic notion of proximity/non-proximity, but are difficult to specify precisely. (Lyons, 1977:668)
e selection of a demonstrative (as opposed to, for example, a pronoun or the definite article) and the choice between the proximal and the distal anaphoric markers are mainly considered dependent on the writer’s/speaker’s perception and intuition. In the process of learning English as a foreign language, students
92
Agnieszka Leńko-Szymańska
are very much le to their own devices to acquire these. us, a second aim of this study is to investigate whether such acquisition really takes place. If it does, the patterns of use of the demonstrative anaphors displayed by learners at higher proficiency levels should be closer to (if not identical with) native speaker patterns, than in the case of learners at lower proficiency levels.
1.1 Research questions e questions addressed in this study can be summarised in the following way: Do Polish learners of English at higher and lower proficiency levels show different patterns of use of demonstrative anaphors? To what extent do these patterns differ from native speaker use?
2. Study 2.1 Data e data used in the study was drawn from two corpora: the PELCRA corpus of learner English (compiled at the University of Łódź), which is a collection of essays written by Polish university learners at different proficiency levels, and the British National Corpus Sampler. ree samples of these corpora were analysed: 105 essays (57,431 tokens) written by second-year students of English (Comp2) 69 essays (48,414 tokens) written by fourth-year students of English (Comp 4) 23 texts (313,347 tokens) from the domain of World Affairs of the BNC Sampler (BNCS-WA)
e PELCRA corpus consists of essays written for the end-of-year exams by students at the Institute of English Studies, University of Łódź. e data is available at four proficiency levels, from Year I to Year IV. In order to ensure the robustness of the proficiency effect (if it exists) in learners’ use of the demonstrative anaphors, the decision was made to select essays at the extreme ends of the proficiency scale. e first-year compositions could not be used because they represented a different genre of writing from the other three groups of essays, and as such could be richer in deictic uses of demonstratives.
Demonstratives as anaphora markers in advanced learners’ English
93
A standard reference corpus, the BNC Sampler, was selected as a benchmark for comparison. Such a choice may have its drawbacks as the observed differences may not be a result of native/non-native use but rather the effect of discrepancies in authors’ age or experience in writing. While such a possibility has to be borne in mind when interpreting results, it has been proven elsewhere (Leńko-Szymańska 2003) that comparing databanks of equivalent native and non-native students’ essays is also not free of this problem, and since the target standard of writing for non-native students is native professional rather than native apprentice production, the BNC Sampler seems a suitable base for comparison. ‘World Affairs’ was chosen among other BNC Sampler written domains because the topics and genres covered in this domain compare best with the learners’ essays. Such topics and genres include reports and discussions of current events taken from British dailies and excerpts from books on topics ranging from geography to European integration.
2.2 Tools and procedures e frequencies of occurrence of the four demonstratives in the samples were calculated using the Wordsmith Tools package (Scott 1999). In the case of this, these and those raw texts were used. However, for the calculation of the occurrence of that, the learner corpus was first tagged with CLAWS (a part-ofspeech tagger developed at Lancaster University), which tags that as a singular demonstrative or a complementizer. Since CLAWS does not handle the task accurately, the results for the three samples were verified manually. Finally, the concordance lines for those were further sorted into two groups: the lines containing those as a determiner and those containing those as a pronoun. Since the sorting was performed manually it generated unexpected and interesting observations concerning the post-modification patterns of the pronoun those, which were also quantified. ree stages of analysis were performed on the data. First, the frequencies of occurrence of the demonstratives in the three samples were compared in order to identify patterns of overuse or underuse in the learners’ essays. Next, the proportions of proximal and distal demonstratives were analysed across the samples with the aim of diagnosing learners’ potential preferences for one or the other category. Lastly, the proportions of determiner and pronoun uses for the distal plural demonstrative those were assessed in order to explore further the patterns of use of this anaphora marker. e log likelihood chi-square and
94
Agnieszka Leńko-Szymańska
the regular chi-square tests were performed to estimate the statistical significance of the results.
2.3 Results e first step in the analysis involved a comparison of the overall frequencies of demonstratives in the three samples. Table 1 presents the observed frequencies and Table 2 contains the results of the log-likelihood chi-square tests assessing the differences between the samples. e tests show that both groups of learners overuse demonstratives in comparison to native speakers. ere is no statistical difference between the groups of learners, indicating that overuse does not significantly diminish with years of exposure and learning. e frequencies of occurrence of individual demonstratives are presented in Table 3, and Figure 1 displays the graphic representation of results. Table 1. Overall frequencies of demonstratives
tokens demonstratives
Comp2
Comp4
BNCS-WA
57,431 529
48,414 488
313,347 2182
Table 2. Results of the log-likelihood chi-square tests comparing the three samples
Comp2/ BNCS-WA Comp4/ BNCS-WA Comp2/ Comp4
%
%
LL
p
0.92 1.01 0.92
0.70 0.70 1.01
31.44 50.37 2.06
p<0.01* p<0.01* p>0.05
Table 3. Frequencies of occurrence of the four demonstratives Tokens
Comp2 57,431
this that these those
248 138 64 79
Comp4 48,414 0.43% 0.24% 0.11% 0.14%
220 94 64 110
BNCS-WA 313,347 0.45% 0.19% 0.13% 0.23%
1185 362 396 239
0.38% 0.12% 0.13% 0.08%
Demonstratives as anaphora markers in advanced learners’ English
95
Figure 1. Frequencies of occurrence of the four demonstratives
e results reveal interesting differences in the overuse of the four demonstratives. Tables 4, 5 and 6 show a fairly complex pattern of misuse of demonstratives in learner writing, but some characteristic traits can be singled out. e most robust differences are connected with the use of distal demonstratives. Both the lower and the higher proficiency learners overuse that and those in
Table 4. Results of the log-likelihood chi-square test comparing the lower-proficiency learner sample with the native speaker sample
this that these those
Comp2 (%)
BNCS-WA (%)
LL
p
0.43 0.24 0.11 0.14
0.38 0.12 0.13 0.08
3.51 47.45 0.90 18.57
p>0.05 p<0.01* p>0.05 p<0.01*
Table 5. Results of the log-likelihood chi-square test comparing the higher-proficiency learner sample with the native speaker sample
this that these those
Comp4 (%)
BNCS-WA (%)
LL
p
0.45 0.19 0.13 0.23
0.38 0.12 0.12 0.08
6.00 18.10 0.11 76.15
p<0.05* p<0.01* p>0.05 p<0.01*
96
Agnieszka Leńko-Szymańska
Table 6. Results of the log-likelihood chi-square test comparing the two learner samples
this that these those
Comp2 (%)
Comp 4 (%)
LL
p
0.43 0.24 0.11 0.14
0.45 0.19 0.13 0.23
0.30 2.57 0.93 11.78
p>0.05 p>0.05 p>0.05 p<0.01*
comparison with the native speaker norm. e two groups of learners also make use of the singular proximal demonstrative this more frequently, but this overuse is not as strong and in the case of the lower proficiency students is not statistically significant. e only demonstrative whose pattern of use is close to the native one is these. e second finding concerning the differences between the two learner samples does not match the hypothesised outcome. ere are no significant differences between the two groups of learners with respect to the use of three out of the four demonstratives: the singular demonstratives this and that are overused in both learner samples, these is used with almost identical frequency by both groups of learners and by the native speakers. e most surprising result concerns the distal plural demonstrative those. e log-likelihood chisquare test confirms the significant difference between Comp2 and Comp4, but the direction of change in the frequency change runs counter to common sense expectation. e learners’ usage in fact moves away from that of natives with greater exposure and learning. Post-hoc analysis of the fourth-year essays did not point to any textual reasons for this phenomenon. e next stage of the analysis involved comparing the proportions of proximal and distal demonstratives in the three samples in order to detect whether the learners’ choices between the two categories of anaphora markers resemble the native speakers’ selection. e frequencies of occurrence of the proximals (this and these) and the distals (that and those) in the three samples are presented in table 7. e results of the chi-square test (χ2=61.30, p<0.01) point to the significant difference in the frequencies of the proximal and distal demonstratives in the three samples. Both groups of learners show a strong preference for the selection of the distal rather than the proximal anaphors in comparison to the native speakers. Although no further statistical (post-hoc) comparisons
Demonstratives as anaphora markers in advanced learners’ English
97
Table 7. Frequencies of proximal vs. distal demonstratives Comp2 total this/these that/those
529 312 217
Comp4 488 284 204
58.98% 41.02%
BNCS-WA 58.20% 41.80%
2182 1581 601
72.46% 27.54%
Table 8. Frequencies of proximal vs. distal demonstratives grouped into singular and plural categories Comp2 singular this that plural these those
384 248 138 145 64 79
Comp4
64.25% 35.75% 44.76% 55.24%
314 220 94 174 64 110
BNCS-WA
70.06% 29.94% 36.78% 63.22%
1547 1185 362 635 396 239
76.60% 23.40% 62.36% 37.64%
Table 9. Frequencies of those in modifying and pronominal functions Comp2 those determiner pronoun
79 39 40
Comp4 49.4% 50.6%
110 48 62
BNCS-WA 43.6% 56.4%
239 70 169
29.3% 70.7%
were performed, the data clearly indicates that there is no statistical difference between the two learner samples: both have almost identical proportions of proximals and distals. Since the overuse of the demonstratives is not evenly distributed, a more detailed analysis of differences in relative frequencies was performed. It aimed at checking whether learners’ preferences for distal anaphoric markers were similar for singular and plural demonstratives. Such a comparison seemed worthwhile especially in view of the overuse of those. Table 8 contains the frequencies of the four demonstratives grouped into the singular/plural cat-
98
Agnieszka Leńko-Szymańska
egories. Separate chi-square tests were run for both categories (χ2sing = 26.53, p < 0.01, χ2pl = 43.31, p<0,01), with tests demonstrating significant differences in the distribution of proximal vs. distal anaphors among the three groups of texts. Further analysis indicated that both the singular and plural distal anaphora markers were more oen used by learners than by native speakers. But whereas in the case of the singular demonstratives preferences change with greater exposure and learning in the direction of the native preferences, the plural demonstratives show the opposite tendency. is result can be linked to the direction of change in the frequency of those discussed above. e next area of exploration of the patterns of use of the demonstratives concerns the distribution of pronouns and determiners. Sorting the demonstratives into the two categories had to be done manually (CLAWS does not distinguish between pro-nominal and modifying functions in this class) so the analysis was restricted to the distal plural anaphor those, which had demonstrated the most surprising and interesting pattern of misuse. Observed frequencies are presented in Table 9. Results of a chi-square test (χ2=13,39, p<0.01) indicate a significant difference in the frequencies of determiners vs. pronouns in the three samples. is phenomenon is illustrated by Figures 2 and 3, which present concordance lines of those drawn randomly from two samples: Comp2 and BNCS-WA. Further analysis revealed that both samples of learner writing display similar distributions of the plural distal demonstrative determiner and the plural distal demonstrative pronoun, but that both of them overuse determiners and underuse pronouns in comparison with the native norm. However, a trend towards the native norm with greater experience and learning can be observed. e last finding reported in this section relates to the type of post-modification following those in its pronominal function. is is a serendipitous finding which had not been explicitly addressed in the research questions and which resulted from the manual sorting of concordance lines. Both groups of learners differ from native speakers in their patterns of post-modification. is phenomenon can be observed in Figures 4 and 5, which present concordance lines of those as a pronoun drawn randomly from two samples: Comp4 and BNCS-WA. Frequencies of different types of post-modification of the pronoun those in the three samples are presented in Table 10. No statistical analysis was performed on the last group of results since some of the frequencies were below the minima for the chi-square test. However, examination of the data points to some interesting differences between the
Demonstratives as anaphora markers in advanced learners’ English
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.
hing about the way Mrs O. talked of all fferent perspectives. Success is one of for using a foreign language to ask for 0 years old, has just le the factory. nvolved in the very moment of acquiring lment’. Other group of people includes into their existence or their actions, of the coach felt almost frozen, while ke to spoil your happiness. at is why unusual event in my life I belong to last fieen minutes when they accepted ly want, and we look condescendingly at eject the social pattern. Unfortunately en these two is oen easily seen among uccessful. e emotions which accompany be the way of becoming successful. For ing to go to a particular place. One of travel around the world. Different as there comes a moment in their life when help and assistance had already passed. y, are faced with wondrous emporia like , at the cafe and we were talking about success, he sees nothing but it. Lucky ell success; let them be happy with it. tay in a foreign country. Examining all to constantly fight for your own life. what they perceive to be a success. For ten by the ‘hero’ but not so quickly by e best things to sacrifice occur all ad bodies you came to the position, and ear the proud names of ‘the survivors’. ama Islands or a trip around the world. must confess that I myself belonged to ollowing the common way of thinking. In er sunset. Still, I do not think that
99
those great English writers, poets and p those things that we experience very o those “taboo objets”. Needless to say, h those beautiful and once so desired obje those objets. And it’s magic that has a those who do not look so deep into their those men and women who are mesmerized b those in the front part were dying of he those who want to be successful need to those people who much believe in common those papers. As one may guess, the pre those who chose to reject the social pat those who elected the life in pursuit of those people who claim to be lucky enoug those people are of highest rank and hav those who choose to bring up babies, to those things that tempts Polish people i those two examples of success may be the those , apparently happy people, realize Those ‘millionaires’ would like to chang those in fairy-stories, and what they fi those parties, and Wojtek (my dance part Those who manage to smell success; let t those however who are still playing at those strategies of buying things, they those who managed to break up with their those with a superiority complex, this w those over whose heads he trode. e tr those , as philosophers would put it, apa those who had already been there when yo Those you can meet in the Autumn in Flor Those young victims, as I would call the those rational people always (or almost those days, whenever I was watching ‘e those events did me any real harm. Today
Figure 2. Concordance lines of those drawn randomly from Comp2
100 Agnieszka Leńko-Szymańska
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.
s statement that it intends to contest oom with which to continue the purge of oom with which to continue the purge of ry launched a big manhunt to track down to make sure that justice is done, that esident Aquino to offer cash rewards to ause the Japanese being compensated are and cutting staff by 50 per cent. longer prepared to make sacrifices for uld be magnanimous in victory or punish e for me to get angry and fight back at Our next election is not until 1992. ional observers are barred from joining e extreme and non-peaceful solutions. istics which we share in common — ropean continent, and also the fates of ional observers are barred from joining e extreme and non-peaceful solutions. ropean continent, and also the fates of new champion. e same is true of w from the ird World, something which ps taken to reverse the consequences of ents were substantially corroborated by unification in ways which will reassure of the country’s political life.” d more dramatically and more oen than well as new residence requirements for well as new residence requirements for European Community will revolve around concentric circles. Its core would be Germany and its immediate neighbours. al campaigns, above all, have reflected death sentence. at is suicide. Let ems to be cashing in on the goodwill of n will use what influence it has to get
those those those those those those those Those those those those Those those Those those those those Those those those those those those those Those those those those those those Those those those those those
elections has met with opposition still rooted in old thinking. still rooted in old thinking. still at large. “We have to who are guilty will be meted out t who turn in rebel leaders. A who personally experienced internm who do leave are having problems f who have brought the army into dis behind the mutiny. She is da who are fighting me,” she said. who want to be president will have of the UN, the OAS, and the Jimmy who believe in the” transfer” of P of parliamentary democracy”. two [ German] states.” of the UN, the OAS, and the Jimmy who believe in the “transfer” of P two [ German] states.” other centres of Maronite power, s who rhapsodise about the “European treaties.” His remarks came of two co-members of the same unit fearful that precipitate moves to comments prompted Mr Slovo to repl of any country in the past 100 yea standing for elected office in Est standing for elected office in Est countries with the strongest econo states which are committed to maki within the existing Community who simplistic assessments. It i who give us such advice sign it th who regret the party’s “new start” hostages released.” Asked wh
Figure 3. Concordance lines of those drawn randomly from BNCS-WA
Demonstratives as anaphora markers in advanced learners’ English 101
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.
one who wishes not to use can. However, red about visiting hospitals especially nt tool for their trade – the Internet. ism is still the widest-open window for ographic magazines. However, there were ociations deeply rooted in the minds of her question (...). It can be said that even, suggesting other pointviews than enabled them to communicate. Originally r previous observation and remarks with ome a useful means of communication for ther some restrictions should be put on ving doing this, and accuse the public, es between political opponents; between een those who have gained the power and an account by the readers, especially arranged the murder of his ex-wife, and uch group will increase uncontrolled by ustrated the failure of such prospects. d therefore they tend to transform into atter when , who and why is under them. s it may b, it is also threatening that hey should or even are obliged to judge xample, how is it possible nowadays for . Some people are right in saying that w one must concede that imprisoned were blame, perhaps most common sense have lanation, it is always possible to join e much wiser, and happier perhaps, than nsible for the death of Princess Diana. er, that is to what, if any, extend all of secrets from the Princess’s life and pened in August night last year because has aroused strong feelings even among ho were ready to give a helping hand to
those those Those those those those those those those those those those those those those those those those Those those Those those those those those those those those those Those those those those those those
who want are free to do so. ere in the poorest parts of the world, who managed to break the soware whose freedom of speech remains re who either envied him his success, who have ever dealt with any form who were in the government or in t preferred and advocated by politic were systems of pictures or strang that follow. Writing gives us the whose intentions are not always go currently use the Internet. Some who read their material. e least who have gained the power and thos who have already lost it. In such of different attitude. As the conf are the readers of scandals-seekin who don’t even grasp the degree of oen were sparks to start the fir who operate the gallows no matter want to be above the nation so as who are members of Parliament, bec who, unfortunately, had no choice who have won the latest parliament who served communist system and be who at those times were suspected who claim that it was simply God’s who call the same thing God’s will who desperately seek the answers t who do not share this opinion enti who bought tabloids and read passi who read them did it with interest who are alleged to have followed D who normally have no or very littl in need. Undoubtedly, the death of
Figure 4. Concordance lines of the pronoun those drawn randomly from Comp4
102 Agnieszka Leńko-Szymańska
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.
ening of his political interests beyond conclusions are no less disparate than o that of pragmatic cost-benefit. gned to catch are precisely the same as nti-competitive agreements and fewer of th a Community dimension are defined as ese are aids of a social character, ve the effects of natural disasters and working of the competitive system from y which is prepared to share power with commensurate with the effort put in by votes. Although the turnout was lower r’s vote seems highly personal and only ty, but because his will is listed with ms paid for his work compared well with plagiarizing his results, particularly At the same time, offering low rents to ional trips to Brittany; on the last of , asserting that true blame attached to riter, and to have a profound effect on ics’ Institute, almost ten years before ailways, docks, and harbours, including ly she regularly dispensed medicines to “ re required to operate such systems and re with potentially flammable items and types of premises under discussion are restriction in terms of floor area than d for the Construction of Buildings are be of at least Grade 2 construction and All forms of paper storage other than ems under Categories II, III and IV are s ( with or without cartons) other than ed rack storage at heights in excess of ommodity classifications are similar to e properly documented and circulated to
those imaginable to the fellahin, the pe Alternative p those of theory. Those who strongly doubt whether the cos those which the UK legislation is aimed Agreemen those that are innocuous. those where the parties have an aggregat those given to relieve the effects of na those designed to compensate for economi those that facilitate it. If the SEM i those whose explicit or implicit aim see those who called for a boycott, that is, those who cast valid votes seemed to be those who are close to the political sce those in the parish of St Sepulchre, jus those paid to his contemporaries, and hi those dealing with the biology of the fo those who would agree to build in stone, those , she fell ill and died 18 Septembe those who had refused to listen to his g those who were to become better known th those established by Leonard Horner &lsq those at Port Talbot, Ipswich, Penzance, those not in acute distempers” among he those employed are normally kept outside those which may be chemically/physically those of height and restricted access. those appertaining at present is justifi those which have been adopted by the Ins those described in chapter 2 of this doc those specified under Categories II and those where experience has shown that th C those specified in Category IV), those shown in these tables, intermediat those already given in N.F.P.A. number 2 7.2 Operating those concerned.
Figure 5. Concordance lines of the pronoun those drawn randomly from BNCS-WA
Demonstratives as anaphora markers in advanced learners’ English 103
Table 10. Frequencies of different types of post-modification of the pronoun those Comp2 tokens relative cl. participle cl. adjective prep. phrase no postmod.
40 32 1 1 5 1
Comp4 80.0% 2.5% 2.5% 12.5% 2.5%
62 49 2 1 5 5
BNCS-WA 79.0% 3.2% 1.6% 8.1% 8.1%
169 71 44 4 49 1
42.0% 26.0% 2.4% 29.0% 0.6%
three samples. Both groups of learners demonstrate almost identical preferences for relative clause post-modification, which they use almost twice as oen as native speakers. On the other hand, they tend to avoid (again in similar proportions) post-modification using participle clauses and prepositional phrases (see the article by Borin and Prütz, this volume, which also points in the same direction).
2.4 Summary of results e results of the three stages of data analysis are summarised below: – – – –
–
Polish advanced learners of English overuse demonstratives in argumentative writing. Overuse is particularly robust with distal demonstratives. Learners show a preference for the selection of distal (as opposed to proximal) demonstratives when compared with the native norm. Learners show statistically significant overuse of those as a determiner and underuse of those as a pronoun (results for other demonstratives not available). Polish learners use different types of post-modification of the pronoun those from native speakers. ey overuse relative clause post-modification and underuse participle clauses and prepositional phrases as pronoun post-modifiers (results for other demonstratives not available).
On the whole, the patterns of misuse do not change significantly with years of exposure and learning. While in some cases a positive trend could be observed, in others there is an opposite tendency.
104 Agnieszka Leńko-Szymańska
2.5 Discussion e results of the study confirm teachers’ intuitions concerning the problems in the use of demonstratives as anaphora markers by Polish advanced learners of English. It should be noted that these problems seldom involve explicit errors, but are frequently related to the complex patterns of overuse and underuse as evidenced by the figures above. Several factors may be responsible. Firstly, overuse of the demonstratives, especially demonstrative determiners, may reflect the lack of articles in Polish. Polish nouns are not marked for definiteness unless special contextual requirements necessitate such marking, in which case a proximal demonstrative is oen used. Anaphoric reference is much more frequently rendered in Polish with the help of the demonstratives than it is in English, since the latter also offers a choice of the definite article. us, demonstrative determiners may oen be employed by Polish learners of English as markers of definiteness, in place of the definite article the, as can be seen in the examples below taken from Comp2. (2) When we reached our destination we felt really free. e first thing me and my friend Ann wanted to know was if there were going to be any discos organised. Since we both were, and still are, crazy about music and dance, when we got the positive answer we couldn’t help jumping from happiness. Immediately, we started all the possible preparations for this ‘big day’. (3) is is a very simple and efficient way of putting people into two categories: the ones who succeed and make money and the ones who do not. It is so easy to fall from one category into the other that we are obsesively afraid of it. is fear of being unsuccessful and condemned by the standards of our world makes us forget the most basic feelings and rules of humanity. (4) With mum’s death came a metamorphosis of my mind. Together with her life she took away my entire knowledge of the people and world that I had felt I knew all about. My friend became just a friend and my father only a parent. e victory of reason was heartbreaking but I still believed I would return to that unique world I was given to live in for a while.
However, learners’ preference for distal demonstratives rather than proximal ones (as compared with the native norm) cannot be explained by interlingual transfer. Distal demonstratives rarely function as anaphora markers in Polish.
Demonstratives as anaphora markers in advanced learners’ English 105
e observed tendency may however be the result of applying a communication strategy. Learners have clearly developed an awareness that distal demonstratives, particularly that, are less marked than proximal ones: [...] generally speaking, in English this is marked and that unmarked [...]: there are many syntactic positions in which that occurs in English and is neutral with respect to proximity and any other distinctions based on deixis. (Lyons, 1977:647)
Learners may become aware of the semantic and pragmatic characteristics of the demonstratives and consciously or unconsciously opt for the unmarked forms. e application of this communication strategy can be illustrated by the examples below taken from Comp4: (5) With hindsight there is nothing that could be done about it, but the conclusion was drown that wearing seatbelts would have saved Princess’ life. Now, still, the question is, who is to blame for that awful accident and who should bear responsibility and incur the price of that mishap. (6) e last group is particularly vulnerable to the adverse influence of the Internet. ose young people are themselves ‘in the making’, that is, their personalities and characters are only taking shape.
Learners’ patterns of post-modification of the pronoun those are a serendipitous finding of this study. e observed tendencies may reflect more general trends in the patterns of post-modification used by Polish learners of English and deserve separate in-depth investigation. However, these observations constitute a helpful hint for teachers concerning the nature of non-native use of the demonstratives, and complements the outcomes of this study. e lack of differences between the two groups of students indicates that years of exposure and learning do not necessarily bring learners closer to the development of native-like intuitions about the subtleties of demonstrative use. One possible explanation for this phenomenon is that the use of the demonstrative anaphors is oen regarded by both teachers and learners as a relatively trivial feature. Since the rules of use and distribution of the various demonstratives are not well described, learners hardly ever make errors which could be highlighted and corrected by the teacher. In any case, selection of an unsuitable marker hardly ever causes a communication breakdown. us, students’ attention is never drawn to the distributional criteria of the various demonstratives and other anaphora markers. Explicit instruction would seem
106 Agnieszka Leńko-Szymańska
indispensable if native-like patterns of use are to be acquired. It should also be pointed out that many differences observed between learner and native use of the demonstratives may be attributable to other factors than non-native and native intuitions. In this study, learner data were compared with published texts written by professional writers such as journalists or essayists. e observed differences could also be a result of a discrepancy between professional and non-professional styles. us, this investigation should be complemented by a comparison of learners’ compositions with essays written by non-professional native writers (British or American students with similar general learning experience) in order to show whether overuse of the demonstratives is a feature of non-nativeness or of lack of expertise in writing. One investigation of this kind conducted by Petch-Tyson (2000) showed that learners with different L1 backgrounds also tended to overuse demonstratives in comparison with non-professional American writers. However, only if all the three groups of writers – professional native writers, non-professional native writers and learners – are compared, will the picture of the patterns of misuse be complete.
3. Conclusion e primary aim of this study was to highlight an interesting feature of interlanguage connected with the use of the demonstratives as anaphora markers by Polish advanced learners of English. e results of the study confirmed teachers’ earlier intuitions concerning the problematic nature of the demonstrative anaphors, a feature of language use which is not addressed explicitly in current syllabi and ELT materials. e results of this study suggest that native-like use of the demonstratives cannot be acquired implicitly by Polish learners, who should not be le to their own devices in developing native-like intuitions in this area. In a broader context, this study has illustrated that interlanguage still remains unexplored in its many fine details. Numerous L1-specific and L1independent problem areas are still waiting to be investigated: learner corpora certainly have a major role to play in this process. e findings not only have important implications for syllabus designers, material developers and teachers, they also point to the need to rethink the current trends in language pedagogy, particularly the communicative approach. e focus on the devel-
Demonstratives as anaphora markers in advanced learners’ English 107
opment of the macro-linguistic skills, including writing, has put form-focused instruction at the peripheries of language teaching, especially at the advanced level. It is oen assumed that learners at higher proficiency levels will benefit from mere exposure to a foreign language, and that finer points of structure and vocabulary do not have to be taught explicitly (e.g. Krashen 1985, Brumfit 1984). e present study provides yet another argument for the importance of explicit form-focused instruction at advanced levels of proficiency.
References Biber, D., Johansson, S., Leech, G., Conrad, S. and Finegan, E. 1999. Longman Grammar of Spoken and Written English. Harlow: Longman. Brumfit, C.J. 1984. Communicative Methodology in Language Teaching: The Roles of Fluency and Accuracy. Cambridge: Cambridge University Press. Krashen, S.D. 1985. The Input Hypothesis: Issues and Implications. London: Longman. Leńko-Szymańska, A. 2003. “Contrastive interlanguage analysis revisited”. Paper presented at the workshop Learner Corpora: Design, Development and Applications, Lancaster University (UK), 27 March 2003. Lyons, J. 1977. Semantics. Cambridge: Cambridge University Press. Quirk, R., Greenbaum, S., Leech, G., Svartvik, J. 1985. A Comprehensive Grammar of the English Language. London: Longman. Petch-Tyson, S. 2000. “Demonstrative expressions in argumentative discourse – A computer-based comparison of non-native and native English”. In Corpus-based and Computational Approaches to Discourse Anaphora, S. Botley and A. McEnery (eds), 43–64. Amsterdam: John Benjamins. Scott, M. 1999. Wordsmith Tools version 3. Oxford: Oxford University Press. Swan, M. 1980. Practical English Usage. Oxford: Oxford University Press.
108 Agnieszka Leńko-Szymańska
How learner corpus analysis can contribute to language teaching 109
How learner corpus analysis can contribute to language teaching: A study of support verb constructions Nadja Nesselhauf University of Heidelberg, Germany
In this paper, the difficulties that advanced learners of English have with support verb constructions such as give an answer or have a look at are investigated on the basis of a subcorpus of ICLE (the International Corpus of Learner English). It is found that, contrary to what is sometimes assumed, it is not necessarily the choice of the verb that poses problems; one common problem, for example, is that existing support verb constructions are used inappropriately. The study is followed by a discussion of how its results, and how results from learner corpus studies in general, can be translated into suggestions for language teaching. It is argued that suggestions for teaching have to take account of factors in addition to those that can be directly derived from a learner corpus study.
1. Introduction e linguistic phenomenon that this paper will focus on can be illustrated by examples such as give an answer, have a look at or make an arrangement. ese expressions consist of an eventive noun that carries the bulk of the meaning, and a verb whose contribution to the lexical meaning is comparatively small, and which is therefore oen called “delexical”.1 e noun in these expressions is derivationally related to a verb which is roughly synonymous with the whole combination (the meaning of make an arrangement, for example, largely corresponds to the meaning of arrange). ere is no established term for these expressions; they have been variously referred to as “expanded predicates” (Algeo 1995), “verbo-nominal phrases” (Rensky 1964), “phrasal verbs” (Stein 1991, Live 1973), “complex verbal structures” (Nickel 1968), “stretched verb
110 Nadja Nesselhauf
constructions” (Allerton 2002) or “support verb constructions” (e.g. Krenn 2000, Danlos 1992). e term “support verb constructions” is adopted here, as it is more specific than most of the other terms suggested (like “phrasal verbs”, for example, or “verbo-nominal phrases”, which could be taken to refer to all kinds of structures involving a verb and a noun) and seems to have gained some currency in recent years (for a more precise definition of support verb constructions see section 2). For learners of English, support verb constructions are important for two reasons: they are frequent in English and they seem to be problematic for learners, even at an advanced level. Surprisingly, then, their importance for language teaching has not been noted very oen. Among the few exceptions are Sinclair and Renouf (1988), who, on the basis of the pervasiveness of these constructions in English as revealed by corpus analyses, explicitly call for giving them more attention in foreign language teaching. As their investigation of nine course series for English reveals, delexical constructions are almost completely neglected in teaching materials. e assumption that support verb constructions are difficult for learners of English has so far mainly been based on impressionistic observation (e.g. Lewis 2000:116) or inferred from the fact, that, while this type of construction is also present in numerous other languages, many support verb constructions do not correspond cross-linguistically (e.g. Dirven and Radden 1977:141, Altenberg 2001:196). Actual investigations of learners’ difficulties with support verb constructions in English are extremely rare, however. Almost all of the few existing studies are more general, i.e. they investigate combinations with certain verbs or verb-noun combinations in general and identify and discuss support verb constructions as a special group among them (Altenberg and Granger 2001, Howarth 1996, Kaszubski 2000). One of the few studies concentrating exclusively on support verb constructions – which is at the same time the one based on the largest amount of data – is Chi et al. (1994), who investigate support verb constructions of six common verbs in a one-million word corpus of learner English produced by speakers of Chinese. e present study focuses on the use of support verb constructions by advanced German-speaking learners of English – a learner group for which no study of support verb constructions has been carried out so far. e aim of the paper is twofold: firstly, to identify the mistakes typically made by this group and secondly, to discuss how these results, and similar ones obtained from learner corpus analysis, can and should contribute to language teaching.
How learner corpus analysis can contribute to language teaching
111
2. The definition of support verb constructions Like most features of language, support verb constructions are not a clearly delimitable phenomenon. Unsurprisingly, therefore, the definitions vary widely: the same structures are sometimes named differently, and the same names are sometimes used for different groups of structures. One group of combinations is oen regarded as prototypical support verb constructions, namely those consisting of one of the four verbs have, take, make and give in a delexical sense, an indefinite article, and an eventive noun that is identical in form to the verb to which the whole construction is roughly synonymous (cf. for example Algeo 1995, who calls these constructions “core expanded predicates”). Examples of such prototypical support verb constructions are have a smoke or give a smile.2 Some authors restrict their definition (of whatever term they are using) to these combinations (e.g. Labuhn 2001).3 Certain combinations are closely related to these prototypical constructions. ese are: –
– – –
combinations in which the noun is phonetically or derivationally related to the verb (e.g. take a breath – breathe, make a decision – decide, offer an apology – apologize) combinations in which there is no indefinite article (e.g. take action) combinations in which the noun is a prepositional object (e.g. take sthg into consideration) combinations which contain verbs other than have, take, make and give (e.g. run a risk).
ese combinations will all be subsumed under the term “support verb construction” here. For practical reasons, the analysis itself will, however, be restricted to have, take, make and give, which are among the verbs found most frequently in support verb constructions, no matter how narrow or wide one’s definition (cf. e.g. Akimoto 1989:2). Other related combinations, which have also sometimes been included by different authors, but which will be excluded here, are: –
–
combinations of a delexical verb and an eventive noun which do not have a roughly synonymous verb related to the noun (e.g. make an effort, cf. Altenberg 2001) combinations that have an equivalent verb in the passive, or in a reflexive or causative construction (e.g. take offence – be offended, give sb a good feeling – make sb feel good, cf. Allerton 2002)
112
– –
Nadja Nesselhauf
combinations of verb and adjective (e.g. be critical – criticize, cf. Allerton 2002) copular constructions in which the noun denotes an agent (e.g. be a helper – help, cf. Rensky 1964)
Whether the noun or the related verb is diachronically and/or synchronically primary will not be taken into account in the present definition, which means that combinations such as have a party will be included, although diachronically the verb party is derived from the noun.
3. Data and methodology of the present study 3.1 The learner corpus e present analysis is based on a subcorpus of the International Corpus of Learner English (ICLE) which contains data from speakers of L1 German.4 ICLE subcorpora consist of essays written by university students of English in their 2nd, 3rd or 4th year, i.e., by learners who mostly strive for near-native competence in English. e essays included in the corpus are all argumentative or descriptive, with topics such as “Television is the opium of the masses” or “My teenage idol” (for a more detailed description of ICLE see e.g. Granger 1996, 1998). A slightly reduced, preliminary version of the German subcorpus was used, containing about 300 essays, or about 150,000 words; this corpus will be called GeCLE (for German ICLE).
3.2 Methodology Aer the extraction of all verb-noun combinations with make, have, take and give from the corpus,5 those that could be classified as support verb constructions on the basis of the criteria outlined above were identified. Unacceptable combinations were classified as support verb constructions if the context indicates that the intended meaning is eventive and lies primarily in the noun. All support verb constructions were then judged as to their acceptability in (British and/or American) English. e acceptability judgements proceeded as follows: Combinations that were found in identical form in a dictionary (the Oxford Advanced Learner’s Dictionary 2000, the Collins COBUILD English Dictionary 1995, or e BBI Dictionary of English Word Combinations,
How learner corpus analysis can contribute to language teaching
113
Benson et al. 1997) or at least five times in the written part of the BNC (British National Corpus)6 were judged acceptable. “Identical form” means that not only the lexical elements had to correspond, but also the number of the noun, its (central) determiners and its complementation. If a combination could not be judged acceptable on this basis, it was presented to two native speakers (one of British and one of American English) in the context of one or two sentences. e native speaker was asked to judge the combination as either acceptable, unacceptable or questionable. If the judgement of the two native speakers did not correspond, two additional native speakers (again one of British, one of American English) were consulted. e judgements were then averaged, resulting in five degrees of acceptability: +, (+), ?, (*), *. If a combination seemed more acceptable in one variety than in the other (for example, if the judgements were +, ? and *, *), the average was only formed on the basis of the judgements by the native speakers of that variety (i.e. + and ? in this case, resulting in (+)). For practical reasons, if different corrections were given by different informants, only one was considered, namely either the one that was given more frequently, or, if none was given more frequently than the other(s), the one that seemed most likely in the context. A few limitations of the methodology have to be pointed out. Firstly, acceptability in the area of support verb constructions can vary quite considerably between speakers, so that the scale of acceptability based on the judgements of a comparatively small number of speakers can only give a rough indication of the degree of acceptability of a certain combination. Secondly, this approach, which is based on certain verbs, cannot capture cases in which another construction is used instead of a support verb construction (for example a single verb such as notice instead of take notice). Nor can it capture cases in which a verb other than the four investigated ones should have been used but was not (such as *do a contribution). irdly, the amount of data investigated is fairly small, and results on specific combinations should therefore be regarded as preliminary.
114
Nadja Nesselhauf
4. Results 4.1 Overall results Overall, 251 support verb constructions (SVCs) containing the verbs give, have, make and take were found in GeCLE. e distribution of the 251 support verb constructions over the four verbs and their distribution on the scale of acceptability is shown in Table 1. While the majority of support verb constructions produced by the learners (168 out of the 251) was found to be undoubtedly acceptable, 45 were judged clearly unacceptable. If those judged *, i.e. clearly unacceptable, and (*), i.e. largely unacceptable, are considered to contain “mistakes” or to be “wrong”, this means that one fourth to one fih (23%) of all support verb constructions the learners used are wrong, i.e. contain one or several mistakes.7 Table 2 shows the relation of wrong support verb constructions to all support verb constructions for each of the four verbs. It reveals that support verb constructions with make are the most liable to error, closely followed by those with give and take, and that combinations with have are the least liable to error. It should be noted, however, that this cannot directly be taken to mean that
Table 1. Degree of acceptability of the support verb constructions in GeCLE
give have make take total
+
(+)
?
34 70 37 27 168
2 8 1 2 13
3 5 3 1 12
(*)
*
total
5 5 3
10 9 16 10 45
54 97 60 40 251
13
Table 2. Number of mistakes
give have make take total
SVCs judged * or (*)
total SVCs
mistakes per 50 SVCs
15 14 19 10 58
54 97 60 40 251
14 7 16 13 12
How learner corpus analysis can contribute to language teaching
115
combinations with make are most difficult and those with have least difficult for the learner, as only the form in which the combinations occur in the corpus and not their correct form is considered.
4.2. Typical difficulties e following types of mistake could be identified for the support verb constructions judged unacceptable:8 wrong verb (15/14): take changes (make; DR–0009.1), make an action (perform; AUG–0067.3), make an experience (have; e.g. AUG–0016.3) wrong noun (8/7): make a trial (attempt; AUG–0053.3), make a cut (distinction; AUG– 0047.1) wrong verb and wrong noun (3/3): make the Nazi greeting (give the Nazi salute; AUG–0072.1), give action to (put effort into; SA–0004.3) SVC instead of verb (14/13): give solutions (solve; BAS–0056.1), make distinction (separate; AUG– 0095.1), have a look at (look aer; DR–0013.1) wrong noun complementation (12/12): take care for (of; e.g. DR–0022.1), have a look on (at; AUG–0035.2) wrong determiner (including missing and superfluous determiner) (2/2): have sb under a strict control (under strict control; AUG–0077.3) wrong syntactic structure (1/1): give a rest (give sb a rest; BAS–0051.1) multiple mistakes (other than wrong verb and wrong noun) (3/3): make a shopping tour (go shopping; AUG–0014.2)
e numbers in brackets indicate how many mistakes of each particular type occurred; the second number refers to the number of texts in which the different mistakes occurred:14/13, for example, means that this type of mistake occurred in 14 combinations, but that one unacceptable combination occurred twice in one essay. As only one correction per combination was considered (cf. section 3.2), each unacceptable combination was assigned to one type of
116
Nadja Nesselhauf
mistake only, although in a few cases the unacceptable construction could plausibly have been assigned to several different types of mistake. Give a solution, which occurred twice in the corpus, for example, was once assigned to the category “wrong verb”, and once to the category “SVC instead of verb”, as in one case all native speaker informants suggested provide a solution as correction (BAS–0014.1), and in the other, three suggested solve, and one provide a solution (BAS–0056.1). According to the above classification, three or four main types of mistake can be identified for support verb constructions. ese are, starting with the most frequent type, wrong verb, SVC instead of verb, wrong noun complementation, and, with a slightly lower number of occurrences, wrong noun. If we compare this result with those of other studies on support verb constructions in learner language, it partly confirms earlier results, and partly provides new insights. Most of the existing studies have stressed the difficulties learners have in choosing the right verb. is is confirmed by the present data, and unsurprisingly so, as the verb in a support verb construction is by definition delexical and therefore to some degree arbitrary. Other types of mistakes are, if at all, only mentioned in passing in previous studies. Chi et al. (1994) point out that sometimes support verb constructions are used instead of single verbs; Altenberg and Granger (2001) in addition mention noun-mistakes and article mistakes. But even in these studies, mistakes other than verb-mistakes seem to be considered far less important and are not quantified. e present results, however, clearly show that there are other frequent types of mistake besides the wrong choice of verb. In a few of the previous studies, a distinction is made between two types of verb mistake, according to whether the verb that was used can best be corrected by another general verb or by a more specialised one (e.g. Howarth 1996, Chi et al.1994); again, the relation of these two types has not been quantified. In the present analysis, 12 out of 15 verb mistakes are best corrected by choosing another general verb (e.g. give a decision to make a decision) and only 3 are best corrected by using a (slightly) less general verb (e.g. make an action to perform an action (AUG–0067.3), give a solution to provide a solution; BAS–0014.1). A further point that emerges from the present analysis and that does not seem to have been noted so far is that quite frequently combinations that learners use actually exist in English but are used incorrectly, i.e., they do not convey the meaning that was apparently intended. Examples of such cases are make a cut, where make a distinction seems to have been intended, or take measurements for to measure:9
How learner corpus analysis can contribute to language teaching
117
(1) Such people have to understand that they have to make a sharp cut between science fiction as it is described in films and reality. (AUG0047.1; my italics) (2) If I wanted to draw a plan I would have to take measures first, I would have to know how much space there is. (SA-0004.3; my italics)
In the category “wrong noun“, 6 out of 8 combinations produced actually exist; in the category “SVC instead of verb“, 11 out of 14.10 If the context of the combinations is not taken into account, such mistakes are easily overlooked, and mostly seem to have been overlooked so far.11 Table 3 shows how the different types of mistake are distributed over the support verb constructions with the four verbs investigated. As the numbers are comparatively small, they have to be interpreted with caution, but some clusters occur and might well point to more general tendencies. Make, and to a lesser extent give, seem to be particularly prone to be used to form (wrong) SVCs by German-speaking learners. Whereas in the case of give four different wrong combinations were formed, in the case of make only three different combinations were formed (in 9 instances altogether), but two of them several times. Six times in five essays, the combination make an experience was used (e.g. AUG–0016.3; probably modelled on German eine Erfahrung machen). is is at the same time also the most frequent single mistake occurring in the data. e other mistake with make as the wrong verb occurring in more than one essay was make a step (take; 2/2; AUG–0058.3, AUG–0081.3). When learners used or created SVCs instead of using a verb, this has particularly oen resulted in combinations with have
Table 3. Types of mistake in SVCs with make, take, give, have Type of mistake
total
make
take
give
have
wrong verb wrong noun wrong verb & wrong noun SVC instead of verb wrong noun complementation wrong determiner wrong syntactic structure multiple mistakes
15/14 8/7 3/3 14/13 12/12 2/2 1/1 3/3
9/8 5/5 1/1 2/2 2/2
1/1 2/1 1/1 6/6 -
4/4 1/1 2/2 4/4 1/1 1/1 1/1 1/1
1/1 8/7 4/4 1/1 -
118
Nadja Nesselhauf
and, to a somewhat lesser extent, in combinations with give (e.g. have a gossip for to gossip; AUG-0095.1, or give solutions for solve; BAS-0056.1). Wrong noun complementation primarily affected the verb take. is is due to two combinations only, namely to take care and take a look. e former occurs twice with for (AUG-0048.3, DR-0022.1) and twice with about (AUG-0084.3, AUG-0047.2) instead of of, the latter occurs twice with into (instead of at; SA-0007.3, AUG-0083.1). One of the noun complementation mistakes concerning a combination with have also involves the noun look: have a look on (instead of at; AUG-0035.2). A final clustering of mistakes that emerges is that the wrong noun is used particularly oen with make-combinations. e main reason for this is that three attempts (by different learners) to express make a distinction have resulted in a combination which, although containing make, does not contain the correct noun. e combinations produced are make a difference (BAS-0040.1), make a division (BAS-0054.1), and make a cut (AUG0047.1). All three combinations exist, but they do not express the meaning that seems to have been intended (cf. for example “we must make a difference between heroin and cocaine on the one hand and alcohol and tobacco on the other“ (BAS-0040.1; my italics)). Clustering of mistakes can also occur across the different verbs and the different types of mistakes, as the example have/take a look at has already suggested. In addition to the mistakes in this combination mentioned above (i.e. wrong prepositions), the whole combination is also once used incorrectly, apparently due to confusion with look aer: “Although there are a lot of programmes and care for mentally ill people it is quite hard to have a look at all of them” (DR–0013.1; my italics). Another occurrence of have a look was judged “?”: “Suddenly I heard a horrible cry [scream] so that I nearly fell from the swing. I had a look around and saw the reason“ (AUG-0053.3; my italics). is was corrected to looked around by both native speaker judges (who both judged the combination “?”). What the learners seem to confuse in these cases is verb-plus-particle sequences containing the verb look and support verb constructions containing the noun look. A similar confusion can be observed with the combination take care, where the prepositions used (about and for) were probably influenced by the existence of the prepositional verbs care about and care for. Further combinations that were used incorrectly more than once are have an/no intention of -ing, have a chat, and give sb some advice. In the case of have an/no intention of -ing, twice have an/no intention to+inf was produced
How learner corpus analysis can contribute to language teaching
119
(AUG-0020.1, AUG-0058.1), apparently modelled on German (die / nicht die Absicht haben + zu+inf).12 In the case of have no intention, this has led to a complementation mistake (to+inf instead of of -ing), in the case of have an intention to a “SVC instead of verb”-mistake, as intend is usually preferred to have an intention in English (whereas have no intention is quite frequent). Instead of the combination have a chat, the combinations have a gossip (AUG0095.1) and make a chat (AUG-0041.2) were produced.13 e combination give sb some advice appears problematic, as in one case give sb the advice to+inf was produced when to advise sb to+inf would have been more appropriate (AUG0098.1).14 In another case (or in two, if a combination judged “?“ is included; DR-0020.1), an attempt was apparently made to avoid the complementation problem by using a colon aer the noun, which then led to an article mistake: “I can just give you the advice: Buy a bike, ride it every day, […]” (AUG-0053.3; my italics, correction: this).
5. Implications for language teaching 5.1 The use of learner corpus analysis for language teaching Most learner corpus studies claim to have some relevance for language teaching and typically end with a short section stating pedagogical implications of the study. ese are oen directly derived from the results, i.e. what was found to deviate frequently from native speaker usage is then recommended for teaching. Learner corpus analyses are therefore prone to a type of criticism similar to what recommendations for teaching based on native speaker corpora have been subjected to for a while: that they only take into account one criterion that is important for teaching, and disregard others. In the case of teaching recommendations based on native speaker corpora, it has oen been objected that the only criterion considered is frequency in native speaker usage (e.g. Widdowson 1991, 2000, Cook 1998). Teaching recommendations exclusively based on the criterion of frequency of deviation in non-native speaker usage seems similarly misguided, however. First, for recommendations on what to teach, frequency in native speaker usage certainly is one of the other important criteria; the criterion needs to be refined, however. Naturally, frequency must mean frequency in the variety or the varieties that the learners are aiming to acquire. In many EFL countries, this will mean frequency in both
120 Nadja Nesselhauf
American and British English. For a general course in particular, frequency should in addition also mean frequency in many text types, i.e., the features selected should be non-technical and of a medium level of formality. For the selection of features for ESP courses, frequency in particular text types and/or a particular subject area is also essential. In addition to frequency in native speaker language and frequency of deviation in non-native-speaker language, the frequency of elements in learner language should also be taken into account, as learners will use (or attempt to use) those features particularly oen that they find particularly useful. Like the criterion of frequency of deviation, this criterion can also be based on learner corpus analysis (again of a corpus that contains text types that a learner is likely to have to produce). A further important criterion that is oen ignored when features are suggested for language teaching is the degree of disruption of an unacceptable expression for the recipient. Unacceptable expressions that disrupt communication more, i.e. that draw the recipient’s attention away from the message, or make the recipient misunderstand or fail to understand the message, should clearly be given more attention than those that are less disruptive.
5.2 The use of the present results for language teaching Assuming that the results of the present analysis will be confirmed by further study, what kinds of suggestions can be made on their basis? Clearly, suggestions can only be made for the learner group that has been investigated, i.e., advanced German-speaking learners. As the analysed text type is fairly neutral, suggestions can be made mainly for general courses, but also to some degree for more technical courses (as these by necessity also contain some non-technical language). e results, together with the fact that support verb constructions are a common feature in English, indicate that for the group of learners in question, there is indeed a need for improved teaching of support verb constructions: they are used quite frequently by the learners and the error rate is rather high. On the basis of the criteria of frequency of deviation, the combinations have an experience, make a distinction, take care of, take/have a look at are candidates for teaching; it can be assumed that frequency of occurrence in native and non-native speaker usage also applies to these combinations. Similarly, the contrast between have no intention of -ing and intend to, give sb some advice and advise sb to+inf, and the combinations have a chat and
How learner corpus analysis can contribute to language teaching
121
make a step seem to deserve attention on the basis of these criteria. In general, it could then be suggested that the major focus of teaching should be – in this order – on choosing the right verbs, the right noun complementation, the right noun, and on contrasting support verb constructions with similar verbs. If the additional criterion of disruption for the recipient is taken into account, however, these suggestions can, and might even have to, be modified in some way. It can be assumed (although further investigation would be needed to confirm this) that a combination is more confusing for the recipient if several elements in the combination or even the whole combination is wrong than if only one element of the combination is wrong. In addition, it can be assumed that a wrong element is more confusing if this element carries a lot of meaning than if it carries only little meaning. us, a wrong noun will probably be more confusing for the recipient than a wrong verb, and a wrong verb will be more confusing than a wrong preposition or other type of noun complementation. e highest degree of confusion probably occurs when the whole combination is used incorrectly although it exists (exactly or almost) in the form in which it is used. Contrasting combinations in which the noun looks like the verb but is used in a (slightly) different sense in the support verb construction such as take notice and notice, take measures, take measurements and to measure would thus seem particularly useful in teaching, and so would contrasting support verb constructions with the same verbs and nouns with similar form and meaning (such as make a distinction and make a difference). Contrasting support verb constructions with preposition and verb-plus-particle sequences with the same lexical element, such as have/take a look at and look around, look aer etc., can also be recommended. Focusing on verbs comes next in importance, since verb errors are frequent but arguably not particularly disruptive for the recipient. What seems least important, if the criterion of disruption is taken into account, is a teaching focus on noun complementation (such as take care of or have the intention of -ing). is aspect should therefore probably get specific attention only in very advanced courses. Some of the results obtained, finally, cannot be translated into suggestions for language teaching at all, or at least not yet. e fact that make, take, give and have are more frequently confused with each other than used instead of more specific verbs, for example, cannot lead to suggestions for language teaching before combinations with more specific verbs have been investigated as well. Questions such as whether to teach that *give a solution does not exist in English, must also remain open. But I hope to have been able to
122 Nadja Nesselhauf
show that learner corpus analysis can make some contribution to language teaching – if the results are translated into suggestions for language teaching with some caution.
Notes 1. As has been shown, the verb does however make some contribution to the lexical meaning, so that the term “delexical” is perhaps a little unfortunate (e.g. Stein 1991, Dixon 1991, Wierzbicka 1982). It will nevertheless be retained here, in the sense indicated above. 2. Some authors consider the second lexical element in such constructions a verb (e.g. Wierzbicka 1982, Dixon 1991). 3. A few authors are even more restrictive, as for example Stein (1991), who only includes combinations with have, take and give in her definition. 4. I would like to thank the coordinator of German ICLE, Gunter Lorenz, and Sylviane Granger and the Centre for English Corpus Linguistics at the Université Catholique de Louvain, Belgium, for integrating me into the ICLE project at a late stage. 5. All verb forms of the four verbs were extracted automatically. On the basis of a manual analysis those which occurred in a verb-noun combination were then selected. 6. Cf. http://www.hcu.ox.ac.uk/BNC. 7. In this paper, no distinction will be made between “error”, and “mistake”, or between “unacceptable” and “wrong”. 8. e essay codes given aer examples have been slightly simplified: their full form reads “ICLE-GE-DR–0009.1”, but “ICLE-GE” (i.e. ICLE-German subcorpus) is omitted. 9. Orthographic mistakes have been corrected in quotations from the essays; grammatical or lexical mistakes are corrected in square brackets if the mistake makes the comprehension of the sentence difficult. 10. Only lexical elements are considered here, so that for example make distinction is considered an existing combination, although, strictly speaking, only the combination of the lexical elements (i.e. make + distinction) is acceptable, but not the combination as it is (i.e., without an article). 11. Chi et al. (1994), for example, only mention this type of mistake on the basis of a nonexisting SVC they found in their corpus (take an interview for to interview). 12. e sentences read as follows: “I had the intention to stay there for a year” (AUG– 0020.1); “He had no intention to […] be successful” (AU 0058.1). 13. e sentences read as follows: “Last week I had one of these little gossips [...] with my neighbour” (AUG–0095.1); “e “little” chats she makes always develop in a monotonous monologue” (AUG–0041.2). 14. e sentence reads as follows: “en I’ll give you the good advice to visit [go to] McDonalds” (AUG–0098.1).
How learner corpus analysis can contribute to language teaching 123
References Akimoto, M. 1989. A Study of Verbo-nominal Structures in English. Tokyo: Shinozaki Shorin. Algeo, J. 1995. “Having a look at the expanded predicate.” In The Verb in Contemporary English, B. Aarts and C.F. Meyer (eds), 203–217. Cambridge: Cambridge University Press. Allerton, D. J. 2002. Stretched Verb Constructions in English. London and New York: Routledge. Altenberg, B. 2001. “Contrasting delexical English make and Swedish göra.” In A Wealth of English. Studies in Honour of Göran Kjellmer, K. Aijmer (ed.), 195–219. Göteborg: Acta Universitatis Gothoburgensis. Altenberg, B. and Granger, S. 2001. “The grammatical and lexical patterning of MAKE in native and non-native student writing.” Applied Linguistics 22 (2), 173–194. Benson, M., E. Benson, R. Ilson 1997. The BBI Dictionary of English Word Combinations. (Rev. ed. of BBI combinatory dictionary of English, 1986.) Amsterdam: John Benjamins. Chi, M. L. A., P. K. Wong, and C. M. Wong. 1994. “Collocational problems amongst ESL learners: A corpus-based study.” In Entering Text, L. Flowerdew and A. K. Tong (eds), 157–165. Hong Kong: University of Science and Technology. Collins Cobuild English Dictionary. 1995. London: Harper Collins. Cook, Guy 1998. “The uses of reality: A reply to Ronald Carter.” ELT Journal 52 (1), 57–64. Danlos, L. 1992. “Support verb constructions. Linguistic properties, representation, translation.” Journal of French Linguistic Study 2, 1–32. Dirven, R. and Radden, G. 1977. Semantische Syntax des Englischen. Wiesbaden: Athenaion. Dixon, R.M.W. 1991. A New Approach to English Grammar, on Semantic Principles. Oxford: Clarendon. Granger, S. 1996. “Learner English around the world.” In Comparing English Worldwide: The International Corpus of English, S. Greenbaum (ed.), 13–24. Oxford: Clarendon. Granger, S. 1998. “The computerized learner corpus: A versatile new source of data for SLA research.” In Learner English on Computer, S. Granger (ed.), 3–18. London and New York: Addison Wesley Longman. Howarth, P. 1996. Phraseology in English Academic Writing. Some Implications for Language Learning and Dictionary Making. Tübingen: Niemeyer. Kaszubski, P. 2000. Selected Aspects of Lexicon, Phraseology and Style in the Writing of Polish Advanced Learners of English: A Contrastive, Corpus-based Approach. Unpublished PhD Thesis, Adam Mickiewicz University, Poznań. Online: http: //main.amu.edu.pl/~przemka/rsearch.html (visited:16.04.2004). Krenn, B. 2000. The Usual Suspects. Data-oriented Models for Identification and Representation of Lexical Collocations. Saarbrücken: German Research Center for Artificial Intelligence.
124 Nadja Nesselhauf
Labuhn, U. 2001. Von Give a Laugh bis Have a Cry. Zu Aspektualität und Transitivität der V+N Konstruktionen im Englischen. Frankfurt: Peter Lang. Lewis, M. ed. 2000. Teaching Collocation. Further Developments in the Lexical Approach. Hove: LTP. Live, A.H. 1973. “The take-have phrasal in English.” Linguistics 95, 31–50. Nickel, G. 1968. “Complex verbal structures in English.” International Review of Applied Linguistics 6, 1–21. Oxford Advanced Learner’s Dictionary. 2000. Oxford: OUP. Rensky, M. 1964. “English verbo-nominal prases. Some structural and stylistic aspects.” Travaux Linguistiques de Prague 1, 289–299. Sinclair, J. and Renouf, A. 1988. “A lexical syllabus for language learning.” In Vocabulary and Language Teaching, R. Carter and M. McCarthy (eds), 140–160. London and New York: Longman. Stein, G. 1991. “The phrasal verb type ‘to have a look’ in Modern English.” International Review of Applied Linguistics 29, 1–19. Widdowson, H.G. 1991. “The description and prescription of language.” In Linguistics and Language Pedagogy: The State of the Art, J. Alatis (ed.), 11–24. Washington D.C.: Georgetown University Press. Widdowson 2000. “On the limitations of linguistics applied.” Applied Linguistics 21 (1), 3–25. Wierzbicka, A. 1982. “Why can you have a drink when you can’t *have an eat?” Language 58, 753–799.
e problem-solution pattern in apprentice vs. professional technical writing
125
The problem-solution pattern in apprentice vs. professional technical writing: an application of appraisal theory Lynne Flowerdew Hong Kong University of Science and Technology, PR China
This article reports a corpus-based analysis of the Problem-Solution pattern in an apprentice and in a professional corpus of technical recommendation-type reports. The first part of the article describes how salient lexis for the Problem-Solution pattern has been identified using the Keywords Tool (Scott 1997), which uncovers words of unusually high frequency in a corpus when compared with a larger-scale reference corpus. The second part of the article describes the classification of the keywords, which is based on Martin’s system of APPRAISAL for encoding evaluative lexis. The importance of taking into account the “context of situation” and “context of culture” for the interpretation of the evaluative keyword lexis is also highlighted.
1. Introduction A few years ago (L. Flowerdew 1998), I advocated the application of corpus linguistic techniques to textlinguistics in the areas of discourse analysis, genre analysis and systemic-functional linguistics. In L. Flowerdew (2002) I noted that recently quite a lot of corpus-based research of on EAP has been carried out from a systemic-functional linguistic, SFL, perspective (Halliday 1994). Most of these studies are of a contrastive nature in which various aspects of student writing are compared with expert, or professional writing. For example, research on the interpersonal level, which shows the writer’s attitude towards a proposition, can be found in Hyland and Milton (1997) and Hewings and Hewings (2002). ematic structure, “the point of departure of the message”, has been investigated by Green et al. (2000), while Ragan (2001) has analysed topical themes in an annotated learner corpus.
126 Lynne Flowerdew
One aspect of Halliday’s system, which has been developed over the past few years by Jim Martin, is APPRAISAL. e APPRAISAL framework is related to the Interpersonal level and specifically concerns the way language is used to evaluate and to manage interpersonal positionings. is framework has mainly been applied to the analysis of media discourse, casual conversation and literature (see White (2001) for an overview of these research studies), but not to the analysis of student or apprentice discourse, as has been the case with other aspects of SFL theory mentioned above. e research reported on in this paper is therefore an attempt to apply the APPRAISAL framework to an analysis of student and professional writing, specifically technically-oriented reports. is genre very oen follows a Problem-Solution organizational pattern (Swales 1990), detailed descriptions of which can be found in Hoey (1983; 2001). However, with the exception of Scott’s (2000) research, this discourse pattern has not received nearly as much attention in corpus-based research as other textlinguistic areas have. A corpus-based analysis of the Problem-Solution pattern situated within the APPRAISAL framework would help to shed light on how this important pattern is realised linguistically in student and professional report writing. e following section describes the student corpus, the professional corpus and the methodological procedures employed in the investigation. is is followed by a description of the APPRAISAL framework which is used for classifying the corpus data.
2. Corpora and methodology 2.1 Description of the corpora e student (STUCORP) and professional (PROFCORP) corpus consist of approximately 250,000 words each, with STUCORP and PROFCORP made up of 80 and 60 reports, respectively. e PROFCORP reports, commissioned by the Hong Kong Environmental Protection Department from various consultancy companies in Hong Kong, document the potential environmental impact that could arise from the construction and operation of proposed buildings and facilities. ese reports also contain a section on suggested mitigation measures to alleviate any possible adverse impacts. e STUCORP reports are written by 2nd and 3rd year undergraduate stu-
e problem-solution pattern in apprentice vs. professional technical writing 127
dents at a tertiary institution in Hong Kong as assessed assignments on a Technical Communications Skills course. Brief assignment guidelines are given to students which stipulate that they are expected to choose an area for investigation where a problem or need can be identified on the basis of evidence from secondary and primary source data (survey questionnaire, interview, observation), and propose a set of recommendations to solve or alleviate an identified problem. e topics of the STUCORP reports are quite wide ranging and mostly concern different university departmental or service unit issues, such as an evaluation of the existing soware or hardware in computer rooms, or the lack of security measures in the laboratories. Unlike the PROFCORP reports, however, the STUCORP reports are unsolicited in that the students write the reports on the basis of a problem perceived by them rather than in response to a request by a department to investigate an issue. For this reason, the STUCORP reports can be regarded as more “problem-oriented” than the PROFCORP ones, as students have to provide persuasive evidence for the existence of some problem or need and not merely give a problem statement.
2.2 Methodological procedures: Keyword and key-key word analyses In order to uncover the lexis for the Problem-Solution pattern in each corpus, I decided that the Keyword function in WordSmith Tools would be suitable for this purpose as this tool identifies linguistic items which are of unusually high frequency in a particular corpus when compared with a larger-scale reference corpus. Previously, this tool has been used for delineating particular genres on the basis of the keywords (Bondi 2001; Scott 1997; Tribble 2002), but it could equally be applied to uncovering key lexis for the Problem-Solution pattern. e identification of salient lexis was carried out in two stages: a keyword, and a “key-key word” analysis. First, each corpus was treated as a single text and compared with a large reference corpus, in this case the BNC (Aston and Burnard 1998) for extraction of the keywords. e first stage of the analysis involved examining the keywords to determine whether any of these would signal the Problem-Solution pattern in PROFCORP or STUCORP. First, each corpus, which was treated as a whole text, was compared with the BNC, the reference corpus. e Log Likelihood for calculating the significance level was set at a p value which would obtain about 40 keywords for the analysis, as Scott (1997) suggests this as a reasonable number for drawing conclusions about a text, and the minimum frequency requirement was le at the default value of
128 Lynne Flowerdew
3. In the second stage, a “key-key word” analysis was conducted, which shows the words that are key in a large number of texts of a given type, and can thus reflect the genre or discourse characteristics of the corpus as a whole. It was thought that a combination of these two related procedures would provide ample evidence for the linguistic items signalling the Problem and Solution elements of the pattern. As both the main elements of the Problem-Solution pattern are essentially evaluative in nature, it was decided that Martin’s system of APPRAISAL, which is set within the SFL tradition, would provide an ideal framework for classifying the keywords and key-key words. A description of the APPRAISAL framework is given below, with particular attention to those aspects of the framework which are relevant to the investigation reported in this paper.
3. APPRAISAL framework and classification of keywords An excellent overview of the APPRAISAL system can be found on Peter White’s site (http://www.grammatics.com/appraisal/). In short, there are three subtypes of Appraisal: Attitude, Engagement and Graduation. It is the Attitude subtype “Values by which speakers pass judgments and associate emotional/ affectual responses with participants and processes” (White 1998/99) which I make use of in this research, specifically the aspect of judgement. Judgement may be activated by either explicit or implicit means, referred to by Martin as strategies involving “Inscribe” and “Evoke” (see Figure 1 below). In other work, Martin refers to these evaluative categories as “Inscribed” and “Evoked”
Figure 1. Strategies for encoding attitude – inscribe, invite, provoke (from Martin 2004: 289)
e problem-solution pattern in apprentice vs. professional technical writing 129
(Martin 2000). In this paper, I shall retain the term “Inscribed” as it best conveys the meaning of what attitude is inherent in a word. However, following Hoey (2001) I shall adopt the term “Evoking” in preference to “evoke” and “evoked” as the form “Evoking” best reflects the idea that a word evokes some kind of reaction in the reader.
3.1 Inscribed vs. Evoking items e Inscribed option realises lexis which is explicitly evaluative. With regard to the Problem element, this would include such nouns as problem, fault, drawback, where the evaluation is built into the word, as it were. ese tend to be superordinate terms and a type of discourse-organising noun, which overlap with what Winter (1977) terms Vocabulary 3 items, Francis (1986 and 1994) A-Nouns (anaphoric nouns), Ivanič (1991) “carrier” nouns and Schmid (2000) “shell” nouns. Carter (1992:80) also recognises this evaluative quality inherent in such nouns: “An interesting category of A-nouns are those which generally signal attitudes. Such items do more than merely label the preceding discourse. ey mark it in an interpersonally sensitive way revealing the writer’s positive or negative evaluation of the antecedent proposition”. Here, Carter touches upon an important distinction between Inscribed and Evoking lexis regarding the reader/writer orientation towards the text, as also pointed out in Hoey (2001:126): “e writer inscribes the evaluation; on the other hand, it is the word that evokes (or provokes) an evaluation in the reader”. e Evoking option “draws on ideational meaning to “connote” evaluation, either by selecting meanings which invite a reaction or deploying imagery to provoke a stance” (Martin 2004:289). In this model, it is the “invite” option of the Evoking category I am interested in, where the item, taken out of context, would evoke an evaluative response in the reader. For example, items signalling the Problem element such as cancer and pollution would belong to this category. White (2001) notes that while the Evoking option is likely to lead to some inference of good/bad, it still remains a purely “factual” description. is implies that yet another option would seem to exist for the Evoking category where it would only be possible to tell from the context whether an item evokes a positive or negative semantic prosody (see Hunston and ompson 2000 for a discussion of the role of context in bringing out this element of evaluation). In fact, Inscribed and Evoking lexis can also be viewed as aspects of connotation. In his discussion on connotation, Partington (2001) mentions that where
130 Lynne Flowerdew
connotation is so intrinsic to a word it is taken for granted, i.e. writer-initiated, and where the connotation seems less intrinsic it can be based on situational or cultural factors. In Partington’s definitions, the former would seem to relate to Inscribed and the latter to Evoking lexis. Below, I present the analysis of the keywords and key-key words within the APPRAISAL framework of Inscribed and Evoking items described above.
4. Results and discussion 4.1 Keyword analysis First, an examination of the top 40 key words in each corpus was carried out to determine which of these signalled either Inscribed or Evoking lexis for the Problem-Solution pattern. e Inscribed lexis was identified on the basis of whether the word had an intrinsically negative (e.g. problem) or positive (e.g. solution) connotation. Evoking lexis was largely identified through whether a negative or positive connotation (such as noise and waste for the Problem element) could be inferred by a reader. However, in some cases it was necessary to examine the lexis in the wider context to determine whether a word did, in fact, carry any negative or positive connotation relating to the Problem or Solution element. e keyword analysis revealed that lexical items were overwhelmingly used for the Problem-Solution pattern. ere were no examples of grammatical words such as but or however, which could be acting as signals for the Problem element. Interestingly, in STUCORP, there was only one Inscribed signal, i.e. problem, whereas in PROFCORP 15 out of the 40 keywords can be classified as belonging to one of the categories of the pattern. e Evoking items, noise, impacts, impact, waste, traffic, dust, realise the Problem element, with construction, landfill and reclamation having the potential to be either the Problem or Solution element depending on the context. In contrast, the Solution element is signalled by keywords of an Inscribed nature, e.g. mitigation, measures, proposed, recommended, with monitoring and assessment used for the Evaluation element. Although at first sight, PROFCORP seems to exhibit a more overt Problem-Solution pattern than STUCORP, this is not to say that the reports in STUCORP are not problem-oriented, for reasons given below in the discussion on the key-key word analysis.
e problem-solution pattern in apprentice vs. professional technical writing
131
Table 1. Comparison of Inscribed and Evoking signals for the Problem element in PROFCORP and STUCORP. INSCRIBED STUCORP
PROFCORP
problem (8) problems (4) need (4) insufficient (5) EVOKING STUCORP
PROFCORP
stolen (4)
impacts (50) impact (26) noise (44) traffic (23) sewage (12) sewerage (6) contaminated (14) contamination (4) waste (20) wastes (5) dust (20) pollution (10) emissions (10) sediments (10) odour (9) effluent (6) discharge (5) discharges (5) NSRS (9) Dba (8) TSP (7) Leachate (6) stormwater (5) groundwater (4) * construction (47) landfill (10)
Note: * indicates lexis which can signal either the Problem or Solution element. Italics indicates vocabulary of a technical nature.
4.2 Key-key word analysis Tables 1 and 2 present the key-key words, i.e. those keywords which are key in a large number of texts, for the Problem and Solution element of each corpus. (e figure in brackets denotes the number of texts in which the words are found to be key). I have listed only those items which were found to be key in four or more reports for the reason that those keywords which were key in three or fewer reports tended to be the names of environmental companies in PROFCORP or university departments in STUCORP. On the other hand, as can be seen from Tables 1 and 2, the most frequent key-key words are those denoting the Problem-Solution pattern, which can thus be considered as reflecting the textual patterning of each corpus as a whole.
132 Lynne Flowerdew
While the tables provide clear evidence that both corpora comprise Problem-Solution based reports, the overall profiles of the patterning are somewhat different. e key-key word analysis for the PROFCORP reports mirrors to a large extent the patterning uncovered by the keyword analysis in that the Problem element tends to favour Evoking lexis while the Solution element prefers the Inscribed lexis. It is also to be noted that some of this evaluative lexis is key in a very large number of reports, which is not surprising given that the PROFCORP reports are relatively homogeneous in terms of subject matter and we would therefore expect this to be the case. ere are two Evoking items in the data, construction and landfill, which can fill either the Problem or Solution slot. Whether these evoke a positive or negative reaction in the reader can only be determined by the context. For example, in some reports landfill is conceived as a problem causing leachate seepage; in other reports, construction of a landfill is seen as a solution for the disposal of waste material. In contrast, it is the Inscribed lexis which dominates the signalling of the Problem-Solution pattern in STUCORP. e heavy reliance on superordinate terms such as problem and recommendations can be traced back to the rubrics for the assignment which include all these Inscribed key-key words in the instructions. It therefore appears that students are incorporating the metalanguage provided in the assignment guidelines into the writing of their recommendation reports to overtly signal the pattern. Moreover, it is also to be expected that there would be far fewer lexical items occurring as key-key words across a large number of reports in STUCORP. As the topics of the student reports cover a much wider subject range than those in PROFCORP, the same lexis would tend not to occur across reports and therefore would not show up as key in four or more of the reports. is explains why other key lexis for the Problem-Solution pattern is present in STUCORP, but only occurs as key in three or less of the reports. However most of this lexis is Inscribed and related to the Problem aspect, either nominal, e.g. concern, failures, difficulties, shortcoming, issue, or adjectival, where the negative import of the word is signalled by the prefix in- or un-, e.g. inadequate, inefficient, thus indicating that the student reports are weighted towards the Problem element. is is not surprising as the teaching materials put emphasis on the identification of a problem through evidence from primary and secondary source data. e absence of any Evoking technical vocabulary in STUCORP can be explained by the fact that students are instructed to write their reports for a non-specialist audience (although the subject matter of the reports may be technical), whereas the
e problem-solution pattern in apprentice vs. professional technical writing
133
PROFCORP reports are addressed to specialists in the field. e above discussion thus highlights the importance of referring to contextual features, such as the report writing guidelines for interpretation of the keywords and key-key words in corpus-based analyses (see Lea and Street 1999 who advocate a socio-cognitive approach to the analysis of different kinds of texts concerning the writing process, such as “guidelines for dissertation writing”). It also demonstrates that what initially appears as a paucity of Inscribed and Evoking signals in STUCORP, and thus a deficiency in students’ writing, may not in fact be the case when contextual and situational factors are taken into account, an aspect which Tognini-Bonelli (2001) regards as important in analysis of corpus data.
5. Concluding remarks is paper has shown the potential value of exploiting the APPRAISAL system for coding keyword lexis for the Problem-Solution pattern into Inscribed and Evoking items. I have also made the point that to fully interpret this evaluative lexis, especially the Evoking items, it is necessary to have recourse to the “context of situation’, a dimension of language which is inherent in an SFL approach to text analysis. is framework also provides a good starting point from which to examine the lexico-grammatical patterning of the key-key words discussed in this article, which is presented in Flowerdew (2003). Now that corpus studies and SFL are becoming more closely aligned, it is expected that other subdomains of the APPRAISAL system will be exploited in future corpus-based research.
References Aston, G. and Burnard, L. 1998. The BNC Handbook. Edinburgh: Edinburgh University Press. Bondi, M. 2001. “Small corpora and language variation: Reflexivity across genres”. In M. Ghadessy et al. (eds), 135–174. Carter, R. 1992. Vocabulary: Applied Linguistic Perspectives. London: Routledge. Flowerdew, J. (ed) 2002. Academic Discourse. London: Longman. Flowerdew, L. 1998. “Corpus linguistic techniques applied to textlinguistics”. System, 26 (4):541–52.
134 Lynne Flowerdew
Flowerdew, L. 2002. “Corpus-based analyses in EAP”. In J. Flowerdew (ed.), 95–114. Flowerdew, L. 2003. “A combined corpus and systemic-functional analysis of the ProblemSolution pattern in a student and professional corpus of technical writing”. TESOL Quarterly, 37(3):489–511. Francis, G. 1986. Anaphoric Nouns. Discourse Analysis Monographs, 11. English Language Research. University of Birmingham. Francis, G. 1994. “Labelling discourse: An aspect of nominal-group lexical cohesion”. In Advances in Written Text Analysis, M. Coulthard (ed.), 83–101. London: Routledge. Ghadessy, M., Henry, A. and Roseberry, R. (eds) 2001. Small Corpus Studies and ELT. Amsterdam: John Benjamins. Green, C., Christopher, C. and Lam, J. 2000. “The incidence and effects on coherence of marked themes in interlanguage texts: A corpus-based enquiry”. English for Specific Purposes, 19 (2):99–113. Halliday, M.A.K. 1994. Introduction to Functional Grammar. London: Edward Arnold. Hewings, M. and Hewings, A. 2002. “‘It is interesting to note that …’: a comparative study of anticipatory “it” in student and published writing”. English for Specific Purposes, 21 (4):367–383. Hoey, M. 1983. On the Surface of Discourse. London: George Allen and Unwin. Hoey, M. 2001. Textual Interaction. London: Routledge. Hunston, S. and Thompson, G. (eds) 2000. Evaluation in Text: Authorial Stance and the Construction of Discourse. Oxford: Oxford University Press. Hyland, K. and Milton, J. 1997. “Qualifications and certainty in L1 and L2 students’ writing”. Journal of Second Language Writing, 6 (2):183–205. Ivanič, R. 1991. “Nouns in search of a context: A study of nouns with both open- and closedsystem characteristics”. IRAL, 2:93–114. Lea, M. and Street, B. 1999. “Writing as academic literacies: Understanding textual practices in higher education”. In Writing Texts, Processes and Practices, C. Candlin and K. Hyland (eds), 62–81. London: Longman. Martin, J.R. 2000. “Beyond Exchange: APPRAISAL systems in English”. In S. Hunston and G. Thompson (eds), 142–75. Martin, J.R. 2004. “Sense and sensibility: Texturing evaluation.” In Language, Education and Discourse: Functional Approaches, J. Foley (ed.), 270-304. London: Continuum Press. Partington, A. 2001. “Corpus-based description in teaching and learning”. In G. Aston (ed.), Learning with Corpora, 63–84. Houston, TX: Athelstan. Ragan, P. 2001. “Classroom use of a systemic functional small learner corpus”. In M. Ghadessy et al. (eds), 207–236. Schmid, H-J. 2000. English Abstract Nouns as Conceptual Shells: From Corpus to Cognition. New York: Mouton de Gruyter. Scott, M. 1997. “PC analysis of key words and key key words”. System, 25 (2):233–45. Scott, M. 1999. WordSmith Tools. Oxford: Oxford University Press.
e problem-solution pattern in apprentice vs. professional technical writing
135
Scott, M. 2000. “Mapping key words to problem and solution”. In Patterns of Text, M. Scott and G. Thompson (eds), 109–127. Amsterdam: John Benjamins. Swales, J. 1990. Genre Analysis: English in Academic and Research Settings. Cambridge: Cambridge University Press. Tognini-Bonelli, E. 2001. Corpus Linguistics at Work. Amsterdam: John Benjamins. Tribble, C. 2002. “Corpora and corpus analysis: New windows on academic writing”. In J. Flowerdew (ed), 131–149. White, P.R.R. 2001. Appraisal Homepage. online: http://www.grammatics.com/appraisal/ (visited 05.01.2003). White, P.R.R. 1998/99. Outline of Appraisal. online: http://www.grammatics.com/ appraisal/ (visited 05.01.2003). Winter, E.O. 1977. “A clause-relational approach to English texts: A study of some predictive lexical items in written discourse”. Instructional Science, 6 (1):1–92.
136 Lynne Flowerdew
Using a corpus of children’s writing 137
Using a corpus of children’s writing to test a solution to the sample size problem affecting type-token ratios Ngoni Chipere, David Malvern and Brian Richards Institute of Education, The University of Reading, UK
Corpus-based studies often measure vocabulary richness in terms of TypeToken Ratio (TTR) – the number of different word types in a text divided by the total number of word tokens. There is, however, a serious flaw with TTR: it is a function of text length and it decreases inexorably in normal texts with increasing token counts. In view of widespread lack of awareness of this fact, we describe the sample size problem affecting TTR and discuss several kinds of solutions that have been proposed over the years. We argue that most of these solutions either fail to engage fully with the sample size problem or incur undesirable outcomes such as under-utilising data or generating results that are incommensurate across corpora. We then describe a solution that is based on modelling the relationship between TTR and token counts. The index of diversity in the model is a single parameter, D, whose value is obtained by fitting a curve to values of TTR taken from sub-samples of increasing size drawn at random from throughout the text. The model has been implemented in a freely available computer program called vocd. The advantage of D over TTR is demonstrated empirically in this paper by an analysis of a graded corpus of children’s writing. The analysis shows that developmental measures such as word length, text length and spelling accuracy correlate significantly and positively with D but not with TTR.
1. Introduction Corpus linguists have accumulated very large amounts of language data which they oen wish to analyse quantitatively. e explosive increase in data has, however, far outpaced the development of the requisite quantitative methods of analysis. is situation can create pitfalls for researchers who are not
138 N. Chipere, D. Malvern and B. Richards
intimately familiar with quantitative research methods but who wish to make numerical statements about corpora. A prime example of such pitfalls is the widespread use of the type-token ratio (TTR) to measure lexical diversity. In this paper, we show that this measure is seriously flawed and, consequently, leads to flawed research. We then describe an alternative measure of lexical diversity which avoids the in-built limitations of TTR. Our discussion is largely based on the child language development literature – an area in which corpus methods took root earlier than in the area of adult foreign language learning. e two fields share common interests in the use of computerized corpora and vocabulary measures such as TTR. erefore the issues that we address in this paper apply to both fields with equal force and the superficial differences in data sets should not be allowed to obscure underlying commonalities of method. TTR is calculated by dividing the number of different word types (V) by the total number of word tokens (N) in a text. e tired example from introspective linguistics “e boy kicked the ball” consists of 4 word types – “the”, “boy”, “kicked”, “ball” and 5 tokens – “the’” x 2, “boy”, “kicked”, “ball”. Many researchers appear to have assumed that the ratio of types to tokens is constant over a given text.1 TTR is clearly a function of text length, however, and in normal texts, it decreases with increasing token counts. A simple demonstration of this fact is provided in the next paragraph. Suppose we divide a 200-word English text into segments of 100 words each and find that there are fiy different types of words in each segment. Dividing 50 types by 100 tokens gives us a TTR of 0.5 for each segment. In a normal text, certain words will occur in both segments (for instance, the definite article is highly likely to show up in both segments). erefore, if we add up the number of word types in both segments we should obtain 50 (from the first segment) plus some figure less than 50 from the second segment (given that some of the word types in the second segment will already have appeared in the first segment). Supposing for the sake of argument that the second segment introduces only 25 new word types, the new TTR is (50+25)/200 = 0.375, which is less than the original 0.5. If we had segmented our text into even smaller segments, then, repeating the process would give us fewer and fewer new words being introduced in subsequent segments. us the value of TTR diminishes inexorably with increasing token count (the sole exception to this being the unlikely scenario in which the number of types is equal to the number of tokens). is argument shows that TTR is dependent on sample size – shorter
Using a corpus of children’s writing 139
texts are likely to produce higher values of TTR than longer texts regardless of the actual diversity of vocabulary in either text. Richards and Malvern (1997) showed how this flaw has resulted in contradictory and uninterpretable research findings in the child language development literature. e examples they provide include cases where the text length effect produces results which indicate a) no differences in TTRs taken from transcripts of children at different levels of language development; b) lower TTRs for more advanced versus less advanced children and c) a lack of correlation between TTR and other measures of language development. Despite these problems, a large number of researchers in various areas of corpus-based research continue to use TTR uncritically (some recent examples being Girolametto et al. 2002; Delaney-Black et al. 2000; Colwell et al. 2002). is situation is clearly unsatisfactory as it can give rise to flawed research in applied linguistics. e propagation of invalid research findings, in turn, has the potential to mislead language teachers and this can have undesirable outcomes for learners down the line. While the use of raw TTR is widespread, other researchers have long been aware of the problem and some have tried to correct it in several ways. At least three kinds of solutions have been proposed, including either a) controlling for text length; b) transforming TTR or c) modelling the curvilinear relationship between TTRs and token counts. ese solutions are described briefly below and the first two are shown to be flawed. e article then describes a solution proposed by Richards and Malvern (1997) in the form of a measure of lexical diversity called D. In addition to a mathematical demonstration of the advantages of D over TTR, the paper provides an empirical comparison of D to raw TTR through the analysis of a corpus of children’s writing. e analysis involved the calculation of various developmental measures, such as spelling, word length and text length, and we would expect these measures to correlate positively with any valid measure of lexical diversity. e mathematical arguments led us to predict that D would correlate more strongly with these other measures than raw TTR, which one would actually expect to be negatively correlated with a developmental measure like text length. We did not seek to compare D to the various transformations of TTR because some of these transformations, such as CTTR and RTTR (described below) are positively correlated with sample size. A strong positive correlation between transformations of TTR and other develomental measures could therefore simply be a sample size effect. We refer readers who wish to satisfy themselves on this
140 N. Chipere, D. Malvern and B. Richards
point to a number of empirical demonstrations of sample size effects produced by transformations of TTR (Arnaud 1984, Hess et al. 1989, Hess et al. 1986, Menard 1983 and Richards 1987). We now describe the kinds of solutions that have been proposed to deal with the sample size problem.
2. Controlling for text length Solutions in this category seek to address the sample size problem by standardizing the lengths of the texts for which comparisons are to be made. e aim of standardization is to eliminate the artificial advantage that short texts have over longer texts in calculations of TTR. For instance, in child language, Stickler (1987) proposed that text length should be standardized by using 50 utterances taken from the middle of a transcript. However, this method does not actually eliminate the text length effect because advanced children produce more words per utterance than less advanced children. ere is, therefore, the likelihood that the TTR values of more advanced children will be at best depressed and, at worst, smaller relative to those of less advanced children. A refinement on Stickler’s proposal is to standardize the number of tokens but this solution is not entirely satisfactory. Standard text lengths vary from 1000 tokens (Wachal and Spreen 1973 and Hayes and Ahrens 1988), 400 tokens (Biber 1988 and Klee 1992), 350 tokens (Hess et al. 1986) to 50 tokens (Stewig 1994). ese variations in token counts are problematic because a set of TTRs calculated from shorter text segments will be higher than those calculated from a set from longer text segments. e two sets of TTRs will therefore be incommensurate thus preventing comparisons being made across corpora. It should also be noted that the difficulty of arriving at a universal standard cannot be solved simply by negotiating a consensus among researchers. is is because there are wide variations in the lengths of transcripts from different sources. For instance, many transcripts of child language data are much shorter than those of adult language data. A universal standard based on the length of child language transcripts would involve a considerable waste of adult language data. An alternative to the standard text length solution is the Mean Segmental Type Token Ratio or MSTTR (Johnson, 1944). is measure involves calculating the mean TTR for consecutive equal-length segments of text. is method
Using a corpus of children’s writing
141
is implemented in the popular WordSmith Tools soware (Scott, personal communication; Scott 1996). e advantage of MSTTR over standardising the number of tokens is that a) the size of the smallest transcript in a corpus can be used as the size of the standard segment and b) nearly all the data are used. However, a problem remains in that MSTTRs based on short segments are higher than those based on longer segments. Given that different analysts may opt for different segment sizes in order to make the best use of their data, the potential still exists for MSTTR to produce results that are incommensurate across corpora. A more detailed examination of the disadvantages of MSTTR is contained in Malvern and Richards (2002:88).
3. Transforming TTR e second type of solution involves transforming TTRs in various ways. For instance, Guiraud (1960) proposed a measure called Root Type Token Ratio (RTTR) which he calculated by dividing the number of types by the square root of the number of tokens. Carroll (1964) proposed the Corrected Type Token Ratio (CTTR) which is obtained by dividing the number of types by twice the square root of the number of tokens. Finally, Herdan (1960) proposed the Bilogarithmic Type Token Ratio, which is obtained by dividing the logarithm of the number of types by the logarithm of the number of tokens. None of these transformations, however, actually solves the problem of text length, as shown by repeated empirical demonstrations (e.g. Hess et al. 1986, Hess et al. 1989, Richards 1987).
4. Modelling the TTR-token curve Most of the solutions discussed above seek to combat the effect of text length on TTR. e third class of solutions has sought instead to exploit this effect by modelling the diminution of TTR with increasing token counts. is paper will focus on the solution proposed by Richards and Malvern (1997) and readers interested in descriptions of related models may consult Baayen (2001) and Richards and Malvern (1997). An independent evaluation of one application of Richards and Malvern’s equation has been undertaken by Jarvis (2002). Richards and Malvern (1997) present an equation which describes the
142 N. Chipere, D. Malvern and B. Richards
family of curves obtained when TTR values are plotted against token counts. ese curves lie between the two extremes of total diversity and zero diversity. In the case of total diversity, the number of types equals the number of tokens throughout the text and TTR has a constant value of 1, resulting in a straight line of zero slope. In the case of zero diversity, the total number of types is 1 throughout the text and TTR = 1/N (Number of tokens) for increasing values of N. e result is a curve which falls steeply from a value of 1 and then gradually flattens as it approaches the token axis asymptotically. e TTR-Token curves of actual texts lie between the two theoretical extremes with increasing lexical diversity represented by increasingly higher and shallower curves and decreasing diversity represented by increasingly lower and steeper curves. e equation which Richards and Malvern derive from Sichel (1986) to describe this family of curves is:
where TTR = Type-Token Ratio, N = number of tokens and D is the parameter that serves as the index of diversity.
5. Implementation of D An algorithm for computing values of D from transcripts is described in McKee, Malvern and Richards (2000). TTR values are obtained by calculating TTRs for increasing values of N from N = 35 to N = 50. Each point is averaged from 100 sub-samples drawn randomly from the text without replacement. D is then obtained by fitting a curve to the points. e algorithm has been implemented in a C program called vocd, also described in McKee et al., which runs on UNIX, PC and Macintosh platforms as part of the CLAN suite of programs (McWhinney 2000).
Using a corpus of children’s writing 143
6. Validation of D D has been validated in a number of analyses on corpora containing data from first and foreign language learning and academic writing (see Malvern and Richards, 2000). Independent applications of D are reported in Berman and Verhoeven (2002), Wright et al. (2002) and Owen and Leonard (2002). is paper reports a study which compares TTR to D in the measurement of lexical diversity in a corpus of children’s writing. Developmental writing is an ideal test-bed for the measure because we would expect lexical diversity to increase with increasing language ability and a valid measure of lexical diversity should reflect this increase. Clearly, though, its implications are relevant for other types of corpus studies as well. e study is described below.
7. Methods 7.1 Materials 918 narrative essays at least 50 words long were analysed. e essays were collected from various schools in England by the Research and Evaluation Department of e University of Cambridge Local Examinations Syndicate (UCLES). e essays cover a cross-section of pupils at the end of three phases of education referred to as Key Stages 1, 2 and 3 in the English school system (i.e., at ages 7, 11 and 14 years). e essays also cover seven (out of a possible eight) levels of writing ability as defined by the National Curriculum for English. All the pupils were asked to write a narrative essay beginning with the sentence “e gate was always locked, but on that day someone had le it open …”.
7.2 Data preparation 10 markers employed by UCLES assigned scores to the scripts on the basis of National Curriculum Level descriptors (see QCA, 2000). e markers were unaware of the age, sex or ability level of the pupils (though handwriting may have given clues as to age). Scores for each essay were assigned separately by at least two markers and later averaged to obtain the final score. In a few cases, scores from the two markers diverged considerably and the final score was
144 N. Chipere, D. Malvern and B. Richards
decided through negotiation. e final score was used to assign each script to one of eight possible National Curriculum Levels of writing ability (in the event, only the seven lower levels were needed). e graded scripts were then keyed into machine-readable form without editing of spelling or punctuation. It was, however, subsequently necessary to correct spelling errors in order to prevent them from being treated as different types and thereby inflating the type counts of inconsistent spellers. Spelling errors were corrected using a computer program that flagged as potential spelling errors any words that were found in the corpus but not in a dictionary list. It was then up to the human editor to decide if a specific word was indeed a spelling error and if so, what the correct spelling ought to be.
7.3 Analytical procedure Essays were analysed using vocd (McKee et al. 2000) via the CLAN interface using batch processing commands. Aer all the scripts had been processed, the output files from vocd were concatenated into a single file and a utility program was used to extract values of D for each essay and produce a spreadsheet of all the results. e data were analysed using the SPSS statistical package.
8. Results e results of a Pearson’s Correlation test are shown in Table 1 below. All correlations are significant at the p<.01 level except for the correlation between TTR and Level, which is not statistically significant.
Table 1. Correlations between level and text variables
LEVEL D Word length Spelling Text length TTR
LEVEL
D
Word length
Spelling
Text length
TTR
1
0.577 1
0.632 0.59 1
0.592 0.293 0.36 1
0.731 0.449 0.433 0.453 1
–0.035 0.385 0.211 –0.142 –0.263 1
Using a corpus of children’s writing 145
e results show that Level correlates positively and significantly with all the quantitative variables with the exception of TTR. In other words, all the quantitative variables except for TTR are related to the qualitative grading of the essays as measured by Level. As might be expected from purely theoretical considerations, TTR correlates negatively with Text Length.
9. Discussion TTR is not a valid developmental measure because it suggests that improvements in certain writing skills such as word length, text length and spelling are accompanied by decreasing lexical diversity. Counter-intuitive results of just this sort have been obtained by many other researchers who have used TTR as a developmental measure (see articles reviewed by Richards and Malvern, 1997). e result is a direct consequence of the flaw associated with TTR: it decreases with increasing text length, a point underscored by the weak but statistically significant negative correlation between TTR and text length.
10. Conclusion is paper has described the sample size problem affecting the Type-Token Ratio – a commonly used measure of lexical diversity. A number of solutions to the problem were discussed and it was argued that solutions based on controlling for text length or transforming TTR in some way either fail to address the problem of sample size fully or incur problems which render them unsatisfactory. A solution based on modelling the way TTR falls with increasing token counts was described and applied to the analysis of a corpus of a children’s writing. e results showed that, in contrast to D, TTR does not correlate positively with other quantitative developmental measures such as word length, text length and spelling accuracy or with the qualitative measurement of writing ability in terms of National Curriculum Levels. is counter-intuitive result has been shown to be an artefact of a flawed measure. In contrast, D produced a meaningful result concerning the relationship between lexical diversity and other developmental measures. D therefore contributes to language learning research in two ways – it provides a reliable index of lexical diversity and it shows, as many have long supposed but have not always been able to
146 N. Chipere, D. Malvern and B. Richards
demonstrate, that lexical diversity develops in tandem with other linguistic skills. We hope that our technical description of the sample size problem and its solution does not dissuade readers from using quantiative techniques in corpus research but rather alerts them to the need for greater methodological sophistication in the analysis of corpora.
Acknowledgment e authors would like to thank Marco Baroni for helpful comments made on a dra of this article.
Note 1. TTR is a proportion rather than a true ratio because the numerator and denominator are not mutually exclusive.
References Arnaud, P.J.L. 1984. “The lexical richness of L2 written productions and the reliability of vocabulary tests.” In Practice and Problems in Language Testing: Papers from the International Symposium on Language Testing, T. Culhane, C. Klein Bradely and D. Stevenson (eds), 14–28. Colchester: University of Essex. Baayen, H. 2001. Word Frequency Distributions. Dordrecht: Kluwer. Berman, R., and Verhoeven, L. 2002. “Cross-linguistic perspectives on the development of text-production abilities: Speech and writing.” Written Language Literacy 5 (1), 1–43. Biber, D. 1988. Variation across Speech and Writing. Cambridge: Cambridge University Press. Carroll, J. 1964. Language and Thought. Englewood Cliffs: Prentice-Hall. Colwell, K., Hiscock, C.K. and Memmon, A. 2002. “Interviewing techniques and the assessment of statement credibility.” Applied Cognitive Psychology 16:287–300. Delaney-Black , V., Covington, C., Templin, T., Kershaw, T., Nordstrom-Klee, B., Ager, J., Clark, N., Surendran, A., Martier, S. and Sokol R.J. 2000. “Expressive language development of children exposed to cocaine pre-natally: Literature review and report of a prospective cohort study.” Journal of Communication Disorders 33(6):463–481. Girolametto, L., Bonifacio, S., Visini, C., Weitzman, E., Zocconi, E. and Pearce, P. 2002. “Mother-child interactions in Canada and Italy: Linguistic responsiveness to late-
Using a corpus of children’s writing 147
talking toddlers.” International Journal of Language and Communication Disorders 37(2):153–171. Guiraud, P. 1960. Problèmes et méthodes de la statistique linguistique. Dordrecht: D. Reidel. Hayes, D. P., and Ahrens, M. G. 1988. “Vocabulary simplification for children: A special case of ‘motherese’”. Journal of Child Language 15:395–410. Herdan, G. 1960. Type-token Mathematics: A Textbook of Mathematical Linguistics. The Hague: Mouton. Hess, C., Haug H.T. and Landry, R. 1989. “The reliability of type-token ratios for the oral language of school age children.” Journal of Speech and Hearing Research 32: 536–540. Hess, C., Sefton K. and Landry, R. 1986. “Sample size and type-token ratios for oral language of preschool children.” Journal of Speech and Hearing Research 32:536–540. Klee, T. 1992. “Developmental and diagnostic characteristics of quantitative measures of children’s language production.” Topics in Language Disorders 12(2):28–41. Jarvis, S. 2002. “Short texts, best-fitting curves, and new measures of lexical diversity.” Language Testing, 19 (1):57–84. Johnson, W. 1944. “Studies in language behavior: I. A program of research.” Psychological Monographs 56:1–15. McKee, G., Malvern, D. and Richards, B. 2000. “Measuring vocabulary diversity using dedicated software.” Literary and Linguistic Computing 15(3):323–338. MacWhinney, B. 2000. The CHILDES Project Vol 1: Tools for Analysing Talk – Transcription Format and Programs. New Jersey: Lawrence Erlbaum. Malvern, D. and Richards, B. 2000. “A new method of measuring lexical diversity in texts and conversations.” TEANGA 19:1–12. Malvern, D. and Richards, B. 2002. “Investigating accommodation in language proficiency interviews using a new measure of lexical diversity”. Language Testing 19:85–104. Menard, N. 1983. Mesure de la richesse lexicale. Geneva: Slatkine. Owen, A. and Leonard, B. 2002. “Lexical diversity in the speech of normally developing and specific language impaired children.” Poster presented at the Symposium for Research in Child Language Disorders 2002, Madison. QCA. 2000. English Tests: Mark Schemes. London, HMSO. Richards, B. and Malvern, D. 1997. Quantifying Lexical Diversity in the Study of Language Development. The University of Reading: The New Bulmershe Papers. Richards, B. 1987. “Type/token ratios: What do they really tell us?” Journal of Child Language 14: 201–209. Scott, M. 1996. WordSmith Tools. Oxford: Oxford University Press. Stewig, J. 1994. “First graders talk about paintings.” Journal of Educational Research 87(5): 309–316. Stickler, K. 1987. Guide to Analysis of Language Transcripts. Eau Claire: Thinking Publications. Wachal R. and Spreen, O. 1973. “Some measures of lexical diversity in aphasic and normal language performance.” Language and Speech 16:169–181. Wright, H., Silverman, S. and Newhoff, M. 2002. “Measures of lexical diversity in aphasia.” Paper presented at the Clinical Aphasiology Conference 2002, Ridgedale.
148 N. Chipere, D. Malvern and B. Richards
Comparing real and ideal language learner input 149
Corpora for learners
150 Ute Römer
Comparing real and ideal language learner input
151
Comparing real and ideal language learner input: The use of an EFL textbook corpus in corpus linguistics and language teaching Ute Römer University of Hanover, Germany
While there is substantial research in the field of corpus linguistics and language teaching based on native-speaker and non-native speaker corpora, the language of EFL teaching materials has not been systematically analysed so far. The present paper discusses the construction and use of an electronic corpus of EFL textbook texts, focussing on the potential of such a corpus in applied corpus linguistics. A case study on if-clauses in spoken English and “school” English demonstrates in which ways analyses based on textbook corpora can lead to valuable insights for linguists and language practitioners and how they may help to improve teaching materials.
1. Introduction It is widely accepted that native speaker corpora and non-native speaker or learner corpora can be very useful in foreign language learning and teaching. Many linguists value the great impact of these data collections and base their research on the analysis of native speaker and learner output. So far comparatively few people have studied learner input. I would like to argue that next to analysing the language produced by learners and the language produced by more competent speakers of English it will also be helpful to look at the input pupils actually get in their English lessons. A large part of this language learner input is represented by the textbooks used in teaching English as a Foreign Language (EFL textbooks).1 In the present paper it will be discussed how an electronic corpus of EFL textbook texts can be used to help us answer two crucial questions related to language teaching: “Do we teach our pupils authentic English, i.e. do we con-
152
Ute Römer
front them with the same type of English they are likely to be confronted with in natural communicative situations?” and “What can we do to improve EFL teaching materials?”. e creation and the design of one particular corpus, consisting of texts taken from German EFL textbooks, will be described. To provide evidence of the usefulness of such a corpus for researchers, the results of an empirical study on if-clauses will be presented and evaluated with respect to their significance for English Language Teaching (ELT).
2. The linguistic potential of an EFL textbook corpus e first question which will be dealt with here is “Why do we need an EFL textbook corpus?”. To start with, an EFL textbook corpus can obviously be used to systematically analyse EFL textbook language, i.e. the kind of language pupils are confronted with when learning English. Also, as the title of this paper suggests, such a corpus is what we need if we want to compare what some people term “school English” (cf. e.g. Mindt 1996:232) with authentic or “real” English as represented in a reference corpus; or, in other words, if we want to compare real language learner input with what I would like to call ideal learner input. Like any other computerised corpora, an electronic corpus of EFL textbook texts can be used to calculate frequencies of occurrence of single lexical items or word combinations (phrases, multi-word items).2 It also allows a closer analysis of contextual phenomena and provides answers to questions like “In which contexts are certain lexemes presented in the textbooks (as opposed to native speaker corpora)?” and “Do we find the same collocational patterns in EFL textbooks that we find in a native speaker corpus?”. Different meanings of polysemous words or structures can be found and analysed, and the lexicalgrammatical progression in textbooks can be traced. Finally, a textbook corpus may yield more about the status of authenticity in English language teaching – an issue we will briefly deal with in the next section.
3. Authenticity in the classroom – a controversial topic e problem of authenticity in ELT has been discussed for many years in numerous scholarly books and articles (e.g. Amor 2002, Breen 1985, Taylor 1994) and the debate has recently resurfaced (cf. Widdowson 2000 and Stubbs
Comparing real and ideal language learner input
153
2001). What authenticity really means in a language teaching context, which different types of authenticity play a role, and whether or not we want to teach authentic English to our pupils are highly controversial questions among linguists and didacticians. Basically, the discussion centres around the question “Should English produced in natural communicative situations form the basis of our teaching or should we use invented texts and examples specifically created for the purpose of teaching in our course materials?”. It may thus be interesting to see what the situation is like at present and how natural the language which learners are presented with in the EFL classroom really is. Aer an analysis of several EFL textbooks used in German grammar schools my observation was that what we tend to find in these books is a simplified, non-authentic kind of English. Pupils are mainly presented with invented sentences, sentences which probably have not occurred in any natural speech situation before (and which probably never will). e short dialogue in (1), from an introductory EFL textbook, may serve to illustrate this phenomenon. (1) MR SNOW MRS SNOW MR SNOW MRS SNOW MR SNOW MRS SNOW
Hello, Wendy. Hello, Ron. Where are the girls? Are they packing? Yes, they are. Or are they playing? No, they aren’t, Ron. ey are packing. (Schwarz 1997:45)
It is rather doubtful whether texts like this can better serve the purpose of preparing learners for the English they are likely to encounter in real life than (parts from) dialogues that have actually occurred. In order to introduce the use of present progressive forms in yes/no-questions (which is the purpose of the given dialogue extract), it would possibly make more sense to present pupils with authentic examples of progressives that are in fact, as a detailed large-scale analysis has shown (cf. Römer in preparation), comparatively common in interrogative contexts. e forms happening, talking, listening, and staying, for instance, occur particularly oen in questions. Hence I would argue that the examples in (2) to (5), all from the spoken part of the BNC, may be a better choice than the playing- and packing-examples in (1) (the phrase “are they packing” does not occur in BNC_spoken; “are they playing” occurs only once).
154 Ute Römer
(2) (3) (4) (5)
What’s happening now, does anybody know? What are we talking about, what’s the subject? Are you listening to me? Are you staying at your mum’s tonight? No. I’m staying at Christopher’s.
Doubts about the use and usefulness of invented language in ELT have of course been expressed by several distinguished linguists before and many arguments have been put forward in favour of authentic examples. Firth observed that many of the examples found in grammar books (he quotes the sentence “I have not seen your father’s pen, but I have read the book of your uncle’s gardener.”) are “just nonsense” from a semantic point of view (1957:24). Sinclair in his 1991 seminal monograph calls it an “absurd notion that invented examples can actually represent the language better than real ones” (ibid: 5). One of the advantages of a real example clearly is that it has in fact occurred in real speech or writing and that it is thus part of the “used” language in Brazil’s terms (cf. Fox 1987:143). Sinclair’s well-known precept for language teaching “[p]resent real examples only” is followed by the entirely plausible statement “[l]anguage cannot be invented; it can only be captured” (1997:31). Another reason why we might want to replace invented examples with real ones is that we may “hinder the development of fluency by excluding data samples that fluent native speakers actually say” (de Beaugrande 2001b:39). e confrontation with larger amounts of authentic language material will probably help learners become more confident in their use of the foreign language and help them achieve a greater degree of naturalness (cf. Fox 1987:149). e assumption that “textbooks are more useful when they are based on authentic native English” (Granger 1998:7) is, however, not shared by all linguists, some of whom do not seem to be very much in favour of authenticity in the classroom or do not consider the use of invented sentences problematic. Cook states that “[t]he utterances in attested data have also been invented, though for communication rather than illustration.” (2001:376) and thus plays down the attested versus invented sentences problem, putting both kinds of examples on one level. A major difference, however, still lies in the “for communication” Cook mentions. Attested utterances “invented” by competent speakers in a communicative context will probably differ significantly from utterances invented by materials designers in order to illustrate a certain language phenomenon. e former, being genuine examples, arise from a specific context and serve a particular pragmatic function while the only purpose of the
Comparing real and ideal language learner input
155
latter is the exemplification of the grammatical structure dealt with in the specific textbook unit. Widdowson even considers it “impossible” to use authentic English in a language teaching context and states that “[t]he language cannot be authentic because the classroom cannot provide the contextual conditions for it to be authenticated by the learners.” (1998:711) Even if it may sound cogent that we cannot transfer the whole context of an actual conversation into the classroom, it should be possible and worth trying to transfer at least part of it and thus achieve a higher degree of authenticity in ELT. On a similar note with reference to the problem of contextual transferability, Michael McCarthy has stated that students are used to and usually very good at recontextualising things because they do that all the time, e.g. while watching soap operas on TV.3 erefore, I would like to claim that there is a need for more authentic, naturally produced, non-invented examples in EFL teaching.
4. GEFL TC: Corpus compilation and composition Having stressed the need for authenticity in ELT and having said something about the research potential of an EFL textbook corpus, I will now come to a description of one particular corpus, the German English as a Foreign Language Textbook Corpus (GEFL TC). e need to compile this corpus arose as there were no ready-made computerised collections of German EFL textbook texts available and as it would have been a rather time-consuming undertaking to read through several textbook volumes each time I was looking for a particular language item in order to examine its use.4 Besides, all the advantages of any computer corpus (as opposed to non-computerised data collections) also hold true for an electronic collection of textbook text. One such advantage is the possibility to examine at a glance many occurrences of a certain word or phrase in context. According to Hunston’s definition (2002:16) GEFL TC could be classified as a “pedagogic corpus”.5 It consists of texts taken from twelve volumes of two introductory course book series (six volumes each) widely used in English language teaching in German secondary schools: Green Line New and English G 2000.6 e texts chosen are all supposed to represent spoken language. Exclusively written material, such as narratives, letters, or excerpts from novels, is not included in the corpus. Spoken texts were selected to enable a comparison with the spoken part of the BNC, which is used as the main source of authen-
156 Ute Römer
tic data in our comparative grammatical studies. is subcorpus was in turn selected because the importance of spoken (based) language is very much stressed in the language teaching curriculum. ere is a strong call for teaching materials which help to improve the pupils’ communicative competence and prepare them for any prospective discourse with native or near-native speakers of English. Spoken texts appear to better serve this purpose than written samples. Examples of spoken-type texts included in GEFL TC are dialogues, interviews, speech bubbles, and narrative texts mainly consisting of dialogue. Figure 1 gives an account of the composition of the corpus. e two subcorpora (English G 2000 and Green Line New) are of similar size and internal structure and thus offer the possibility of inter-textbook comparisons. e whole corpus counts 108,424 tokens, which is not a very impressive size for a language corpus judged by today’s standards. It has to be kept in mind, however, that we are here dealing with a specialised corpus, which is supposed to represent the language of German secondary school level EFL course materials. Although such a small corpus can hardly be labelled representative of school English in general, it can still reveal a lot about the typical features of textbook language. With a corpus of this size certain kinds of analyses (e.g. of lower frequency language items) can of course not be made and the method of approaching the data has to be chosen accordingly (cf. Sinclair 2001).7 e corpus was compiled in four steps. First of all, appropriate pages (i.e., pages including spoken-based textual material) from the twelve coursebooks (see appendix I) had to be selected. Each single page was digitised with the help of a scanner. e resulting image files could then be processed by OCR soware. As figure 2 shows, the parts of the text which were to be included in the corpus had to be highlighted manually (see grey shadings). us narrative passages and pictorial material could be excluded. e last step in the compilation of GEFL TC involved a conversion of the data into text format. It was then possible to analyse the text files with a concordance program (in our case WordSmith Tools, Scott 1996).
5. “If you eat your hat, you’ll be ill” – An example of corpus analysis with GEFL TC8 e starting point for the empirical analysis reported on in the following paragraphs was the fact that if-clauses are oen described as a grammatical
Comparing real and ideal language learner input
Figure 1. e composition of GEFL TC
Figure 2. Processing of textbook pages with OCR soware (here TextBridge Pro 8.0)
157
158 Ute Römer
problem area for (German) learners (and teachers). Even at an advanced level of language instruction students seem to have constant problems with conditional constructions and use if-clauses which contain errors.9 Especially the choice of the correct sequence of tense forms causes difficulties for learners, as we can see in the following two sentences produced by an advanced German learner of English: “If more and more words from one language will be mixed into another language, a new language form will be created.” and “Our progress was not that far if we do not use so many american products.”10 It therefore seemed to be worth having a closer look at conditionals in textbook English. An analysis of all if-clauses (211 altogether) in the two GEFL TC subcorpora was carried out and the findings of this analysis were compared with the results of a query for “if” on the spoken part of the BNC. Among the features under investigation were the sequence of clauses in the conditional sentence (if-part in initial or in final position?), the sequence of different tense forms, and collocations to the le and to the right of the search word (in L2, L1, R1, and R2 position). A first observation was that in GEFL TC the if-part, specifying the condition, occurs less frequently in if-clause initial position than it does in spoken British English. While 74.6% of the conditions came first in the BNC_spoken data, only 62.1% of the textbook examples started with “If…”. Bald notes that in many school grammars the sequence of if-part and main part in conditional clauses does not seem to be determined in any way (cf. 1988:48). Learners might thus be (mis)led into thinking that alterations in the if-part/main part sequence are arbitrary and do not change the meaning of the sentence. Other interesting results were obtained concerning the tense form distribution in ifclauses from GEFL TC and BNC_spoken. Figures 3 and 4 show the shares of different tense form combinations in the two parts of the if-clauses in both corpora under analysis. e first thing we notice in figure 3 is that there are far more different combinations of tense forms to be found in spoken BNC data than in GEFL TC conditionals. e very different heights of some of the columns illustrate striking over- and underrepresentations of certain tense form sequences in textbook English when compared to natural spoken English, a phenomenon which becomes even more apparent in figure 4. Here we can observe a strong overuse of certain tense form combinations in GEFL TC:
Comparing real and ideal language learner input 159
Note: e labels on the lower axis give the tense forms in if-part and main part of the clause in abbreviated form. e abbreviation “SPr – prtMODAL+inf”, for instance, stands for simple present tense in the if-part and present tense modal verb plus infinitive in the main part, as in e.g., “But I won’t win the competition if I just write about life on a space station.” (Schwarz 1998: 93). All tense form sequences are spelled out and exemplified in Appendix II. Figure 3. Tense form sequences in if-clauses, BNC (spoken) vs. GEFL TC, detailed
Figure 4. Tense form sequences in if-clauses, BNC (spoken) vs. GEFL TC, summarised
160 Ute Römer
–
–
–
simple present (if-part) – present tense modal (mainly “will”) + infinitive (main part), as in “Kay will be angry if I come back without his sword.” (Ashford et al. 1997:52) simple past (if-part) – past tense modal (mainly “would”) + infinitive (main part), as in “Rodney, if I was worried about stuff like that, it would make me crazy.” (Schwarz 1999b:89) past perfect (if-part) – past tense modal “would” + have + participle (main part), as in “I wouldn’t have been there if I hadn’t gone the wrong way.” (Ashford et al. 1998:37)
In textbooks and school grammars these three combinations are usually referred to as “type 1”, “type 2”, and “type 3” conditionals, respectively. A legitimate question one might ask in this context concerns the didactic significance of these labels. According to Bald (personal communication) the type-labels are rather useless as they do not reveal anything about the semantic functions of the clauses or their syntactic structures. Other sequences of tense forms are underrepresented in the textbook corpus, e.g. the simple present – simple present combination, with 23.7% the most frequent if-clause structure in the analysed BNC_spoken concordance lines, if-clauses with past tense modal + infinitive in both parts (e.g. “I’d be very grateful therefore if er you could put your minds to the options” BNC_spoken), and the structures subsumed under “other combinations” in figure 4. For corpus examples of these and other tense form sequences in if-clauses see Appendix II. As to the collocational analysis, it could be noticed that “if” has got different collocates immediately to the right (in positions R1 and R2) in BNC_spoken and EFL textbook corpus data. While the personal pronoun “you” is by far the most frequent right-hand collocate of “if” in spoken British English, the cluster “if you” is only one among other frequent clusters in GEFL TC. Combinations that are much more frequent in this corpus than in BNC_spoken are “if I”, “if we”, if they”, and “if he”. In R2 position the textbook if-clauses feature a number of past participles, especially “was” and “had”, none of which could be found in this position in any of the examined BNC_spoken concordance lines.
Comparing real and ideal language learner input
161
6. Pedagogical implications: What can be done to improve teaching materials? As we have seen in this short analysis, there are some significant differences between the ways if-clauses are presented in GEFL TC and in the spoken part of the British National Corpus. ese differences do not only apply to ifclauses taken from the textbook volumes for beginners (volumes 1–3) in which we would probably expect and approve of a simplified, less complex kind of language, but also to examples from some of the last units in the courses (textbook volumes 5 and 6) aimed at more advanced learners. A closer look at the type of English which pupils are confronted with in the EFL classroom helps us discover what exactly needs to be changed if we want to develop more authentic teaching materials, i.e. if we want to bring the linguistic features we teach closer to the linguistic features we observe in use. e use of examples from (spoken) corpora instead of invented or constructed sentences like “If you eat your hat, you’ll be ill” may be a step in the right direction. Lexico-grammatical items (e.g. different types of if-clauses) could be presented in roughly the same proportions as used in “real” English. We are not suggesting here that one should take our empirical findings to the limit and use exactly the same distribution of if-clause types that were found in spoken corpus data also in EFL textbooks. It would probably mean asking too much of the pupils were they confronted with this huge variety of naturally-occurring tense form combinations. However, we ought not to conceal from learners the fact that if-clause types 1, 2, and 3 are not the only possible (and grammatical) types. ere probably is not much use in teaching pupils things about a language which we know are not typical of real language use. Learning a variety of English they will rarely encounter in real-life situations is very unlikely to help learners communicate successfully with competent speakers of English. In this context Glisan and Drescher claim quite convincingly that “[…] if grammar is to be taught for communicative purposes, the structures presented should reflect their use in current-day native speaker discourse” (1993:24). Also, it might be worth paying more attention to collocational patterns and contextual phenomena that are found in native speaker corpus data. A higher degree of authenticity can be achieved if lexical items (such as “if”) are presented in the right context, i.e. the kind of context in which they typically appear in actual language use. As a final pedagogical implication, we could try to make language learners less afraid of using a structure for which there is no
162 Ute Römer
example in their textbooks. An if-clause that can neither be categorised “type 1”, nor “type 2”, nor “type 3” is not necessarily an ungrammatical or unacceptable if-clause.11 A quote from a contemporary English novel illustrates this point quite nicely. In this novel an English teacher tells one of his foreign students to try to get over her hang-up about speaking English. He says: “Don’t let it become too important, okay? … It doesn’t matter if it comes out sounding different from the textbooks…” (Parsons 2001:164). In the light of what has been described in the present paper we might even want to add something like “… it is sometimes even more natural if it does not”.
7. Conclusion e mismatches between BNC_spoken and GEFL TC data make it clear that, at least with respect to if-clauses, the language of German EFL textbooks does not mirror authentic language use. us, further studies on lexico-grammatical phenomena in textbook corpora may show how an improvement of English language teaching materials can be achieved on the basis of native speaker corpus data. More analytical and comparative studies of textbook language (and ideally of classroom language as a whole) will be necessary to discover more about the kind of English we teach, its differences from “real” English, and the status of authenticity in ELT. I agree with Glisan and Drescher, who state that “authentic language must continue to be examined if we are to use real language as the basis for our teaching” (1993:32), and I am convinced that further corpus-informed comparisons of authentic English and “school” English may lead to fruitful insights for linguists, language teachers, and language learners.
Acknowledgements I acknowledge the support of my supervisor Wolf-Dietrich Bald and of Walter Pape, Dean of the Arts Faculty, University of Cologne, who helped me fund my participation in the TaLC5 conference in Bertinoro. I would also like to thank David Oakey for helpful comments on an earlier version of this paper and my TaLC audience for stimulating questions and remarks aer my presentation.
Comparing real and ideal language learner input 163
Notes 1. It has to be kept in mind that the language in an EFL textbook does of course not represent classroom language in its entirety. It constitutes, however, a considerable part of school English, especially if we take into account that ELT (at least in German secondary schools) is very much based on textbooks. 2. For a detailed discussion on different types of multi-word items and for information on lexical connections between words in general see Moon 1997. 3. e statement was made during a discussion at the first Inter-Varietal Applied Corpus Studies Group (IVACS) Conference in Limerick, Ireland, June 15th 2002. 4. e major part of (if not all) empirical analyses of German EFL textbook language so far were carried out manually. Worth mentioning in this context are the works of Dieter Mindt and his colleagues at the Freie Universität Berlin (cf. e.g. Mindt 1987, Haase 1995, Schlüter 2002). 5. Another corpus of this type is the TEFL Corpus assembled by the COBUILD team in the mid-1980s. For a description of this project see Renouf 1987. 6. Pupils who use these textbooks are usually about ten years old at the beginning of the course and about sixteen at the end of it. 7. According to Sinclair the difference between corpora is one of method rather than one of size: “ere is thus a fairly sharp contrast in method; the so-called small corpora are those designed for early human intervention (EHI) while the large corpora are designed for late or delayed human intervention (DHI)” (2001: xi). 8. e sentence quoted in this headline is typical of the kind of example if-clauses that can be found in the coursebooks included in the analysis. e example is taken from English G 2000, vol. 3 (Schwarz 1999a: 29). 9. On this topic see also Bald (1988), who lists conditionals as one of the core problems in English grammar, not only for German learners but for learners in general. 10. anks go to Sven Naujokat for providing me with these examples from one of his grade 12 pupils’ in-class essays (grade 12 is the penultimate year before the A-levels). 11. is is not supposed to imply that textbooks exclusively talk about if-clause types 1–3. In textbook and grammar sections for more advanced learners there is some information on mixed conditionals, usually a combination of the type 2 and type 3 if-clauses, e.g. “If I had big ones [muscles] like the Malleys, I’d never have been able to get through that hole in the fence” (Schwarz 2001: 63) and on so-called “zero conditionals” (simple present in both parts) used to state general validities, e.g. “If you’ve got something with “Made in Sheffield” on it, that’s quality.” (Schwarz 2001: 31). e main focus, however, is on the three types mentioned above.
164 Ute Römer
References Amor, S. 2002. Authenticity and Authentication in Language Learning: Distinctions, Orientations, Implications. Frankfurt: Peter Lang. Bald, W. D. 1988. If-Sätze im Englischen. In: Bald, W. D. (ed). Kernprobleme der englischen Grammatik (pp. 38–50). Berlin: Langenscheidt-Longman. Beaugrande, R. de. 2001a. “‘If I were you…’: Language standards and corpus data in EFL”. Revista Brasileira de linguistica aplacada 1 (1):117–154. online: http: //beaugrande.bizland.com/Ifiwereyou.htm (visited:20.12.2003) Beaugrande, R. de. 2001b. Twenty challenges to corpus research. And how to answer them. online: http://beaugrande.bizland.com/Twenty%20questions%20about%20corpus% 20research.htm (visited:20.12.2003) Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan. 1999. Longman Grammar of Spoken and Written English. London: Longman. Breen, M. P. 1985. “Authenticity in the language classroom”. Applied Linguistics 6:60–70. Conrad, S. 2000. “Will corpus linguistics revolutionize grammar teaching in the 21st century?” TESOL Quarterly 34:548–560. Cook, G. 2001. “‘The philosopher pulled the lower jaw of the hen.’ Ludicrous invented sentences in language teaching”. Applied Linguistics 22 (3):366–387. Firth, J. R. 1957. Papers in Linguistics 1934–1951. London: Oxford University Press. Fox, G. 1987. “The case for examples”. In Looking up, J. McH. Sinclair (ed), 137–149. London: Harper Collins. Glisan, E. and V. Drescher. 1993. “Textbook grammar: Does it reflect native speaker speech?” The Modern Language Journal 77 (1):23–33. Granger, S. (ed) 1998. Learner English on Computer. London: Longman. Haase, I. 1995. Konditionalsätze im authentischen Sprachgebrauch und im Englischunterricht. Eine empirische Untersuchung. Berlin: Privatdruck. Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Mindt, D. 1987. Sprache – Grammatik – Unterrichtsgrammatik. Futurischer Zeitbezug im Englischen I. Frankfurt: Diesterweg. Mindt, D. 1996. “English corpus linguistics and the foreign language teaching syllabus”. In Using Corpora for Language Research, J. Thomas and M. Short (eds), 232–247. London: Longman. Moon, R. 1997. “Vocabulary connections: Multi-word items in English”. In Vocabulary: Description, Acquisition and Pedagogy, N. Schmitt and M. McCarthy (eds), 40–63. Cambridge: Cambridge University Press. Parsons, T. 2001. One for my Baby. London: Harper Collins. Renouf, A. 1987. “Corpus development”. In Looking up, J. McH. Sinclair (ed.), 1–40. London: Harper Collins. Römer, U. (in preparation). Progressives, Patterns, Pedagogy: A Corpus-driven Approach to Progressive Forms, their Functions, Contexts, and Didactics (working title). Schlüter, N. 2002. Present Perfect: eine korpuslinguistische Analyse des englischen Perfekts mit Vermittlungsvorschlägen für den Sprachunterricht. Tübingen: Narr.
Comparing real and ideal language learner input 165
Scott, M. 1996. WordSmith Tools. Oxford: Oxford University Press. Sinclair, J. McH. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press. Sinclair, J. McH. 1997. “Corpus evidence in language description”. In Teaching and language Corpora, A. Wichmann, S. Fligelstone, T. McEnery and G. Knowles (eds), 27–39. London: Longman. Sinclair, J. McH. 2001. “Preface”. In Small Corpus Studies and ELT. Theory and Practice, M. Ghadessy, A. Henry and R. L. Roseberry (eds), vii-xv. Amsterdam: John Benjamins. Stubbs, M. 2001. “Texts, corpora, and problems of interpretation: A response to Widdowson”. Applied Linguistics 22 (2):149–172. Taylor, D. 1994. “Inauthentic authenticity or authentic inauthenticity?” TESL-EJ 1 (2):1–12. online: http://www-writing.berkeley.edu/TESL-EJ/ej02/a.1.html (visited:20.12.2003) Widdowson, H. G. 1998. “Context, community, and authentic language”. TESOL Quarterly 32 (4):705–716. Widdowson, H. G. 2000. “On the limitations of linguistics applied”. Applied Linguistics 21 (1):3–25.
Appendix I Coursebooks Ashford, S. et al. (eds) 1996. Green Line New 2. Stuttgart: Klett. Ashford, S. et al. (eds) 1997. Green Line New 3. Stuttgart: Klett. Ashford, S. et al. (eds) 1998. Green Line New 4. Stuttgart: Klett. Ashford, S. et al. (eds) 1999. Green Line New 5. Stuttgart: Klett. Ashford, S. et al. (eds) 2000. Green Line New 6. Stuttgart: Klett. Aston, P. et al. (eds) 1995. Green Line New 1. Stuttgart: Klett. Schwarz, H. (ed) 1997. English G 2000 A1. Berlin: Cornelsen. Schwarz, H. (ed) 1998. English G 2000 A2. Berlin: Cornelsen. Schwarz, H. (ed) 1999a. English G 2000 A3. Berlin: Cornelsen. Schwarz, H. (ed) 1999b. English G 2000 A4. Berlin: Cornelsen. Schwarz, H. (ed) 2001. English G 2000 A5. Berlin: Cornelsen. Schwarz, H. (ed) 2002. English G 2000 A6. Berlin: Cornelsen.
166 Ute Römer
Appendix II Examples of if-clauses from BNC_spoken and GEFL TC (different tense form sequences, cf. figure 3) SPr – SPr: Simple present – simple present (“TYPE 0”) BNC_spoken: we usually attend if it’s in the South anywhere (d95) GEFL TC: If you help us, this is yours. (GLN, vol. 4, p. 26) SPr – prtMODAL+inf: Simple present – present tense modal + infinitive (“TYPE 1”) BNC_spoken: You can use more than one word if you want. (f72) GEFL TC: If you eat your hat, you’ll be ill. (EG 2000, vol.3, p. 29) patMODAL+inf – patMODAL+inf: Past tense modal + infinitive – past tense modal + infinitive BNC_spoken: I’d be very grateful therefore if er you could put your minds to the options (f7a) GEFL TC: If the carpet makers couldn’t sell their carpets, they wouldn’t have any reason to use children as cheap workers. (EG 2000, vol. 6, p. 49) SPr – patMODAL+inf: Simple present – past tense modal + infinitive BNC_spoken: if we let them both know then, then erm, we might we might speed things up. (dch) GEFL TC: If you don’t, there might easily be misunderstandings and tension. (GLN, vol. 6, p. 53) SPa – SPr: Simple past – simple present BNC_spoken: So if you started off in complete darkness you rotate until you get complete darkness or the opposite. (f7u) GEFL TC: – SPa – patMODAL+inf: Simple past – past tense modal + infinitive (“TYPE 2”) BNC_spoken: Lightbulbs would give out more light if they were washed every week in soapy water. (d90) GEFL TC: Would it be all right if I went to London the weekend aer next? (EG 2000, vol. 5, p. 121)
Comparing real and ideal language learner input 167
PrPr – SPr: Present progressive – simple present BNC_spoken: if you ‘re using them in industry, … then that is the real importance of wearing protective gear. (f77) GEFL TC: – PrPr – patMODAL+inf: Present progressive – past tense modal + infinitive BNC_spoken: if we are moving it to Saturday … we could switch the venue to Havstock Park (f7j) GEFL TC: – SPa – SPa: Simple past – simple past BNC_spoken: I think we still sent them out if anyone wanted them (f7c) GEFL TC: the law only allowed executions if the prisoner was able to defend himself (EG 2000, vol. 6, p. 34) SPa – prtMODAL+inf: Simple past – present tense modal + infinitive BNC_spoken: if they removed the local authorities judging we ’ll be in a terrible position! (f7v) GEFL TC: – SPr – PrPr: Simple present – present progressive BNC_spoken: if there ‘s nobody to doing that in Edinburgh they ‘re going to slip again unfortunately. (f7c) GEFL TC: that’s what’s going to happen if you date him again. (EG 2000, vol. 4, p. 74) SPr – SPa: Simple present – simple past BNC_spoken: you did n’t consider this a venue of body building to be suitable if it ‘s not erm it ‘s gon na put bums on seats (d91) GEFL TC: – SPa – PrPr: Simple past – present progressive BNC_spoken: if you did the graph one signature, … what are we going to call pictures? (f7r) GEFL TC: –
168 Ute Römer
PrPerf – patMODAL+inf: Present perfect – past tense modal + infinitive BNC_spoken: if somebody has erm you know, by the end of year seven done that, this would be the national curriculum record (f7e) GEFL TC: PrPerf – SPr: Present perfect – simple present BNC_spoken: If you have n’t done that yet do it now (fmc) GEFL TC: – PaPerf – patMODAL+have+PP: Past perfect – past tense modal + have + past participle (“TYPE 3”) BNC_spoken: – GEFL TC: I wouldn’t have been there if I hadn’t gone the wrong way. (GLN, vol. 4, p. 37)
Can the L in TaLC stand for literature? 169
Can the L in TaLC stand for literature? Bernhard Kettemann and Georg Marko University of Graz, Austria
Many European language departments combine the study of language and the study of literature. Though the two fields are thus institutionally linked in addition to sharing a common interest in language and texts, they are still treated as completely different in their meta-theoretical and methodological conceptions and their practices of textual analysis. More efforts to bridge the gaps are required, bringing the two perspectives of literary studies and linguistics together. This paper, which is a think-piece rather than a research piece, argues that approaching literary texts through corpus analysis may benefit language students on many different levels. We suggest that the use of concordancing can help them in their explorations of texts – prior to or after a first reading. This, in turn, will enhance their language awareness or, to be more precise, their awareness of the contributions of individual linguistics structures to possible interpretations of a literary text. It will further strengthen their discourse awareness, i.e., it will make them see the differences between literary and non-literary texts more clearly.
1. Introduction e analysis of literary corpora takes a subordinate position in the vast field covered by the conception of TALC, a position too low-case to be assigned the capital L in the acronym. e reason for this may be that literary language itself is oen seen as distinct from “ordinary” language (the paradigmatic position here is Jakobson 1960) and that its analysis by literary studies also appears to be different from the analysis of other types of language in linguistics and its neighbouring disciplines. Investigating literary language with the help of corpus analysis is therefore considered only marginally relevant to languagerelated teaching outside the realm of literary studies, i.e., in particular to teaching language and to teaching about language and its interaction with culture
170 Bernhard Kettemann and Georg Marko
(the main areas of teaching targeted by TALC). As the title of this article suggests, however, we want to challenge this assumption, arguing that the corpus analysis of literary texts may be a worthwhile pedagogical enterprise for all types of teaching.1 e motivation for seeking to establish an important role for corpus analysis of literature stems from the non-integrative approach to teaching oen found at Central European language departments, in particular with respect to literary studies. e University of Graz, at whose English Department we both work, is definitely no exception here. Our students are required to take courses in language studies, literary studies and cultural studies in addition to learning the language itself. Although these fields are diverse, they still share a common interest in signs and could thus potentially complement each other in the attempt to achieve the superordinate and superior teaching objective of making students sensitive to and critical of the connections between language, communication and social processes. In times when the survival of university departments does not rest solely on historical and traditional reputation anymore, we think it is of paramount importance to integrate literary studies and particularly its objects of analysis into the framework of the philologies. And we will suggest – and argue in favour of it – in this paper that the corpus analysis of literature could play a key role in this process. e practical background of this article is the attempt to establish a course on corpus-based stylistics in the curriculum of English and American studies at our university. e main ideas presented in the current article are based on the tenets of this proposal. is implies that we are presenting thoughts standing behind the wish to have a corpus-based stylistics course and therefore that the article is a think-piece rather than a report on empirical research, something which might also explain its occasional tentativeness. We will begin this article with a short outline of the pedagogical principle of awareness raising, which for us is the fundamental idea standing behind the project, and the different dimensions of awareness that seem relevant to us, namely language awareness, discourse awareness, and methodological and metatheoretical awareness. e three main sections of the article will centre around each of these respectively, providing room for descriptions of how they might be applied in real classroom scenarios. Finally we will also point to some problems that an approach such as the one outlined in the article might pose.
Can the L in TaLC stand for literature?
171
2. Awareness raising: pedagogical and linguistic background e assumption that corpus analysis can create some common ground between literary studies and the other disciplines in philology is based on the pedagogical principle of awareness raising. Awareness as an important aspect of learning and teaching is, surprisingly enough, not a very well researched phenomenon. e only exception is the more specific dimension of language awareness, which has received ample but sometimes heterogeneous treatment in recent years in language pedagogy (cf., e.g., Hawkins 1987, van Lier 1996, Gebhard and Oprandy 1999, or the articles in James and Garret 1991, Fairclough 1998, and van Lier and Corson 1997).2 e heterogeneity may be a result of the fact that awareness is a concept which is difficult to pinpoint. Definitions draw upon concepts such as consciousness, knowledge, or attention, which either do not explain anything because they are synonymous (consciousness) or point in completely different directions (attention is a cognitive phenomenon quite distinct from that of knowledge). is is supposed to serve as a caveat because – since we did not focus on awareness in our research in the first place – we will not be able to remedy these inconsistencies, even though our own definition is along the following lines: For us, awareness has two interrelated senses: firstly, it is the state of being aware, i.e., an alert and attentive state of the mind, and secondly, it is the mind’s ability to become aware and thus to notice and attend to particular aspects of experience and to make them explicit. e first is a momentary and transient cognitive state and the second one a more permanent cognitive faculty. Although it will not be possible to keep the two strictly apart, we still think it is a useful distinction because the goal of teaching is the cognitive faculty and not the state, i.e., we do not want to make students aware of something, we want them to be able to become aware on their own. In this sense, language awareness captures more of the spirit of the Council of Europe’s recommendation (1994:10) “[to] in fact develop explicit objectives and practices to teach methods of discovery and analysis”. e mental ability of awareness covers implicit/intuitive knowledge (= sensitivity) and explicit knowledge (= conscious awareness) of a phenomenon, which are both integral components of the concept. While conscious awareness could not emerge without sensitivity, the latter on its own would not be enough for awareness.
172 Bernhard Kettemann and Georg Marko
As awareness enables the mind to operate on experiential data, it is a necessary prerequisite for the comprehension of and reflection upon experience, which, in turn, allows for control, intervention and manipulation. Since comprehension and reflection are the ultimate goals of teaching and learning, the importance of awareness is self-evident. And if we are able to become aware outside the classroom, then whenever we are exposed to relevant input, we could use it for learning. is also forms the bridge between awareness and the central principle of learner autonomy, which is also being discussed as one of the pillars of modern pedagogy (for the connection between awareness and autonomy in language teaching, cf. van Lier 1996, Little 1997). Awareness raising in a discipline such as English and American studies may be divided into five interrelated subareas: Language awareness: the sensitivity to and conscious awareness of how language (perhaps extendable to other sign systems too) works on the micro-level. Discourse awareness: the sensitivity to and conscious awareness of the fact that there are different types of language use, which we call discourses, with differences in form going hand in hand with differences in function. Literary awareness: the sensitivity to and conscious awareness of the particular properties, values and functions of literary texts (cf. Zyngier 1994). Cultural/social awareness: the sensitivity to and conscious awareness of social and cultural processes and structures, and their interaction with aspects of communication. Methodological and metatheoretical awareness: the sensitivity to and conscious awareness of how we do research, i.e., how we systematically approach a question and which tools we can use.
e question that remains to be answered is how we can enable students to raise their awareness of phenomena in the areas just mentioned. Possibly the best ideas concerning how awareness (in the sense of cognitive faculty) can be created do not come from researchers on language awareness, as they seem to get entangled in the web of definitional problems, but from a more surprising source, namely the writings of Moshé Feldenkrais (cf. Feldenkrais 1981, 1991). Feldenkrais postulates that in physical education as well as in physical therapy, explicit knowledge of the physiological and anatomical functions of the body,3 in addition to the experience of contrast (experiencing something as different) and distortion (experiencing something as what we are not accustomed to) – basically two sides of the same coin – enable us to become conscious of how
Can the L in TaLC stand for literature? 173
our body works and possibly, at a second stage, to detect, change and modify the patterns of movement which we have got used to but which may nevertheless cause problems. us according to Feldenkrais, raising awareness has a theoretical component and an experiential component, which correspond to the dimensions of conscious awareness and sensitivity mentioned above. Translated into the context of English and American studies this means that awareness raising in the five areas mentioned entails providing students with explicit knowledge, e.g., knowledge of linguistics, literary studies, cultural studies and sociology, and methodology and metatheory, and making them experience – in or outside the classroom – contrasts and distortions so that they develop an intuitive feel for the problem. is will eventually help students grasp – intuitively and explicitly – (i) how language works on the lower (language awareness) and on the higher (discourse awareness and literary awareness) levels and (ii) how language relates to social and cultural phenomena (cultural/social awareness), as well as helping them understand how sciences work (methodological and metatheoretical awareness). e basic assumption underlying this paper is that using corpora of literary texts in class can – in combination with providing knowledge about language and literature – contribute to awareness raising on all the five planes mentioned above. is is, roughly speaking, a result of the unique combination of a distorting form of experiencing language (concordancing) with an exceptional form of texts (literature), in addition to the many levels of contrasts that such an approach can open up for experience. In the following three sections, we will discuss three of the given types of awareness, viz. language awareness, discourse awareness, and methodological awareness. is is not to say that the other two, literary awareness and cultural/social awareness are less significant but the decision is based on practical grounds: as linguists we feel that without a tighter cooperation with (i) people doing cultural studies and sociology and (ii) people in literary studies, our considerations in these areas would be too tentative to be of any relevance here. In our view, the full awareness raising potential of teaching corpus-based stylistics can only be realized in a separate course. But even if we include just a few sessions in either literature or language classes or establish it as a crosscurricular principle – in whatever way this might be implemented – this can create awareness. In what follows, however, we present our ideas as if there were a separate course, complete with background theorizing and exemplary analyses.
174 Bernhard Kettemann and Georg Marko
Readers should bear in mind that although we shall deal with the different awarenesses separately, assigning to them specific texts and text analyses, in practice they cannot and ideally should not be so neatly separated.
3. Language awareness Language awareness as the general sensitivity to and conscious awareness of how linguistic structures, i.e., elements, combinations, relationships, contribute to different types of meanings on the local and the global levels should be the pivotal objective of any philology. Why should language awareness be granted this prominent position? Language and the work it performs is usually taken for granted. But the conscious awareness of how it works allows students to constructively intervene in all areas where language plays a part, starting from their own language learning via language teaching to the effective use of language in the media or in business. Language awareness is crucial for language students because it is claimed (e.g., James and Garrett 1991) that language awareness facilitates language/culture learning in a variety of ways: affective (attitude and motivation, stimulating curiosity), social (tolerance, accepting variety and diversity, de-stereotyping), critical (de-manipulative, cf. Fairclough 1992), cognitive (segmentation, classification, contrast, system, rules) and performative (from “knowing that” to “knowing how”, from declarative to procedural linguistic knowledge, cf. Ellis 1994). is is why language awareness has become a central factor in curricula not only in higher education but also in secondary education. And it therefore should also stand at the beginning of any discussion of the relevance of a corpus-based course in stylistics. Following our argumentation above, to create the kind of awareness just described in students requires first of all the explicit teaching of metalinguistic knowledge. Linguistic theories that focus on connections between formal elements and meanings are functional theories of grammar and lexicon – in particular Hallidayan Systemic Functional Linguistics (cf. Halliday 1994, ompson 1996, Eggins 1994) – in addition to cognitively-oriented theories of discourse comprehension – in particular schema theory. In other words, we teach students the theoretical foundation that approaches such as Critical Discourse Analysis (cf. Fairclough 1989, 1992 or Kress and Hodge 1993)
Can the L in TaLC stand for literature?
175
or stylistics (cf. Leech and Short 1981, Widdowson 1992, Fowler 1996, Short 1996, Semino 1997) also draw upon. e second step is the experiential dimension of the creation of language awareness. We are supposed to expose students to different kinds of language in a contrastive fashion to make them experience the contrasts, so as to lay the intuitive foundation of language awareness. Now we think that corpus analysis of literary text is a very good instrument for creating language awareness because literary texts per se are consumed very differently from other texts, with a much closer attention to meanings and meaning creation. In combination with concordances of particular linguistic elements, this provides for a very “distorted” reading experience because students have learned to distil meanings from the texts – in a generally much more focused and conscious process than in other discourses – and are now confronted with this unnatural way of viewing textual components to see how they might be related to meanings. In our exemplary analyses, we try to ensure that there are further levels of contrasts to enhance the experiential value of the exercise. We therefore, for instance, provide two exemplary analyses for language awareness, in order to explore the wide range of meaning types, one focusing on characterization, the other one on speech acts. ese two analyses are the topics of the following two sections.
3.1 Characterization Characterization, i.e., how characters and their social and psychological attributes are represented in literary texts, serves the purpose of language awareness raising very well in that it is a level of meaning that students are used to exploring. Comparing their interpretations, which they arrive at without usually attending to the linguistic details that create characterization in the first place, with concordances of phrases denoting main characters, should make students see and experience how details in the language of a text can have an impact on their overall understanding. For the session itself, we take as topic the different characterizations of men and women in an early emancipatory American short story by Mary Freeman Wilkins (1890), “e Revolt of Mother,” which we have digitized and thus have available as a mini corpus. e story is the perfect starting point: (i) it is short,
176 Bernhard Kettemann and Georg Marko
thus allowing students to instantly compare the conclusions of their close reading with the insights that they gain by viewing concordances, (ii) it is characterized by two levels of opposition, viz. between men and women and between the early and later stages of the characters’ development, therefore providing for the additional contrasts mentioned above, which are so important to drive home a point more easily. We start with the concordance of she,4 which will yield the verbs that collocate with female characters in the story and allow us insights into the actions performed by the central woman in this short story (at least as far as the subject coincides with the semantic role of agent). is concordance already indicates which aspects of language awareness might be raised by our approach in this area. It suggests that collocations are not mere arbitrary word partnerships, as sometimes suggested in introductions to semantics, but that they show a real influence of one linguistic element on the meaning of another (and possibly vice versa), e.g., as in the case above, of verbs co-occurring with the most common noun phrases referring to one of the story’s main characters. A concordance such as that in Table 1 practically forces students to categorize the verbs, finding housekeeping to be the prominent aspect shared by verbs such as bake, cook, clean, sew, wash. is, in turn, will make them become aware of the relevant semantic features. ey will see that the effect of collocates rests on the meaning aspects that they share and which are collectively Table 1. Concordance of she (selection) from “e Revolt of the Mother” better than any other kind. t. Nanny and Sammy watched. and razor ready. At last to go before the broom; out with a pile of plates. apparently no art. To-day put in a piece of soap. t down with her needlework. aside. “You wipe ’em,” said His wife helped him. aer he had washed. en pan with a conclusive air. cambric and embroidery. inanimate matter produces.
She She she she She she She She she She she She She She
baked twice a week. Adoniram brought out cups and saucers, buttoned on his collar and to cleaned, and one could see no got the clothes-basket out of got out a mixing bowl and a n, got the comb and brush, and s had taken down her curl-paper , “I’ll wash. ere’s a good poured some water into the tin put the beans, hot bread, and scrubbed the outside of it as i sewed industriously while her swept, and there seemed to be
Can the L in TaLC stand for literature? 177
transferred to the conceptualization of a person in the story. In the specific case of the story, the concordance will show students how the semantic preference (see Hoey, this volume) of she for housekeeping verbs creates or at least contributes to the understanding of the woman’s role in the story. e woman’s role is then compared to that of the main male character in the story. Here we produce the concordance for he (Table 2). By contrast with she, he collocates with a semantically different class of verbs. He works with the farm animals, designs, fixes, has plans (“is going to”), thinks (stares reflectively), is confident, controls the situation. As the word “revolt” in the title suggests, there is a development in the story. For this reason, we can divide the story into two parts. e following concordances are from the second part of the story. ere is a change in this distribution of power in the course of the story, which becomes obvious from the lines extracted from the she concordance of the second part of the story: Now it is she who makes up her mind as to her course of action, forms a maxim, decides on something and sticks to it, looks immovable, full of authority, and is successful with her cause, because in the end, she is “overcome by her own triumph” (Table 3). Her move towards action is mirrored in his lapse into helplessness, disbelief, speechlessness, immobilization, and finally results in his remorse, cf. the following concordance lines with he (Table 4): e four concordances should suffice to make students realize a certain change in the characters in the story. On the background of traditional male/female roles a role change is foregrounded. It seems to be a change
Table 2. Concordance of he (selection) from “e Revolt of the Mother” Adoniram Penn’s barn, while her did one good thing when sternly at the boy. “Is except on extra occasions. man said not another word. great bay mare. in on Wednesday; on Tuesday feels about the new barn,” Adoniram did not reply; of blackberry vines. a horse I want.”
he he he He He He he he he He He
designed it for the comfort of fixed that stove-pipe out there goin’ to buy more cows?” e held his head high, with a hurried the horse into the farm nessing the hustled the collar on to her received a letter which changed said, confidentially, to Sammy shut his mouth tight. “I know slapped the reins over the horse stared reflectively out of the
178 Bernhard Kettemann and Georg Marko
Table 3. Concordance of she (selection) from “e Revolt of the Mother”, second part. e had on a clean calico, and steady, her lips firmly set. od in the door like a queen; e talkin’, Mr. Hersey,” said last buckles on the harness. f they were bullets. At last she repeated in effect, and ut her apron up to her face;
she She she she. She she she she
bore herself imperturbably. formed a maxim for herself, held her head as if it bore “I’ve thought it all over a looked as immovable to him looked up, and her eyes show made up her mind to her cour was overcome by her own triumph
Table 4. Concordance of he (selection) from “e Revolt of the Mother”, second part. What on airth does this mean, mother?” (she)smoothed his thin gray hair aer t the right of the barn, through which designed for her front house door, and head and mumbled. All through the meal ut in a dazed fashion. His lips moved, him, and shook his head speechlessly. ther!” e old man’s shoulders heaved:
he he he he he he He he
gasped. “You come in h had washed. en she p had meant his Jerseys leaned his head on his stopped eating at int was saying something, tried to take off his was weeping.“Why, don
towards more justice and a more even distribution of power, authority and responsibility. is exemplary analysis may be complemented by similar searches in corpora of novels. When comparing impressions with concordancing data, students will not only see the larger relevance of the approach, they will also notice that larger corpora pose more problems for the analysis of linguistic details, e.g., the fact that in examining the characterization of Heathcliff in Emily Brontë’s Wuthering Heights the search word “he” underspecifies the target (i.e., gives more examples than wanted). is contributes in turn to methodological awareness.
3.2 Speech acts and performative verbs Language is not only used to represent the world, but also to act upon or together with other people, i.e., for interaction (in Hallidayan terminology, these are called the ideational and interpersonal functions, cf. Halliday 1994). As literature usually places too much emphasis on the representing function,
Can the L in TaLC stand for literature? 179
our secondary analysis, in contrast to the one on characterization, is concerned with the interactive function. It thus tries to raise students’ awareness of the social work that linguistic elements perform. We use a corpus of Shakespeare plays (the complete set; the corpus has been POS tagged)5 for this session. e reason for choosing dramatic texts is that they constitute the paradigmatic interpersonal genre in literature. We have selected Shakespeare because we assume students of English and American studies to be familiar to some extent with his oeuvre. e fact that Shakespeare used poetic forms of Early Modern English poses some problems but also introduces an element of contrast that we deem important in the creation of awareness. e central topic of the session are speech acts. e latter – at first sight – may evade corpus analysis as they do not seem to have anything in common formally, i.e., no shared features by which we can recognize them. ere are, however, speech acts that are clearly marked formally, namely those that are explicitly introduced by a performative verb. A short glimpse into Shakespeare’s plays will reveal that the use of performative verbs usually take the following forms: “I” + VERB + Comma (as in I promise,) “I” + VERB + “you/thee” + Comma (as in I warn you,) Table 5. Concordance of “I *,” and “I * you,” in the Shakespeare corpus (extract). by the very fangs of malice and my daughter. PROTEUS our own device. SILVIUS No, my daughter Katharina, this ere my Romeo comes? Or, If is head! Now, by Saint Paul sit her. BAPTISTA But thus, ht am nothing: but whate’er e the wiser by your answer. ssing arguments? EDMUND Not than I to speak of. ORLANDO sweet Demetrius. DEMETRIUS go we In, ne too, which did awake me: Buck, buck, buck! Ay, buck; but ask me not what; for If other; for, hear, Camillo,
I swear, I am not that I play. Are y I do, my lord. DUKE And also, I I protest, I know not the contents: I know, She Is not for your turn, I live, Is It not very like, e hor I swear, I will not dine until I see I trust, you will not marry her. I be, Nor I nor any man that but man I pray you, sir, are you a courtier? I pray you, what are they? CURAN I thank you, sir: and, pray you, I charge thee, hence, and do not e ard from him. But I pray thee, Jessica, And Ir, I I shaked you, sir, and cried: as I warrant you, buck; and of the I tell you, I am no true Athenian. I direct means or I assure thee, and almost with tears d. Dost thou I conjure thee, by all the parts of
180 Bernhard Kettemann and Georg Marko
Table 6. Most common performative verbs in Shakespeare corpus. I + V + Comma Say Prithee Pray Swear Protest Confess Warrant Grant Profess Prophesy Vow
I + V + you/thee + Comma 100 85 36 19 17 11 9 6 3 3 2
pray beseech thank tell warrant charge promise assure bid conjure
228 51 51 48 36 27 8 6 4 3
Using a truncated word class search, i.e., with a wildcard, e.g., *, helps us to get concordances as in Table 5. If we weed out the non-performative cases (e.g., I do, I live, I be, etc.), we arrive at a list of the most common performative verbs used in these patterns in Shakespeare plays (Table 6). Once again, such a list should make students start categorizing right away. And they will arrive at the conclusion that there is a clear preponderance of the directive speech act of requesting: I prithee, I pray, I pray you, I beseech you, I charge you, I bid you, I conjure you. e question is why it is this speech act which needs to be marked (of course, we always have to consider nonpragmatic aspects of the use of performative phrases as well – e.g., they might help to make the blank verse work). It is perhaps supposed to signal politeness (pray could be regarded almost as a formulaic phrase to indicate politeness, as is please today – one of the awareness raising contrasts with modern English), since today’s use of modal would/could in stereotypically indirect requests (e.g., could you mail this for me?) does not seem to have been that common in Shakespeare’s time. A word that stands out is say. ough it is the most common verbum dicendi, its meaning seems so unspecific and neutral that its use as a performative comes as a surprise. We might therefore have a closer look at concordances with say in this function.
Can the L in TaLC stand for literature?
181
Table 7. Say as a performative verb in the Shakespeare corpus (extract). of the Garter. Host Peace, ray you, – PROTEUS Sirrah, ove false! Exit QUEEN Son, hat fame may cry you loud: ? JOAN LA PUCELLE Why, no, l the empty air, Clifford, no more. Proceed. CALIBAN s again to cope your wife: ave you warp. Call hither, You mistook, sir; ghost of him that lets me! hrones, and smile at Troy! s name. Go, take him away, ey; let them go: Dispatch, the county; Ay, marry, go, unto their husbands: Away, . HUBERT Give me the iron, F A plague of all cowards,
I say, I say, I say, I say, I say, I say, I say, I say, I say, I say, I say, I say, I say, I say, I say, I say, I say, I say,
Gallia and Gaul, French and forbear. Friend Valentine, a w follow the king. CLOTEN at farewell. Second Lord Health, distrustful recreants! Fight come forth and fight with me: by sorcery he got this isle; but mark his gesture. Marry, bid come before us Angelo. she did nod: and you ask me if away! Go on; I’ll follow thee. at once let your brief plagues and strike off his head and find the forester. Exit an and fetch him hither. Now, and bring them hither and bind him here. ARTHUR and a vengeance too! marry,
Even though we might assume that say denotes the most neutral of assertive speech acts and we would therefore expect it to have the function of explicitly marking an assertive speech act – i.e., the act of making a claim – the very fact that it here collocates very oen with imperatives (follow, go, dispatch, take away, strike off, marry, fetch, give, bind, etc.) or quasi imperatives (away) suggests that it has the more dominant function of marking an order or even adding some emphasis to an order. Students may go further here by concluding that those characters having their say are those who are in power. Comparisons with other corpora of conversation, whether literary or nonliterary, are, of course, highly welcome and could further enhance the sensitivity to and consciousness of the interactive function of linguistic elements.
4. Discourse awareness Discourse awareness is concerned with more global aspects of language, i.e., types of language use. Whatever terminology we prefer, i.e., whether we want to call these types text types, genres, registers, styles, varieties, or discourses, the
182 Bernhard Kettemann and Georg Marko
main point is that they are characterized by a set of contextual features and a set of linguistic features, and that these two sets interact in complex ways. Being aware of these different types of language use and being able to identify them is of paramount importance to students of language as it will ultimately help them not only to produce better texts themselves but also to interpret texts more effectively. Further, it will enable them to appreciate mixtures of types as intertextuality. Raising discourse awareness therefore takes a prominent position. We will proceed along the same lines as with language awareness: first, we teach students about discourses and approaches to analyzing them (e.g., Halliday and Hasan 1991, Leckie-Tarry 1995) in addition to encouraging them to use all the metalinguistic knowledge they have (not only from the first sessions but also from other courses). We secondly have them search corpora with concordances and distil data from these, thus experiencing contrasts and differences. Although there is an infinite list of literary and non-literary discourses which could be compared by means of corpus analysis, we will limit ourselves to devoting two exemplary analyses to discourse awareness, the first one dealing with authors’ personal styles, the second one with the differences between literary and non-literary discourses. We will introduce the first analysis in detail in the subsection below. As regards the second, we will only make brief comments because we still see a host of problems which we have not yet resolved.
4.1 Authors’ personal styles By authors’ personal styles we mean the linguistic features (of any kind) peculiar to the language of specific writers. As authors’ personal styles span over their whole oeuvre, they constitute discourses in their own rights. eir prominence in teaching about literature at school warrants examining them in a first session aiming at discourse awareness. Personal styles can be treated as tools to identify a particular author’s work. Like forensic linguistics, this approach is more surface-oriented and works with sophisticated statistics. It would, however, completely leave aside assumptions about functions, meanings, and in turn also about the socio-cultural impact of linguistic elements, which we claim to be the other side of a discourse. We
Can the L in TaLC stand for literature? 183
therefore take a different approach, establishing connections between formal features and meanings in all their aspects, which eventually means relating language to authors’ world views, in both their idiosyncratic and collective socio-cultural dimensions. is session is limited to three authors who can be interpreted as representatives of different literary eras, namely: Nathaniel Hawthorne (1804–1864) – American Romanticism Mark Twain (1835–1910) – American Realism Jack London (1876–1916) – American Naturalism
For the exemplary analysis, we have composed three corpora covering the complete works of the three writers, respectively. e corpora have been POS tagged. What kind of differences will students now expect to find in the three corpora and how might these differences be represented in language? We think that they will draw upon their knowledge of the three literary periods to come up with the assumption that Realism and Naturalism strive for objectivity, i.e., representation of reality “as it is”. Apart from Psychological Realism, which neither Twain nor London subscribe to, subjectivity with its distortions, selectivity and unreliability will be considered to be the realm of Romanticism. In contrast to the first sessions, where we started with concordances immediately, we want to take a more cautious approach here, encouraging students to translate such general assumptions concerning meanings into expectations about the occurrence of concrete linguistic elements. ey may thus come to the conclusion that subjectivity could be marked by modality, that is, by epistemic modality, concerned with factuality or non-factuality and expressed by modal verbs such as may or might and modal adverbs such as perhaps, probably, or obviously, or by those aspects of modality that have something to do with the desirability of certain events, expressed by modal verbs such as shall or should. is might lead to the hypotheses that such markers of modality should be more common to Hawthorne than to either London or Twain. Table 8 gives the figures for may/might, perhaps/probably/maybe/possibly and shall/should. e students’ expectations seem to be supported by the higher figures for Hawthorne and the lower figures for Twain and London. A further technique for rendering representations more realistic which might be mentioned by students is the use of direct speech. A concordance
184 Bernhard Kettemann and Georg Marko
Table 8. Frequencies of certain modal verbs and adverbs in the Hawthorne, Twain and London corpora.
may/might perhaps/maybe/probably/ possibly shall/should
Hawthorne per 10,000 words
Twain per 10,000 words
London per 10,000 words
29.0 11.0
11.5 6.2
7.7 3.1
20.3
12.8
9.3
Table 9. Number of direct speeches in the Hawthorne, Twain, and London corpora.
Direct speech
Hawthorne per 10,000 words
Twain per 10,000 words
London per 10,000 words
90.9
103.3
135.0
searching for all (end of quote) quotation marks yields the numbers given in Table 9. ough the difference between Hawthorne and Twain may not be significant, the number of direct speeches used by Jack London is substantial. Again the expectations would be supported. ese two features are supposed to give only a taste of what could be examined in this session and how we can proceed. More profound and thorough analyses will reveal further differences in form which can be related to differences in meaning so that students will acquire both sensitivity to and conscious awareness of this more global dimension of language.
4.2 Literary versus non-literary texts e last exemplary analysis is intended to put an even stronger emphasis on discourse awareness as we contrast literary texts with texts of varying distance to “serious literature”. e options range here from pulp literature to academic discourses, all varieties that students are familiar with and which consequently will strongly contribute to raising their discourse awareness. On the downside, the large set of options also makes the selection process more difficult. At this stage, we have not decided yet which comparative discourse to use for an exemplary analysis. e choice of literary corpora is also
Can the L in TaLC stand for literature?
185
more problematic since it would have to be one of recent literature, since otherwise the contrasts could always be related to the difference in time of origin as well.
5. Methodological and metatheoretical awareness Methodological and metatheoretical awareness, as defined above, is students’ sensitivity to and conscious awareness of what they are doing in research, how and why they are doing it. Although only a few students will work in research aer their studies, methodological and metatheoretical awareness in our view contributes to a general and critical awareness of problem identification and of problem solving, which we consider a fundamental objective of any form of learning and studying. How can corpus analysis of literary texts contribute to the creation of methodological and metatheoretical awareness? We think that having to use concordancing as their primary tool of analysis, students will be confronted with problems they do not have to think about in the traditional approach of close reading in literary studies – problems such as corpus composition, search strings, quantitative evaluation of data, etc. Encountering such problems provides the basis for the experiential dimension of awareness raising, which we will complement with explicit information.6 is also implies that the experiential phase usually precedes the theoretical one in this area. ere are several aspects of methodological and metatheoretical awareness which can be promoted in a stylistics course using methods of corpus analysis: Awareness of the frame of reference: sensitivity to and conscious awareness of what we are making statements about. Procedural awareness: sensitivity to and conscious awareness of how we proceed in our research. Awareness of differences in approaches: sensitivity to and conscious awareness of how problems can be approached differently. Critical awareness: sensitivity to and conscious awareness of the potential fallacy of our own approach or that of others.
We will briefly discuss these in the four sections below. Raising metatheoretical and methodological awareness does not work
186 Bernhard Kettemann and Georg Marko
independently of the research work students do in the course. We therefore do not deal with it in a separate session but rather focus on it if the situation invites it.
5.1 Awareness of the frame of reference By frame of reference we mean the part of the world that we make statements about in research. Students – and their teachers, too – oen ignore this and leave it for others to decide whether they are, for instance, saying something about language in general, about the English language, about a particular variety of English or just about a specific language event, a specific text or a collection of texts. Having to choose a corpus moves the question of frame of reference to the foreground. In particular the transition from the first exemplary analysis, where we deal with a particular short story, to the second one, where we are concerned with all of Shakespeare’s plays, seems to be a case in point. And students will no doubt see the difference between drawing conclusions about a particular story or about Shakespeare’s dramatic language as a whole.
5.2 Procedural awareness Being aware of how they proceed in their research efforts will help students to systematize their approach and make it repeatable in principle. Corpus analysis proves valuable in this respect since it requires a certain research discipline, i.e., the research questions have to be funnelled down into searches for our concordancing programme, which prevents a random approach. In the course, we promote the following procedure in all our exemplary analyses: 1. Clarifying the research question and the central concepts by gathering background information and carrying out exploratory searches. 2. Identifying linguistic indicators: these are the linguistic elements that are considered to contribute to the phenomenon under investigation. 3. Finding out about the technical possibilities of tracing the linguistic indicators in a corpus. 4. Carrying out searches. 5. Presenting, discussing and interpreting results.
Can the L in TaLC stand for literature? 187
In the session about authors’ personal styles, for instance, students are required to obtain some background information about the three periods and the three authors so as to get an idea of what might distinguish the three corpora. ey are then invited to explore the corpus on their own, e.g., by using the keyword function as a tool. is function of WordSmith Tools (Scott 1996) allows them to see the words that occur significantly more frequently in one corpus as compared to another. ey might also want to embark on a series of chain searches – these are searches where the main collocates of a search word can be used as the new search words, etc. – an approach that not only enhances familiarity with the corpus but also makes serendipitous findings easier. is should make them see the relevance of subjectivity and objectivity, which they may connect to Hawthorne’s preference for words such as “may”, “might”, “indeed” or “perhaps”, as revealed by a keyword analysis, insights which they can then translate into concrete hypotheses.
5.3 Awareness of differences in approaches ere are different ways to analyze a text. As mentioned above, corpus linguistics represents a radically different approach to those that students are familiar with from literary studies. As corpus linguistics is so different from most things done in literary studies, it will raise students’ awareness of the fact that the latter just represents one type of approach. e following features of literary studies are particularly likely to be highlighted
5.3.1 Co-textual versus transtextual analysis Co-textual analysis refers to a close reading of a particular passage, taking all relations between linguistic elements in the passage into account. Transtextual analysis, on the other hand, means examining particular structures/elements across a text. (Notice that these two approaches represent the extreme poles of a scale). Literary studies tends towards co-textual analysis, while stylistic corpus analysis leans towards transtextual analysis. Students will probably notice this difference, and they will also see that both have their merits. We are not suggesting that concordances should replace close reading but that a transtextual approach allows a completely different view of a literary text. Students may initially bring forward the argument that transtextual analysis is artificial because readers never see concordances, but they should eventually realize that it is like
188 Bernhard Kettemann and Georg Marko
the microscopic view, artificial but still able to allow new insights.
5.3.2 Qualitative versus quantitative analysis Corpus analysis is also an extremely effective device to highlight the difference between qualitative and quantitative analysis and also to show how they can complement each other. In particular the session on authors’ personal styles will promote this point as students will realize that styles can only be conceived of in terms of statistical trends. is will then easily persuade students to count more systematically, doing both token counts (frequency of a particular item) and type counts (number of different words in a category). And this should make them see how frequencies, as the elementary form of statistical analysis, can reveal interesting properties of a text or a group of texts. 5.3.3 Content versus linguistic structure As already discussed, literary studies sometimes work with larger “contenty” chunks in their analyses, while stylistic research is based on examining the nitty-gritty details of linguistic structure. is is a difference that is very much at the centre of attention from the beginning of the course. 5.4 Critical awareness e last aspect of awareness we want to raise is critical awareness. Critical awareness basically amounts to the ability and also the willingness to question particular assumptions, whether they are our own or those of others. is attitude should be promoted whenever possible in the course. And corpus analysis of literary texts offers such opportunities aplenty. Take for instance the session about literary periods, during which we look at Jack London as a representative of Naturalism. Now one aspect that is oen mentioned in connection with Naturalism is the conception of humans as powerless objects exposed to uncontrollable forces. How could this be reflected in language? We can look at the modality of obligation and necessity. If we must do or have to do certain things, we are forced to act in a particular way and cannot deliberately choose to do so. e frequencies of “must” and “have to” are thus worth examining.
Can the L in TaLC stand for literature? 189
Table 10. Frequencies of must and have to in the Hawthorne, Twain and London corpora.
must/have to
Hawthorne per 10,000 words
Twain per 10,000 words
London per 10,000 words
11.4
12.8
12.7
Obviously, the figures do not prove our initial hypothesis. eoretically, this could mean that aer all, literary critics were wrong and Naturalist work does not see the individual as subjected to external forces. More constructively, however, students should look at other possible factors: Obligation has to be distinguished from necessity, as it is less a matter of circumstances necessarily leading to some sort of outcome but rather of people wanting other people to do something. Must and have to might occur in direct speeches with considerable frequency (considering the fact that there are ten times as many direct speeches as occurrences of must and have to, this is clearly something that should not be dismissed easily). Must and have to do not only express obligation and necessity but also some notions of epistemic modality (conclusions from experience, etc.). is list can, of course, be expanded. Although the course will not include statistical analysis proper, an example such as this one can be taken to bring in the notion of statistically significant differences. Intuitively it may appear to students that these differences might be a result of chance rather than showing a real preference for this structure in the oeuvre of any of the three authors. And we can point them to books that deal with this aspect in detail. An example such as this one should also make students become aware that even if figures support hypotheses, this does not necessarily preclude the possibility of considering other factors that might have led to the results. You can compare this to the example of storks and birth rates. ere may be a significant correlation between the decline in the population of storks and the decline in the birth rate. But we know – or at least strongly believe – that there is no connection. is would show them that to proceed in a scientific manner always means being cautious and suspicious and considering as many factors as possible.
190 Bernhard Kettemann and Georg Marko
6. Potential problems Of course there are some arguments against a corpus-based approach to stylistics. e following concerns about possible shortcomings have been voiced by students in courses where we have used parts of the material presented in this article: Destroying the integrity and wholeness of texts: Using concordances means fragmentizing texts (treating it as text – uncountable noun – rather than as a text – countable noun), thus destroying the integrity and wholeness of the text and thus also suspending interpretation triggered in linear readings of texts. Promoting uncritical and superficial reading of texts: Using concordances promotes an overemphasis on surface form at the cost of deeper meanings, which precludes the possibility of resistant and critical readings of texts. Blurring literary issues: Corpus analysis of literary texts may help to study literary texts linguistically, but won’t be able to answer the genuine questions of literary studies. In other words, corpus analysis promotes an approach indifferent to the literariness of literary texts.
It seems to us that these concerns are based in part on the erroneous assumption that corpus analysis is promoted as the only analytical instrument in the study of literary text or that it could even replace ordinary linear close reading. Some aspects of the objections, however, seem justified. Focusing on awareness raising might reduce the literary text to an ancillary function (spending more energy on the aspect of literary awareness might help to mitigate this impression), thus minimizing its literary potential for offering insights into all aspects of what it means to be human. Designing a course such as the one introduced here therefore always requires embedding it in the right curricular context, so that it does not stand in isolation but rather that it is complemented by literary studies courses which do deal exactly with the dimensions mentioned.
7. Conclusion In this article, we have tried to show how using corpus methods in class for the analysis of literary texts can contribute to the metacognitive and experiential
Can the L in TaLC stand for literature?
191
dimension of awareness raising in areas such as language, discourse, literature, society and culture, and metatheory and methodology. In conclusion, we think the claim that the use of corpus analysis generally promotes awareness goes uncontested. e question whether we need literary corpora for this purpose or whether academic or media texts could prove more valuable is more difficult to answer. We think that the experiential value of literary texts is perhaps higher than that of those mentioned above. But the answer to this question eventually rests on whether we think that literary texts are particularly valuable cultural products and that to read, interpret and/or analyze them is a valuable cultural skill. If this is so, then using corpus-based stylistics in teaching language and/or literature will hopefully bridge the gap between literary studies and the other disciplines of the philologies. Literary would then occupy at least some of the space of the capital L in TALC.
Notes 1. Literary corpora could be of relevance for literary studies, but corpus analysis has not yet received widespread recognition as a research tool, let alone as a pedagogical tool, in the literary studies community and consequently only a few researchers from this field have been involved in the TALC community. 2. e launching of the journal Language Awareness in 1992 is another indication of the growing interest in this topic. 3. In fairness, the aspect of metaknowledge, equal to explicit knowledge, is mentioned by almost all contributions on language awareness, e.g., van Lier (1996:77): “One of the support structures that may help guide the L2 learner to focus attention effectively is a metalinguistic awareness and a store of metalinguistic knowledge.” For many people, however, this is equal to creating awareness. 4. Some of the descriptions make it sound like just one more top-down course, with “information flowing from teachers to students”. Most of the aspects to be analyzed, however, are either discussed in depth in class or students (are expected to) bring them up themselves before we actually embark on the analysis as described here. 5. e tags, however, do not feature in the concordances used in this article. 6. We try to keep the theoretical part to a minimum as we do not think that a full philosophical treatment of metatheory and methodology will be as beneficial to awareness raising as linguistic knowledge is to the creation of language awareness or discourse awareness.
192 Bernhard Kettemann and Georg Marko
References Council of Europe. 1994. Language Learning for European Citizenship. A Common European Framework for Language Teaching and Learning. CC-Lang (94)1. Strasbourg: CoE. Eggins, S. 1994. An Introduction to Systemic Functional Linguistics. London: Pinter. Ellis, N. (ed.). 1994. Implicit and Explicit Learning of Languages. London: Academic Press. Fairclough, N. 1989. Language and Power. London: Longman. Fairclough, N. (ed.). 1992. Critical Language Awareness. London: Longman. Fairclough, N. 1998. “Political discourse in the media: An analytical framework”. In Approaches to Media Discourse, A. Bell and P. Garrett (eds), 142–162. London: Blackwell. Feldenkrais, M. 1981. The Elusive Obvious. Cupertino (CA): Meta Publications. Feldenkrais, M. 1991. Awareness Through Movement: Easy-to-do Health Exercises to Improve Your Posture, Vision, Imagination, and Personal Awareness. London: Harper Collins. Fowler, R. 1996. Linguistic Criticism. Oxford: Oxford University Press. Gebhard, J.G. and Oprandy, R. 1999. Language Teaching Awareness. A Guide To Exploring Beliefs and Practices. Cambridge: Cambridge Univerity Press. Halliday, M.A.K. 1994. An Introduction to Functional Grammar. London: Edward Arnold. Halliday, M.A.K. and Hasan, R. 1991. Language, Context, and Text. Aspects of Language in a Social-Semiotic Perspective. Oxford: Oxford University Press. Hawkins, E. 1987. Awareness of Language: An Introduction. Cambridge: Cambridge University Press. Jakobson, R. 1960. “Closing statement: Linguistics and poetics”. In Style in Language, T.A. Sebeok (ed.), 350–377. Cambridge (MA): MIT Press. James, C. and Garrett, P.P. (eds). 1991. Language Awareness in the Classroom. London: Longman. Kress, G. and Hodge, B. 1993. Language as Ideology. London: Routledge. Leckie-Tarry, H. 1995. Language and Context. A Functional Linguistic Theory of Register. London and New York: Pinter. Leech, G. and Short, M. 1981. Style in Fiction. London: Longman. Little, D. 1997. “Language awareness and the autonomous language learner”. Language Awareness 6 (2–3):93–104. Scott, M. 1996. WordSmith Tools. Oxford: Oxford University Press. Semino, E. 1997. Language and World Creation in Poems and Other Texts. London and New York: Longman. Short, M. 1996. Exploring the Language of Poems, Plays and Prose. London and New York: Longman. Thompson, G. 1996. Introducing Functional Grammar. London: Edward Arnold. van Lier, L. 1996. Interaction in the Language Curriculum. Awareness, Autonomy & Authenticity. London: Longman. van Lier, L. and Corson, D. (eds). 1997. Knowledge about Language. Encyclopaedia of Lan-
Can the L in TaLC stand for literature? 193
guage and Education. Volume 6. Dordrecht: Kluwer Academic Publisher. Widdowson, H. 1992. Practical Stylistics. Oxford: Oxford University Press. Zyngier, S. 1994. “Introducing literary awareness”. Language Awareness 3 (2):95–108.
194 Bernhard Kettemann and Georg Marko
Speech corpora in the classroom 195
Speech corpora in the classroom
Anna Mauranen University of Tampere, Finland
Pedagogical uses of spoken corpora have been much less discussed and developed than those of written corpora. Nevertheless, spoken English continues to be a central concern in practical teaching. This paper takes up some fundamental questions in the use and relevance of spoken corpora for learning, relating them to a case study of an experimental course in English for Acedemic Purposes. Questions of authenticity, communicative utility, and the processing of formulaic language are discussed, and the topical issue of the role of English as lingua franca is suggested as a potentially relevant target for international learners. It is argued that spoken corpora can achieve high authenticity, serve communication, and provide valuable models of the target language. In addition to native speaker-based corpora, there is a need for corpora of international English. It is also argued that in order for corpora to properly establish themselves in language teaching, they need to be integrated into teacher education as well as published teaching materials, which are the backbone of most foreign language teaching.
1. Introduction Despite the centrality of spoken language in most language teaching syllabuses, pedagogical applications of speech corpora have received scant attention, with few exceptions (e.g., McCarthy 1998, 2002; Swales 2001; Zhang 2002; Zorzi 2001). In teaching English for Academic Purposes (EAP), the written mode has occupied centre stage from the start, most likely because the need to read and write in English has been obvious worldwide. at professionals and academics also need spoken English skills has been recognized in places, so for example a spoken component has been part of the foreign language requirement in degree programmes in Finland since the late seventies. But spoken corpora have made little way into teaching yet. Is this because they are
196 Anna Mauranen
inherently less suitable for pedagogic use than written or general corpora, or are they more difficult to apply because their special characteristics have not been fully recognized yet? Spoken corpora are particularly sensitive to the varieties represented, because the standardizing and unifying processes of editing that writing is subject to are missing. e question thus arises to what extent native speaker data represents the best model for foreign language learners, especially in the case of English, which is so strongly a global language. is paper addresses the above questions by starting off with experiences from an experimental course at a Finnish university language centre.1 It relates observations of the course to more general matters concerning spoken corpus use: authenticity, communicative utility, and the processing of formulaic language. Finally, the case of English as lingua franca is discussed as a potentially relevant target for future graduates who are not language professionals. Practical issues concerning the integration of corpora into foreign language (FL) teaching are a basic concern throughout. A fundamental background assumption is that the real test of corpora as a learner resource is their adaptability to school settings, as tools for ordinary teachers and learners. e first two sections below (Sections 2 and 3) describe the teaching experiment as a point of departure for the more theoretical and pedagogical discussion which follows in the rest of the paper.
2. An experimental course My point of departure was an attempt to introduce a speech corpus into an ordinary course of academic spoken English at the University of Tampere. Students attending such courses can expect part of their academic studies and a good deal of their professional lives to involve the use of spoken English, even though they are not likely to become language professionals. eir motives are thus instrumental, and they can expect to use English primarily with other non-native speakers. e course is part of the language requirement for an MA degree; students have 10 to 12 years of English at school behind them and a good number of coursebooks and readings in English. ese students came from various disciplinary areas, were of different ages and had varying work experience. What made this group slightly unusual was that they did not seem particularly inter-
Speech corpora in the classroom 197
ested in English, as they had postponed the language requirement to the last possible moment. eir motivation rose quickly as the course began, thanks to the skills of the teacher, who had long experience in teaching spoken EAP and was highly qualified. She had not used corpora before, and was not particularly interested in or knowledgeable about computers. But she had good ordinary computer skills, and enough experience, background education and self-confidence to take a corpus on board. e corpus used was the Michigan Corpus of Academic Spoken English (MICASE: see Simpson et al. 1999), which was designed for investigating, teaching, and testing English in university settings. It was compiled at the English Language Institute at Michigan, where large numbers of foreign students are tested for their English and offered courses for coping in an allEnglish academic environment. Trying the corpus out in a situation where English is a foreign language seemed a good test of its usability from a more global perspective. For a concordancing program, I selected Monoconc by Michael Barlow, since it is designed for pedagogical purposes and appears to be quite uncomplicated to use. Teachers at the Tampere Language Centre were introduced to MICASE in three general sessions, including demonstrations and hands-on tasks. Many teachers were interested, and some small-group and individual sessions followed. However, as is oen the case with hard-working busy teachers, incorporating a novel component into courses demands more than just interest and willingness. In the end only one teacher actually took up the challenge. e course using the corpus was designed by the teacher: it was thoughtfully structured, largely task-based, and intensive, lasting two weeks full time. Most of it was taken up by student presentations and by a variety of tasks and discussions in preparation for and following them. Evaluation was mainly based on the presentations. e course thus revolved around planned oral production, and more spontaneous but fairly controlled activities relating to this. For instance, one pedagogic sequence focussed on discourse markers: 1. Teacher presents discourse markers; 2. Students find thesis abstracts in their own discipline in English on the university website, and list instances of discourse markers they find interesting or problematic; 3. Students use MICASE to investigate the markers they have listed;
198 Anna Mauranen
4. Students present their findings to each other in class. e experimental course was thus one of general EAP, which the teacher normally ran once a year, with the novel component of incorporating work on a corpus of native speakers’ spoken academic English. In addition to the group and individual sessions with the teacher, I had an interview session with her when the course was finished, and she shared copies of all her tasks, materials and student feedback with me.
3. Comments from the course Here I shall briefly go through both the teacher’s and the participants’ comments on the experimental course, as a basis for reflecting upon some basic issues in using spoken corpora in the classroom.
3.1. Teacher comments In the post-course interview, the teacher gave me her impressions and comments. For her, the course had meant risk-taking, but at the same time it had been interesting, exciting, and simply fun. She had particularly enjoyed being able to tell students that the information they found in the corpus was not available in standard reference works such as grammars. As specific points she mentioned discovering frequencies of verbs with prepositions, and the connections between prepositions and verbs: “which prepositions go with which verbs”. She noted that taking a corpus into the classroom demands that the teacher understand the tentative nature of all knowledge. She also saw a connection between corpus use and experiential education (e.g., Kohonen 2001): principles such as reflexive learning, cooperative learning, and teacheras-facilitator provided her with a conceptual framework in which to accommodate students’ corpus use. e teacher felt that the main benefit for students lay in the opportunity to make their own hypotheses and try to verify them, and that patternings in the texts used had stimulated such hypotheses. She said that the more computerliterate students had particularly liked using MICASE. But despite advance preparation, a number of problems – not only technical – had emerged, making it difficult to use the corpus fully. ere had been some unlucky choices in designing exercises – some too wide (prepositions with “see”), others too
Speech corpora in the classroom 199
narrow (no examples were found of “shrug off”). Difficulties of this kind seem inevitable until sufficient experience is accumulated of corpus use. e teacher also made some suggestions for the future: more training sessions to help gain confidence with the program, and technical assistance during the course. As might be expected in an EAP context, she stressed the need for easier access to subject-specific sets of data. At a more general level, she saw a role for corpora in in-service teacher education.
3.2. Student comments Written feedback was obtained from the students. ey were generally more positive than negative about using the corpus. What they found interesting was discovering a large number of word combinations (“which words precede or follow a given word”), in particular discoveries about prepositions in sentences and with verbs. Some mentioned idiomatic expressions, but added that they had not found many. Some students welcomed the new experience in more general terms, saying they had never seen anything like this in language teaching. What they were less happy with was the ultimate aim, which remained unclear to many: some said the program was not interesting enough to buy. Using the corpus was time-consuming and complicated, and unsuitable for beginners. In contrast to their teacher, some students observed that frequency information was irrelevant to them. And as surmised by the teacher, opinions were divided on technical issues: those with good computer skills found it all very easy, but others found it complicated and/or boring. In the students’ view, prospects for the future were not all bright – they thought the corpus would be more useful for advanced learners, researchers and translators. But they hoped for more time to practise with the program, and one student perceptively suggested that a bilingual manual would be a good idea – for learners who are not very advanced it makes more sense to receive instructions in their L1 rather than in the target language (on using the L1 in L2 instruction see Frankenberg-Garcia, this volume). In all, the comments repeated the typical experience of new approaches, being received with some excitement, but at the same time they revealed a need for even experienced and well-educated teachers to get a good deal of hands-on experience to make the new approach work. Since corpora do not only provide new resource material and new exercises, but actually a new way of looking
200 Anna Mauranen
at language, thereby demanding wholly new types of exercises, the time and effort required for teacher initiation is probably more than for many other pedagogical innovations. Clearly the effort would be worth it, since the overall assessment was very positive. At the same time, working with the corpus seemed risky to the teacher, since she emphasized the relativity of knowledge; nevertheless, the excitement of going beyond what reference books offer was greater than the risk. For less confident teachers, the loss of authority over language could be a threat. What also came out fairly clearly was an implicitly assumed straightforward connection between spoken and written academic discourse, which led to problems. Student commentary reflected the teacher’s views, for example in seeing useful patterns in terms of building up larger units from smaller elements. eir assessments of corpora as suitable for more advanced or professional users are hard to evaluate, given that the group consisted of students with unusually little interest in language. eir overall positive attitude was therefore all the more surprising. I shall now move on to tackle issues that arose from these comments, and some which relate to the original motivation behind the experiment – bringing a large database of authentic target speech to the classroom. In the following sections, I first take up the question of linguistic authenticity in the classroom, including the relevance of the student as an observer, then discuss the usefulness of spoken data for immediate vs. delayed needs. For all these questions, the differences between speech and writing are a central concern. Aer this, I move on to the “patterning” points that the teacher and students made, and the possible processing differences in L1 and L2 use. Finally, I raise the issue of a suitable linguistic learning target for future professionals using English – is it really the native speaker model?
4. How authentic is a spoken corpus? Authenticity has been a bone of contention in applied linguistics for at least two decades. I limit myself to two points relevant to the kind of corpus we are dealing with.
Speech corpora in the classroom 201
4.1 Subjective and objective authenticity A spoken corpus seems even further removed from its origins than a written one, because it tends to appear in transcribed form. It has undergone a mode change, and much that is relevant to the original communicative event gets le out – probably more than with writing. e technical shortcomings (missing sound, missing visual cues) may be remedied by new technology, but the fundamental problems of bringing discourse into the classroom remain. ese, as Widdowson has oen pointed out, are issues of methodology and learner response rather than of technical quality, quantity, or origin. is has led some scholars (e.g., McCarthy 2001) to suggest that we should measure learner responses rather than speculate about them. However, this is easier said than done – how exactly can we measure authenticity of response? In translation studies, the potential equivalence of the reader’s response to a translated text and to its original is oen presupposed (e.g., Nida’s “dynamic equivalence”; see Nida 1964), and this has with varying terminologies been proposed as a proper criterion for evaluating translations. But reader response cannot be reliably measured, so the criterion remains purely speculative. In essence, similar problems face L2 learning: we would have to make so many preliminary assumptions (how does L2 authenticity relate to L1 authenticity? if different, how is it different? does it vary across cultures? etc.) that to try to operationalize the concept on such a controversial basis would hardly be worth it. But if we take a simpler approach to the assessment of authenticity, I think it is realistic to assume we can at least capture some facets of it. It may be useful to distinguish between a subjective and an objective side. e subjective side is simply how learners experience particular language material – whether they find it interesting, useful, credible, or otherwise. While hardly independent of their learning situations, learners’ subjective experiences are of interest as feedback on a method or type of data. e “objective” side – the researcher’s perspective – is a matter of assessing the degree of similarity between an observable response and a desired one. So for example responding to a question with a relevant answer, or asking relevant questions in a puzzling situation, could be acceptably authentic responses. It is important however to bear in mind that learners’ responses are a matter of social practice; they are mediated rather than immediate or naïve. ey are socially conditioned, so that pedagogical practices may play a greater role than the data on its own. Responses are culturally and historically changeable.
202 Anna Mauranen
Rote learning of teacher-provided rules carries high subjective credibility in teacher-centred cultures; the arrival of tape-recordings in classrooms apparently created a strong sense of authenticity at first, even though the typical tape was a contrived text read aloud.
4.2 The student as observer A second point concerning authenticity relates to the difference in terms of interactivity between written and spoken material. While written text requires a reader in order to be interactively complete, taped dialogue is a record of discourse already completed in interaction, and the only role that remains for the listener is that of a non-participant observer. In this sense we might say that the learner’s response to spoken dialogue is less authentic. If we accept this line of thought, it would seem that public speeches, radio talks, and other monologic events are interactively less saturated than dialogue, and in this respect more authentic. Yet I am sure that few language educators would like to limit teaching material to monologues. One defence of recorded dialogue resides in the fact that the role of the non-participant observer is not insignificant for a secondlanguage learner. Learners quite oen make spontaneous use of recorded dialogue in films and television to boost their learning (see e.g., Dufva and Martin 2002), and in social events with native speakers a good proportion of a learner’s time is spent not actively participating but observing, and making sense to the best of their ability. Just as in first-language acquisition, non-participant observation seems a major source of input that makes up a necessary ingredient for successful participation. Useful and common as non-participant observation is, speech is impossible to arrest for reflective observation in ongoing discourse. Spontaneous observation tends to be holistic and focus on contents and the unfolding situation. With enough exposure, the most salient repeated sequences no doubt make their way into learners’ as well as to native speakers’ repertoires, but the processing load is usually heavy in just trying to keep up with the gist of the conversation. e rate of learning from exposure alone is slow, and if greater acquisition efficiency is required, this means pedagogic intervention. One advantage of a corpus of spoken language is that it can provide a large number of repeated instances which can be arrested, enabling the learner to focus on forms and functions which play important roles in discourse but may not be
Speech corpora in the classroom 203
found frequently enough in individual situations to attract particular attention. e importance of form-focussed teaching has been widely emphasized by recent research in second language acquisition (e.g., by Ellis 2002). In written texts, it is easier to point out repeated instances of the same item even within one (longish) text, even if the data has not been manipulated for pedagogic purposes. Reading courses were built around this idea well before corpora were thought of in this connection (e.g., Nuttall 1982). In contrast, textbook dialogues tend to have either very little text with not much repetition, or a deliberate pedagogic focus which may make the dialogue unfaithful to speech as it normally is. It is well known that descriptions of spoken language lag behind descriptions of writing, and this puts pressure on students’ skills in “noticing” relevant features of speech. Observing recurrent patterns is a prerequisite for acquisition, and being able to present such patterns is the greatest strength of corpora. Access to speech as recorded in a corpus seems likely to provide a shortcut to observations which are far harder to make in the continuous flow of real speech situations.
5. For communication or for learning? While a written corpus can be utilized for communicative tasks (such as translation, writing or reading), a spoken corpus is less directly useful. A corpus can provide practice in communication even if nothing is directly learned from it – when in other words, it can be used to perform a given task. But this requires that the communication be delayed in one way or other, as written communication normally is. Conversation cannot be carried out by referring to a computer, so is there much communicative use for a speech corpus? I think there is. Prepared talks are an increasingly frequent set of genres, not only in academia but also in business (while the need is probably dwindling in politics). is tendency was reflected in the course described above, which was based on the assumption that one of the main skills graduates will need in their professional lives is giving presentations. In this context, a speech corpus is as useful a resource as a written one. A speech corpus also needs to be separate (or easily separable) from a written corpus. Differences between spoken and written language need to be made clear to learners. ere is plenty of anecdotal evidence of non-natives using lexis in their speech which is more typical of L2 writing (e.g., Sinclair 1991).
204 Anna Mauranen
For specifically academic contexts, one might think that academic speech would be close to academic writing, and share a good deal of the same special lexis (indeed this has been the standard assumption behind approaches to EAP spoken courses, and was one of the initial questions that interested the MICASE team). But it now looks clear that academic speech is closer to conversation than it is to academic writing (see Lindemann and Mauranen 2001, Mauranen 2002, Swales and Burke 2001, as well as results from the Arizona corpus T2K-SWAL: Biber 2001). Although it is easy to say with hindsight that this ought to have been predictable, it was not, and it still seems counterintuitive to many people working with EAP, for whom those features which distinguish academic speaking from casual conversation are salient, and therefore should not be overlooked. We certainly need more subtle analyses of the data, since broad categories like “academic speech” and “conversation” do not yield easily applicable results if students aspire to membership in particular discourse communities. Because the differences between spoken and written academic genres are so large, however, insights from written text analysis are unlikely to throw much light on speech. One problem to emerge in the Tampere course was the assumption that the items students looked for would be common to both kinds of data: they looked for expressions in written text (thesis abstracts) and then turned to the speech corpus to test their hypotheses about them. ey did not necessarily find instances of what they were looking for, which some students understandably found frustrating. As Biber et al. (1998) emphasize, genre and register differences are very important aer the initial stages of learning. From a more research-oriented perspective, Butler (1997:69) has argued that corpora which combine spoken and written data risk fudging important distinctions revealed by separate analyses. In sum, it seems that a speech corpus is vital both for communicative usefulness and for an adequate view of different modes in the target language.
6. Analytic vs. holistic processing e positive comments concerning language made by students on the course all mentioned “combinations” of some kind: words with other words, prepositions in context, etc. In contrast, none commented on the way that longer chunks of language could be seen to fall into separate parts, or to show inter-
Speech corpora in the classroom 205
nal variation. ese comments were very similar to those that I have received from professional translators (Jääskeläinen and Mauranen 2000), and from my own students who major in English – everyone stresses the combinatory possibilities of the smaller units that they are more familiar with. is seems to be a common expectation: learners have acquired a large number of small items, which are used as building blocks for utterances. ey then seem to follow the “open choice principle” rather than the “idiom principle” (Sinclair 1991), reflecting an analytic rather than a holistic learning style. is squares with what is generally thought to characterize L2 learners. Wray (2002) postulates two entirely different relationships to what she calls “formulaic sequences” among native speakers and non-native speakers: while native speakers acquire formulae as wholes suited to their communicative needs and for reducing processing load, non-natives learning an L2 aer childhood have the reverse approach: they analyze new sequences for individual words, store these as isolated units, and then construct expressions by putting those units together. is building-up procedure is prone to errors, and so the L2 user oen ends up combining elements in a non-idiomatic manner compared to a native speaker who has acquired the larger units as single wholes in the first place. e students on the Tampere course were adult learners, so their analytic approach could be seen as predictable from this perspective – a viewpoint which under different guises is fairly widely accepted, allocating holistic and analytic approaches to L1 and L2 learners respectively. e reasons for this difference are not entirely clear – is it only brain maturation, or also literacy, teaching traditions or other socially variable causes? One complicating point can also be raised here: it has been claimed (see e.g., Wray 2002) that there is a general strategic preference in individual learning styles (whether of the L1 or L2), and the main division is again along the lines of holistic vs. analytic. If this is the case, then it means that some learners retain a holistic strategy even aer reaching adulthood (by SLA criteria 8 to 10 years of age), and may require a different teaching approach from that for analytic processors. At the very least, we ought to predict different difficulties for these two groups. Holistic learners should find it hard to make correct or successful breakdowns of stored chunks (if indeed they have had much chance of learning them in the classroom), while analytical learners ought to have difficulties in getting the chunks right in the first place, because these would have to be built up as combinations.
206 Anna Mauranen
Much more needs to be known about these processing preferences. But even before such evidence has accumulated, it seems that if pedagogic intervention is to be useful, it should not try to desperately impress native-like combinations on learners – if Wray is right, this is hardly likely to lead to great success in any case. Instead, we might perhaps attempt to provide a different conceptualization of that which is communicatively salient. Very oen the communicatively useful – and hard-to-learn – units are semi-fixed expressions, rather than either completely fixed routines or maximally open choices; it might be useful to learn to observe these, and see where variation is possible and where not, even if the precise expressions do not come out as native-like. In detecting such patterning, corpus data is obviously at its best. But it is also clear that considerably more work needs to be put into providing students with workable strategies in observing data. e course studied here showed that searches by novices were far from ideal, and similar findings have emerged elsewhere (e.g., Kennedy and Miceli 2002).
7. The native speaker model vs. English as lingua franca One tacit assumption which is virtually never questioned in corpus work is that native speaker language serves as a useful model for those learning that language, who should strive to achieve the adult native standard as closely as possible. e language of corpora is typically restricted: it tends not to include much child language, translated language, or L2 language, among other things. However, we might question the relevance of solely L1 adult speaker data as a relevant model for learners. I would argue that an “authentic” speech corpus need not be an L1 corpus in the case of learners who aspire to use English as an international language. is is of course the situation for most people in today’s world, since English is the lingua franca of most internationalized and globalized walks of life. In fact, we could argue that with the exception of language professionals, a highly idiomatic command of native-like English, and in-depth knowledge of one of its major variants, is fairly irrelevant to most users. e English teaching world has been engaged in lively discussion about the relevant models of English in recent years, and lingua franca English has been put forward as a particularly appropriate one (e.g., Jenkins 2000, Knapp and Meierkord 2002, Seidlhofer 2000, 2002), although this discussion has not quite reached the corpus world
Speech corpora in the classroom 207
yet (see, however, Seidlhofer and Widdowson 2002). For instance Hunston and Francis (1999) concede a certain interest value to corpus research into international English, but show little enthusiasm for or faith in such a possibility (an attitude reiterated in Hunston 2002). For the kind of learner discussed in this paper, one relevant model is good international English spoken in academic and professional contexts. is is a somewhat vague starting-point, but investigating the kind of language that appears to work in academic contexts may give us a handle on the kind of language that might provide a useful model and the variation we can expect. At Tampere we are compiling an L2 academic speech corpus (ELFA: English as Lingua Franca in Academic Settings – see Mauranen 2003). Compiling such a restricted corpus is starting at the conservative end, but given the labour-intensive nature of compiling and transcribing speech, and the unpredictability of the range of variation in wider contexts, I see this as simply realism which may nonetheless open up new vistas. English as lingua franca (ELF) is a problematic model, because it seems excessively liberal to assume that anything goes where ELF is concerned, and that anything an L2 speaker chooses to utter must be acceptable. Clearly, to judge what is acceptable is always problematic, but a good choice would seem to be to focus on expert users – like established and experienced academics who have frequent international contacts, who teach in English, routinely give presentations in English, or participate in international committees, etc. – in short, people whose English works well in practice. e problem is the dividing line: the moment we exclude certain speakers or writers, our criteria are inevitably intuitive and normative. But this is a problem in designing any corpus. If we want to study translation, we want to look at the good translator first, to understand the process and the products as successful activities that people engage in; likewise the “good language learner” who seems to crop up in L2 studies every now and then. So also the successful speaker of ELF is of interest as a relevant model for learners, and thereby also for corpora. ELF speakers are not to be equated with learners, and it is important to distinguish between corpora for modelling successful use, based on good L2 speakers, and learner corpora, designed to track the development of L2 learners who strive towards the target language. Despite the propensity of language educators to view the rest of the world as learners, this may be an alien or irrelevant identity to ELF speakers. For many professionals, communicating in English is a means of going about their work, not a means for learning more of the language. And
208 Anna Mauranen
while it may be true that a speaker’s repertoire of a language changes gradually, this goes for the L1 as well as the L2, or L3. We should give full recognition to the successful use of ELF speakers and treat repeated instances as evidence of successful strategies, just as we do with L1 data. For instance one good L2 strategy is to make a little go far, as illustrated in Altenberg and Granger’s (2001) study of “make”. NNSs “overused” “make”, and combined it with incorrect or untypical collocates. e authors concluded that these users needed more practice until their use of “make” could equal the NS norm, but to me it seemed that most of the unidiomatic uses were at least comprehensible. e difference is one of viewpoint: Altenberg and Granger saw their subjects as learners (which was not unreasonable, given that they were university students of English), but from the point of view of ELF communication, most of their usage could be seen as perfectly adequate.
8. Conclusion We need spoken corpora for teaching the spoken language. ey can achieve high authenticity, serve as communication aids, and provide irreplaceable models of the target language. Academic spoken corpora already exist, with MICASE available to anyone through the Internet. But we also need ELF corpora to model international English, for research as well as for teaching: it is important to know what good target usage is like. e case study described here made no recordings of the students’ performances or presentations, and therefore cannot tell what they actually learned from using the corpus, or whether they used their corpus findings in their presentations. It is, for some reason, not part of normal teaching practice to evaluate how much or what students actually learn: while it is common to evaluate students as individuals, it is rare to compare their collective scores with pre-tests or other groups. Assessing learning outcomes is le to researchers, and hopefully, this can be carried out as the next step in Tampere, as corpus use becomes regular practice on similar courses. However, it was already found that the reactions of those involved were more positive than negative, and importantly, the teacher was inspired by the corpus and felt encouraged to continue. To make a serious contribution to language teaching, corpora must be adopted by ordinary teachers and learners in ordinary classrooms. e dis-
Speech corpora in the classroom 209
semination of pedagogic practices needs to be effected not only via the obvious channels of teacher education and in-service training, but also by developing programs and teaching materials. e programs should be more user-friendly to learners than those currently available, including hands-on instructions, and manuals in the L1. Corpora also need to find a place in the materials packages that constitute the backbone of most language courses. In order to adopt something radically new, most ordinary teachers need (1) initiation and practice, (2) materials they can easily use as part of their normal teaching. It is not reasonable to expect teachers to take on large extra workloads in addition to their normal, oen demanding duties. It is also important not to underestimate the novelty of corpus thinking to most practising teachers. My observations from sessions with highly skilled and experienced EAP teachers were that it was surprisingly hard to convey the logic of what corpus data is like, how a corpus works, what kinds of questions can be asked, as well as the more mundane logic of the program, which I had purposely selected on the basis of its simplicity. In brief, there is more work in store for corpus enthusiasts if they want to have corpora adopted on a large scale – and thereby have an impact on how languages get learned.
Note 1. I am grateful to Mary McDonald-Rissanen for undertaking to try out the corpus, for giving me access to her course materials, and for sharing with me her insights on problems and successes.
References Altenberg, B. and Granger, S. 2001. “The grammatical and lexical patterning of make in native and non-native student writing”. Applied Linguistics 22:173–194. Biber, D. 2001. “Dimensions of variation across university registers. An analysis based on the T2K-SWAL corpus”. Paper given at the Third North American Symposium on Corpus Linguistics and Language Teaching, University of Boston, 23–25 March, 2001. Biber, D., Conrad, S., and Reppen, R. 1998. Corpus Linguistics. Investigating Language Structure and Use. Cambridge: Cambridge University Press. Butler, C. S. 1997. “Repeated word combinations in spoken and written text: Some implica-
210 Anna Mauranen
tions for functional grammar”. In A Fund of Ideas: Recent Developments in Functional Grammar, C.S. Butler, J.H. Connolly, R.A. Gatward and R.M. Vismans (eds), 60–77. Amsterdam: IFOTT. Dufva, H. and Martin, M. 2002. “Hyvän kielenoppijan niksit”. In Kieli yhteiskunnassa – yhteiskunta kielessä, A. Mauranen and L. Tiittula (eds), 247–262. Jyväskylä: AfinLA. Ellis, N. C. 2002. “Frequency effects in language processing”. Studies in Second Language Acquisition 24:143–188. Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Hunston, S. and Francis, G. 1999. Pattern Grammar. Amsterdam: John Benjamins. Jääskeläinen, R. and Mauranen, A. 2001. “Kääntäjät ja kieliteknologia – kokemuksia työelämästä”. In Tietotyön yhteiskunta – kielten valtakunta, M. Charles and P. Hiidenmaa (eds), 358–370. Jyväskylä: Afinla. Jenkins, J. 2000. The Phonology of English as an International Language. Oxford: Oxford University Press. Kennedy, C. and Miceli, T. 2002. “The CWIC project: Developing and using a corpus for intermediate Italian students”. In Teaching and Learning by Doing Corpus Analysis, B. Kettemann and G. Marko (eds), 183–192. Amsterdam: Rodopi. Knapp, K. and Meierkord, C. (eds) 2002. Lingua Franca Communication. Frankfurt: Peter Lang. Kohonen, V. 2001. “Experiential language learning: Second language learning as cooperative learner education”. In Experiential Learning in Foreign Language Education, V. Kohonen, R. Jaatinen, P. Kaikkonen and J. Lehtovaara (eds), 8–60. London: Pearson Education. Lindemann, S. and Mauranen, A. 2001. ““It‘s just real messy”: The occurrence and function of just in a corpus of academic speech”. English for Specific Purposes 2001(1): 459–476. Mauranen, A. 2002. ““A Good Question”: Expressing evaluation in academic speech”. In Domain-specific English. Textual Practices across Communities and Classrooms, G. Cortese and P. Riley (eds), 115–140. Frankfurt: Peter Lang. Mauranen, A. 2003. “The corpus of English as lingua franca in academic settings”. TESOL Quarterly 37 (3):513–527. McCarthy, M. 1998. Spoken Language and Applied Linguistics. Cambridge: Cambridge University Press. McCarthy, M. 2001. Issues in Applied Linguistics. Cambridge: Cambridge University Press. McCarthy, M. 2002. “What is an advanced level vocabulary?” In Corpus Studies in Language Education, M. Tan (ed.), 15–29. Bangkok: IELE Press. Nida, E. A. 1964. Towards a Science of Translating. Leiden: Brill. Nuttall, C. 1982. Teaching Reading Skills in a Foreign Language. London: Heinemann. Seidlhofer, B. 2000. “Mind the Gap: English as a mother tongue vs. English as a lingua franca”. Vienna English Working Papers, 9, 51–68. Seidlhofer, B. 2002. “Basic questions”. In Lingua Franca Communication, K. Knapp and C. Meierkord (eds), 269–302. Frankfurt: Peter Lang. Seidlhofer, B. and Widdowson, H. 2002. “Lending VOICE to common concerns about
Speech corpora in the classroom
211
teaching and language corpora”. Paper given at TALC 02, Bertinoro, 26–31 July 2002. Simpson, R. C, Briggs, S. L., Ovens, J., and Swales, J. M. 1999. The Michigan Corpus of Academic Spoken English. Ann Arbor, MI: The Regents of the University of Michigan Sinclair, J. M. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press. Swales, J. 2001. “Integrated and fragmented worlds: EAP materials and corpus linguistics”. In Academic Discourse, J. Flowerdew (ed.), 153–167. London: Longman. Swales, J. and Burke, A. 2001. “‘It’s really fascinating work’: Differences in evaluative adjectives across academic registers”. Paper given at the Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, March 23–25, 2001. Wray, A. 2002. Formulaic Language and the Lexicon. Cambridge: Cambridge University Press. Zhang, X. 2002. “Corpus, English conversational class, Chinese context”. In Corpus Studies in Language Education, M. Tan (ed.), 81–94. Bangkok: IELE Press. Zorzi, D. 2001. “The pedagogic use of spoken corpora: Introducing corpus concordancing in the classroom”. In Learning with Corpora, G. Aston (ed.). 85–107. Bologna: CLUEB.
212 Anna Mauranen
Lost in parallel concordances 213
Lost in parallel concordances Ana Frankenberg-Garcia Instituto Superior de Línguas e Administração, Lisbon, Portugal
This paper calls for a reflection on the use of parallel concordances in second language learning. It is centred on two main questions. First, in what language learning situations might parallel concordances be beneficial? Because they encourage learners to compare languages – normally their mother tongue and the language they are in the process of learning – it is argued that it is important to make conscious decisions on whether or not parallel concordances are actually called for. Second, how might language learners and teachers set about navigating through a parallel corpus? Because parallel concordances expose learners not only to two languages at the same time (L1 and L2), but also to two types of language (source texts and translations), it is necessary to consider whether to use L1 or L2 search terms, and whether it is important to distinguish between translational and non-translational L2. The discussion draws on examples of how Portuguese learners of English can learn from concordances extracted from COMPARA, a parallel, bidirectional corpus of English and Portuguese.
1. Introduction Concordances extracted from monolingual corpora have been used in a variety of ways to promote second language learning. Parallel concordances have more typically been associated with translation studies, translator training, the development of bilingual lexicography and machine translation. Although the potential benefits of parallel concordances in second language learning have not been overlooked (for example, Roussel 1991, Barlow 2000 and Johansson and Hofland 2000), they have certainly been much less exploited than monolingual concordances. is paper calls for a reflection on when and how parallel concordances might be used to enhance second language learning. It is centred on two main questions:
214 Ana Frankenberg-Garcia
a. In what language learning situations might parallel concordances be beneficial? b. How might language learners and teachers set about navigating through a parallel corpus? Any attempt to answer the first question will inevitably rekindle the debate on the use of the first language in the second language classroom. Despite the growing belief that using the first language is not necessarily wrong, it is generally agreed that not every language learning situation calls for it. Given that parallel concordances encourage learners to compare mother tongue and target language, in what kind of setting and in what circumstances are they then appropriate? How to navigate through a parallel corpus in second language learning is another question that must be posed if the fundamental structural difference between monolingual and parallel corpora is to be taken into account: while the former contemplate texts written in a single language, the latter look not only at two languages at the same time (L1 and L2) but also at two types of language (source texts and translations). In what situations is it relevant to distinguish between concordances extracted from corpora of L1 source texts and their translations into L2 and ones of L2 source texts and their translations into L1? When are the differences between searching from source texts to translations and from translations back to source texts important? How do these four factors interact? In this paper I shall concentrate on attempting to address these questions from the perspective of issues that have exclusively to do with parallel, as opposed to monolingual, concordances, and will ignore factors which are common to both types of concordances, such as the availability of a corpus, the representativeness of the corpus, the level of difficulty of the concordances, and the fact that, because concordances rely on a fairly sophisticated level of metaawareness, learners should ideally be adults, literate and cognitively-oriented.
2. In what language learning situations might parallel concordances be useful? Parallel concordances are based on translational relations between texts; as such, they encourage learners to compare languages, normally their mother
Lost in parallel concordances
215
tongue and the language they are in the process of learning. It follows that it can only be appropriate to use parallel concordances when it is appropriate to use the first language in the second language classroom. e idea of using the L1 is not novel. It was present in the grammar-translation method used for teaching Greek and Latin in the late eighteenth century, and this is how modern languages began to be taught in the nineteenth century. Considerable emphasis was placed on translation, and the L1 was oen used to explain how the target language worked (Howatt 1984). Modern approaches to language teaching have tipped the balance of instruction towards the target language. In doing so, while some approaches began to actively discourage the use of the L1, others took practically no notice of its existence (Atkinson 1987, Phillipson 1992). Probably the most influential and not entirely unreasonable argument behind this is the belief that the first language works against L2 fluency. In addition to this, there are a number of practical reasons for neglecting the L1: it wouldn’t work in multilingual classes, native speaker teachers might not know or might not know enough of their students’ L1, and many modern L2 teaching materials have been conceived for language learners in general rather than for learners of a single L1 background in particular. In spite of these impediments to the use of the L1, there is a growing belief that it is not just there to impair L2 fluency, and that it can in fact be used productively in second language learning, provided that the bulk of instruction continues to be carried out in the target language. Atkinson’s (1993) book Teaching Monolingual Classes explores several different ways in which second language teachers can attempt to make the most of their students’ L1. Medgyes (1994) argues that knowledge of their students’ L1 is one of the most valuable assets second language teachers can have. For Barlow (2000:110), “learning a second language involves some use of first language schemas as templates for creation of schemas for the second language.” Cohen (2001) reports on evidence that despite ESL teachers’ general admonitions not to use the first language, learners continually resort to written or mental translation as a strategy for learning. ere is also some evidence that the first language may actually contribute towards the development of a second language. Tomasello and Herron (1988, 1989), for example, report that a group of English-speaking learners of French learned more when the influence of English upon French was openly discussed in class than when instruction focused on French only.
216 Ana Frankenberg-Garcia
Provided they are used wisely, it would hence seem that parallel concordances can carve themselves a legitimate place in second language instruction. To discuss the circumstances under which they might be beneficial, it is useful to distinguish between self-access and classroom use. Parallel concordances can be used for independent study when learners know what they want to say in the L1 and want to find out how to say it the L2, or when they see something in the L2 and want to understand what it means in the L1. According to Barlow (2000:114), a parallel corpus is like an “on-line contextualized bilingual dictionary” that gives learners access to concentrated, natural examples of language usage. Parallel concordances can therefore be used to complement bilingual and language production dictionaries when writing in a foreign language. ey can not only help learners find foreign words they don’t know, but they can also give them the contexts in which these words are appropriate. ey can also help them come to terms with the fact that there are certain words in their L1 for which there are simply no direct translations available. When reading in a foreign language, learners may also find it useful to resort to parallel concordances to help them understand foreign words, meanings and grammar that they are unfamiliar with. Extracting concentrated examples of chunks of the foreign language that they do not quite understand matched to equivalent forms in their mother tongue can help learners grasp what is going on in the L2. e main point here, however, is that when learners resort to concordances on a self-access basis, their queries are initiated by themselves (Aston 2001). is means that they are engaged in looking for demonstrations of language use that might help them solve problems that are in the forefront of their minds.1 In this sense, learner-initiated concordances are likely to be meaningful, relevant and conducive to successful language learning. e picture changes when it comes to using parallel concordances in the classroom. It is self-evident that parallel concordances will work best with monolingual classes and with teachers who know their students’ L1. What is not so obvious is when it is appropriate to resort to them. e idea of looking at differences between the L1 and L2 as a basis for teaching the L2 is not novel: it was the main line of inquiry of the Contrastive Analysis Hypothesis (Lado 1957). e problem with Contrastive Analysis, however, is that not all differences between languages are relevant to L2 learning (Wardhaugh 1970, Odlin 1989). Moreover, even when they are relevant, drawing attention to them may not be unconditionally helpful to all learners at all times. As Sharwood-Smith (1994:184) points out, “consciousness-raising techniques may be counterpro-
Lost in parallel concordances 217
ductive where the insight has already been gained at a subconscious, intuitive level”. Language contrasts that are no longer or have never been a problem to learners could provoke overmonitoring and inhibit spontaneous performance. Indeed, those who defend L2-only approaches to language teaching would, in these circumstances, be right to affirm that the first language can undermine second language fluency. Instead of presenting learners with L1-L2 contrasts that do not affect and could even be detrimental to their learning, Granger and Tribble (1996) propose that what is important are the differences between the learner’s interlanguage and the L2, which they call Contrastive Interlanguage Analysis. However, this does not mean to say that the idea of comparing L1 and L2 need be abandoned altogether. For Wardhaugh (1970), although L1-L2 differences might not be useful to predict errors, as originally proposed in the Contrastive Analysis Hypothesis, they do help to explain learner errors. Indeed, if you look at the L2 problems that students actually have, while it is true that not all of them have to do with their L1, it is also true that students who share the same native language oen experience a significant number of second language problems that can be traced back to the influence of their first language. Lott (1983), for example, describes negative transfer errors that are common to Italian learners of English. Frankenberg-Garcia and Pina (1997) describe problems of crosslinguistic influence that are typical of Portuguese learners of English, which include not only negative transfer, but also the avoidance of transfer, whereby students avoid using perfectly acceptable English forms simply because they perceive them as being too Portuguese-like. Problems of crosslinguistic influence like these can open the door to the use of parallel concordances in the second language classroom. Instead of drawing attention to language contrast per se, or predicting problems of language learning that may fail to materialize, parallel concordances can be brought to the classroom to help learners focus on real interlanguage problems that can be traced back to the influence of the first language. Roussel (1991) appears to have been the first to propose using parallel concordances for this purpose. She showed how French learners of English tend to have problems with tonic auxiliaries and how parallel concordances could help sensitize these students to certain prosodic features of English. Following a similar line of thought, Johansson and Hofland (2000) report that overuse of “shall” is a common error among Norwegian learners of English caused by the influence of Norwegian, and proceed to show how these learners can explore
218 Ana Frankenberg-Garcia
the English-Norwegian Parallel Corpus to find out that the etymologically equivalent Norwegian modal auxiliary skal does not always correspond to the English shall. Frankenberg-Garcia (2000) provides several further examples of Portuguese learners of English making inappropriate use of prepositions because of the influence of Portuguese, and proceeds to show how a parallel corpus can be a useful source of authentic data for exercises to help them become aware of when they tend to get the first and the second language mixed up. I cannot overly stress, however, that before using parallel concordances in the classroom, with a group of learners, it is important for teachers to find out, through observation, whether these learners are experiencing L2 problems that can be traced back to their L1. Parallel corpora enable us to access so many comparable facts of linguistic performance that it is easy to lose sight of the language contrasts that really matter, and to overburden learners with contrasts that bear no relation, and can even be detrimental, to their learning processes (cf. Leńko-Szymańska, this volume). Detecting negative transfer and other forms of crosslinguistic influence can help inform teachers where parallel concordances are likely to be pedagogically relevant to their students (on pedagogic relevance and corpus use see also Seidlhofer 2000).
3. Navigating through a parallel corpus When using parallel concordances in second language learning, it is not enough to know what language contrasts might be helpful to students. It is also important to consider how to focus on them, for unlike monolingual corpora, which deal with a single language, parallel corpora involve not only two languages – L1 and L2 – but also two types of language – source texts and translations. It is therefore possible to extract concordances taken from the L1 with L2 equivalents (L1 à L2), or from the L2 with L1 equivalents (L2 à L1), and from source texts (ST) or translations (TT) as starting points. In other words, four types of parallel concordances are possible: L1ST L1TT L2ST L2TT
à à à à
L2TT L2ST L1TT L1ST
Lost in parallel concordances 219
Given these possibilities, one must ask: (a) in what language learning situations is it relevant to distinguish between L1àL2 and L2àL1 concordances? (b) in what language learning situations is it relevant to distinguish between STàTT and TTàST concordances? (c) how do these factors combine?
3.1 L1àL2 or L2àL1 concordances? When using parallel concordances for pedagogical purposes, the most basic choice that has to be made is deciding whether the starting point for searches should be an L1 or an L2 term. If the aim of instruction is to promote the development of language production skills, it makes sense to use L1 search terms, which will render concordances in L1 aligned with L2 (L1àL2 concordances). is will enable learners to see how the meanings they formulate in L1 can be expressed in L2. Conversely, if the aim of instruction is to help learners with language reception skills, then the logical thing to do is to use L2 search expressions, which will produce L2 concordances aligned with L1 (L2àL1 concordances). is will enable learners to see how forms they have selected in L2 translate into their L1. Of course, it may be argued that the ultimate aim of instruction is to help learners with both language production and reception, and that for this reason it is important to look at L1àL2 and L2àL1 parallel concordances. is is an entirely reasonable argument when learners happen to experience the same types of difficulties in language production and reception. False cognates, for instance, oen have a negative impact on both. Portuguese learners of English, for example, frequently assume that words like actually and actualmente, eventually and eventualmente, pretend and pretender and resume and resumir mean the same, and this causes them problems not only when speaking and writing, but also when listening and reading (Frankenberg-Garcia and Pina 1997). In such cases it seems appropriate to use both L1àL2 and L2àL1 parallel concordances (assuming the problems of reception and production occur at the same time). As shown in table 1, looking up actualmente may help learners see that the equivalent in English can be rendered as present, nowadays, these days, now, and so on.2 Looking up actually can help these same learners find out that it is a word whose equivalent is de resto, na verdade, or, most importantly, that it is oen simply le out in Portuguese (c.f. Tables 1 and 2).
220 Ana Frankenberg-Garcia
Table 1. Sample L1àL2 concordances for actualmente (language production) Com os rendimentos que actualmente tenho, podia dar 10 000 libras por ano sem grande esforço.
I could afford ten thousand a year from my present income, without much pain.
Claro que actualmente tenho posses para mandar fazer camisas por medida, mas o ar snob dos camiseiros de Picadilly dissuade-me de lá entrar e as popelines às riscas expostas nas montras são demasiado afectadas para o meu gosto.
Of course, I could afford to have my shirts made to measure nowadays, but the snobbylooking shops around Picadilly where they do it put me off and the striped poplins in the windows are too prim for my taste.
Deixem-me concentrar por um momento nessa lembrança, fechar os olhos e tentar absorver toda a infelicidade que nela existia, para apreciar melhor o conforto de que actualmente desfruto.
Let me just concentrate for a moment on that memory, close my eyes and try and squeeze the misery out of it, so that I will appreciate my present comforts.
Por que será que actualmente só sinto apetite sexual em Londres, onde tenho uma namorada que se satisfaz com a sua castidade, e quase nunca em casa, em Rummidge, onde tenho uma mulher cujo apetite sexual é inesgotável?
Why do I only seem to get horny these days in London, where my girlfriend is contentedly chaste, and almost never at home in Rummidge, where I have a partner of tireless sexual appetite?
O meu irmão mais novo, o Ken, emigrou para a Austrália no princípio dos anos 70, quando era mais fácil do que actualmente, e foi a melhor decisão que tomou na vida.
My young brother Ken emigrated to Australia in the early seventies, when it was easier than it is now, and never made a better decision in his life.
It is not always the case, however, that the problems that learners experience at the level of language reception are the same or occur at the same time as the ones they experience at the level of language production. Generally speaking, reception comes before production. Portuguese learners of English, for example, don’t seem to have much difficulty understanding the English words lose and miss. When producing the language, however, a common error is for them to say lose when they mean miss: * I’m sorry I’m late. I lost the train.
is particular problem seems to stem from the fact that both concepts are normally expressed by a single Portuguese verb, perder. Looking up miss in
Lost in parallel concordances 221
Table 2. Sample L2àL1 concordances for actually (language reception) I actually went so far as to blindfold myself, Fui, de resto, ao ponto de pôr uma venda with a sleeping mask British Airways gave nos olhos, que me deram em tempos num me once on a flight from Los Angeles. voo da British Airways vindo de Los Angeles. «So you're actually making a positive contri- – Então, você está na verdade fazendo uma bution to the nation’s trading balance?» contribuição positiva para a balança comercial do país. (e guy's name is actually pronounced «Kish», he’s Hungarian, but I prefer to call him «Kiss».
(O nome do tipo pronuncia-se «Kish», é húngaro, mas prefiro chamar-lhe «Kiss».
In the ease of the family presence we oen didn't actually greet each other at meals; it would have been like talking to oneself.
No aconchego familiar, não cumprimentávamos os outros na hora das refeições; seria como falar consigo mesmo.
Well, when I imagined them, I never saw myself as actually experiencing them later on.
Pois bem: nunca me vi ao fantasiá-las, como existindo-as mais tarde.
Table 3. Sample L2àL1 concordances for los.* But somewhere, sometime, I lost it, the knack of just living, without being anxious and depressed.
Mas houve um momento, uma altura qualquer, em que perdi o treino de viver, viver apenas, sem andar ansioso nem deprimido.
I was rapidly losing faith in this hospital.
Eu estava perdendo rápido a confiança no hospital.
But when they got to the brothel, Frédéric lost his nerve, and they both ran away.
Mas quando chegaram ao bordel, Frédéric perdeu a coragem e fugiram ambos.
If it is to appear next winter, I haven't a minute to lose between now and then.
Se é para sair no próximo Inverno, não tenho um minuto a perder.
He savors his freedom but doesn't lose sight of his master.
Saboreia a liberdade, mas não perde o amo de vista.
222 Ana Frankenberg-Garcia
Table 4. Sample L2àL1 concordances for miss.* I agreed enthusiastically, but I spent most of Concordei entusiasticamente, mas passei a the flight home wondering what I'd missed. maior parte da viagem de regresso a pensar no que teria perdido. I meant to catch the 4.40, but just missed it. A minha intenção era apanhar o das 4.40, mas acabei de perdê-lo. I got so carried away by that bit of description that I discovered missed the 5.10 as well.
Embrenhei-me tanto na descrição que estava a fazer que descobri que também perdi o comboio das 5.10.
Anyway, I'd better stop, or I'll miss the 5.40 as well.
Bom, é melhor parar por aqui, senão vou perder o das 5h40 também.
But he had found a guide, and didn't want to miss out on an opportunity.
Mas tinha encontrado um guia, e não ia perder esta oportunidade.
Table 5. Sample L1àL2 concordances for perd.* Mas houve um momento, uma altura qualquer, em que perdi o treino de viver, viver apenas, sem andar ansioso nem deprimido.
But somewhere, sometime, I lost it, the knack of just living, without being anxious and depressed.
Passou uma hora, depois outra; a neve An hour passed, then another; snow gathjuntava-se nas dobras das roupas; perderam- ered thickly in the folds of their clothes; they se. missed their road. – Mas que se perde em experimentar? José Dias não perdia as defesas orais de tio Cosme.
«What can you lose by trying? José Dias never missed a single one of his speeches for the defense.
– Ser adoptada e depois perder a mãe.
` To be adopted and then to lose your mother?
the English to Portuguese direction of a parallel corpus would not tell learners what they need to know, any more than looking up lose. In both cases, the Portuguese equivalent is perder (cf. Tables 3 and 4). However, looking up perder in the Portuguese to English direction returns results that can help learners notice the difference between lose and miss, and fix the difference in their minds (cf. Table 5).
Lost in parallel concordances 223
Second language problems that affect reception but not production are not as common, and detecting them is not as simple, for they do not always result in visible errors. Still, language reception problems can sometimes be spotted through reading comprehension exercises, or during conversations, when communication breaks down. Whatever the problems learners of a given native language seem to have, what seems important is to be aware that L1àL2 parallel concordances are different from L2àL1 parallel concordances, and that the two directions serve different purposes. L1àL2 concordances are more likely to enhance language production, while L2àL1 concordances are better suited to improving language reception.
3.2 STàTT or TTàST concordances? Learners using parallel concordances are typically exposed to source texts on one side of the corpus and to translations on the other. is means that, just as it is possible to extract concordances from L1 to L2 or from L2 to L1, it is also possible to present learners with parallel concordances going from source texts to translations (STàTT), or from translations to source texts (TTàST). In unidirectional parallel corpora, the relationship between these factors is constant. If the learners’ L1 happens to be the language of the source texts, the L2 will be the language of the translations. Or the other way round: if the L1 is the language of the translations, then the source texts will necessarily be the L2. St John (2001) describes a case-study of an English speaking learner of German using the German-English INTERSECT corpus (Salkie 1995), where the source texts are in German and the translations in English. For this learner, the L1 part of the concordances are translations while the L2 part are source texts. For a German learner of English using the same corpus, the opposite would be the case. For learners using bi-directional parallel corpora like COMPARA (Frankenberg-Garcia and Santos 2003), CEXI (Zanettin 2002) or the ENPC (Johansson et al. 1999), the part of the corpus in their L1 contains both translations and source texts, as does the part of the corpus in their L2. is means that when searching L1àL2, it is possible for learners to work from translations to source texts, from source texts to translations, or even from both to both. e same applies to the situations in which learners are working with L2àL1 concordances. Given these possibilities, one must ask in what language learning situations it may be relevant to distinguish between them.
224 Ana Frankenberg-Garcia
Table 6. L1TTàL2ST concordances for já (sheltering learners from translational L2) Agora já é a conferencista principal.
Now, she's Principal Lecturer.
Quando espreitei outra vez às 7.30 da manhã, já se fora embora.
When I looked again at half-past seven this morning, he had gone.
E quanto às visitas subsequentes, quando And what of subsequent visits, when he had já era o autor da escandalosamente famosa become author of the notorious Madame Madame Bovary ? Bovary ? – O pai dela já morreu.
' Her father's dead.
Don't you think he's done enough, he's been Não acha que ele já estudou muito, ficou nisso o dia inteiro, Sonny, ele deveria fechar at it all day, Sonny, he should close his books and have an early night. os livros e ir dormir cedo.
Table 7. L1ST à L2TT concordances for carnaval.* (helping learners with L1 culturally-bound concepts) Parecia um sujeito vestido para um baile de carnaval dos anos 1920.
He looked like someone dressed for a Carnival dance in the 1920s.
Um sujeito de nome Áureo de Negromonte, «famoso carnavalesco e campeão de desfiles», segundo a TV, afirmava que a morte de Angélica era uma perda irreparável para o carnaval brasileiro.
A man by the name of Áureo de Negromonte - ' a famous Carnival figure and competition winner ', according to the TV – stated that Angélica's death was an irreparable loss for Carnival in Brazil.
O desfile de carnaval daquele ano, segundo Negromonte, estava irremediavelmente prejudicado.
e Carnival parade that year, according to Negromonte, was irretrievably damaged.
O programa estava sendo transmitido direta- e program was being broadcast direct mente da nova igreja de Copacabana, lotada, from the new church in Copacabana, apesar de ser um domingo de carnaval. which was packed despite it being Carnival weekend. «São esses blocos carnavalescos», disse ' It's those Carnival groups, ' the driver said o motorista de mau humor, «os filhos da puta ill-humouredly. ' e sons of bitches like to gostam de desfilar pelas ruas movimentadas... parade down the busy streets...
Lost in parallel concordances 225
Table 8. L1ST+TT à L2TT+ST concordances (helping learners with contrastive prepositions) [TT] O último deles consistia em ficar de [ST] e last one consisted of hanging cabeça para baixo por uns minutos para fazer o upside down for minutes on end to make sangue ir à cabeça. the blood rush to your head. [TT] Não acreditei no que estava acontecendo [ST] I couldn't believe what was happencomigo. ing to me. [TT] Alexandra acha que eu estou sofrendo de falta de auto-estima.
[ST] Alexandra thinks I'm suffering from lack of self-esteem.
[ST] – A minha saúde depende do contrário.
[TT] «My health depends on just the opposite.
[ST] Mais tarde, na cama, depois do sexo, Fúlvia me encheu de elogios, disse que eu era muito bom naquilo.
[TT] Later, in bed, aer sex, Melissa showered me with praise, told me I was very good at it.
It is well documented in the literature that the language of translation is not the same as language which is not constrained by source texts from another language (for example, see Baker 1996). According to Gellerstam (1996), the differences between translational and non-translational language weigh against the use of parallel corpora in language learning. Indeed, exposing language learners to translational language may be problematic. COMPARA 1.6 contains equal amounts of translational and non-translational English, but if one looks at the distribution of the adverb “already”, only 35% of its occurrences come from texts originally written in English, whereas 65% come from translated texts. is suggests a much greater tendency to use “already” in translated English than in English source texts. Portuguese learners of English, in their turn, also tend to use the English adverb “already” in situations in which it is not required. You can oen hear them say Have you already had lunch? when what they mean is simply Have you had lunch?. In other words, they use already to ask whether or not lunch has taken place, without intending to convey the idea that it took place earlier than expected. is particular problem seems to stem from the fact that there is no grammatical difference between these two sentences when they are translated into Portuguese. e Portuguese adverb já (the literal equivalent of the English adverb already) would be used in both cases: Já almoçaste? Presenting Portuguese learners of English who overuse already with parallel concordances containing this adverb in translated English would not seem
226 Ana Frankenberg-Garcia
such a good idea, for already appears a lot more frequently in translational English than in non-translational English. e concordances would certainly not help the learners in question develop a feeling for the situations in which already might be le out. Having said this, the fact that parallel concordances expose learners to translational language does not necessarily mean that they cannot be used constructively. In fact, parallel concordances can (and should) be used in such a way that the translational/non-translational language distinction is put to good use. If there happens to be a need to shelter learners from translational instances of the target language, one can restrict the L2 side of parallel concordances to source texts. is might be of consequence when parallel concordances are used to draw attention to elements that exist both in the L1 and the L2, but which occur more typically in only one of the languages, as in the case of the English adverb already and the Portuguese já. Table 6 illustrates how Portuguese-English TTàST concordances can be used precisely to show Portuguese learners of English that they needn’t say already in English every time they mean já in Portuguese.3 Observing translations in the L2 side of the corpus can in turn be useful to help learners come to grips with L1 terms that are difficult to express in L2, or for which there are no straightforward L2 translations, such as culturally-bound concepts. Table 7 shows how Portuguese-English STàTT concordances can be used to help Portuguese learners of English describe the Brazilian carnival in English. ere are times, however, when distinguishing between source texts and translations is less important. When the aim of instruction is simply to draw attention to certain isolated morphological, syntactic and even lexical contrasts, TTàST concordances can be just as helpful as STàTT ones. Table 8 shows how both types can be used to focus on the contrastive use of prepositions in English and Portuguese.
3.3 Putting it all together Navigating through a parallel corpus involves deciding whether an L1 or an L2 search term is to be used and deciding whether the search term in question is to be in translational or non-translational language, or a mix of both. e basic decision is the first one: in section 3.1 I argued that L1àL2 concordances
Lost in parallel concordances 227
(based on L1 search terms) are best for promoting language production, and that L2àL1 concordances (based on L2 search terms) are more suitable for language reception. It is only aer this decision has been made that one should worry about the translational/non-translational language distinction. In section 3.2 I argued that there are situations in which it is best to shelter learners from translational L2, situations in which translational L2 can be especially useful to learners, and situations in which the distinction between translational and non-translational L2 is not so important. Putting it all together, this means that if the distinction between translational and non-translational language is not an issue, then unidirectional and bi-directional parallel corpora can be used in either direction. However, should the need arise to shelter learners from translational L2, then unidirectional parallel corpora should be used in only one direction, which will depend on whether the learner’s L1 is the source text or the translation language of the corpus. e same applies to situations in which parallel concordances are used to deliberately expose learners to translational L2. In contrast, bi-directional corpora can be interrogated in any direction, provided only the part of the corpus which shelters learners from (or exposes them to) translational L2 is used.
4. Conclusion In addition to the undeniable utility of parallel concordances in translation studies, translator education, the development of bilingual lexicography and machine translation, I have argued in this paper that there is also room for the use of parallel concordances in second language learning. However, I also hope to have made it clear that it is important to make conscious decisions on whether or not parallel concordances are called for, on whether to use L1 or L2 search terms, and on whether it is important to distinguish between translational and non-translational L2.
228 Ana Frankenberg-Garcia
Notes 1. e term “engagement” is borrowed from Smith (1982:171), who defines it as “the way a learner and a demonstration come together on those occasions when learning takes place”. 2. e parallel concordances shown in this paper were taken from COMPARA 1.6. Online: http://www.linguateca.pt/COMPARA/ [visited: 9.7.2002] 3. e fact that the translational, Portuguese side of TTàST concordances such as these may sound odd or unnatural to native speakers of Portuguese can even help Portuguese learners of English develop a better grasp of the differences between Portuguese and English.
References Aston, G. 2001. “Learning with corpora: An overview”. In Learning with Corpora, G. Aston (ed.), 7–45. Houston: Athelstan. Atkinson, D. 1987. “The mother tongue in the classroom: A neglected resource?”. English Language Teaching Journal 41 (4): 241–247. Atkinson, D. 1993. Teaching Monolingual Classes. London: Longman. Baker, M. 1996. “Corpus-based translation studies: The challenges that lie ahead”. In Terminology, LSP and Translation, H. Somers (ed.), 175–187. Amsterdam: John Benjamins. Barlow, M. 2000. “Parallel texts in language teaching”. In Multilingual Corpora in Teaching and Research, S. Botley, T. McEnery and A. Wilson (eds.), 106–115. Amsterdam: Rodopi. Cohen, A. 2001. “Mental and written translation strategies in ESL”. Paper presented at the 35th Annual TESOL Convention, St. Louis, February 27- March 3, 2001. Frankenberg-Garcia, A. 2000. “Using a translation corpus to teach English to native speakers of Portuguese”. Special issue on translation of Op. Cit, A Journal of Anglo-American Studies 3: 65–73. Frankenberg-Garcia, A. and Pina, M.F. 1997. “Portuguese-English crosslinguistic influence”. Proceedings of XVIII Encontro da APEAA, Guarda, Portugal, 69–78. Frankenberg-Garcia, A. and Santos D. 2003. “Introducing COMPARA, the PortugueseEnglish parallel corpus”. In Corpora in Translator Education, F. Zanettin, S. Bernardini and D. Stewart (eds.), 71–87. Manchester: St. Jerome. Gellerstam, M. 1996. “Translations as a source for cross-linguistic studies”. In Languages in Contrast: Papers from a Symposium on Text-based Crosslinguistic Studies, K. Aijmer, B. Altenberg and M. Johansson (eds.), 53–62. [Lund Studies in English 88]. Lund: Lund University Press. Granger, S. and Tribble, C. 1996. “From CA to CIA and back: An integrated approach to computerized bilingual and learner corpora”. In Languages in Contrast: Papers from
Lost in parallel concordances 229
a Symposium on Text-based Crosslinguistic Studies, K. Aijmer, B. Altenberg and M. Johansson (eds.), 37–51. [Lund Studies in English 88]. Lund University Press. Howatt, T. 1984. A History of English Language Teaching. Oxford: Oxford University Press. Johansson, S., Ebeling, J. and Oksefjell, S. 1999. English-Norwegian Parallel Corpus: Manual. Online: http://www.hf.uio.no/iba/prosjekt/ENPCmanual.html [Visited 11.12.2003] Johansson, S. and Hofland, K. 2000. “The English-Norwegian Parallel Corpus: Current work and new directions”. In Multilingual Corpora in Teaching and Research, S. Botley, T. McEnery and A. Wilson (eds.), 106–115. Amsterdam: Rodopi. Lado, R. 1957. Linguistics across Cultures. Ann Arbor: University of Michigan Press. Lott, D. 1983. “Analysing and counteracting interference errors”. ELT Journal 37 (3): 256–261. Medgyes, P. 1994. The Non-Native Teacher. London: Macmillan. Odlin, T. 1989. Language Transfer: Cross-linguistic Influence in Language Learning. Cambridge: Cambridge University Press. Phillipson, R. 1992. Linguistic Imperialism. Oxford: Oxford University Press. Roussel, F. 1991. “Parallel concordances and tonic auxiliaries”. ELR Journal 4: 71–101. Salkie, R. 1995. “INTERSECT: A parallel corpus project at Brighton University”. Computer and Texts 9: 4–5. Seidlhofer, B. 2000. “Operationalizing intertextuality: Using learner corpora for learning”. In Rethinking Language Pedagogy from a Corpus Perspective, L. Burnard and T. McEnery (eds.), 207–223. Frankfurt am Main: Peter Lang. Sharwood-Smith, M. 1994. Second Language Learning: Theoretical Foundations. London: Longman. Smith, F. 1982. Writing and the Writer. London: Heinemann. St. John, E. 2001. “A case for using a parallel corpus and concordancer for beginners of a foreign language”. Language Learning and Technology 5 (3): 185–203. Tomasello, M. and Herron, C. 1988. “Down the garden path: Inducing and correcting overgeneralization errors in the foreign language classroom”. Applied Psycholinguistics 9: 237–246. Tomasello, M. and Herron, C. 1989. “Feedback for language transfer errors: The garden path technique”. Studies in Second Language Acquisition 11: 385–395. Wardhaugh, R. 1970. “The contrastive analysis hypothesis”. TESOL Quarterly 4: 123–130. Zanettin, F. 2002. “CEXI: Designing an English translational corpus”. In Teaching and Learning by Doing Corpus Analysis, B. Kettemann and G. Marko (eds.), 329–344.
230 Ana Frankenberg-Garcia
Examining native speakers’ and learners’ investigation 231
Corpora with learners
232 Passapong Sripicharn
Examining native speakers’ and learners’ investigation 233
Examining native speakers’ and learners’ investigation of the same concordance data and its implications for classroom concordancing with ELF learners Passapong Sripicharn Thammasat University, Thailand
This paper reports a study designed to compare observations made and strategies employed by native speakers and learners when they work on the same concordance materials. Six Thai undergraduate students and six British undergraduate students were presented with three of the same concordance-based tasks. The students came separately to discuss the materials with the researcher and their discussions were tape-recorded. The results suggested that both native speakers and learners were able to handle the concordance-based tasks as they made useful and sophisticated generalizations. However, the non-native speakers seemed to show greater awareness of the context and adopted data-driven learning strategies such as forming and testing hypotheses against the data, whereas the native speakers tended to base their observations on their intuition and knowledge of the settings of the concordances, as well as cultural or pragmatic aspects associated with the citations. The paper also discusses some implications for classroom concordancing with EFL students in general.
1. Introduction Since the introduction of classroom concordancing (Tribble and Jones 1990; Johns 1991), two main types of empirical study have been carried out to evaluate the use of concordances in the classroom. Quantitative-oriented research seeks to measure the effect of classroom concordancing on the learning of particular language features (Stevens 1991; Cobb 1999) and on learners’ self-correction (Someya 2000; Watson Todd 2001), and also attempts to compare the
234 Passapong Sripicharn
effectiveness of classroom concordancing and other conventional approaches (Gan et al. 1996). Qualitative-oriented studies (e.g., Turnbull and Burston 1998; Bernardini 2000; Kennedy and Miceli 2001) aim to assess the performance of learners in handling concordance-based tasks and corpus investigation, i.e., what kind of generalization learners make from the concordance data, or what difficulty they have when doing concordance tasks. is paper will focus on the second type of research, which looks into the learning process and strategies used during concordance-based activities. Bernardini (2000), for example, worked with 6 Italian students in a translation seminar and examined the learners’ experiences of independent corpus investigation. She found that the learners successfully adopted cognitive and linguistic strategies to interpret and generalize their findings, particularly in observing collocation and semantic prosodies of a search word. In a similar study, Turnbull and Burston (1998) carried out a longitudinal case study of concordancing strategies used for the investigation of self-selected concordances. e results suggested that learners experienced varying levels of success with concordancing strategies, depending on cognitive style and motivation. While most studies have focussed on observations and strategies made by second language learners, little research has been conducted to look into the way native speakers or learners with near-native competence work with concordance data. is is probably because such research reveals very little about the effectiveness of classroom concordancing in helping learners to improve their language proficiency. However, this paper argues that we may gain more insights into classroom concordancing if we look at how similarly or differently learners and native speakers work with concordance data, in particular the strategies they use to process the data and come up with generalizations.
2. The study A small-scale study was conducted using native speaker data to assess the performance of a group of ai students in handling concordance-based materials. is section will describe (i) the subjects and the tasks used in the study, and (ii) the experiment itself.
Examining native speakers’ and learners’ investigation 235
2.1 The subjects Six ai undergraduate students and six British undergraduate students took part in this experiment. e ai students were studying at ammasat University, Bangkok, ailand. Four of them were from the English Department and two from the Linguistics Department. Although a specific language proficiency test was not carried out before the study, a university English placement test and the results of the learners’ previous English courses indicated that the students were at an upper-intermediate level. e native speakers were British students on the BA program at the Faculty of Arts, University of Birmingham, UK, all specializing in English Literature. While the ai students had been introduced to the notion of classroom concordancing and had been working on concordance tasks prior to the study, the British students were only given a brief introduction to concordancing on the spot as they went through all the exercises during the study.
2.2 The tasks e native-speaker students and the EFL learners were asked to do three identical concordance-based tasks. e teaching units were designed by the researcher, and the concordance lines were retrieved from the Bank of English Corpus. e first task (the “conduct” task: see Figure 1) focuses on different collocates of the verbs conduct and perform, which are translated as the same word in ai. Concordance lines for conduct and perform were given and the participants were first asked to generalize the similarities between the immediate right collocates of the key word in each set (e.g., survey, experiment, research, test in the “conduct” group; and job, role, duty, concert in the “perform” group), and then notice the differences between the collocates across the sets. In the second task (the “suggest” task: see Figure 2), two sets of concordance lines were provided, with the verbs suggest, recommend and propose as key words. e difference is that in the first set the verbs follow the pattern “verb + that + subject + should + verb” such as “He also suggested that the organization should adopt new responsibilities outside Europe”, whereas the word should is not found in the sentences in the second set such as “e product wrapping does suggest that it be eaten at breakfast and lunch, along with fruit”. e participants were invited to look into this particular grammar structure,
236 Passapong Sripicharn
What are the differences between the activities “conducted” and the activities “performed”? Set 1 “conduct” decided to turn the tables and made it easier for many people to skills and techniques necessary to to register the students. To for council. Tests have been He says the investigation was According to a nation-wide poll
conducted conduct conduct conduct conducted conducted conducted Rates. According to the study, conducted
Set 2 “perform” was reached he told us: INK Jos Hubers who organised the event defined as the inability to a chimpanzee to the model, by requiring the Cray to ey contacted me and I agreed to the world. In 1974, the group
performed performed perform perform perform perform performed
in the war with music and drama performed
a survey to find out whether experiments on animals, research in the social sciences the examination of the students which indicate there is potential because the expanding market for by a Los Angeles firm, 77 per by the Building Owners and
a very professional job. ey a fantastic role as interpreter, all duties of the insured’s own open heart surgery. some operation that overtaxes a professional evaluation of a farewell concert, but seven in the evenings. But in
Figure 1. e “conduct” task
which is referred to as the present subjunctive (the use of the infinitive without to in all persons) in standard grammar (e.g. Quirk et al. 1985). Unlike the first two tasks which are discussion-based, the final task (the “hazardous” task: see Figure 3) takes the form of a test or exercise. Concordances for adjectives with negative meanings such as hectic, tedious, horrible and hazardous were presented. e participants were asked to read the contexts carefully to guess the meanings of the key words, and puzzle out the missing collocate in the last line of each set of concordances. e participants were asked to come to see the researcher separately. As part of a study on the evaluation of classroom concordancing (Sripicharn 2002), the one-to-one discussions with the ai students were carried out in ailand from November 1999 to March 2000, while the interviews with the native-speaker students were conducted in Britain in January 2001. In each meeting, which lasted about 30 minutes, the participant discussed one of the three teaching units with the researcher. e discussions were tape-recorded for the purpose of analysis. Two main features of the discussions were closely examined. First, comparisons were drawn between observations made by the
Examining native speakers’ and learners’ investigation 237
What are the differences between sentences in Set A and Set B? Set A 1. He also suggested that the organisation should adopt new responsibilities outside Europe. 2. Her doctor suggested that she should be admitted to a psychiatric clinic. 3. Mr Delors is intending to propose that the final treaty should set out a new way of fixing the EC budget. 4. A policy document also proposes that passenger services should be run by regional private companies. 5. e American Heart Foundation recommends that less than 10 per cent of our daily calorie intake should be from fat. 6. Professor Joseph Belth recommends that a company should receive a high rating from at least three of five rating agencies. Set B 1. e product wrapping does suggest that it be eaten at breakfast and lunch, along with fruit. 2. She suggested that Mrs Haston find another job. 3. e Clinton administration has proposed that the IMF make more money available to Russia. 4. When congressional conservatives proposed that married women be banned from working, she decried the inequality of such a notion. 5. I would therefore recommend that you arrange to take your puppy home with you when he is about 15 weeks old. 6. It’s reported to have recommended that passengers be warned when there is a bomb threat to their flight. Figure 2. e “suggest” task
Guess the meaning of the key word, and find the missing noun. Also give reasons for your answers. with waste water contaminated by to raise an alarm if particularly joined in a campaign to get rid of illegal for under-18s to work with the movement of waste, especially The closest meaning of “hazardous” a) dirty b) scientific c) dangerous d) safe
Figure 3. e “hazardous” task
hazardous hazardous hazardous hazardous hazardous
chemicals, and processes and gases breach pre-set danger waste and unwanted chemicals. substances such as oxides of __________, the park could be made
The missing noun a) materials b) experiments c) factories d) diseases
238 Passapong Sripicharn
ai students and the native-speaker students. Second, strategies used by the participants for interpreting the concordance data were analyzed.
3. Results 3.1 Observations made by the two groups e results showed that the observations made by the two groups were not dissimilar. Due to space limitations, only findings from the first task (the “conduct” task) will be discussed here. For example, both groups made similar generalizations regarding the differences between nouns following conduct and nouns following perform. e ai learners suggested that people conduct something quite serious and formal, and that the nouns are associated with academic work of some sort such as research or study. e nouns aer perform are less serious and are concerned with work or jobs, or some kind of action or movement. e native speakers also pointed out that conducting is fairly formal and abstract. e word conduct is used more oen in written English (e.g., in reports), whereas performance is less formal and seems to be associated with activities or entertainment rather than something intellectual. Another observation along these lines was that the verb perform has strong associations with the speaker’s own actions. For example, one of the ai students said that perform is “something to do with ourselves, something coming from ourselves”. Similar observations from the native speaker group included “When you conduct a survey, experiments, or research, you are discovering things about other people, whereas if you perform something you yourself are doing it”. e subject of conduct and perform was another feature that both groups commented on. Some of the ai students noticed that when reading concordances we oen know “who” performs something, whereas the subject of the verb conduct is sometimes omitted, i.e., it appears in passive constructions. In the British group, one native speaker noticed that a lot of people can perform something together, whereas an activity seems to be conducted by an individual or a body. Despite the two groups making similar observations, it should be noted that at times the non-native-speaker students commented on some features that the native speakers were not aware of. For example, one of the learners noticed
Examining native speakers’ and learners’ investigation 239
that when we conduct something, there is a process or procedure involved. e activities conducted have a kind of time frame. Similarly, the native speakers also came up with generalizations that the ai students failed to propose e.g., the verb carry out can be used in place of all citations in the “conduct” group and in most sentences in the “perform” group except the last two lines where perform means performance in front of an audience.
3.2 Strategies used for observing the concordance data 3.2.1 Typical strategies used by the non-native-speaker students a. Spotting context clues and making a generalization e results showed that the ai students were capable of using context clues to eliminate wrong choices and choose the correct answers. In the “hazardous” task, for example, (see Extract 1), the student in question (S4) chose dangerous as the closest meaning of hazardous. Having had to decide between dirty and dangerous, the student picked out the clues alarm and danger and suggested that hazardous substances are not just dirty but could be harmful. (1) T: Now let’s do the last set. What should be the closest meaning of “hazardous”? What are the clues? S4: “alarm”, “danger”, “waste”, “chemicals” …What does “contaminated” mean? T: To contaminate something means to make it dirty or poisonous, for example, the water in the river has been contaminated. So if you have “contaminated” in the context, the choice “safe” is not likely, is it? S4: e answer is “dangerous”? T: Well, it can be either “dangerous” or “dirty”. Why did you choose “dangerous” instead of “dirty”? S4: Like in line two, we have “alarm” and “danger”, so I think it’s not just “dirty”. If it’s just “dirty”, maybe it’s not “a danger”.
b. Forming and testing hypotheses Another typical strategy used by the learners was hypothesis forming and testing. e following is an extract from a discussion on the “suggest” task. Aer
240 Passapong Sripicharn
reading the concordances, one of the students formed at least two hypotheses, which were (a) the verb propose is used with inanimate subjects; and (b) the verb recommend is used with animate subjects. e student then tested the hypotheses against more concordance data (or in fact by reading the data more carefully). By doing so, she noticed some negative evidence, and finally rejected her original hypotheses. (2) S2: I’m still not sure. I was looking at the function or parts of speech [and expected that the words may belong to a different word class], but in both sets, these words are similarly used as verbs. T: Yes, they are all used as verbs. S2: I was also looking at the way they are used in different context. At first, I noticed that “propose” is used with inanimate subjects; for example, “the Clinton administration” is inanimate, but the human subject is also used in “Mr Delors is intending to propose that…” e verb “recommend” seems to have animate subjects as well, as in “I would therefore recommend that…” or “Professor Joseph Belth recommends that…”. But again, we have “e American Heart Foundation”, which is inanimate.
3.2.2 Strategies used by the native-speaker students In general, the native speakers seemed not to employ data-driven strategies when they dealt with concordance data. e interview data showed that they based their observations on three resources. c. Using intuitive knowledge While the non-native students mainly used “data-driven” strategies to generalize the concordance data, the six native speakers drew little from the context and the concordances. Instead the native speakers oen used their intuition and language experience to answer the questions. For example, they were more confident in pointing out whether two words or phrases typically collocate, as illustrated in Extract 3. In Extract 4, one of the British students referred to such intuitions as a “database in my mind”. (3) “You can say a study has been done but you can’t say a poll has been done. I don’t know why. “Experiments” can be done, carried out, or conducted”.
Examining native speakers’ and learners’ investigation 241
(4) “People hear patterns of words such as “hazardous waste”, and they sit together as a unit and you tend to say “hazardous waste” rather than “dangerous waste”. I spent 22 years listening to people speaking in English, so it would be easy for me to identify with the database in my mind what words sound right together”.
d. Generalizing beyond the concordance data Another “non-data-driven” strategy used by the native speakers was making a generalization beyond the concordance data. For example, the native speakers showed that they were capable of contextualizing the concordance lines, which can be difficult for non-native speakers. at is, the native speakers were able to identify the “setting” of the concordances and the text type from which the concordances were taken. For instance, in Extract 5, they intuitively knew that the first concordance line was probably taken from a text about a water-processing plant. (5) “e first line would be either in water-processing plants or something like that. e second line would be in some kind of factory where you have to control hazardous gases”
Some generalizations were based on the speaker’s own experience of the language, the kind of information not explicitly provided in the concordances. is can be seen in comments concerning cultural or pragmatic aspects of the target language. For example, when making a point that the word should adds “willingness” or “push”, a native speaker also talked about the culture and her own experience of language use (see Extract 6). (6) “English people are very sensitive about being pushed about. So I would resent it automatically if my mum said “You should do the washing up” because it implies that I never do the washing up”, so you should say “Is it ok if you do the washing up?”, or something like that”.
e. Questioning the data and identifying exceptions Another difference in terms of strategies used is that while the ai students regarded the concordances as “unquestionable” evidence, the native speakers sometimes questioned the given concordances and had the ability to use negative evidence from intuition to balance the positive evidence from the concordance data. For example, in Extract 7, a native speaker argued that it is
242 Passapong Sripicharn
acceptable to say “perform experiments on animals” as opposed to the given citation “conduct experiments on animals”. In another example (Extract 8), another native speaker suggested that one of the concordance lines sounded awkward to her and that she would use the word recommend instead of suggest in that particular context. (7) “You can also say “perform” experiments on animals” (8) ““Her doctor suggested that she should be admitted to psychiatric clinic” doesn’t sound right to me. I’d use “recommended” rather than “suggested” because the doctor has the qualification so he was not making a suggestion, but a recommendation.”
4. Discussion and implications ere are a few points arising from the findings. First, the results reported in Section 3 show that the ai students were capable of handling the concordance-based tasks. ey were able to observe subtle similarities and differences, and made as useful and sophisticated generalizations as those made by native speakers who obviously have much higher language proficiency. e findings also support observations made by Johns (1991) that some of the learners’ answers or explanations are unexpected and are even more sophisticated than those provided by the teacher (or native speakers in this case). ese unexpected answers underline a distinctive aspect of DDL, that is, “the data is primary” and that the learners “oen notice things that are unknown not only to the teacher, but also to the standard works of reference on the language” (Johns, 1991:3). e next point is that there were big differences between the strategies used by the two groups. e non-native-speaker students clearly adopted “datadriven” strategies in the way they formed hypotheses, searched for context clues, and tested the hypotheses. By contrast, when presented with concordance-based tasks, the native speakers seemed to ignore the concordance data and instead used their intuitive knowledge of the language to answer the questions. One possible explanation for this is that the learners had been trained to observe concordance data prior to the study, while the native speakers seemed to treat concordance data as part of the questions and thus ignored them when they gave the answers.
Examining native speakers’ and learners’ investigation 243
Another interesting finding was that intuition allowed the native speakers to comment on various aspects of the language that the concordances cannot fully reveal. ese include the context and settings of the concordances, and cultural or pragmatic aspects associated with the citations. e native speakers also had the ability and confidence to question the concordances and to point out negative evidence to balance positive evidence presented in the concordance data. e study has at least two implications for classroom concordancing with second language learners. First of all, it suggested that not all students will use data-driven strategies when they handle concordance-based tasks. e findings showed that concordance tasks seemed to be more appropriate for language learners than for native speakers who are also not linguists but students of literature, and that the native speakers felt their intuition was sufficient to cope with the task, and therefore did not resort much to analytical skills. is is why it is important to point out to learners, especially those who have high language proficiency or have native-like language competence, that they may use concordance data as a basis for their generalizations or at least to test their intuition against the authentic data. Another implication is that it was clear that the data presented in the materials were limited, and there seems to be a danger of over-generalization on the part of learners, partly because they do not have what one of the native interviewees called the “database in their mind”, or native speaker intuition to balance positive evidence presented in the concordance lines with negative evidence such as overlapping cases or exceptions. e study also showed that concordance lines do not always provide all the data needed for generalization, and that intuition can be helpful for deducing or extracting some kinds of information such as pragmatic, cultural, and discourse issues, information which is otherwise difficult to obtain from the given concordances. One way of dealing with such problems is to make sure that the concordances chosen to write the materials represent typical use and pattern of the language rather than being deliberately manipulated to represent what is easy for learners to notice or to illustrate particular language points. Another solution is to promote collaboration between the teacher and the students e.g., in the form of a one-to-one consultation. For example, the teacher may step in and help where there is a potential danger of over-generalization by giving comments or pointing out exceptions or “possible-but-not-typical cases”.
244 Passapong Sripicharn
Where appropriate, the teacher may also comment on other aspects of the language that are not easily observed from the concordances such as cultural and pragmatic issues.
5. Conclusion is paper has discussed a small-scale study carried out to compare native speakers’ and learners’ performance on concordance-based tasks. e results showed that both groups were capable of handling the concordance tasks as they came up with a number of useful and sophisticated observations. e learners showed more sensitivity to context clues and more oen used datadriven strategies to make generalizations, although such results were not surprising because the native speakers were not language learners and were not trained to read and interpret concordance data. e data gathered from the native speakers working on the same concordance materials provided some insights into classroom concordancing with EFL students in general. As the study suggested, working on concordance materials without balancing negative evidence with evidence presented (rather limitedly) in the materials may cause a danger of over-generalization. Collaboration with a teacher or a native speaker may partly solve the problem of over-generalization and may help learners obtain some “beyond-the-citation” information so concordances are not only used to draw attention to a particular language feature but also as an aid to the learning of the target language in general.
References Bernardini, S. 2000. Competence, Capacity, Corpora: A Study in Corpus-aided Language Learning. Bologna: CLUEB. Cobb, T. 1999. “Breadth and depth of lexical acquisition with hands-on concordancing”. Computer Assisted Language Learning 12(4):345–360. Gan, S., Low, F. and Yaakub, N. 1996. “Modeling teaching with a computer-based concordancer in a TESL preservice teacher education program”. Journal of Computing in Teaching Education 12:28–32. Johns, T.F. 1991. “Should you be persuaded – Two samples of data-driven learning materials”. In Johns and King, 1–16.
Examining native speakers’ and learners’ investigation 245
Johns, T.F. and King, P. (eds) 1991. Classroom Concordancing: English Language Research Journal 4. Birmingham: The University of Birmingham. Kennedy, C. and Miceli, T. 2001. “An evaluation of intermediate students’ approaches to corpus investigation”. Language Learning and Technology 5(3):77–90. Online: http: //llt.msu.edu/vol5num3/kennedy/default.html [visited:27.4.2004] Quirk, R., Greenbaum, S., Leech, G. and Svartvik, J. 1985. A Comprehensive Grammar of the English Language. London: Longman. Someya, Y. 2000. “Online business letter corpus KWIC concordancer and an experiment in data-driven learning/writing”. Paper presented at the 3rd ABC International Conference held at Doshisha University, Kyoto, Japan, August 9, 2000. Online: http: //www.cl.aoyama.ac.jp/~someya/ [visited:27.4.2004] Sripicharn, P. 2002. Evaluating Data-driven Learning: The Use of Classroom Concordancing by Thai Learners of English. Unpublished Thesis. The University of Birmingham. Stevens, V. 1991. “Concordance-based vocabulary exercises: A viable alternative to gap-fillers”. In Johns and King, 47–62. Tribble, C. and Jones, G. 1990. Concordances in the Classroom. London: Longman. (Reprint, Houston: Athelstan) Turnbull, J. and Burston, J. 1998. “Towards independent concordance work for students: Lessons from a case study.” ON-CALL 12(2). Online: http://www.cltr.uq.edu.au/ oncall/turnbull122.html [visited:27.4.2004] Watson Todd, R. 2001. “Induction from self-selected concordances and self-correction”. System 29:91–102.
246 Passapong Sripicharn
Self-discovery and corpora 247
Some lessons students learn: self-discovery and corpora Pascual Pérez-Paredes and Pascual Cantos-Gómez Universidad de Murcia, Spain
As a spin-off to the process of compilation of the Spanish component of the Louvain International Database of Spoken Language Interlanguage (LINDSEI) corpus, advanced learners of English as a Foreign Language (EFL) at the University of Murcia have been given the chance to explore their own speaking output through a corpus-aided methodology. This paper presents a glimpse of the materials which have guided our students through the process of self-discovery of their own spoken discourse. Similarly, special attention is devoted to the learning principles underlying our form-focused approach whose target is students of English in their third and final year. Preliminary evidence in the form of face-to-face feedback about the reception of the protocol among the students who completed it suggests that form-focused attention facilitates noticing of often-unnoticed features of learner oral discourse.
1. Introduction e type of work we propose here falls under the browsing category outlined by Aston (1997) and expanded by Bernardini (2000), where the corpus becomes a source of activity in itself and progressive discovery occurs on a negotiable step-by-step basis. is paper presents a glimpse of the materials which have guided our students through the process of self-discovery of their own spoken discourse, but attention is devoted to the learning principles underlying our form-focused approach. e materials as well as the rationale for the students’ work can be accessed on the Intranet of the UMU (University of Murcia) Language Laboratory, and further information on the course can be obtained from http://www.um.es/engphil/profesorado/pperez/reglada/curso2002-2003/ lengua3.
248 Pascual Pérez-Paredes and Pascual Cantos-Gómez
e target group of our proposal are students graduating in English, in their third year of studies at the University of Murcia, Spain. e activities described below are carried out in the modern languages laboratory during the regular lesson schedule. e teacher supervises and directs the students’ work, explaining the essentials of protocol completion and giving help and assistance on demand. Confronting learners with linguistic evidence, we aim to develop a critical attitude in our students towards the patterns of L2 use in their own discourse. e protocol is, for the most part, a collection of HTML files which exploit the Intranet and Internet resources available in the lab and which are aimed at facilitating the students’ explorations of their own pre-recorded oral output both as digitalized audio files and text files.
2. Learner autonomy, self-discovery and corpus-based work: The protocol for oral discourse appreciation is research aims to integrate the theoretical principles of Data Driven Learning with classroom-based awareness-raising activities, with a focus on noticing acquisition through autonomous work on L2 production. e Protocol (see Appendix 1) is structured or divided into three main sections, where each of these sections corresponds to a different cognitive process/task. It intends to be a quantitative as well as a qualitative approach to oral performance of students, focusing on their vocabulary selection. Due to the brevity of instances of their oral production, and to the medium through which students access them (transcription), we have not considered other linguistic levels such as pronunciation or discourse features. Part I is devoted to a first preliminary qualitative analysis of the lexicon used by students in their oral output. To facilitate the task, we elaborated a concise, easy to complete spreadsheet, which can be easily implemented in Microso Excel or any similar program. e spreadsheet contains detailed information gaps that need to be completed and/or calculated by each student. We previously processed all the oral productions and obtained some group mean data which would be relevant for our approach (see Appendix II). is basic but essential information on tokens, types, content words, ratios, frequency bands, etc. in the form of means, maximum and minimum scores, and standard deviations, was incorporated into the spreadsheet1. Based on the information pre-
®
Self-discovery and corpora 249
sented, students were then asked to explore and obtain objective data on their performance and to compare it with the group mean. At this stage, students achieved, for the first time, real awareness of their performance, comparing it with the group’s performance. Part II is much more specific in this sense, and tries to guide students towards reflection and self-discovery, by means of using the quantitative data of Part I, and inviting students to think about their strengths and weaknesses regarding vocabulary selection. At this point, students were in a position to make a more grounded and objective evaluation of their own oral performance. Finally in Part III, we tried to encourage students to address the weaknesses they discovered, offering them a range of remedial tasks/exercises, manipulating reference corpora. e Protocol’s goals and students’ tasks are summarized here: Part I: Goal: Analysis of students’ own oral production Students’ task: Manipulate own transcribed oral production, calculate several indices and compare them with group-scores Part II: Goal: Reflections on students’ own oral productions Students’ task: Using the data of Part I, self-discovery of one’s own strengths/ weaknesses Part III: Goal: Supply remedial material (reference corpora) to implement weaknesses discovered Students’ task: Manipulate corpora and extract specific data/instances (not used previously in their oral productions) Apart from the students’ exploration and findings of their own oral production, the information provided by the oral transcription is valuable for teachers too. For instance, teachers can easily and quickly compare groups (inter-group analysis), compare single group performance through the academic year (diachronic intra-group analysis) or investigate the group performance, focusing on particular grouping behaviour (intra-grouping). We analyzed the groupings by means of a statistical technique known as cluster analysis. Cluster
250 Pascual Pérez-Paredes and Pascual Cantos-Gómez
Figure 1. Cluster analysis
analysis actually encompasses a number of different classification algorithms. A general question facing researchers in many areas of inquiry is how to organize observed data into meaningful structures, that is, to develop taxonomies. Note how in this classification (Figure 1), the higher the level of aggregation the less similar are the members in the respective class. Student 1 has more in common with students 11, 16, 5 and 13 than he/she does with the more “distant” members (e.g., students 3, 17 and 22), etc. e cluster analysis shows, for example, how the group we investigated consisted of two clearly different subgroups: subgroup 1 (students 10 and 19; who formed a group on their own) and subgroup 2 (all the rest). If we analyze the quantitative data of these two students we immediately realize that their vocabulary scores are above the group means. is indicates that they are/were likely to be more proficient in English than their classmates. A further analysis, regarding subgroup 2, reveals how these students organize into two different sub-subgroups (S-SG1: students 1, 11, 16, 5, 13, 3, 17, 22, 23, 24, 2, 15 and 6; and S-SG2:9, 25, 4, 14, 18, 8, 20, 7, 12, 21, 10 and 19). Again this indicates that S-SG2 students performed lexically better than those in S-SG1.
Self-discovery and corpora
251
3. Remarks For operational purposes, we have drawn on Nunan’s 1995 conceptualization of learner autonomy in terms of a continuum, where students of a foreign language are gradually introduced to ways of expanding their capacity to exercise control over their own learning. In very constructive terms, autonomous learners are able to self-determine the overall direction of their learning, become actively involved in the management of the learning process and exercise freedom of choice in relation to learning resources and activities (Nunan 2000). An exclusive, all-or-nothing concept of student autonomy is not even considered in our approach as it fails to reflect the context of learning where the experience reported in this paper actually took place. Besides, a great deal of self-direction is apparent in the rationale we present here. Different authors have claimed that explicit attention to form can facilitate second language learning (DeKeyser 1998, Norris and Ortega 2000). Within a cognitive perspective on language learning, it has been pointed out that noticing, that is noting, observing or paying special attention to a particular language item, is generally a prerequisite for learning (Schmidt 1990, 1993, Robinson 1995, Skehan 2001). In a similar way, it has been argued that acquisition of the noticed form is more likely to take place in higher proficiency learners than in lower proficiency ones. Williams’ (2001) study shows that proficiency “seems to provide increasing returns: not only do the more advanced learners generate more language-related episodes (LREs), they also use this information more effectively” (2001:336). It seems that lower proficiency, in this context, prevents learners from a more thorough integration of the new input generated during the LREs mentioned above. Williams’ study points to the need for (1) the integration in a variety of ways of attention to form procedures, (2) the important role that teachers play in providing negative and positive linguistic evidence and in calling students’ attention to it and, finally, (3) a learner-centred approach broadly understood in terms of responding to the needs of the learner (2001:338). In a recent experiment, Nassaji (2000) adopted a form-focused methodology in an integrative approach to L2 learning where, through design and process, communicative activities are given a form-focused plus. In a way, Nassaji’s approach pursues the whole 5-step process of autonomy (Nunan 1995). Curiously enough, our protocol, albeit clearly form-focused, and, accordingly, not strictly communicative as an activity in the interactive paradigm (Savignon
252 Pascual Pérez-Paredes and Pascual Cantos-Gómez
1983) fashion, aims at Nunan’s stage 5 where learners become researchers, at least during protocol completion. Work is based on multimedia digital files of the type proposed by PérezParedes (2003). For the protocol to be implemented, students’ oral productions will first need to have been recorded, transcribed and both types of information been given digital format – .wav, .ram or similar for the audio and .txt for the text. Obviously, this is not the place to discuss technicalities, as the different options available in terms of digitization will be largely dependent on the lab or computer facility available. Leech (1997) has stated that it is unwise to use corpora as a bandwagon. e analogy goes like this: “Teaching bandwagons, if driven too far and too fast, can do much harm to those on the receiving end”. Although the use of our protocol is a demanding proposal in terms of the infrastructure needed for its implementation and the skills required from teachers and students operating potentially complex soware, we find that this is a bastioned powerful way to stimulate an exploratory attitude in the students. is intuition is aimed at encouraging learners completing the protocol to observe language phenomena on different levels, and to undertake independent learning journeys and serendipitous scrutiny of their own discourse as a starting point for noticing behaviour that is not present in most teaching situations, especially the selfexploration of one’s own oral output. Preliminary evidence in the form of face-to-face feedback about the reception of the protocol among the students who completed it suggests that formfocused attention facilitates noticing of usually unnoticed features of learner oral discourse, including lexical density, level of constituency (especially in Noun Phrases), prosodic features, cohesion features as well as general perception of segmental and suprasegmental characteristics of L2 output. Interviews were carried out before and aer work with the protocol. Students were invited to provide the interviewer with feedback on particular areas where the learners thought new insights had been gained in terms of languageawareness. ese areas included, in order of relevance, lack of appropriate vocabulary range according to academic level, absence of a relevant presence of stance adverbials in their discourse, very low levels of pre- and post-modification in Noun Phrases as well as absence of collocations in both Noun and Adjective Phrases. e limitations of our work should also be considered. On the research level, a longitudinal study of repeated applications of this protocol would be
Self-discovery and corpora 253
necessary in order to assess the validity of these claims. On the pedagogical level, we shall turn to Stevens (1990:8) for a final observation: Although text manipulation is conveniently implemented and consistent with current language learning pedagogy, its benefits are difficult to intuit; hence the genre is easily misunderstood.
Note 1. We also included a reference on text difficulty. Intuitively, this could be defined as the difficulty a person has in understanding a particular oral output. Of course, this measure here is very limited as we have not considered some linguistic features such as pronunciation, sentence complexity, etc. e measure used is the Fog-Index (Alderson and Urquhart 1986): R = 0.4 (k + j); where k = percentage of words with three or more syllables and j = mean sentence length (R coefficient:0–12 easy; 13–16 undergraduate; >16 postgraduate).
References Alderson, C. and Urquhart, H. 1986. Reading in a Foreign Language. London: Longman. Aston, G. 1997. “Small and large corpora in language learning”. In PALC’97: Practical applications in language corpora, B. Lewondowska-Tomaszczyk and P.J. Melia (eds), 51–62. Łódź: Łódź University Press. Bernardini, S. 2000. “Systematizing serendipity: Proposals for concordancing large corpora with learners”. In Rethinking Language Pedagogy from a Corpus Perspective, L. Burnard and T. McEnery (eds), 225–234. Frankfurt am Main: Peter Lang. DeKeyser, R.M. 1998. “Beyond focus on form: Cognitive perspective on learning and practical second language grammar”. In Focus on Form in Classroom Second Language Acquisition, C. Doughty and J. Williams (eds), 42–63. Cambridge: Cambridge University Press. Leech, G. 1997. “Teaching and language corpora: A convergence”. In Teaching and Language Corpora, A. Wichmann, S. Fligelstone, A. McEnery and G. Knowles (eds), 1–23. Harlow: Longman. Nassaji, H. 2000. “Towards integrating form-focused instruction and communicative interaction in the second language classroom: Some pedagogical possibilities”. The Modern Language Journal 84(2):241–250. Norris, J.M. and Ortega, L. 2000. “Effectiveness of L2 instruction: A research synthesis and quantitative meta-analysis”. Language Learning 50(3):417–528. Nunan, D. 1995. “Closing the gap between learning and instruction”. TESOL Quarterly 29(1):133–158. Nunan, D. 2000. “Autonomy in language learning”. Plenary presentation, ASOCOPI 2000, Cartagena, Colombia, October 2000.
254 Pascual Pérez-Paredes and Pascual Cantos-Gómez
Pérez-Paredes, P. 2003. “Integrating networked learner oral corpora into foreign language instruction”. In Extending the Scope of Corpus-based Research. New Applications, New Challenges, S. Granger and S. Petch-Tyson (eds), 249–261. Amsterdam: Rodopi. Robinson, P. 1995. “Attention, memory, and the noticing hypothesis”. Language Learning 45:283–331. Savignon, S.J. 1983. Communicative Competence: Theory and Classroom Practice. Reading, MA: Addison-Wesley. Schmidt, R. 1990. “The role of consciousness in second language learning”. Applied Linguistics 11(2):129–158. Schmidt, R. 1993. “Awareness and second language acquisition”. Annual Review of Applied Linguistics 13:206–226. Skehan, P. 2001. “The role of a focus on form during task-based instruction”. In Trabajos en lingüistica aplicada. C. Muñoz, M.L Celaya, M. Fernández-Villanueva, T. Navés, O. Strunk and E. Tragant (eds), 11–24. Barcelona: Univerbook SL. Stevens, V. 1990. “Text manipulation: What’s wrong with it, anyway?”. CAELL Journal 1(2):5–8. Williams, J. 2001. “The effectiveness of spontaneous attention to form.” System 29: 325–340.
Self-discovery and corpora 255
Appendix 1 Protocol Part I Personal Total tokens used Total types used Total content words used Token-type ratio Token-content word ratio Types with frequency > 10 used Types with frequency 5–10 used Types with frequency 2–4 used Types with frequency = 1 used Top-type 1 (‘the’) used Top-type 2 (‘and’) used Top-type 3 (‘she’) used Top-type 4 (‘is’) used Top-type 5 (‘picture’) used Top-type 6 (‘in’) used Top-type 7 (‘a’) used Top-type 8 (‘to’) used Top-type 9 (‘her’) used Top-type 10 (‘I’) used Total top-types used Top-content word 1 (‘picture’) used Top-content word 2 (‘woman’) used Top-content word 3 (‘like’) used Top-content word 4 (‘think’) used Top-content word 5 (‘painter’) used Top-content word 6 (‘seems’) used Top-content word 7 (‘friends’) used Top-content word 8 (‘man’) used Top-content word 9 (‘painting’) used Top-content word 10 (‘portrait’) used Total top-content words used Text difficulty (Fog-Index)
Group 159.84 68.76 38.12 2.3135 4.2022 1.52 6.72 20.40 40.12 12.88 7.44 6.96 6.84 5.12 4.76 4.60 4.56 3.44 3.12 59.72 5.12 2.96 2.12 1.76 1.12 0.88 0.84 0.76 0.80 0.76 17.12 13.86
Diff.
256 Pascual Pérez-Paredes and Pascual Cantos-Gómez
Part II Comment on your oral production Note the points you scored above class-average Note the points you scored below class-average Which features of your oral production are particularly strong? Which features of your oral production are less strong? Overall, do you think your speaking is above or below class-average? Why? Do you think your speaking could improve/ benefit if you take some kind of remedial work/ exercises? In which way?
Part III Using the reference corpus, take your top-five content words Same meaning: and find instances (concordance lines) where these content words have been used with the same, similar and different Similar meaning: meanings. Different meaning: Which top-types have you not used? Using the corpus, find three instances of each of these nonused types Which top-content words have you not used? Using the corpus, find three instances of each of these nonused content words
Self-discovery and corpora 257
Appendix II Descriptive data (group scores)
TOKENS TYPES content words ratio token-type ratio token-content word token band 1 (freq. > 10) token band 2 (freq. 5–10) token band 3 (freq. 2–4) token band 4 (freq. 1) top type 1 (the) top type 2 (and) top type 3 (she) top type 4 (is) top type 5 (picture) top type 6 (in) top type 7 (a) top type 8 (to) top type 9 (her) top type 10 (I) total top types top content word 1 (picture) top content word 2 (woman) top content word 3 (like) top content word 4 (think) top content word 5 (painter) top content word 6 (seems) top content word 7 (friends) top content word 8 (man) top content word 9 (painting) top content word 10 (portrait) total top content words
N
Range
25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25
170 66 44 1.30 2.63 5 13 26 38 18 16 19 17 10 13 8 15 9 9 71 10 6 8 6 4 4 2 3 5 4 19
Min. 95 46 24 1.70 3.23 0 2 12 25 3 1 1 0 0 0 1 0 0 0 27 0 0 0 0 0 0 0 0 0 0 8
Max.
Mean
S.D.
265 112 68 3.00 5.87 5 15 38 63 21 17 20 17 10 13 9 15 9 9 98 10 6 8 6 4 4 2 3 5 4 27
159.84 68.76 38.12 2.31 4.20 1.52 6.72 20.40 40.12 12.88 7.44 6.96 6.84 5.12 4.76 4.60 4.56 3.44 3.12 59.72 5.12 2.96 2.12 1.76 1.12 0.88 0.84 0.76 0.80 0.76 17.12
47.27 16.65 10.89 0.35 0.56 1.29 2.98 6.00 9.93 5.06 3.61 4.42 4.21 3.37 3.59 2.14 3.44 2.35 2.11 18.92 3.37 1.72 1.83 1.61 1.13 1.20 0.55 0.97 1.22 1.13 5.99
258 Pascual Pérez-Paredes and Pascual Cantos-Gómez
Student use of large, annotated corpora to analyze syntactic variation 259
Student use of large, annotated corpora to analyze syntactic variation Mark Davies Brigham Young University, USA
This study discusses the way in which advanced language learners have used several large corpora of Spanish to investigate a wide range of phenomena involving syntactic variation in Spanish. The course under discussion is taught via the Internet, and is designed around a textbook that contains both descriptive and prescriptive rules of Spanish syntax. The students carry out studies for those syntactic phenomena in which there is supposedly variation – either between registers, between dialects, or where there is currently a historical change underway. The three main corpora used by the students are the 100 million word Corpus del Español (which I have created), the CREA corpus from the Real Academia Española, and searches of the Web via Google. The students have found that each of the three large corpora has its own weaknesses, and that the most effective strategy is to leverage the strengths of each corpus to find the desired data. The students also learn strategies for comparing results across different corpora, and even within components of the same corpus – such as the frequency of occurrences on the Web from different countries.
1. Introduction A goal of many language learners is to more fully understand the range of syntactic variation in the second language and thus move beyond the simplistic rules that are presented in many textbooks. is effort can be aided by large corpora of the second language, which allow users to easily and quickly extract hundreds or thousands of examples of competing syntactic constructions from millions of words of text in different dialects and registers. Using this data, the teachers and students can then have a more realistic picture of how the constructions in question vary from one country to another, whether they are
260 Mark Davies
more common in formal or informal speech, and whether their use is increasing or decreasing over time. is paper provides an overview of the way in which students in a recent online course used three very large sets of corpora – involving hundreds of millions of words of text – to study “Variation in Spanish Syntax”. We will examine the goals of the course, how the students carried out their research using several corpora, the way in which they analyzed competing structures from the corpora, and how they were ultimately successful in describing syntactic variation in Spanish. e issue of how students can be trained as corpus researchers to learn about a foreign language has been the focus of a number of recent articles, including Bernardini (2000, 2002), Davies (2000), Osborne (2000), Kennedy and Miceli (2001, 2002), Kirk (2002), and Kübler and Foucou (2002). Although the issue of “student as researcher” is the focus of our study as well, it differs from many of the previous studies in several ways – the corpus used in this course is much larger (hundreds of millions of words of text), the students are more advanced (graduate students, with many of them language teachers themselves), the focus is on variation rather than norms, and the linguistic phenomena studied in the course deal with complex syntactic constructions, rather than simple collocations. Yet the additional data from our course should help to provide a more complete picture of the different ways in which students can use large corpora to study and analyze the grammar of the second language. In terms of the organization of the course, the “Variation in Spanish Syntax” class that we will be discussing was offered for the first time in 2002, and was taught online to twenty language teachers from throughout the United States (see http://davies.linguistics.byu.edu/sintaxis). Each week the students would examine two to four chapters in A New Reference Grammar of Modern Spanish (Butt and Benjamin, 2000), which is an extremely complete reference grammar of Spanish. Table 1 below lists the primary topics for each week’s assignments. Aer reading the assigned chapters from the reference grammar, each student would identify a particular syntactic construction from among the topics for that week, for which Butt and Benjamin indicated there was some type of variation – between geographical dialects, speech registers, or an overall increase or decrease in the use of the construction. Once the students had identified their topic of study, they would then spend the week using three different sets of corpora or web-based search engines to search for data on the
Student use of large, annotated corpora to analyze syntactic variation 261
Table 1. Topics in the “Variation in Spanish Syntax” course Week Topic
Week Topic
1 2
Morphology: gender and plurals Articles, adjectives, numbers
8 9
3 4 5 6
Demonstratives, lo, possessives Prepositions, conjunctions Pronouns: subject and objects Pronominal verbs, passives, impersonals
10 11 12 13
7
Indicative verb tenses
14
Progressive, gerund, participles Subjunctive, imperative, conditionals Infinitives, auxiliaries Ser/estar, existential sentences Negation, adverbs, time clauses Questions, relative pronouns and clauses Cle sentences, word order
constructions in question. e corpora were the Corpus del Español (www.co rpusdelespanol.org), the CREA and CORDE corpora from the Real Academia Española (www.rae.es), and the web through the Google and Google Groups search engines (www.google.com, groups.google.com) – all of which will be discussed below. Once they had extracted sufficient data for the construction, they would then write a short summary for that week’s project. In this summary they would point out the four most important findings from the data, summarize how their findings confirmed or contradicted the claims of Butt and Benjamin regarding variation, and then briefly discuss some of the possible motivations for this syntactic variation. e final step for each project, which was completed the following week, was to then review the projects of three other students. By the end of the semester, each student had carried out fairly in-depth research on fieen different syntactic constructions involving variation in Modern Spanish, and in addition each student had reviewed another forty-five studies by other students. Based on the quality of their projects, it seems clear that these corpus-based activities were extremely valuable in helping the students to move beyond simplistic textbook descriptions of Spanish grammar, and to acquire a much better sense of the actual variation in contemporary Spanish syntax.
262 Mark Davies
2. The corpora e corpora were the foundation for the entire course, and therefore an understanding of the composition and features of each corpus is fundamental to understanding how the students carried out their research.
2.1 The Corpus del Español e primary corpus for the course was the 100 million word Corpus del Español that I have created, and which was placed online shortly before the start of the semester. e Corpus del Español has a powerful search engine and unique database architecture that allow the wide range of queries shown in Table 2. ese include pattern matching (1), collocations (2), lemma and part of speech for nearly 200,000 separate word forms (3), synonyms and antonyms for more than 30,000 different lemmas (4), more complex searches using combinations of the preceding types of searches (5), queries based on the frequency of the construction in different historical periods and registers of Modern Spanish (6), and queries involving customized, user-defined lists (7). Note also that it would take only about 1–2 seconds to run any of these queries against the complete 100 million word corpus. In short, the Corpus del Español is richly annotated and allows searches for many types of linguistic phenomena, which made it extremely useful for the wide range of constructions that were studied in the “Variation in Spanish Syntax” course.
2.2 CREA / CORDE and Google (Groups) In addition to the 100 million word Corpus del Español, the students used two other sets of Spanish corpora. e first set is the CREA (Modern Spanish) and CORDE (Historical Spanish) corpora from the Real Academia Española, which contain a combined total of about 200 million words of text. e second set are the Google and Google Groups search engines. While these search engines are of course not limited just to web pages in Spanish, the main Google index covers more than 100 million words of text in Spanish language web pages, while the Google Groups search engine contains millions or tens of millions of words of text in messages to Spanish newsgroups.
Student use of large, annotated corpora to analyze syntactic variation 263
Table 2. Range of searches possible with the “Corpus del Español” 1 est_b* *ndo
estaba cantando, estábamos diciendo
2 lo * possible
"as * as possible" lo mejor posible, lo antes posible, lo máximo posible
3 poder.* *.v_inf
forms of poder (“to be able”) + infinitive puede tener, pudiera escapar
4 !dificil.*
all forms of all synonyms of difícil “difficult” imposible, duros, compleja, complicadas, . . .
5 estar.* !cansado.* *.prep *.v_inf
any form of estar + any form of any synonym of cansado (“tired”) + preposition + infinitive estoy harto de vivir, estaba cansada de escuchar
6 *.adv {19misc>5 19oral=0}
all adverbs that occur more than five times in newspapers or encyclopedias from the 1900s, but not in spoken texts from the 1900s inversamente, clínicamente
7 le/les [Bill.Jones:causative] *rse
le or les + a customized list of [causative] verbs created by [Bill.Jones] + words ending in [-rse] le mandaban ponerse, les hace sentirse
e main advantage that CREA/CORDE and Google (Groups) have over the Corpus del Español is their ability to limit searches to specific countries. As we will see in Section 3.3, this is useful for students who want to look at the relative frequency of constructions in different geographical dialects. In addition, CREA and CORDE allow users to compare the relative frequency of constructions in several different registers, beyond just the three divisions of the Corpus del Español (literature, spoken, newspaper/encyclopædias). e main disadvantage of CREA/CORDE and Google (Groups) is that they are not annotated in any way, which makes it impossible to search for syntactic constructions using lemma or part of speech. Even the wildcard features of both sets of search engines are rather limited, which means that it is also impossible to look for morphologically similar parts of speech or lemma. Both CREA/CORDE and Google (Groups) are really only useful in searching for exact phrases. However, as we will see in Section 3.3, when they are used in
264 Mark Davies
conjunction with the highly annotated Corpus del Español, the overall collection of corpora does permit students to carry out detailed searches on syntactic variation for a very wide range of constructions in Spanish.
3. Student outcomes: Examining syntactic variation through corpus use In this section, we will consider some of the challenges that the students faced in using the corpora effectively, how they overcame these obstacles, and how they were ultimately successful in carrying out more advanced research in Spanish and thus moving beyond the simplistic rules of many introductory and intermediate level textbooks. In the sections that follow, we will use as an example the question of clitic placement, in which the clitic can either be pre-posed or post-posed (lo quiero hacer vs. quiero hacerlo; “I want to do it”), and which exhibits variation that is dependent on dialect, register, and several functional factors (cf. Davies 1995, 1998).
3.1 Learning to use the corpora e first challenge facing the students was simply learning to use the different corpora successfully, in order to limit searches and extract the desired information. To help the students, during the first week of class they spent several hours completing two rather lengthy “scavenger hunt” quizzes using the Corpus del Español, the CREA and CORDE corpora, and Google (Groups). Each question would outline the type of data that they would look for from a particular corpus. For example, one of the questions dealing with the Corpus del Español asked them to look for the most common phrases involving any object pronoun followed by any form of any synonym of querer (“to want”) followed by an infinitive. e students would be responsible for looking at the help file to see how to search for lemma and parts of speech, and would then hopefully use the correct search syntax to query the corpus, in this case [ *.pn_obj !querer.* *.v_inf ]. As a hint to make sure that the students were on the right track in their search, they were told what the first two entries in the results set would be (e.g., te quiero decir, le quiero decir), and they could use this to check their results. ey were then responsible for providing the third entry from the list (in this
Student use of large, annotated corpora to analyze syntactic variation 265
case, me quiere decir). By the time they had answered sixty such questions for the different corpora during the first week, they were ready to use the corpora to extract data on syntactic constructions of their choosing. By examining Table 1 we can see that the topics in the course were arranged so that the students would start with constructions that were relatively easy to find, and that as the semester progressed became gradually more abstract and difficult. For example, at the beginning of the semester students started at the word-internal level (morphological variation for gender and number), and then moved to constructions involving adjacent words (e.g., demonstratives and possessives), then semantically more complex localized constructions (e.g., pronominal verbs), then even less well-defined structures (e.g., subjunctives), and ending up with fairly abstract and less localized constructions (e.g., cle sentences and word order).
3.2 Formulating the research question One of the hardest parts of carrying out linguistic research is knowing how to frame the question, and setting up the actual search of the corpora. To help the students in the course, during the first four weeks of the course I required that they send me a Plan de Trabajo (Work Plan) before they started the research in earnest. In a short paragraph they would first briefly outline what type of variation was described in the reference grammar. ey would then indicate which corpora would be most useful to examine the variation, and show exactly what type of queries would be run against these corpora to extract the data. Sometimes there were problems with the general research question – for example, the topic was much too wide or too narrow. Returning to the clitic placement construction, for example, they might propose to look at clitic placement with all main verbs (too wide) or with just three or four exact phrases (too narrow). Sometimes they intended to use a corpus that was not the best one for the question at hand, or else they had the wrong search syntax. In all such cases, I would help them frame the search correctly before they started. Once they had received this feedback, they were then ready to start the search itself. is procedure seemed to prevent a lot of wasted time and frustration on the part of the students as they were learning to use the corpora. By the end of the fourth week of using corpora, however, most of the students had sufficient experience in framing the research questions and setting up the queries, and it then became optional to submit a work plan before consulting the corpora.
266 Mark Davies
3.3 Extracting data from multiple corpora e students soon learned that the most productive research was that which incorporated searches from all three sets of corpora, by using each of the corpora for those purposes for which they were most useful. Typically, the students would start searching with the Corpus del Español, because it is the only one of the three that was annotated. For example, if they were examining variation in clitic placement, they could search for all cases of pre-positioning (“clitic climbing”) (1a) or postpositioning (1b) with all of the synonyms of a particular verb: (1a) [ *.pn_obj !querer.* *.v_inf ] (1b) [ !querer.* *.v_inf_cl ]
lo quiero hacer, me preferían hablar quiero hacerlo, preferían hablarme “I want to do it, he preferred to talk to me”
e students would see all of the matching phrases for both constructions and could easily compare the relative frequency across historical periods – to see whether one construction or the other is increasing over time – and they could also compare the relative frequency in the three general registers of literature, spoken, and newspapers / encyclopaedias. In order to carry out even more detailed investigations of register or dialectal variation, however, the students oen turned to the CREA/CORDE or Google (Groups) corpora. Because these corpora only allow searches of exact words and phrases, however, the student would need to select individual phrases from the lists generated in the Corpus del Español (e.g., lo quiero hacer vs. quiero hacerlo; “I want to do it”), and then search for these individual phrases one by one. Although somewhat cumbersome, this would allow students to compare the relative frequency of specific phrases in more than twenty Spanish-speaking countries and (in the case of CREA/CORDE), in a much wider range of register subdivisions than in the Corpus del Español. e two-step process – starting with the Corpus del Español and then using its data (when necessary) to search for individual phrases in CREA/CORDE and Google (Groups) – consistently yielded the best results for the students.
Student use of large, annotated corpora to analyze syntactic variation 267
3.4 Organizing the data Aer running the queries on the different corpora, the next step was to organize the data so that it would confirm or deny Butt and Benjamin’s claims about syntactic variation in Spanish. At the beginning, this was rather difficult for some students. ey might examine two competing syntactic constructions in different geographical dialects of Spanish – for example [clitic + main verb + infinitive] (“Type A”) vs. [main verb + infinitive + clitic] (“Type B”). In their attempt to show whether Type A or Type B was more common in different dialects, they might discover that CREA or Google had many more examples of Type A from Spain than from Mexico or Argentina. Inexperienced students might interpret this to mean that Type A was more common in Spain than in Mexico or Argentina, without realizing that there are more examples from Spain simply because the textual corpus from Spain is so much larger than that of other countries. Of course, the issue is not the relative frequency of Type A in Spain vs. the relative frequency of Type A in Mexico or Argentina, but rather the relative frequency of Type A vs. Type B in each of these three countries. Once students learned to correctly use relative frequencies to compare geographical dialects, different registers, or different historical periods, they were on much firmer footing as regards making valid judgments about the data.
3.5 Drawing conclusions regarding variation Once the data had been organized correctly, students were responsible for summarizing the most important findings from the data, and for suggesting whether the data confirmed or denied the original hypothesis regarding variation with the particular syntactic construction. During the first two or three weeks, students found it very difficult to clearly and concisely summarize the findings, and instead preferred to hope that quantity equaled quality. erefore, starting in the third week I limited them to only four short sentences to explain the major findings from the data. In addition, they were asked to include a two or three sentence conclusion, which showed whether the four points just mentioned confirmed or denied the hypothesis from Butt and Benjamin regarding syntactic variation. My sense was that this “bottom line assessment” was very useful in helping them to organize their data collection and written summary.
268 Mark Davies
3.6 “Explaining” the syntactic variation If students were able to accomplish all of the proceeding tasks, they were judged to have been successful in carrying out the research. Starting in about the fih week, however, they were presented with an additional goal, which was to suggest possible motivations for the syntactic variation – whether it was geographical, register-based, or diachronic in nature. e second textbook that was required for the course – in addition to the reference grammar – was Spanish-English Contrasts (Whitley 2002), which is an overview of recent research on a wide range of syntactic constructions in Spanish. To the extent possible, the students were asked to use any of the more theory-based explanations in Whitley for the particular construction in question, and see whether this might be useful in helping to “explain” variation. For example, with the clitic placement construction, they might realize that placement is a function of the semantics of the main verb, with semantically light verbs allowing clitic climbing more oen (cf. Davies 1995, 1998). In some cases students were able to identify possible causal factors, but in other cases it was simply sufficient to point out the actual variation and leave it at that. Even in these cases, however, the data that they presented was oen more complete than that found in previous studies by much more accomplished researchers, simply because of the size and power of the corpora that were available to the students in the course.
4. Conclusions As was explained in the introduction, there has been recent interest in the way in which students can be “trained” as corpus linguists to extract data from the foreign language. Many of these studies have focused on intermediate level students looking for “correct rules” for simple constructions in relatively small corpora. In the course that we have described, however, the graduate-level students focused on variation from the norm with rather complex constructions in hundreds of millions of words of data. In spite of the differences between this course and those described in previous studies, the hope is that our experience might provide insight into how students can use corpora to perform advanced research on the foreign language. As we have seen, if there is proper guidance and feedback, even students who are relatively inexperienced in syntactic research can be trained to use the
Student use of large, annotated corpora to analyze syntactic variation 269
corpora, formulate research questions and search strategies, organize the data, confirm or deny previous claims about language variation, and perhaps even begin to find motivations for this variation. In accomplishing these goals, these students have been successful in moving beyond the simplistic, prescriptivist rules found in many textbooks, and have begun to use corpora to acquire a much more accurate view of the syntactic complexity of the foreign language.
References Bernardini, S. 2000. “Systematising serendipity: Proposals for large-corpora concordancing with language learners”. In Burnard and McEnery, 225–234. Bernardini, S. 2002. “Exploring new directions for discovery learning”. In Ketteman and Marko, 165–182. Burnard, L. and McEnery, T. (eds). 2000. Rethinking Language Pedagogy from a Corpus Perspective [Łódź Studies in Language, Vol. 2]. Frankfurt am Main: Peter Lang. Butt, J. and Benjamin, C. 2000. A New Reference Grammar of Modern Spanish. 3rd edition. Chicago: McGraw-Hill. Davies, M. 1995. “Analyzing syntactic variation with computer-based corpora: The case of modern Spanish clitic climbing”. Hispania 78:370–380. Davies, M. 1998. “The evolution of Spanish clitic climbing: A corpus-based approach”. Studia Neophilologica 69:251–263. Davies, M. 2000. “Using multi-million word corpora of historical and dialectal Spanish texts to teach advanced courses in Spanish linguistics”. In Burnard and McEnery, 173–186. Kennedy, C. and Miceli, T. 2001. “An evaluation of intermediate students’ approaches to corpus investigation”. Language Learning and Technology 5:77–90. Kennedy, C. and Miceli, T. 2002. “The CWIC project: Developing and using a corpus for intermediate Italian students”. In Ketteman and Marko, 183–192. Kettemann, B. and Marko, G. 2002. Teaching and Learning by Doing Corpus Analysis (Proceedings of the Fourth International Conference on Teaching and Language Corpora, Graz 19–24 July, 2000). Amsterdam: Rodopi. Kirk, J. 2002. “Teaching critical skills in corpus linguistics using the BNC”. In Ketteman and Marko, 155–164. Kübler, N. and Foucou, P-Y. 2002. “Linguistic concerns in teaching with language corpora. Learner corpora”. In Ketteman and Marko, 193–203. Osborne, J. 2000. “What can students learn from a corpus? Building bridges between data and explanation”. In Burnard and McEnery, 165–172. Whitley, M.S. 2002. Spanish-English Contrasts. Washington, D.C: Georgetown University Press.
270 Mark Davies
Facilitating the compilation and dissemination of ad-hoc web corpora 271
A future for TaLC?
272 William H. Fletcher
Facilitating the compilation and dissemination of ad-hoc web corpora 273
Facilitating the compilation and dissemination of ad-hoc web corpora William H. Fletcher
United States Naval Academy,1 USA
Since the World Wide Web gained prominence in the mid–1990s it has tantalized language investigators and instructors as a virtually unlimited source of machine-readable texts for compiling corpora and developing teaching materials. The broad range of languages and content domains found online also offers translators enormous promise both for translation-by-example and as a comprehensive supplement to published reference works. This paper surveys the impediments which still prevent the Web from realizing its full potential as a linguistic resource and discusses tools to overcome the remaining hurdles. Identifying online documents which are both relevant and reliable presents a major challenge. As a partial solution the author’s Web concordancer KWiCFinder automates the process of seeking and retrieving webpages. Enhancements which permit more focused queries than existing search engines and provide search results in an interactive exploratory environment are described in detail. Despite the efficiency of automated downloading and excerpting, selecting Web documents still entails significant time and effort. To multiply the benefits of a search, an online forum for sharing annotated search reports and linguistically interesting texts with other users is outlined. Furthermore, the orientation of commercial search engines toward the general public makes them less beneficial for linguistic research. The author sketches plans for a specialized Search Engine for Applied Linguists and a selective Web Corpus Archive which build on his experience with KWiCFinder. He compares his available and proposed solutions to existing resources, and surveys ways to exploit them in language teaching. Together these proposed services will enable language learners and professionals to tap into the Web effectively and efficiently for instruction, research and translation.
274 William H. Fletcher
1. Aperitivo Aston (2002) compares learner-compiled corpora to professionally produced corpora through a memorable analogy to fruit salad. While home-made fruit salad (and corpora) can entail various benefits he enumerates, the offthe-shelf variety offers reliability and convenience, supplemented in its corpus analogue by documentation and specialized soware. He proposes that learners can follow a compromise “pick’n’mix” strategy, compiling their own customized subcorpora from professionally selected materials. By now this alimentary analogy (but by no means the strategy) must have passed its “best-by” date, yet I cannot resist adapting it to the World Wide Web. Food-borne analogies seem very appropriate for a conference in Bertinoro, the historic town of culinary and oenological hospitality, so I begin and end on this note. For years the Web has tantalized language professionals, offering a boundless pool of texts whose fruitful exploitation has remained out of reach. It is like an old-fashioned American community pot-luck supper, to which each family brings a dish to share with the other guests. As a child at such events I would taste many dishes in search of the most flavourful; usually I wasted my appetite sampling mediocre fare. Similarly I have spent countless hours online seeking and siing through webpages, too oen squandering my time, then giving up, sated yet unsatisfied. Frustration with finding useful content in the World Wide Haystack inspired me to design and implement the Web concordancing tools and strategies described here which enable users to compile ad-hoc corpora from webpages.2 Reflection on essential needs unmet by this model has led me to chart the course for future development to make sharing of Web corpora easier and more rewarding, and to outline an infrastructure for a search engine tailored to the needs of language professionals and learners. My conviction is simple: if online linguistic research can be made effective and efficient, linguists and learners will not have to take pot-luck with what they find on the Web by chance.
Facilitating the compilation and dissemination of ad-hoc web corpora 275
2. Web as corpus ? ! A haphazard accumulation of machine-readable texts, the World Wide Web is unparalleled for quantity, diversity and topicality. is ever-expanding body of documents now encompasses at least 10 billion (109) webpages publicly available via links, with several times that number in the “hidden” Web accessible only through database queries or passwords. Once overwhelmingly Anglophone, the Web now encompasses languages used by a majority of the world’s population. Currently native English speakers account for only 35% of Web users, and their relative prominence is dwindling as the Web expands into more non-western language areas.3 Online content covers virtually every knowledge domain of interest to language professionals or learners, and incorporates contemporary issues and emerging usage rare in customary sources. With all the Web offers, why have all but a handful of corpus linguists and language professionals failed to exploit this vast potential source for corpora?4 Surely the effort required to locate relevant, reliable documents outweighs all other explanations for this neglect. e quantity of information online greatly surpasses its overall quality. Unpolished ephemera abound alongside rare treasures, and online documents generally seem to consist more of accumulations of fragments, stock phrases and bulleted lists than of original extended text. Among the longer coherent texts, specialized genres such as commercial, journalistic, administrative and academic documents predominate. Assessing the “authoritativeness” of a webpage–the accuracy of its content and representativeness of its linguistic form–demands time and expertise. Despite these challenges, there are compelling reasons to supplement existing corpora with online materials. A static corpus represents a snapshot of issues and language usage known when it was compiled. e great expense of setting up a large corpus precludes frequent supplementation or replacement, and contemporary content can grow stale quickly. In contrast, new documents appear on the Web daily, so up-to-date content and usage tend to be well represented online. In addition, even a very large corpus might include few examples of infrequent expressions or constructions that can be found in abundance on the Web. Moreover, certain content domains or text genres may be underrepresented in an existing corpus or even missing entirely. With the Web as a source one usually can locate documents from which to compile an ad-hoc corpus to meet the specific needs of groups of investigators, translators or learners. Finally, while existing corpora may entail significant fees and
276 William H. Fletcher
require specialized hardware and soware to consult, Web access is generally inexpensive, and desktop computers to perform the necessary processing are now within the reach of students as well as researchers.
3. Locating forms and content on line 3.1 Established techniques Marcia Bates’ “information search tactics” can be adapted to categorize typical approaches to finding useful material online (Fletcher 2001b). Hunting, or searching directly for specific forms or content online, appears to be the most widely-used and productive strategy. For specialized content, grazing, i.e., focusing on predetermined reliable websites, has also proved an effective strategy for corpus construction.5 In contrast to these goal-oriented tactics, browsing, the archetypal Web activity, relies on serendipity for the user to discover relevant material. What follows shows how all three strategies can be improved to make the Web a more accessible corpus for language research and learning. “Hunting” via Web searches is the most effective means of locating online content. Unfortunately this strategy depends on commercial search engines and thus is limited by their quirks and weaknesses. A dozen main search engines aspire to “crawl” and map the entire Web, yet none indexes more than roughly a fih of the publicly-accessible webpages. ousands of specialized search engines focus on narrower linguistic, geographic or knowledge domains. e search process is familiar to all Web users: first one formulates a query to find webpages with specific words or phrases and submits it to a search engine. Some search engines support “smart features” for a few major languages, for example to search automatically for synonyms or alternate word forms (“stemming”). Meta-search engines query several search engines simultaneously, then “collapse” the results into a single list of unique links. In all cases, however, the user must still retrieve and evaluate the documents individually for relevance and reliability. Beyond the tedium of winnowing the wheat from the chaff, this search-andselect strategy has several flaws, starting with the port-of-entry to the Web. Search engines are not research libraries but commercial enterprises targeted at the needs of the general public. e availability and implementation of their
Facilitating the compilation and dissemination of ad-hoc web corpora 277
services change constantly: features are added or dropped to mimic or outdo the competition; acquisitions and mergers threaten their independence; financial uncertainties and legal battles challenge their very survival. e search sites’ quest for revenue can diminish the objectivity of their search results, and various “page ranking” algorithms may lead to results that are not representative of the Web as a whole.6 Most frustrating is the minimal support for the requirements of serious researchers: current trends lead away from sites like AltaVista supporting sophisticated complex queries (which few users employ) to ones like Google offering only simple search criteria. In short, the search engines’ services are useful to investigators by coincidence, not design, and researchers are tolerated on mainstream search sites only as long as their use does not affect site performance adversely.
3.2 KWiCFinder Web concordancer To overcome some limitations of general-purpose search engines and to automate aspects of the process of searching and selecting I have developed the search agent KWiCFinder, short for Key Word in Context Finder. is free research tool7 helps users create a well-formed query and submits it to the AltaVista search engine. It then retrieves and produces a KWiC concordance of 5–15 online documents per minute without further attention from the user; dead links and documents whose content no longer matches the query are excluded from this search report. Here I discuss how it enhances the search process for language analysis as background to the proposals advanced in the solutions sections below; for greater detail see the website referenced in the previous note and Fletcher 2001b.
3.2.1 Searching with KWiCFinder To streamline the document selection process, KWiCFinder features more narrowly focused search criteria than commercial search sites. For example, AltaVista supports the wildcard *, which stands for any sequence of zero to five characters. KWiCFinder adds the wildcards ? and %, which represent “exactly one” and “zero or one” characters respectively. In an AltaVista query, lower-case and “plain” characters match their upper-case and accented counterparts, so that e.g., a in a query would match any of aáâäàãæåAÁÂÄÀÃÆÅ. KWiCFinder introduces the “sic” option, which forces an exact match of lower-case and “plain” characters. For example, choosing “sic” distinguishes
278 William H. Fletcher
the past tense of the German passive auxiliary wurde from the subjunctive auxiliary würde, and both are kept separate from the noun Würde “dignity”. Similarly, KWiCFinder supports the operators BEFORE and AFTER in addition to AltaVista’s NEAR to relate multiple search terms, and permits the user to specify how many words may separate them. ese enhancements do come at a price: KWiCFinder must submit a standard query to AltaVista and retrieve all matching documents, then filter out webpages not meeting the narrower search criteria. In extreme cases, dozens of webpages must be downloaded and analyzed to find one that matches the searcher’s query exactly. Obviously the most efficient searches forgo wildcards by specifying and matching variant forms exactly. Especially in morphologically rich languages, entering all possible variants into a query can be most tedious. KWiCFinder introduces three types of “tamecards,” a shorthand notation for such variants. A simple tamecard pattern is entered between [ ], with variants separated by commas: work[,s,ed,ing] is expanded to work OR works OR worked OR working, but it does not match worker, workers, as would wildcard work*. Indexed tamecards appear between { }; each variant is combined with the corresponding variant in other indexed tamecards in the same query field. For instance, {me,te,se} lav{o,as,a} expands only to the Spanish reflexive forms me lavo, te lavas, se lava, but not to non-reflexive te lavo or ungrammatical *se lavo. KWiCFinder’s “implicit tamecards” with hyphen or apostrophe match forms both with and without the punctuation mark and / or space: on-line matches the common variants on line, online, on-line. is is particularly useful for English, with its great variation in writing compounds with and without spaces and hyphens, and for German, where the new spelling puts asunder forms that formerly were joined: kennen-lernen matches both traditional kennenlernen and reformed kennen lernen, which coexist in current practice and are reunited in a KWiCFinder search. As a final implicit set of tamecards KWiCFinder recognizes the equivalence of some language-specific orthographic variants, such as German ß and ss, ä ö ü and ae oe ue.
3.2.2 Exploring form and content with KWiCFinder KWiCFinder complements AltaVista by focusing searches to increase the relevance of webpages matched. e typical “search and select” strategy requires one to query a search engine, then retrieve and evaluate webpages one by one. KWiCFinder accelerates this operation by fetching and excerpting matching documents for the user. Even with a KWiC concordance of webpages, how-
Facilitating the compilation and dissemination of ad-hoc web corpora 279
ever, the language samples still must be considered individually and selected for usefulness. KWiCFinder’s browser-based interactive search reports allow one to evaluate large numbers of documents efficiently. e data are encoded in XML format, so results from a single search can be transformed into various “views” or formats for display in a Web browser, from “classic concordance” – one line per citation, centred on the key word or phrase – to table or paragraph layout with key words highlighted. Navigation buttons facilitate jumping from one example to the next. In effect, KWiCFinder search reports constitute mini ad-hoc corpora which can include substantial context for further linguistic investigation. Users can add comments to relevant citations and documents, call up original or locally saved copies of webpages for further scrutiny, and select individual citations for retention or elimination from the search report. Browser-based JavaScript tools are integrated into the search report to support exhaustive exploration and simple statistical analysis of the co-text. User-enhanced search reports can be saved as stand-alone HTML pages for sharing with students or colleagues, who in turn can annotate, supplement, save and share them. By merging concordanced content and investigative soware into a single HTML document that runs in a browser, KWiCFinder interactive search reports remain accessible to users of varying degrees of sophistication and achieve a significant degree of platform independence.8
4.
Language-oriented Web search: challenges and solutions
4.1.1 Challenge I: time and effort Each generation of computers has made us users more impatient: we have grown accustomed to accessing information instantly, and a delay of seconds can seem interminable. Tools such as KWiCFinder can download and excerpt several pages a minute, where the exact value of “several” depends on connection speed, document size and processing capability. Frequently I investigate a linguistic question or look for appropriate readings for my students by searching for and processing 100 or more webpages in 10–15 minutes. For example, to compile a sample corpus of Web documents, I downloaded 11,201 webpages in an aernoon while I was teaching through unattended simultaneous searches. Typically I run such searches while doing something else and
280 William H. Fletcher
peruse the results when convenient. Unfortunately this strategy is inadequate for someone like a translator with an immediate information need, and it can be costly for a user who pays for time online by the minute. 9 Downloading and excerpting webpages can be accelerated. In an ongoing study based on my sample Web corpus I have evaluated various “noise-reduction” techniques to improve the usefulness of documents fetched from the Web (Fletcher 2002). Document size is the simplest and most powerful predictor of usability: webpages of 3–150 KB tend to yield more connected text, while smaller or larger files have a higher proportion of non-textual overhead or noise, as well as a higher HTML-file size to text-file size ratio. Since document size can be determined before a file is fetched, one could restrict downloads to the most productive size range and achieve tremendous bandwidth savings. While this and other techniques will realize further efficiencies in search agents, even an automated search and concordancing tool like KWiCFinder remains too slow to be practical for some purposes. Furthermore, formulating a targeted query and evaluating online documents and citations for reliability, representativeness and relevance to a specific content domain, pedagogical concern or linguistic issue can require a significant investment of time and effort. If a search addresses a question of broader interest, the resulting search report and analysis should be shared with others. While one can easily save such reports as HTML files for informal dissemination, there is no mechanism for “weblishing” them or informing interested colleagues about them. Moreover, the relevant, reliable webpages selected by a searcher are likely to lead to productive further exploration and analysis of related issues and to contain valuable links to additional resources, yet as things now stand they will be found in future searches only by coincidence. How can the value added by the (re)searcher be recovered?
4.1.2 Solution I: Web Corpus Archive (WCA) To help searchers with an immediate information need and as a forum for sharing search results I intend to establish an online archive of Web documents which collects, disseminates and builds on users’ searches. KWiCFinder will add the capability for qualified users to upload search reports with broader appeal to this Web Corpus Archive (WCA). In brief comments, users will describe the issues addressed, classify the webpages by content domain, and summarize the insights gained by analyzing the documents. Such informal weblications will enable language professionals and learners world wide to
Facilitating the compilation and dissemination of ad-hoc web corpora 281
profit from an investigator’s efforts. is model extends Tim Johns’ concept of “kibbitzers”, ad-hoc queries from the British National Corpus designed to clarify some fine point of word usage or grammar complemented by analysis and discussion of the evidence which he saves and posts online (Johns 2001). Whenever a user uploads a search report to benefit the user community, the WCA server will download the source documents from the Web and archive them, preserving the original content from “link rot” and enabling others to verify and reanalyze the original data. Since much of a webpage’s message is conveyed by elements other than raw text – images, layout, colour, sounds, interactivity – these elements should be preserved as well. Links from these pages to related content will be explored to extend the scope of content archived. Since this growing online body of webpages selected for reliability and classified by content domain will reside on a single server, it can provide fast, sophisticated searches within the WCA, yielding browser-based interactive search reports similar to those produced by KWiCFinder. Fruitless searches will be submitted to other search engines to locate additional webpages for inclusion in the Web Corpus Archive. Data on actual user searches with KWiCFinder and on my “Phrases in English” site (Fletcher 2004) would also expand the archive. Available topic recognition and text summarization soware could be harnessed to classify and evaluate these automatically retrieved documents. Clearly obtaining permission from all webpage creators to incorporate their material into an archive is unfeasible, which raises the question whether this repository would infringe on copyright. Including entire webpages without permission in a corpus distributed on CD-ROM would obviously be illegal – and unethical to boot. But providing a KWiC concordance via the Web of excerpts from webpages cached in their entirety on a “corpus server” clearly falls well within currently accepted practice. While not a legal expert, I do note that for years search engines like Google and AltaVista have included brief KWiC excerpts from documents in their search reports with impunity. Indeed, both Google and Internet Archive (a.k.a. the Wayback Machine, http:// web.archive.org) serve up entire webpages and even images from their cache on demand. Both these sites’ policy statements suggest an implied consent from webpage owners to cache and pass on content if the site has no standard Web exclusion protocol “robots.txt” file prohibiting this practice and if the document lacks a meta-tag specifying limitations on caching. ey assert this right in daily practice and defend it when necessary in court. Internet Archive’s FAQ
282 William H. Fletcher
explicitly claims that its archive does not violate copyright law, and in accordance with the Digital Millennium Copyright Act it provides a mechanism for copyright holders to request removal of their material from the site as well.10 Besides these familiar sites rooted in the information industry, libraries and institutes in various countries are establishing national archives of online documents to preserve them for future generations. e co-founder of one such repository (who understandably prefers anonymity) has confided in me that his group will proceed despite the unclear legality of their endeavour. Eventually legislation or litigation will clarify the status of Web archives, a recurring topic on the Internet Archive’s [archivists-talk] mailing list.11 Optimistically I assume that a Web-accessible corpus for research and education derived from online documents retrieved by a search agent in ad-hoc searches will fall within legal boundaries. Meanwhile, I intend to assert and help establish our profession’s rights while scrupulously respecting any restrictions a webpage author communicates via industry-standard conventions.12
4.2.1 Challenge II: Commercial search engines Two concerns prompt me to propose a more ambitious project as well. Firstly, the limitations imposed on queries by the most popular search engines for practical reasons reduce their usefulness for serious linguistic research. Secondly, the demands of survival in a competitive market compromise the viability and continuity of the most valuable search engines.13 4.2.2 Solution II: Search Engine for Applied Linguists (SEAL) e observations in 4.2.1 point toward one conclusion: if language professionals want a search site that satisfies their needs for years to come, they will have to create and maintain it themselves. With this conviction I now outline a realistic path to this goal of a Search Engine for Applied Linguists (SEAL). While on sabbatical during the academic year 2004–05 I intend to start on this project and hope to report significant progress toward this goal at TaLC 2006. An ideal Web search site for language learners and scholars would have to support the major written languages and character sets, and allow expansion to any additional language. e search engine would provide sophisticated querying capabilities to ensure highly relevant results, not only matching characters, but also parts of speech and even syntactic structures. Such a site would permit searches on any meaningful combination of wildcards and regular expressions,
Facilitating the compilation and dissemination of ad-hoc web corpora 283
which would be optimized for the character set of the target language.14 It also would furnish built-in language-specific “tamecards” to match morphological and orthographic variants. SEAL should not report merely how many webpages in the corpus contain a given form, but also calculate its total frequency and dispersion as well. While mainstream search engines match at the word level, ignoring the clues to linguistic and document structure contained in punctuation and HTML layout tags, our ideal site would also take such information into account. Above all, a search site for language professionals would stress quality and relevance of search results over quantity. Real-world search sites are resource-hungry monsters. At the input end of the process, “robot” programs “crawl” or “spider” the Web, downloading webpages and adding their content to the search database. Links extracted from these documents point the way to other pages, which are spidered in turn. A “full” Web crawl involves transferring and storing many terabytes (roughly 1012 characters) of data. When the webpage database is completed, indexed and optimized, the search site calls on it to attend to many thousands of user queries simultaneously, with a tremendous flow of data in the other direction. To perform their magic, major search sites boast batteries of thousands of computers, gigabytes of bandwidth, and terabytes of storage. How can we academics hope to match their capabilities? Collectively we too have thousands of computers and gigabytes of bandwidth untapped when our learning laboratories and libraries are closed. Why not employ them to crawl and index the Web for a language-oriented search engine? A central server would coordinate the tasks and accumulate the results of these armies of distributed “crawlers.” e inspiration for this distributed approach comes from a project which processes signals from outer space with a screensaver running on volunteers’ desktops around the world; whenever one of the computers is idle, the program fetches chunks of data and starts crunching numbers. Researching the concept online, I discovered both a blueprint for a search engine with distributed robots spidering the Web (Melnik et al. 2001) and a Master’s thesis on Herodotus, a peer-to-peer distributed Web archiving system (Burkard 2002).15 Clearly we need not reinvent the wheel to implement SEAL, only adapt freely available open-source soware to the specific requirements of our discipline.16 Once the basic search engine framework has been implemented and tested, the model could be extended to a further degree of “distributedness.” Separate servers hosted by different universities could each concentrate on a specific
284 William H. Fletcher
language or region, or else mirror content for local users to avoid overtaxing a single server. Local linguists would provide the language-specific expertise to create tamecards for morphological and orthographic variants, optimize regular expressions for the character set, and implement part-of-speech and syntactic tagging. Due to the relatively low volume of traffic, such sites could support sophisticated processing-intensive searches which are impractical on general-purpose search engines.17 e specialized nature and audience of a linguistic search engine cum archive would limit its exposure to litigation as long as the exact legal status of such services remains unclear. Indeed, since the goal of SEAL is to build a useful representative searchable sample of online documents, not to cover the Web comprehensively, some restrictions on content would be quite tolerable.
5. Alternative solutions is section surveys existing resources comparable to those outlined above. e intention is to be descriptive, not judgemental: while a soware application’s usefulness for a specific purpose should be gauged by its suitability for one’s goals, its success must be assessed only by how well it meets its own design objectives. e list of applications derives from variants on the question “How is your soware x different from y?” Since the Web Corpus Archive and Search Engine for Applied Linguists are vapourware which may never achieve all that I intend to, I acknowledge that I am comparing an ideal concept to implemented facts. Before a detailed discussion of the alternatives it is only fair to reveal my background, biases and intentions. Before programming a precursor of KWiCFinder in 1996, I spent 10 years designing, implementing and evaluating video-based multimedia courseware for foreign language instruction.18 e development cycle entailed extensive direct observation of users as well as analysis of their errors and their evaluations of the courseware. My criteria for a good user interface were heavily influenced by Alan Cooper, who preaches that soware should make it impossible for users to make errors: errors are a failure of the programmer, not the user (1995, 423–40). Usability is a primary concern in all my soware development projects. For instance, studies of online search behaviour such as Körber (2000), Jansen et al. (2000) and Silverstein et al. (1999), summarized in detail in Fletcher (2001a, b), reveal that most
Facilitating the compilation and dissemination of ad-hoc web corpora 285
users avoid complex queries (i.e., ones with multiple search terms joined by Boolean operators like AND, OR and NEAR), and those who do attempt them make errors up to 25% of the time, resulting in failed queries. Many features of KWiCFinder and subsequent applications address specific observed difficulties of students and other casual searchers in order to help them produce appropriate, well-formed queries. As a teacher of Spanish and German, I sought a tool for my students and myself that could handle languages with richer morphology and greater freedom in word order than English. For example, while a typical English verb has only 4–5 variants, Spanish verbs have ten times that number of distinct forms. English sentences tend to be linear, but in German, syntactic and phraseological units are oen interrupted by other constituents. In both languages webpage authors use diacritics inconsistently – Spanish-language pages may neglect acute accents, German pages may substitute ae for ä etc. and ss for ß (standard usage in Switzerland). Complex queries allowing matches with the Boolean operators NEAR / BEFORE / AFTER as well as NOT, AND and OR, tamecards for generating variant forms, and flexible character matching strategies are essential to studying these languages efficiently and effectively. None of the alternatives surveyed below offers the full range of Boolean operators and complex queries supported by KWiCFinder. KWiCFinder was designed as a multipurpose application, to examine not just a short span of text for lexical or grammatical features, but also to assess document content and style when desired. As Stubbs (forthcoming) points out, the classical concordance line may provide too little context to infer the meaning and connotations of a word reliably. In a telling example he shows that the immediate context of many occurrences in the BNC of the phrase horde of appears to suggest neutral or even positive associations. e consistently negative connotations become obvious only aer one examines a much larger amount of co-text. KWiCFinder’s options to specify any length of text to excerpt and to redisplay concordances in various layouts (paragraph and table as well as concordance line) allows the flexibility to examine either the immediate or the larger context.
5.1 Web concordancer alternatives to KWiCFinder Here “Web concordancer” is not to be understood as a Web interface to a fixed corpus like Mark Davies’ (see Davies, this volume) “Corpus del español”
286 William H. Fletcher
(http://corpusdelespanol.org), the Virtual Language Centre’s “Web Concordancer”, (http://www.edict.com.hk/concordance/), or my “Phrases in English” site (http://pie.usna.edu), none of which features language from the Web; I designate the latter “online concordancers”. Rather, the former are Web agents which query search engines and produce KWiC concordances of webpages matching one’s search terms. e first two applications considered are commercial products, but the others were developed by and for linguists. Typically the soware is installed on the user’s computer (KWiCFinder, Copernic, Subject Search Spider, TextSTAT, WebKWiC), but WebCorp and WebCONC run on a Web server and are accessed via a webpage, which makes them less daunting to casual users and avoids platform compatibility issues. For concordancing these applications follow one of three general strategies, client-side, server-side and search-engine-based processing. Client-side concordancers like KWiCFinder, Subject Search Spider and TextSTAT download webpages to the user’s computer for concordancing. With a slow or expensive connection this can be a significant disadvantage, but once downloaded the texts can be saved for subsequent examination and (re)analysis off line.19 e server-side approach shis the burden of fetching and concordancing webpages to the WebCorp or WebCONC server. is requires far less data transfer to the user’s computer, but webpages of further interest must be fetched and saved individually by the user via the browser. Depending on the number of concurrent searches, these services can be slow or even unavailable. WebCorp does offer the option to send search results by e-mail, which prevents browser timeout and saves money for those with metered Internet access. One potential limitation of server-based processing is the unclear legality of a service which modifies webpages by excerpting them; client-side processing avoids any such risk. Search-engine-based concordancing is the fastest approach as it relies on the search engine’s existing document indices; for details of the implementations, see the descriptions of Copernic and WebKWiC below. Copernic (http://www.copernic.com) is a commercial meta-search agent which queries multiple search engines concurrently for a single word or phrase and produces a list of matching pages sorted by “relevance”. While very fast, its concordances are too short and inconsistent to be useful for linguistic research; they appear to derive from the excerpts shown in search engine results. Copernic includes excellent support for a wide range of languages. e free basic version of the soware evaluated constantly reminded me of the many additional features available by upgrading to one of several pay-in-advance vari-
Facilitating the compilation and dissemination of ad-hoc web corpora 287
ants. ese more sophisticated products may offer the flexibility to do serious KWiC concordancing of online texts, and the high-end version (not evaluated) produces text summaries which could be useful for efficient preview and categorization of online content. Another commercial search product, Subject Search Spider (http://www. kryltech.com), produces KWiC concordances of the search terms in a paragraph layout. All features are available in the 30-day free trial download, including full control over the number of concordances per document and the amount of context to show. SSSpider supports 34 languages, virtually all those of Europe, in addition to Arabic, Chinese, Hebrew, Japanese and Korean, and can search usenet (newsgroups) as well as the Web.20 As with Copernic there are companion text summarization and document management suites available. One free product, SSServer, is deployed on a Web server, where it could easily be customized into an online concordancer for any of the languages supported. WebCorp (http://www.webcorp.org.uk; Morley, Renouf and Kehoe 2003) from the University of Liverpool’s Research and Development Unit for English Studies has regularly added new features since its launch in 2000. While it offers but a single field for inputting search terms, its support for wildcards and “patterns” (similar to KWiCFinder’s tamecards) gives it flexibility in matching variant forms, and queries can be submitted to half a dozen different search engines to improve their yield. Up to 50 words of preceding and following context are shown, and options allow displaying any number of concordances per document (up to 200 webpages maximum are analyzed). WebCorp’s concordances give access to additional data analysis (e.g., type / token count, lists of word forms), and other tools are available on the site. Online newspapers can be searched by domain (e.g., UK broadsheet, UK tabloid, US), and searches can be limited to a specific Open Directory content domain. With the numerous choices WebCorp offers, its failure to provide a document language option seems inexplicable, since almost every search engine supports it. e user interface would benefit from client-side checking for meaningful, well-formed queries before submission to WebCorp; mistakes in a query can lead to long waits with no results and no explanation. Zoni (2003) describes WebCorp in greater detail and compares it with KWiCFinder. Matthias Hüning’s WebCONC (http://www.niederlandistik.fu-berlin.de/ cgi-bin/web-conc.cgi), another server-based Web concordancer, performs searches on Google and generates KWiC concordances of the search phrase
288 William H. Fletcher
in paragraph layout. One can also copy and paste text for concordancing onto the search page. Options are minimal: target language, amount of context (maximum of 50 characters before / aer the node!), and number of webpages to process (50 maximum, in practice fewer if some pages in the search results are inaccessible or do not match exactly). ere is no provision for wildcards (not supported by Google) or pattern matching. Matches are literal, and all occurrences of a search string are highlighted in the results, even as a substring of a longer word. A punctuation mark aer word form is matched too, which can be useful, for example to find clause final verb forms in German. e server can be slow and may even time out without producing any concordances. WebCONC could be far more useful if it offered more options for search and output format. Of greater potential interest is the author’s TextSTAT package (http://www.niederlandistik.fu-berlin.de/textstat/soware-en.html), which can download and concordance both webpages and usenet postings. Programmed in Python, it runs on any standard platform (Windows, Macintosh, Unix / Linux). Hüning’s user license permits modification and redistribution of the soware code, making TextSTAT an instructive example and valuable point-of-departure for a customized Web concordancer. WebKWiC (http://kwicfinder.com/WebKWiC/; Fletcher 2001 a, b) is a browser-based application that exploits a feature of Google’s search results: click on the “cache” link to see a version of a webpage from Google’s archives with the search terms highlighted. WebKWiC queries Google, parses the search results, fetches a page from Google’s cache, encodes the highlighted search terms to permit navigation from one instance of the search terms to the next, and displays the page in a new browser window. is “parasitic” approach with JavaScript and DHTML builds on core functionalities of Internet Explorer, works on multiple platforms, and supports any language known to Google. It could be extended to produce and display KWiC excerpts from webpages, or to download and save them in HTML, text or concordance format. A small set of webpages and scripts (70KB installed), WebKWiC takes full advantage of all Google’s search options.
5.2 Alternative to the Web Corpus Archive e Internet Archive (http://web.archive.org) “Wayback Machine” preserves many (but by no means all) webpages back to 1996. Archived sites are represented by a selection of their pages and graphics in snapshots taken every
Facilitating the compilation and dissemination of ad-hoc web corpora 289
few months. For example, a visit to the Archive reminded me that KWiCFinder was not publicly downloadable until November 1999, and it helped me reconstruct the introduction and evolution of WebCorp. e archive is not searchable by text, only by URL. e ability to step back in time, for example, to retrieve a webpage cited in this paper which has since disappeared from the Web, is complemented by comparison of various versions of the same webpage, with the differences highlighted. In contrast to the Internet Archive, the WCA proposed here will not aim to preserve the state of the entire Web, only to ensure immediate text-searchable access to pages which support either a user-uploaded “kibbitzoid” search analysis or documents indexed in its Search Engine for Applied Linguists.
5.3 Alternatives to the Search Engine for Applied Linguists GlossaNet (http://glossa.fltr.ucl.ac.be) analyzes text from over 100 newspapers in eleven languages, providing both more and less than a linguistic search engine as I conceive it. Originally a monitoring tool to track emerging lexical developments (Fairon and Courtois 2000), GlossaNet now offers both “instant search” of the current day’s newspapers with results in a webpage and “subscription search” (aer free registration), which re-queries each daily crop of newspapers and e-mails the results at regular intervals. Concordance lines display 40 characters to the le and right of the node. Clicking on the node displays the original newspaper article with the search terms highlighted, but this feature may be unavailable: an error message warns that most articles are accessible only on the day of publication. Queries can be formulated as any combination of word form, lemma, “regular expression” (less than the name suggests), or word class and morphology, or else as a Unitex “finite state graph” (not documented on the site; manual in French and Portuguese at http://www-igm.univ-mlv.fr/~unitex/). GlossaNet has its limitations: it is restricted to a single genre, newspaper texts, and to the rather small pool (in comparison to the Web) of one day’s newspapers; searches cannot be replicated on another day, and results may not be verifiable in the context of the original article; syntactic analysis and lemmatization can be faulty; search results do not show the grammatical annotation, so the users cannot learn to tailor their queries to the idiosyncrasies of the analysis engine; documentation is minimal. Clearly it has strengths as well compared to KWiCFinder or WebCorp: the ability to search by syntactic or morphological category can eliminate large
290 William H. Fletcher
numbers of irrelevant hits; “instant search” delivers results almost immediately; “subscription search” permits monitoring of linguistic developments in manageable increments; newspaper texts are generally reliable, authoritative linguistic sources. e Linguist’s Search Engine (LSE, http://lse.umiacs.umd.edu:8080) arrived on the scene in January 2004 as a tool for theoretical linguists to test their intuitions by “treating the Web as a searchable linguistically annotated corpus” (Resnick and Elkiss 2004). At its launch LSE had a collection of about 3 million English sentences, a number bound to increase rapidly. e source of these Web documents is the Internet Archive, which ensures their continued availability. New users will likely start with the powerful “Query by Example” feature: enter a sentence or fragment to match, then click “Parse” to generate both a tree and a bracketed representation of the example sentence. LSE uses a non-controversial Penn Treebank-style syntactic constituency annotation readily accessible to most linguists. Queries can be refined in either the graphical tree or the text bracketed representation. For example, I entered “He is not to be trusted”, which yielded this parse in bracketed notation: (S1 (S (NP (PRP He)) (VP(AUX is) (S (RB not) (VP (TO to) (VP (AUX be) (VP (VBN trusted)))))))). Aer being made more general in the tree editor, the bracketed query (S1(S NP (VP(AUX be )(S(RB not )(VP(TO to )(VP(AUX be )(VP VBN )))))))
matched 76 sentences with comparable constructions such as “Any statements made concerning the utility of the Program are not to be construed as express or implied warranties.” and “In clearness it is not to be compared to it.” LSE’s concordances can be displayed or downloaded in CSV format for importation into a database or spreadsheet, and the original webpages can be retrieved from the Internet Archive for examination. While such linguistic search of a precompiled Web corpus via an intuitive user interface is impressive, the LSE really advances Web searching by exploiting this functionality to locate examples matching lexical and syntactic criteria with the AltaVista search engine. e user submits a query to AltaVista and LSE fetches the corresponding webpages, parses them, and filters out the ones that fail to meet the user’s structural criteria. Retrieval and analysis are surprisingly rapid. Queries, their outputs, and the original webpages can be saved in personal collections for later analysis and refinement. e tools can also analyze corpora uploaded from the user’s computer.
Facilitating the compilation and dissemination of ad-hoc web corpora 291
Despite the LSE’s impressive power and usability, it does not fulfil all the needs the SEAL intends to address. Above all it supports only English, and there are no plans to add other languages except possibly in parallel corpora searchable via the annotation of the corresponding English passages (Resnick, personal communication), while SEAL will start with the major European languages, establish a transferable model for branching out into other language families. LSE is aimed at theoretical linguists seeking to test syntactic hypotheses who are sufficiently motivated to master a powerful but complex system. In contrast, SEAL’s target audience is more practically oriented, including language professionals such as instructors, investigators and developers of teaching materials, translators, lexicographers, literary scholars, and advanced foreign language learners as well as linguists. Many in these groups could be overwhelmed by a resource that requires too much linguistic or technical sophistication at the outset. SEAL will offer tools to leverage users’ familiarity with popular search engines and nurture them along the path from word and phrase search to queries that match specific content domains, phrases structures and sentence patterns as well. As an incrementally implemented companion to the Web Corpus Archive, it will benefit both from analysis of search behaviors and use patterns and from direct user feedback. Aer comparing future plans, Resnick and I have determined that LSE and SEAL will complement rather than compete with each other.
6. Web search resources in language teaching and learning Suggestions for language teachers and learners to use these tools are surveyed here. Specific examples of instructor-developed learning activities focussing on the levels of word, phrase and grammar are based on my experience teaching beginning and intermediate German and Spanish. Open-ended learner-directed techniques to develop critical searching skills and to encourage writing by example are also described. While some of these tasks could be performed without the specialized soware described here, they make the process more effective and familiarize the students with valuable research tools and techniques applicable to other disciplines as well. Since 1996 the Grammar Safari site (http://www.iei.uiuc.edu/web.pages/ grammarsafari.html) has been a popular resource on the Web, linked to and expanded on by over 2000 other sites. It offers tutorials and a set of assign-
292 William H. Fletcher
ments for learners of English to hunt for and analyze grammatical and rhetorical structures in online documents. e technique entails querying a search engine, retrieving webpages individually, and finding the desired forms on the page. One of the Web concordancers surveyed above could easily automate the mechanics of such activities, leaving more time for analysis and discussion of the examples. Familiarizing learners with an efficient approach to a beneficial but tedious task will encourage them to apply it even when not directed to do so. Grammatical and lexical exploration can also be based on instructor-prepared mini-corpora. KWiCFinder allows search results to be saved as webpages with self-contained interactive concordance tools which can be used profitably with students. For example, to contrast the German passive auxiliary wurde with the subjunctive auxiliary würde, I assign small groups of students (2–3 per computer) to explore, then describe the grammatical context (e.g., they co-occur with past participle and infinitive respectively) to the class. As instructor I clarify the meaning and use of the structures by translating representative examples. ese few minutes spent on “grammar discovery” prepare the students to understand and retain the textbook explanation better. Recently an in-class KWiCFinder search demonstrated to my students how actual usage can differ from textbook prescription. In a geographical survey of the German-speaking countries I explained that the usual adjective for “Swiss” in attributive position is the indeclinable Schweizer; a student pointed out that our textbook listed only schweizerisch. A pair of KWiCFinder searches rapidly clarified the situation: while forms of the latter typically modified the names of organizations and government institutions, the former was obviously both far more frequent and more general in use. Students can be assigned similar ad-hoc discovery activities in response to recurrent errors or to supplement the textbook. For example, it is instructive for a learner studying French prepositions to discover that merci à / pour parallel English “thanks to / for”, while merci de + infinitive corresponds to English “thanks for” + -ing. A search of the BNC illustrates the advantage of a bottomless corpus like the Web: this English construction occurs only 53 times in this huge corpus, and could well be lacking entirely in a smaller one. Ideally, aer assigned tasks such as this, learners will develop the habit of formulating and verifying usage by example rather than resorting to Babelfish or another online translation engine. Frand (2000) summarizes what he calls the “mindset” of Information-Age students. eir behaviour with an unfamiliar website or soware package typi-
Facilitating the compilation and dissemination of ad-hoc web corpora 293
cally exhibits more action than reflection; learning by trial and error replaces systematic preparation and exploration (“Nintendo over logic”). To encourage development of “premeditated” searching habits, I assign students a written pre-search exercise before they undertake open-ended Web-based research for a report or essay. ey jot down variants of key words and phrases likely to occur on webpages in contexts of interest for their topic as well as additional terms that can help restrict search results to relevant webpages.21 is written exercise forces thought to precede action and allows the group to brainstorm about additional possible search terms and variants. en they search for and evaluate a number of webpages in writing with a checklist based on Barton 2004. Finally, they re-search the sites deemed most useful in order to find additional appropriate content. Without these paper-and-pencil exercises, students tend to choose from the first few hits for whatever search term occurs to them. A concordancing search agent greatly accelerates evaluating webpages for content, reliability, and linguistic level. One venerable stylistic technique I attempt to pass on to my students is imitatio (not plagiarism!), the study and emulation of exemplary (or at least native speaker) texts in creative work. In major languages the Web is a generous source of texts on almost any topic. Aer locating appropriate webpages, advanced learners can immerse themselves in the style and language of the content domain they are dealing with before preparing compositions or presentations. is concept parallels translation techniques outlined by Zanettin (2001) based on ad-hoc corpora from the Web. It is a powerful life-long foreign-language communication strategy which builds knowledge as well as linguistic skills. When the WCA and SEAL and comparable resources become a reality, they will further accelerate the tasks surveyed here. Response time from querying from a single archive will be far faster than fetching and excerpting documents from around the Web. Searching a large body of selected documents by content domain and / or grammatical structure will yield a higher percentage of useful hits than the current query by word form approach. User-submitted kibbitzers will supply ready illustration and explanation for linguistic questions and problems (e.g., the wurde / würde and Schweizer / schweizerisch examples above).22 Finally, the linguistic annotation provided by SEAL will help motivated students gain greater insight into grammar. Admittedly, most of the techniques discussed here are feasible with static corpora as well. By the same token, most applications of corpus techniques to
294 William H. Fletcher
language learning (surveyed in Lamy and Mortensen 2000) could be adapted to Web concordancing instead. e size and comprehensive coverage of the Web are powerful arguments for this approach, as is the availability of free tools with a consistent, adaptable user interface for exploring everything from linguistic form to document content. If we can acquaint our students with responsible online research techniques and instil in them a healthy dose of scepticism toward their preferred information source, we will have accomplished far more than teaching them a language.23
7. Caffè e grappa oppure limoncello In this paper we have considered a wide range of challenges and solutions to exploiting the Web as a (source of) linguistic corpus. Such dense, heavy fare leaves us much to digest. Let’s linger over caffè and grappa or limoncello to discuss these ideas – aer all, this is not just a declaration of intent, but an invitation to a dialogue. ese proposals outline an incremental approach to implementing the solutions which will yield useful results at every milestone along the way – searchers with an immediate information need should not have to delay gratification as a programmer must. e Web Corpus Archive proposed here will give direct search results, if not the first time, then at least when a query is submitted on subsequent occasions. Posted KWiCFinder search report kibbitzers can exemplify techniques for finding the forms or information one requires, much as successful recipes from a pot-luck supper continue to enrich the table of those who adopt them. Building on the infrastructure of this archive, the Search Engine for Applied Linguists sketched here will afford rapid targeted access to an ever-expanding subset of the Web. In the process, all three information-gathering strategies will be served: hunters will profit from a precision search tool, grazers will be able to locate rich pastures of related documents, and browsers will enjoy increased likelihood of serendipitous finds. As other linguists join in the proposed cooperative effort, the search engine’s scope can be extended well beyond European languages. Initially, outside funding may be required to establish the infrastructure, but ultimately this plan will be sustainable with resources from the participating institutions. With time, the incomparable freshness, abundant variety and comprehensive coverage added by this Web corpus-cum-search engine will make
Facilitating the compilation and dissemination of ad-hoc web corpora 295
it an indispensable complement to the more reliable canned corpora for a “pick’n’mix” approach. Linguists and language learners alike will benefit from examples which clarify grammatical, lexical or cultural points. Foreign language instructors and translators will find a concentrated store of useful texts for instructional materials and translation by example. New soware tools will integrate the Web and the desktop into a powerful exploratory environment. e steps outlined here will lead toward fulfilling the Web’s promise as a linguistic and cultural resource.
Notes 1. Research for this paper was supported in part by the Naval Academy Research Council. 2. Ad-hoc corpora – also designated as “disposable” or “do-it-yourself” corpora – are compiled to meet a specific information need and typically abandoned once that need has been met (see e.g., Varantola 2003 and Zanettin 2001). 3. Figures from September 2003 (http://www.global-reach.biz/globstats/, visited 26 February 2004), which estimates the online population of native speakers of English and of other European languages at 35% each, while speakers of other languages total about 30%. ese numbers contrast sharply with the late 1990s, when English speakers comprised over threequarters of the world’s online population. 4. e number of linguists exploiting the Web as a linguistic corpus (beyond the casual “let’s see how many hits I can find for this on Google”) is growing. Kilgarriff and Grefenstette (2003) survey numerous papers and projects in this field. Other representative examples of applying Web data to specific linguistic problems include Banko and Brill (2001), Grefenstette (1999), and Volk (2002). Brekke (2002) and Fletcher (2001 a, b) discuss the pitfalls and limitations of the Web as a corpus. Finally, researchers like De Schryver (2002), Ghani et al. (2001) and Scannell (2004) demonstrate the importance of the Web for compiling corpora of minority languages for which other electronic and even print sources are severely limited. 5. Knut Hofland’s Norwegian newspaper corpus (Hofland 2002) follows a “grazing” strategy to “harvest” articles daily from several newspapers. Using material from a limited number of sites offers several advantages: permission and cooperation for use of texts can be secured; recurring page layouts help distinguish novel content from “boilerplate” materials automatically; the texts’ genre and content domain are predictable, and their authorship, representativeness and reliability can be established. Similarly, GlossaNet (http://glossa.fltr.ucl.ac.be), described in greater detail below, monitors and analyzes text from over 100 newspapers in nine languages, but does not archive them for public access. 6. “Paid positioning” and other “revenue-stream enhancers” may put advertisers’ webpages at the top of the search results. e link popularity ranking strategy exemplified by Google
296 William H. Fletcher
– webpages to which more other sites link are ranked before relatively unknown pages – can mask much of the Web’s diversity by favouring well-known sites. 7. KWiCFinder is available free online from http://KWiCFinder.com. First demonstrated at CALICO 1999 and available online since later that year, it is described in far greater detail in Fletcher 2001b. 8. For a discussion of features of the interactive search reports, refer to http://kwicfinder.com/KWiCFinderKWiCFeatures.html and http://kwicfinder.com/KWiCFinderReportFormats.html. 9. I am indebted to Michael Friedbichler of the University of Innsbruck for this observation and for fruitful discussions of various issues from the user’s perspective. 10. In an interview (Koman 2002) Internet Archive founder Brewster Kahle brushes aside a question about copyright, insists that it is legal and implies that the Internet Archive had never had problems with any copyright holder (subsequent lawsuits nullify that implied claim). e Archive’s terms of use and copyright policy also assert the legality of archiving online materials without prior permission (http://archive.org/about/terms.php [visited 8 October 2004]). Apparently such assertions are based on Title 17 Chapter 5 Section 512 of the US Digital Millenium Copyright Act (DMCA, http://www4.law.cornell.edu/uscode/ 17/512.html [visited 28 February 2004]), which authorizes providers of online services to cache and retransmit online content without permission from the copyright owner under specific conditions, which include publishing “takedown” procedures” for removing content when notified by the owner and leaving the original content unmodified. (Extensive discussion and documentation of these and related issues are found on the websites Chilling Effects http://www.chillingeffects.org/dmca512/ and Electronic Freedom Foundation http://www.eff.org.) Excerpting KWiC concordances from a webpage clearly constitutes modification, as does highlighting of search terms in a cached version, both services provided by Google and other search engines. Two legal experts I have consulted who requested anonymity find no authorization in US copyright law for these accepted practices, but case law seems to have established and reinforced their legitimacy. Obviously the legal status of these practices under US law has little bearing on the situation in other countries, whose statutes and interpretation may be more or less restrictive. 11. In the United States, a KWiC concordance of webpages appears to fall under the fair-use provisions of copyright law as well. Crews (2000) and Hilton (2001) both argue for more liberal interpretations of this law than that found in the typical academic institution’s copyright policy. I am seeking an official ruling from my institution’s legal staff before establishing the WCA on servers at USNA. If our lawyers do not authorize exposing the Academy to possible risk in this gray area, I can implement the WCA on my KWiCFinder.com website. As a “company” KWiCFinder has neither income nor assets, making it an unlikely target for litigation. 12. An approach proposed by Kilgarriff (2001), the Distributed Data Collection Initiative, would create a virtual online corpus: a classified collection of links to relevant webpages would compile subcorpora from webpages retrieved from their home sites on demand and
Facilitating the compilation and dissemination of ad-hoc web corpora 297
serve them to users; as pages disappear they would be replaced by others with comparable content. is alternative avoids liability for caching implicitly copyright documents, but it does not provide an instantly searchable online corpus, nor does it guarantee availability of the original data for verification and further analysis. 13 When this was written, AltaVista was the only large-scale international search engine that supported wildcards and the complex queries necessary for efficient searching. Originally a technology showcase for Digital Equipment Corporation, it passed from one corporation to another over the years. In March 2004, the latest owner Yahoo dropped support for wildcards on the AltaVista site and apparently ceased maintaining a separate database for AltaVista. ese developments reinforce my point that linguists must establish their own search engine to ensure that their needs will be met. 14. “Regular expressions” are powerful cousins of wildcards which allow precise matching of complex patterns of characters. Unfortunately most implementations are Anglo-centric and thus ignore the fact that characters with diacritics can occur within word boundaries. Regular expression pattern-matching engines could be optimized for specific languages by matching only those characters expected to occur in a given language. 15. Links to these and other resources related to the concepts proposed here are on http:// kwicfinder.com/RelatedLinks.html. In early 2003, LookSmart, a large commercial search engine provider, acquired Grub, a distributed search-engine crawling system. Now almost 21,000 volunteers have a Grub client screensaver which retrieves and analyzes webpages, thus helping LookSmart to increase the coverage and maintain the freshness of its databases. (http://looksmart.com [visited 19 June 2004]; http://grub.org [visited 19 June 2004]; http://wisenut.com, [visited 19 June 2004]). 16. Open-source soware is developed cooperatively and distributed both freely and free. Specific open-source technologies proposed here are the “LAMP platform”: Linux operating system, Apache web server, MySQL database, PHP and / or Perl scripting, all of which cost nothing and run competently on standard desktop PCs costing at most a few hundred dollars. Storage costs have dropped well below 50 cents a gigabyte and are set to plummet as new terabyte technologies are introduced in a few years. e expertise required to develop and maintain a search site encompasses Web protocols, database programming, and serverand client-side scripting, all skills typically available at universities. 17. For example, due to the high processing requirements, Google – currently the most popular search engine in the world by far – does not support any wildcards, and even AltaVista restricts them severely. 18. My experience negotiating rights to incorporate authentic video into multimedia courseware explains my hypersensitivity to copyright issues. 19. KWiCFinder provides the option to save the Web document files automatically in original HTML and / or text format for later analysis by a full-featured concordancer like WordSmith or MonoConc. 20. SSSpider’s heuristics for determining the language of the source text are not entirely reliable: a search for pages in Afrikaans returned many Dutch pages; aer switching to
298 William H. Fletcher
a search term that does not exist in Dutch, I got pages in French and Romanian as well as Afrikaans. 21. KWiCFinder’s inclusion and exclusion criteria are terms which help narrow but are not concordanced in the search results. For example, in a search for TaLC, words like “corpus”, “corpora”, “language”, “linguistics” are good discriminators of relevant texts, while “powder, talcum” are likely to appear on irrelevant webpages. 22. Perhaps Philip King’s term “kibbitzoids”, premiered at TaLC 5 in Bertinoro (2002), is more appropriate, as these are not strictly speaking what Tim Johns means by kibbitzers. 23. As Frand (2000:16) puts it, “Unfortunately, many of our students do believe that everything they need to know is on the Web and that it’s all free.”
References Aston, G. 2002. “The learner as corpus designer”. In Teaching and Learning by Doing Corpus Analysis: Proceedings of the Fourth International Conference on Teaching and Language Corpora, Graz 19–24 July, 2000 [Series Language and Computers, Vol. 42], B. Kettemann and G. Marko (eds), 9–25. Amsterdam: Rodopi. Banko, M. and Brill, E. 2001. “Scaling to very very large corpora for natural language disambiguation”. ACL–01. Online at http://research.microsoft.com/~brill/Pubs/ACL2001 .pdf [visited 2.3.2004] Barton, J. 2004. “Evaluating web pages: Techniques to apply and questions to ask”. Online: http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/Evaluate.html [visited 1.3.2004]. Evaluation form online: http://www.lib.berkeley.edu/TeachingLib/ Guides/Internet/EvalForm.pdf [visited 1.3.2004]. Brekke, M. 2000. “From the BNC toward the cybercorpus: A quantum leap into chaos?” In Corpora Galore: Analyses and Techniques in Describing English: Papers from the Nineteenth International Conference on English Language Research on Computerised Corpora (ICAME 1998), Kirk, J. M. (ed.), 227–247. Amsterdam and Atlanta: Rodopi. Burkard, T. 2002. “Herodotus: A peer-to-peer web archival system”. Cambridge, MA, Massachusetts Institute of Technology Master’s Thesis. Online: http://www. pdos.lcs.mit.edu/papers/chord:tburkard-meng.pdf [visited 8.10.2004]. Burnard, L. and McEnery, T. (eds.). 2000. Rethinking Language Pedagogy from a Corpus Perspective: Papers from the Third International Conference on Teaching and Language Corpora. Frankfurt am Main: Peter Lang. Cooper, A. 1995. About Face: The Essentials of User Interface Design. Foster City, CA: IDG Books. Crews, K.D. 2000. “Fair use: Overview and meaning for higher education”. Online: http: //www.iupui.edu/~copyinfo/highered2000.html [visited 8.10.2002]. De Schryver, G. M., 2002. “Web for / as corpus: A perspective for the African languages”. Nordic Journal of African Studies 11(2):266–282. Online: http://tshwanedje.com/publications/webtocorpus.pdf [visited 26.2.2004]. Fairon, C. and Courtois, B. 2000. “Les corpus dynamiques et GlossaNet: Extension de la
Facilitating the compilation and dissemination of ad-hoc web corpora 299
couverture lexicale des dictionnaires électroniques anglais”. JADT 2000:5es Journées Internationales d’Analyse Statistique des Données Textuelles. Online: http: //www.cavi.univ-paris3.fr/lexicometrica/jadt/jadt2000/pdf/52/52.pdf [visited 18.2. 2003]. Fletcher, W. H. 2001a. “Re-searching the web for language professionals”. CALICO, University of Central Florida, Orlando, FL, 15–17 March 2001. PowerPoint online: http: //www.kwicfinder.com/Calico2001.pps [visited 2.3.2004]. Fletcher, W. H. 2001b. “Concordancing the web with KWiCFinder”. American Association for Applied Corpus Linguistics, Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23–25 March 2001. Online: http: //kwicfinder.com/FletcherCLLT2001.pdf [visited 8.10.2004]. Fletcher, W. H. 2002. “Making the web more useful as a source for linguistic corpora”. American Association for Applied Corpus Linguistics Symposium, Indianapolis, IN, 1–3 November 2002. Online: http://kwicfinder.com/FletcherAAACL2002.pdf [visited 25.8.2003]. Fletcher, W. H. 2004. “Phrases in English”. Online database for the study of English words and phrases at http://pie.usna.edu [visited 26.2.2004]. Frand, J. 2000. “The Information-Age mindset: Changes in students and implications for higher education”. EDUCAUSE Review 35 (5):14–24. Online: http://www. educause.edu/pub/er/erm00/articles005/erm0051.pdf [visited 29.2.2004]. Ghani, R., Jones, R. and Mladenic, D. 2001. “Using the web to create minority language corpora”. 10th International Conference on Information and Knowledge Management (CIKM–2001). Online at http://www.cs.cmu.edu/~TextLearning/corpusbuilder/ papers/cikm2001.pdf [visited 7.7.2004]. Grefenstette, G. 1999. “The World Wide Web as a resource for example-based machine translation tasks”. Online at http://www.xrce.xerox.com/research/mltt/publications/Documents/P49030/content/ gg_aslib.pdf [visited 12.10.2001] Hilton, J. 2001. “Copyright assumptions and challenges”. EDUCAUSE Review 36 (6):48–55. Online: http://www.educause.edu/ir/library/pdf/erm0163.pdf [visited 8.10.2002]. Hofland, K. 2002. “Et Web-basert aviskorpus”. Online: http://www.hit.uib.no/aviskorpus/ [visited 8.10.2004]. Jansen, B. J., Spink, A. and Saracevic, T. 2000. “Real life, real users, and real needs: A study and analysis of user queries on the web”. Information Processing and Management 36 (2):207–227. Johns, T. F. 2001. “Modifying the paradigm”. Third North American Symposium on Corpus Linguistics and Language Teaching, Boston, MA, 23–25 March 2001. Kilgarriff, A. 2001. “Web as corpus”. In Proceedings of the Corpus Linguistics 2001 conference, UCREL Technical Papers:13, P. Rayson, A. Wilson, T. McEnery, A. Hardie and S. Khoja, (eds), 342–344. Lancaster: Lancaster University. Online: http://www. itri.bton.ac.uk/~Adam.Kilgarriff/PAPERS/corpling.txt [visited 8.10.2004]. Kilgarriff, A. and Grefenstette, G. 2003. “Introduction to the special issue on the web as corpus”. Computational Linguistics 29(3):333–47. Online: http://www-mitpress.mit. edu/journals/pdf/coli_29_3_333_0.pdf [visited 7.1.2004]. Körber, S. 2000. “Suchmuster erfahrener und unerfahrener Suchmaschinennutzer im
300 William H. Fletcher
deutschsprachigen World Wide Web: ein Experiment”. Unpublished master’s thesis, Westfälische Wilhelms-Universität Münster, Germany. Online: http://kommunix.uni-muenster.de/IfK/examen/koerber/suchmuster.pdf [visited 9.1.2004]. Koman, R. 2002. “How the Wayback Machine works”. Online: http://www.xml.com/pub/a/ ws/2002/01/18/brewster.html [visited 8.10.2004]. Lamy, M. N. and Mortensen, H. J. K. 2000. “ICT4LT Module 2.4. Using concordance programs in the modern foreign languages classroom”. Information and Communications Technology for Language Teachers. Online: http://www.ict4lt.org/en/en_mod2–4.htm [visited 1.3.2004]. Melnik, S., Raghavan, S., Yang, B. and García-Molina, H. 2001. “Building a distributed full-text index for the web”. WWW10, 2–5 May 2001, Hong Kong. Online: http: //www10.org/cdrom/papers/275/index.html [visited 8.10.2004]. Morley, B., Renouf, A. and Kehoe, A. 2003. “Linguistic research with XML/RDF-aware WebCorp tool”. http://www2003.org/cdrom/papers/poster/p005/p5-morley.html [visited 19.2.2004]. Pearson, J. 2000. “Surfing the Internet: Teaching students to choose their texts wisely”. In Burnard and McEnery, 235–239. Resnik, P. and Elkiss, A. 2004. “The Linguist’s Search Engine: Getting started guide”. http: //lse.umiacs.umd.edu:8080/lse_guide.html [visited 23.1.2004]. Scannell, K. P. 2004. “Corpus building for minority languages”. Online at http://borel. slu.edu/crubadan/ [visited 19.3.2004]. Silverstein, C., Henzinger, M., Marais, H. and Moricz, M. 1999. “Analysis of a very large web search engine query log”. SIGIR Forum, 33(3). Online at http://www.acm.org/sigir/ forum/F99/Silverstein.pdf [visited 26.2.2004]. Smarr, J. 2002. “GoogleLing: The web as a linguistic corpus”. Online at http://www. stanford.edu/class/cs276a/projects/reports/jsmarr-grow.pdf [visited 7.7.2004]. Stubbs, M. forthcoming. “Inferring meaning: Text, technology and questions of induction”. In Aspects of Automatic Text Analysis,. R. Köhler and A. Mehler (eds). Heidelberg: Physica-Verlag. Varantola, K. 2003. “Translators and disposable corpora”. In Corpora in Translator Education, F. Zanettin, S. Bernardini and D. Stewart (eds), 55–70. Manchester: St Jerome. Volk, M. 2002. “Using the web as corpus for linguistic research”. In Tähendusepüüdja: Catcher of the Meaning. A Festschrift for Professor Haldur Õim [Publications of the Department of General Linguistics 3], Pajusalu, R. and Hennoste, T. (eds.). Tartu: University of Tartu. Online at http://www.ifi.unizh.ch/cl/volk/papers/Oim_Festschrift_2002.pdf [visited 7.7.2004]. Zanettin, F. 2001. “DIY corpora: The WWW and the translator”. In Training the Language Services Provider for the New Millennium, B. Maia, H. Haller and M. Urlrych (eds), 239–248. Porto: Facultade de Letras, Universidade do Porto. Online: http: //www.federicozanettin.net/DIYcorpora.htm [visited 16.10.2004]. Zoni, E. 2003. “e-Mining – Software per concordanze online”. Online: http://applicata. clifo.unibo.it/risorse_online/e-mining/e-Mining_concordanze_online.htm [visited 15.1.2004].
Index 301
Index A Aarts 67, 78, 80, 82 ad-hoc corpus see corpus AltaVista 277, 278, 281, 297 Altenberg 116, 208 Apple Pie Parser (APP) see NLP tools Appraisal 5, 125, 126, 128, 130, 132 authenticity 11, 14, 152, 200 autonomy learner 247-249, 251 awareness 171, 248 cultural/social 172 discourse 170, 172, 182 language 170, 172, 174, 176, 252 literary 172 methodological/metatheoretical 170, 172, 185 B Bank of English 1, 33 Benjamin 259, 261, 267 Bernardini 234 Biber 36 Bilogarithmic Type-Token Ratio (BTTR) see statistical measures Boolean operators 285 Borin 3 Brill tagger see NLP tools British National Corpus (BNC) 1, 4, 7, 12, 13, 33, 52, 53, 56, 113, 127, 153, 155, 158-162, 281, 285, 292 British National Corpus (BNC) Sampler 3, 4, 67, 72, 81, 89, 92, 93 Brown 1 Burston 234 Butt 259, 261, 267 C Cantos-Gómez 9 characterization 175 ChaSen see NLP tools child language 138, 140
Chinese/English 127 Chipere 5 chi-square see statistical tests Chomsky 22, 39, 47 classroom concordancing see concordancing CLAWS tagger see NLP tools cluster analysis see statistical tests Cobuild 6 cohesion 21, 34-37, 252 colligation 27, 28, 30, 31, 37 collocation 23, 28, 31, 37, 161 Comlex Lexicon 55-58, 60, 61 Compara 8, 213, 223, 225, 228 comparable corpus see corpus Computer-Assisted Language Learning (CALL) 71 concordancing 235, 244, 286-289, 292, 293 classroom use 216, 233, 234 parallel 213, 214, 217, 226, 227 self-access 216 strategies 239, 243 tool 280, 285 connotation 129, 130 constituency 252 construction grammar 22 Contrastive Analysis (CA) 70, 216 Contrastive Interlanguage Analysis (CIA) 69, 217 Cook 17, 154 Copernic see web concordancer corpus ad-hoc 274, 279, 295 apprentice writing 125-127, 130-133, 137, 143 children’s writing 137, 143 ELT 18, 57 learner 69, 78, 109, 119 interlanguage 46 L1 reference 46 target 46
302 Index
literary 173, 184, 191 parallel 183, 218, 223, 227 professional writing 125-127, 130-133 reference 1, 14 representativeness 81, 82 spoken 196, 205, 208 web archive (WCA) 273, 280, 281, 284, 288, 289, 291, 293, 294, 296 comparable 83 Corpus de Referencia del Español Actual (CREA) 259, 261-264, 266, 267 Corpus del Español 9, 259, 262-264, 266, 285 Corpus Diacrónico del Español (CORDE) 261-264, 266, 267 Corrected Type-Token Ratio (CTTR) see statistical measures Critical Discourse Analysis (CDA) 174 cultural/social awareness see awareness D D measure see statistical measures Dagneaux 68 data-driven learning (DDL) 9, 16, 242, 248 Davies 9 demonstratives 89-98, 103-106 discourse awareness see awareness E Ellis 68 ELT corpus see corpus ELT materials 3, 4, 6, 7, 18, 51, 52, 89, 91, 106, 112, 113, 151, 153, 156 Emmott 35 English as Lingua Franca (ELF) 6, 14, 106-208 English as Lingua Franca in Academic Settings (ELFA) Corpus 207 English Norwegian Parallel Corpus (ENPC) 223 error analysis 68 evaluation 125, 129, 133 Evoking lexis 128-133 F feedback 247, 252
Finnish/English 7 Firth 26, 27 Fletcher 11 form focus 247, 252 formulaic sequences 205 Frankenberg-Garcia 8 Freiburg London Oslo Bergen (FLOB) corpus 80 Frown Corpus 80 G Gellerstam 78 German 278, 285, 288, 291, 292 German Corpus of Learner English (GeCLE) 114 German English as a Foreign Language Textbook Corpus (GEFL TC) 155-162 German/English 4, 110, 112, 152, 153, 155, 158 Gledhill 33, 34 Gleitman 47 GlossaNet 289, 295 Goldberg 47 Google 259, 261-264, 266, 267, 277, 281, 287, 289, 295 Grammar Safari 291 grammar-translation approach 215 Granger 2, 67, 69, 78, 80, 82, 116, 154, 208 Grimshaw 47 Grosz 35 Guardian, The 29, 33, 36 H Halliday 27 Hawthorne, Nathaniel 183, 184 Hoey 10-12 Hofland 217 HTML 279, 280, 283, 288 Hunston 7 I ICAME 1 ideational function 129 idiom principle 205 if-clauses 7, 158, 162 Inagaki 49, 50 Inscribed lexis 128-133
Index 303
interference 68, 71, 104, 217, 218 interlanguage (IL) 68 International Corpus of English (ICE) 1 International Corpus of Learner English (ICLE) 1, 4, 52, 53, 69, 78, 84, 109, 112 International Sample of English Contrastive Texts (INTERSECT) Corpus 223 Internet Archive 281, 282, 288-290, 296 interpersonal function 125, 126, 178 IPAL Electronic Dictionary Project 57 J Japanese EFL Learner Corpus (JEFFL) 3, 46, 50, 51 Japanese/English 3, 45, 48, 52, 55, 57, 60 Johansson 217 Johns 16, 242, 281 K Kettemann 7 keyword analysis 127, 128, 130-133 kibbitzers 281 KwiCFinder see web concordancer L language awareness see awareness language production 219, 220 language reception 219, 221, 227 learner autonomy see autonomy learner corpus see corpus learner, analytical 205 learner, holistic 205 Leńko-Szymańska 4 Levin 57 lexical density 252 lexical item 15, 160 Linguist’s Search Engine (LSE) 290, 291 literary awareness see awareness literary corpus see corpus London, Jack 183, 184, 195 London Oslo Bergen (LOB) corpus 1 Louvain Corpus of Native English Essays (LOCNESS) 53, 78, 82, 84 Louw 25
M Malvern 5, 139, 141, 142 Mann-Whitney (or U) test see statistical tests Marco 7 Martin 128, 129 Mauranen 14 McEnery 69 Mean Segmental Type-Token Ratio (MSTTR) see statistical measures methodological/metatheoretical awareness see awareness Michigan Corpus of Academic Spoken English (MICASE) 6, 7, 197, 198, 208 Monoconc 197 Montrul 49 Morley 35 N Nesselhauf 4, 14 newspaper text 29, 33, 36, 289, 295 NLP tools Apple Pie Parser (APP) 56 Brill tagger 72 ChaSen 56 CLAWS tagger 93, 98 NP movement 49 O open-choice principle 205 open-source software 297 oral production see production Oxford English Dictionary 22 P parallel concordancing see concordancing parallel corpus see corpus Partington 35, 129, 130 Part-of-Speech (POS) n-grams 67, 71-73, 79 pattern grammar 22, 28 Pearson’s correlation see statistical tests PELCRA Corpus 4, 50, 89, 92 Pérez-Paredes 9 performative verbs 178 Pinker 46-48 Polish/English 4, 89, 90, 92, 103-106
304 Index
Portuguese/English 8, 220-226 priming 10, 23-25, 28 problem-solution pattern 125-127, 131, 141 production oral 247, 249, 256 prosody 252 Prütz 3 R Real Academia Española 259, 261 reference corpus see corpus Renouf 110 Richards 5, 139, 141, 142 Roget’s Thesaurus 22 Römer 6, 7, 9, 12 Root Type-Token Ratio (RTTR) see statistical measures Roussel 217 S sample size 137, 139, 140, 145 schema theory 174 Second Language Acquisition (SLA) 48, 55, 63, 68 Seidlhofer 14 self-access concordancing see concordancing semantic association 25, 26, 28, 31, 37 semantic prosody 129 Shakespeare, William 179 Sidner 35 Sinclair 12, 13, 15, 16, 25, 27, 35, 110, 154 Spanish 9, 259-264, 266, 267, 285, 291 Spanish/English 247 speech acts 178, 180, 181 spoken corpus see corpus Sripicharn 8, 23 statistical measures Bilogarithmic Type-Token Ratio (BTTR) 141 Corrected Type-Token Ratio (CTTR) 141 D measure 137, 139, 142, 143, 145 Mean Segmental Type-Token Ratio (MSTTR) 140, 141 Root Type-Token Ratio (RTTR) 141 Type-Token Ratio (TTR) 5, 137-143, 145,
146 statistical tests chi-square 57, 61, 93-96, 98 cluster analysis 249, 250 Mann-Whitney (or U) 72 Pearson’s correlation 144 stemming 276 Stockholm Umeå Corpus (SUC) 3, 67, 71, 72, 82 Stubbs 25 stylistics 170, 175 Subcategorisation Frame (SF) patterns 3, 45, 48, 61 support verb constructions 109-118, 120, 122 Swedish/English 3, 4, 67, 71, 72, 80, 82, 84 syntactic variation 259, 261, 268 Systemic Functional Linguistics (SFL) 22, 125, 126, 128, 133, 174 T tagset English 87 Swedish 87 TOSCA-ICE 78 TextSTAT see web concordancer Thai/English 9, 235, 236, 242 theme/rheme 27, 125 Thomas, Dylan 38 TOEFL 2000 Spoken and Written Academic Language (T2K- SWAL) Corpus 206 Tono 3 TOSCA-ICE tagset see tagset transfer see interference translation 223 translationese 78, 81 TTR see statistical measures Turnbull 234 Twain, Mark 183, 184, 195 Type-Token Ratio (TTR) see statistical measures U Universal Grammar (UG) 47, 49 Uppsala Student English (USE) Corpus 3, 67, 71-73, 82
Index 305
V vocd 142, 144 W web browsing 276 web concordancer Copernic 286 KwiCFinder 11, 273, 278-281, 284, 285, 289, 292, 294, 296, 297 TextSTAT 288 Web Concordancer 286 WebCONC 286-288 WebCorp 286, 287, 289 WebKWiC 288 Subject Search Spider 287, 297 Web Corpus Archive (WCA) see corpus web crawling 283 web search engine 273, 274, 276, 277, 282284, 286, 287, 289, 291-294, 296, 298 web spidering 283 WebCONC see web concordancer
WebCorp see web concordancer WebKWiC see web concordancer Whitley 268 wh-questions 13 Widdowson 2, 12, 15, 16, 155 wildcard 277, 278, 282, 283, 287, 288, 297 Wilkins, Mary Freeman 175 Wilson 69 Winter 36 Wordsmith Tools 93, 127, 141, 187 World Wide Web 11, 259, 261, 273-276, 279-282, 284, 289, 294-296 Wray 205, 206 X XML 279 Y yes/no questions 153 Yule 35
306 Bionotes
Bionotes 307
Bionotes Guy Aston is professor of English linguistics and Dean of the School for Interpreters and Translators of the University of Bologna at Forlì, Italy. His main research interests: contrastive pragmatics, conversational analysis, corpus linguistics, autonomous language learning. Silvia Bernardini is a research fellow at the School for Interpreters and Translators of the University of Bologna at Forlì, Italy, where she currently teaches translation from English into Italian. Her main research interests are corpora as aids in language and translation teaching and the study of translationese through parallel, comparable and learner corpora. Lars Borin is Professor of Natural Language Processing in the Department of Swedish, Göteborg University. His main research interests straddle the boundary of computational and general linguistics; in particular he has published on contrastive corpus linguistics, on the use of language technology in linguistic research and in the teaching of languages and linguistics, and on language technology in the service of language diversity. Pascual Cantos is a Senior Lecturer in the Department of English Language and Literature, University of Murcia, Spain, where he lectures English Grammar and Corpus Linguistics. His main research interests are in Corpus Linguistics, Quantitative Linguistics, Computational Lexicography and Computer Assisted Language Learning. He has published extensively in the Welds of CALL and Corpus Linguistics and is also co-author of various CALL applications: CUMBRE Curso Multimedia para la Enseñanza del Español, 450 Ejercicios Gramaticales and Practica tu Vocabulario, published by SGEL. Ngoni Chipere is Lecturer in Language Arts at the University of the West Indies. He completed his doctoral thesis in experimental psycholinguistics at the University of Cambridge in 2000. His post-doctoral studies at the University of Cambridge Local Examinations Syndicate and the University of Reading
308 Bionotes
were concerned with quantitative analysis of developmental trends in a corpus of children’s writing. His research interests straddle theoretical and applied concerns in linguistics and the psychology of language. His publications include: Understanding Complex Sentences: Native Speaker Variation in Syntactic Competence, published by Palgrave in 2003 and – with David Malvern, Brian Richards and Pilar Durán – Lexical Diversity and Language Development: QuantiWcation and Assessment, soon to be published by Palgrave. Mark Davies is an Associate Professor of Corpus and Computational Linguistics at Brigham Young University in Provo, Utah, USA. He has developed large corpora of historical and modern Spanish and Portuguese, which have been used (by him and by many students) to investigate several aspects of syntactic change and current syntactic variation in these two languages. William H. Fletcher is Associate Professor of German and Spanish at the United States Naval Academy. His current research focusses on exploiting the Web as a source of linguistic data. He also has authored numerous papers on the role of multimedia in language learning and on the linguistic description of modern Dutch. Lynne Flowerdew coordinates technical communication skills courses at the Hong Kong University of Science and Technology. Her research interests include corpus-based approaches to academic and professional communication, textlinguistics, ESP and syllabus design. Ana Frankenberg-Garcia holds a PhD in Applied Linguistics from Edinburgh University and is an auxiliary professor at ISLA, in Lisbon, where she teaches teaches English language and translation. She is joint project leader of the COMPARA parallel corpus of English and Portuguese, a public, online resource funded by the Portuguese Foundation for Science and Technology. Her current research interests focus on the use of corpora for language learning and translation studies. Bernhard Kettemann is professor of English linguistics at Karl-Franzens-University Graz and currently head of the Department of English Studies. His main research interests are corpus linguistics, (media) stylistics, and the teaching and learning of EFL. Recent publications include Teaching and Learning by
Bionotes 309
Doing Corpus Analysis (co-edited with Georg Marko, 2002, Rodopi). Agnieszka Leńko-Szymańska is a graduate of the University of Łódź, where she is Adjunct Professor and Head of the Teaching English as a Foreign Language (TEFL) Unit. Her research interests are primarily in psycholinguistics, second language acquisition and corpus linguistics, especially in lexical issues in those Welds. She has published a number of papers on the acquisition of second language vocabulary. She teaches applied linguistics, foreign language teaching methodology and topics in psycholinguisitcs and SLA. David Malvern is Professor of Education and Head of the Institute of Education at the University of Reading. A mathematical scientist by training, he read physics at Oxford and has been a Research OYcer at the Royal Society, Visiting Professor in the Department of Educational Psychology, McGill University, Montreal and a European Union and British Council Consultant. He has been collaborating with Brian Richards on various aspects of language research since 1988. Georg Marko teaches English linguistics at the Department of English Studies at Karl-Franzens-University Graz and Professional English at the University for Applied Sciences FH-Joanneum Graz. He is interested in the application of corpus linguistics to Critical Discourse Analysis. He is currently Wnishing his PhD dissertation on the discourse of pornography. Anna Mauranen is professor of English at Tampere University. She has published widely in corpus studies, translation studies and contrastive linguistics. Her current research focuses on speech corpora and English as a lingua franca. She is running a research project on lingua franca English, and compiling a corpus of English spoken as a lingua franca in academic settings (the ELFA corpus). Nadja Nesselhauf is an Assistant (“wissenschaftliche Assistentin”) at the English Department of the University of Heidelberg, Germany. She holds a PhD in English Linguistics from the University of Basel, Switzerland, where she taught various courses in Linguistics from 1999 to 2003. Her main research interests are linguistics and language teaching, phraseology, second language acquisition, and corpus linguistics.
310 Bionotes
Pascual Pérez-Paredes works at the English Department in the University of Murcia, Spain. He completed his doctorate in English Philology in 1999, and currently teaches English Language and Translation. His main academic interests are the compilation and use of language corpora, the implementation of Information and Communication Technologies in Foreign Language Teaching/Learning, and the role of aVective variables in Foreign Language Learning. He is responsible for the compilation of the Spanish component of the Louvain International Database of Spoken English Interlanguage (LINDSEI) corpus. Recent publications include articles in Extending the Scope of Corpus-based Research. New Applications, New Challenges, edited by S. Granger and S. Petch-Tyson and How to Use Corpora in Language Teaching, edited by J. Sinclair. Having studied Egyptology, Linguistics and Computational Linguistics at Uppsala University, Klas Prütz currently works as a Corpus Research Assistant at the Centre for Language and Communication Research, CardiV University. His research focuses on the development and evaluation of methodologies for large-scale corpus investigations. He is working on a PhD thesis concerning multivariate analysis of part-of-speech-determining contexts for word forms in Swedish texts. Brian Richards is Professor of Education at the University of Reading and Head of the Section for Language and Literacy. A former teacher of German and English as a Foreign Language, his research interests have extended to early language development and language assessment, as well as foreign and second language teaching and learning. He obtained a doctorate on auxiliary verb acquisition from the University of Bristol in 1987 before moving to Reading to train teachers of French and German. He is the author of Language development and individual diVerences (1990) and editor of Input and interaction in language acquisition (1994) (with Clare Gallaway) and Japanese children abroad (1998) (with Asako Yamada-Yamamoto). He is the author of numerous articles and book chapters on language and language education and has been a member of the editorial team of the Journal of Child Language since 1992. Ute Römer studied English linguistics and literature, Chemistry and Education at Cologne University and now works as a researcher and lecturer in English
Bionotes
311
linguistics at the University of Hanover. She is currently Wnalising her PhD thesis entitled Progressives, Patterns, Pedagogy: A corpus-driven approach to English progressive forms, their functions, contexts, and didactics. The study is based on more than 10,000 progressives in context and tries to demonstrate how corpus work can contribute to an improvement of ELT. Main research and teaching interests include corpus linguistics, linguistics & language teaching, and language & gender. Her most recent research project centres around a monitor corpus of linguistic book reviews and its possible use in corpus-driven sociolinguistics. She has recently co-edited Language: Context and Cognition. Papers in Honour of Wolf-Dietrich Bald’s 60th Birthday (2002) and published articles on corpus linguistics and language teaching. Passapong Sripicharn is currently a lecturer in the English Department, Faculty of Liberal Arts, Thammasat University, Thailand. He received his Ph.D. in Applied Linguistics from the University of Birmingham, UK. His research interests include corpus linguistics, second language writing, and ESP/EAP. Dominic Stewart is a research fellow at the School for Interpreters and Translators of the University of Bologna at Forlì, Italy, where he currently teaches English language and linguistics. His research interests include issues relating to the validity of corpus data within both the language and the translation classroom, and the use of reference corpora for translation into the foreign language. Yukio Tono is Associate Professor of Applied Linguistics at Meikai University, Japan. He holds a Ph.D. in corpus linguistics from Lancaster University. His research interests include second language vocabulary acquisition, corpusbased second language acquisition, learner corpora, applications of corpora in language learning/teaching, dictionary use, and corpus lexicography. He serves on the editorial board of the International Journal of Lexicography. His recent work includes Research on Dictionary Use in the Context of Foreign Language Learning (Max Niemeyer Verlag, 2001). He has also led two major learner corpus-building projects, JEFLL and SST, in Japan.
In the series Studies in Corpus Linguistics (SCL) the following titles have been published thus far or are scheduled for publication: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
PEARSON, Jennifer: Terms in Context. 1998. xii, 246 pp. PARTINGTON, Alan: Patterns and Meanings. Using corpora for English language research and teaching. 1998. x, 158 pp. BOTLEY, Simon and Tony McENERY (eds.): Corpus-based and Computational Approaches to Discourse Anaphora. 2000. vi, 258 pp. HUNSTON, Susan and Gill FRANCIS: Pattern Grammar. A corpus-driven approach to the lexical grammar of English. 2000. xiv, 288 pp. GHADESSY, Mohsen, Alex HENRY and Robert L. ROSEBERRY (eds.): Small Corpus Studies and ELT. Theory and practice. 2001. xxiv, 420 pp. TOGNINI-BONELLI, Elena: Corpus Linguistics at Work. 2001. xii, 224 pp. ALTENBERG, Bengt and Sylviane GRANGER (eds.): Lexis in Contrast. Corpus-based approaches. 2002. x, 339 pp. STENSTRÖM, Anna-Brita, Gisle ANDERSEN and Ingrid Kristine HASUND: Trends in Teenage Talk. Corpus compilation, analysis and findings. 2002. xii, 229 pp. REPPEN, Randi, Susan M. FITZMAURICE and Douglas BIBER (eds.): Using Corpora to Explore Linguistic Variation. 2002. xii, 275 pp. AIJMER, Karin: English Discourse Particles. Evidence from a corpus. 2002. xvi, 299 pp. BARNBROOK, Geoff: Defining Language. A local grammar of definition sentences. 2002. xvi, 281 pp. SINCLAIR, John McH. (ed.): How to Use Corpora in Language Teaching. 2004. viii, 308 pp. LINDQUIST, Hans and Christian MAIR (eds.): Corpus Approaches to Grammaticalization in English. 2004. xiv, 265 pp. NESSELHAUF, Nadja: Collocations in a Learner Corpus. xii, 326 pp. + index. Expected Winter 04-05 CRESTI, Emanuela and Massimo MONEGLIA (eds.): C-ORAL-ROM. Integrated Reference Corpora for Spoken Romance Languages. ca. 300 pp. (incl. DVD). Expected Winter 04-05 CONNOR, Ulla and Thomas A. UPTON (eds.): Discourse in the Professions. Perspectives from corpus linguistics. 2004. vi, 334 pp. ASTON, Guy, Silvia BERNARDINI and Dominic STEWART (eds.): Corpora and Language Learners. 2004. vi, 311 pp.