Corpus Linguistics in Literary Analysis
Corpus and Discourse Series editors: Wolfgang Teubert, University of Birmingh...
195 downloads
1321 Views
8MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Corpus Linguistics in Literary Analysis
Corpus and Discourse Series editors: Wolfgang Teubert, University of Birmingham, and Michaela Mahlberg, University of Liverpool. Editorial Board: Paul Baker (Lancaster), Frantisek Čermák (Prague), Susan Conrad (Portland), Geoffrey Leech (Lancaster), Dominique Maingueneau (Paris XII), Christian Mair (Freiburg), Alan Partington (Bologna), Elena Tognini-Bonelli (Siena and TWC), Ruth Wodak (Lancaster), Feng Zhiwei (Beijing). Corpus linguistics provides the methodology to extract meaning from texts. Taking as its starting point the fact that language is not a mirror of reality but lets us share what we know, believe and think about reality, it focuses on language as a social phenomenon, and makes visible the attitudes and beliefs expressed by the members of a discourse community. Consisting of both spoken and written language, discourse always has historical, social, functional, and regional dimensions. Discourse can be monolingual or multilingual, interconnected by translations. Discourse is where language and social studies meet. The Corpus and Discourse series consists of two strands. The first, Research in Corpus and Discourse, features innovative contributions to various aspects of corpus linguistics and a wide range of applications, from language technology via the teaching of a second language to a history of mentalities. The second strand, Studies in Corpus and Discourse, is comprised of key texts bridging the gap between social studies and linguistics. Although equally academically rigorous, this strand will be aimed at a wider audience of academics and postgraduate students working in both disciplines.
Research in Corpus and Discourse Conversation in Context A Corpus-driven Approach With a preface by Michael McCarthy Christoph Rühlemann Corpus-Based Approaches to English Language Teaching Edited by Mari Carmen Campoy, Begona Bellés-Fortuno and Ma Lluïsa Gea-Valor Corpus Linguistics and World Englishes An Analysis of Xhosa English Vivian de Klerk Evaluation and Stance in War News A Linguistic Analysis of American, British and Italian television news reporting of the 2003 Iraqi war Edited by Louann Haarman and Linda Lombardo
Evaluation in Media Discourse Analysis of a Newspaper Corpus Monika Bednarek Historical Corpus Stylistics Media, Technology and Change Patrick Studer Idioms and Collocations Corpus-based Linguistic and Lexicographic Studies Edited by Christiane Fellbaum Meaningful Texts The Extraction of Semantic Information from Monolingual and Multilingual Corpora Edited by Geoff Barnbrook, Pernilla Danielsson and Michaela Mahlberg Rethinking Idiomaticity A Usage-based Approach Stefanie Wulff Working with Spanish Corpora Edited by Giovanni Parodi
Studies in Corpus and Discourse Corpus Linguistics in Literary Analysis Jane Austen and her Contemporaries Bettina Fischer-Starcke English Collocation Studies The OSTI Report With an introduction by Wolfgang Teubert John Sinclair, Susan Jones and Robert Daley Edited by Ramesh Krishnamurthy Text, Discourse, and Corpora. Theory and Analysis With an introduction by John Sinclair Michael Hoey, Michaela Mahlberg, Michael Stubbs and Wolfgang Teubert
This page intentionally left blank
Corpus Linguistics in Literary Analysis Jane Austen and her Contemporaries
Bettina Fischer-Starcke
Continuum International Publishing Group The Tower Building 80 Maiden Lane, Suite 704 11 York Road New York London SE1 7NX NY 10038 www.continuumbooks.com © Bettina Fischer-Starcke 2010 All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage or retrieval system, without prior permission in writing from the publishers. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. ISBN: 978-1-8470-6437-0 (hardcover) Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress.
Typeset by Newgen Imaging Systems Pvt Ltd, Chennai, India Printed and bound in India by Replika Press Pvt Ltd
For my family
This page intentionally left blank
Contents
List of Tables List of Figures Preface
xi xii xiii
1
Introduction 1.1 Stylistics and style 1.2 The data 1.3 The potential and goals of corpus stylistic analyses
1 2 8 10
2
Goals, techniques, principles 2.1 The theory 2.2 Corpora, texts, software 2.3 Concluding comments
13 13 27 32
3
Language and meaning 3.1 Stylistics and meaning 3.2 Stylistics – the background 3.3 The classics 3.4 Cognitive stylistics 3.5 Corpus stylistics 3.6 Concluding comments
34 39 40 42 53 55 60
4
Summary of Northanger Abbey
63
5
Keywords and concordance lines 5.1 Keywords in the literature 5.2 The text NA 5.3 Excursus: grammatical negation in NA 5.4 The corpus Austen 5.5 Concluding comments
65 67 70 92 94 105
Contents
x
6 Phraseology 6.1 Phraseology in the literature 6.2 The text NA: data and analysis 6.3 The corpus Austen: data and analysis 6.4 The corpus ContempLit: data and analysis 6.5 Concluding comments
108 111 115 134 138 142
7
Text segmentation 7.1 Cohesion and coherence 7.2 The text NA: the data 7.3 Segmentation of the corpus Austen 7.4 Linguistic homogeneity and heterogeneity 7.5 Concluding comments
144 146 148 176 184 193
8
Conclusion
195
Appendix Notes References Index of Names Index of Subjects
202 205 207 221 224
List of Tables
Table 2.1
Corpora/text
30
Table 5.1
Keywords emotions in NA
71
Table 5.2
Keywords textuality in NA
72
Table 5.3
Grammatical negations in NA
92
Table 5.4
Distribution of family and social relationships in Austen
97
Table 5.5
Pronouns, articles, proper nouns and titles in Austen
98
Table 6.1
4-grams and 4-frames of NA
117
Table 6.2
Patterns of 4-grams, NA
118
Table 6.3
Patterns of 4-frames, NA
118
Table 6.4
Correspondences Austen – NA, 4-grams
136
Table 6.5
Correspondences Austen – NA, 4-frames
136
Table 7.1
Keywords
150
Table 7.2
Correlation WD and VMP
169
Table 7.3
Correlation literary criticism and VMP
169
Table 7.4
Segmentation of NA
170
Table 7.5
Keywords NA1
171
Table 7.6
Keywords NA2
172
Table 7.7
Keywords NA3
173
Table 7.8
The corpus Austen
176
Table 7.9
Place names in Austen
179
Table 7.10 Highest CBDF-values for Austen and ContempLit
190
Table 7.11 Lowest CBDF-values for Austen and ContempLit
191
Table A1
Intertextual references NA – The Mysteries of Udolpho
202
Table A2
Intertextual references NA – The Monk
203
List of Figures
Figure 3.1 Elements creating meaning in language
37
Figure 6.1 * i am sure *
125
Figure 7.1 Lexical cohesion according to Hasan (1984)
147
Figure 7.2 Place names in NA
153
Figure 7.3 Textuality in NA
156
Figure 7.4 Textuality 2 in NA
159
Figure 7.5
Emotions in NA
161
Figure 7.6
Textuality – Emotions in NA
162
Figure 7.7 VMP of NA, interval 35 words
167
Figure 7.8 VMP of NA, interval 101 words
167
Figure 7.9 VMP of NA, interval 151 words
168
Figure 7.10 Comparison of keywords from NA1, NA2 and NA3
174
Figure 7.11 Family and social relationships in Austen
178
Figure 7.12 Place names in Austen
180
Figure 7.13 Diagram VMP of Austen, interval 101
182
Preface
Stylistics has been a source of interest to me since writing my MA thesis on the language of Janet Frame, a New Zealand author. So when choosing a topic for my PhD thesis, corpus stylistics was my immediate choice, and I decided to look at Jane Austen’s novels to see whether using corpus linguistic techniques in the analysis of literature could give new insight into already thoroughly analysed texts. I selected three main analytic techniques and found that they were indeed highly successful in revealing new literary meanings of the data. This book is an adaptation of my PhD thesis which I submitted at the University of Trier, Germany, in 2007. The book presents both the theoretical contexts in which the analyses of my thesis are embedded and the analyses themselves. However, the original appendix of my thesis with its more than 100 pages of data could not be reproduced in this book. This would have included the complete lists of keywords, the complete sets of concordance lines, all CBDF-values and so on, that are mentioned in this book, and I would be happy to provide this data to anyone interested in it. Also, the corpora that I use for the analyses do not come with this book. Neither the book nor the thesis could have been written without the moral and practical support of a number of people. I am grateful to Michael Stubbs for helpful comments and conversations both on my PhD thesis and on earlier versions of this book. I am also grateful to Katrin Oltmann who gave me permission to use her corpus of literature contemporary to Austen, ContempLit in this book, and to Isabel Barth who gave me permission to use her software Word-Distribution. Thank you also to Christian Fischer who helped me with the statistics and provided moral support whenever I needed it, to Anna Maria Duplang, Bernd Elzer, Clare Fielder, Kieran O’Halloran, Sabine Starcke and Kurt Ubelhoer for proof reading and helpful comments.
The meaning of a word is its use in the language. Ludwig Wittgenstein
Chapter 1
Introduction
Stylistics is the linguistic analysis of literary texts. Corpus linguistics is the electronic analysis of language data. The combination of both disciplines is corpus stylistics, the linguistic analysis of electronically stored literary texts. Corpus stylistics pursues two goals: 1. to study how meaning is encoded in language and to develop appropriate working techniques to decode those meanings, and 2. to study the literary meanings of texts. The first goal is a traditional goal in linguistics, which includes gaining knowledge of analytic techniques. The second goal is a traditional goal in literary studies, which includes gaining knowledge of the meanings of a specific text or body of texts. In corpus stylistics, the use of corpus linguistic techniques and the goals of stylistics complement each other as both disciplines decode linguistic patterns and their meanings in texts. In both disciplines, the knowledge gained is used to generate a more general understanding of, for instance, literary meanings or the organization of language. The two disciplines therefore complement each other when they are combined to form corpus stylistics. This combination of the two disciplines is the reason for the great analytic potential of corpus stylistics. It allows for decoding meanings of literary texts that cannot be detected either by intuitive techniques as in literary studies or with the necessary restriction to short texts or text extracts as in traditional stylistics. Corpus linguistic techniques allow (1) a systematic and detailed analysis of large quantities of language data for lexical and/or grammatical patterns and (2) to subsequently decode the meanings of these patterns. These patterns are not intuitively recognizable because of
2
Corpus Linguistics in Literary Analysis
the very size of the data. In the analyses in Chapters 5 to 7 in this book, the data is comprised of between 77,000 and 4,370,000 tokens.
1.1 Stylistics and style Widdowson defines stylistics as the ‘study of literary discourse from a linguistic orientation’ (1975: 3) which ‘treats literature as discourse’ (6). Toolan supports this view by saying that stylistics is ‘the study of language in literature’ (1998: viii) and that it is therefore part of linguistics. By analysing the linguistic patterns of a text, it gives answers to questions such as how literary effects are encoded in language. And Weber is even more pointed by demanding that it answers questions such as ‘what is literature? How does literary discourse differ from other discourse types? How do we read and interpret literary texts?’ (1996: 1). The definition of stylistics in this book is closely related to the views quoted above. Here, stylistics is defined as the linguistic analysis of literary texts and therefore as a linguistic discipline. Its goal is to decode literary meanings and structural features of literary texts by identifying linguistic patterns and their functions in the texts. Consequently, the term style means lexical and grammatical patterns in a text that contribute to its meaning. This ties in with Fowler (ed.) who says that [w]e must assume that all texts manifest style, for style is a standard feature of all language (. . .). [S]tyle is a manner of expression, describable in linguistic terms, justifiable and valuable in respect of non-linguistic factors. (. . .) it is a facet of language. (1987: 236) This rather general definition of style is further explained and expanded in Chapter 3 Language and meaning. The definition of style above emphasizes that stylistics is a linguistic discipline. However, apart from the goals of linguistics, namely to gain knowledge of how meaning is encoded in language and of the meaning itself, it also pursues the goal of literary studies, namely to gain knowledge of the literary meanings of a specific text. In the following, the relationship between style, meaning in language and the analytic techniques of both linguistics and literary studies are discussed, in order to show how the goals of linguistics and literary studies can be combined, and to explain how their goals are pursued and related to each other in stylistics.
Introduction
3
Linguistics analyses language systematically to gain knowledge of language patterns either in a specific text or in language in general. Depending on the question set, the textual basis of an analysis is either a text or a corpus, that is, a compilation of texts or text fragments in electronic form. In most linguistic analyses, the textual data is non-fiction and non-literary. One basic assumption of linguistic analyses is that the linguistic form of the data indicates its meaning. Corpus linguistics further assumes a correlation between the frequency of a pattern and its significance in the data. Frequent linguistic patterns have significance for either the content of the data or its structural organization (Teubert 2005). The frequency with which a feature occurs therefore influences its qualitative analysis. This is further explained in Chapter 2 Goals, techniques, principles. Literary studies analyse the meanings of literary texts by looking at their language and at extratextual features. This is subsumed under ‘criticism’ which is ‘[t]he conscious evaluation or appreciation of a work of art, either according to the critic’s personal taste or according to some accepted aesthetic ideas’ (Shipley ed. 1970: 66). Bressler, quoting the nineteenthcentury critic Matthew Arnold as evidence for his proposition, further defines literary criticism as ‘a disciplined activity that attempts to describe, study, analyze, justify, interpret, and evaluate a work of art’ (2003: 4f.). And he goes on to say that ‘this discipline attempts to formulate aesthetic and methodological principles on which the critic can evaluate a text.’ Eagleton goes still further in his definition of ‘literary theory’ and says that it ‘is less an object of intellectual enquiry in its own right than a particular perspective in which to view the history of our times’ (1983: 195). This is because any body of theory concerned with human meaning, value, language, feeling and experience will inevitably engage with broader, deeper beliefs about the nature of human individuals and society, problems of power and sexuality, interpretations of past history, versions of the present and hope for the future. (195) The definition and the exact nature of literary criticism therefore changes in the course of time as different social and political conditions prevail. The uniqueness of the linguistic style of a text, as manifest in its lexical, phraseological and grammatical patterns, is of less importance for the analysis of a text in literary studies than in linguistics. In literary studies, language is mainly relevant as a criterion for literariness and the literary meanings of the text. This means that linguistic deviations from language norms are one criterion for culturally valued literature, and it occasionally
Corpus Linguistics in Literary Analysis
4
distracts attention from a rather banal content of a text (Cook 1986: 150). The analysis of texts frequently conforms to the principles of classical rhetoric and is intuitive by often following a particular school of thought, for example Reception Theory or the New Criticism. Basic questions that are addressed in literary criticism are concerned with the philosophical, psychological, functional and descriptive nature of the text itself: z z z z z
Does the text have only one correct meaning? Is a text always didactic; that is, must a reader learn something from every text? Does a text affect each reader in the same way? How is a text influenced by the culture of its author and the culture in which it is written? Can a text become a catalyst for change in a given culture? (Bressler 2003: 5)
Linguistic features that are not prominent in the text are frequently not recognized and are therefore only rarely analysed in literary studies. Both linguistics and literary studies analyse texts and their meanings, but they differ in their methods of analysis and in their choice of texts. Literary studies are restricted to literary texts and the analysis of their meanings. The question what ‘literature’ actually is, has not been answered definitively. Shipley, for example, says that literature is ‘[w]ritten productions as a collective body. The total preserved writings belonging to a given language or people; that part which is notable for literary form or expression, “belles lettres” (. . .)’ (ed. 1970: 183f.). Eagleton (1983), however, does not offer a single definition of ‘literature’ in his chapter entitled ‘Introduction: What is Literature’, but instead shows that its definition has changed in the course of time and depends on the school of thought proposing the definition. He emphasizes that linguistic, philosophical, social and political factors influence whether a piece of writing is accepted as literature or not. In this book, the term ‘literature’ follows the rather general definition of the Oxford English Dictionary (1989) as a ‘literary work or production (. . .) the realm of letters’ or [l]iterary productions as a whole; the body of writings produced in a particular country or period, or in the world in general. Now also in a more restricted sense, applied to writing which has claim to consideration on the ground of beauty of form or emotional effect in order to cover the diverse views on the concept.
Introduction
5
Linguistics usually analyses non-fiction texts, text fragments and collections of texts and the functions of patterns in the language of the data. The criteria for the selection of the data in linguistics are often functional and are based on the research question. In literary studies, it is often the literary value of a text which functions as a criterion for it to be selected for an analysis. Consequently, literary studies gain knowledge mainly of a specific text or a literary period. Linguistics, on the other hand, gains knowledge of the specific text or corpus and of the language system with its mechanisms for encoding meaning. The two disciplines also differ in their definitions of style. While linguistics perceives ‘style as choice’ (de Beaugrande 1993: n.p.) of an author, literary studies perceive ‘style as ornamentation’ (n.p.) of a text. In literary studies, style is an aesthetic choice which makes a text either literary or non-literary and which ‘serves to mark the critic’s approval or disapproval of the quality of a writing’ (Shipley ed. 1970: 314). A style is a manner of expression, describable in linguistic terms, justifiable and valuable in respect of non-linguistic factors. (. . .) it is a facet of language (. . .) that is given significance by personal or cultural, rather than verbal, qualities. (Fowler 1987: 236f.) It is an exclusive criterion. In linguistics, and therefore also in stylistics, a text ‘represents the results of a complicated selection process, and each selection has meaning by virtue of all other selections which might have been made, but have been rejected’ (Sinclair 1965: 76f.). The individual style of a text is the author’s or speaker’s choice and its meaning derives precisely from the fact that it was the sender’s choice. This means that style is not an exclusive, but a describable criterion in linguistics which allows for determining the degree to which it is specific to a sender or to which it conforms to conventions. The sender’s choice of language is not evaluated; it is merely described. This goal of a linguistic analysis is modified when the objects of an analysis are large and representative corpora. A sender’s individual choices are no longer of interest, but instead patterns in the general usage of language of a large number of senders are identified and described. A comparison between the data of various senders leads to the identification of intertextual patterns, the functions of which are subsequently decoded. The main objective of the analysis is to decode the significance of these patterns for the content and the structure of the data. This is the same objective as in
6
Corpus Linguistics in Literary Analysis
the analysis of a text. However, the analyses of a text and a corpus differ in the quantities of the data. Stylistics combines the data of literary studies, that is, literary texts, with the analytic techniques and objectives of linguistics. It thereby fills a gap within linguistics, since stylistics is the only linguistic discipline which allows the analysis of literary texts and their literary meanings by way of linguistic techniques. It holds this singular position despite the fact that literature is made of language. But a study of language which is unable to analyse one text type, especially a culturally significant one, is incomplete (Sinclair 1975, 1982). Thus, a further goal of linguistics should be to be able to draw conclusions about the language, the structure and the meanings of literary texts by means of linguistic analytic techniques. Stylistics is based on the assumption that meaning in language is a linguistic phenomenon which can be decoded by way of linguistic analyses. Corpus stylistics specifies this assumption by choosing corpus linguistic techniques for the analyses. Unlike traditional stylistics, which can analyse only short texts or extracts from longer texts, corpus stylistics also permits the analysis of longer works such as novels. It utilizes software to aid in identifying language patterns which are objectively in the data. This provides the linguist with detailed and neutral insights into the data, which are independent of, for example, previous knowledge of the reception of the work or of genre conventions. The analysis is text-internal and gives a new perspective on the data, so that the researcher can detect new meanings even in a widely discussed text. The detailed linguistic analysis permits detecting meanings, which are virtually invisible in an intuitive approach to the data as in literary studies. The corpus stylistic focus on the most frequent linguistic features, however, precludes it from detecting infrequent features. This is the case even though these features may be foregrounded and often contribute to a text’s meaning, especially in literary texts. They are identified in literary critical or in traditional stylistic analyses. In corpus stylistics, conclusions about the meanings of the data are based on the assumption that form and meaning correlate. However, this correlation is neither obvious nor stable. One language pattern can have different meanings in different texts and contexts so that generalizations about the meanings of language patterns are valid only for the data analysed or similar data. Furthermore, different linguists might interpret the same pattern differently, since an interpretation is always a subjective process (cf. Chapter 2 Goals, techniques, principles). The patterns, however, are objective features of the data. The aim of stylistics is therefore to make the
Introduction
7
connection between language structures and their meanings explicit and, by doing so, to reveal the linguistic basis of a literary interpretation. Using a corpus stylistic approach to the analysis of literary texts is an explicitly quantitative approach. It makes the quantitative element of many stylistic studies, which is frequently implicit (Fowler 1987: 237f.), explicit. There are also critics who oppose the use of corpus linguistic techniques in the analysis of literature. One of them is Miall (1995) who argues that analysing a text electronically results in a loss of an analyst’s individual perception of the text, since this perception is based on personal experience and individual knowledge of the text. The use of software counteracts the simulation of the reading process and therefore, hampers the understanding of a text. Yet, it is precisely this loss of individuality, that is, a reader’s personal textual competence and experiences, that corpus stylistics aims for in the generation of the data that is analysed, as this is what contributes to the intersubjectivity of an analysis. The generation of frequency data as a basis of the analysis of literary meanings is as much stripped off an analyst’s individual choices and perceptions as possible. The ensuing interpretative process of the data, however, is necessarily subjective (cf. Chapter 2 Goals, techniques, principles). A further point of criticism of corpus stylistics is that it disregards literary elements of texts, such as metaphors, in its analyses (van Peer 1989). However, unlike van Peer (1989), corpus stylisticians do not perceive lexis and grammar to be ‘on the lower levels of linguistic organization’ in comparison to ‘figurative meanings’ (301) in literary texts. On the contrary, lexical and grammatical patterns contribute to the literary character of a text and analysing them contributes to decoding meanings in literary texts. Even though stylistics uses the same data for its analyses as literary studies, namely literary texts, it does not aim at replacing literary studies. A collaboration of the two disciplines would generate deeper insights into the texts than can be gained by strictly separate analyses. The analyses of this book show that the two disciplines provide different insights into the same texts. A co-operation between the two disciplines could therefore result in more detailed and more extensive knowledge of a text than by keeping the disciplines strictly separate. The fact that this co-operation does not exist at present is no reason for doubting its benefit in understanding the various shades of meanings of a text and for ultimately rejecting it. In the following analyses, reference to findings by literary critics is given only when it seems appropriate. The main emphasis in this book is on the literary findings from the analyses presented here. This does not, however, mean that I am not aware of the comprehensive discussion and seemingly
8
Corpus Linguistics in Literary Analysis
exhaustive research on Austen’s novels by literary critics. I use the literary critical studies to complement my own findings and, in turn, complement the literary findings with my corpus stylistic insights.
1.2 The data The textual bases of the analyses in this book are Jane Austen’s novel Northanger Abbey (henceforth NA) and the corpus Austen which consists of Austen’s six novels Emma, Mansfield Park, Northanger Abbey, Persuasion, Pride and Prejudice and Sense and Sensibility. The corpus ContempLit, which is analysed in Chapter 6 Phraseology, represents the literary language contemporary to Jane Austen (cf. Chapter 2 Goals, techniques, principles for information on the compilation of the corpora). NA (1818) is both Austen’s first completed and also her last, posthumously published novel. The reasons for choosing Austen’s text NA and the corpus Austen as data for the analyses in this book are practical ones. First, their original publications date back to about 200 years ago. This means that the texts do not fall under the copyright and it is therefore legal to store and analyse them electronically. The same is true for the texts comprising the corpora ContempLit and Gothic, two of the reference corpora for the analyses in Chapters 5 to 7 in this book. Legal access to electronically stored language data is one of the necessary preconditions for corpus linguistic and corpus stylistic analyses. Second, NA and Austen’s other novels have been intensively discussed and analysed over the past about 200 years. Jane Austen is one of the most widely read British classical authors, her novels are known worldwide and are still popular today. Evidence for this includes the various film adaptations of Austen’s novels since the 1990s, for instance Sense and Sensibility (1995), Emma (1996) and Pride and Prejudice (2005). The novels’ popularity is based on their ironic, humorous and light-hearted tone and on their ostensibly simple and romantic contents. NA is one of Austen’s least discussed works – even though there are still a significant number of writings on the novel. The query Northanger and Abbey in the database of the Modern Language Association (MLA) results in 191 hits (26 June 2009). Austen’s most popular and most widely discussed novel, Pride and Prejudice, only results in more than twice the hits in the MLA database (459, query of Pride and Prejudice, 26 June 2009). In addition, there are numerous writings on Austen’s novels in publications on the author’s
Introduction
9
complete oeuvre, so that a query of Austen and novels produces 2588 hits in the MLA database (26 June 2009). Analyses of Austen’s novels are not restricted to literary studies only, but also linguists have examined her language. Chapmann (1933), for example, discusses ‘Miss Austen’s English’, Phillipps (1970) Jane Austen’s English, Page (1972) The Language of Jane Austen, Tave (1973) Some Words of Jane Austen, Stokes (1991) also looks at The Language of Jane Austen, and Burrows’ (1987) study of Austen’s characters’ idiolects is still influential in linguistics. Systematic linguistic research on Austen’s language, such as Burrows’ (1987), is still the exception, however. Most published secondary literature on the author and her language is situated within literary studies and is therefore intuitive in its analyses. Nevertheless, because literary critics have examined Austen’s novels closely in the past and still do so today, it might, at first sight, seem unlikely that new insights into their contents and language could be gained. This makes the novels ideal data for evaluating the effectiveness of corpus stylistic analyses, since a comparison of findings from the following analyses with those already published in secondary literature allows one to evaluate the novelty and innovativeness of findings. It is possible to see whether corpus stylistic analyses produce new insights into the novels, or whether they only replicate literary critics’ findings. Corpus stylistic analyses are only successful when they produce new findings on the data (cf. section 1.3. for further explanations of this claim). Consequently, I emphasize new findings on the data analysed in this book, but only rarely previous findings by literary critics. This shows that many observations made in this book do not seem to have been made previously. The present book takes Burrows (1987) as a model and demonstrates a systematic and comprehensive linguistic analysis of the data (cf. Chapter 3 Language and meaning for a discussion of Burrows 1987). The analyses in this book examine the different sets of data, NA, Austen and in one chapter ContempLit, by using different analytic techniques. Consequently, the different analyses investigate linguistic units on different hierarchical levels in language, namely lexis, phraseology, text parts and text. Analysing these different linguistic units creates a comprehensive picture of literary meanings in the data and of the effectiveness of the different analytic techniques for the different sets of data. The analytic techniques that are mainly used in the analyses are extracting keywords and frequent phrases from the data, generating distribution diagrams of lexis and analysing concordance lines. This range of analyses goes beyond that by Burrows’ (1987).
10
Corpus Linguistics in Literary Analysis
1.3 The potential and goals of corpus stylistic analyses In corpus stylistics, we can use the same methods to analyse both an individual text and a corpus. This allows us z z z
to develop analytic techniques for investigating various research questions, to evaluate the success of different research techniques for different sets of data, and to gain new literary and structural insights into the data.
All this is demonstrated in the analyses later in this book. The present work is broader in its range of analyses than previous corpus stylistic studies which usually have one textual basis examined using one analytic technique (cf. Chapter 3 Language and meaning). In this book, the potential of various techniques are explored for various sets of data, thereby filling a gap in corpus stylistics and stylistics in general by systematically examining which analytic techniques generate (1) the most and (2) so far unknown insights into a text and a corpus. This book also enlarges the scope of data that is analysed in comparison to previous publications. In stylistics, including corpus stylistics, the objects of analyses have mostly been short texts, such as poems or extracts from longer works (e.g. Louw 1993, O’Halloran 2007b). The possibility of analysing longer texts in a corpus stylistic analysis has only recently been put into practice (e.g. Stubbs 2005, Starcke 2006, Fischer-Starcke 2009b) and is further developed in the present volume by systematically analysing a text and a corpus. The analyses result in literary and structural knowledge of NA and Jane Austen’s oeuvre in general. In addition, the analysis of the corpus ContempLit in Chapter 6 Phraseology, and its use as a reference corpus in other analyses of this book, gives insight into general literary language contemporary with Austen. This enlarges the scope and the quantity of data of corpus stylistic research. Apart from gaining literary insights into the data, a second goal of this book is to evaluate the use of corpus linguistic techniques in the analysis of literature. The basis for this evaluation is a comparison between the findings from the analyses in this book and interpretations of the data published as secondary literature. Since Austen’s texts are part of the literary canon and have been studied intensively over the centuries (as shown earlier), they are particularly well-suited for this task.
Introduction
11
The evaluation of the use of corpus linguistic techniques for the study of literature is based on two criteria: 1. Can literary insights into the data that have been published be reproduced by using corpus linguistic techniques in the analysis? 2. Is it possible to gain new insights into the data by using new analytic, that is, corpus linguistic, techniques? The latter question is not only the greater challenge, it is also of greater importance for the evaluation of the analyses. Only its affirmation legitimizes corpus stylistics as its negation would also negate the usefulness of the analyses. The mere reproduction of knowledge would make using corpus linguistic techniques in the analysis of literature unnecessary. As the following analyses will show, however, it is possible to gain new insights into the data by using corpus linguistic techniques for their analysis. One example of a new insight into the data is the contribution of the novel’s frequent phrases to characterizing people and places in NA (cf. Chapter 6 Phraseology). This shows that identifying linguistic patterns in large quantities of data, such as a novel or a corpus, by using corpus linguistic techniques, not only complements traditional techniques and methods in the analysis of literature by reproducing findings, but that the analyses also expand findings from previous research and offer new insights into the data. Also, the first question posed earlier can be answered in the affirmative, as previously generated results can be replicated using corpus linguistic techniques for the analysis. The analyses in this book show that insights into the data gained by literary critics, for example on intertextual references to Gothic and sentimental fiction in NA (cf. Chapter 5 Keywords and concordance lines), were reproduced. This strengthens the legitimacy of corpus stylistics further, since the reproduction of findings demonstrates the probable veracity of insights into the data gained by way of corpus stylistics. Reproducing knowledge about the data increases the significance of new findings, since their plausibility has already been asserted. New insights into the data can be gained since (1) the data is studied in a detailed and systematic way and (2) a larger number of units of meaning in language is analysed than in literary studies. While literary studies have traditionally looked at a text as a unit of meaning, corpus linguistics looks at more than one unit as carriers of meaning. Words, phrases, text parts and the text itself are units of meaning (cf. Chapter 3 Language and meaning)
12
Corpus Linguistics in Literary Analysis
which all contribute to the literary meanings of the data. They are therefore the objects of the analyses later in this book. Analysing more than one of these units of meaning gives a more comprehensive view of the data than the analysis of a unit of only one kind. This multi-dimensional approach to an analysis in corpus linguistics and corpus stylistics stands in contrast to the traditional approaches to text analysis in literary studies and in other linguistic disciplines, and is made possible only because software is used for the analyses. This multi-dimensionality is one of the major features of corpus linguistics and one of the foundations of the success and the potential of corpus stylistics for analysing literary texts. This is demonstrated in Chapters 5 to 7 of this book.
Chapter 2
Goals, techniques, principles
This chapter is divided into two parts. The first part (2.1 The theory) discusses the theoretical concepts and premises upon which the analyses in Chapters 5 to 7 have been based. The second part (2.2 Corpora, texts, software) discusses the practical foundations of this book, namely the data and the software used for the analyses.
2.1 The theory First, the following section discusses some of the theoretical principles of corpus linguistics and corpus stylistics. Second, evaluation criteria for corpus linguistic and corpus stylistic analyses are developed and discussed. Third, the goals of this book are presented. Fourth, the working techniques used to achieve these goals are explained. The principles and premises presented only for corpus linguistics are also relevant and basic to corpus stylistics. In most instances, this is not stated explicitly, but should be borne in mind for the remainder of this book.
2.1.1 The probabilistic nature of corpus linguistic evidence Corpus linguistic analyses generate data and evidence for claims which are inherently empirical, quantitative and probabilistic. Qualitative statements are results of the interpretation of this data. This enlarges the knowledge of the data and of the analytic techniques, so that the corpus linguistic approach to an analysis conforms to Lakatos’ position that ‘empiricalness (or scientific character) and theoretical progress are inseparably connected ’ (1970: 123, emphasis in the original). Corpus linguistic analyses reveal tendencies and probabilities in language by way of electronically generated quantitative data. These tendencies and probabilities are the results of generalizations of data extracted from a text or a corpus by electronic means. Absolute statements, such as feature x in
14
Corpus Linguistics in Literary Analysis
corpus y occurs in z per cent of all cases, are the basis for hypotheses and generalizations on the usage of a specific feature in a particular linguistic context. Generalizations regarding a context can only be extensions of the data type in the original analysis. This means that, for example, the analysis of a representative corpus of literary texts can lead to conclusions on general literary language as represented by the corpus, but not on general language usage. The latter would require the analysis of a general language corpus, such as the BNC1. Therefore, statements on language in the following chapters of this book apply only to that language variety represented by the corpus that is analysed. Language is an open system. This means that a corpus usually contains only a selection of the possible language of the variety in question. Consequently, statements on language are mostly generalizations of features that occur in the corpus. They show the probability with which the feature occurs in the language variety represented by the corpus. The only corpora to which this does not apply, are corpora which consist of the entire language of a particular variety. Austen is an example of this kind of corpus as it includes all novels by the author Jane Austen. Analysing this corpus allows absolute statements to be made on Austen’s language in her novels, but not on her language in other writings, such as her letters, since these are not represented by the corpus. Generalizations on linguistic patterns from a representative corpus are one of the potentials of and, indeed, an explicit goal in corpus linguistics. This is because insights into the language system cannot be gained from the analysis of one text only. Generalizations based on the analysis of corpora allow a determination of the significance of linguistic patterns in the language data and a selection of patterns for further detailed analyses. They permit insights into the language system and the encoding of meaning in language. It is the analysis of large amounts of language data that makes studying actual language use possible (cf. Chapter 3 Language and meaning for a discussion on the language system). Since statements on the usage of language are generalizations, they are inherently probabilistic. This is because the absolute frequency of a feature varies between different texts and corpora; a particular feature might occur several times in some, but not at all in other data. Thus, corpus linguistic analyses establish the average frequency with which a feature is used in language. Whether a specific linguistic pattern occurs in the data is influenced by several factors, including sociolinguistic ones, such as the participants of
Goals, techniques, principles
15
a conversation, its situation and its purpose. Other factors influencing language use are, for example, differences between written and spoken language, and genre or general linguistic conventions, for example, on how something is typically phrased. Every utterance is a reaction to previously made utterances (Voloshinov 1929, Firth 1957a, Teubert 2005: 4), since conventions for language use are established by continuous and repeated usage in particular contexts. Language does not stand in isolation, but refers to previously used language. This situates an utterance within the language system of production and reception and makes corpus linguistic analyses inherently comparative in nature.
2.1.2 The relationship between frequency and significance in corpus linguistics The assumption of a correlation between the frequency of a linguistic feature and its significance in a text or corpus is one of the basic principles of corpus linguistics (cf. Sinclair 1991 among others). It means that the typical usage of a word is its conventional usage (Hanks 1987) which is part of its meaning. The typical usage of a word is established by analysing a large number of texts, since ‘[m]ultiple evidence from independent sources is necessary to be valid as evidence for conventionality’ (132). And Sinclair adds that [t]he distinction between frequency and importance is not disputed here (. . .) [and] an initial platform of importance is established by recurrence – where an event of coselection occurs more than once, and as far as can be ascertained the two occurrences are independent of each other. Less than recurrence – the single occurrence of something – is a phenomenon that cannot be objectively assessed, repeated recurrence confirms the identification of a coselection as something that must be incorporated in a description. (1999: 6, emphasis in the original) Not everyone advocates this assumption, however, as, for example, Widdowson (1991) and Cook (1998) dispute the correlation between frequency and the significance of language features. One of their arguments relates to the fact that knowledge of infrequent features might be important when learning a language. This is why Widdowson (1991) says that it might be prudent to learn infrequent simple language constructions before frequent ones. He argues, for example, that prototypes, such as pea for
16
Corpus Linguistics in Literary Analysis
vegetables, do not occur as the most frequent collocate of the superordinate term in a corpus, which would be vegetable in this case. Consequently, prototypes cannot be identified by way of a frequency analysis, even though knowledge of prototypes might be of great importance for learners of a language. Widdowson therefore questions the correlation between frequency and significance of language features, at least for language learning. While Widdowson’s argument is valid for language learning, the analyses in this book still assume a correlation between frequency and significance of language features and look at dominant, that is, frequent, features of language use in the data. This is because frequency is an indicator for typicality of language usage and style is the typical language of a given text. Consequently, frequent linguistic features are particularly relevant when discussing the style of writing of a particular text or author. Therefore, the analyses in Chapters 5 to 7 use the most frequent realizations of a search query, for example of phrases, as their starting points. In general, frequent features are given preference over rare ones in the analyses. 2.1.3 Subjectivity versus objectivity in corpus linguistics Before performing and ultimately evaluating the following analyses, the question of whether it is possible to conduct the research entirely objectively has to be addressed. In research, objectivity means that an analysis is based on independent data and that it is understandable for other researchers. Objectivity in analyses is one of the goals of science and research. The electronic analysis of data in corpus linguistics and corpus stylistics seems to fulfil this goal. A depersonalized software analyses the data, so that the output of the software seems to have been generated without human interference. At first glance, corpus linguistics seems to be an objective discipline. But when looking at corpus linguistic analyses more closely, subjective decisions in the research process become apparent. These include the choice of data and software for an analysis, the settings of the software, the choice which of the data generated by the software is analysed and the interpretation of the data. Subjective and objective elements are therefore interrelated in corpus linguistic analyses. Consequently, in order to evaluate the significance of an analysis, it is necessary to identify and make explicit its subjective and objective elements. Even though Popper explicitly refers to the natural sciences (1979a: 6), his theories and claims on science are useful starting points for distinguishing between subjective and objective elements in an analysis.
Goals, techniques, principles
17
Popper defines the goal of science as to find satisfactory explanations, of whatever strikes us as being in need of explanation. (. . .) [S]cientific explanation, whenever a discovery, will be the explanation of the known by the unknown. (1979b: 191) Scientific research aims at furthering knowledge. This new knowledge is based on previous knowledge which is expanded by the research. This is why [a]ll acquired knowledge, all learning, consists of the modification (possibly the rejection) of some form of knowledge, or disposition, which was there previously; and in the least instance, of inborn dispositions. (71) Therefore, ‘[a]ll growth of knowledge [is] (. . .) the improvement of existing knowledge which is changed in the hope of approaching nearer to the truth’ (71). In addition, ‘science (. . .) [should] be visualized as progressing from problems to problems – to problems of ever increasing depth’ (1979a: 17). This means that science is an intertext which feeds from existing research. Previous experiences, intuition and the analyst’s interests serve as starting points for further research. Consequently, completely objective results are no more possible than completely objective research. This means that ‘objective knowledge is conjectural’ while subjective knowledge (. . .) [is] part of a highly complex and intricate but (in a healthy organism) astonishingly accurate apparatus of adjustment, and (. . .) it works, in the main, like objective conjectural knowledge: by the method of trial and elimination of error, or by conjecture, refutations, and self-correction (‘autocorrection’). (1979b: 77) Science is always partly subjective and cannot be completely objective. Even though Popper’s use of the terms subjective and objective deviates from general language use, his statements are still useful for the questions discussed in this chapter. For Popper, objective knowledge is what he calls World Three knowledge. This is the entire knowledge of humanity manifest in objects of World Two, that is, in physical objects, such as books or articles. The knowledge has been produced by many individuals, but it has become independent from its producers by having been put into words. It has become collective knowledge, the only knowledge relevant for science as it influences individual knowledge, for example by providing ideas for future research.
18
Corpus Linguistics in Literary Analysis
Knowledge that has been published or formulated becomes knowledge of World Three as it becomes independent from an individual. The fact that it belongs to World Three is not a criterion for its objectivity though; it is merely a criterion for its accessibility. World Three knowledge is constantly questioned in research so that it might be falsified. For Popper, this is one of the goals of science. This focus on the falsification of knowledge entails that scientific processes and the knowledge resulting from these processes are, according to Popper, always partly subjective. This is because researchers need ideas on how to falsify and replace a theory. For Popper, this individual and therefore subjective part of research is one of its constituent features. While Popper emphasizes subjective elements of science, corpus linguistic analyses also entail objective elements. One objective element is the electronic and automatic analysis of data. Once the parameters of the software are set, no manual and therefore no subjective interferences into the process of an analysis are possible. It is before and after this process that subjective choices constitute major elements of an analysis. These are, for example, the choice of data and software and of data selected for further analyses. Also, the interpretation of the data is a subjective process. Subjective and objective elements are interdependent in corpus linguistics and influence each other. They are inseparable. The relationship between subjectivity and objectivity in corpus linguistics is inherently connected – a common phenomenon in research. Analyses are part of a cycle in which several steps are constantly repeated (generating data, interpreting the data, selecting data for further analyses). In corpus linguistics, the data that is analysed is constantly accessible throughout the analysis, so that both the analytic process and its results can be constantly monitored and questioned. Nevertheless, it is impossible to generate entirely objective results. Because of the analyst’s choices and interpretation of the data, as described above, every research project includes subjective elements. In corpus linguistics, however, the electronically generated data provides a basis for an analysis that is as objective as possible. Furthermore, practical reasons might result in a reduction of the subjective part of an analysis. Corpus linguistic and corpus stylistic analyses depend on access to electronically stored data and software. If neither data nor software can be supplied by the researcher him/herself, his/her choice of both may be limited. This was the case, for example, with the compilation of Gothic as a reference corpus for this book (cf. section 2.2.1 for further details on the corpus and its design). When compiling the corpus, not all texts that were to be included in it were available in electronic form, so that
Goals, techniques, principles
19
the corpus now consists of those texts that were available. The restriction to only those resources that are readily available might determine the analyst’s decisions and therefore influence the course of an analysis. This reduces the analyst’s subjective choices. The discussion above has shown that a fully objective analysis by means of corpus linguistic techniques is impossible. Consequently, the aim of the analyses in this book is to gain insights into the data that are as objective as possible. The following analyses are carefully documented and made transparent in order to demonstrate this greatest possible objectivity. This allows for the fulfilment of the following criteria for the evaluation of corpus linguistic analyses.
2.1.4 Criteria for the evaluation of corpus linguistic analyses The following criteria are designed to allow the evaluation of the applicability and the use of corpus linguistic techniques in stylistics as they make the analyses and their results transparent for a reader. They also permit a mutual comparison of different analyses. The criteria are 1. 2. 3. 4.
growth of knowledge resulting from analyses, replicability of results, checkability of results, innovations derived from analyses.
Growth of knowledge Kenny states that research, particularly in the humanities, is most valuable when it is ‘an original scholarly contribution within its own discipline’ (1992: n.p.). With reference to this book, this means that research is significant when it either demonstrates the usefulness of adopting corpus linguistic techniques to stylistics or when their application is rejected on sound grounds. Both results are relevant contributions to the discipline. Popper is more explicit in his demands on a new scientific theory. He says that [t]he new theory should proceed from some simple, new, and powerful, unifying idea about some connection or relation (such as gravitational attraction) between hitherto unconnected things (such as planets and apples) or facts (such as inertial and gravitational mass) or new ‘theoretical entities’ (such as field and particles). (1979a: 46, emphasis in the original)
20
Corpus Linguistics in Literary Analysis
This book does not develop and present a new scientific theory, but it demonstrates the application of already known analytic techniques in a new context, namely in the analysis of literary texts and corpora. This fulfils the demand for a ‘simple, new, and powerful, unifying idea’ (Popper 1979a: 46) of the application of analytic techniques and procedures. Furthermore, the demand on the idea is fulfilled when, as Kenny (1992) requests, either the success or the failure of the analytic techniques for the specific research question is demonstrated. This results either in the approval or rejection of the techniques for similar research questions. The techniques are successful when, z
z
literary insights or, more generally, new and additional information on the data are gained that could not or have not been generated without electronic analyses, or when, already known information on the data or previous interpretations of it can be supported or refuted by way of electronically generated data.
Successful analyses can be used as the basis for literary interpretations of the data. These interpretations are less subjective than the interpretations by literary critics as literary critical techniques are far more intuitive. Interpretations based on corpus linguistic analyses, however, are based on data that is as objective as possible. The use of corpus linguistic techniques in the analysis of literary texts is rejected when, z z z
no new knowledge on the data is gained, no additional information to already existing knowledge on the data is gained, or, previous literary interpretations are neither supported nor refuted by corpus linguistic analyses.
If any of these conditions were fulfilled, corpus linguistic techniques would be of no use for stylistics and would therefore be rejected for the discipline. When evaluating the growth of knowledge from an analysis, the inherent comparative and descriptive nature of corpus linguistics must be taken into account. Language is compared with other language. This comparison is the basis for general statements on linguistic patterns. Language does not stand in isolation, but has to be interpreted in the context of other language. In addition, patterns that do not occur in the data cannot be described even though they might occur in comparative data. Only patterns that do occur
Goals, techniques, principles
21
are described and the probabilistic nature of the analyses must be taken into account when formulating the probability with which a feature occurs in language. Evaluating the growth of knowledge resulting from an analysis is only possible when taking this into account. Replicability The second criterion for the evaluation of results is their replicability. One of the foundations of science is that researchers can use the methodological approach of other researchers’ works to test their results. The possibility of replicating an analysis makes its method, research technique, and the results it has generated transparent and valuable contributions to research. In order to facilitate this transparency, the various steps of an analysis and its results have to be carefully documented. While transparency permits the replication of an analysis, it also reveals where subjective decisions were taken in its course. Since these cannot be avoided in research (see above), transparency of an analysis allows other researchers to question decisions and to possibly decide differently in their own research. It allows the evaluation of an analysis as a whole. Checkability The third criterion, checkability, is closely related to replicability. As discussed above, it is fundamental in research that its results are not only coherent, but also that they can be tested by other researchers. This is necessary for asserting the validity of the results of a specific analysis and of its underlying assumptions and theories. Also Popper demands that theories are testable: [a] new theory should be independently testable. That is to say, apart from explaining all the explicanda which the new theory was designed to explain, it must have new and testable consequences (preferably consequences of a new kind); it must lead to the prediction of phenomena which have not so far been observed. (1979a: 47, emphases in the original) This central position of testability in science is rejected by Kuhn, though, since [t]o rely on testing as the mark of a science is to miss what scientists mostly do and, with it, the most characteristic feature of their enterprise. (1970: 10)
22
Corpus Linguistics in Literary Analysis
For Kuhn, this ‘most characteristic feature’ is research itself. His evidence for this proposition is Copernicus who was never able to test his theories on astronomical behaviour, but nevertheless conducted research and, thus, was a scientist. And it was precisely Copernicus’ documentation of his arguments and theories which allowed others to replicate and to check his findings in later centuries. In his argument, Kuhn (1970) opposes Popper’s theory (first published in 1935 and developed further until 1962) that any scientific theory should be empirically falsifiable. According to Popper, scientific theories can never be proven, but only refuted. This is why experiments and observations should aim at falsifying theories and therefore at developing new and improved theories. Research develops through falsified theories, since a new theory replacing an older one must be more successful than its predecessor. According to Popper, this improvement of theories correlates with an increase in the empirical nature of the theory. Checking other researchers’ analyses and the conclusions that they have drawn from them is only possible when the analyses are transparent, as described above. When this transparency is provided, results from corpus linguistic analyses can be corroborated or doubted, but neither verified nor falsified. This is due to the probabilistic nature of corpus linguistics. Results are statements on the probability of the occurrence of a feature and on tendencies in language. Results from corpus linguistic analyses are only valid for the data analysed or comparable data. In order to falsify or to doubt the original conclusion, its probabilistic nature has to be considered. Counterexamples to a hypothesis or a conclusion do not necessarily falsify a hypothesis as they might be due to the probabilistic nature of the analysis. A valid evaluation of a conclusion has to take the characteristic features of the analysed text or corpus into account, such as its compilation, since these features determine the significance and the range of conclusions. If, for example, a representative corpus like the BNC is the object of an analysis, conclusions on it concern general language usage. Findings on general language usage can only be gained from analyses of general language corpora as opposed to specialised corpora of, perhaps, newspaper articles. But even the analysis of a representative corpus such as the BNC does not allow for absolute statements on language usage. As explained above, the probabilistic nature of corpus linguistics entails that findings from the analyses describe only tendencies in the language. The fact that the range of results from an analysis depends on the corpus that is analysed has to be borne in mind for the analyses later in this book.
Goals, techniques, principles
23
In the analyses, the corpora that are used are mostly specialized ones which were compiled for a specific purpose, e.g. as a reference corpus. Conclusions drawn from their analyses are therefore restricted to either the corpus analysed or the language variety it represents. Statements resulting from the analysis of a text are only valid for this specific text. The difference between replicability and checkability lies in replicability referring to reproducing analyses while checkability refers to testing the method and techniques by conducting one’s own analysis using some parameters of the original research. With regard to wordlists, Kilgarriff describes the first as ‘[w]ould another team, working with the same framework, within the same goals, arrive at the same list’ (1997: 147). Replicability ensures that the techniques used and the decisions taken in an analysis are transparent to other researchers. This ensures that the analysis of the same data would lead to exactly the same results. Checkability, on the other hand, emphasizes the transfer and adaptation of research techniques to different sets of data, different analytic techniques or different research questions. It is based on the assumption that analyses are inherently intertextual and that they are implicit predictions for other analyses.
Innovation The fourth evaluation criterion is innovation. It is used to question whether analytic techniques are new and innovative and whether they contribute to a growth of knowledge in corpus linguistics, stylistics or literary studies. Even though this question is closely related to Kenny’s (1992) observation quoted above, its aim differs from Kenny’s. Not everything that furthers the growth of knowledge in corpus linguistics, stylistics or literary studies is innovative. Also, conclusions that are not innovative might be of interest for specific questions since, for instance, an analysis might highlight a different detail or the author might take a different theoretical perspective on a question. Since technical developments have only fairly recently allowed the electronic analysis of corpora, corpus linguistics is still a rather new field in linguistics2. This is why innovations achieved using corpus linguistic techniques in the analysis of language data are of particular importance. Especially in a discipline that is still relatively young, the question whether the analytic techniques chosen for an analysis were useful or whether similar results either could be or have been generated by other methodologies or techniques is prominent in the reception of the analyses. With regard to
24
Corpus Linguistics in Literary Analysis
corpus linguistics, this implies the question whether the effort to analyse a text in its electronic form is necessary and useful, or whether the analyses are mainly interesting because they use modern technologies, that is, the computer and related software. The latter proposition is refuted by the analyses later in this book. The four criteria for evaluating research – growth of knowledge, replicability, checkability and innovations – can be fulfilled by corpus linguistics and corpus stylistics, but not entirely by literary studies, particularly with reference to replicability and checkability. They are only applicable to systematic analyses, because systematic analyses can be compared to other analyses, so that they can be evaluated in the context of other research. Consequently, this study focuses solely on systematic analyses.
2.1.5 Goals Several goals are pursued in this book. The first goal is to demonstrate the usefulness of corpus linguistic techniques in a stylistic analysis of literary texts and corpora. This is achieved by extracting both linguistic and literary information from a literary text and a corpus that is compiled of literary texts by using corpus linguistic techniques. Analysing different sets of data and using different analytic techniques demonstrates the possible range and the applications of corpus stylistic analyses. Jane Austen’s novel Northanger Abbey (NA) is analysed as a case study throughout this book. In addition, the corpora Austen and, in Chapter 6 Phraseology, ContempLit are analysed by means of the same techniques as the text. Further reference corpora are used for the analyses when made explicit in the text (cf. section 2.2.1 for information on the corpora). Analysing different sets of data by using the same techniques allows an evaluation of the usefulness of the techniques in analysing different kinds of data. So far, corpus linguistic research has mainly analysed non-fiction texts and corpora (cf. Altenberg 1998, Biber & Conrad 1999, Scott 2001, Culpeper & Kytö 2002 among others). The following analyses also demonstrate its relevance for fiction texts and corpora by extracting literary meanings from the data. Only success in the analyses justifies the use of the techniques and demonstrates their usefulness. The second goal of this book, closely related to its first, is to gain literary and structural insights into the data with the help of corpus linguistic
Goals, techniques, principles
25
techniques for the analyses. These insights into the data centre on either the contents or the structure of the data, and they contribute to a literary interpretation of NA. Some insights gained in the following analyses have already been discussed by literary critics, for example the novel’s dominant topic textuality (cf. Chapter 5 Keywords and concordance lines). But the insights gained in this book are frequently more detailed than those by literary critics. Such is the case in Chapter 5 Keywords and concordance lines, where the corpus linguistic data shows that the Tilney siblings are portrayed as positive characters by way of their reading habits. Some insights gained in this book deviate from conclusions drawn by literary critics. For example, the segmentation of NA into its constituent parts by way of distribution diagrams of frequent lexis (cf. Chapter 7 Text segmentation) differs from the segmentation by literary critics. The analyses performed in this book also generate entirely new insights into the data, such as the fact that the place Bath in NA, is characterized as superficial by some of the novel’s most frequent phrases (cf. Chapter 6 Phraseology). The linguistic patterns these observations are based on, can only be detected by electronic analyses as in corpus stylistics, but not intuitively as in literary criticism. Nevertheless, they are useful indicators of the structure and of discourse features of the data and therefore of its meaning. They also contribute to explaining intuitive reactions from readers on the data. This is the case, for example, in Chapter 6 Phraseology, where corpus linguistic evidence shows why readers intuitively perceive Catherine to be insecure in her public appearances in Bath. The third goal of this book is to develop the techniques demonstrated in the analyses so that they can be adapted to the analysis of other texts and corpora. This adaptability is demonstrated in Chapters 5 to 7 by using the same techniques in the analysis of two and three sets of data. The goal is not to develop analytic techniques which can be used in an identical form for other data, but rather to develop analytic techniques adaptive to other fiction or non-fiction data. The analyses in this book show possible analytic techniques and steps in a corpus stylistic analysis as well as the potential these techniques have for future analyses. In summary, this book pursues three main goals: 1. to demonstrate the usefulness of corpus linguistic techniques in a stylistic analysis of literary texts and corpora, 2. to gain literary and structural insights into the data analysed, and 3. to develop the techniques demonstrated in the analyses, so that they can be adapted to the analysis of other texts and corpora.
26
Corpus Linguistics in Literary Analysis
2.1.6 Analytic techniques To achieve these goals, three main analytic techniques are used in this book: 1. keywords analyses (cf. Scott 1999) 2. phraseological research 3. distribution analyses of keywords and of points in the data where new lexis is introduced into it. In Chapter 5 Keywords and concordance lines, keywords3 of NA and Austen are extracted by using the software WordSmith Tools (WST, Scott 1999). As a second step, the typical usage of selected lemmata from the lists of keywords is analysed by way of the lemmata’s concordance lines. The first goal of this chapter is to demonstrate that major topics in the data can be identified through dominant lexis on the lists of keywords. The second goal is to demonstrate that these keyword analyses provide literary insights into the data. The analyses demonstrate, for example, that the protagonists in NA are characterized by way of their reading habits, an observation new to criticism of the novel, because it can really only be readily observed with this kind of technologically-supported analysis. In Chapter 6 Phraseology, the most frequent phraseological units consisting of four words, 4-grams and 4-frames, are identified and analysed. 4-grams are uninterrupted strings of four words, the most frequent of which in NA is i am sure i. 4-frames are strings of four words which are variable in one slot. In NA, the most frequent 4-frame is the * of the. In the data, the asterisk (*) is replaced by various words. Both 4-grams and 4-frames are a means of organizing information and structuring the data. They contribute to, among other things, characterizing places in NA. The data analysed in this chapter is NA, Austen and ContempLit. In Chapter 7 Text segmentation, the distribution of keywords from NA and Austen is used as a basis for segmenting the two sets of data into their constituent parts. This segmentation is based on the assumption that textual units are defined by a common content, which is manifest in lexical cohesion. These segmentations are tested by means of the software Vocabulary Management Profiles (VMP, Youmans 2001) which identifies points in the data where new lexis is introduced into it. This is relevant, since the introduction of new lexis into a text indicates the introduction of a new topic into it. This points to structural boundaries which mark transitions between the constituent parts of the data. The sequence of the analyses mirrors the progression of linguistic units that are analysed from word to phrase to text part to text.
Goals, techniques, principles
27
2.2 Corpora, texts, software The following section describes the language data and the software used for the analyses. They are described in alphabetic order.
2.2.1 The data The corpus Austen consists of the six novels Emma (EM, 1816), Mansfield Park (MP, 1814), Northanger Abbey (NA, 1818), Persuasion (Per, 1818), Pride and Prejudice (P&P, 1813) and Sense and Sensibility (S&S, 1811) by Jane Austen that are stored electronically as one text file. The electronic versions of the texts were provided by Project Gutenberg (www.gutenberg.net) on the world wide web (www). The only changes in the data that were made concern bibliographic data and terms and conditions of Project Gutenberg which were tagged to make them invisible for the corpus linguistic software. The corpus consists of about 723,500 orthographic words (tokens). The corpus is compiled of the six novels that Austen intended for publication. This excludes her unfinished novels, juvenilia and letters. NA and Per are included in the corpus even though they were published posthumously, since the author had intended their publications. Literary critics consider the six novels to be Austen’s major works – a hypothesis which is followed in this book4. Austen5 is used as a reference corpus in this book and is compiled of Austen’s novels EM, MP, Per, P&P and S&S. NA is not included in the corpus. This compilations allows the corpus to be used for comparisons between NA and Austen’s other novels without including NA itself in the reference corpus. The corpus consists of about 645,000 tokens. The British National Corpus, BNC, consists of about 100 million words of British English from the beginning of the 1990s. It was designed to be as representative as possible of then current British English. The corpus includes about 90 per cent written language and about 10 per cent spoken language. It includes texts and text fragments of various genres, for example, newspaper articles, literature and Hansard. The spoken section includes, for example, transcripts of radio broadcasts and of doctor – patient consultations (for further information on the BNC, see www.natcorp.ox.ac.uk). The corpus ContempLit consists of literary texts that were published between 1740 and 1859 so that they are roughly contemporary with Jane Austen (1775–1817). The texts were mainly provided by Project Gutenberg on the
28
Corpus Linguistics in Literary Analysis
world wide web. The choice of texts results from their availability in electronic form and the fact that they are all novels. The latter makes them comparable to Austen’s works. ContempLit is not representative for literary language at Austen’s time, as the number of novels available in electronic form at the time when the corpus was compiled was limited. The corpus consists of about 4,370,000 tokens and includes the following texts: z z z z z z z z z z z z z z z z z z z z z z z z z z z z z z
Samuel Richardson (1740), Pamela or Virtue Rewarded John Cleland (1749/50), Memoirs of Fanny Hill. Memoirs of a Woman of Pleasure Henry Fielding (1749), The History of Tom Jones, a Foundling Sarah Fielding (1749), The Governess or The Little Female Academy Laurence Sterne (1760), The Life and Opinions of Tristram Shandy, Gentleman Oliver Goldsmith (1762/66), The Vicar of Wakefield Horace Walpole (1764), The Castle of Otranto Henry Mackenzie (1771), The Man of Feeling Tobias George Smollett (1771), The Expedition of Humphrey Clinker Fanny Burney (1778), Evelina or The History of a Young Lady’s Entrance into the World Mary Wollstonecraft (1791), Maria or The Wrongs of Woman Ann Radcliffe (1794), The Mysteries of Udolpho Matthew Gregory Lewis (1796), The Monk Maria Edgeworth (1800), Castle Rackrent Sir Walter Scott (1814), Waverly, or ‘Tis Sixty Years Since Mary Shelley (1818), Frankenstein, or the Modern Prometheus Benjamin Disraeli (1826), Vivian Grey Mary Russell Mitford (1832), Our Village Edward Bulwer-Lytton (1834), The Last Days of Pompeii Frederick Marryat (1836), Mr Midshipman Easy Charles Dickens (1837/39), Oliver Twist Anne Brontë (1847), Agnes Grey Charlotte Brontë (1847), Jane Eyre Emily Brontë (1847), Wuthering Heights William Makepeace Thackeray (1847/48), Vanity Fair Elizabeth Gaskell (1848), Mary Barton Charles Kingsley (1850), Alton Locke, Tailor and Poet Charlotte Yonge (1853), The Heir of Redclyffe Anthony Trollope (1855), The Warden George Eliot (1859), Adam Bede
Goals, techniques, principles
29
The corpus Gothic was compiled as a reference for NA and is used to compare the language of NA with that of roughly contemporary Gothic novels. Gothic contains all those Gothic novels that are mentioned in NA and which were available on the world wide web on 20 April 2004. The compilation of the corpus was completed on that day and the corpus has not been changed since. The corpus consists of about 730,000 tokens and includes the following texts: z z z z z
Horace Walpole (1764), The Castle of Otranto Eliza Parson (1793), The Castle of Wolfenbach Ann Radcliffe (1794), The Mysteries of Udolpho Matthew Lewis (1796), The Monk Julian Hawthorne (ed.) (1967), The Lock and Key Library Classic Mystery and Detective Stories – Old Time English. This includes, – Charles Dickens, ‘The Haunted House No. I Branch Line: The Signal Man’ – Bulwer-Lytton, ‘The Haunted and the Haunters; or, The House and the Brain. The Incantation’ – Thomas de Quincey, ‘The Avenger’ – Charles Robert Maturin, ‘Melmoth the Wanderer’ – Laurence Sterne, ‘A Mystery with a Moral’ – William Makepeace Thackeray, ‘On Being Found Out. The Notch on the Ax’ – Anonymous, ‘Bourgonef’ – Anonymous, ‘The Closed Cabinet’
Since Gothic was designed as a reference corpus for NA, a novel treating the then popular Gothic fiction ironically, the original idea was to include all Gothic texts in the corpus that are mentioned in NA. But since the number of texts available on the world wide web, that is, in electronic form, was limited, this original plan had to be abandoned. The two novels Camilla or a Picture of Youth (1796) by Fanny Burney and The Italian (1997) by Ann Radcliffe that are mentioned in NA, are therefore not included in the corpus. Instead, other Gothic texts which are not mentioned in NA were included in order to increase the size of the corpus. This shows that the compilation of Gothic was determined by practical conditions and considerations, such as the availability of texts. It is not representative of the Gothic literature of Austen’s time. This partly non-intentional choice of texts included in the corpus makes Fish’s (1973) criticism of stylistics, namely that researchers select questions and data for their analyses for which they already anticipate the results, as
30
Corpus Linguistics in Literary Analysis
inapplicable as possible (cf. Chapter 3 Language and meaning for a discussion of Fish 1973). Even though the corpus was compiled for a particular reason, its content could not be planned. The circularity between the choice of data and postulating a previously expected result, as claimed by Fish (1973), is therefore avoided as far as possible in an analysis. NA is the text Northanger Abbey (1818) by Jane Austen that is used as a case study in the entire book. Its electronic version was provided by Project Gutenberg. The only changes made in the text ensure that bibliographic references and terms and conditions of the project are ignored by corpus linguistic software and are not included in the analyses. NA consists of about 77,000 tokens and is part of Austen, but not of Austen5. Table 2.1 gives a summary of the different texts and corpora and their sizes. The aforementioned difficulties in the compilation of Gothic show difficulties of corpus stylistics in general. Corpus stylistics depends on texts in electronic form which researchers can either prepare themselves, a timeintensive task, or it depends on texts that are already available in electronic form, as on the world wide web. Consequently, theoretical considerations might be subordinated to practical concerns, for example when a researcher decides against preparing the electronic form a text, but rather works with pre-existing data. This was the case when compiling Gothic. In addition, copyright regulations limit the number of texts available for corpus stylistic analyses as they forbid storing entire works in electronic form for 70 years after their author’s death. Only those texts that do not fall under the copyright or for which permission has been given can be stored and analysed electronically. This means that recently published texts are usually not available for corpus linguistic research. This is one of the reasons for preferring older works in corpus stylistics to more recent ones (cf. Chapter 1 Introduction for further reasons for choosing older works for corpus stylistic analyses). Table 2.1
Corpora/text
Corpus/Text
Number of Tokens
Austen Austen5 BNC ContempLit Gothic NA
ca. 723,500 tokens ca. 645,000 tokens ca. 100,000,000 tokens ca. 4,370,000 tokens ca. 730,000 tokens ca. 77,000 tokens
Goals, techniques, principles
31
2.2.2 The software The following software is used for the analyses in this book: kfNgram (Fletcher 2002) is a phraseological software which (1) extracts uninterrupted strings of words, n-grams, from a text or corpus. These strings are of variable, individually specified lengths. (2) kfNgram extracts strings of n words which are variable in one slot, so-called p-frames. An example for an n-gram is the most frequent 4-gram of NA, i am sure i, the fifth most frequent 4-frame of NA is * i am sure. The Vocabulary Management Profiles (VMP, Youmans 2001) generate diagrams, which show where new word types are introduced into a text or corpus. For this purpose, the software analyses a text file in individually specified intervals and counts both types and tokens in this interval. It then calculates the type-token ratio for the interval. A high type-token ratio indicates the introduction of a substantial quantity of new lexis into the data at this point. This, in turn, indicates the introduction of a new topic into the text. Word-Distribution (WD, Barth 2002) enables the generation of diagrams which show where a specified word type occurs within a text or corpus. To accomplish this, the software identifies the exact positions of a type in the data and lists them numerically. These numbers can then be read into MS EXCEL, which generates a diagram of the type’s distribution. WordSmith Tools (WST, Scott 1999) generates statistical data on a text or corpus and analyses lexical structures in the data. Appropriate to their names concord, wordlist and keywords, the main functions of the software are, to generate (1) concordance lines of specified node words and lemmata, (2) wordlists of data, and (3) keywords of data by comparing its wordlist to that of a larger reference corpus. Scott (2001: 115) defines keywords as follows: the software [WordSmith Tools] (. . .) analyses KWs [keywords] by comparing the frequency of each word type in the text in question with that of the same word type in a reference corpus. If the word’s frequency is found to be outstanding, it will qualify as a KW. The actual process of computing the keywords is described by Scott (1999) as follows: The ‘key words’ are calculated by comparing the frequency of each word in the smaller of the two wordlists [of the text or corpus that is analysed
32
Corpus Linguistics in Literary Analysis
and of the reference text or corpus] (. . .) with the frequency of the same word in the reference wordlist. All words which appear in the smaller list are considered, unless they are in a stop list. (. . .) To compute the ‘key-ness’ of an item, the program therefore computes its frequency in the small wordlist the number of running words in the small wordlist its frequency in the reference corpus the number of running words in the reference corpus and cross-tabulates these. Statistical tests include: the classic chi-square test of significance with Yates correction for a 2 × 2 table Ted Dunning’s Log Likelihood test, which gives a better estimate of keyness, especially when contrasting long texts or a whole genre against your reference corpus. A word will get into the listing here if it is unusually frequent (or unusually infrequent) in comparison with what one would expect on the basis of the larger wordlist. This explanation is used as the definition of keywords for the entire book, except when their qualitative definition is made explicit in the text. Quantitative keywords in this book are identified by means of the log likelihood test (cf. Dunning 1993 for information on the test). The probability level set for the calculations is 0.000001, the standard setting of the software. The minimal number of occurrences of keywords in order to be identified as such is three. Apart from wordlists, concordance lines and lists of keywords, WST also provides statistical information on the data. These include its number of words, its type-token ratio and its average sentence length. This is information on the complexity of the data. For example, knowledge of the number of tokens in the data is interesting, when it functions as a basis for calculating the statistical probability with which a feature occurs in the data.
2.3 Concluding comments Discussing the basic goals, the techniques and principles of this book (1) highlights the theoretical and practical background of the following analyses and (2) makes the analyses replicable and checkable for their recipients. This makes it possible for the growth of knowledge and the
Goals, techniques, principles
33
innovations deriving from the analyses to be evaluated. Decisions taken as part of the analyses only become transparent and understandable when their theoretical and practical positions and backgrounds are understood by the recipient. These positions and backgrounds influence the analytic design and how recipients understand and evaluate the results from the analyses. It is therefore of prime importance to make them explicit as has been done in this chapter. In the following chapter, further basic principles and backgrounds of the analyses are discussed, namely, their theoretical contexts. All analyses demonstrated in this book refer and relate to previous research and are part of the theoretical context of the discipline. The analyses are therefore inherently intertextual and it is only by knowing this background, that their implications, for example, for the usefulness of analytic techniques, can be evaluated.
Chapter 3
Language and meaning
The primary aim of linguistics is to analyse how meaning is encoded in language by looking at regularities and patterns in language data. But since Saussure (1916), the question has arisen as to which level of language linguists analyse. In his Cours de Linguistique Générale, Saussure distinguished between langue and parole. Parole is the actual language use, that is, the language that is manifest when uttered by speakers or writers of a language. Langue, on the other hand, is the language system which allows the production of parole. It is the entire language system which cannot be deduced from or analysed by looking at single utterances. Utterances are realizations of single aspects of the system, but they are not manifestations of langue. This means that the analysis of texts is the analysis of parole. For a long time, it seemed that langue could not be analysed, even though Saussure had demanded that it should be the main object of linguistic research. Since about the 1990s, the analysis of patterns in large amounts of data has become possible as corpus linguistic techniques were developed. The compilation of language data into corpora and their analysis now allows for detecting and researching language patterns in large amounts of data, such as the 100 million words in the BNC or in even larger corpora. The analysis of patterns facilitates insights into the so-called intertext, that is, references to other texts by recurrent linguistic patterns. These patterns exist on all linguistic levels, such as in lexis, phraseology and grammar. Even though intertext is manifest in parole, that is, in actual texts, it indicates langue, since insights into the language system can be induced from recurrent patterns in parole. This is because the analysis of a corpus which represents a language variety allows the induction of the probability with which a particular language pattern occurs in the language variety represented by the corpus. However, the probabilistic character of corpus linguistics has to be borne in mind in this process. The induction of the probability with which a feature occurs in language is a hypothesis about langue which can be supported by further data from
Language and meaning
35
other corpora. But there can be no proof for these hypotheses, because langue cannot be observed directly. Any statement about langue is based on probabilistic research of language features as represented in a corpus, that is, in a selection of language. As langue is the sum of all potentially possible units of parole, corpus linguistics is a linguistics of parole from which hypotheses about langue are induced. Recurrent language patterns form networks between texts and serve as pointers to each other. These patterns are called intertext and they can be, for example, grammatical rules, such as the syntax of a sentence, or lexical references to other texts. The latter are called intertextual references. One instance of an intertextual reference is the phrase ‘Once upon a time . . .’, which a receiver identifies as the beginning of a fairy tale. This identification is based on the knowledge of genre conventions, which create expectations about the text that follows the phrase. These expectations can be either fulfilled or disappointed. Intertext is a theoretical construct which is manifest in a text. It is both part of langue and parole. Intertext is part of a text and therefore part of parole. It is also part of patterns that are inherent to the language, like syntactical rules, so that it is also part of langue. Intertext partly depends on human actions. It is created by the sender of a message and is decoded by its receiver. In case either sender or receiver do not decode the intertext, as occurs when they do not recognize linguistic patterns in a text, the intertext nevertheless exists. However, it is not actively decoded, so that no links to either other texts or to the language system are recognized and created by the receiver. This influences the interpretation of a text, but not its form. This same principle applies to linguistic patterns that occur only within one text and which contribute to its meaning. These patterns are called intratextual references and form an intertext within one text. Like intertextual references, also intratextual references are not recognized and decoded by all recipients of a text. This means that meaning that is encoded by these patterns is not the same for all recipients of the text, depending on whether or not they recognize the patterns. Factors, such as a receiver’s individual textual competence, the situational context of reception and the receiver’s background knowledge, influence the interpretation of a text and whether or not linguistic patterns are recognized. Meaning is not a unified phenomenon, but is based, at least to some degree, on the receiver’s individual linguistic and encyclopaedic knowledge. These theoretical premises can be illustrated in Austen’s novel Northanger Abbey. When reading the text, one notices that titles of Gothic novels
36
Corpus Linguistics in Literary Analysis
such as The Mysteries of Udolpho and The Monk are frequently mentioned. But without knowing the contents and developments of these novels, the reader misses implicit references to them in NA, such as when situations and relationships between protagonists of Gothic novels are mirrored (cf. the Appendix for intertextual references in NA). If these implicit references are not recognized, it is impossible for the receiver to see that NA treats its contemporary genre, Gothic novels, ironically. But even though the ironic nature of the novel is not recognized by its receiver and therefore does not influence the reader’s interpretation of the novel, that is, the text’s perceived meaning, the irony still exists in the text. Intertextual and intratextual references are part of a text and are therefore part of parole. They are also part of the language system and are therefore also part of langue. This shows that langue and parole are not two distinct entities, but that they are two aspects of the same phenomenon as described by Firth (1935: 53, 1957b: 2), Halliday (1978: 38) and Sinclair (1991: 103). Halliday’s weather-climate analogy (1991, 1992) is a convincing explanation for rejecting the dualism between langue and parole. When talking about the weather, people discuss the momentary state of atmospheric conditions. It helps to decide which clothes are appropriate for the following day, for example. When talking about the climate, people discuss the atmospheric system which underlies the weather. The accumulation of weather conditions over years or centuries is the climate. A climate can be subtropical, but the weather on a given day can still be cold. Weather and climate are the same phenomenon, but looked at from different perspectives depending on whether its manifestation on a particular day, that is the weather, or the underlying system, that is the climate, is the subject of an enquiry. This is also true for langue and parole, which both look at language from two different perspectives. The goal of the following analyses is therefore, (1) to gain insights into the concrete data, parole, and (2) to induce information on the language system, langue, on which the data is based. A corpus linguistic analysis looks at language, that is parole, and the implications of parole for langue by analysing intertext and intratext, that is linguistic patterns in a text or corpus. Recognizing intertext and intratext depends on both sender and receiver of a message and on their individual knowledge. This individual knowledge of, for example, text and genre conventions, linguistic patterns or single texts is henceforth summarized as textual competence. Textual competence is an umbrella term for several competences and forms of knowledge. It is specific for every speaker of a language.
Language and meaning
37
Figure 3.1 Elements creating meaning in language
When summarizing the argument so far, it is clear that textual meaning derives from various factors and their interrelations. Their various connections are shown in Figure 3.1. Figure 3.1 shows that identifying and analysing linguistic patterns, intertext, langue and parole are based on the analysis of a text or corpus, since the factors are manifest in texts or corpora. Encoding and decoding them depends on human actions, that is, on the sender and the receiver of the message and their textual competences. The meaning of a message is a result of all these factors in language. The question of where exactly meaning is encoded in language has been widely discussed in linguistics with four approaches having been dominant: 1. 2. 3. 4.
meaning is inherent in the language system meaning is inherent in a text as a unit meaning is encoded in recurrent linguistic patterns in a text meaning exists in human perception and is projected onto a text.
These four approaches focus on the system, the text, the intertext and the recipient of a message and his/her cognition. Literary studies see meaning within a text and they interpret intertextual references as part of the text. The notion that meaning is inherent in the language system is the one postulated by Saussure (1916). As described above, he calls langue the primary object of linguistic research. Perceiving the text as the unit of meaning is Halliday’s (1985) position and that of systemic-functional linguistics. The full meaning of a text can only be revealed when looking at the complete text. Looking for meaning in recurrent linguistic patterns, that is, in the intertext, is the position of corpus linguistics. This is demonstrated by the analyses in Chapters 5 to 7. The notion that meaning is part of human perception and that it is realized by way of cognitive processes is the position of cognitive linguistics.
38
Corpus Linguistics in Literary Analysis
It assumes that texts contain pointers towards meaning, but that meaning itself can only be fully realized in the receiver’s mind. One example of this is schemata, that is, standardized sequences of actions (cf. section 3.4. for further information on the concept of a schema). According to cognitive linguistics, a text itself does not carry meaning, but it triggers cognition and, therefore, meaning in the human mind. Figure 3.1 shows that the text is central to the various relationships which encode meaning in language. Language patterns, corpora, intertext, textual competence, langue and parole are all connected with the text, even if indirectly. The text functions as a connector between the different elements. A corpus, for instance, consists of texts or text fragments which are analysed for linguistic patterns. These patterns might, in turn, create intertext. Because of their textual competences, the sender and receiver identify this intertext. Parole is manifest in a text and the accumulation of language in a corpus indicates langue. Also, triggers for potential cognitive processes are part of a text. The central position of all these processes and relationships is held by the text. This would support the positions of both Halliday and of literary critics who perceive the text as the unit of meaning in language. When looking more closely at Figure 3.1, however, one notices that meaning is not only part of a text, but part of all factors in the diagram and that it is activated by all of them. This is because all factors are closely connected with each other or are even interdependent. Meaning is not encoded at only one point in language, but it is activated by a receiver’s textual competence, and is part of both the text and linguistic patterns. Even though the patterns are part of the text, they nevertheless function as separate units within the text. Linguistic patterns can be both recurrent or unique linguistic phenomena within a text. An example of a recurrent pattern is quantitative keywords (cf. Chapter 2 Goals, techniques, principles) which indicate dominant topics in a text or corpus. Looking at NA, its quantitative keywords show the novel’s intertextual references to Gothic novels (cf. Chapter 5 Keywords and concordance lines). An example of a unique meaningful phenomenon are qualitative keywords in the sense of literary criticism. These keywords occur only infrequently in a text, but they have high symbolic values (cf. Chapter 5 Keywords and concordance lines). Turning yet again to NA, the node word ‘gothic’ occurs only four times in the text and is used to describe buildings in all four instances. Despite this infrequent occurrence, the node word has great significance for the content of the novel, since it is a satire on Gothic novels.
Language and meaning
39
But units of meaning are not only single words or a complete text, also combinations of words, that is phrases, are units of meaning. This is again illustrated in NA, where people and places are characterized by way of frequent phrases (cf. Chapter 6 Phraseology). Also, the complete text with its inherent linguistic patterns is a unit of meaning as it conveys meaning in the text’s complete form. In addition, text parts are identified as units of meaning in Chapter 7 Text segmentation. The argument so far has emphasized that meaning can be decoded by analysing linguistic patterns on various linguistic levels. These patterns are part of the form of a text – a fact that supports the assumption of a correlation between form and meaning in language. Because of this correlation, the analyses in this book look at linguistic patterns on different linguistic levels, that is, on the levels of the word, the phrase, text parts, and the text in order to identify their contributions to the meanings of the data.
3.1 Stylistics and meaning The central concern of stylistics is the relationship between linguistic form and meaning, mostly of a literary text. Its goal is to linguistically analyse a literary text to gain literary insight into it or, in a more general sense, to decode the meanings of the text. Insight gained into the text can either replicate intuitive interpretations of the text, for example interpretations developed in literary criticism, or they can generate new insight into the text. It is the latter which is both of greater interest and of greater significance for stylistic research. Only the generation of new insight into a text legitimizes stylistics as a discipline. The mere replication of results gained by literary critics would not warrant the existence of an additional discipline. Its function would be fulfilled by literary criticism. This is not the case, however. Depending on the stylistic subdiscipline an analysis is situated in, analyses focus on parole, author, reader, linguistic patterns, or intertext of a text. Langue is only implicitly discussed in stylistics, such as when insights into literary language as such or genre conventions are gained. Stylistic analyses are based on a text or a corpus. While there is consensus about the goals of stylistics, the linguistic definition of ‘style’ is highly debated. Mukherjee summarizes the different emphases of definitions as ‘style as choice, (. . .), [s]tyle as deviation from a norm, (. . .) style as recurrence (. . .) [and] [s]tyle as comparison (. . .)’ (2005: 1184f.). Style as choice is relevant, for example, in determining the formality
40
Corpus Linguistics in Literary Analysis
of an utterance. Style as deviation from a norm is used, for example, as a criterion for authorship attribution or as a feature singling out one text from others. Style as recurrence describes the probabilistic nature of language as made explicit by corpus linguistics and corpus stylistics. Style as comparison is relevant, for example, for contrasting a dialect with the standard variety of a language in order to identify and describe dialectal features in language. In stylistics, this is of particular interest for identifying and describing an author’s or a character’s idiolect. Since Jakobson’s statement (1958) which emphasised the importance of the syntagmatic axis in an analysis, stylistics in the structuralist tradition has moved away from the first characterization of style, that is, style as choice. The other three approaches are still widely held in stylistics, especially in corpus stylistics. This is demonstrated both in this chapter and in the analyses in Chapters 5 to 7. In this book, style is defined as a combination of recurrence, comparison and probability of linguistic features. This combination of factors makes explicit that style is unique to a particular text. Consequently, the style of a text is frequently analysed by comparing the text and its language with other texts or corpora, and style is frequently described by discussing recurrent linguistic patterns in the data. Since absolute statements about the occurrence of patterns can only relate to the particular data that is analysed, statements on the probability with which a feature occurs in the language variety represented by a corpus are frequent. This is due to the probabilistic character of corpus linguistics. Furthermore, the occurrence of style in a particular text makes style part of the syntagmatic axis of a text.
3.2 Stylistics – the background Mirroring the lack of a single definition of ‘style’ in stylistics, there are also a number of different analytic techniques and research questions within stylistics. This is shown by the large number of studies which adopt research techniques from different linguistic disciplines for the analysis of literary texts. These studies also follow the different traditions of definitions of ‘style’ as described above. Examples of linguistic disciplines which have been adopted for stylistic analyses are discourse analysis (e.g. Short 1981), pedagogical linguistics (e.g. Widdowson 1975, Carter 1986, Starcke 2007, Fischer-Starcke 2009a) and social-critical analyses or social pragmatics (e.g. Burton 1982, Mills 1992). Since about 2000, a growing number of analyses with a cognitive linguistic orientation (for example Culpeper 2002a,
Language and meaning
41
Heywood, Semino and Short 2002) and a corpus linguistic approach (e.g. Stubbs 2005, Starcke 2006, Mahlberg 2007a, Fischer-Starcke 2009b) have also been published. This diversity of approaches within stylistics shows that there is no unified or closed discipline with a specific analytic apparatus at its disposal. Stylistics remains dynamic and diverse by making use of the complete range of linguistic analytic techniques. It mirrors the developments of and within linguistics. The basic assumption of stylistics is the structuralist assumption that a correlation exists between form and meaning in language. This correlation can be analysed on different linguistic levels listed by Fowler, when he describes when and how a linguistic analysis of literature can be successful: linguistics for our purpose [i.e. stylistics] (. . .) should aim to be comprehensive in offering a complete account of language structure and usage at all levels: semantics, the organization of meanings within a language; syntax, the processes and orderings which arrange signs into the sentences of a language; phonology and phonetics, respectively the classification and ordering, and the actual articulation, of the sounds of speech; textgrammar, the sequencing of sentences in coherent extended discourse; and pragmatics, the conventional relationship between linguistic constructions and the users and uses of language. (1986: 192) With this statement, Fowler follows Firth who said that ‘[t]he statement of meaning cannot be achieved by one analysis at one level, in one swoop’ (1950: 44). But while Fowler proposes a comprehensive list of linguistic features that should be analysed in a stylistic analysis, most analyses, in fact, look at single linguistic categories. For example, Sinclair (1975) studies syntax, Cook (1986) studies text-grammar, Cureton (1997) studies phonology, Hori (2002) and Culpeper (2002b) study semantics and Markus (2002) studies pragmatics. Only phonetics is not analysed in any of the studies known to me, since it is of little interest for the meaning encoded in a text. If pronunciation is important for the meaning of the text, it is discussed in a phonological analysis. At first glance, it seems to be useful to analyse all linguistic levels in a stylistic analysis as has been called for by Firth and Fowler. But even though technical developments now allow this comprehensive analysis, the time required for it would still be enormous and general doubts about such a comprehensive analysis remain. It is questionable whether the analysis of all linguistic levels of a set of data generates new or improved insights into
42
Corpus Linguistics in Literary Analysis
its meanings. As this is the main goal of a stylistic analysis, it has to be questioned whether all analyses help to achieve it. It would be beneficial and interesting, for example, to analyse the phonology in Dickens’ novels, in particular, with a focus on the different characters. But it is neither beneficial nor interesting to perform the same analysis on Austen’s novels. This is because while Dickens frequently characterizes his protagonists by their idiolects, this is much less so the case with Austen. The interpretation of her novels would not benefit from such an analysis. Consequently, Fowler’s (1986) proposal for a comprehensive analysis of texts must be rejected on the grounds of its relevance for the particular research question that is addressed. The stylistic analyses discussed and performed in this book therefore concentrate on selected linguistic features of the data, so that they generate insights that are relevant to their respective research questions. In stylistics, the approach to an analysis is that of linguistics. In a first step, linguistic patterns of a text are identified. In a second step, these patterns are interpreted for their meanings. This distinguishes stylistics from literary criticism which prefers an intuitive approach and takes linguistic form to be of relatively little importance for the meanings of the text. But also within stylistics, there are different approaches to the systematic identification of linguistic patterns. These approaches differ in their analytic techniques and in their goals.
3.3 The classics Following Jakobson’s demand in his ‘Closing statement’ of 1958, one of the basic analytic principles in stylistics is that the syntagmatic axis of a text, as opposed to its paradigmatic axis, is analysed for linguistic patterns. Stylistics and much of linguistics in general follow this proposition, independent of whether prose or poetry is the object of an analysis. Jakobson himself only discusses the analysis of poetry in his statement. Nevertheless, his claims and theoretical insights are also highly relevant to prose and are therefore explained in some depth in the following paragraphs. The starting point of Jakobson’s statement is the different functions of language (emotive, referential, poetic, phatic, metalingual and conative) in a text. A literary text is characterized by the dominance of the poetic function. Unlike the other functions, the poetic function focuses on the message of a text. Therefore, the text’s message is the main object of an analysis of a literary text, so that the focus of the analysis is
Language and meaning
43
self-referential as it is based solely on the text’s literary character. The objects of the analysis are linguistic patterns in the text since, [t]he repetitiveness effected by imparting the equivalence principle to the sequence makes reiterable not only the constituent sequences of the poetic message but the whole message as well. This capacity for reiteration whether immediate or delayed, this reification of a poetic message and its constituents, this conversion of a message into an enduring thing, indeed all this represents an inherent and effective property of poetry. (371) Meaning in a literary text is created by recurrent linguistic patterns such as parallelisms, for example a rhyme scheme or recurrent grammatical patterns. The syntagmatic axis of a text encodes its meanings so that the syntagmatic axis is also the object of an analysis. While the poetic function is the most dominant, the other five functions defined by Jakobson are also relevant for a literary text. However, their analysis is more prominent when looking at non-literary texts. Since linguistic patterns are objective features of a text, Jakobson (1958) states that also the meaning of a text must be objective. Referring to the phonology of poems, Jakobson says that, [s]ound symbolism is an undeniably objective connection between different sensory modes, in particular between the visual and the auditory experience. (372) Consequently, poetics, which is what is called stylistics today, is ‘an objective scholarly analysis of verbal art’ (352) which analyses formal features of a text, namely sequences and parallelisms. Following the identification of these formal patterns, the meanings of a text are extracted from them. The fact that, for Jakobson, they are objective features of the text, also makes the text’s meanings objective for him. From today’s point of view, this objectivity of meaning can no longer be asserted (cf. Chapter 2 Goals, techniques, principles on objectivity in corpus linguistics). Nevertheless, stylistic analyses still focus on linguistic patterns in a text in order to interpret the text’s meanings from them. This interpretation is a subjective process, however. Jakobson’s focus on sequences and parallelisms as defining features of literary texts seems to introduce an empirical component into the analysis of a text. He is remiss, though, in not defining the number of elements necessary for a sequence to be considered as such. This causes the paradox that he demands an objective scientific discipline, but leaves the decision as
44
Corpus Linguistics in Literary Analysis
to what constitutes one of its defining features to every analyst’s personal intuition and interpretation. This introduces a further subjective element into the analysis of the patterns. Jakobson’s emphasis on syntagmatic relationships in a text as objects of stylistic analyses means that the words and phrases the author really uses are analysed. This procedure stands in contrast to earlier traditional text analyses which frequently emphasized the paradigmatic axis of a text, that is, different possible ways of phrasing a content. But according to Jakobson, the meaning of a text is encoded by the language an author chooses and not by the language an author could have chosen. A text is the product of the author’s decision. Consequently, ‘[t]he poetic function projects the principle of equivalence from the axis of selection into the axis of combination’ (358, emphasis in the original). And Jakobson continues by saying that, [e]quivalence is promoted to the constitutive device of the sequence. In poetry one syllable is equalized with any other syllable of the same sequence; (. . .) in poetry the equation is used to build a sequence. (358) Equivalences in the sense of parallel linguistic features or patterns are the basis of the syntagmatic organization of a literary text. Today, structuralists still look at the syntagmatic axis of a text or corpus. The principle of equivalence has been broadened, though, to now encompass linguistic patterns in general. Identifying these patterns is the basis of a stylistic analysis. This means that an analysis is text-focused and that style is a textual phenomenon (Tolcsvai Nagy 1998). Jakobson has defined important parameters of linguistics by emphasizing the significance of the actual language of a text for an analysis as opposed to the potential language that could have been chosen for a text. He calls for a text intrinsic analysis which has the functions of language as its primary analytic objects. The target of an analysis is no longer the mere description of language, but the analysis of the functions of language, that is, of its effects on the receivers. The communicative function of language is now at the centre of linguistic research. The emphasis on the text’s poetic function in a stylistic analysis reflects the focus on communication in linguistic analyses in general. This specific focus, however, underlines that literary texts differ from other texts by their self-referentiality as they are characterized by a specific function, the poetic, which is relevant for every literary text and which can only be deduced from the text itself. Non-literary texts on the other hand can be characterized by all other textual functions.
Language and meaning
45
In his analysis of Golding’s The Inheritors, Halliday (1971) uses Jakobson’s (1958) premises by assuming that the linguistic features in the text allow the extraction of its meaning. He calls this a functional approach to text analysis in which the meaning of a text is of central importance. Halliday illustrates the functional approach by convincingly demonstrating that he can draw conclusions about the content and the meanings of the novel by analysing the use of transitive and intransitive verbs in the descriptions of two Neanderthal tribes. Halliday uses his findings to draw conclusions about how the tribes understand processes and about their chances of survival in the course of evolution. He demonstrates that the description of Lok’s tribe frequently uses simple past tense forms, has a preference for non-human subjects and that transitive verbs are almost completely absent. The linguistic patterns in the description of this people create the impression for the reader that the tribe is both inefficient and helpless in its actions. And in fact, the tribe is attacked and defeated by another tribe in the story. As opposed to Lok’s tribe, the language that mainly describes the conquering people is characterized by mainly human subjects, sentences which describe actions and most sentences contain transitive verbs. Halliday interprets this syntax as a sign of activity and goal-oriented actions which enable the tribe to survive. This interpretation is supported by the plot in which this tribe defeats Lok’s people. This characterizes it as more successful and fit for survival in the course of evolution than the conquered tribe. Halliday’s analysis is a prominent example of functional stylistics which assumes that language fulfils functions. It also assumes that linguistic features and patterns in a text have meanings and evoke meanings for the reader. Consequently, Halliday systematically analyses one linguistic feature, that is, the transitivity of verbs, in the course of a text in order to decode the textual meanings of this feature. The success of this analysis has rightfully made Halliday’s article one of the classics of stylistics. But Halliday’s article also shows the necessary limitations of a linguistic analysis of a longer text which does not use corpus linguistic software. Even though Halliday systematically analyses a linguistic feature which most readers of the text are unlikely to notice intuitively and even though this provides him with insight into the meanings of the text, he is nevertheless restricted to text extracts for his analysis. It was impossible for him to analyse the complete text without electronic means and the relevant software which would have allowed him to discuss, for example, changes between the extracts or developments in the course of the text. Today, it might be possible to tag all verbs within the text for their grammatical forms
46
Corpus Linguistics in Literary Analysis
and to analyse their distributions across the text. Halliday did not have the means to do so. While Halliday’s analysis is and will remain a classic and influential article in stylistics, it has been severely criticized on several grounds by Hoover (1999). First, Hoover says that Halliday’s analysis ‘is not explicit enough’ (27). In his analysis, Halliday presents a table of clauses of various types that occur within the different passages of the text and which he uses as the basis for his subsequent analysis. However, Hoover finds that Halliday ‘does not indicate how he has divided the passages into clauses’ (27). While this seems unproblematic initially, since clauses seem to be a relatively straightforward concept, Hoover says that he ‘tried to duplicate Halliday’s analysis, and I would not have arrived at precisely the same divisions and figures if I had not had Halliday’s statisitics as a goal’ (27). Furthermore, ‘placing the clauses into Halliday’s types were (sic) (. . .) even more difficult (transitive and intransitive action, location/possession, mental processes, attribution and other)’ (37, emphasis in the original). Second, Hoover says that [i]t is simply not true that the first long section (language A) is very intransitive compared with the second short section (language C). While the lack of transitive verbs with human subjects in passages like A does show certain limitations in Lok’s understanding, it is not true that the Neanderthal world as a whole lacks cause and effect, nor that people cannot act as agents in their world, as Halliday claims. (41) Rather, Lok’s difficulties [in understanding the new people’s actions] stems from a lack of knowledge about the culture and artefacts of the new people, not from an inability to understand agency or cause and effect. (49) Instead, the sense of powerlessness and ineffectuality that Halliday ascribes to the syntax seems to inhere (. . .) in the plot. Lok and his people are unsuccessful in their struggle with the new people, who seem powerful in contrast because of their success. (52) While Hoover agrees on the importance of transitivity patterns and agency for the readers’ perceptions of the text (52), he implicitly accuses
Language and meaning
47
Halliday of overestimating them for the novel’s meaning. This implies criticism in the tradition of Fish (1973) of the original analysis that the analyst, first, detects meaning in the text and, second, finds this meaning in the linguistic patterns of the text. While Hoover’s criticism of (1) inexplicitness and lack of documentation and (2) overinterpretation of linguistic features has to be taken seriously, Halliday’s analysis nevertheless remains a classic in stylistics. Halliday was the first to present a systematic analysis of linguistic features, which he uses to account for literary meanings of the text. Still, the points of criticism cannot be dismissed. In fact, they remind stylisticians of the importance of the documentation of analyses, that is, of replicability and checkability. By revealing the bases for an interpretation, Hoover’s first point of criticism of Halliday could have been avoided. A detailed documentation would also have revealed whether the charge of overinterpretation of features was justified. Also, Culler (1975) discusses the structuralist approaches to text analysis by Jakobson (1958) and Halliday (1971). In his book on structuralist poetics, Culler reviews different approaches to analysing literature and their accompanying theories, for example, by Barthes and Foucault, and states that the task of linguists is not to tell us what sentences mean; it is rather to explain how they have the meanings which speakers of a language give them. (74) Therefore, a ‘linguistic analysis provides a method for discovering the patterns or meanings of literary texts’ (95). But then he proceeds by saying that the direct application of techniques for [a] linguistic description may be a useful approach if it begins with literary effects and attempts to account for them, but that it does not in itself serve as a method of literary analysis. The reason is simply that both author and reader bring to the text more than a knowledge of language and this additional experience – expectations about the forms of literary organization, implicit models of literary structures, practice in forming and testing hypotheses about literary works – is what guides one in the perception and construction of relevant patterns. (95) Culler emphasizes not only the necessity of analysing linguistic patterns, but also the reader’s textual competence as meaning-creating elements in
48
Corpus Linguistics in Literary Analysis
a text. However, what Culler does not seem to notice is that what he describes as subjective impressions are in fact linguistic patterns which are decoded in a linguistic analysis. This is, for example, the case with (1) genre conventions which he calls ‘expectations about the forms of literary organization, implicit models of literary structures’ (1975: 95) and (2) developing and testing hypotheses on language. As in literary criticism, linguistics discusses the meanings of a text by interpreting the linguistic data on the text. It is the kind of data that is interpreted and the techniques used to generate the data which distinguish linguistics and literary criticism. Taking this into account, Culler calls for an analysis of linguistic patterns, that is, of recurrence, which is a stylistic approach to text analysis. He proposes an analysis of literature which looks at the text, that is, at parole and its syntagmatic realizations, the intertext and the reader’s textual competence. Culler also argues that a literary text can only be understood properly in the context of other literary works. Intertextual references, which are based on this context, contribute to the meaning of the work, even though they might not be noticed by its receiver. Therefore, a structuralist poetics must enquire what knowledge must be postulated to account for our ability to read and understand literary works. (1977, reprinted 1998: 304) This is the case, since [t]he work has structure and meaning because it is read in a particular way, because these potential properties, latent in the object itself, are actualized by the theory of discourse applied in the act of reading. (1975: 114) The full meaning of a work can only be decoded by a recipient when its context is understood. A further prerequisite for generating meaning of a literary text is the reader’s literary competence which Culler defines as ‘a set of conventions for reading literary texts’ (1975: 118), for example, knowledge of genre conventions or literary traditions. Only this accumulation of knowledge of texts allows a comprehensive interpretation of a literary text. While this claim seems to come straight from the ivory tower of the humanities, Culler’s own illustration that the graphic form of a sentence influences its interpretation demonstrates that this is not the case. He does so by discussing a sentence that was originally published in a newspaper
Language and meaning
49
article as a run-on sentence and which Culler presents in the graphic form of a poem. Based on literary conventions, a recipient expects symbols and metaphors in a poem, which have to be interpreted in order to understand the sentence’s full meaning. This is not the case with a sentence from a newspaper article. Consequently, the same sentence is likely to be interpreted differently by its readers depending on its visualization, either as a poem or as a run-on sentence. This means that its form influences its meaning for the recipient. This is not an example from the academic ivory tower, but a practical demonstration of how the receiver’s textual competence determines his/her interpretation of a text. Culler positions himself within literary criticism, but the analytic parameters he calls for are those of linguistic stylistics. His emphasis lies on identifying linguistic patterns and intertextual references which form the basis of a literary interpretation of a text. According to the title of his 1975 book, he situates himself in the structuralist tradition and supports Figure 3.1 above from a literary critic’s point of view by describing how different linguistic factors create meaning in a text. His emphasis on linguistic patterns and the intertext as meaning-creating elements in texts shows that Culler perceives meaning to be part of parole and its linguistic patterns. There are also critics of stylistics. One prominent critic is Fish (1973) whose criticism is, in part, a reaction to structuralism. In particular, he criticizes the seeming circularity of stylistic analyses. Even though this reproach is directed at stylistics only, it is also applicable to other linguistic disciplines and possibly to science and research in general. Fish (1973) argues that stylistic analyses are based on the analyst’s personal preferences. He says that a linguist selects features for an analysis which s/he is convinced are significant in the data. After the analysis, s/he postulates the significance of the feature, which correlates with the interpretation of the text s/he had already had before the analysis. According to Fish, the original hypothesis about the meaning of the text is neither questioned nor checked in most stylistic analyses. Furthermore, Fish denies that linguistic patterns have an intrinsic meaning and refers to the contexts of these patterns, that is, their co-texts, which influence their meaning. This anticipates the position of corpus linguistics. According to Fish (1973), one example of this seeming circularity of stylistic analyses is Halliday’s (1971) work. Fish accuses Halliday of having had a set opinion of the fitness for survival of the two tribes before performing his analysis. Fish claims that Halliday’s interpretation of the text was merely supported by the linguistic patterns identified by the analysis.
50
Corpus Linguistics in Literary Analysis
This criticism of stylistics denies stylistic analyses any objectivity. However, stylistics aims at performing analyses as objectively and systematically as possible (cf. Chapter 2 Goals, techniques, principles and the discussion on Jakobson above). This is what distinguishes it from traditional literary criticism. For Fish, this argument is invalid because of the perceived circularity of stylistic analyses. For him, there seems to be no difference between a literary critic’s intuition and that of a selective stylistician. While Fish’s (1973) criticism of stylistics may be justified with regard to some analyses, the question how this circularity could be avoided in an analysis remains. It seems unlikely and unpromising that an analyst would chose a research object which to him/her seems unlikely to produce results. Also, the linguist’s personal interests and preferences surely influence the choice of a research object. However, this does not make results less significant, since neither the function nor the meaning of a linguistic feature in a text are determined and decided on before the analysis. It is still possible that a hypothesis is falsified. Indeed, it is rather useful and even necessary to have a hypothesis about the data before an analysis. Its corroboration or falsification, however, is only possible on the basis of an analysis. This is most likely a standard procedure in all research, and also Fish’s criticism can be applied to all research. As it cannot be avoided in any research, it can be minimized by ensuring transparency of all decisions taken within an analysis (cf. Chapter 2 Goals, techniques, principles). This documents the linguist’s choices and decisions so that they can be checked by other linguists. Fish’s criticism is helpful for stylistics as a discipline as it emphasizes the need for transparency in an analysis. It reminds linguists of the necessity to document the different steps of an analysis and to question one’s motives for an analysis. Even though when taken by its full force, Fish’s criticism calls the procedure of research in general into question, it is useful to identify potential and avoidable circularity in an analysis. Furthermore, Widdowson’s (2004) criticisms of critical discourse analysis (CDA) and corpus linguistics also apply to corpus stylistics, so that they warrant a discussion here. His main criticism of corpus linguistics and CDA is that both disciplines over-interpret linguistic patterns. In his opinion, linguistic patterns are given too much significance in the extraction of meaning from texts while the physical context of a text is unjustly ignored. Widdowson says that ‘[o]ne needs to relate the text externally to the conditions of its production and reception’ (2004: 123) in order to understand its full meaning. This is because ‘[t]he significance is not in the text’ (122). According to Widdowson, the interpretation of a text cannot be reduced to
Language and meaning
51
the analysis of textual patterns, but also has to take the context of its production and reception into account (125). Furthermore, Widdowson criticizes corpus linguistics and CDA for choosing their objects of an analysis on the basis of a previous conscious or unconscious perception of their textual relevance (166). This is not a neutral starting point for an analysis. Widdowson is thereby siding with Fish’s (1973) criticism. While Widdowson’s criticism of corpus linguistics and CDA has to be taken seriously, it can be countered, at least partly, by disclosing all data which forms the basis of an interpretation. This has already been proposed here to counter Fish’s (1973) criticism. Documenting, explaining and, if necessary, giving access to the data makes an analysis and its results transparent and understandable for its recipients. While criticism of an analysis is still possible after these measures, it is based on the exact knowledge of the data and of the analytic procedures. This makes it more powerful than without these insights. Nevertheless, laying open the analytic parameters cannot dispel all criticism of over-interpretation. It can, however, reduce it. Widdowson agrees with the need for transparency in analyses and says that, they [students of linguistics] need to be provided with a methodology: a set of explicit and replicable analytic procedures for them to apply not only in producing their own analysis but in evaluating the analyses of others. (2004: 173) This proposition for tertiary education is useful for linguists and their work in general. But even though Widdowson says that a systematic evaluation of analyses by way of set procedures is desirable, he does not suggest the nature of these procedures. Neither does he suggest criteria according to which an over-interpretation of linguistic patterns could be avoided. Furthermore, Widdowson is liable to commit an intentional fallacy (cf. Wimsatt and Beardsley 1946) when he says that we are still left [after a corpus linguistic analysis] with the essential pragmatic question of ‘what the author was about when he produced the text’. (123) Wanting or claiming to know an author’s intention cannot be the goal of a stylistic or, more generally, a linguistic analysis, since (1) it is irrelevant for
52
Corpus Linguistics in Literary Analysis
the interpretation of a text and (2) frequently the intention can no longer be determined. Concentrating on linguistic patterns that exist within the text allows for an interpretation, which is based on objectively occurring patterns in the data. This interpretation might deviate from that of other analysts and from that of the author. However, this is irrelevant for its validity. Despite this criticism of Widdowson, his warning of an overinterpretation of data has to be taken seriously. There are different ways to help prevent it that is by: 1. verifying the relevance of data which is used for an analysis through multiple analyses as in Chapter 5 Keywords and concordance lines 2. choosing the data for an analysis on a quantitative basis as in Chapter 6 Phraseology 3. reusing data for further analyses which has already been shown to be relevant in previous analyses as in Chapter 7 Text segmentation. Even though the selection of data on a quantitative basis is to some degree still open to Widdowson’s criticism, the basis for the selection of features for an analysis becomes transparent and understandable for a recipient. If necessary, this allows sound criticism of an analysis. In corpus linguistics, the analytic techniques and criteria, according to which analyses are performed, frequently determine whether an analysis is corpus-based or corpus-driven (Francis 1993). A corpus-based analysis is one that researches a particular hypothesis by way of corpus linguistic techniques. A corpus-driven analysis is one in which hypotheses are developed by way of corpus linguistic data. In a corpus-driven analysis, the choice of features that are analysed is based on the significance, that is, frequency, with which they occur in the data. This means that the exact objects of an analysis are identified only in its course. In a corpus-based analysis, the features that are analysed are selected before the analysis and the corpus is used as a reference tool. The corpus-based approach to an analysis is mainly deductive as it attempts to answer a given question. The corpus-driven approach to an analysis, on the other hand, is mainly inductive as the research question is developed from the data. Both approaches are useful within corpus linguistics and serve different purposes. While corpus-based analyses answer previously known questions, corpus-driven analyses are exploratory and allow the analyst to gain entirely new knowledge. This is because the exact research question is suggested by the first results of the analysis. An analysis might, for example, look at the
Language and meaning
53
most frequent phrases of some data. But which phrases exactly are the most frequent ones, and therefore the objects of the analysis, is only established in a preliminary analysis of the data. The functions of the phrases in the data are established in a second analytic step. The corpus-driven approach to an analysis is much less prone to criticism in the tradition of Fish (1973) than the corpus-based approach. But neither of the two approaches is entirely proof against it. Both approaches include pre-selections, that is, subjective elements, for example the choice of objects for an analysis. These decisions cannot be avoided completely as they are part of research. However, the fact that the results of an analysis are not known before the analysis, justifies this pre-selection. In the following sections of this chapter, the stylistic disciplines which are most relevant for this book are introduced and relevant literature is discussed. The discussion of literature relevant for the different analyses takes place in the respective analytic chapters of this book.
3.4 Cognitive stylistics Taking Fish’s (1973) and Widdowson’s (2004) criticisms into account, it seems that considering cognitive processes in a stylistic text analysis could solve some of the problems they mention. This is because cognitive stylistics has moved away from the purely structuralist and formalist analytic techniques discussed earlier in this chapter. Instead, it looks at meaning as being evoked in the receiver’s cognition by linguistic elements in a text. The focus of cognitive analyses is the receiver of a text. The object of an analysis is parole as it evokes cognition. One way of discussing cognitive processes evoked by language is to analyse the use of schemata (Bartlett 1932) in texts. Schemata are conventionalized series of actions which are stored as complete units in the brain. One example of a schema from NA is travelling to Bath for health reasons. Merely mentioning the journey to Bath triggered the contemporary readers’ knowledge of the processes involved and which Austen expected her readers to be familiar with, for example, the procedures in the Pump House in Bath. This is why she does not explain them in the novel. In fact, the Pump House was the local assembly house in which patients and convalescents could enjoy the city’s social life, like the dances that were held there, and drink healing water. It was the social centre of Bath at the time. These functions of the Pump House are not explicit in the text, but are implicitly conveyed to the readers by mentioning the house. Contemporary readers
54
Corpus Linguistics in Literary Analysis
could then activate the schema of actions performed there and of the social importance of the place. The fact that literary texts use schemata to imply meaning is theorized by Culpeper (2002a) who discusses the impact of cognitive processes on the characterization of literary protagonists. He writes, referring to the icebergtheory mentioned by Toolan (1988, which he in turn probably adapted from Hemingway), that only very little of the protagonists’ characterization happens via language. In fact, most of the characterization perceived by a reader is below the surface, that is, outside of language. It is effected via cognitive processes which are triggered by schemata in the text, that is, by language. While Culpeper discusses these points theoretically and does not demonstrate an analysis, Heywood, Semino and Short (2002) identify and analyse cognitive and linguistic metaphors in extracts of two literary texts. Their theoretical concept is Lakoff’s (1993) description of conceptual domains which the authors use in order to describe and analyse metaphors in extracts of novels by Maitland and Rushdie. In the course of their analysis, the authors demonstrate that decoding these domains provides in-depth insights into the literary meanings of the texts. These insights reach deeper than those that could be gained without a cognitive analysis. But since the interpretation of the conceptual domains in the texts and the discussion of implications for the texts’ literary meanings deriving from these domains only takes little space in the article, questions concerning the number of domains in the text and their significance for the text’s meaning remain open. Also, research by van Peer and Maat (2001) shows that cognitive processes influence the reception and perception of a text. The authors demonstrate that readers of a text empathize with those characters who seem to have legitimate or altruistic motives for their actions. Actions performed by characters who seem egoistic are judged more negatively. The motives for actions are inferred from the text by its readers. Textual situations which readers were asked to discuss include accounts of conflicts between a couple described from the perspectives of both parties involved. These texts were manipulated by the authors by either integrating the man’s or the woman’s feelings into the original text. The texts were then used to trigger either empathy or dislike for one part of the couple in the reader. The readers’ reactions on the text provide insights into the reading process by indicating which stretches of a text and which words are read most intensively and therefore trigger reactions.
Language and meaning
55
Further cognitive stylistic studies are performed by Cervel (1997–1998) and Freeman (2002). Their studies give insights into the psychology of characters, actions and recipients of literary texts. Freeman (2002) compares Dickinson’s and Frost’s poems by looking at their use of schemata. The analysis shows that Frost trusts in human reason and that this trust is based on the past. Dickinson, on the other hand, discusses the destructive potential of higher beings. According to Freeman, this indicates that she is ahead of her time in foreseeing future social tendencies. Cervel (1997–1998) analyses the characterization of protagonists from Austen’s Pride and Prejudice by way of schemata. Even though her analytic steps and her conclusions are partly based on a dubious interpretation of the novel (as an example, she says that the Bennet family ‘stand for the low layers of society’ (242) when they really come from the middle class), Cervel shows that the concept of love as a journey is a recurrent motive in the novel. However, because of the study’s lack of systematic rigour and its partly unclear theoretical basis, this insight does not gain its full weight. This makes the study unconvincing and random in its approach, even though its results could be potentially interesting for Austen scholars. The articles discussed above are intuitive and therefore subjective analyses. Analytic decisions are not based on objective or transparent criteria, so that other linguists might take different analytic decisions. While the results gained from the analyses are highly interesting and valuable contributions to stylistics and possibly literary criticism, they, however, cannot be checked, either by the means proposed as parts of the evaluation criteria set up in Chapter 2 Goals, techniques, principles, or by any other linguistic techniques. They are therefore open to criticism in Fish’s (1973) terms. Still, they provide interesting insights into how meaning is encoded in language and develops in a reader’s cognition, and into the readers’ receptions of language.
3.5 Corpus stylistics While Jakobson (1958) and Halliday (1971) could analyse linguistic patterns only manually and thus intuitively, without electronic means, and in rather small extracts from longer texts or in poems, corpus stylistics now allows the analysis of complete texts or corpora for their lexical, phraseological and grammatical patterns. Following Jakobson, the syntagmatic axis of a text is at the centre of an analysis, such as in the analysis of frequent phrases in a text. Corpus linguistics and corpus stylistics, as understood and performed
56
Corpus Linguistics in Literary Analysis
in this book, are based on the parameters set up by Jakobson, but expand his approach by providing means for analysing much larger sets of data than those that either Jakobson or Halliday had the means for. The analysis of non-fiction texts by using corpus linguistic techniques has been amply demonstrated. But corpus stylistic analyses, that is, the analysis of literary texts or corpora by means of corpus linguistic techniques, are still rare. Notable exceptions include Burrows (1987) with his impressive analysis of the idiolects of Jane Austen’s characters, and Tabata (1994, 2002) who analyses corpora compiled of Charles Dickens’ novels. Both linguists analyse literary corpora. Corpus stylistic analyses of texts have been presented, in particular, by Burgess (1999, 2000), Hardy and Durian (2000), Stubbs (2005), Starcke (2006), O’Halloran (2007a) and Fischer-Starcke (2009b). In general, corpus stylistic analyses are based on the following parameters: At the very least, a computer-aided analysis can help the reader to detect and keep track of the manifold patterns of the narrative. It may even alert the reader to subtleties and complexities that may otherwise have been missed or overlooked. These may be at various levels: patterns of individual or related words or phrases or image clusters; the association of particular words or phrases with specific figures or events; special or idiosyncratic usage of, e.g., particular turns of phrases by, for example, the narrator at specific moments in the narrative or repeatedly in connection with specific characters. Finally, (. . .) a computer-aided analysis may provide a sound statistical foundation on which to test critical hypotheses (Burgess 1999: 20). One focus in corpus stylistics has been the analysis of authors’ idiolects. This is exemplified by Tabata (1994), who analyses a corpus compiled from extracts from ten novels by Dickens. These novels were written within two time spans and each novel is represented by about 20,000 words. Tabata’s research object is the most frequent lexis of the corpus, mostly grammatical words, which he uses to demonstrate the development of Dickens’ style of writing in the course of his life. In his first novels, Dickens used a rather formal style of writing which tended towards written language. This changed in the course of his career towards a more spoken language style. In 2002, Tabata enlarged the scope of his study of 1994 and analysed a corpus compiled from 23 novels by Dickens. The goal of this research was again to look at developments in Dickens’ style of writing, but also to make general statements concerning the author’s language. One illustration of
Language and meaning
57
the latter is the difference between the density of information in texts, which were designed as a series, such as David Copperfield, and in his sketches, such as ‘The Uncommercial Traveller’. The serial publications frequently describe emotions, and they are characterized by a style resembling spoken language. The sketches, on the other hand, frequently feature a nominal style and are decidedly descriptive. Tabata’s research offers new, highly interesting insights into Dickens’ language, which, in a different study, could be used to gain insights into the meanings of every individual text and of the entirety of Dickens’ oeuvre. Generating explicitly literary insights into Dickens’ oeuvre was not, however, the goal of this particular research. Mahlberg (2007a, 2007b) also looks at Dickens and, like Tabata (2002), analyses a corpus compiled from 23 texts by the author. Mahlberg (2007a) uses this corpus to extract phrases of three, four and five words and so-called key clusters. The latter are phrases extracted by WST which occur significantly more frequently in the corpus than in a reference corpus. The concept is analogous to that of quantitative keywords. Elsewhere (2007b), Mahlberg analyses the phrases of different lengths which occur at least five times in the corpus. In both articles, Mahlberg looks at the local textual functions of the phrases. These are specific both for individual characters in Dickens’ novels and for particular functions. And she finds, for example, that phrases which include a body part frequently provide ‘contextual information that accompanies the description of a situation or activity which is more central to the story’(2007a: 25). Mahlberg’s approach to analysing phrases and their functions in a corpus complements research by Tabata (1994, 2002) and Hori (2002) with the latter discussing lexical words and their functions in Dickens’ novels. This research provides insights into Dickens’ language and into how meaning is encoded in his texts. It therefore contributes to the study of meaning in Dickens’ texts. Burrows (1987) presents a middle ground between Tabata’s and Mahlberg’s approaches to analysing an author’s oeuvre. He studies Jane Austen’s language, and by looking at frequent, mostly grammatical words, he develops literary interpretations of the novels’ characters and situations and of the author’s idiolect. His analyses were performed by using, necessarily, rather rudimentary software, but highly complex statistics. Burrows demonstrates that word lists and distribution diagrams of single words facilitate knowledge of the protagonists, their characterizations and their relationships to each other. The fact that technical possibilities were limited at the time
58
Corpus Linguistics in Literary Analysis
Burrows did his research means that his important and interesting conclusions could not go into more depth than what he presents in his book. They could also not be complemented by further analyses. This, however, does not dampen the impressiveness of his study. Also, Burrows has since made up for that (for example in his 2002 publication), in particular with his works on authorship attribution. The present book is partly based on the results of his 1987 research. Corpus linguistic analyses of literary texts necessarily focus on different aspects than analyses of corpora. One example is Hardy and Durian (2000), who analyse collocations and colligations of the lemma SEE* and its past tense form saw in Flannery O’Connor’s language. For this purpose, they compare literary texts by O’Connor with the Brown Corpus1. This provides them with empirical evidence for their conclusion that topics in the texts are to some degree characterized by descriptions of visual impressions. In his analysis of Conrad’s Heart of Darkness, Stubbs (2005) looks at frequent lexis and frequent phrases in the text. This allows him to demonstrate that the use of vague lexis, such as something, contributes to the novel’s air of uncertainty and to the impression that lack of knowledge is omnipresent in the novel. And in fact, both topics are two of the novel’s leitmotifs as identified by literary critics. Stubbs also shows that the novel’s most frequent phrases contribute to its leitmotifs of (1) geographic and psychological space, (2) appearance and reality, and (3) ignorance and uncertainty. The analyses uncover meanings of the text which had not been previously discussed by literary critics. This further demonstrates the potential for discoveries of corpus linguistic analytic techniques in the analysis of literature and of corpus stylistics in general. The most frequent phrases of Austen’s novel Persuasion are analysed in Starcke (2006). The article demonstrates how the phrases (1) contribute to characterizing the relationship between two of the novel’s protagonists and (2) help to create the novel’s sombre atmosphere in which time passing is of great significance for the novel’s literary meanings. This analysis gives insights which had yet to be gained in nearly 200 years of literary criticism of the novel. This is further evidence for the usefulness of corpus stylistic analyses for interpretations of literary works. A comprehensive analysis of Joyce’s short story ‘Eveline’ is demonstrated by O’Halloran (2007a). He shows that, already at the beginning of the story, linguistic features indicate that Eveline will not leave Dublin at the end of the story to start a new life in Buenos Aires. The features that O’Halloran identifies are quantitative keywords, constructions which include her + body part and free indirect thoughts in constructions with would. O’Halloran
Language and meaning
59
demonstrates convincingly that they hint at Eveline’s ‘mental paralysis’ (236), as he calls her inability to enter the ship at the end of the story. By doing so, he shows how studies that are based on empirical evidence can counter Fish’s (1973) criticism which he describes as ‘a usefully uncomfortable body-piercing for stylistics’ (241). This emphasizes the importance of Fish’s criticism as a reminder for stylisticians to question their research for possible circularity. Fischer-Starcke (2009b) presents an analysis of quantitative keywords and frequent phrases of Austen’s Pride and Prejudice. The article shows that the collocations and colligations of these items encode literary meanings of the text, especially with regard to family relationships. Semino and Short (2004) adopt a different approach to corpus stylistics than the studies discussed above by systematically analysing three corpora, each one representing one of the three genres prose fiction, newspaper news reports, biography/autobiography, that they themselves compiled and tagged. These corpora are analysed for (in the authors’ sequence): z z z z z z z
narrator’s representation of voice and writing; narrator’s representation of speech acts, of writing acts and of thought acts; indirect speech, writing and thought; free indirect speech, writing and thought; (free) direct speech, writing and thought; internal narration and; inferred thought presentation.
The proportions to which these categories occur in the different corpora are interpreted as genre specific features by Semino and Short. While this does not offer insights into literary meanings of data, the authors identify formal linguistic patterns that are characteristic of the genres, two of which are literary. All research discussed in this section uses corpus linguistic techniques for the analysis of literary works. The studies aim at generating insights into the literary meanings of the works, into an author’s idiolect and style of writing, or into genre conventions. These goals are achieved by all studies discussed, frequently by analysing a single linguistic category or feature. Since corpus stylistics is still a fairly young discipline, however, a number of open questions remain even after the highly interesting and relevant analyses discussed above. One such question concerns the interaction between different linguistic categories and levels which is only rarely analysed, but would offer multi-dimensional insights into the language and the meanings of a text.
60
Corpus Linguistics in Literary Analysis
In addition, it would show the interdependence of different linguistic levels and the relationships between different levels of meaning in language. Moreover, the unit of meaning that is analysed, often the word, is selected at the beginning and is not questioned during the analysis. Furthermore, analytic techniques are not tested on different sets of data, and data that has been generated for one analysis is not reused in another. There is no testing whether knowledge gained about analytic techniques in one analysis can be adopted to fit other data. The transferability of analytic techniques and steps from one analysis to another is neither questioned nor tested. This is the case even though the validity and relevance of results would greatly increase if the transferability of techniques to other data were shown. To re-use data and adopt analytic techniques from one analysis to another would be a humanistic form of the social scientific method of triangulation. Even though this would not involve using different analytic techniques to answer one question, as has been demanded by, for example, Campbell and Fiske (1959) and Denzin (1970), reusing data from one analysis for another would greatly increase its significance (cf. Chapter 2 Goals, techniques, principles). Furthermore, it would demonstrate the relationships between different linguistic levels, such as word and phrase. This is shown in the analyses of Chapters 5 to 7 in this book.
3.6 Concluding comments As mentioned at the beginning of this chapter, stylistics is not a closed discipline which has a limited number of analytic techniques at its disposal. On the contrary, it uses all techniques of linguistic text analysis to gain insights into the contents and topics of a text or corpus, or into the idiolect of an author. This makes stylistics a creative discipline which can combine different analytic techniques and traditions. This also distinguishes it from other linguistic disciplines which usually use one set of analytic tools only. The variety of analytic techniques in stylistics is, on the one hand, a strength of the discipline. On the other hand, it also shows that there is no consensus in linguistics as to where meaning is encoded in language. Cognitive linguists, namely, argue that cognitive processes have to be included in the analysis of meaning, while corpus linguists prefer a quantitative approach which can be tested by other linguists. There is no proof that either of the two theories is correct, only evidence for the growth of knowledge that is effected by them.
Language and meaning
61
The goals of the different analytic approaches also differ. While cognitive stylistics aims at gaining knowledge about concrete texts, about their reception by readers and about readers’ receptions of texts in general, corpus linguistics gains insights into the language of the specific text or corpus and into the language variety that is represented by the corpus. This is because not only single manifestations of language and their linguistic patterns, but also patterns across different texts are analysed. In a stylistic analysis, this variety of analytic techniques of linguistics can be used not only to analyse literary texts for their meanings, but also to test the techniques and to develop them further. One way of doing this is by analysing texts which have been discussed intensively by literary critics. When knowledge gained by linguistic techniques matches that gained by literary critics, the significance of new insights gained by linguists increases. This is because the possibility of replicating the results demonstrates the potential of the technique for gaining literary information, including entirely new ones. Gaining new insights into a work or the idiolect of an author is the major goal of stylistic analyses. It is achieved by (1) identifying and (2) interpreting linguistic patterns in a text or corpus. This procedure ensures that statements regarding the contents or structure of a text or corpus are based exclusively on the actual language of the data. Looking at stylistic analyses from a purely linguistic point of view, it becomes clear that the analyses can help to test the linguistic techniques and to develop them further. The question of how meaning comes into and develops within a text is relevant not only for literary texts, but for language on the whole. This is why corpus stylistic analyses contribute to developing techniques for decoding meaning also in general language usage. The following analyses in this book (Chapters 5 to 7) use mainly corpus linguistic techniques, in particular the analysis of quantitative keywords, of phrases and of the distribution of lemmata. The choice of these techniques and of corpus linguistics as an overall framework for this book is based on the characteristics of corpus linguistics which are discussed in Chapter 2 Goals, techniques, principles. In Chapter 5 Keywords and concordance lines, the corpus linguistic techniques are supported by schema theory. This is a combination of analytic techniques and theoretical approaches which has already been used successfully by O’Halloran (2007b). The following analyses show that the combination of analytic techniques, for example of corpus linguistics and cognitive linguistics, is useful for the analysis of a text and a corpus. Different analytic techniques reveal different linguistic patterns and their meanings. This multi-dimensionality leads to
62
Corpus Linguistics in Literary Analysis
more detailed knowledge concerning the contents and the structure of the data than could be gained by using only one technique. The one-dimensional approach, in the sense of using only one technique, of most of the works discussed above is enlarged in the following analyses. The results gained from that also affirm corpus linguistics as the basic framework of the analyses. Chapter 4 Summary of Northanger Abbey gives a brief summary of NA. Chapters 5 to 7 present applications and implications of the theoretical and practical contexts of corpus linguistic and stylistic research presented so far. The units of the three analyses are lexis, phraseology, text parts and text.
Chapter 4
Summary of Northanger Abbey
Since NA is the textual basis for most of the following analyses which aim at literary interpretations of the novel, this very short chapter provides a brief summary of NA. It is designed to help follow the progression of the plot, to understand interpretations of the novel that are based on linguistic data and to match names of protagonists with their actions and characterizations. The protagonist of the novel is Catherine Morland, who is invited by her friends Mr and Mrs Allen to accompany them on their journey to Bath. Catherine gladly accepts this invitation, and she leaves her hometown Fullerton for Bath at the beginning of the novel. In Bath, Catherine meets Isabella Thorpe and her brother John Thorpe. She soon becomes friends with Isabella, who introduces Catherine to Gothic literature. From this point on, Catherine is fascinated by the literature. But in her naïveté, Catherine is unable to completely distinguish between the fiction of her novels and reality. She believes that the Gothic stories of her novels are based on reality. At the same time in Bath, Catherine meets and befriends the siblings Henry and Eleanor Tilney and soon falls in love with Henry. Not long after her arrival in Bath, Isabella Thorpe becomes engaged to be married to Catherine’s brother James. But once Henry Tilney’s brother Captain Tilney arrives in Bath, Isabella breaks off the engagement in the hope of attracting the wealthy Captain Tilney as a better match than James Morland. However, this hope is not fulfilled. Meanwhile, John Thorpe courts Catherine who neither notices his affections nor is she romantically interested in him. Henry and Eleanor’s father General Tilney meanwhile courts Catherine as a future wife for Henry, because the general falsely believes Catherine to be the heir of the wealthy Mr Allen. To pursue further the goal of wedding Catherine and Henry, General Tilney invites Catherine to accompany the family to their home Northanger Abbey after their stay in Bath. Catherine
64
Corpus Linguistics in Literary Analysis
enthusiastically accepts this invitation and imagines Northanger Abbey to be an abbey as in her Gothic novels. This includes her belief that crimes have happened at the abbey. It is her dislike of her host, General Tilney, which leads Catherine to suspect him of having murdered his wife. Henry notices her suspicions and convinces her that his father is not guilty of the crime and that his mother died a natural death. After Catherine has stayed at the abbey for a few weeks, John Thorpe, who had told the general about Catherine’s presumed inheritance, informs General Tilney that Catherine will not be the heir to a large fortune and that she comes from a poor family. Even though neither her presumed future wealth nor her present poverty are true, the general believes John Thorpe, and Catherine has to leave the abbey almost immediately without being given a reason. For General Tilney, she is not acceptable as a daughterin-law without having or inheriting a fortune. Henry is away from the abbey at the time Catherine has to leave it. But once he has found out about her dismissal, he visits her at her parents’ house, explains his father’s conduct and apologizes on his sister’s and his own behalf. He also asks Catherine to marry him. Catherine gladly accepts and the wedding takes place at the end of the novel after Eleanor has married a wealthy gentleman, which has softened her father to the connection between Henry and Catherine. NA is a bildungsroman which describes Catherine’s development from a girl who cannot distinguish between fiction and reality to a grown-up woman. Furthermore, it portrays both sentimental novels and Gothic novels ironically, two genres which were popular at the time Austen wrote NA. Elements of the sentimental novel include the engagement between James Morland and Isabella Thorpe, John Thorpe courting Catherine, Catherine falling in love with Henry, and Catherine and Henry’s marriage at the end of the novel. Gothic novels are referred to in NA by way of frequent intertextual references, both by mentioning titles of novels and by mirroring situations from Gothic novels (cf. the Appendix and Chapter 5 Keywords and concordance lines for intertextual references to Gothic novels in NA).
Chapter 5
Keywords and concordance lines
Keywords (cf. Scott 1999) are words which occur statistically significantly more frequently in a text or corpus than in a comparable, larger reference text or corpus (cf. Chapter 2 Goals, techniques, principles). Their frequency is due to their importance for either the content or the structure of the data that is analysed. They indicate dominant topics or structural features and hence, meanings of the data. Therefore, analysing the usage of keywords can serve as a starting point for the literary interpretation of data. The difference in size between the two sets of data is the basis for assuming that the words identified as keywords are important for the content or the structure of the data. The larger size of the reference corpus makes it highly unlikely that the keywords’ frequency in the data is due to a general probability of language use, but probable that the motivation for their occurrence lies in the structure or the content of the data. This is why it is possible to formulate hypotheses regarding the data’s content and the structure based on the keywords. Evidence for or against these hypotheses is gained from subsequent analyses of the keywords’ concordance lines. This will be demonstrated later in this chapter. Keywords do not form absolute lexical patterns in a text or corpus, as they depend on the choice of reference corpus. A reference corpus sets a lexical frame for the words that are identified as keywords. As an example, the comparison between NA and a corpus consisting of Gothic literature is unlikely to identify words as keywords which point to Gothic elements in NA. Comparing NA with a general corpus of contemporary literature, however, is likely to generate keywords pointing to Gothic elements in the text. Both kinds of corpora are used as reference corpora for NA later in this chapter. Extracting keywords by comparing one set of data with several reference corpora allows the comparison of the resulting lists. Words that are identified as keywords by more than one comparison have a higher significance and are
66
Corpus Linguistics in Literary Analysis
therefore more relevant for an analysis than words that are identified by only one comparison. Their higher significance is based on the fact that they do not depend on any single reference corpus and are therefore more objective pointers to characteristic lexical items of the data than keywords that are identified by only one analysis (cf. Chapter 2 Goals, techniques, principles on objectivity in corpus linguistic analyses). This maximizes their significance as a characteristic lexical feature of the data. In the following, NA is compared with four different corpora. The second set of data that is analysed, Austen, is compared with only one reference corpus, namely ContempLit. This is because there are no other reference corpora that are (1) larger, (2) without diachronic differences to Austen, and (3) belong to the same text type as Austen, that is, literature. Using a comparable reference corpus for Austen was rated higher for the analysis than the use of several comparisons with non-equivalent corpora as results from the latter would not be significant for the analysis. The results from performing more than one keyword analysis in order to identify the most relevant keywords for the data by comparing the different lists of keywords, calls findings by Scott and Tribble (2006) into question. They say that ‘while the choice of reference corpus is important, above a certain size, the procedure throws up a robust core of KWs whichever reference corpus is used’ (64). However, they do not define the ‘certain size’ that is required for the production of core keywords, so that the present analysis cannot be evaluated on that basis. Nevertheless, the analyses in this book show that the compilation of specialized corpora clearly influence the keywords that are identified. While this might be due to the relatively small size of the corpora used here, it still emphasizes the usefulness of more than one keyword analysis for detecting relevant lexis. This is especially the case with specialized corpora as a reference corpus. I will demonstrate the influence of reference corpora on the keywords identified with regard to NA in section 5.2.1 later in this chapter. The analyses in this chapter take the following steps: 1. keywords are extracted by comparing two sets of data with each other 2. semantic patterns, that is, semantic fields, are intuitively identified from the lists of keywords 3. words which fall into these semantic fields and which have been selected for further analyses (cf. section 5.2.1 for the criteria for their selection) are lemmatized and their concordance lines are extracted from the data
Keywords and concordance lines
67
4. the concordance lines are analysed for further semantic and grammatical patterns, that is, collocations and colligations of the lemmata 5. the patterns identified allow for insights into the content and structure of the data, for example the characterization of the protagonists and the use of irony in NA. NA is analysed in the first part of this chapter, Austen in its second part.
5.1 Keywords in the literature The fact that lexical patterns of a text convey meaning is one of the basic assumptions of modern linguistics. Sinclair et al., for instance, state that ‘evidence of the lexical organisation of the language can be found by studying patterns of significant collocation’ (1970: 77 quoted in Phillips 1989). These lexical patterns are ‘forms of organisation in the language’ (77) which determine both the content and the structure of texts. Identifying lexical patterns in a text therefore contributes to decoding its meaning. The identification of these patterns often depends on automatic, that is, computerized, analyses and is impossible by intuitive means. Such is the case with quantitative keywords. Identifying keywords has a long tradition in linguistic research and has helped to answer different kinds of questions (as discussed later in this section). At first, keywords were identified intuitively; nowadays, this is usually an automatic process. As a common background, intuitive and automatic keyword analyses share the assumption that keywords are useful for recognizing and explaining important concepts in language or society. The focus on important topics in a given text or corpus, as in the analyses in this book, has become possible only by the introduction of software that allows the identification of quantitative keywords in a given set of data. Some of the first researchers to comment on the concept of keywords were Ladendorf (1906) in his Historisches Schlagwörterbuch (Dictionary of Historical Headwords) and Lepp (1908) in his work on headwords of the reformation period. Both Ladendorf and Lepp intuitively identified words which were important for the respective time periods they researched. In German-speaking countries, this approach was later continued by Strauß, Haß and Harras (1989) and Herberg (1997), who analysed political terms and their implications. While Herberg (1997) examines keywords of the German Wende (i.e. the collapse of the Communist system in Eastern Germany in 1989), Strauß, Haß and Harras (1989) analyse political concepts and words that were discussed controversially in public discourse.
68
Corpus Linguistics in Literary Analysis
Teubert (1989) adopts a similar approach to a keyword analysis by analysing politically charged terms. In France, Benveniste (1954), for example, researched connotations and meanings of the French term civilisation. In Britain, Firth (1935, 1957a) and Williams (1983) were the first commentators on keywords. They both intuitively identified words as keywords which they assumed stood for important social and/or cultural concepts in society. Firth (1935) identified, for example, hours, leisure, time and self-respect as well as ‘words associated with dress’, occupations and ambitions of women (45) as keywords. Williams (1983) selected words as diverse as charity, elite, fiction, modern, nature, private and underprivileged. His source for evidence on diachronic changes in the usages and, hence, the meanings of these words, was the Oxford English Dictionary. An alternative means of generating keywords, namely a quantitative approach to identifying keywords, has become possible by modern technology. The quantitative approach is based on software using statistical and empirical calculations to automatically extract keywords from a set of data. The theoretical basis for this analysis is often derived from Scott’s (1999, cf. Chapter 2 Goals, techniques, principles) definition of keywords as words which are statistically salient in a text or corpus compared to a reference corpus. This definition is also the basis for the analyses in this book. The benefit of quantitative keywords as opposed to intuitively identified qualitative keywords is their relevance with regard to the content of the data that is analysed. This is emphasized by Baker (2004) who argues that quantitative keywords ‘direct the researcher to important concepts in a text’ (347), since ‘a keyword analysis will focus only on lexical differences’ (349) of different sets of data. Thus, a keyword analysis is the investigation of text-specific lexis. Furthermore, Mason and Platt (2006) point out that keywords are in fact collocations within a text or corpus, since they point to topics present in the data and reveal connections between them. The authors ‘treat the text analogously to the lexical span around a node word, and compare frequencies within the text to word frequencies in a general corpus as a benchmark’ and call ‘[t]hese significant words on the text level (. . .) textual collocates’ (159, emphasis in the original). These textual collocates are ‘conceptually similar’ (159) to Scott’s (1999) keywords. The fact that the keywords or textual collocates depend on the data that is analysed, makes them collocations of the data. Scott and Tribble (2006) also make explicit that the keywords of a given text uncover lexical patterns which indicate the content of the data. For them, the potential of a keyword analysis lies in the use of software, in their
Keywords and concordance lines
69
case WST, as it ‘is capable of ploughing through vast quantities of text in a relatively short time (. . .) reducing it to a set of potential patterns’ (5). These patterns then serve as starting points for further analyses. The authors use the software as a tool for selecting and classifying lexis as relevant for further analyses. While Scott and Tribble (2006) offer few new insights into the topic, they provide a useful summary on the theory of quantitative keyword analyses as well as a demonstration of their usefulness by means of case studies, including analyses of genre-typical lexical features and a study of literary meanings in Texts for Nothing, 1 by Samuel Beckett. Some of these analyses further develop previous studies by the authors. One such study seems to be Tribble (2000) which looks at project applications submitted to the European Union and which demonstrates that the texts use a rather specific lexis. Knowing and using this lexis is especially important for learners of the genre in order to maximize their chances of a successful application. Another example is Scott (2002) who shows how social stereotypes are encoded by lexical patterns in language by analysing a corpus of the Guardian newspaper. Discovering these patterns by way of a keyword analysis allows to decode the stereotypes. Both studies offer interesting findings with regard to their specific research questions. An analysis which uses a combination of quantitatively (in the sense of Scott 1999) and qualitatively (in the sense of Williams 1983) defined keywords is that by Johnson, Culpeper and Suhr (2003) who examine a corpus of newspaper articles for variants of the term political correct, for example, political correctness and politically incorrect. The analysis is performed in the tradition of critical discourse analysis and covers a five-year period. Interestingly, the authors discover that the usage of such terms decreases significantly within their period of analysis. They relate this to the Labour party’s victory in the general elections in Great Britain in 1997 and Labour’s subsequent government. In stylistics, keywords function as pointers to literary meanings in a text. This is demonstrated by Culpeper (2002b), who characterizes the protagonists of Shakespeare’s Romeo and Juliet by means of keywords. One of his findings is that the protagonists’ use of first person singular and plural pronouns allows readers to draw conclusions on their social status and their personalities. Culpeper (2009) later expands this analysis of Romeo and Juliet with an analysis of the play’s keywords, their semantic categories and parts of speech. In his analysis of Munro’s ‘The love of a good woman’, Toolan (2004) also shows that keywords are useful pointers to the topics and thus the content
70
Corpus Linguistics in Literary Analysis
of a literary text. He demonstrates this by discussing the progression of Munro’s text in which the keywords are unevenly distributed. Both Toolan’s (2004) and Culpeper’s (2002b) analyses are useful and successful combinations of quantitative data and qualitative analyses in stylistics. The analyses presented in this chapter follow those described above. They combine quantitative and qualitative approaches to analysing the data by using quantitative keywords as starting points for qualitative analyses. The goal of these analyses is to identify data-specific patterns which contribute to encoding literary meanings in the data. These meanings are decoded by looking at collocations and colligations of the keywords as manifest in their concordance lines.
5.2 The text NA The first step of the analysis is to compare the text NA with the corpora Austen, Austen5, Gothic and ContempLit. The choice of these reference corpora is based on their contents and their classifications: NA treats Gothic literature highly ironically (cf. section 5.2.4), NA is part of Austen’s oeuvre and it belongs to its contemporary literature. Using Austen5 as a reference corpus allows the comparison of NA with Austen’s other novels, without NA being part of the reference corpus as it is in the comparison of NA with Austen. Comparing the lists of keywords resulting from the comparisons NA – Austen and NA – Austen5 also shows the influence of NA on the corpus Austen and the influence of the choice of reference corpus on the lexis identified as keywords. Both topics are discussed in section 5.2.1. The second step of the analysis is to select words from the lists of keywords for further analyses (cf. section 5.2.1 for the criteria for selection), to lemmatize them and to extract their concordance lines. These concordance lines are analysed for collocations and colligations of the lemmata. The lexical and grammatical patterns identified in the concordance lines function as indicators of literary meanings in the data. Using lemmatized forms of the words helps to detect as many shades of meanings of the node words as possible. This is because the different forms of a lemma pattern differently in language (cf. Sinclair 1991 and Sinclair in Moon 1987: 89) and all forms and patterns of the lexis are relevant for extracting the data’s literary meanings. In the second part of this chapter, the corpus Austen is analysed by following the same sequence of steps as described for NA.
Keywords and concordance lines
71
One focus of the analysis of NA is on textuality as a means of z z z
characterizing the protagonists and as a feature of the novel’s content, cohesion and coherence, and ironically portraying the plots of and attitudes toward Gothic literature contemporary to Austen.
In addition, the roles of the novel’s author and narrator (cf. section 5.2.4) as well as the novel’s discourse structure, that is, the use of grammatical negations in the text (cf. section 5.3), are discussed. The choice of textuality as a focus of the analysis is due to the dominance of this semantic field on the lists of keywords for NA1 (cf. section 5.2.1). The part of this chapter in which NA is analysed is divided into two sections. The focus of the first section is on the plot and content of the novel, such as on the characterization of the protagonists and on identifying lexical references to the genres Gothic novels and sentimental novels. Both genres were popular at Austen’s time and NA alludes ironically to them (cf. Brooks and Watson (2000) and Smith (1992) on sentimental novels and Ousby (1992), Hermansson (2000), Neill (2004), and Armstrong (2009) on Gothic novels). However, the focus of the following analysis is on Gothic novels, while sentimental novels are only infrequently mentioned. The second section of the analysis focuses on irony and the roles of author and narrator in the text.
5.2.1 The choice of keywords When looking at the lists of keywords extracted from comparisons of NA with different reference corpora, one notices that lexis expressing emotions occurs on two of the four lists and is dominant on one. Table 5.1 NA – Austen
Keywords emotions in NA NA – Austen5
NA – Gothic
NA – ContempLit
feelings, engagement
feelings, engagement, attentions, admiration
Table 5.1 shows that lexis from the semantic field emotions is identified as keywords in comparisons of NA with Gothic and ContempLit only, but not in the comparisons with Austen and Austen5. This indicates that both the topic and its respective lexis is present in all novels by Austen, as it would
72
Corpus Linguistics in Literary Analysis
otherwise be identified as key in NA when comparing the novel with Austen and Austen5. The occurrence of the lexis in the other two lists shows that it is characteristic of NA in comparison to Gothic literature and to its contemporary literature, even though the exact words and the number of words identified by the two comparisons differ. The lists indicate that emotions occupy more space in Gothic novels than in literature as such, since the comparison of NA with Gothic identifies fewer keywords from this semantic field than the comparison of NA with ContempLit. Textuality is a second dominant semantic field in the lists of keywords. Its lexis is identified as keywords in all four comparisons (cf. Table 5.2). This implies that the semantic field is more dominant than emotions in NA. This is why it is the focus of the following analyses. Emotions are analysed in greater depth in Chapter 7 Text segmentation. Table 5.2 Keywords textuality in NA NA – Austen
NA – Austen5
NA – Gothic
NA – ContempLit
udolpho, heroine udolpho, heroine, novels, heroine, read, journal, udolpho, heroine, journal, manuscript, novel novels, theatre, novel heroine’s, novels
Table 5.2 shows that udolpho, novel, novels and journal occur in at least two comparisons, and heroine occurs in all four comparisons. Because of the correlation between frequency and importance of linguistic features in language (cf. Chapter 2 Goals, techniques, principles), the occurrence of a node word in more than one list indicates its importance for the data analysed. Since heroine is the most dominant keyword of the semantic field, it is analysed in particular detail later in this chapter (cf. section 5.2.2). The analyses of read, novel, journal and manuscript are presented after this primary analysis of heroine, albeit not in as much detail (cf. section 5.2.3). Even though theatre and udolpho are also identified as keywords, they are not analysed, since udolpho refers to a particular book and theatre refers to the building as a physical entity in all of their occurrences. Neither of the words is used in a sense that makes them part of the semantic field textuality. The comparison between NA and Austen identifies the fewest keywords from the semantic field textuality. The comparisons between NA and Austen5 and between NA and Gothic identify most words from this semantic field. These differences in numbers indicate differences between the contents of the reference corpora. The relatively small number of words from the semantic field identified by the comparison of NA with Austen shows that the lexis of the corpus is
Keywords and concordance lines
73
strongly influenced by the text NA, which constitutes nearly one sixth of Austen. This is a significant part of the corpus, so that lexis which is characteristic for NA, for example the semantic field textuality, is not necessarily identified as keywords in the comparison between NA and Austen. This is shown most clearly by a comparison of that list with the one derived from the comparison between NA and Austen5, with the latter list being about three times longer than the one resulting from the comparison NA – Austen. The influence of one text on a complete corpus of which it is part, though, decreases with an increase in the number of texts or text fragments that constitute the corpus as well as with the overall size of the corpus. Since Austen is compiled from six texts only, the influence of each constituent text on the corpus is high. Apart from Austen, no other reference corpus includes NA, so that NA’s individual lexical characteristics, as compared to other corpora and to the texts they include, can be most readily identified by comparing NA with corpora other than Austen. The primary results from these comparisons are that (1) textuality is dominant in NA, but that (2) it is not a dominant topic in all of Austen’s novels, as lexis from this semantic field is identified as key in the comparison between NA and Austen5. This shows that the topic is not prominent in Austen5. Textuality also seems to be less prominent in Gothic novels than in literature in general, since more keywords from the comparison NA – Gothic belong to the semantic field than from the comparison NA – ContempLit. Textuality is a characteristic topic of NA and its prominence is not matched in Austen5, Gothic or ContempLit. The dominance of textuality in the lists of keywords indicates its importance for the content and structure of NA. Mentions of textuality constitute both intra- and intertextual references in NA; they constitute a recurrent topic and thus a cohesive link within the text. Textuality is part of the plot and contributes to the characterization of the protagonists. Furthermore, it is a means of treating the genres of the Gothic novel and sentimental novel ironically in NA. Gothic novels are explicitly and implicitly referred to in NA, sentimental novels are only implicitly referred to. References to the titles of Gothic novels – like Radcliffe’s The Mysteries of Udolpho and Lewis’ The Monk – are explicit intertextual references to the genre in NA; parallel plot lines between Gothic novels and NA constitute implicit intertextual references both to the genre as a whole and to particular novels. One example of this is Catherine calling General Tilney a Montoni (1722), referring to the villain from The Mysteries of Udolpho (cf. the Appendix for an analysis of intertextual references to Gothic literature in NA).
74
Corpus Linguistics in Literary Analysis
The dominance of the semantic field emotions in the lists of keywords for NA is a reference to sentimental novels, the second genre treated ironically in NA. The lexis mirrors the discussion of sentiments and emotions, which is one characteristic feature of sentimental novels. Furthermore, the genre is also reflected by the plot of NA, for instance by the romances between Catherine and Henry Tilney and between Isabella Thorpe and Catherine’s brother James. As NA refers to both sentimental and Gothic novels, the following analyses show which contents and structural features of the genres are mirrored in NA. Moreover, the analyses demonstrate that textuality and irony are inherently interrelated in NA, since the topic textuality is used to convey irony in the novel. Consequently, the irony also determines the lexis of textuality, since the semantic field is used to create irony. The relationship between textuality and irony is thus inherently circular and interdependent. In the literature on NA thus far, the description of this relationship has been intuitive. The following analysis provides formal evidence for it. 5.2.2 The protagonist – a heroine? The first finding yielded by an analysis of the concordance lines of HEROINE* (24) is that the lemma describes Catherine in 14 instances. The fourfold identification of the lemma in the lists of keywords shows the significance of the node word heroine for the novel. This makes it likely that the character to which the term is attributed plays an important part in the novel. And, in fact, the character described as a HEROINE* is identified as the novel’s protagonist by this label. This matches the readers’ intuitive perception of Catherine’s role in the novel. Catherine is, however, not only identified as a protagonist of the novel by being called a HEROINE*, she is also given attributes of a heroine and is assigned certain character traits by appealing to the readers’ schematic expectations of what a heroine ought to be (cf. Chapter 3 Language and meaning on schema theory). These expectations are derived mainly from Gothic novels in which female protagonists are frequently called heroines. These heroines are usually young, pretty, modest and courageous as is the case in The Monk and The Mysteries of Udolpho, and in the course of the narratives, they usually live through an adventure and are threatened by a villain. In NA, the narrator refers to the latter explicitly by saying that Catherine might be warned of a possible abduction by a nobleman by her mother before she leaves Fullerton for Bath. However, since Mrs Morland is ignorant of this possible threat to her daughter, she does not warn Catherine (6).
Keywords and concordance lines
75
Despite this attribution, Catherine does not fulfil the schematic expectations of a heroine and is characterized as an anti-heroine as early as in the very first sentences of the novel: No one who had ever seen Catherine Morland in her infancy would have supposed her born to be an heroine. Her situation in life, the character of her father and mother, her own person and disposition, were all equally against her. Her father was a clergyman, without being neglected or poor, and a very respectable man, though his name was Richard, and he had never been handsome. He had a considerable independence, besides two good livings, and he was not in the least addicted to locking up his daughters. (1) Catherine does not have a problematic family background, her father does not imprison her and he does not suffer from financial difficulties. According to the quotation from NA above, these circumstances preclude Catherine from becoming a true heroine. This characterization of Catherine as an anti-heroine is frequently discussed in literary criticism, by Liddell (1963), Harmsel (1964), Tanner (1986) and Emsley (2005) among others. In fact, the depiction of Catherine as an anti-heroine goes as far back as 1863 to Kavanagh (193)3. Catherine’s characterization as an anti-heroine is linguistically strengthened by the colligation of the adjective heroic (5) with grammatical negations and its collocation with negatively connotated and denotated lexis in 100 per cent of its occurrences. Also, in 100 per cent of its occurrences, the adjective refers to Catherine as the following concordance lines give evidence for: t not merely to dolls, but to the more heroic enjoyments of infancy, nursing a anation. Feelings rather natural than
heroic possessed her; instead of conside
e she fell miserably short of the true heroic height. At present she did not k rine, passed away without sullying her heroic importance. He looked as handso t Catherine, who had by nature nothing heroic about her, should prefer cricket
Catherine neither possesses heroic qualities, such as a specific, ‘heroic’, height nor does she develop heroic feelings. This further supports her characterization as an anti-heroine. The usage of the lemma HEROINE* to describe the novel’s protagonist can be read as an implicit reference to Gothic novels as the protagonists in Gothic novels are frequently called heroines. This usage complements other explicit and implicit intertextual references to the genre in NA as,
76
Corpus Linguistics in Literary Analysis
for example, the mentions of book titles (cf. the Appendix for intertextual references in NA). Also, protagonists in sentimental novels are frequently called HEROINE*. Thus, Catherine’s description as a heroine is an intertextual reference to that genre as well. Furthermore, Catherine’s love for Henry, John Thorpe’s courtship of Catherine and the dominance of the semantic field emotions in the lists of keywords all point to plot parallels in NA with sentimental novels. Explicit references to the genre, such as references to book titles, do not occur in NA. However, not only plot lines and the semantic field on the lists of keywords are intertextual references to sentimental novels; collocations of HEROINE* with words denoting mental concepts (16) also point to the genre. In two thirds of the occurrences of the lemma, HEROINE* collocates with mental concepts. The lemma’s possessive form (5) even collocates with mental concepts in 100 per cent of its occurrences: ledge, and dreadfully derogatory of an heroine’s dignity; but if it be as new of one novel be not patronized by the
heroine of another, from whom can she e
the sleepless couch, which is the true heroine’s portion; to a pillow strewed w r is widely different; I bring back my heroine to her home in solitude and dis would have supposed her born to be an
heroine. Her situation in life, the ch
harmless delight in being fine; and our heroine’s entree into life could not ta en. But when a young lady is to be a
heroine, the perverseness of forty surr
ess. Every young lady may feel for my
heroine in this critical moment, for eve
spirits can lead me into minuteness. A heroine in a hack post-chaise is such a an myself." And now I may dismiss my
heroine to the sleepless couch, which is
a most pleasing remembrance of all the heroines of her acquaintance; and she t a heroine; she read all such works as
heroines must read to supply their memories
e presumed that, whatever might be our heroine’s opinion of him, his admiratio ere fortune was more favourable to our heroine. The master of the ceremonies stances which peculiarly belong to the heroine’s life, and her fortitude under ipid pages with disgust. Alas! If the
heroine of one novel be not patronized
The concordance lines show that the collocation not only points to sentimental novels, but also to Gothic novels since mental concepts, such as the protagonists’ emotions and fears, are important for the contents and plots of both genres. In the case of sentimental novels, a mental concept is already part of the name of the genre (sentimental), which frequently describes the ups and downs of a young protagonist’s romances. In Gothic novels, the protagonist’s experiences and emotions are described in a series of frightening situations. Consequently, the collocation between HEROINE* and mental concepts such as feel, delight and spirits (cf. the concordance lines
Keywords and concordance lines
77
above) is a linguistic indicator of the connection between NA and these two genres. It is an intertextual reference to both genres. Moreover, the use of the lemma HEROINE* does not only connect the text with the two genres and helps to identify and characterize the novel’s protagonist, collocations of the lemma also reveal structural features of NA. The connection between HEROINE* and textuality, the semantic field from which the lemma HEROINE* itself is derived, is particularly evident, as the concordance lines show that HEROINE* collocates with references to literature and theatre (the latter in a phrase which refers to the theatre) in seven cases, that is, in nearly one third of its total number of occurrences: permitting them to be read by their own
heroine, who, if she accidentally take u
of one novel be not patronized by the
heroine of another, from whom can she e
this simple praise than a true-quality
heroine would have been for fifteen sonnets
a heroine; she read all such works as
heroines must read to supply their memo
to seventeen she was in training for a
heroine; she read all such works as
ipid pages with disgust. Alas! If the
heroine of one novel be not patronized
me comfort; and now was the time for a
heroine, who had not yet played a very
distinguished part
The collocation of HEROINE* with words and idioms from the semantic field textuality clearly shows the connection between the characters, the plot and textuality as structural features of NA. It 1. 2. 3. 4.
indicates a structural connection between the protagonist and textuality, hints at a dominant topic in the plot, makes the fictional nature of the novel explicit, and makes references to other genres in NA explicit.
The first three points are discussed later in this chapter (cf. sections 5.2.3 and 5.2.4), the fourth point has been discussed earlier. 5.2.3 Further indicators of textuality Even though textuality is a dominant topic in the novel, as commented on by, for instance, Auerbach (2004) and Todd (2006), reading, and especially reading novels, is portrayed as a surprising and rather unusual activity. This is particularly true for men whose attitudes towards reading fiction are portrayed as more negative than those of women. READ* (70, without ready and readily) occurs 30 times with negatively connotated or denotated lexis or grammatical negations. The following
Corpus Linguistics in Literary Analysis
78
concordance lines are examples of this usage: and scarcely ever permitting them to be read by their own on, our
heroine, who, if she
foes are almost as many as our readers. And while the abilities
"Believe me,"
&c.
of th
Catherine had not read three lines before her sudden chang
ead all the rest of it.
Consider – if reading had not been taught, Mrs. Radcl
he Mysteries of Udolpho.
But you never read novels, I dare say?" "Why not?"
re perfectly well
qualified
to torment readers of the most advanced reason and
it avail in such a case? Catherine had ing work.
read too much
teresting."
You are fond of that kind of reading?" "Not
I, faith!
not to be perfectly aware
"To say the truth, I do not
No, if I read any, it shall be Mrs. Radcliffe’s;
d in not one was anything found.
Well read in the art of concealing a treasure
The concordance lines of NOVEL* (19, excluding novelty), JOURNAL* (9) and MANUSCRIPT* (7) also mirror this negative attitude towards textuality. NOVEL* occurs 15 times with negatively connotated or denotated words or grammatical negations, JOURNAL* five times and MANUSCRIPT* three times. The following concordance lines are examples of this usage for NOVEL* and are complete for JOURNAL* and MANUSCRIPT*: , but he prevented her by saying, "
Novels are all so full of nonsense and
no novel-reader -- I seldom look into
novels -- Do not imagine that I often r
ls -- Do not imagine that I often read novels -- It is really very well for a n dolpho! Oh, Lord! Not I; I never read
novels; I have something else to do."
reading, Miss -- ?” “Oh! It is only a
novel!" replies the young lady, while s
us and impolitic custom so common with novel-writers, of degrading by their co ne, who, if she accidentally take up a novel, is sure to turn over its insipid lly thought before, young men despised novels amazingly." "It is amazingly; ls -- Do not imagine that I often read novels -- It is really very well for a n and taste to recommend them.
"I am no novel-reader -- I seldom look into novel
"I shall make but a poor figure in your
journal tomorrow."
ou to say."
journal."
"But, perhaps, I keep no
"My journal!"
"
"Perhaps you are not sittin
doubt is equally possible. Not keep a
journal! How are your absent cousins to
without having constant recourse to a
journal? My dear madam, I am not
forth, in the shape of some fragmented
journal, continued to the last gasp. O
so ig
d, with an unsteady hand, the precious manuscript, for half a glance sufficed fraught with awful cold sweat
intelligence.
The manuscript so wonderfully found, so wond
stood on her forehead, the manuscript fell from her hand, and grop
The concordance lines show that the semantic prosody4 (Louw 1993) of textuality is negative. However, this negative portrayal of textuality in NA is slightly modified by a positive portrayal of characters who read fiction (as described later on).
Keywords and concordance lines
79
The critical attitude towards reading that is visible in the concordance lines reflects Austen’s contemporary situation. During her time, reading fiction was not yet accepted as a respectable leisure activity and novels were not taken seriously. Consequently, literature had only little prestige and relatively few people admitted to reading fiction. Only non-fiction books were deemed worth reading, since they promised to educate their readers. This attitude began to change at the end of the 18th century when lending libraries allowed a greater public access to literature. From then on, the acceptance of novels in the middle classes grew rapidly. At Austen’s time, however, novels were still controversial. Apart from lending libraries, a second factor which heightened the acceptance of novels was the widespread popularity of Gothic novels. They helped to spread novels among readers, particularly among women, while men still preferred non-fiction books (Erickson 1990: 578). In NA, Austen seems to reflect and comment (cf. her ‘Defence of the Novel’, 24f.) on this attitude of her contemporaries. Despite their contemporary background, the negative and sometimes condescending utterances on reading as revealed by the concordance lines above are surprising, since z z
z
Austen constantly reminds her readers of the novel’s fictional nature by making its author/narrator explicit (cf. section 5.2.4) characters who read novels are portrayed positively (see below) and Austen herself defends reading novels in her so-called ‘Defence of the Novel’ (24f.) Austen creates a ‘female sphere’ of textuality in the novel for her mainly female readers.
The fact that textuality in NA is dominated by and connected with women is visible from further data regarding the content and the language of the novel. Literature, reading and writing are female domains, since, for instance, z
z z
women are attributed with writing better letters than men (‘Everybody allows that the talent of writing agreeable letters is peculiarly female’ (14)) women meet in order to read books together (‘shut themselves up, to read novels together’ (24)) the novel The Mysteries of Udolpho, which is frequently mentioned and discussed in NA, was written by a woman, Ann Radcliffe.
80
Corpus Linguistics in Literary Analysis
Moreover, reading is part of Catherine’s self-education and she tries to qualify as a heroine by reading the relevant books (3). From a linguistic point of view, the connection between textuality and women is most obvious in the analysis of the lemma READ* (70, excluding ready and readily). In 19 cases, that is, in nearly one-third of its occurrences, the lemma collocates with the personal pronoun I which refers to the protagonist Catherine. In a further 16 cases, the lemma explicitly refers to women, while in only seven cases it refers to men. This means that in exactly 50 per cent (35) of its occurrences, the lemma explicitly refers to women, but in only 10 per cent (7) to men. Even though female characters are more numerous in the novel than male characters, the proportion of the sexes who are connected to READ* does not mirror the novel’s gender ratio. There is a female dominance in this domain. However, a distinction between reading women and non-reading men would not do justice to the novel. Rather, the protagonists’ reading habits serve to characterize them as ‘good’ or ‘bad’ characters. Protagonists who read novels are portrayed as reliable and trustworthy, whereas characters who do not read novels are portrayed as unreliable. For example, both Henry and Eleanor Tilney read novels and Isabella Thorpe reads Gothic, and possibly other, novels at the beginning of NA. Isabella even forms her friendship with Catherine partly on the common ground of reading Gothic novels. On the other hand, those characters who do not read fiction are portrayed as ‘bad’. This becomes particularly clear in the descriptions of the Thorpe siblings. Isabella Thorpe’s portrayal changes during her friendship with Catherine. At first, she promises Catherine to read selected books with her (26f.), but later on does not keep her promise. As the plot evolves, she is portrayed as calculating, unreliable and fickle in the same way as her nonreading brother. Mr Tilney, on the other hand, is the ‘good’ male character in the novel. He reads The Mysteries of Udolpho to his sister and remarks on how much he enjoyed the novel (94f.). This anticipates his ‘good’ character, which is eventually confirmed by his wedding with Catherine at the end of the novel. In summary, the points made so far show that the topic textuality fulfils several functions in the novel: 1. The protagonists’ reading habits are a recurrent and important topic in the plot of the novel. This makes the topic a cohesive link across the text with regard to the novel’s lexis and content. It also supports coherence in NA.
Keywords and concordance lines
81
2. The references to literature are intratextual, intertextual and exophoric references to the fictional world within and the real world outside of NA. The real world is evoked by reference to real book titles in NA, that is, by intertextual references. Their relation to the world outside the text also makes them exophoric references. Their occurrence throughout the text, the implicit references to ironically treated genres and the lexical references to textuality function as cohesive links as they connect different parts of the text. They are, therefore, also intratextual references. 3. The topic textuality reminds readers of the fictional nature of NA, so that the borders between reality and fiction are blurred in the novel. This is elaborated in the following section (section 5.2.4). 5.2.4 Author, narrator, reader, and irony The focus of the analyses so far has been on textuality as, (1) a structural feature of the novel, (2) a dominant topic in NA, and (3) a means of characterizing the novel’s protagonists. In the following, the focus shifts to the roles of author and narrator and the novel’s inherent irony. The starting points for these analyses continue to be the keywords extracted and discussed above and their concordance lines. In addition, schema theory and theories on irony are used in the analyses.
5.2.4.1 Irony and theories on irony In linguistics, theories of irony and their linguistic and language philosophical features are frequently based on the traditional definition of irony of classical rhetoric or on Grice’s (1967) work. Also, Sperber and Wilson’s relevance theory which further develops Grice’s Cooperative Principle (CP) and its maxims, analyses irony. The basic question of linguistic approaches to the study of irony can be summarized as follows: What is it? What forms does it take and how are these related to one another? What makes people ironical? What are the functions and uses of irony? How important is it? (Muecke 1970: 1) The oldest and most influential theory of irony is that of classical rhetoric. It states that irony occurs when a speaker means the opposite of what s/he actually says and therefore logically negates the utterance. This negation is implicit and only understandable from the context of the utterance.
82
Corpus Linguistics in Literary Analysis
One of the first researchers to discuss irony was Grice who adopts a language philosophical approach. He locates irony within the study of intended meaning (1967), situates it within the CP and says that it flouts the maxim of quality. This is the case, since one feature of irony is to mean the opposite of what is said. When a speaker speaks ironically, this is an intentional act, thus the ‘maxim of Quality is flouted’ (1975: 53). This statement situates Grice and his theory of irony within the tradition of classical rhetoric, as Grice adopts the notion that irony is an implicit logical negation of what is said. Consequently, the surface meaning of an utterance is untrue in this particular situation and the speaker of an ironic utterance expects his/her listener to decode the intended meaning. However, Grice does not explain how a listener decodes irony. In 1989, Grice expands his theory of irony by saying that it is ‘intimately connected with the expression of a feeling, attitude, or evaluation’ (53) and adds that I cannot say something ironically unless what I say is intended to reflect a hostile or derogatory judgement or a feeling such as indignation or contempt (. . .). (53f.) Apart from Grice (1975), Searle (1979) and Martin (1992) also share the assumption that irony is a hidden logical negation of the propositional content of an utterance. Grice (1975) and Searle (1979) state that a listener discards the explicit meaning of an ironic utterance, since it is clearly either untrue or ‘grossly inappropriate’ (Searle 1979: 113). This necessitates a reinterpretation of the utterance and the ‘most natural way to interpret it [the utterance] (. . .) [is] as meaning the opposite of its literal form’ (Searle 1979: 113). The listener generates the intended message, or the message that s/he thinks was intended, from the sender’s implicit message. This implicit message is that the surface message is either untrue or inappropriate. Consequently, the real message of an utterance is implicit. According to Booth (1974), this means that irony can only be decoded on the basis of the context of an utterance. The assumption that negation is an essential component of irony is also shared by Martin, who states that there is an obvious parallel between irony and negation (contrary, and not necessarily contradictory, negation). (. . .) [E]ither the speaker says something false in order to suggest something true, or the speaker says something true in order to reveal something false. (1992: 83f.)
Keywords and concordance lines
83
Again, considering the context of an utterance is taken as essential for recognizing irony. Amante (1980) takes up Grice’s notion of intertextuality and argues that irony is always intertextual and exophoric. According to him, irony can only be recognized by the receiver when drawing on his/her knowledge of the world. Since irony is used to logically negate existing norms and maxims, it is inherently propositional in character. Furthermore, since irony refers to events within a text, ironic language is both anaphoric and exophoric in nature. Amante argues further that negations are important components of irony. They can be either indirect logical negations of the reader’s expectations that are driven by the development of an action or they can be explicit negations through grammatical negations or negating sentence structures: ‘(. . .) [I]rony is dependent upon negative-making devices in a language’ (1980: 20). Relevance theory (RT) by Sperber and Wilson further develops Grice’s (1975) hypotheses and his Cooperative Principle. Within RT, Sperber and Wilson (in Smith and Wilson 1992) contrast Grice’s assumption that conversational maxims are consciously flouted in ironic utterances with their own theory. This theory states that irony is created by the logical, grammatical or lexical negation of readers’ expectations. This means that irony is necessarily exophoric and functions as an echo of previously made utterances or conventions. These conventions can relate to both form or content of an utterance. The fact that a speaker distances him-/herself from his/her own utterance is implied by the echo in combination with the aforementioned negation. Wilson and Sperber summarize this as ‘the speaker echoes an implicitly attributed opinion, while simultaneously dissociating herself from it’ (1992: 60). This line of argument was followed in the above demonstration of how the negation of schematic expectations of a heroine characterizes Catherine as an anti-heroine in NA. According to the argument of RT, this characterization of Catherine entails that NA treats the genres sentimental novel and Gothic novel ironically. The following analyses follow this argument and demonstrate in detail later in this chapter to what extent literary conventions of the genres are disregarded in the novel. In RT, irony is based on three different elements which Curcó summarizes as follows: (a) it [irony] is a variety of interpretative use in which the proposition expressed by the utterance represents a belief implicitly attributed by the
84
Corpus Linguistics in Literary Analysis
speaker to someone other than herself at the time of utterance, (b) it is echoic (i.e., it implicitly expresses the speaker’s attitude to the beliefs being represented), and (c) the attitude involved in the echo is one of dissociation from the thoughts echoed. (2000: 261) Echoes can refer (1) to the linguistic form of an utterance or (2) to cultural or social content or conventions. According to RT, when the echo mainly refers to the linguistic form of an utterance, its message is parodied. When the echo mainly refers to the content of an utterance, thus creating an intertext between the echo and the text that is echoed, the original text is treated ironically. Based on this distinction, NA treats sentimental and Gothic novels ironically. This is demonstrated later in this chapter (cf. section 5.2.4.2). Groeben and Scheele (1984: 3) state that nearly every linguistic feature can be used as a signal for irony. For them, signals for irony are no abgrenzbar[es] Korpus fixierter, konkreter sprachlicher Phänomene, sondern (. . .) eine Funktions-/Verwendungsweise sprachlicher und parasprachlicher Mittel. (. . .) [Sie wirken] in Verbindung mit diesem Wissen [um zentrale ironische Inkongruenz] (. . .) negierend oder ambiguisierend, d.h. [sie] bestätigen oder verstärken die aus Vorwissen gespeisten Inkongruenzerwartungen (. . .). (1984: 9) set corpus of fixed and predetermined linguistic phenomena, but (. . .) a way of using and making use of linguistic and paralinguistic features. (. . .) In connection with the knowledge [of ironic incongruence, these features function] as negations or they create ambiguity. This means that they confirm or reinforce expectations of incongruity which derive from previous knowledge (. . .) [translation by B.F-S] According to Groeben and Scheele, irony is not characterized exclusively by syntactic or semantic features. This means that irony can only be decoded when taking the context of an utterance into account. Thus, irony can be produced and understood only on the basis of common knowledge between the sender and the receiver of a message. Consequently, irony is an indirect speech act. This is also the position of Franck (1975) and Wunderlich (1975). A corpus linguistic approach to identifying irony is demonstrated by Louw (1993). As a first step, he intuitively selects words from a text which seem to be used ironically in this particular context. Second, he analyses their conventional usage in a corpus of general contemporary English, for
Keywords and concordance lines
85
example in the Bank of English. Third, he compares the results from the corpus linguistic analysis with the usage of the words in the text. If the usage of the word differs between the two sets of data, it is very probable that it is used ironically in the non-representative text. The word is likely to function as a signal for irony. This argument implicitly follows the theory of irony from classical rhetoric, since also Louw’s approach defines the logical negation of norms as the basis for irony. These norms are based on the conventional pragmatic usage of words as opposed to their usage in a single text. Data that is generated by corpus linguistic techniques is used as evidence for the analyst’s intuition regarding irony in a text. The approaches to irony discussed above, all see irony as inherently intertextual and exophoric. For both its production and reception, irony depends on an ironically treated element outside the ironic text. This perception of irony is also supported by the analyses in the following sections of this chapter. The analyses show that the exophoric references in NA are mostly to literary works and, thus, are intertextual references. This creates a close structural connection between the two key terms of the analysis of NA in this chapter, textuality and irony. The subsequent identification of irony in NA follows, in particular, Groeben and Scheele (1984), who say that irony does not depend on the occurrence of a particular form, but is manifest in various linguistic features which are part of the text. This means that their ironic usage in a specific context can only be detected by analysing linguistic patterns and their contexts as is demonstrated in the following analyses of NA. From a literary point of view, irony in NA has been discussed by a number of critics, including Brown (1979), Tanner (1986) and Gerster (2000). 5.2.4.2 Irony in NA: the roles of author and narrator The prominent role of the narrator in NA is already noticeable when reading the novel. She comments on the plot and inserts personal opinions, as in her ‘Defence of the Novel’ (24f.). These comments blur the distinction between the narrator and the author of the novel, since the narrator claims that she is also the author of the text. This is implicit, for example, in utterances such as her comment that she is ‘aware that the rules of composition forbid me the introduction of a character not connected with my fable’ (235). Since the distinction between the author and the narrator in the novel is unclear, the following analysis refers to an author/narrator of NA, even though this counteracts conventions of literary criticism. The term
Corpus Linguistics in Literary Analysis
86
author/narrator is appropriate, however, as it connects the two entities closely and makes this connection explicit. And since the author of the novel is female, I refer to the author/narrator as ‘she’. In fact, NA is Austen’s only novel that explicitly addresses its readership (Auerbach 2004: 73) and thereby makes its narrator explicit. The prominent position of the author/narrator in NA is already visible in the concordance lines of HEROINE* (24). In eight cases, or one-third of its occurrences, the lemma colligates with the possessive pronouns my (5) and our (3) directly to the left of the lemma: long visit at Northanger, by
which my
heroine was involved in one of her most
r is widely different; I bring back my
heroine to her home in solitude and dis
harmless delight in being fine; ess.
and our heroine’s entree into life could not ta
Every young lady may feel for my
an myself."
And now I may dismiss my
e presumed that, whatever might be
heroine to the sleepless couch, which is
our heroine’s opinion of him, his admiration
usion to disconcert their measures, my ere fortune was more favourable to
heroine in this critical moment, for eve
heroine was most unnaturally able to fu
our heroine. The master of the ceremonies
These concordance lines show that the auctorial, omniscient author/ narrator comments on the events of the novel. She is detached from the characters and refers to herself as their creator and the mental owner of Catherine. This ownership is expressed by the colligation between HEROINE*, which refers to Catherine in all concordance lines quoted above, and the possessive pronouns. The colligation implies that Catherine is an artificial object created by the author/narrator which only exists in the text. This is true in fact, but rarely made explicit in fiction texts. This close relationship between the protagonist and the author/narrator indicates the character’s dependence on the author/narrator and contributes to the destruction of the schema of an independent, strong heroine. It strengthens Catherine’s characterization as an anti-heroine. In addition, the possessive relationship between Catherine and the author/narrator creates a distance between reader and protagonist, which makes it almost impossible for the reader to identify with Catherine (see later in this section for a discussion of this point). Also the reader is involved in this possessive relationship: The colligation of HEROINE* with our indicates that the reader shares the author/narrator’s superior position as the creator and owner of the text and its characters. However, this does not mirror the real situation of the novel’s production and reception. In fact, the reader receives the product written
Keywords and concordance lines
87
by the author, Jane Austen. This is not reflected by the use of our as a colligation of HEROINE*. By using our, the author/narrator creates an emotional bond between the reader, the author/narrator and the plot of the novel. This shifts the reader’s position from that of an independent and detached reader, to one who is involved in the creation of the novel and who stands above the plot. This is a second linguistic device which makes it difficult for the reader to identify with the characters. But the author/narrator also uses possessive pronouns when addressing or speaking in general about or to her readers (7) in order z z z
to inform them that they should evaluate a situation, for example, ‘I leave it to my reader’s sagacity’ (231) to provide them with information, for example, ‘stated, for the reader’s more certain information’ (5) to describe the readers’ emotions, for example, ‘fear, to the bosom of my readers’ (234).
By using these devices, the author/narrator linguistically turns her readers and characters into her creations and products whose judgements, evaluations and states of knowledge she influences. She, thus, places readers and characters on the same linguistic level. This is shown by the following concordance lines: on, our foes are almost as many as our readers. es have been seen.
And while the abilities of th
I leave it to my reader’s sagacity to determine how much
nce in Bath, it may be stated, for the reader’s more certain information, lest y have now passed in review before the reader; the events of each day, its hop dly extend, I fear, to the bosom of my readers, who will see in the tell-tale c me description of Mrs. Allen, that the reader may be able to judge in what mann re perfectly well qualified to torment
readers of the most advanced reason and
The concordance lines show that linguistically, the reader is not on an equal footing with the author/narrator and that the reader depends on textual features for his/her interpretation of the text. This cancels the author/narrator’s concession on the reader’s elevated level in the novel as expressed by the collocation of HEROINE* with our discussed above. There is no even distribution of the colligation of possessive pronouns with HEROINE across NA, as it clusters mainly at the beginning and at the end of the novel. This permits the author/narrator to first introduce the
88
Corpus Linguistics in Literary Analysis
reader to this unusual situation of reception and, at the end of the novel, to remind the reader of his/her freedom to evaluate its events. The fact that this freedom of evaluation has really been limited by the author, who has written the text and has inserted her own evaluations, is not mentioned. This creates a fiction for the reader which mirrors the novel’s topic textuality. While the perception of most literary events in literary texts follows the chain reader → events, perception in NA follows the chain reader → author/narrator → events. The reader does not perceive the events directly, but from the author/narrator’s perspective. This is always the case in literature, since a text has always been written by one or more authors with a particular perspective on the events. However, literary texts frequently create the illusion that the reader is an independent judge and observer of events. In NA, the linguistic emphasis on the author/narrator prevents the reader from developing or maintaining this feeling of independence. Moreover, it prevents the reader from identifying with the protagonist. Since a reader of a literary text identifies with the first element in the chain of perception (see the chain above), the reader identifies with the author/narrator, instead of with the events or the protagonist in NA. This distance between the reader and the protagonist allows the reader to recognize that the conventions of sentimental novels and Gothic novels are not followed in NA. Furthermore, the author/narrator distances herself from the convention of a narrator, who seems independent from author and reader. This shows that a number of literary conventions are broken in NA, a fact which emphasizes the fictional nature of the novel and its protagonists. This mirrors and supports the topic textuality in NA. Making the author/narrator explicit creates two literary levels in NA. These are, 1. the plot level on which the protagonist Catherine plays a major part, and 2. the authorial level on which the author/narrator reminds the readers of her double role as creator and narrator of the novel. This blends the roles of author and narrator, since the narrator explicitly refers to her role as the author of the novel.
Keywords and concordance lines
89
In addition to these two literary levels, the reader is also addressed on different levels of reality in the novel: the reader as a real person reads the fictional novel by a real author and a fictional narrator on fictional characters who talk about literature and the theatre and about real literature which they read. as part of the fiction. This interwoven structure of reality and fiction on four levels makes Mooneyham (1988:1) characterize NA as ‘structured like a Chinese box of fictions within fictions within fictions’. The border between reality and fiction is blurred, and the emotional distance between the reader and the novel’s protagonists, as described above, further emphasizes the fictional nature of the text. The fact that the reader is involved in the plot and in the evaluation of the novel is not only a renunciation of the real circumstances of the reception of literature, but also a renunciation of literary conventions, and thus of schematic expectations of the text by its readers. According to the chains of perception and identification above, a reader usually identifies with the text’s protagonists. Following the theories of irony, the renunciation of these conventions means that they are treated ironically, since they are logically negated. Furthermore, it is an intertextual reference to novels in which the narrator fulfils the traditional function. The facts that, 1. the narrator makes explicit that she is also the author of the novel and 2. the author/narrator linguistically claims that the reader is on an equal level with her create a double irony of literary conventions, namely of the roles of author and narrator in a literary text and of the novels that follow these conventions. The reader’s alleged elevation to the level of the author/narrator is also an ironic treatment of the reader who, consciously or unconsciously, believes in this elevation. But since, in fact, it is the author/narrator who predetermines the plot and the protagonists’ characterizations, the reader
90
Corpus Linguistics in Literary Analysis
is not equal to her. This illusion, therefore, functions as a renunciation and logical negation of norms regarding the reader’s position as the recipient of a literary text. Consequently, the reader is treated ironically and is (1) the subject, i.e. agent, of this treatment of literary conventions and the ensuing irony by recognizing the conventions, and (2) its object as the seeming creator of the events in the novel. The ironic treatment of both the reader and real books in NA creates intertextual and exophoric reference points for the irony. Moreover, also literary conventions are exophoric references in the novel. These different references are essential components of the irony discussed here. The discussion so far has shown that the renunciation of literary conventions and of the reader’s expectations fulfils three functions: 1. It is a means to treat the reader, literary conventions and novels that conform to these conventions ironically. 2. It is an intertextual reference to other novels. 3. It is an exophoric reference to the reality outside of the novel, that is, to literary conventions, to real novels mentioned in NA, to schematic expectations created by real novels and to the reader. An essential feature that allows the recognition of irony is the distance of perception between the reader, on the one hand, and the protagonists and events on the other hand (see above for the chain of readers’ perceptions of literary events in novels). It is this distance which makes recognising intertextual and exophoric references possible. Meanwhile, the recognition of irony prevents the reader from identifying with the protagonists even further. This means that irony in NA is circular and that both its creation and its recognition depend on three consecutive steps: 1. There is a logical negation of the reader’s schematic expectations which derive from intertextual and exophoric references. 2. This negation prevents the reader from identifying with the protagonists or the literary situation. 3. This allows the recognition of irony. If one of these conditions is not fulfilled, irony cannot be recognized by the reader and irony as such becomes impossible, since both the reader’s expectations and their grammatical, lexical or logical negation are essential components of irony.
Keywords and concordance lines
91
5.2.5 Implications The analyses have shown that the investigation of keywords provides insight into topics and structural features within a text, namely NA. The keywords that are identified by several analyses are used as pointers to linguistic patterns, which are then analysed further. Analysing the keywords’ concordance lines allows us (1) to identify important topics of the text, (2) to reveal implicit characterizations of the protagonists and (3) to make explicit the importance of the topic textuality in the novel. Furthermore, features of text structure, such as the roles of author and narrator in the text, are analysed and linguistic features triggering irony in NA are identified. For the latter, the cognitive concept of schemata was used in addition to corpus linguistic analytic techniques. Using statistics to identify keywords, which indicate dominant topics in a text, has been shown to be useful for the analyses. Yet, the keywords are only pointers to dominant topics in the data. They are relative linguistic features deriving from a comparison of different corpora. Furthermore, it has been shown that it is helpful to use several reference corpora for such comparisons in order to prevent a bias resulting from only one such corpus. This ensures that the dominant topics of a text really are identified. The automatic identification of keywords based on the words’ statistical significance in the data makes them relevant for the analyses. The fact that they frequently mirror intuitive impressions on important words or concepts in the data strengthens their significance further, since the intuitive perception of important topics in a text is likely. The statistical identification of intuitively recognisable topics might serve as a trigger for analyses which might otherwise not have been carried out. Reasons for this might be that the topic seems too obvious in the data or that it does not seem to be interesting for the meanings of the text. This is the case, for example, with the topic family and social relationships in Austen. The dominance and significance of these relationships in Austen’s novels is intuitively obvious so that they do not seem to warrant an analysis. When the analysis is performed, however, it gives interesting and thus far unknown insight into the social structures of the novels (cf. section 5.4.2). The dominance of the semantic field in the list of keywords is the trigger for the analysis of this apparently obvious and not very complex topic. This shows that keywords are useful for, 1. selecting features for an analysis, 2. supporting intuition on important topics, 3. making these intuitions more objective, and 4. detecting entirely new meanings in a text.
Corpus Linguistics in Literary Analysis
92
This holds true, even though the lists of keywords themselves provide only little information about the text or corpus and require further interpretations and analyses of the words identified. As keywords are ordered according to their statistical significance on the lists, their respective positions point at words or semantic fields which are more characteristic of the data than others. Insights gained from quantitative data and its qualitative interpretation permitted the in-depth analysis of the topic textuality in NA. This has revealed meanings that an intuitive approach would have found nearly impossible to uncover. One example of this is that NA’s protagonists are characterized by their attitudes towards fiction texts. To my knowledge, literary critics have not discussed this aspect of NA, even though it seems intuitively logical and obvious in retrospect. It is the detailed analysis of linguistic data which has helped to make it visible, not an intuitive interpretation. This shows the benefit of analysing quantitative linguistic data of literary works.
5.3
Excursus: grammatical negation in NA
The analysis of concordance lines from the semantic field textuality above has shown a large number of grammatical negations and negatively connotated and denotated lexis in the concordance lines. The following analyses test whether the language of the entire novel is characterized by the discussion of topics or circumstances which do not exist, or whether this is only true for the topic textuality. The analysis is performed by extracting grammatical negations, indefinite pronouns and determiners from the lists of keywords generated above (cf. section 5.2.1). In the following, these categories are summarized as ‘grammatical negations’, as the short term makes the argument clearer. The result of this extraction is shown in Table 5.3. Table 5.3 Grammatical negations in NA NA – Austen
NA – Austen5
NA – Gothic
NA – ContempLit
not, anything, anybody, nothing, any
not, anything, anybody, nothing
Table 5.3 shows that two of the four comparisons identify grammatical negations as keywords of NA, namely the comparisons between NA and Gothic and ContempLit. The fact that the comparisons between NA and Austen and Austen5 do not identify grammatical negations as keywords indicates a general tendency towards using them in Jane Austen’s novels.
Keywords and concordance lines
93
The assumption that Austen tends to frequently use grammatical negations in her novels is further supported by the list of keywords derived from a comparison between Austen and ContempLit, which identifies the grammatical negations not, nothing, any, no, anything, cannot, without, never and anybody as key in Austen. Furthermore, the negative prefixes in- (inconvenience, indifference), dis- (distressing, dislike), un- (unpleasant, undoubtedly), mis- (mistaken) and im- (impossible) occur on the list. This shows that there is a much higher number of grammatical and morphological negations in Austen than in its contemporary literature. NA is no exception to this lexical pattern in Austen’s works. The two lists of keywords for NA which identify grammatical negations as key are the lists from the comparisons of NA with Gothic and with ContempLit. On both lists, not is the first grammatical negation which occurs and it is among the top 15 node words on both lists. This prominent position warrants its analysis as an example of all grammatical negations on the lists. To do so, the concordance lines of the node word are analysed for collocations and colligations of not. [N]ot occurs a total of 972 times in NA, this is an average of about four times per page. The analysis of not shows that NA deals with concrete fictional characters and with situations from their fictional lives. Comments on general events and features of the fictional world are rare and not describes a literary character’s behaviour or perception in most of its occurrences. This is illustrated by the following concordance lines: have him engaged to a girl whom we had
not the smallest acquaintance with, and
painful a notion, but Catherine could
not believe it possible that any injury
of the family, a fondness for plays was not to be ranked; but perhaps it was be entment --
a letter which Eleanor might not be
ime come to love a rose?"
"But I do
pained by the perusal of -- and,
not want any such pursuit to get me out
in which a doubt is equally possible.
Not
ned directly into a shop that he might
not speak to me; I would not even look a
sed to a modest tranquillity.
not learn either to forget or defend the
She did
keep a journal!
How are your absen
by a clandestine correspondence, let us not inquire. Mr. and Mrs. Morland neve u stand such a ceremony as this? Will
not your mind misgive you when
you find
The concordance lines describe, for example, a couple’s engagement to be married, Catherine’s thoughts and Eleanor’s feelings when reading a letter. These are private and individual situations of the characters. Linking these situations with a grammatical negation indicates that the readers’, author/narrator’s or protagonists’ expectations of these situations are not fulfilled. As shown above, the origins of these expectations are
94
Corpus Linguistics in Literary Analysis
frequently social norms, such as the expectation that women habitually keep a diary. Since NA is a piece of fiction, these social norms can also be determined by literary conventions, such as the fact that heroines in sentimental novels frequently keep a diary. This results in joint social and literary conventions and thus joint social and literary expectations. Disappointments of conventions or unmet expectations, however, are not natural to people and literary characters as ‘there are no negatives in nature but only in the human consciousness’ (Watt 1960: 259). The expectations themselves are thus made explicit by their negation. By using not to grammatically negate expectations, the node word fulfils a similar function as grammatical, lexical and logical negations which may trigger irony in a text. This means that the occurrence of not can trigger irony in NA if the other preconditions of irony mentioned above are fulfilled as well.
5.4 The corpus Austen In the following, the corpus Austen is analysed by the same means as NA in the previous sections. The corpus’ keywords are used as pointers to dominant topics in the corpus.
5.4.1 Identifying dominant topics After demonstrating the usefulness of keywords for gaining information about the content and the structure of a text, this section (section 5.4) demonstrates their usefulness in the analysis of a corpus. The aim of the analysis is to identify and gain literary insight into major topics in the corpus. Since Austen is lexically homogeneous and since homogeneity of lexis and content are interrelated (cf. Chapter 7 Text segmentation), the analysis of the corpus’ keywords is a basis for general statements regarding its literary meanings. The appropriateness of general statements on Austen is supported by the node words’ and lemmata’s occurrences throughout the complete corpus (cf. section 5.4.2). The textual basis for the following analysis is the corpus Austen, which includes NA. In order to ensure transparency of the data, in particular with regard to the analysis of NA above, the frequencies of the keywords identified for Austen are given in addition to their frequencies in each of the six novels from which the corpus was compiled (cf. section 5.4.2). Their frequencies are given in absolute numbers and they are normalized to their
Keywords and concordance lines
95
occurrence per 1 million words. This makes the distribution of words across the corpus transparent. When extracting keywords from Austen by comparing it to ContempLit, the software identifies 500 keywords. This is the maximum number of words which the software identifies for one query in its standard setting. Of these 500 words, 189 words are names of people and places. These are of minor importance for the analysis of dominant topics and structural features of the data, since they are necessarily identified as keywords in an analysis, because it is unlikely that names occur with equal frequencies in two sets of data. However, they are not particularly relevant for identifying dominant topics in the data, so that the 189 names from the list are subtracted from the 500 keywords. The remaining 311 words thus form the basis for the following analyses. The following semantic fields are intuitively identified from the remaining keywords on the list: z z z
67 words describe mental concepts or perceptions, for example feelings, think, manners, attachment and spirits 11 words refer to family or social relationships: sister, sisters, acquaintance, cousins, sister’s, family, daughters, friend, cousin, aunt, brother 1 word refers to textuality: anhalt, a character of the play performed in MP. The fact that anhalt is a name results in its exclusion from further analyses. The reason for mentioning it here, is the analysis of NA above.
This list of semantic fields shows that the dominance of textuality in NA is not a general feature of Austen’s novels, which confirms this same conclusion drawn from the analysis of lists of keywords from the comparisons of NA with Austen and with Austen5. Only one word, a name, from this semantic field is identified as a keyword of Austen. This indicates that textuality is not more dominant in Austen than in ContempLit. The list of keywords for Austen, sorted according to statistical significance, shows a dominance of words relating to the female sphere already within the first 25 words. This tendency is continued throughout the entire list. Of the first 25 words, five words are female pronouns and forms of address (her, she, miss, mrs, herself). The first male form (mr) is the 35th word on the list. A further indicator of the female dominance on the list of keywords is that five of the eleven node words from its dominant semantic field family and social relationships refer to women (sister, sisters, sister’s, daughters, aunt), but only one to men (brother). This indicates that the six novels which comprise the corpus have mainly female protagonists whose lives and experiences are
96
Corpus Linguistics in Literary Analysis
described in the novels. The latter can be deduced from the dominance of expressions describing mental concepts and perceptions on the list of keywords. These 67 words describing mental concepts and perceptions constitute nearly one quarter of the shortened list, which excludes names. In the following section (5.4.2), the first five keywords from the list describing family and social relationships are analysed in analogy to the analysis of NA. In order to accomplish this, the words are lemmatized and their concordance lines are extracted. The concordance lines are then analysed for linguistic patterns which function as evidence for literary meanings. The reason for restricting the analysis to five lemmata was for the sake of relevance. Analysing a larger number of lemmata would not lead to further insights into the meanings of the corpus, since the linguistic patterns of other relevant words conform to those that are discussed in this chapter. The aim of the analysis is to develop a kind of sociogram of family and social relationships in Austen. To do so, relationships between members of a family and their relationships to people outside the family circle are analysed. This provides insights into the social worlds of Jane Austen’s novels. A literary critical analysis of families in Austen’s novels is presented by, for example, Perry (2009). 5.4.2 Family and social relationships A look at the top five lemmatised keywords from the semantic field family and social relationships on the list, FAMIL*, SISTER*, DAUGHTER*, COUSIN* and ACQUAINTANCE*, shows their focus on the family. With four of the five lemmata belonging to this semantic field, it dominates the list. This indicates that families play a major role in the corpus while friends and acquaintances seem to play more minor roles. This impression is confirmed by the following analyses. Moreover, the list above supports the hypothesis that female protagonists are dominant in Jane Austen’s novels as the list features two expressions which refer to women. There are no expressions referring to men on the list. Table 5.4 shows that all lemmata that are analysed occur in all texts of the corpus. Family and social relationships thus seem to be important elements in the lives and conversations of the protagonists in all novels. Only the numbers of occurrences of the different lemmata and word forms differ between the texts. This can be seen from Table 5.4, in which absolute numbers of the words’ and lemmata’s occurrences are given without brackets, and the numbers normalized to 1 million words are put into square brackets.
Keywords and concordance lines Table 5.4
Distribution of family and social relationships in Austen Austen
family
97
579 [798.8] families 47 [64.8] FAMIL* 656 [905] sister 865 [1193.3] sisters 216 [298] SISTER* 1088 [1501] daughter 280 [386.3] daughters 129 [178] DAUGHTER* 409 [564.2] cousin 186 [256.6] cousins 100 [138] COUSIN* 288 [397.3] acquaintance 329 [453.9] acquaintances 12 [16.6] ACQUAINTANCE* 341 [470.4]
EM
MP
NA
Per
P&P
S&S
77 [478.3] 11 [68.4] 92 [571.5] 33 [205] 12 [74.6] 45 [279.5] 62 [385.1] 7 [43.5] 69 [428.6] 2 [12.4] 2 [12.4] 4 [24.9] 63 [391.3] 0 [0] 63 [391.3]
126 [785.4] 14 [87.3] 147 [916.2] 197 [1227.9] 47 [293] 246 [1533.3] 27 [168.3] 30 [187] 57 [355.3] 93 [579.7] 49 [305.4] 142 [885.1] 40 [249.3] 0 [0] 40 [249.3]
61 [785] 4 [51.5] 68 [875.1] 53 [682.1] 16 [205.9] 70 [900.8] 37 [476.2] 5 [64.4] 42 [540.5] 1 [12.9] 2 [25.7] 3 [38.6] 49 [630.6] 0 [0] 49 [630.6]
80 [956.3] 8 [95.6] 89 [1063.9] 82 [980.2] 17 [203.2] 99 [1183.4] 23 [274.9] 10 [119.5] 33 [394.5] 28 [334.7] 19 [227.1] 49 [585.7] 61 [729.2] 0 [0] 61 [729.2]
152 [1245.2] 7 [57.3] 166 [1359.8] 218 [1785.8] 76 [622.6] 297 [2433] 85 [696.3] 50 [409.6] 135 [1106] 49 [401.4] 11 [90.1] 60 [491.5] 52 [426] 11 [90.1] 63 [516.1]
83 [692] 3 [25] 94 [783.7] 282 [2351.2] 48 [400.2] 331 [2759.7] 46 [383.5] 27 [225.1] 73 [608.6] 13 [108.4] 17 [141.7] 30 [250.1] 64 [533.6] 1 [8.3] 65 [541.9]
The numbers show that the frequencies of the keywords differ between the texts. But since the collocations and colligations of the lemmata are similar across the texts, this is of no consequence for the following analysis. Their distributions are only briefly mentioned here as they are discussed in more detail in Chapter 7 Text segmentation. Table 5.4 also shows that the singular form of a node word is usually more frequent than its plural form. This indicates that relationships between two people are of greater importance in the novels than those between larger groups, as the singular forms usually refer to relationships between two people in which characters speak, for example, of a COUSIN* or a DAUGHTER*. Talking about several people, and therefore also relationships between several people, seems to be less frequent and therefore of less importance in the novels than relationships between two people.
Corpus Linguistics in Literary Analysis
98
The following analysis (section 5.4.2.1) starts by looking at the cover term FAMIL* and proceeds by moving to more distant relationships. For demonstrative purposes, the first analysis is performed in most detail. The analysis adopts a semantic approach to identifying different meanings and implications of the concept ‘family’ by examining the lemma’s concordance lines for its collocations and colligations. These indicate its literary meanings. This approach is also used for the analysis of the other lemmata analysed later in this chapter.
5.4.2.1 The analysis The lemma FAMIL* (656, including forms such as familiar) is connected with several distinct meanings. Its most striking usage is the emphasis on the family as a self-contained unit which appears as one unit to the outside world. This results in a distinction between the members of a family and outsiders, as can be seen from the following concordance lines: ctions at all to meeting the Hartfield
family. Don’t scruple. I know you are at
ntinued with them, and made one in the family party on the ramparts. Mrs. Prric as you know, and re-admitted into the
family;
and there it was his constant o
elieve, will soon become a part of the family."
"You have a very small park he
icitous of being longer noticed by the family,
as Sir Walter considered him un
rous of being really received into the family, was
disposed
to look up to him
en, he was able to represent the whole family to the general in a most respecta ill be an unwelcome contraction of our family circle; but I should have been
d
ied, and properly supported by her own family, people of respectability as they a very old and intimate friend of the
family,
but particularly connected with
The grammatical indicators for an inclusion into or an exclusion from a family is the use of possessive pronouns, proper nouns, place names or titles identifying families and the use of definite pronouns in the first (L1) or second (L2) position to the left of the lemma. These grammatical markers occur 398 times, that is in more than 50 per cent of the occurrences of the lemma (cf. Table 5.5). Table 5.5 pronouns, articles, proper nouns and titles in Austen
L1 L2
her
his
my
our
the
their
this
your
proper nouns, place names and titles identifying a family
54 15
34 6
9 6
7 3
146 55
8 2
5 0
14 3
31 0
Keywords and concordance lines
99
In 20 cases, two of these features co-occur in the same concordance line. In 19 of them, the stands in front of a proper noun, place name or title which identifies a family, for example, the Musgrove family. A family is a social institution. Its members are criticised when they absent themselves from it, do not support the family or are excluded from it. This also results in a loss of prestige and influence within the family. On the other hand, belonging to a family is of great importance. The following concordance lines give evidence for the depiction of not-belonging to a family, of conflicts within a family or with non-family members and of people who neglect their family and property: the world, and wholly unallied to the
family!
Do you pay no regard to the wis
y they were wished away by some of the family. Mrs. Hurst and her sister scarce "I cannot comprehend the neglect of a
family library in such days as these." "
earing; to care for none beyond my own family circle; to think meanly of all th ways lodged apart from the rest of the family. While they snugly repair to thei s to bear, she has no fair pretence of family or blood. She was nobody when h "though she has behaved so
ill by our family, she may behave better by yours.
g mischief. Even the smooth surface of family-union seems worth preserving, tho disrespectfully or carelessly of the uine, but he
family and the family honours, he was q
was more negligent of his family, his habits were worse, and his m
These concordance lines depict the image of a family which presents itself as a unit to the outside world and which dissociates itself from it. This indicates a strong bond both within a family and when facing the outside world. Also, it frequently triggers a collocation of FAMIL* with negatively connotated or denotated lexis or a colligation with grammatical negations as can be seen from the concordance lines above. The status of a family, its social relationships and prestige are of great importance for the self-confidence and self-image of a family. They are criteria for evaluating the attractiveness and social status of a family, as can be seen from the following concordance lines: s something, but all the honour of the
family he held as cheap as dirt. I have
ourable, and ancient--though untitled-- families.
Their fortune on both sides i
ns; every body was anxious to be in her family, for she moves in the first circle , the younger branch of a very ancient
family--and that the Eltons were nobody.
and to be preferred by her to her own
family connections among the nobility
ll off for a cadet of even a baronet’s
family. By the time he is four or five a
tory. She was certainly not a woman of
family, but well educated,
disrespectfully or carelessly of the
family
accomplished
and the family honours, he was q
ng her father’s heir, and whose strong
family pride could see only in him a pro
of Highbury, and born of a respectable
family,
which for the last two or three
Corpus Linguistics in Literary Analysis
100
Its social status determines the prestige of a family, and its distance from socially inferior social groups and families is a basis for the self-definition and self-confidence of a family. The factual or perceived social status of a family is a basis for family pride and indicates its hierarchical level within the society that is portrayed in the novels comprising Austen. The pursuit of status and prestige resulting from this means that one’s family is a retreat from social duties and social hierarchies. This sphere of retreat and comfort within a family is mirrored in the following concordance lines:
d within so easy a
distance of her own family and friends." "An easy distance d
able and affectionate;
wrapt up in her family; a devoted wife, a doating mother
ul, gentle, knowing all the ways of the family, good-humoured
and pleasant in his own
interested in all its concerns,
family, and I do not think him so
garet’s return, and talking of the dear family party affection, and domestic comfort of the
very
which would then be restor
family to whom accident had now introdu
wn to wish for any possible good to my
family, I should have fixed
convenience of any single being in the
family.
on Colonel
I am aware that you may be left
the comfort and cheerfulness of every
family meeting and every meal chiefly d
his decided preference of a quiet
family party to the bustle and confusion
Similar to the family as a whole, a sister is also highly valued. Her feelings, needs and comfort are taken care of and family members show consideration for her. Furthermore, she herself is also a source of comfort and an authority who evaluates other people’s actions. She has a central position within the family. This is shown by the following concordance lines of SISTER* (1088): .
You must be a great comfort to your sister, sir." "I hope I am, madam." "And
had already written a few lines to her sister to announce their safe arrival i does shew a thorough confidence in my
sister, and a certain consideration for
m," she added. "Whatever can give his sister any pleasure is sure to be done i procure so agreeable a visitor for her sister?" "Nothing can be more natural," y well off, with all the children of a sister I love so much, to care about. Th uche arrived, Mr. Crawford driving his sisters;
and as everybody was ready, th
s, just articulate, "My Fanny, my only sister; my only comfort now!"
She could
very kind and careful guardian of his
sister, and you will hear him generally
een
sisters.
them it was more the intimacy of
Even before Miss Taylor had ce
There is only little criticism of sisters to be found in the corpus and a positive picture of them prevails. However, the following concordance lines
Keywords and concordance lines
101
are examples of criticism of a sister: o contend against the unkindness of his sister, and the insolence of his mother such a mother and such uncompanionable sisters, home could not be faultless, a g them, an absolute breach between the sisters had taken place.
It was the nat
bit her lip, and looked angrily at her sister. A mutual silence took place for less and dissatisfied every where, her
sister could never obtain her opinion o
d’s anxious circumspection! of all his sister’s falsehood and contrivance! the make it up any better." And when her it would not be very likely
sisters abused it as ugly, she added, wi
to promote sisterly affection or delicacy of mind."
und to be intolerable, and the younger sisters not worth speaking to, a wish of greatly heightened by the shock of his sister’s conduct,
and his recovery so m
With SISTER*, the focus of the relationships shifts to within the core family. While the public appearance of a family to non-family members is of great importance for the family as a unit, this is not the case with a SISTER* whose happiness is at the centre of attention. This is a contrast between the connotations of FAMIL* and SISTER*, between the outer appearance and the inner being. Not only a SISTER*, but also a DAUGHTER* (409) is highly valued in the corpus. The concordance lines show that daughters are cared for and that consideration is shown to them. They also give advice and make their parents proud. This is shown by the following concordance lines: "I leave an excellent substitute in my daughter. Emma will be happy
to enterta
s; and never had the general loved his daughter so well in all her
hours of co
for, I assure you." "But consider your daughters. Only think what an establishm , and he might be thankful to his fair daughter Julia that Mr. Yates did yet me untenance particularly grateful to the daughter.
He
scarcely needed an invita
e by almost ceaseless attention on his daughter’s side, se at the earnest advice of her eldest daughter. entertainment.
Fanny was indeed the
and by exertions which
For the comfort of her childr
daughter that he wanted.
ct. Precious as was the company of her daughter to her,
His charitable
she desired nothing so
himself, and she thought of it for her daughters’ sake with satisfaction,
thou
Positive judgements about daughters outweigh criticism of them. The latter only occurs in few concordance lines such as: ain herself, began scolding one of her
daughters.
light,
daughter, and a disgrace never to be wip
as comprehending the loss of a
before she could at all forgive their
daughter.
"Don’t keep coughing so, Kit Mr.
Bennet’s
emotions
were
courtier and fine gentleman to like his daughter the less. But, by G--! if she b t the rest of his time in scolding his
daughter
sort of man, and I thought well of his
daughter--better than she deserved, for
l-judging indulgence the errors of her
daughter must principally be owing.
is licentiousness of behaviour in your
daughter has proceeded from
to a parent’s mind. The death of your
daughter would have been a blessing in c
ife should never be acknowledged as her daughter,
for so foolishly hurrying her "
a faulty de
nor be permitted to appear in
102
Corpus Linguistics in Literary Analysis
The most frequent collocation of DAUGHTER* is with words relating to weddings and marriage. Upcoming or past weddings of daughters are frequent topics of conversations in the corpus. These weddings and marriage as such are perceived as desirable by the speakers. This is shown by the following concordance lines: olving to choose a wife from among his
daughters, that the loss to them might be
she would be able to show her married daughter in the neighbourhood
before sh
to have allowed the marriage; that his daughter’s sentiments had been sufficient the approaching nuptials of my eldest daughter, of which, it seems, he has bee not be more at Highbury; but now their daughter is married, I suppose Colonel a delightful thing, to be sure, to have a daughter well married," continued
her m
tertained. "If I can but see one of my daughters happily settled at Netherfield his spirits affected by the idea of his daughter’s attachment to her husband, hree or four months. Of having another daughter
s
married to Mr. Collins, she th
duct he pursued. He, who had married a daughter to
Mr. Rushworth:
romantic de
The concordance lines show that DAUGHTER*, like SISTER*, is mostly positively connotated in the corpus. Relationships within the core family seem to be close and are not subject to the social constraints and representative duties of a family. Consequently, the lemmata denoting the core family and the family as such have positive semantic prosodies. Once relationships become more distant, emotional ties are looser, but one’s affiliation to a family is emphasized even more strongly. In the case of COUSIN* (288), this is achieved by a colligation of the lemma with the possessive pronouns her, his, my, our, their and your in the first (188) and second (40) positions left of the lemma in more than two thirds of its occurrences. This resembles the colligation of FAMIL* with possessive pronouns discussed above. The reason for making one’s membership to a family explicit is to either transfer prestige and social standing of a COUSIN* to the rest of the family or vice versa. This does not mean that a cousin must be liked, but s/he is at least tolerated. This can be seen in the following concordance lines: and "Our cousins in Laura Place,"--"Our cousin, Lady Dalrymple and and entreaty,
to his right honourable
Miss Cartere
cousin. Neither Lady Russell nor
Mr Ell
so fashionable; because they were all
cousins and must put
honours of the house the names of his
cousin Miss de Bourgh,
efore you must make allowance for your
cousin, and pity her deficiency.
honour nor inclination confined to his
cousin, why is not he to make another ch
umbrance."
cousin’s state at this time
Her representation of her
up with one anothe and of her mothe And re
was exactly
e, and it is believed that she and her
cousin will unite the two estates."
e; for besides having a regard for his
cousin, Charles Hayter was an eldest son
Situated as we are with Lady Dalrymple, cousins, we ought to be very careful
Thi no
Keywords and concordance lines
103
A further feature of COUSIN* is that people mostly talk about and not with him/her. This indicates a distance which, on the one hand, emphasizes the family relationship, but, on the other hand, also shows the exclusion of the COUSIN* from the core family. This exclusion is expressed lexically and grammatically by the frequent collocation and colligation of the lemma with grammatical negations and negatively connotated words. The following concordance lines give evidence for this: Mr. Bertram.
I am so glad your eldest
to an apartment never used since
some
cousin is gone, that he may be Mr. Bertr cousin or kin died in it about twenty ye
ord might not ask. The prospect for her cousin grew worse and worse. The woman w ing-room. "Dear mama, only think, my
cousin cannot put the map of Europe toge
s expectations were fully answered. His cousin was
as
absurd
as he had
find that they did not see more of her
cousin by the alteration,
uch to disquiet and mortify him in his
cousin’s behaviour.
ne’s opinion, most unfortunately) were
cousins of the Elliots;
was anxious to avoid the notice of his
cousins,
, if he had not happened to say so. My
cousins have been so plaguing me!
hoped
for the chief
She had too old a r and the agony w
from a conviction that if they I dec
The emotional distance to a COUSIN*, as shown in the concordance lines, is much larger than that to the members of the core family SISTER* and DAUGHTER*. This shows that the perception of family members is ambivalent in Austen and that it depends on the closeness or distance of the family relationship between the respective speaker and the object of his/ her speech. As a general feature, the perception of a person becomes more critical the more distant the family relationship is. This is also confirmed by an analysis of the relationships between the protagonists and their ACQUAINTANCE* (341), in which critical comments outweigh positive evaluations. Acquaintances are, for example, unwanted, quickly closed or unpleasant. Positive attributes of an acquaintance are rarely made explicit. Negative judgements are manifest, for example, in the following concordance lines: d in so untoward a moment to admit the as a disappointment
at Randalls. The
acquaintance of a young man whom he fel acquaintance at present had no charm for
painful a conclusion of their present
acquaintance! and yet, she could not he
perhaps, been accepted on too short an
acquaintance, and, on knowing him better
seen the beginning and the end of their acquaintance; but not with a few months ight, and such has been the end of our o help calling."
"She will drop the
a man has committed himself on a short
acquaintance. And what an acquaintance h acquaintance entirely." But in spite of acquaintance, and rued it all the rest o
One feels that it cannot be a very long acquaintance. and yet the danger of a renewal of the
He has been
gone only fo
acquaintance!-- After much thinking, she
104
Corpus Linguistics in Literary Analysis
Remarks on an acquaintance are often (121) accompanied by a classification of its duration and are qualified by, for instance, former, week’s, slight, short, recent, new, old, older, half-hour’s, fortnight’s and always. The words former, slight and short on this list indicate that the emphasis of the modifiers is on critical comments concerning one’s AQUAINTANCE*. Positive evaluations are expressed by OLD*, as an example, since an acquaintance which goes back a long time is a sign of continuity and trust between the ACQUAINTANCE*. This is positively connotated in the corpus. The seeming necessity to define an acquaintance and to qualify it by mentioning its duration is expressed by the colligation of ACQUAINTANCE* with adjectives, adverbs and participles in the first position left of the lemma (97). This allows one to define the duration of the acquaintance, its intensity and the size of one’s circle of acquaintances: er to affect the air of a cool, common
acquaintance than anything else, I watc
ind me in Bath, my first and principal
acquaintance on marrying should be your
gathered from the Middletons’ partial
acquaintance with him; and she was eage
his profession, he should not have many acquaintance in such a place as this." day and the next; he had met with some
acquaintance at the Crown who would not
e every military man, had a very large
acquaintance. When the entertainment was
er, independent of his claims as an old acquaintance, an attentive neighbour, a ch
the neglect or unkindness of slight acquaintance like the Tilneys ought to
the familiarity of a long-established
acquaintance. "Well, Marianne," said Eli
Colonel Brandon alone, of all her new
acquaintance, did Elinor find a person w
Meanwhile, the concordance lines also show social constraints in the choice of one’s acquaintance, since they reflect on one’s own prestige. This can be seen from the following concordance lines: omething to amuse; his journeys and his acquaintance were all of use, and Susan w likely to be benefited by an increasing acquaintance among his brother-officers. ed something better; but yet "it was an acquaintance worth having;" and when Anne in town as in the country, since his
acquaintance must now be dropped by all w
very careful not to embarrass her with
acquaintance she might not approve. If we
it will be advisable to have as few odd acquaintance as may be; and, therefore, I ion.
He was not at all ashamed of the
acquaintance, and did, in fact, think and
g a man whom I could never admit as an acquaintance of my own! I wonder you sho her; she would detach her from her bad
acquaintance, and introduce her into good
be worth knowing: always acceptable as
acquaintance." "Well," said Anne, "I cert
The speaker’s own social status and that of the ACQUAINTANCE* are of prime importance for the relationship between them.
Keywords and concordance lines
105
5.4.3 Implications The analyses have shown that (1) the topic family and social relationships is dominant throughout the corpus, (2) its connotations are roughly similar across the novels, and (3) the characters’ perceptions of other people are influenced by the closeness or distance of their family or social relationships. The lemmata’s usages are roughly similar across the corpus, despite the differences in the lemmata’s and the node words’ distributions in the novels that constitute the corpus, and despite their different plots. This indicates that, irrespective of the individual plots of the novels, the corpus is homogenous in terms of its contents. This homogeneity is based on the social and cultural conventions of the late 18th and early 19th century England which constitute the background of the plots. The analyses have also shown that the analytic techniques which are successful for a text, are also useful for the analysis of a corpus. The analysis of Austen has given insight into dominant topics of the corpus and thus into its plot structures. By looking at the keywords of the corpus, its dominant topics, like family and social relationships, are identified. This topic is analysed in depth by looking at its concordance lines, which facilitate literary insights into Austen’s novels, for instance into the correlation between the closeness of a family relationship and the esteem for a person. To my knowledge, this has not yet been discussed in the secondary literature on Austen. The keywords function as pointers to topics which warrant an analysis and which lead to literary insights into the data. The analyses of Austen and NA have shown that even though keywords do not have much information value themselves, they are useful starting points for analyses which, in turn, give insight into the contents of the data.
5.5 Concluding comments In this chapter, electronically generated data was used as the starting point for the analysis and as evidence for literary interpretations of a text and a corpus. The data used above is reused for further analyses in Chapter 7 Text segmentation as its relevance for the content and the structure of the data has been asserted. At first glance, the keywords did not seem to provide insight into the data, which could not have been gained by traditional literary critical methods. After all, literary critics have already discussed, for example, the importance
106
Corpus Linguistics in Literary Analysis
of literature in NA (cf. for example Burlin 1975 who mentions Mathison 1957 as the first critic to discuss the role of literature in NA) and of family relationships in Austen’s novels (cf. for example Perry 2009). Yet, at second glance, one notices that insight gained by using corpus linguistic techniques are frequently more detailed and often go beyond those gained by intuitive methods. Examples of this are the characterization of NA’s protagonists by way of their reading habits and the differences in the characters’ perceptions of family members and other social relations in Austen. To my knowledge, neither of these topics has been discussed in literary criticism, even though both findings seem obvious in retrospect. Nevertheless, despite their intensive and detailed analyses of Austen’s texts, literary critics have not discovered them. This is evidence for the large potential for gaining new knowledge about a text’s or corpus’ meanings by way of detailed linguistic analyses. Corpus linguistic analyses identify linguistic patterns which are virtually undetectable intuitively, but constitute meanings in a text. The analyses in this chapter have also shown that seemingly simple and standardized corpus linguistic analytic techniques, such as the extraction of keywords and the analysis of concordance lines, are not only valid, standard techniques, but that they are also extremely useful in a linguistic stylistic analysis. They allow the detection of meanings which have not been discussed in the nearly 200 years of literary criticism of the novels. They have led to insights both into NA and, more generally, into Jane Austen’s oeuvre, which enhance the understanding of the texts by adding new aspects to their interpretations. Yet, the analyses have also shown that often one analytic technique is not enough for decoding meaning in a text, but that a combination of techniques is useful. For the analysis of NA, keywords and their concordance lines were extracted, and their subsequent analysis was supported by the cognitive linguistic schema theory. For the analysis of Austen, the analysis of keywords and their concordance lines was supported by the analysis of the lemmata’s distribution across the corpus. This combination of different approaches, techniques and traditions of linguistic analyses allowed for more comprehensive results on the data than the use of only one analytic technique would have. The analysis of the distribution of lemmata and node words, for example, confirmed that the results gained from their analysis are valid for the complete corpus as the lexis is distributed across all texts comprising it. The keywords likewise point to semantic fields which are of interest for an analysis. However, the list of keywords itself does not give insight into the meanings of the data, but it is useful for forming
Keywords and concordance lines
107
hypotheses about the content and structural features of the data. These hypotheses can be corroborated or refuted by further analyses. The analyses performed in this chapter are based on lexis and focus on the contents of the data. In the following chapters, these introductory linguistic stylistic analyses of literary data are complemented by analyses of their phraseologies and by the segmentation of NA and Austen into their constituent parts. The following analyses focus more on the structural features of the data than the analyses presented in this chapter.
Chapter 6
Phraseology
Phraseology is the analysis of word chains and chains of grammatical categories of variable lengths. Word chains can be either uninterrupted chains of n words or they can be variable in one place. While the first are called n-grams, the latter are called phrase-frames or p-frames (cf. Fletcher 2002). Chains of grammatical categories are called pos-grams. As the latter are not relevant for the following analyses, though, they are not discussed further in this book. Phraseological analyses are performed in order to gain information about the lexical and grammatical organization of a text or corpus. The analysis of n-grams or p-frames permits us, for example, z z z
to draw conclusions about frequent or dominant contents of the data based on the lexis which occurs in the phrases to analyze recurrent words or phrases as cohesive links in the data to identify frequent lexical and grammatical patterns (collocations and colligations) within and in the co-text of the word chains.
In Chapter 5 Keywords and concordance lines, insights into dominant contents of NA are gained by way of keyword analyses. Chapter 7 Text segmentation demonstrates the contribution of recurrent words and semantic fields to the cohesion and coherence of a text and a corpus by way of distribution diagrams. The present chapter focuses on identifying frequent phraseological units, patterns of collocation or colligation of these units and their contributions to encoding implicit meanings and discourse structures in the data. The data for these analyses are the text NA and the corpora Austen and ContempLit. The following analyses contribute toward identifying discourse structure, discourse organization and content of the data. This allows the phraseologically positioning of NA in the context of Austen’s oeuvre and in the context
Phraseology
109
of its contemporary literature in general. In Chapter 7 Text segmentation, NA is lexically positioned in relation to Austen’s other novels and its contemporary literature by discussing Austen’s lexical homogeneity. N-grams and p-frames encode meanings of the data. Analysing them, therefore, contributes to determining whether the data’s frequent phrases have common grammatical or lexical patterns which encode meanings in a conventionalized way. Analysing p-frames is particularly useful for gaining an overview of the grammatical and discursive organization of information in the data. Analysing n-grams is particularly useful for examining the contents of the data. The analyses in this chapter focus on how explicit and implicit information is organized and encoded in language. As readers are likely to perceive patterns in n-grams and p-frames unconsciously, analysing these patterns makes the meanings they encode explicit. Otherwise, these meanings would remain implicit for the receiver. Making the patterns, and thus their meanings, explicit provides quantitative evidence for subjective impressions of the data, for example, of linguistic similarities between Jane Austen’s six novels. Both the software kfNgram, which is used for identifying the word chains, and the terminology n-grams and p-frames are those of Fletcher (2002). N-grams are recurrent phrases of a length n, p-frames are recurrent combinations of p words which are variable in one slot. The most frequent 4-gram of NA is i am sure i, the most frequent 4-frame of NA is the * of the. The most frequent realisations of the asterisk * in the 4-frame are rest, end and course. The terminology for variable and non-variable phrases differs within the literature on phraseology. Renouf and Sinclair (1991) introduced the term framework for ‘a discontinuous sequence of two words, positioned at one word removed from each other’ (128). Altenberg (1998) calls word chains clusters, Biber, Leech, Johansson et al. (1999) call them lexical bundles and Stubbs and Barth (2003) call them chains. The reason for adopting Fletcher’s terminology for this book is that his software (2002) is used for the analyses. In the analyses, first, the most frequent 4-grams and 4-frames of NA are extracted and analysed (cf. section 6.2). The steps of this analysis are presented in detail later in this section. The aim of the analysis is to gain insight into how implicit meaning is encoded in the text and into the textual organization of the text’s discourse and information structures. The analysis shows, for example, that two of the most frequent 4-frames of NA, * i am sure and i am sure *, are mostly delexicalized in
110
Corpus Linguistics in Literary Analysis
their usage and that they are used as discourse markers in the text. The fact that they mainly occur in spoken language shows that their speakers resort to meaningless phrases in their language. This characterizes the speakers as superficial, a finding that is discussed in greater depth in section 6.2.1. Second, the most frequent 4-grams and 4-frames of Austen are extracted and compared to those of NA (cf. section 6.4). This allows for determining whether the literary critical position that NA differs from Austen’s other novels (cf. for instance Pinion 1973, Craik 1965) is confirmed by the phraseology of the novels. Other linguistic features and contents of NA and Austen’s other novels that might contribute to answering this question are not analysed in this chapter if they go beyond the insight gained by looking at the data’s 4-grams and 4-frames. Lexical similarities and differences between NA and Austen’s other novels though are discussed in Chapter 7 Text segmentation. Similarities in terms of content are discussed in Chapter 5 Keywords and concordance lines. Third, the most frequent 4-grams and 4-frames of ContempLit are extracted and analysed. The analysis includes identifying recurrent lexical and grammatical patterns of and in the co-text of the phrases which allow for insight into how meaning is encoded in literary language of Austen’s time. The results from these analyses are then compared to those from NA and Austen in order to determine whether the author Austen conformed to linguistic conventions of literary texts or whether she used deviant patterns in her writings. This identifies individual features of Austen’s style of writing. The goals described above are accomplished by performing several analytic steps: 1. The 4-grams and 4-frames of NA are extracted and analysed for their lexical and grammatical patterns. 2. Preliminary hypotheses on the discourse structure and encodings of information in NA are formulated. 3. These hypotheses are further investigated by analysing four of the most frequent 4-frames of NA as case studies. These case studies focus on the phrases’ semantic and grammatical preferences. 4. Austen and ContempLit are analysed in the same way as NA by extracting their 4-grams and 4-frames and analysing their most frequent realizations for collocations and colligations. 5. The two corpora are compared both with each other and with NA.
Phraseology
111
The length of the n-grams and p-frames in all analyses is four orthographic words. This is for two main reasons. First, Biber, Leech, Johansson et al. (1999: 992) say that phrases that consist of three words are very frequent as they are ‘a kind of extended collocational association’. Longer phrases on the other hand are ‘more phrasal in nature and correspondingly less common’ and therefore more specific to the data. Using 4-word strings for analyses represents the middle ground between these two poles. This allows the identification of collocations and colligations of the phrases while still maintaining reference to the specific data. The second reason for using 4-word strings for the analyses is that collocations mostly occur within a span of four words to the left and four words to the right of a node word (Sinclair 1991). This means that the phrases’ collocations and colligations can be identified in both 4-grams and 4-frames. The analyses of the 4-grams and 4-frames are based on the 25 most frequent realizations (or 26 when the number of occurrences of the 25th and the 26th phrase are identical) of both units. In the case of the 4-grams of NA, this corresponds to those phrases that occur at least five times in the text.
6.1 Phraseology in the literature Since roughly the mid-1990s, corpus linguistic phraseological analyses have looked at single word forms (e.g. Danielsson 2003), idioms (e.g. Sinclair 1996) and frequent, recurrent phrases of different (e.g. Biber and Conrad 1999) or predetermined (e.g. Culpeper and Kytö 2002) lengths. The data for these analyses vary from general corpora such as the Bank of English (e.g. Moon 2001) or the London-Lund Corpus of Spoken English (e.g. Altenberg 1998) to specialized corpora, such as the International Corpus of Learner English (ICLE, e.g. Granger 1998). Studies investigating the phraseology of literary texts, however, are still rare. Exceptions are, for example, Gläser (1998), Naciscione (2001) and Starcke (2006). Gläser (1998) suggests basic principles for phraseological analyses within stylistics. She writes that particularly, the modification of the phraseological unit in certain contexts and its relation to a particular genre, punning with idioms in the light of intertextuality and so on (125)
112
Corpus Linguistics in Literary Analysis
of a text or corpus should be analysed. For this purpose, she defines phraseological units as, a lexicalized, reproducible bilexemic or polylexemic word group in common use, which has relative syntactic and semantic stability, may be idiomatized, may carry connotations, and may have an emphatic or intensifying function in a text. (125) The analyses in this book deviate both from her definition of a phrase and from the principles she sets up. In this book, phraseological units are selected for analyses only on the basis of their frequency in the data, variations in their usage across the data are not considered. Only the number of words and the frequency of the word strings are criteria for their inclusion or exclusion from the analyses. The parameters demanded by Gläser involve subjective decisions in the choice of units for an analysis and are therefore rejected for the present purpose. Furthermore, Gläser asks for an intuitive identification of phraseological units. This is another point in which the following analyses do not follow her. Gläser’s (1998) article is concerned with theoretical principles of ‘phraseostylics’ (143), but does not demonstrate their practical applications. Naciscione (2001), on the other hand, referring to Kunin (1969) as the first stylistician to have analysed phraseological units in literary texts, describes and quantifies the use of phraseological units in works by Chaucer and D. H. Lawrence. The basis for these analyses is Naciscione’s definition of a phrase as ‘a stable combination of words with a fully or partially figurative meaning’ (5). Even though she does not describe whether she identifies these phrases automatically or intuitively and even though her definition of a phrase does not conform to the mechanical criteria of corpus linguistics (for example, the analysis only of a specific node word or of a phrase of a specified length), her research is an interesting and valuable approach to identifying a link between the phraseology and the linguistic style of a text. The link between phraseology and linguistic style is also identified by Starcke (2006). In this article, I extract phrases of a specified length from Jane Austen’s Persuasion and analyse them for their textual functions. The findings from these analyses are used as the basis for a literary interpretation of the novel. This concerns, for instance, the novel’s sombre atmosphere, which is in part created by recurrent phrases in the text (cf. Chapter 3 Language and meaning for a further discussion of the article). This approach to a phraseological analysis has been refined
Phraseology
113
in Fischer-Starcke (2009b) where I analyse keywords and frequent phrases of Austen’s Pride and Prejudice for their contributions to the novel’s literary meanings. Phraseological analyses typically focus on either or both of the following two points: z z
the use of word strings as meaning-encoding features the function of word strings for textual organization.
In both cases, phrases are accepted as units of meaning in language (cf. Chapter 3 Language and meaning for a discussion on units of meaning in language). This is also asserted by, for example, Danielsson (2003) who confirms Sinclair’s (1987) hypothesis that upward and downward collocations of words indicate different meanings of the investigated lemma. In the course of her analysis, she demonstrates that units of meaning within a text do not consist of one word only, but of ‘one or more words’ (110). This calls into question the centrality of the word as a unit of meaning in language. Also, Stubbs (2007) poses the same challenge. He says that word strings have more semantic and grammatical impact on meaning than single words. This is because semantic prosodies are inherent parts of phrases, so that they are also part of their analysis. Furthermore, some words, for example world, are frequent in the language, because they are part of frequent phrases, such as ‘the most natural thing in the world’ (164). The functional and pragmatic use of idioms is one of Moon’s (2001) research objects. Her definition of an idiom is that of a ‘non-compositional, metaphorical expression consisting of two or more words’ (229). In her analysis, she finds that these idioms are used particularly frequently in written language and that their usage is ‘suasive rather than cooperative, and (. . .) rarely associated with pure exposition’ (239). There is a correlation between text type in written language and the frequent occurrence of fixed phrases or idioms. These seem to be independent of the text genre. Biber, Conrad and Cortes (2004) counter Moon’s (2001) finding that fixed phrases mainly occur in written language, as they find examples of phrases both in the oral language used in university teaching and in the written language of university textbooks. They therefore conclude that ‘lexical bundles are stored [in the brain] as unanalysed multi-word chunks, rather than as productive grammatical constructions’ (400). Phrases are stored by the speakers, they can be retrieved as complete units, and
114
Corpus Linguistics in Literary Analysis
they fulfil set functions in language (cf. also Jackendorff 2002 on this point). They serve as discourse framing devices: they provide a kind of frame expressing stance, discourse organization, or referential status, associated with a slot for the expression of new information relative to that frame. (Biber, Conrad and Cortes 2004: 400) Their fixed functions and the fact that they are stored as complete units entails that neither the phrases’ production nor their reception pose problems for the sender or receiver of a message. This explains the highly formulaic character of language and confirms the notion of phrases as units of meaning in language. This position is also held in psycholinguistics (cf. Wray 2002). In his analysis of semantic and functional characteristics of frequent 4-grams, 4-frames and 4-pos-grams of the BNC, Stubbs (2006) describes dominant functions of frequent phrases in the corpus. He finds that most of the 4-word strings are part of nominal or prepositional phrases which describe spatial, temporal and/or logical relationships. He also shows that most [h]igh frequency 4-grams with the structure preposition + determiner + noun + of [are] (. . .) lexically, semantically and pragmatically idiosyncratic. They contain high frequency words used in idiosyncratic ways (. . .), lexical items which are rare elsewhere (. . .), plus other vocabulary from restricted lexical classes. (n.p.) The analyses of the 25 most frequent 4-frames of NA, Austen and ContempLit demonstrated later in this chapter show similar results. The 4-grams correspond only in part to those of the BNC. The above summary of the research on the phraseology of English has shown that further research is needed. Open questions remain about the relationship between the source of the data and the results obtained in a study, as an example. The relevance of this question is shown, for instance, by the divergent results of the studies by Moon (2001) and Biber, Conrad and Cortes (2004). The results of these studies lead to the question as to whether highly frequent phrases are at least partly genre-dependent. A first approach to answering this question is presented by Starcke (2008) where highly frequent phrases of the BNC are analysed for their collocations and semantic prosodies. One finding of this study is that both collocations and semantic prosodies vary with the lengths of the phrases, even though they all have a common 3-gram, i do n’t, as their nucleus. This shows both
Phraseology
115
that phrases are units of meaning in language and that the length of the phrases chosen for an analysis might predetermine its results. However, further research is necessary to be able to answer this question definitively.
6.2 The text NA: data and analysis In the following, the most frequent 4-grams and 4-frames of NA are presented and analysed. As a first step, the phrases are listed together with their frequencies. The numbers in round brackets show their absolute numbers of occurrence in NA, the numbers in square brackets show their average numbers of occurrence per 1 million words. This convention for the presentation of numbers is adopted for the entire chapter and conforms to the conventions of the entire book. The 25 most frequent 4-grams of NA are: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.
i am sure i i do not know in the course of a great deal of i would not have the rest of the at the end of i am sure he i do not think mr and mrs allen all the rest of for all the world i am sure it i dare say he mr and mrs morland quarter of an hour a quarter of an as well as she but i am sure for the first time it seemed as if the pleasure of seeing was not to be what do you mean what do you think
(13) [169] (12) [156] (10) [130] (9) [117] (9) [117] (9) [117] (8) [104] (7) [91] (7) [91] (7) [91] (6) [78] (6) [78] (6) [78] (6) [78] (6) [78] (6) [78] (5) [65] (5) [65] (5) [65] (5) [65] (5) [65] (5) [65] (5) [65] (5) [65] (5) [65]
Corpus Linguistics in Literary Analysis
116
The 26 most frequent 4-frames of NA are: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.
the * of the the * of her in the * of the * of a * i am sure i am sure * * i do not i do not * in the world * * in the world for the * of the * of his by the * of of the * and to the * of and the * of on the * of she could not * * she could not at the * of it would be * * it would be i would not * * i would not * i dare say i dare say *
(133) [1729] (72) [936] (61) [793] (61) [793] (56) [728] (56) [728] (46) [598] (46) [598] (41) [533] (41) [533] (41) [533] (40) [520] (40) [520] (37) [481] (33) [429] (29) [377] (29) [377] (28) [364] (28) [364] (27) [351] (27) [351] (27) [351] (26) [338] (26) [338] (26) [338] (26) [338]
The list of 4-frames above shows one of the difficulties of phraseological research, namely the occurrence of doubles on the lists. Doubles are two phrases with identical word strings with the asterisk once at their beginnings and once at their ends. One example of doubles are the phrases she could not * and * she could not. As most phrases do not occur either at the very beginning of a text or at its very end, a variable automatically occurs at the phrase’s beginning and end. Since the frequency of the three core words of the phrase is high within the text or corpus, they appear on the list of 4-frames even though they are complete 3-grams. Doubles constitute about half of the list of 4-frames for NA, and there are 14 doubles on the list for Austen and six on the list for ContempLit (cf. sections 6.3 and 6.4 for the lists for Austen and ContempLit).
Phraseology
117
The doubles call the validity of the lists of 4-frames partly into question. Since the lists are used for comparative purposes, all lists for all data should have equal numbers of 4-grams and 4-frames. The fact that doubles are really complete 3-grams raises the question of whether they should be counted as one or two frames. Based on the methodological approach of this book, doubles are counted as two frames in the analyses. The mechanical and quantitative criteria used here dictate that those strings the software identifies as highly frequent in the data should be the basis for the analyses. Joining two frames intuitively to form one would ignore the previously used criteria and undermine the validity of the analyses. Treating two frames as one would be a change of the original research parameters, namely using intersubjectively generated data as the basis for the analyses. These changes should not be undertaken within an analysis, since different analyses within one research project would no longer be comparable. Consequently, she could not * and * she could not and other doubles are treated as two separate frames in the following analyses. This ensures the comparability of the different lists of 4-frames of the different sets of data. An interesting finding with respect to the doubles of NA is that they are frequently phrases which (1) seem to be lexicalized and (2) do not seem to fulfil any obvious grammatical functions. This impression is questioned in the analyses later in this chapter (cf. section 6.2.1). The pattern of seemingly lexicalized doubles is also visible in the data of Austen, but not in the data of ContempLit. It therefore seems to be either text- or author-specific. The 4-grams and 4-frames of NA partly overlap. This is shown in Table 6.1. Table 6.1 4-grams and 4-frames of NA 4-grams NA
4-frames NA
i am sure i i am sure he i am sure it i do not know i do not think in the course of i would not have the rest of the at the end of i dare say he but i am sure
i am sure *
i do not * in the * of i would not * the * of the at the * of i dare say * * i am sure
118
Corpus Linguistics in Literary Analysis
In fact, eleven 4-grams overlap with eight 4-frames of NA. This overlap between the two lists is rather small with a rate of less than 50 per cent. However, the analysis of semantic and grammatical characteristics of the 4-grams and 4-frames shows large overlaps between the two lists. The patterns listed in Table 6.2 are dominant within the 4-grams. Table 6.2
Patterns of 4-grams, NA
Pattern
Realizations in the data
Eight 4-grams express temporal, spatial and/or quantitative/qualitative relationships, four of which make temporal relationships explicit Seven 4-grams conform to the pattern i *** and * i **, another four 4-grams include the personal pronouns she, it and you with the it being impersonal Seven 4-grams include a verb, adjective or noun describing a perception or a mental process Four 4-grams include the grammatical negation not
in the course of, a great deal of, the rest of the, at the end of, all the rest of, quarter of an hour, a quarter of an, for the first time i am sure i, i would not have, i am sure he, i do not think, i am sure it, i dare say he, as well as she, but i am sure, it seemed as if, what do you mean, what do you think sure, think, say, seemed, pleasure
Two 4-grams include question words
i do not know, i would not know, i do not think, was not to be what do you mean, what do you think
The patterns listed in Table 6.3 are dominant within the 4-frames. Table 6.3 Pattern
Patterns of 4-frames, NA Realizations in the data
Fourteen 4-frames express temporal, spatial and/ the * of the, the* of her, in the * of, the * of a , or quantitative/qualitative relationships * in the world, in the world *, for the * of, the * of his, by the * of, of the * and, to the * of, and the * of, on the * of, at the * of Eight 4-frames follow the pattern i *** and * i **, * i am sure, i am sure *, * i do not, i do not *, another two 4-frames following the pattern she could not *, * she could not, i would not *, * she ** and she *** * i would not, * i dare say, i dare say * Six 4-frames include a modal verb could and would Six 4-frames include the grammatical negation not * i do not, i do not *, she could not *, * she could not, i would not *, * i would not Four 4-frames include a verb or an adjective say and sure describing a perception or a mental process
Words that are identified as keywords in Chapter 5 Keywords and concordance lines do not occur on the lists of 4-grams or 4-frames. This indicates (1) the difference between the lexical and phraseological structures of the text and (2) that words and phrases are both independent units of meaning in language.
Phraseology
119
Tables 6.2 and 6.3 show that the 4-grams and 4-frames share a number of dominant patterns: z
z z
There is a large number of phrases expressing temporal, spatial and/or quantitative/qualitative relationships. While the realizations of the phrases differ between the two lists, they nevertheless fulfil similar functions in language. Analyses later in this chapter will demonstrate that they are frequently delexicalized and function as prepositions. Both lists include a large number of personal pronouns. The frequent use of I indicates the dominance of spoken language in NA. Expressions describing perceptions and mental processes occur on both lists. In combination with the dominance of personal pronouns, this shows that the novel focuses on the lives and emotions of the protagonists. It is a personalized narrative.
These findings are discussed in detail later in this chapter (cf. later in this section). The above summary of structural and functional aspects of the word strings shows that the differences between the two lists are mostly unrelated to the lexical contents of the strings, but that they are formal and due to the functions they fulfil. The occurrence of lexical words, the high number of personal pronouns on both lists and the occurrence of proper nouns among the 4-grams show the text-specific character of the two lists. They reflect the text’s and possibly the author’s idiolect. The latter is investigated later in this chapter by comparing the data of NA with that of Austen and ContempLit (cf. sections 6.3 and 6.4). The dominance of temporal relations in the phraseology of NA allow us to draw conclusions about the source data, namely a fiction text. The fact that a novel is a narrative entails that the description of temporal relations is one of its constituent features as, according to E. M. Forster’s (1927) famous definition, the chronological sequence of events is a defining feature of a narrative: a narrative of events, the emphasis falling on causality. ‘The king died and then the queen died’ is a story. ‘The king died, and then the queen died of grief’ is a plot. (60) Also Labov (1972) considers a temporal sequence to be an essential component of a narrative: We define a narrative as one method of recapitulating past experience by matching the verbal sequence of events which (it is inferred) actually occurred. (359f.)
120
Corpus Linguistics in Literary Analysis
Phrases describing spatial relationships on the lists also mirror the chronological sequence of a narrative, since the spatial positions of the characters change over the course of time. These changes are described by using word chains. Consequently, phrases describing temporal and spatial relationships are essential components of narratives and therefore of the phraseology of fiction texts in general. This is confirmed later in this chapter when analysing the phrases of the corpora Austen and ContempLit, which are both compiled from fiction texts (cf. sections 6.3 and 6.4). The phrases expressing temporal, spatial and/or quantitative/qualitative relationships function as structural features of the novel. This means that they fulfil the functions of prepositions and that they are delexicalized in their usage. Their lexical meanings are secondary and their usage is mainly functionally motivated. The dominance of temporal, spatial and/or quantitative/qualitative relationships expressed by the phrases mirrors a general tendency in present-day English, as identified by Stubbs (2006) for the BNC. The phrases are thus not only characteristic for a narrative, but for language in general. Phrases that include personal pronouns are dominant on the lists of both 4-grams and 4-frames, and they relate to the characters of the novel. The frequent occurrence of I among these phrases hints at the dominant position of spoken language in the novel. This is because self-referentiality (I ) is much more frequent in spoken than in written language. And in fact, NA includes a large number of conversations between the protagonists1. The frequent phrases of NA that (1) express temporal or spatial relationships and (2) include personal pronouns therefore mirror characteristic features of narratives, that is, of novels in general, and of NA. They indicate the temporal sequence of events, which is a constituent feature of the narrative form of the text, and the interactions between protagonists. In order to make the analysis that has been demonstrated so far more concrete, section 6.2.1 presents a case study of selected 4-frames and their co-texts. This relates the rather general findings above to the text NA.
6.2.1 The analysis In the following, the phrases z z
* i am sure and i am sure * * i do not and i do not *
Phraseology
121
are analysed in detail by looking at their concordance lines and one distribution diagram. The analysis identifies lexical and grammatical patterns in the usage of the phrases. The question of whether these patterns contribute to the cohesion and coherence of the text is discussed in Chapter 7 Text segmentation. Chapter 7 also offers definitions of cohesion and coherence and identifies the relationship between these two concepts. The selection of the four 4-frames for the analysis is based on their frequencies in NA: z z z z z z
i am sure i is the novel’s most frequent 4-gram * i am sure is its fifth most frequent 4-frame i am sure * is its sixth most frequent 4-frame i do not know is its second most frequent 4-gram * i do not is its seventh most frequent 4-frame i do not * is its eighth most frequent 4-frame.
The 4-frames account for the two most frequent 4-grams of the text, and they are the most frequent 4-frames of NA that include lexical words. This makes the 4-frames and their realizations as 4-grams some of the most frequent phrases in the text. They are therefore particularly suited to a detailed analysis. The phrases are also among the most frequent phrases of the corpus Austen: z z z z z z z z
i am sure i is the corpus’ second most frequent 4-gram i am sure you, and i am sure, i am sure it and i am sure she are among the top 26 4-grams i am sure * is its seventh most frequent 4-frame * i am sure is its eighth most frequent 4-frame i do not know is its most frequent 4-gram i do not think is its eighth most frequent 4-gram i do not * is its fifth most frequent 4-frame * i do not is its sixth most frequent 4-frame.
The two phrases occur on less prominent positions on the lists of 4-grams and 4-frames of ContempLit: z z z
i am sure i is the corpus’ 23rd most frequent 4-gram * i am sure and i am sure * are its 80th and 81st most frequent 4-frames * i do not and i do not * do not occur among its top 100 4-grams or 4-frames.
122
Corpus Linguistics in Literary Analysis
The fact that the phrases are dominant in NA and Austen but not in ContempLit indicates that they are part of the author’s idiolect. They are author-specific and therefore characteristic features of Austen’s texts. The four 4-frames’ concordance lines show that they occur mostly in the fictional spoken language of the novel, namely in conversations between the characters. Even though these conversations include only few linguistic features of spoken language, they nevertheless represent spoken language in NA. One such example of a phrase representing spoken language is formulaic language such as the recurrent use of * i am sure and i am sure *. The occurrence of the personal pronoun I and the phrase’s recurrence create the impression of spoken language and allow the reader to distinguish between the fictional spoken and written languages of the novel. The fact that four of the most frequent 4-frames of NA, * i am sure, i am sure*, * i do not and i do not *, occur mostly in the novel’s spoken language shows that the text is dominated by fictional conversations and their language. This finding is supported by the dominance of personal pronouns on the lists of 4-grams and 4-frames. The focus on spoken language seems to be a specific feature of Austen’s novels, since there is no evidence of it in the lists of phrases from ContempLit. It seems to be author-specific. For the following in depth analysis of the phrases, the 4-frames * i am sure and i am sure * (56 in total) and * i do not and i do not * (46 in total) are collated to form the 5-frames * i am sure * and * i do not *. This allows a clearer presentation of the analysis, and there is no difference in the data that is generated for the frames, since the four 4-frames have words before and after them in all instances of their occurrence. They are neither the first nor the last words of the novel, and it is this feature of the 4-frames which is reconstructed by the 5-frames. Since 4-frames are the objects of the analyses, the frames remain to be called 4-frames in the following. The collation of frames merely simplifies the analysis and the presentation of its results. These 4-frames are then used to extract their concordance lines and to generate one distribution diagram. This data functions as the basis for the following analysis. 6.2.1.1 * i am sure * The first question to ask in the analysis of * i am sure * is whether the phrase is typical for literary texts of Austen’s time. This is answered by looking at the list of the most frequent phrases of ContempLit. As mentioned before, the phrase * i am sure * occurs as numbers 80 and 81 on the list of the corpus’ most frequent 4-frames and the two frames occur a total of 362 times,
Phraseology
123
that is, 84 times per one million words in ContempLit. This means that the phrases occur about 50 per cent less frequently in the corpus than in NA. In Austen, they occur 113 times per one million words. Also, Stubbs and Barth (2003) do not identify realizations of * i am sure * among the most frequent phrases of their corpus of fiction texts. The phrase, therefore, does not seem to be a characteristic feature of literary language, neither at Austen’s time nor today. This indicates that it is a feature of Austen’s idiolect. A detailed analysis of the concordance lines of * i am sure * in NA shows that the phrase functions as a discourse marker, with discourse markers being [l]inguistic devices that help structure discourse. (. . .) Discourse markers have many functions (. . .) [for instance they] foreground a topic (. . .), or (. . .) help organize the overall discourse structure (. . .). (Bussmann 1996: 132, original emphases not retained) In NA, * i am sure * organizes discourse structure as it emphasizes the utterance either immediately before or, more frequently, immediately after the phrase. It directs the reader’s attention to its immediate surroundings. This allocates the greatest significance in terms of content within an utterance to the units pointed at by the phrase. In most cases (46 out of 56 occurrences), that part of the utterance with the greatest significance in terms of content comes immediately after the phrase. The phrase therefore functions as an eye catcher to attract the reader’s attention and to direct it to that part of the utterance which is most relevant for its meaning. By doing so, it functions across sentence boundaries, so that the phrase and the content to be emphasized are not necessarily within the same sentence. This function as an emphasizer entails that the phrase is conventionalized and delexicalized in its usage. It does not carry independent meaning, but is used functionally to direct the reader’s attention. Examples of this are the following text passages: z z z z
Say, Mr. Morland, you long to be at it, do not you? I am sure you do. (78) that I may go away.” “Our brother Frederick!” “Yes, I am sure I should be very sorry to leave you so soon (. . .).” (188) As for myself, I am sure I only wish our situations were reversed. (108) (. . .) I say it is no bad notion.” “I am sure I think it a very good one. (111)
124
Corpus Linguistics in Literary Analysis
The examples show that the phrase occurs in combination with other linguistic emphasizers. In the first example sentence, this additional emphasis is the question tag immediately before the phrase. In the second sentence, the reader’s attention is directed by the use of punctuation, that is, the exclamation mark, and by the use of the emphasizer very. The third sentence includes the reflexive pronoun myself and the modal only. The fourth sentence includes a repetition of the personal pronoun I and the emphasizer very. This shows that the phrase * i am sure * as an emphasizer and discourse marker is not independent of other linguistic features, but it is interrelated with its surrounding lexis in terms of content and semantics. The fact that the phrase occurs in combination with other emphasizers shows its functions as a discourse marker and as a means of foregrounding information. Its literal meaning, that is, being sure of something, is not relevant in this context, since it is expressed by the other emphasizers in the utterance. The phrase * i am sure * is not used by only one speaker, but it is used by different characters in the novel. It therefore does not characterize the different protagonists, but rather the place where it is used. The distribution diagram (Figure 6.1) shows that * i am sure * mainly occurs within roughly the first two-thirds of the novel. The end of this frequent usage at about 47,000 words correlates with Catherine’s departure from Bath and her arrival at Northanger Abbey (cf. the dashed line in Figure 6.1). The usage of the delexicalized phrase * i am sure * seems to be a characteristic feature of the language of Catherine’s acquaintance in Bath, mainly with Mr and Mrs Allen and with the Thorpe family. Their frequent usage of the discourse marker indicates their superficiality. This means that the lexis points to characteristics of the novel’s protagonists who, in turn, characterize the place where the protagonist Catherine and the reader meet them. This shows that the phrase * i am sure * contributes to characterizing both the place Bath and its society as superficial. It seems that the author Jane Austen recognized the significance of meaningless phrases for the social life in Bath and, consciously or unconsciously, incorporated them into her description of the location. This conclusion is supported by Mr Tilney’s explicit criticism of linguistic inaccuracies and vagueness, which he utters by mocking other people’s usage of AMAZ* (95) and nice (95f.). The superficial and meaningless use of language is therefore not only demonstrated, but it is also made explicit and criticized in NA. The fact that Henry Tilney himself uses the phrase * i am sure * that does not have any content, adds to the irony of his lecturing on the correct usage of language.
80,000
70,000
60,000
50,000
30,000
Phraseology
40,000
20,000
10,000
Figure 6.1
* i am sure * 125
126
Corpus Linguistics in Literary Analysis
The delexicalization of the phrase is in accordance with its functionalization. While it loses its inherent meaning, namely expressing certainty on something, it fulfils the function of a pointer to characteristics of the protagonists and the place. In addition, it functions as an emphasizer in the text. The interpretation of * i am sure * as a discourse marker and as a signal for linguistic superficiality is confirmed by further analyses of the lexical and grammatical surroundings of the phrase. In 54 instances, * i am sure * occurs within a statement. In two instances, the phrase occurs within a question. While the sure seems to express certainty on the content level of the sentences, the grammatical form of a question indicates uncertainty. This divergence can only be resolved by an analysis of the contents of the two questions, z z
(. . .) and the Allens, I am sure, are very kind to you? (38) She will never forgive me, I am sure; but, you know, how could I help? (105)
In terms of content, the two questions are really statements taking on the grammatical form of questions. They are rhetorical questions and therefore indirect speech acts, as their answers are anticipated or even predetermined before the question is asked. This shows that the insecurity expressed by the grammatical form of the question is not part of the meanings expressed by the sentences. When relating these aspects of content to the analysis above, the occurrence of the phrase in statements, that is, statements in terms of content and not of grammar, increases to 100 per cent of the phrase’s occurrences. The grammatical form of the sentences and the content of the phrase, conveying certainty on something, seem to correspond. However, the analysis of concordance lines shows that this is not the case since the certainty expressed by the phrase is cancelled out by, z z z
the collocation of the phrase with negatively connotated and denotated lexis such as dreadful or conceited the colligation of the phrase with grammatical negations such as not or the colligation of the phrase with conditional forms expressed by the modals could, would, should, may and shall and the conditionals had better not and if I had known which indicate uncertainty or a choice between different possibilities
Phraseology
127
in nearly four fifths of its occurrences. This calls the certainty expressed by the lexis into question. This incongruence between the content of the phrase and its collocations and colligations is particularly prominent in the 15 instances in which the conditionals or model verbs are used after the phrase. In these instances, the certainty conveyed by the phrase is cancelled immediately after it is expressed. The following concordance lines are examples of that: hen he talks of being sick of it, that I am sure he should not complain, for it or the world; you are such a sly thing, I am sure you would have made some drol he is not," said Catherine warmly, "for I am sure he could not afford it."
"
t allowed to receive a letter from me, I am sure I had better not write. There en why did not you tell me so before?
I am sure if I had known it to be improp
The semantic content and the grammatical context of the phrase contradict each other. This is a further indicator for the delexicalization of the phrase. The following sentences are examples of the collocation and colligation between negatively connotated and denotated lexis and grammatical negations and the phrase: z z z
but I am sure I cannot. (136) yes, I am sure Mrs. Tilney is dead (57) I am sure you would be miserable if you thought so! (28)
This is a further indicator for the dominance of grammatical negations and negatively connotated and denotated lexis in the novel as discussed in Chapter 5 Keywords and concordance lines. Taking the semantic and grammatical preferences of the phrase into account, the certainty expressed by the lexis (sure) is not part of the phrase’s meaning in its usage. The certainty is lexically and grammatically negated by the co-occurrence of the phrase with negatively connotated and denotated lexis and its colligation with conditionals and grammatical negations. In some cases, the phrase even expresses the opposite of its lexical meaning, namely uncertainty. This shows the formulaic and delexicalized usage of the phrase and supports previous findings that it is mainly used as a discourse marker which emphasizes statements. This points to the superficiality of public and social life in Bath where the phrase is mainly used. Also, Catherine’s frequent usage of the phrase explains why she appears to be insecure in her public appearances in Bath while linguistically expressing security and confidence in her utterances. The phrase * i am sure * does not carry meaning by itself, but functions as phatic talk in the sense of Malinowski (1953).
128
Corpus Linguistics in Literary Analysis
6.2.1.2 * i do not * As with * i am sure *, the first step in this analysis is to determine how common the 4-frames i do not * and * i do not (46) are in literary language contemporary to NA. As mentioned above, neither the 4-frames nor the 4-gram i do not know, the most frequent realization of i do not * in NA, are among the top 100 phrases of ContempLit. This indicates that their frequency in NA and Austen is not a typical feature of literary language of Austen’s time, but a feature of Austen’s idiolect. This finding is unlike that for current contemporary literature, where the phrases i don’t know and i don’t know what have been identified as particularly frequent by Stubbs and Barth (2003), indicating a diachronic change in literary language. When analysing the concordance lines of * i do not *, the semantic preference of the phrase for grammatical negations and negatively connotated and denotated lexis becomes visible. It collocates and colligates with these categories in nearly 50 per cent of its occurrences, that is, 21 times. In a further three instances, words which function as negations in this particular context (but, besides, only) occur immediately before the phrase (L1) within the same sentence. This dominance of negatively connotated and denotated words and grammatical negations is countered by only three explicitly positive expressions (yes, he does, well for us) to the left of the phrase. The following concordance lines are examples of the semantic preference of the phrase: y a wealthy family?”
“No, not very.
he is unworthy of you." "Unworthy!
I do not believe Isabella has any fortun I do not suppose he ever thinks of me."
ou will not be able to go, my dear." "
I do not quite despair yet.
t before you knew what you were about.
I do not think anything would justify m
I shall no
happy." "Oh! I know no harm of him;
I do not suspect him of pride.
I belie
The use of grammatical negations and negatively connotated and denotated lexis to the left of the phrase, meaning that they are uttered before the phrase, announces the subsequent negation not. This frequently makes the not redundant within the sentence which, in turn, indicates that the phrase is conventionalized and delexicalized in its usage. It functions as an emphasizer and makes the negation included in the utterance explicit. Its content is secondary to these functions. The phrase’s semantic preference for negations and negatively connotated lexis is independent of whether the phrase occurs within a sentence or at its
Phraseology
129
beginning. The latter is the case in about 50 per cent of its occurrences (20) and it occurs 22 times within a sentence. The occurrence of the phrase in a sentence-final position, that is, when the * stands for the last word of a sentence, is rare with only three instances. Its preferred position is one where it is immediately followed by its object. The semantic and syntactic context of * i do not * is therefore largely predetermined. The phrase co-occurs with verbs of perception or mental concepts (44) in nearly 100 per cent of its occurrences, for example with know, despair, like and think which mainly occur at positions R1 or R2 of the phrase. In one instance, the * is realized by the verb say that expresses the speaker’s thoughts in this particular context. The following concordance lines are examples of the phrase’s co-occurrence with verbs of perception and mental concepts and of the syntactic sequence in which they occur: m her, perhaps never to see her again,
I do not feel so very, very much afflict
, there is no great harm done, because
I do not think Isabella has any heart t
iendship and no flattery in the case, "
I do not like him at all," she directl
ght so!" "No, indeed, I should not.
I do not pretend to say that I was not
ell him I beg his pardon -- that is --
I do not know what I ought to say -- bu
looked once at you the whole day?" "
I do not say so; but he did not seem in
Only two sentences in the novel are exceptions of this syntactic pattern, namely, when the phrase occurs in sentence-final positions and no verbs follow within the sentence. In one of these sentences, the phrase is followed by a question tag (I do not – ought I?); the second sentence is elliptic (I do not indeed.). This collocation between verbs of perception and not indicates that processes of perception are portrayed as being in a state of uncertainty. This conforms to the novel’s general tendency of using grammatical negations and negatively connotated and denotated lexis which is identified and discussed in some detail in Chapter 5 Keywords and concordance lines (cf. section 5.3 in particular). When analysing the most frequent realizations of verbs of perception that co-occur with * i do not *, KNOW* and THINK*, for their collocations and colligations, the analysis shows that the two lemmata have a semantic preference for grammatical negations and negatively connotated and denotated lexis. Independent of the phrase * i do not *, the lemmata collocate with grammatical negations or negatively connotated or denotated words in NA in more than 50 per cent of their occurrences. The following
130
Corpus Linguistics in Literary Analysis
concordance lines are examples of this usage: too, without saying a word! You do not know how vexed I am; I shall have no pl was not affronted. Perhaps you did not know I had been there." "I was not w tell me so before? I am sure if I had
known it to be improper, I would not ha
erine, you would be quite amazed. You
know I never stand upon ceremony with su
such a comfort to us, was not it? You
know, you and I were quite forlorn at f
and to oblige such a friend -- I shall think you quite unkind, if you still re I could not sleep a wink all right for thinking of it. Oh! Catherine, the ma d now, after such behaviour, we cannot think at all well of her. Just at prese d, "and for all our intimacy! She must think me an idiot, or she could not hav essed, but always steady. "I did not
think you had been so obstinate, Catheri
This indicates that the phrase * i do not * with its grammatical negation and the verbs occurring to its right co-influence each other and their semantic contexts. In order to research this hypothesis further, the phrase * she could not * (28) is analysed in the following. Its realizations * she could not and she could not * are the top 18th and 19th phrases on the list of 4-frames of NA, and they are the most frequent phrases on the list that include a grammatical negation after * i do not *. This grammatical negation is the object of the following analysis.
6.2.1.3 * she could not * The analysis of the concordance lines of * she could not * (28) shows that the phrase co-occurs with a verb of perception or with mental concepts to its right in 18 cases. The following concordance lines are examples of that: he should engage her again; for though she could not, dared not expect that Mr. re; till she had spoken to Miss Tilney she could not be at ease; and quickenin Allen, between whom she now remained.
She could not help being vexed at the n
ess fancies and injurious examinations, she could not wonder at any degree of h ual in degree, however unlike in kind. She could not think the Tilneys had act
Unlike * i do not *, there are also 14 mental concepts or words of perception to the left of * she could not *. With * i do not *, expressions describing mental processes or perceptions are restricted to its right. The number of verbs, adjectives and nouns describing perceptions and mental processes which co-occur with * she could not * is only slightly lower (23) than that of * i do not *. * she could not * collocates in about four fifths of its occurrences (82 per cent) with mental concepts and words of perception
Phraseology
131
while * i do not * collocates in nearly 100 per cent of its occurrences with this category. This shows that there is a strong semantic preference between the phrases, their grammatical negation and descriptions of mental and perceptive processes in NA. In more than 50 per cent of its occurrences (15), the negative message of * she could not *, expressed by the grammatical negation, is prepared for by negating devices to its left. This is illustrated by the following concordance lines: l? She must be an unprincipled one, or
she could not have used your brother so
ess fancies and injurious examinations, she could not wonder at any degree of h ual in degree, however unlike in kind.
She could not think the Tilneys had act
a heavy and troublesome business, and
she could not easily forget its having s
e added some bitter emotions of shame.
She could not be mistaken as to the roo
In 15 cases, there is either a grammatical negation, a negating morpheme or a negatively connotated and denotated word to the left of the phrase, for example, not, displease and murderer. Their occurrence is also frequent as collocates and colligates of * i do not *. This indicates that grammatical negations and negatively connotated lexis in general frequently collocate and colligate with negating phrases in NA. They seem to be primed to do so (cf. Hoey 2005 on lexical priming).
6.2.2 Implications The analysis of * she could not * has confirmed the pattern of a collocation and colligation between grammatical negations and negatively connotated and denotated lexis and words describing perceptions and mental processes in the frequent phrases of NA. Furthermore, common patterns of NA’s frequent phrases that include not are their collocation with negatively connotated and denotated lexis and their colligation with grammatical negations. These lexical and grammatical patterns seem to be primed to co-occur in NA. This indicates that the patterns are also likely to occur with other, less frequent phrases of the novel which include grammatical negations. The co-occurrence of negatively connotated and denotated words and grammatical negations with or as parts of frequent phrases is not surprising in the overall context of the novel as shown in the excursus on grammatical negations in NA (cf. section 5.3). It is also noticeable when reading the novel intuitively, for example in the novel’s first four sentences which
132
Corpus Linguistics in Literary Analysis
characterize Catherine as an anti-heroine (cf. Chapter 5 Keywords and concordance lines): No one who had ever seen Catherine Morland in her infancy would have supposed her born to be an heroine. Her situation in life, the character of her father and mother, her own person and disposition, were all equally against her. Her father was a clergyman, without being neglected, or poor, and a very respectable man, though his name was Richard – and he had never been handsome. He had a considerable independence besides two good livings – and he was not in the least addicted to locking up his daughters. (1, emphases added) Already the very first sentences of the novel include an average of two negatively connotated and denotated words or grammatical negations per sentence. This tendency is continued throughout the novel, so that also the last sentence of the novel includes a large number of negatively connotated and denotated lexis, even though its message is positive: To begin perfect happiness at the respective ages of twenty-six and eighteen is to do pretty well; and professing myself moreover convinced that the general’s unjust interference, so far from being really injurious to their felicity, was perhaps rather conducive to it, by improving their knowledge of each other, and adding strength to their attachment, I leave it to be settled, by whomsoever it may concern, whether the tendency of this work be altogether to recommend parental tyranny, or reward filial disobedience. (235, emphases added) Despite the fact that this very high frequency of grammatical negations and negatively connotated and denotated lexis is not constant throughout the novel, corpus linguistic analyses confirm that descriptions in the novel frequently focus on objects or issues that do NOT exist or that are NOT present (cf. section 5.3 for a discussion of grammatical negations in NA). This emphasis on the negative points to the readers’ and Catherine’s expectations that are based on schemata of Gothic novels (cf. Chapter 3 Language and meaning on schema theory). These schemata are negated in the novel.
6.2.3 Evaluation The analyses of frequent phrases in NA have generated a number of insights into the text. These are functional, lexical, grammatical and literary in nature.
Phraseology
133
From a functional point of view, the analyses have shown that seemingly lexicalized phrases are in fact delexicalized in NA and that they fulfil conventionalized functions. Their lexical content does not influence their usage, thus they become discourse markers which help to structure the text. They direct the reader’s attention to important parts of a sentence. From a lexical point of view, the analyses have shown a semantic preference of phrases which include not for negatively connotated and denotated lexis and for further grammatical negations. Furthermore, expressions describing perceptions or mental concepts are frequently close to the negations of the phrases. In fact, the analyses have shown a general preference of frequent verbs of perception or mental processes, exemplified by analyses of THINK* and KNOW*, for grammatical negations and negatively connotated and denotated lexis in NA. The lexis seems to be primed to frequently occur in these patterns. From a grammatical point of view, especially the largely predetermined syntactic positions of collocations and colligations in relation to the phrases are noticeable. Mobility of parts of sentences and lexis close to the phrases is often strongly restricted. This is a form of grammatical priming. From a literary point of view, for example the characterization of Bath as a superficial place by the novel’s characters’ frequent usage of discourse markers has been shown. This has not been discussed in literary critical sources. Moreover, it has also been shown that Catherine’s insecurity in public situations is, at least in part, conveyed to the readers by her frequent usage of discourse markers. The results of these analyses provide a more detailed picture of the novel’s plot than could be gained without them. In particular, identifying the dominance of grammatical negations and negatively connotated and denotated lexis contributes to this picture. The analyses also confirm that ‘there are no negatives in nature but only in the human consciousness’ (Watt 1960: 259) as already quoted in Chapter 5 Keywords and concordance lines. It is people who experience something as either positive or negative, the actual experience itself is neutral. As demonstrated above, a negative perception of events is linguistically manifest when an expected positive event does not occur or when expectations of either protagonists or readers are disappointed (cf. Chapter 5 Keywords and concordance lines and the quote from page 1 above for expectations that are disappointed in the novel). Looking at the frequency of this lexis in NA, they seem to be frequently disappointed in the text. The phrases also hint at a further textual focus in NA, namely Catherine’s emotions, as she is the most frequent user of the phrases. They show
134
Corpus Linguistics in Literary Analysis
Catherine’s insecurity concerning her emotions, since her feelings regarding friendship, falling in love and loyalty towards her family are in turmoil. The high frequency of the phrase * i am sure * is her attempt at demonstrating self-confidence and it stands in contrast to her real insecurity. As the selfconfidence expressed by the phrase is also negated by further linguistic features, such as collocations and colligations (as shown earlier), her real insecurity and that of other characters who use the phrase is in fact emphasized. This is shown by the analysis of the semantic and grammatical contexts of the phrase. The phrase * i am sure * is delexicalized in its usage and it does not express any content. The fact that it is mainly used in that part of the novel which is set at Bath, points to the superficiality of social relationships in this location. It is a characteristic feature of the description and characterization of Bath in the novel. These semantic and functional patterns within the text cannot be identified consciously by readers. Nevertheless, they contribute to the readers’ perception of events and of implicit plot lines in the text, as is the case with Catherine’s insecurity. They help to implicitly convey meaning. Consequently, phraseological analyses contribute to matching connotations and implications of a text with its linguistic patterns. The analysis of frequent word strings therefore allows a detailed analysis of the content of the text which is based on linguistic data.
6.3 The corpus Austen: data and analysis In the second part of this chapter, the most frequent 4-grams and 4-frames of Austen and ContempLit are analysed in the same way as those of NA. It starts with the analysis of the phrases of Austen. The first goal of this analysis is to gain information on both the contents of the corpus and its structure. The second goal is to gain information on Austen’s style of writing. To reach the latter goal, the data extracted for NA is compared to that of Austen. This comparison allows the novel to be positioned within the context of the author’s oeuvre, that is, within her general style of writing. This contributes to answering the question raised in Chapters 5 Keywords and concordance lines and 7 Text segmentation, as to whether NA differs in its style of writing from Jane Austen’s other novels. Afterwards, comparisons between Austen and ContempLit and between NA and ContempLit are made, so that the novel and the author’s oeuvre are placed within the context of their contemporary literature. The analyses focus on identifying individual features of Austen’s language.
Phraseology The 25 most frequent 4-grams of Austen are: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.
i do not know i am sure i the rest of the a great deal of in the course of at the same time i am sure you i do not think and i am sure it would have been for the sake of as soon as she it would be a at the end of out of the room was not to be as soon as they i am sure it i am sure she she could not help that she could not put an end to quarter of an hour for a few minutes it could not be
(133) [184] (89) [123] (82) [113] (80) [110] (71) [98] (67) [92] (60) [83] (55) [76] (54) [75] (49) [68] (46) [63] (44) [61] (39) [54] (38) [53] (38) [53] (38) [53] (34) [47] (34) [47] (34) [47] (34) [47] (34) [47] (33) [46] (33) [46] (32) [44] (32) [44]
The 25 most frequent 4-frames of Austen are: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
the * of the in the * of the * of her the * of a i do not * * i do not i am sure * * i am sure the * of his to the * of
(1040) [1435] (591) [816] (542) [748] (418) [577] (418) [577] (418) [577] (415) [573] (415) [573] (401) [553] (374) [516]
135
Corpus Linguistics in Literary Analysis
136
11. * she could not (332) [458] 12. she could not * (332) [458] 13. of the * and (303) [418] 14. for the * of (285) [393] 15. and the * of (274) [378] 16. * it would be (258) [356] 17. it would be * (258) [356] 18. by the * of (252) [348] 19. * in the world (246) [340] 20. in the world * (246) [340] 21. as soon as * (236) [326] 22. * as soon as (236) [326] 23. all the * of (229) [316] 24. a great deal * (214) [295] 25. * a great deal (214) [295] (cf. the explanation in section 6.2 on doubles on the list of 4-frames.) When comparing the data extracted from Austen with that extracted from NA, the lists correspond in content and form in a number of aspects (cf. Tables 6.4 and 6.5). Table 6.4
Correspondences Austen – NA, 4-grams
Austen 4-grams
NA 4-grams
Fourteen phrases include personal pronouns, seven of them include I Eleven phrases express temporal, spatial and/ or quantitative/qualitative relationships Six phrases include a verb of perception, five phrases include sure Six phrases include not Five phrases include the 3-gram i am sure
Ten phrases include personal pronouns, eight of them include I Nine phrases express temporal, spatial and/ or quantitative/qualitative relationships Five phrases include a verb of perception, five phrases include sure Four phrases include not Four phrases include the 3-gram i am sure
Table 6.5
Correspondences Austen – NA, 4-frames
Austen 4-frames
NA 4-frames
Fifteen frames express temporal, spatial and/ or quantitative/qualitative relationships Ten frames include personal pronouns, three of them are female, I occurs four times Nine frames include the * of Four frames include not Two frames include sure
Twelve frames express temporal, spatial and/ or quantitative/qualitative relationships Fourteen frames include personal pronouns, three of them are female, I occurs eight times Eleven frames include the * of Six frames include not Two frames include sure
Phraseology
137
A comparison of the lists shows that 10 of the 4-grams and 20 of the 4-frames of Austen also occur on the lists of NA, so that nearly 50 per cent of the lists of 4-grams and about 80 per cent of the lists of 4-frames overlap between the two sets of data. This shows great linguistic, and in particular phraseological, similarities between the data. Similarly to NA, also the lists for Austen include a large number of phrases which function as prepositions to express temporal, spatial and/or quantitative/qualitative relationships. They occur among both 4-grams and 4-frames, most frequently as variants of the 3-frame the * of. These phrases express relationships between people and to events so that they are necessary constituents of a narrative. It is the description of relationships and events which creates a multidimensional story. The occurrence of the phrases on the lists of NA, Austen and ContempLit, as shown in section 6.4, is therefore not surprising. Expressing and describing relationships between people and relationships between people and events in the novel is also manifest in the frequent occurrence of personal pronouns and words of perception or mental concepts that occur within the phrases. They create a personalized perspective on the events of the texts and therefore on the protagonists’ emotions. Words denoting perceptions and mental concepts are particularly prominent on the list of 4-grams. The dominance of female protagonists in Austen’s novels can be seen from the dominance of female personal pronouns in the lists. Furthermore, the dominance of I indicates that conversations between characters and references to the speakers themselves in the conversations are important structural elements of the corpus. This is another common feature of NA and Austen as female personal pronouns and the use of I are dominant among the 4-grams and 4-frames of both sets of data. Also, the phrase * i am sure * occurs on the lists of 4-grams and 4-frames of Austen and NA, and its usage as a discourse marker is similar in the two sets of data. In most of its occurrences in Austen and NA, the phrase structures the text and directs the readers’ attention to the most important content of an utterance. The only prominent difference between the usage of frequent phrases in Austen and NA is that of phrases which include a grammatical negation. Even though both sets of data show the phrases’ semantic preference for negatively connotated or denotated lexis or for further grammatical negations, the mobility of these expressions in Austen is not as limited as in NA. Nonetheless, the phrases’ semantic and syntactic preferences are the same in both sets of data.
138
Corpus Linguistics in Literary Analysis
The analysis above has shown that the contents and narrative structures of the text and the corpus that are manifest in their frequent phrases correspond. There are no significant differences between the two sets of data. Therefore, the first result of this phraseological comparison between NA and Austen is that NA conforms to the phraseological norms of Austen’s other novels. There are no differences in content or discursive and linguistic structures that can be established by an analysis of their most frequent phrases. This indicates that Austen is a homogeneous corpus and supports this same conclusion with regard to the content of the corpus from Chapter 5 Keywords and concordance lines. Analyses of the lexical homogeneity of Austen are presented in Chapter 7 Text segmentation.
6.4 The corpus ContempLit: data and analysis In the following, the phraseological data of NA and Austen are compared to that of ContempLit. This allows us to determine the degree of similarity or dissimilarity between the phraseological norms in Austen’s oeuvre and literature of her contemporaries. The 26 most frequent 4-grams of ContempLit are: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.
at the same time the rest of the said my uncle toby in the midst of in the course of a great deal of as soon as he at the end of an please your honour quoth my uncle toby for the first time for the sake of the end of the as if he had out of the room one of the most as soon as the to say the truth with an air of
(323) [74] (154) [35] (148) [34] (143) [33] (131) [30] (115) [26] (111) [25] (111) [25] (108) [25] (108) [25] (105) [24] (105) [24] (105) [24] (102) [23] (97) [22] (89) [20] (84) [19] (82) [19] (82) [19]
Phraseology 20. 21. 22. 23. 24. 25. 26.
by the side of that i could not for my part i i am sure i to be sure i dear father and mother in the world and
139
(80) [18] (79) [18] (77) [18] (77) [18] (76) [17] (75) [17] (75) [17]
The 25 most frequent 4-frames of ContempLit are: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.
the * of the in the * of the * of his the * of a to the * of of the * and of the * of at the * of the * of my the * of her the * and the by the * of for the * of with the * of in the * and and the * of * my uncle toby my uncle toby * to the * and on the * of * one of the one of the * * out of the out of the * from the * of
(6365) [1457] (2442) [559] (1928) [441] (1805) [413] (1784) [408] (1607) [368] (1313) [301] (1251) [286] (1193) [273] (1078) [247] (1070) [245] (1044) [239] (1009) [231] (941) [215] (931) [213] (928) [212] (823) [188] (823) [188] (809) [185] (777) [178] (767) [176] (746) [171] (746) [171] (745) [171] (711) [163]
The phrases said my uncle toby, quoth my uncle toby, my uncle toby *, * my uncle toby and an please your honour only occur in the novel The Life and Opinions of Tristram Shandy, Gentleman by Laurence Sterne. The phrase dear father and mother only occurs in Samuel Richardson’s Pamela or Virtue Rewarded.
140
Corpus Linguistics in Literary Analysis
These singular occurrences of 4-grams and 4-frames show that it is not only important in an analysis to take the number of occurrences of a phrase into account, but that also their range within a corpus is important. When a phrase occurs in several texts of a corpus, its significance for an analysis is much higher than that of a phrase which occurs in only one or in very few texts. Phrases that occur in only one text do not have any significance for the corpus as a whole, even if they are among the most frequent phrases of the complete corpus. Nonetheless, since the phrases listed above have been identified by the software as some of the most frequent phrases of ContempLit, they cannot be excluded from the lists. This is the case despite their low significance for the corpus. Editing the lists manually would be a subjective interference in the data that would counter the standards set for generating the data electronically and refraining from all manual editing. This mirrors the problem of doubles on the lists of p-frames (cf. section 6.2). Nevertheless, phrases that occur in only one text cannot be included in the analysis for the complete corpus, as they would skew results for the corpus. Consequently, they are identified on the lists of phrases, so that the recipients of the analysis understand the reasons for their inclusion in the lists and their exclusion from the analysis. As the first step of the analysis, the lists of 4-grams and 4-frames of ContempLit and NA and of ContempLit and Austen are compared. These comparisons identify similarities between the lists, mainly with regard to temporal, spatial and/or quantitative/qualitative relationships. These phrases are text-independent and function as prepositions in the data. The following six 4-grams occur on the lists of ContempLit and NA: i am sure i, in the course of, a great deal of, the rest of the, at the end of, for the first time The following twelve 4-frames occur on the lists of ContempLit and NA: the * of the, the * of her, in the * of, the * of a, for the * of, the * of his, by the * of, of the * and, to the * of, and the * of, on the * of, at the * of The following eight 4-grams occur on the lists of ContempLit and Austen: i am sure i, the rest of the, a great deal of, in the course of, at the same time, for the sake of, at the end of, out of the room The following ten 4-frames occur on the lists of ContempLit and Austen: the * of the, in the * of, the * of her, the * of a, the * of his, to the * of, of the * and, for the * of, and the * of, by the * of
Phraseology
141
The results from the comparisons show that the 4-gram i am sure i is the only seemingly lexicalized phrase on the lists. However, as has been demonstrated above for other sets of data, the phrase functions as a discourse marker and is, in fact, delexicalized in its usage. This usage was identified for NA (cf. section 6.2.1.1) and is also dominant in Austen and ContempLit. This shows that it was a typical linguistic feature of fiction at Austen’s time. One feature that distinguishes NA and Austen from ContempLit are the personal pronouns that occur on the lists of frequent phrases. Personal pronouns on the lists of NA and Austen are mostly female pronouns or the first person singular. ContempLit differs from this pattern as the numbers of masculine, feminine and neutral (I) pronouns on the lists of the corpus’ most frequent phrases are equal. This does not permit any conclusions to be drawn on the contents and the protagonists of the novels included in ContempLit the way it is possible for NA and Austen. The focus on female protagonists in NA and Austen, which is mirrored by the dominance of female personal pronouns on the lists of the data’s most frequent phrases, is not manifest in the phrases of ContempLit. This formal criterion allows conclusions to be drawn on dominant features of the contents and structures of Austen’s novels that are distinct from those of other novels of her time. A further individual linguistic feature of Jane Austen’s novels is the high frequency of grammatical negations that is manifest on the lists of 4-grams and 4-frames of NA (4/5) and Austen (6/4). In contrast, there is only one grammatical negation on the list of 4-grams of ContempLit. This shows that this high frequency of grammatical negations is no general feature of 18th and 19th century fiction, but rather that it is specific to the author Jane Austen. Generally speaking, the most frequent 4-grams and 4-frames of ContempLit are text-independent phrases that frequently express temporal, spatial and/or quantitative/qualitative relationships. This distinguishes the corpus significantly from NA and Austen. Phrases that are used as prepositions are much less frequent on the lists of the latter data than on the lists of ContempLit. This distinguishing feature between the data from Jane Austen and ContempLit is due to the different natures of the three sets of data. NA is a single, homogeneous text (cf. Chapter 7 Text segmentation). Its content and linguistic structure are mirrored in its 4-grams and 4-frames. Additionally, Austen is a homogeneous corpus (cf. section 6.3 and Chapter 7 Text segmentation). In contrast, ContempLit is a compilation of texts that were written neither by the same author nor do they treat similar topics. The corpus is heterogeneous (cf. Chapter 7 Text segmentation). This entails more variation
142
Corpus Linguistics in Literary Analysis
of lexical words between the different texts included in the corpus than in a single text or a homogeneous corpus. The most frequent phrases of ContempLit are text-independent and indicate general tendencies in literary language, for example descriptions of temporal sequences as characteristic features of narratives. These text-independent features of the phrases identified for ContempLit closely resemble the structures of the most frequent 4-grams and 4-frames of the BNC (identified by Stubbs 2006), that is, of a general language corpus. Diachronic and genre-specific differences between the corpora do not seem to be relevant here. The analyses have shown that NA and Austen closely resemble each other in terms of their phraseologies and that they are distinct from ContempLit. While the analysis of the first two sets of data identifies author-specific linguistic features, the latter identifies general phraseological tendencies in literary language at Austen’s time. These tendencies differ in part from those of Jane Austen’s works.
6.5 Concluding comments The analyses in this chapter have provided further evidence for phrases being units of meaning in a text and in language (cf. section 6.1 and Chapter 3 Language and meaning on units of meaning in language). They structure and organize discourse and contribute to encoding meaning in language. This means that they contribute to the encoding of the contents of a text or corpus and therefore carry meaning in language. This has been demonstrated by analyses of the text NA and of the corpora Austen and ContempLit. The comparison of the different analyses has shown that the results of text and corpus analyses largely depend on the data that is analysed and that the corpus compilation is a deciding factor in the analyses. The analysis of a homogeneous corpus, such as Austen, produces results that resemble that of a text analysis: it provides information concerning the corpus’ discourse structure and organization and on its contents. Its most frequent phrases are often data-specific. The analysis of a heterogeneous corpus such as ContempLit, on the other hand, produces results that identify discursive features of the language represented by the corpus. Insights into the contents of the corpus are only rarely gained, since its most frequent phrases often mirror features of general language use. The two analytic approaches resemble each other insofar as they both offer insight into texts and corpora that could not be gained without these analytic techniques. They helped, for example, to discover that the most
Phraseology
143
frequent phrases of NA and Austen are frequently delexicalized in their usage and that they function as discourse markers. This is a linguistic explanation for why readers of NA intuitively perceive the society in Bath as portrayed by Jane Austen as superficial. When analysing the novel’s most frequent phrases, which are frequently used in that part of the novel set at Bath, one notices that they occur in spoken language and that they in fact do not have any content as they are frequently used as discourse markers in the text. They can therefore be interpreted as signs of superficiality in the novel. Also, Catherine’s usage of the phrases explains why she appears insecure in her public appearances in Bath. This is linguistic evidence for intuitive perceptions which are made more objective by these means. They are now based on the interpretation of objectively occurring linguistic patterns. The analysis of the phrases has therefore resulted in both literary and structural insight into the data. However, not only insight into the content of texts and corpora can be gained by phraseological analyses. The comparison between Jane Austen’s language and that of contemporary texts has also shown some of the author’s individual linguistic tendencies that either do not occur at all or occur less frequently in general literary language of her time. This enables conclusions to be drawn about features of Austen’s idiolect that is manifest in the phrases she uses most frequently. Based on the assumption of a correlation between frequency and significance of linguistic features in language data (cf. Chapter 2 Goals, techniques, principles), the features identified above are characteristic for Austen’s idiolect. The frequency with which these features occur contributes to the cohesion and coherence of texts and homogeneous corpora. The phrases create a connection within NA and Austen and between the two sets of data as they encode meaning in a consistent form. This consistent form makes the meaning noticeable for the readers and contributes to the coherence of every individual text and of Austen’s complete oeuvre. The similarities in the contents of Austen’s novels perceived by the readers are supported and strengthened by linguistic similarities between the texts. Consequently, features of the contents can be decoded by analysing phraseological structures. This has been demonstrated above. Austen’s linguistic continuity becomes a feature of cohesion and coherence across her novels. While the continuity of linguistic style is usually only intuitively perceived by readers, it can now, at least partly, be analysed and identified by phraseological analyses. This provides quantitative and qualitative evidence for the readers’ perception of her novels.
Chapter 7
Text segmentation
The linguistic means for segmenting texts into their constituent parts have hardly been researched in linguistics. This is despite the high relevance of text segmentation for understanding the structure of a text, that is, for gaining a deeper understanding of the data as such. Identifying the constituent parts of a text is useful, for example, for gaining knowledge about the progression of a text which, in turn, contributes to its interpretation. This is demonstrated in the following by segmenting the text NA and the corpus Austen into their constituent parts on the basis of lexical features of the data. The criterion for a text part is its homogeneity in terms of content, indicated by its lexical homogeneity, which is manifest in the frequent occurrence of lemmata from a common semantic field. The result of this analysis is a segmentation of the novel into its constituent parts. This segmentation is an alternative to the literary critical segmentation, and it is based on lexical features as opposed to features of content of the text as in the literary critical segmentation of NA. The basis for the segmentations presented later in this chapter are findings from Chapter 5 Keywords and concordance lines. The data that was found to be relevant for the meanings of the text and the corpus, namely keywords, is used as the basis for the following analyses. As a first step, the keywords’ distribution across NA is analysed. As a second step, the distribution of the text’s keywords is used to segment the text into its parts. As a third step, this segmentation is compared to the points in the text where a large number of new lexis is introduced into it, since the introduction of new lexis is a signal for the introduction of a new topic into a text. This comparison results in the final linguistic segmentation of NA. Following these analyses, the corpus Austen is analysed in the same way. The chapter is completed by an analysis of the lexical homogeneity and heterogeneity of the corpora Austen and ContempLit. This continues the discussion of the phraseological homogeneity between the corpora and NA in Chapter 6 Phraseology and the discussion of Austen’s homogeneity in terms of content in Chapter 5 Keywords and concordance lines.
Text segmentation
145
A distribution analysis of lexis identifies where specified word forms or lemmata are used within a text or corpus. To do so, the software Word Distribution (WD, Barth 2002) identifies the numerical positions of a word within a text or corpus and orders the numbers from low to high. As a second step, the numbers are read into MS EXCEL, which generates a flow chart to graphically display the distribution of the lexis across the data. When enlarging the scope of the analysis to several words from one semantic field, it is possible to trace and graphically present the occurrence of a specific topic across the data. Grouping words into semantic fields is the linguist’s task. The basic assumptions of this form of text segmentation are (1) that dominant semantic fields on lists of keywords indicate thematic foci of the text or corpus that is analysed and (2) that the data consists of different parts that are defined by a common content. Since content and lexis correspond, analysing the distribution of lexis allows the analysis of the content structure of the data. This is also emphasized by Morris and Hirst who state that ‘[w]hen a unit of text is about the same thing there is a strong tendency for semantically related words to be used within that unit’ (1991: 35). This is confirmed by Phillips who says that the organisation of written text can be fully understood only if the level of lexis is taken into account in any description. (. . .) text must be recognised as a category related to that of lexis. (1989: 3) The reasons for a segmentation depend on the data that is segmented. A text might, for example, be segmented into its different parts in order to analyse the progression of its content while a corpus might be analysed to determine whether single words occur throughout the complete corpus or whether they are restricted to single texts. This shows the distribution of a specific word in the language variety represented by the corpus. A further research question for the analysis of a corpus might be whether it is possible to identify the transitions of the texts within a corpus by analysing its thematic units as they are manifest in the corpus’ lexis. This knowledge on thematic units within a corpus is knowledge regarding its structure. When looking at a text, knowledge of its thematic units or parts and of the progression of the text enables a better understanding of its argument structure. This, in turn, leads to a more in-depth interpretation of the text than would otherwise be possible. In the following, first the text NA is analysed for its constituent parts. The goal of the analysis is to segment the text on the basis of its lexis and to identify its thematic units, that is, its constituent parts. Second, the corpus Austen is analysed in the same way as NA. The goal of the analysis is to identify
146
Corpus Linguistics in Literary Analysis
the different texts that compile the corpus and the points of transition between the texts. This is followed by an analysis of the lexical homogeneity of the corpora Austen and ContempLit. The choice of lexis as a basis for the analysis is due to its function as cohesive links and therefore as a basis for the coherence of texts. Lexical cohesion means that lexical patterns within a text are intratextual references that create a formal connection between different units that constitute a text, as between sentences or text parts. It is the recurrent occurrence of linguistic features which make text parts linguistic units. It is therefore likely that the text parts are also cohesive within themselves and that this cohesion is created by the occurrence of several words from one semantic field or by the occurrence of several realizations of one lemma. Apart from cohesion, this also creates coherence within a text. This is demonstrated by the analyses later in this chapter. Furthermore, the analyses show that the end of one text part is marked by the end of the occurrence of words from a previously dominant semantic field. This is a partial disruption of the cohesion of the text.
7.1 Cohesion and coherence Cohesion describes formal features of grammar or lexis that create textual unity. According to Halliday and Hasan (1976), there are six main cohesive devices, namely reference, substitution, ellipsis, conjunction, reiteration and collocation. Coherence, on the other hand, is not a formal feature of a text, but it describes a reader’s or listener’s intuitive perception of the connectedness of a text. Halliday and Hasan (1976) distinguish between grammatical and lexical cohesion, since ‘[c]ohesion is expressed partly through the grammar and partly through the vocabulary’ (5). But on the whole, cohesion is a semantic connection across a text (6). This emphasis on semantics is the basis for the following analyses that examine reiteration in NA and Austen. Hasan (1984) slightly modifies the term lexical cohesion that was introduced by Halliday and Hasan (1976) and distinguishes between two types of lexical cohesion, namely between general, the conventional, and instantial, the specific word meaning in a text (cf. her diagram in Figure 7.1). Hasan’s (1984) study of different ways of creating and maintaining lexical cohesion is more detailed than the study by Halliday and Hasan (1976). The following analyses look at reiteration as a means of lexical cohesion and follow Hasan (1984) rather than Halliday and Hasan (1976). Also following Hasan (1984), the analyses are concerned with lemmata instead of word forms. Since the different meanings of different forms of one lemma
Text segmentation
147
Figure 7.1 Lexical cohesion according to Hasan (1984)
(Sinclair 1991 and Sinclair in Moon 1987: 89) are of no consequence for the following analyses, they are not taken into account. Also Hoey (1991: 10) finds evidence for a connection between lexis and cohesion in his research. He says that lexical cohesion becomes the dominant mode of creating texture. In other words, the study of the greater part of cohesion is the study of lexis, and the study of cohesion in text is to a considerable degree the study of patterns of lexis in text. Baayen confirms this finding and writes that ‘[w]ord usage reflects lexical cohesion both at the level of the sentence and at the level of discourse’ (2001: 7f.). The relevance of lexical cohesion for identifying text parts has also received some attention. For example, Reynar (1994) takes lexical cohesion that is manifest in the repetition of lexical features as the basis for his segmentation of texts into their subtopics. These repetitions are established with the help of an algorithm which identifies the positions of lemmatized lexical words within a text. Hearst (1997) uses a similar approach, but without mentioning the term cohesion. He also identifies lexical co-occurrences and distributions with the help of an algorithm in order to identify ‘multi-paragraph units that represent passages or subtopics’ (33). He uses patterns of lexical co-occurrences and their distributions as pointers to transitions between different subtopics in a text. Jobbins and Evett (1998) also work in this tradition, as they identify repetitions and collocations as indicators for connections between words. These connections are the basis for their segmentation of texts into their
148
Corpus Linguistics in Literary Analysis
different topics. They do not use electronic means for their analyses, though, but rely on their intuition for the segmentation. Lexical cohesion is also the basis for Phillips’ (1983, 1985, 1989) analyses of the structure of lexical networks in texts. He uses the insights gained by this process in order to determine how readers understand the content, that is, the aboutness, of a text. For this purpose, he identifies ‘text elements for whatever patterns of association they might contract with each other as a function of repeated co-occurrence’ (1989: 24). This is the basis for determining which elements are particularly relevant for the reader to understand the meanings of a text and of its parts. In doing so, he also discusses the macro-structures of texts from different genres and identifies lexical patterns that are characteristic for these genres. His units of analyses are book chapters. The works described above show that texts consist of different parts that are interconnected by means of cohesion and coherence. According to Sinclair and Coulthard’s (1975) concept of latent patterning, these lexical patterns cannot be recognized by intuitive means, but only when using corpus linguistic analytic techniques. The occurrence of semantically related words creates cohesion in a text and is used as a criterion for identifying its constituent parts. The following analyses also show that the sudden absence of these words creates a break in cohesion and coherence of a text. In many cases, these breaks are not noticed by a reader, since usually several topics, and therefore several semantic fields, occur parallel in a text. Nevertheless, the breaks indicate the end or an interruption of a topic within the text. In the following, the breaks in the lexis, and therefore the breaks in cohesion, are identified as points of transition between text parts. The analysis shows that there is an inherent connection between text structure and cohesion, which allows the identification of the different text parts based on the occurrence and absence of specific lexis.
7.2 The text NA: the data The following semantic fields and their distributions across NA are the basis for the analysis later in this chapter (cf. section 7.2.1): 1. place names (FULLERTON*, NORTHANGER*, WOODSTON*, ABBEY*, BATH*) 2. textuality (HEROINE*, NOVEL*, READ*, MANUSCRIPT*, JOURNAL*) 3. emotions (*LOVE*, ENGAGE*, AFFECTION*, ADMIR*, FEEL*).
Text segmentation
149
There are several reasons for this choice of semantic fields: z
z
z
Place names represent geographic locations, which are the main criterion for literary critics in their segmentation of NA into its constituent parts. In order to make the following analysis comparable to that of the literary critics, place names are incorporated into the linguistic analysis. Textuality was identified as a major topic of the novel in Chapter 5 Keywords and concordance lines. Since it is likely that important topics are part of the text structure, the semantic field is included in the analysis. Emotions were also identified as a dominant semantic field in Chapter 5 Keywords and concordance lines. As this indicates the topic’s importance for the content of the text, emotions are included in the analysis.
The choice of the specific lemmata for the analysis is based on their occurrence on the lists of keywords of comparisons of NA with different reference corpora (cf. Table 7.1 below). Their choice is based either on their multiple occurrence on at least two lists or on their occurrence as two realizations of one lemma on one list of keywords. The occurrence of words on more than one list of keywords strengthens their significance as keywords, which in turn further legitimizes their choice as criteria for the text segmentation. This is the case, since their multiple occurrence minimizes the risk that either the reference corpus or a particular word from a list of keywords was specifically selected to generate predetermined results. The differences between the reference corpora in terms of their contents and their forms allows the greatest possible objectivity in this respect. The multiple occurrence of words, therefore, enhances their significance as criteria for a lexical segmentation, since their selection is legitimized by several means (cf. Chapter 5 Keywords and concordance lines). The number of lemmata that are analysed in the following is restricted to five per semantic field. This is because only five words of the categories place names and textuality fulfil the criteria for their inclusion mentioned above. Emotions is the only category where six words fulfil the criteria. However, in order to have equal numbers of lemmata for each semantic field to allow for comparisons between the analyses of the different categories, the number of lemmata included in emotions is restricted to five. Of the six possible lemmata for emotions, AFFECTION* is excluded from the analysis, since it occurs in two variants on only one list of keywords. The
Corpus Linguistics in Literary Analysis
150
other lemmata occur on at least two lists, so that their significance as keywords of the text is stronger than that of AFFECTION*. The lemmata subsumed under textuality are identified and shown to be relevant for the content of the novel in Chapter 5 Keywords and concordance lines. Nevertheless, they conform to the same criteria for their selection as the lemmata subsumed under emotions and place names. Table 7.1 lists the keywords of the three semantic fields that are analysed in this chapter. They were generated by comparisons of NA with different reference corpora and are listed in their order of occurrence on the lists.
Table 7.1
Keywords
Semantic field/ reference corpus
place names
textuality
(romantic) emotions
BNC
northanger, bath, fullerton, woodston, abbey (excluding street names)
udolpho, heroine, heroine’s, read, novels
feelings, heart, engagement, affection, admiration, affectionate, attentions, engaged, beloved, tenderness, attachment, felt, loved, love
Austen
bath, northanger, fullerton, udolpho, heroine abbey, woodston (excluding street names) northanger, fullerton, bath, udolpho, heroine, woodston, abbey novels, journal, manuscripts, novel bath, northanger, fullerton, heroine, read, journal, feelings, engagement woodston, clifton, abbey, novels, theatre, novel oxford, blaize (excluding street names) bath, northanger, fullerton, udolpho, heroine, feelings, engagement, woodston, abby, cestershire, heroine’s novels attentions, oxford, clifton, landsdown admiration (excluding street names) northanger, bath, fullerton, heroine feelings, engagement abbey, woodston
Austen5
Gothic
ContempLit
Keywords that are identified by comparisons with all reference corpora (for emotions, except of in Austen and Austen5)
Text segmentation
151
Street names are excluded from the analysis, since they are mostly used metonymically in the text and stand for the residents of the street; for example, Pultney Street stands for Mr and Mrs Allen and Catherine, and Milsom Street stands for the Tilney family. Furthermore, the streets are part of Bath, so that they do not exclusively indicate the geographical location of the plot. Both the occurrence of words from the three semantic fields in comparisons of NA with three of five reference corpora and the fact that some words of the categories occur on all lists, indicate that these words and the topics they depict are particularly relevant for the content of NA. Only the semantic field emotions is absent from two lists, namely the lists that result from the comparison of NA with Austen and Austen5. In retrospect, this can be explained by the structure and the contents of Austen’s novels. When looking at them superficially, they are all, at least to some degree, concerned with the protagonists’ love lives. The texts frequently discuss similar topics, like emotions, by using similar lexical items. These items are not identified as keywords of NA, since they occur in all of Austen’s novels. This thematic focus does not exist in the other reference corpora and is due to the homogeneity in terms of content and lexis of Austen, Austen5 and NA (cf. section 7.4 and Chapters 5 and 6 for a discussion of homogeneity in terms of content and phraseology, cf. Chapter 5 Keywords and concordance lines on the influence of reference corpora on the keywords that are identified in a comparison).
7.2.1 The analysis The following analysis of the text structure of NA proceeds in several steps: 1. The distribution diagram of place names is analysed. This results in a preliminary segmentation of the text by means of linguistic data and by using literary critical approaches. 2. The distribution diagrams of textuality and emotions are analysed and interpreted. For illustrative purposes, the analysis of textuality is demonstrated in more depth than that of emotions. 3. On the basis of these analyses, NA is segmented into its constituent parts. This segmentation is compared to results from an analysis of NA by means of Vocabulary Management Profiles (VMP, Youmans 2001). This comparison results in the finalised linguistic segmentation of NA into its constituent parts.
152
Corpus Linguistics in Literary Analysis
4. The text parts that are identified by steps 1 to 3 are analysed as separate texts and compared to the BNC in a keyword analysis. The aim of this analysis is to test the validity of the semantic fields that were selected for the initial analyses by determining whether the text parts differ from each other in their dominant lexis. Afterwards, the corpus Austen is analysed by again following steps one to three. This analysis is followed by the analysis of the lexical homogeneity and heterogeneity of the corpora Austen and ContempLit.
7.2.1.1 The classic segmentation of NA by literary critics In literary studies, NA is usually segmented into two parts on the basis of geographical criteria (cf. e.g. Walter 1996, Poplawski 1998, Keller 2000). The first part consists of Catherine’s stay at Bath; the second part consists of her stay at Northanger Abbey. Catherine’s brief stays at her home village Fullerton both at the beginning and at the end of the novel are not considered in this segmentation. This geographical segmentation is supported by a segmentation in terms of the contents of the novel where both segmentations correspond. During her time at Bath, Catherine is introduced to social life. During her time at the abbey, she imagines a pseudo-Gothic story in which her host General Tilney is guilty of having murdered his wife. The first part of the novel is mainly a satire of sentimental fiction, while the second part is mainly a satire of Gothic fiction. At first glance, this literary critical segmentation seems to be supported by the distribution of place names in NA (Figure 7.2). Bath is continually mentioned from the beginning of the novel and Northanger starts to occur when Catherine is invited to stay there. Also, the distribution of further place names, for example of Catherine’s home village Fullerton, Henry Tilney’s parish Woodston and Abbey, as part of the place name Northanger Abbey, start to occur roughly in the middle of the text. This seems to support the segmentation by literary critics who divide the novel roughly in its middle. The distribution of place names in the text can be explained by its content. While Bath is mentioned from the start as Catherine and her family plan her stay there, Northanger, Woodston and Abbey start to be mentioned at about 40,000 words. This point roughly correlates with Catherine’s departure from Bath at about 45,000 words and the beginning of her stay at Northanger Abbey. The reason why the places start to be mentioned before
154
Corpus Linguistics in Literary Analysis
that is that Catherine’s visit at Northanger is a topic of conversation among the protagonists even before her departure from Bath. Fullerton is an exception to this distribution, since it is already mentioned at the beginning of the text when Catherine’s departure from Fullerton is described. It also re-occurs in the text before Northanger and Abbey are mentioned for the first time. Bath occurs throughout the text, since Catherine’s stay is an ongoing topic of conversation for the protagonists. In the first part of the novel, Bath dominates the list of place names. In the second part of the novel, multiple place names are mentioned. But when looking at the distribution diagram more closely, it suggests a divergence from the literary critical segmentation, since place names are also mentioned when the characters stay somewhere else. Despite the fact that ‘northanger’ and ‘abbey’ are mentioned for the first time roughly in the middle of the novel, the mentioning of Bath does not stop after Catherine’s geographical move away from it. On the contrary, Bath is continually mentioned throughout the novel and the topic is not closed after Catherine has left the town. There is no text part that only deals with Northanger Abbey. Furthermore, place names other than Bath begin to be mentioned before Catherine moves to the abbey. The literary critical segmentation of NA does not entirely correspond to the distribution of place names in the novel. There are also semantic features that disagree with the literary critical segmentation. The analysis of concordance lines of place names shows that they are frequently used metonymically and therefore denote the place, its inhabitants and concepts with which these inhabitants are associated. For example, Bath does not only stand for the place, but also for Catherine’s experiences and for the friends she finds there. The name is not only a place name. The same holds true for ‘northanger’ and ‘abbey’, which do not only stand for the abbey, but also for the Tilney family and for Catherine’s image of an abbey in general. This image is strongly influenced by the Gothic novels she has read at Bath. Fullerton (28) is also used metonymically. This is demonstrated, as an example of this usage of place names in general, by providing the concordance lines for Fullerton. In ten instances, that is, in more than one third of its occurrences, Catherine’s home village stands for the Morland family and not for the village: d, I shall come and pay my respects at Fullerton before it is long, if not disa settled as this necessary reference to Fullerton would allow. The circumstan n you return from this lord’s, come to Fullerton?" "It will not be in my pow . We shall be very glad to see you at
Fullerton, whenever it is convenient."
place names 80,000
70,000
50,000
40,000
30,000
20,000
Text segmentation
number of words in the text
60,000
10,000
153
Figure 7.2 Place names in NA
154
Corpus Linguistics in Literary Analysis
that is that Catherine’s visit at Northanger is a topic of conversation among the protagonists even before her departure from Bath. Fullerton is an exception to this distribution, since it is already mentioned at the beginning of the text when Catherine’s departure from Fullerton is described. It also re-occurs in the text before Northanger and Abbey are mentioned for the first time. Bath occurs throughout the text, since Catherine’s stay is an ongoing topic of conversation for the protagonists. In the first part of the novel, Bath dominates the list of place names. In the second part of the novel, multiple place names are mentioned. But when looking at the distribution diagram more closely, it suggests a divergence from the literary critical segmentation, since place names are also mentioned when the characters stay somewhere else. Despite the fact that ‘northanger’ and ‘abbey’ are mentioned for the first time roughly in the middle of the novel, the mentioning of Bath does not stop after Catherine’s geographical move away from it. On the contrary, Bath is continually mentioned throughout the novel and the topic is not closed after Catherine has left the town. There is no text part that only deals with Northanger Abbey. Furthermore, place names other than Bath begin to be mentioned before Catherine moves to the abbey. The literary critical segmentation of NA does not entirely correspond to the distribution of place names in the novel. There are also semantic features that disagree with the literary critical segmentation. The analysis of concordance lines of place names shows that they are frequently used metonymically and therefore denote the place, its inhabitants and concepts with which these inhabitants are associated. For example, Bath does not only stand for the place, but also for Catherine’s experiences and for the friends she finds there. The name is not only a place name. The same holds true for ‘northanger’ and ‘abbey’, which do not only stand for the abbey, but also for the Tilney family and for Catherine’s image of an abbey in general. This image is strongly influenced by the Gothic novels she has read at Bath. Fullerton (28) is also used metonymically. This is demonstrated, as an example of this usage of place names in general, by providing the concordance lines for Fullerton. In ten instances, that is, in more than one third of its occurrences, Catherine’s home village stands for the Morland family and not for the village: d, I shall come and pay my respects at Fullerton before it is long, if not disa settled as this necessary reference to Fullerton would allow. The circumstan n you return from this lord’s, come to Fullerton?" "It will not be in my pow . We shall be very glad to see you at
Fullerton, whenever it is convenient."
Text segmentation
155
ion; for to return in such a manner to Fullerton was almost to destroy the ple est tenants might be "our friends from Fullerton." She felt the unexpected co preparing to set off with all speed to Fullerton, to make known his situation lence, Eleanor said, "No bad news from Fullerton, I hope? Mr. and Mrs. Morlan of the great secret of James’s going to Fullerton the day before, did raise somhad little right to expect a welcome at
Fullerton, and stating his impatience t
These results of linguistic analyses cast doubt on the validity of a text segmentation that is based only on place names. In the text, places are mentioned at times when the protagonists do not reside there. This is particularly clear in the case of Bath, which is continually mentioned throughout the novel. Furthermore, places are tied to people and the protagonists’ experiences. They are therefore not purely geographical features. Thus, the formal segmentation of the text on the basis of the occurrence of place names in the text is called into question and the following sections present an alternative, linguistic segmentation of NA that is based on corpus linguistic analyses. This linguistic segmentation will also entail an alternative segmentation of the text in terms of the content of the text parts.
7.2.1.2 Alternative linguistic segmentation of NA As described above, an alternative, linguistic segmentation of NA is developed by analysing the distribution of the lemmata from dominant semantic fields on lists of keywords as these fields indicate important contents of the novel. As a first step, the distribution diagram of all lemmata that are subsumed under textuality is analysed. The result of this analysis is a hypothesis on an alternative text segmentation. As a second step, this hypothesis is tested by comparing it to the distribution diagram of the lemmata subsumed under emotions. As a third step, the results from these two analyses are combined to complete the alternative segmentation of the text. This segmentation is supported, in a fourth analytic step, by an analysis of the text by means of the software VMP (Youmans 2001). Figure 7.3 shows the distribution of all lemmata subsumed under textuality. Their distributions are discussed in detail in the following. HEROINE* The first cluster of the lemma occurs between words 3 and 38,085, a second cluster occurs at the end of the novel between words 71,200 and 77,399. The phases are divided by a section of about 33,000 words, in which the lemma does not occur.
156
Corpus Linguistics in Literary Analysis textuality
80,000
number of words in the text
70,000 60,000 50,000 40,000 30,000 20,000 10,000
Figure 7.3 Textuality in NA
This uneven distribution of the lemma in the text is due to the plot structure of the novel. Since HEROINE* refers in most of its occurrences either directly or indirectly to Catherine (cf. Chapter 5 Keywords and concordance lines), its occurrences correspond to those parts of the text in which Catherine is at its focus. The first phase in which the lemma occurs starts by introducing Catherine, then the other protagonists are introduced and their relationships to Catherine are described. The second phase of the lemma’s occurrence corresponds to Catherine’s return to Fullerton at the end of the novel. It also covers Catherine and Henry’s engagement at the end of the novel. The occurrence of the lemma both at the beginning and at the end of the novel creates a frame for the novel which describes Catherine’s social sphere. The usage of the lemma is rather homogeneous. READ* The distribution of READ* contains three phases in which the lemma occurs (words 907 to 11,458, words 30,807 to 32,158 and words 42,473 to 66,344). The third phase of the lemma’s occurrence is the least compact one, since it covers some 24,000 words. The three phases mirror foci of the novel’s content. In the first phase, basic characteristics of the novel are introduced, for instance the recurrent
Text segmentation
157
inter- and intratextual references. These references are books that are read by the protagonists either because they are interested in the books or because they think the books might be useful to them. The second and third phases are dominated by explicit discussions on and talks about specific novels, about novels in general and about non-fiction works, and by theoretical discussions on the usefulness of reading. The texts mentioned include fiction and non-fiction books, but also other genres such as letters and diaries. The use of the lemma is not specific to a certain kind of situation. NOVEL* The distribution of NOVEL* has two phases of occurrence, both remarkably short. The first phase (words 7,389 to 11,316) covers the so-called ‘Defence of the Novel’ (24f.) in which the author/narrator of NA states that reading novels is as valuable and justifiable as reading non-fiction books. This content is repeated in the second phase of the lemma’s occurrence (words 30,808 to 31,063), in which Mr Tilney and Catherine discuss the moral value of novels. The usage of the lemma is homogeneous. JOURNAL* The distribution of JOURNAL* has got two phases. The first phase is restricted to a conversation between Henry Tilney and Catherine when they first meet at Bath (words 4,113 to 4,412). After that, the lemma occurs only once more (word 58,237), while Catherine stays at Northanger Abbey. The use of the lemma is connected to the relationship between Catherine and Henry Tilney. In its first phase of occurrence, Catherine and Henry Tilney meet. In its second phase, Catherine stays at Northanger and enters a room in the abbey into which she is forbidden to go. Her motivation for nevertheless doing so is that she hopes to find proof there for her suspicion that General Tilney, Henry’s father, had murdered his wife. When entering the room, she is discovered by Henry Tilney who tells her about the real circumstances of his mother’s natural death and who forgives her for her suspicions against his father. The usage of the lemma is homogenous as it has two referents, namely (1) the relationship between Catherine and Henry Tilney and (2) Gothic novels. MANUSCRIPT* Unlike the other lemmata, MANUSCRIPT* is used in only one very restricted phase of the novel. Its seven occurrences are all within 3,500 words, when Catherine supposedly finds a manuscript in her room at Northanger Abbey.
158
Corpus Linguistics in Literary Analysis
This manuscript later turns out to be a laundry bill of a former guest at Northanger. The usage of the lemma is homogeneous and locally restricted. 7.2.1.3 Segmenting NA on the basis of its dominant lexis Summarizing the results from the analysis above shows that the usages of the lemmata are often homogeneous. This supports the hypothesis that the lemmata can be used as a means to linguistically segment the text into its constituent parts on the basis of its content. Taking the lemmata’s distribution as an indicator for the text parts, allows its segmentation into parts of homogeneous contents. This linguistic text segmentation presents an alternative to the literary critical one. As a second step of the analysis, the distributions of all lemmata of the semantic field textuality are analysed. This shows that there are three main phases in which the lemmata occur. This becomes particularly visible in Figure 7.4, a diagram in which the positions of all lemmata subsumed under textuality have been collated. Their positions have been collated into one line which shows where the lemmata occur in the text. The analysis of the diagram in Figure 7.4 identifies three text parts which are hereafter called NA1, NA2 and NA3. The transitions between the parts are marked by dotted lines in the diagram. The dashed line in the diagram shows the transition between the two text parts identified by literary critics. The three parts are: NA1: words 20 to 15,369 NA2: words 15,370 to 30,403 NA3: words 30,404 to 77,399. The end of one text part is marked by a phase in which the lemmata occur particularly frequently. NA1 and NA2 are of roughly equal length, but the number of lemmata from the semantic field textuality is higher in NA1 than in NA2. NA3 is longer than NA1 and NA2, and the lemmata are used more frequently in NA3 than in NA2, but less frequently than in NA1. NA1 and NA3 are defined by the lemmata’s occurrence. NA2 is defined by the lemmata’s relative absence across long stretches of text. The literary critical segmentation of NA situates the transition between its two text parts at about 45,045 words, that is, with the beginning of Chapter 20 (cf. the red line in Figure 7.4). In the linguistic segmentation,
textuality 80,000
70,000
50,000
40,000
30,000
Text segmentation
number of words in the text
60,000
20,000
10,000
Textuality 2 in NA
159
Figure 7.4
160
Corpus Linguistics in Literary Analysis
NA2 starts at about the beginning of Chapter 9 and NA3 starts at about the beginning of Chapter 14. An exception to this literary critical segmentation of the text is the Norton Edition of Northanger Abbey in which the second part of the novel starts at the beginning of Chapter16 when Catherine first dines with the Tilney family. This corresponds roughly to the beginning of NA3. The exact numbers of where in the text certain words occur that are used in the linguistic analyses are provided by WD. The numbers from WD function as points of reference for the beginnings and ends of the text parts, but slight deviations from the numbers are possible. This is the case with the end of NA3, for example, as it finishes only about 350 words before the end of the novel. Practical and logical reasons therefore suggest that NA3, in fact, ends at the end of the novel. This is accepted in the following. Emotions Unlike textuality, the second semantic field that is analysed, emotions, occurs throughout the entire novel. This is shown in Figure 7.5 which displays the positions of all lemmata subsumed under emotions in NA. For this diagram, the respective positions of all lemmata from the semantic field were collated to form one line. The design of this diagram is parallel to Figure 7.4 earlier. The continuity of the topic emotions means that emotions and textuality frequently co-occur in NA. This is demonstrated by the diagram in Figure 7.6 which shows the occurrence of the lemmata across the novel. The horizontal dotted lines show the transitions between NA1, NA2 and NA3, the dashed line shows the transition between the two text parts of literary criticism and the bold line shows the transition between the two text parts as identified above on the basis of place names. The diagram in Figure 7.6 shows that the topics textuality and emotions are present throughout the entire novel. While emotions occurs without interruptions, textuality mainly occurs in clusters. This distribution indicates that emotions functions as a background to other dominant topics in the novel, so that the identification of NA1 and NA3 on the basis of the presence and of NA2 on the basis of the absence of lemmata from the semantic field textuality is supported. It also confirms the thesis of the general segmentation of NA into three parts that has been put forward. The focus of NA1 and NA3 is on textuality, the focus of NA2 is on emotions. These semantic fields are dominant within the respective parts of the novel. Both the segmentation of NA on the basis of place names and the literary critical segmentation set the transition between their two text parts later than the transition between NA2 and NA3. Place names identify the transition
emotions 80,000
70,000
50,000
40,000
30,000
Text segmentation
number of words in the text
60,000
20,000
10,000
Emotions in NA
161
Figure 7.5
Corpus Linguistics in Literary Analysis
162
textuality – emotions 80,000
number of words in the text
70,000 60,000 50,000 40,000 30,000 20,000 10,000 0
Figure 7.6 Textuality – Emotions in NA
at about 10,000 words later than the transition between NA2 and NA3 and literary criticism sets it about 15,000 words later. While some lexical changes in the text support the segmentation of literary critics and one on the basis of place names (cf. section 7.2.1.4), the bulk of the linguistic evidence in a linguistic stylistic analysis, however, suggests earlier points of transition. The segmentation of NA into three parts shifts both the formal boundaries of the text parts and also the contents covered by them from that of literary criticism to the one presented above. NA1 describes Catherine’s departure from Fullerton and the beginning of her stay in Bath where she meets Isabella and John Thorpe, and Henry and Eleanor Tilney. She is also introduced to Gothic novels. Unlike in the segmentation of literary criticism, the setting of NA2 is still Bath where the reader witnesses the romantic relationship between Catherine’s brother James and Isabella Thorpe. This relationship leads to an engagement between James and Isabella. Furthermore, Catherine is repeatedly in the company of Isabella’s brother John, who prevents Catherine from seeing Henry and Eleanor Tilney as often as she would like. The reason for John’s occasionally even rude behaviour is that he is courting Catherine. Catherine, however, neither notices his courtship nor is interested in John.
Text segmentation
163
In NA3, the friendship between Catherine and Henry and Eleanor Tilney develops. This results in an invitation by General Tilney for Catherine to stay at the Tilney’s family home Northanger Abbey after their time in Bath. Catherine’s stay at the abbey and her return to Fullerton are also described in NA3. Catherine’s romantic feelings for Henry Tilney are a background topic of all parts of the novel. NA1 is an introduction to the novel with its protagonists and to the topic textuality. NA2 is dominated by the engagement between James Morland and Isabella Thorpe, that is, by emotions. NA3 has several dominant topics that can all be subsumed under textuality: explicit discussions on literature, Catherine living a Gothic story and Catherine’s friendship with Henry and Eleanor Tilney who become part of a Gothic story, since Catherine suspects that their father is a murderer. NA1 and NA3 include a large number of explicit and implicit references to literature and textuality, NA2 includes fewer of these references. In NA2, the topic emotions is more dominant than textuality (cf. Chapter 5 Keywords and concordance lines and the Appendix for a discussion of intertextual references in NA). A summary of the contents of the three parts of NA clarifies this revised structure of the novel: 1. Theoretical introduction to textuality 2. Intermezzo between the two text parts, which are characterized by their emphases on textuality, focusing on a romantic relationship, that is, on emotions 3. Living textuality. This structure shows that NA is a Bildungsroman which, on the one hand, portrays Catherine and her imagination as well as her attitude to fiction and textuality. On the other hand, it portrays Catherine’s social competences in her dealings with her friends and with calculating people such as General Tilney and Isabella and John Thorpe. While NA1 and NA3 are mainly concerned with textuality, NA2 describes mainly interpersonal relationships that are otherwise a secondary plot line in the novel. Furthermore, all relationships portrayed in NA2 turn out to be unsuccessful in the course of the novel. Examples of this are the engagement between James Morland and Isabella Thorpe that is formed in NA2 and which breaks up in NA3, and the friendship between Catherine and Isabella which also breaks up later-on. These failures in interpersonal relationships distinguish NA2 from NA1 and NA3. The relationships described in the latter two parts all turn out to be successful in the course of the novel
164
Corpus Linguistics in Literary Analysis
(cf. Chapter 5 Keywords and concordance lines on the relationship between the characters and textuality). The three-part structure of NA that has been identified by using electronically generated linguistic data deviates from the geographically motivated segmentation by literary criticism, which at first seemed to be confirmed by the distribution of place names in NA. The segmentation of the text into three parts combines the insights gained from corpus linguistic analyses with insights into the content of the novel that derive partly from the corpus linguistic analyses and partly from insights gained by literary critics. In the following section, NA is analysed by using the software Vocabulary Management Profiles (VMP, Youmans 2001). Following this analysis, its results are related to the segmentation presented above in order to establish a final segmentation of the text.
7.2.1.4 Vocabulary Management Profiles as a means for text segmentation The software Vocabulary Management Profiles (VMP, Youmans 2001) can be used to analyse texts to find out where new word types are introduced into it. To do so, the software takes the absolute number of words in the text (tokens) and sets them in relation to the number of word types (types) that have been used in the text up to a given point. This procedure is repeated across the text at a regular interval that is set by the user at the beginning of the analysis. The aim of the analysis is to identify the distribution of newly introduced lexis across a text. This distribution is presented in the form of a diagram by the software. Youmans (2002) himself describes the analytic procedure of his software as follows: Output is a concordance [sic] with moving averages of a modified version of type-token ratios. Instead of treating all repeated vocabulary the same – as contributing one new token but zero new types to a text – this program calculates a ratio based upon the fraction of the text that has intervened between a repeated word and its most recent occurrence. These ratios are used to generate a modified version of the VMP that is [sic] does not decline as rapidly over the length of a text and which correlates more successfully with structural boundaries in discourse. In the analysis of NA, VMP counts a total of 6,197 types and 77,308 tokens. The type-token-ratio is 0.0802.
Text segmentation
165
VMP is particularly useful for the present analysis as the introduction of new topics into a text is marked by the introduction of new lexis into it (Youmans 1991: 1). The reason for this is, according to Youmans, that lexical patterns display ‘patterns of information management’ (1991: 4). This means that the introduction of a great number of new lexical items into a text at a specific point or span within a text marks the introduction of a new topic. In doing so, the software treats different forms of one lemma as different word types. The optical representation of these new types within a text is the rise of the distribution curve in the diagram and ‘major upturns in the VMP tend to occur near major constituent boundaries of a discourse’ (Youmans 1994: 2). What Youmans (1994 and 2002) does not describe, though, is whether and if so, how, he has checked these ‘constituent’ or ‘structural boundaries’. The hypothesis that an increase of new lexis in a text corresponds to the beginning of a new text part is also supported by Stubbs (2002) who uses software resembling VMP to analyse James Joyce’s short story, ‘Eveline’. He discovers that the introduction of new lexis into the text correlates with turns in the story that are also identified by literary critics. Stubbs uses intervals of 35 and of 151 words for his analysis. Both intervals are tested for the analysis of NA later in this section. Since the introduction of new lexis into a text corresponds to the introduction of new topics into it, an analysis of NA using VMP supports the identification of transitions between text parts as, in the present analysis, text parts are defined on the basis of their contents. The analysis is also useful for testing the basic hypothesis of the software that the introduction of new lexis into a text marks the introduction of new topics into it, and therefore the transitions between text parts. The latter is an assumption adopted in this analysis and is not explicit in the documentation of the software. Both assumptions are tested in the following by using VMP in the analysis of NA. The analyses performed in this chapter so far involved a degree of subjectivity on the part of the analyst as sorting the lists of keywords into semantic fields is an individual and partly subjective process. This subjectivity is largely avoided in the analysis of NA by using VMP. Only the choice of the interval the software uses for the analysis and the final interpretation of the diagram is left to the analyst. Subjective decisions are thus minimized as much as possible in the analysis. A further argument for using VMP is that the analysis does not involve semantic criteria indicating contents or topics of the text, such as semantic fields. The analysis is based on word types and tokens only, that is, on both
166
Corpus Linguistics in Literary Analysis
the absolute number of words and on the number of different words in the text. The software assumes an orthographic definition of a word as a string of letters that is separated by spaces on its left and right from other words. This mechanic sorting of lexis into types and tokens gives a text segmentation that relies on both distribution diagrams and VMP a broader and more objective basis than an analysis that is based on only one of the two analytic techniques. This increases the significance of the results. The following analysis of NA by VMP uses an interval of 101 words (cf. Figure 7.8). The reason for selecting this interval is Youmans (1991), who argues that choosing a rather long interval allows the identification of long-term lexical patterns in a text. It is the text-length that determines the interval that is suitable for an analysis. The diagrams in Figures 7.7, 7.8 and 7.9 show differences between the analyses of NA by using different intervals. The diagrams show that choosing a relatively long interval, here 101 words (Figure 7.8), is useful for the analysis of a novel, since the distribution of the introduction of new lexis into the text is made visible on a rather large scale. This is appropriate for a novel of about 77,000 words. An analysis with a shorter interval of 35 words (Figure 7.7) with a rather frequent upturn of the curve indicating the introduction of new lexis into the text would not be suitable for the analysis of a text of this length. This short interval identifies subtopics that are not relevant for the identification of text parts. A span that is longer than 101 words generates no new insights into the text compared to the analysis with an interval of 101 words. As can be seen from Figure 7.9 with its interval of 151 words, the curve remains rather flat, so that thematic units of the text cannot be conclusively identified. Consequently, the diagram based on the interval of 101 words (Figure 7.8) is the basis for the following analysis. The diagram in Figure 7.8, interval 101, shows that new lexis is continually introduced into NA throughout the novel. Peaks of the type-token-ratio of 0.2 or higher (cf. the dashed line in Figure 7.8) occur at about 16,000, 32,000, 46,000, 51,000 and 56,000 words. After that, the curve indicating the introduction of new lexis becomes flatter. This development is due to the fact that the further advanced a text already is, the more lexis has already been used. Consequently, fewer new word types are introduced into it at its end than at its beginning. When comparing the peaks of the type-token-ratio, that is, the peaks of the introduction of new lexis into the text, with the transitions of the text parts identified by WD above, it becomes visible that they frequently correlate (cf. Table 7.2).
1
Types/Tokens
0.8 0.6
0.4
0.5 0 7000
14,000
21,000
28,000
42,000 35,000 Tokens
49,000
56,000
63,000
70,000
77,000
Text segmentation
VMP 2.2 Curve
VMP of NA, interval 35 words
Figure 7.7
1
Types / Tokens
0.8
0.6
0.4
0.2
0
14,000
21,000
VMP of NA, interval 101 words
28,000
35,000 42,000 Tokens VMP 2.2 Curve
49,000
56,000
63,000
70,000
77,000
167
Figure 7.8
7000
168
Types/Tokens
0.8
0.6
0.4
0.2
0
Figure 7.9
7000
14,000
21,000
VMP of NA, interval 151 words
28,000
35,000 42,000 Tokens VMP 2.2 Curve
49,000
56,000
63,000
70,000
77,000
Corpus Linguistics in Literary Analysis
1
Text segmentation Table 7.2
169
Correlation WD and VMP
Text parts
Segmentation based on WD
Peaks in VMP
NA1 NA2 NA3
1 to 15,369 15,370 to 30,403 30,404 to 77,308
Beginning of the text ca. 16,000 ca. 32,000 ca. 46,000 ca. 51,000 ca. 56,000
The first peak of the diagram is at the beginning of the text, i.e. at the beginning of NA1 when nearly all words in the text are still new. The second peak correlates roughly with the beginning of NA2 and the third peak correlates roughly with the beginning of NA3. The peaks at about 46,000, 51,000 and 56,000 words are within NA3. However, VMP also shows that the beginnings of the two parts of NA identified by literary critics are marked by the introduction of new lexis into the text (cf. Table 7.3).
Table 7.3
Correlation literary criticism and VMP
Literary criticism part 1 Literary criticism part 2
Text parts
Peaks in VMP
1 to 45,045 45,046 to 77,308
Beginning of the text ca. 46,000 ca. 51,000 ca. 56,000
The peaks at about 46,000, 51,000 and 56,000 words mark Catherine’s move to Northanger Abbey which occurs during this span in the text. The travellers arrive at the abbey at about 56,000 words after the journey to Northanger has been described. The peaks within NA3 therefore indicate that topics and foci shift within this text part, so that new lexis is introduced into the text. The other peaks of VMP support the segmentation of NA into its three constituent parts, which is based on the distribution of the novel’s keywords. The analysis has shown that both the linguistic and the literary segmentations of the text are mirrored in its language. Consequently, the linguistic segmentation of NA into three parts developed above has to be modified to a new segmentation to accomodate this finding (cf. Table 7.4).
Corpus Linguistics in Literary Analysis
170
Table 7.4 Segmentation of NA Text parts
Phases
Spans within the text
NA3.1 NA3.2
0 to 15,369 15,370 to 30,403 30,404 to the end 30,404 to about 46,000 ca. 46,000 to the end
NA1 NA2 NA3
In Table 7.4, NA1, NA2 and NA3 are called text parts, as they are based on the double identification of distribution diagrams and VMP. NA3.1 and NA3.2 are called text phases, as they are identified by only one analysis, namely by VMP. This segmentation summarizes the peaks identified by VMP at about 46,000, 51,000 and 56,000 words to only one peak. This is based on the assumption that long-term patterns of the content of a text are longer than 5,000 words, i.e. less than 1/15th of the complete text. This means that the peak at 51,000 words is still part of the introduction of new vocabulary occurring at the beginning of the phase at about 46,000 words. Also, the peak at 56,000 words still belongs to this phase. Taken together, these 10,000 words, during which a significant number of new lexical items is introduced, indicate major changes in the text. This assumption is supported by the content of the span which describes both Catherine’s journey to Northanger and her arrival at the abbey. The segmentation of NA into text parts and phases above therefore reflects both the linguistic and the literary segmentations.
7.2.2 Analysing keywords of NA1, NA2 and NA3 As a next step of the analysis, the different text parts are analysed for their respective keywords by using the BNC as a reference corpus. The results of this analysis show both semantic differences and common semantic features of the three parts and with reference to the complete text. In particular, the differences between the text parts support and legitimize the segmentation presented above. The fact that mainly semantic differences between the three text parts are identified by this analysis of their keywords shows that they are distinct. This further supports the segmentation of the text into the three parts with NA3 consisting of two phases.
Text segmentation
171
To conduct the following analysis, NA is divided into its three parts as identified by the analysis above. This results in three separate text files that are each compared to the BNC to identify their keywords. The keywords The most prominent semantic fields on the list of keywords for NA1 (118 words) relate to Catherine’s stay in Bath as a place of social life. Furthermore, place names and words from the semantic field textuality occur (cf. Table 7.5). Table 7.5
Keywords NA1
Pattern
Realizations
A large number of adjectives (22) indicates that the text is descriptive and that it describes the protagonists’ subjective impressions of physical and emotional circumstances; unlike in NA2, these words are mostly positively connotated or denotated (15), seven words are neutrally or negatively connotated or denotated 19 positively connotated or denotated words
Positively connotated or denotated words: affectionate, agreeable, amiable, charming, dear, dearest, delighted, delightful, engaged, fond, glad, handsome, pretty, satisfied, unwearied neutrally or negatively connotated or denotated words: young, sure, amazingly, horrid, insipid, impertinent, eldest
agreeable, dear, friend, dearest, delightful, fond, handsome, pleasure, intimacy, pretty, charming, amiable, affectionate, glad, praise, delighted, admiration, leisure, delight Eight negatively connotated or denotated not, unwearied, horrid, scold, insipid, nothing, words or grammatical negations nor, impertinent Ten words express emotions fond, unwearied, satisfied, pleasure, scold, affectionate, glad, delighted, admiration, delight Eight words describe family and social acquaintance, acquainted, brother, friend, relationships friendship, intimacy, partner, sister, sisters Six words come from the semantic field textuality heroine, journal, laurentina’s, novels, read, udolpho Two words describe clothing, that is, the outer gown, muslin appearance which seems to be of great importance for social life in Bath Two words are place names, excluding bath and tetbury street names, cf. section 7.2 for reasons for this exclusion
NA1 appears to mainly describe social life in Bath, which, because of its novelty and its many diversions, seems positive to Catherine. Furthermore, NA1 is an introduction to textuality in the novel. The atmosphere in NA2 (106 keywords) differs from that in NA1, cf. Table 7.6.
172
Corpus Linguistics in Literary Analysis
Table 7.6 Keywords NA2 Pattern
Realizations
Nine words are positively connotated or denotated, three of which are adjectives that are mainly used when addressing someone Eight words are negatively connotated or denotated or are grammatical negations Nine words describe family and social relationships, which shows the great importance attached to these relationships in NA2 The occurrence of nine auxiliary verbs and one indefinite pronoun indicates insecurity and a concentration on past and future actions and states The importance of emotions in this part of NA is mirrored by the occurrence of four words from this semantic field, two of which are positively connotated and denotated, one of which is negatively denotated. Three words are place names, excluding street names
agreeable, dear, dearest, delightful, sweetest, agreeableness, friend, happiness, pleasure; forms of address: dear, dearest, sweetest angry, cannot, dirt, haste, never, no, not, though brother, friend, engagement, dance, engaged, acquaintance, brother’s, dancing, servant
am, anybody, cannot, could, did, do, had, shall, was, would
angry, feelings, happiness, pleasure
bath, clifton, blaize
The occurrence of this lexis mirrors the plot of NA2. James Morland’s marriage proposal is accepted by Isabella Thorpe, and John Thorpe tries to win Catherine for himself. However, the latter is not successful in his pursuit. On the plot level, the four protagonists of this part of NA are all concerned with romantic emotions. However, this is mirrored in only three keywords (engage, engagement, feelings). NA3 continues the trend started in NA2 of an increase in grammatical negations and negatively connotated and denotated lexis, cf. Table 7.7. This list of semantic fields identified from the list of keywords shows that NA3 continues topics of NA1 and NA2, but does not seem to develop any new topics. Figure 7.10 shows the large overlap between semantic fields among the three parts of NA. Five semantic fields occur on all three lists of keywords, one more field occurs on two lists. The occurrence of textuality and emotions in the lists confirms the conclusions drawn above that emotions are an ongoing topic of the text and that textuality mainly occurs in NA1 and NA3.
Text segmentation Table 7.7
173
Keywords NA3
Pattern
Realizations
49 words are negatively connotated or denotated or are grammatical negations, eleven of which express emotions
not, never, nothing, scarcely, but, alarm, no, nay, agitation, nor, agitated, ashamed, pretending, torment, uneasiness, dreadful, distress, neither, unworthy, mistaken, unlooked, disagreeable, causeless, folly, pride, anxious, vain, devoured, unkindness, uneasy, ill, vanity, suspense, fearful, impatience, offended, unpleasant, simpleton, absence, misled, dread, devouring, suspicions, tempest, resentment, sorry, detain, cannot, strange happiness, dear, engagement, affection, comfort, charming, pleasure, agreeable, kindness, ease, handsome, affectionate, happy, partiality, glad, admiration, dearest, sweetest, prettiest, cheerfulness, fond, attachment, handsomely, loved, tenderness, felicity, delightful, amiable, love affection, attachment, attentions, consent, feelings, felt, heart, hearts, love, loved, marrying, partiality acquaintance, brother’s, daughter, family, father, friend, friend’s, friends, friendship, mother’s, sister abbey, bath, fullerton, northanger, woodston
29 words are positively connotated or denotated, thirteen of which express emotions
Twelve words belong to the semantic field love and marriage Eleven words express family and social relationships Five place names, excluding street names, among the top 40 keywords show that the focus of the text shifts from Bath only to several places (cf. also Figure 7.2 for the distribution of place names in NA). The topic textuality occurs in four words on the list
heroine, letter, manuscript, udolpho; staircase is a fifth word that is intuitively associated with Gothic novels in deserted castles and abbeys
When looking at the numbers of words and at the actual words that are identified as key in NA1, NA2 and NA3, a progression from a carefree environment to a darker atmosphere in the text becomes noticeable. This also confirms the conclusions drawn above that NA1 functions as an introduction to textuality, that NA2 is characterized by social relationships that fail in the course of the novel, and that NA3 shows how the protagonist lives textuality by imagining that she is reliving a Gothic story.
174
Corpus Linguistics in Literary Analysis
Figure 7.10 Comparison of keywords from NA1, NA2 and NA3
This shows that the three parts of NA differ from each other. Apart from the distribution diagrams and VMP, this further legitimizes the segmentation of NA into three parts. The three parts are inherently homogeneous, but are distinct from each other. This shows that each of them is a unit of meaning within the text. The continuity of topics across the different text parts creates cohesion and coherence across the whole novel. This allows for shifts of dominant topics among the text parts as continuous topics form their backgrounds throughout the text. These shifts of topics are mirrored by the differences of semantic fields on the lists of keywords of the three text parts. Semantic fields that occur across the whole text, emotions for example, create lexical cohesion and coherence so that the different text parts cannot be readily identified intuitively. Continuous topics ensure that the reading flow is not interrupted. This creates a coherent text.
Text segmentation
175
The analysis of the keywords of the three parts of NA shows that the semantic fields chosen for the original analysis stand for important topics in the text. The assumption that prominent semantic fields on a list of keywords are criteria for the segmentation of a text into its constituent parts has been corroborated. This corroboration has been further supported by an analysis of the keywords of the separate text parts.
7.2.3 Implications The analyses have shown that it is not enough to analyse only selected lemmata in a distribution analysis that aims to research the content structure of a text, because dominant topics differ between text parts. Consequently, the different text parts are characterized by lexical differences between them. Only the analysis of several semantic fields allows a segmentation which can be assumed to be valid. Furthermore, the analysis has shown that it is useful to combine different kinds of analyses to segment a text into its constituent parts. For the segmentation above, the distribution of lemmata from three different semantic fields was one basis of the linguistic segmentation of NA. The semantic fields and the lemmata analysed for this field had been identified as some of the novel’s keywords. A second basis was the analysis of NA by means of VMP. The final linguistic segmentation of NA is based on the findings of both analyses. While this combination of different bases provides a firm basis for the segmentation, it also allows for the analysis of whether the choice of different semantic criteria results in different segmentations of the text. In the analysis above, the results largely correspond, so that the lexical segmentation of NA is not ambiguous. This shows that every text part is a unit of meaning within the complete text. Also, the complete text is a unit of meaning in language. However, the results from the different analyses performed in the process do not completely correspond, as VMP identifies three peaks within NA3. These peaks indicate developments in the content within NA3 which are mirrored in its lexis. This shows that literary criticism bases its segmentation on both content and lexical changes within the novel, but also that it seems to overlook the beginning of these changes at about 30,000 words. The linguistic segmentation of NA into three text parts and two phases therefore presents an alternative to the literary segmentation. This alternative takes into account the lexical features that seemingly influence the literary segmentation.
176
Corpus Linguistics in Literary Analysis
The linguistic segmentation of NA presented above differs distinctly from that of literary studies. The difference between the beginning of the second part of the novel identified by literary critics and the beginning of NA3 is about 15,000 words with the second part of the literary segmentation starting within NA3. This difference is about one seventh of the text. This is not much in absolute terms, but the differences between the contents of the two parts are significant. In addition to contributing to a linguistic segmentation of NA, the distribution diagrams presented in this chapter have also shown that emotions are an ongoing topic throughout NA, which creates cohesion across the text and forms the background for other topics. The analysis, therefore, has also provided information on the cohesive structure of the data.
7.3 Segmentation of the corpus Austen Having demonstrated the segmentation of a text by means of distribution diagrams and VMP, what follows is a similar analysis of the corpus Austen. As in the previous analysis, the corpus’ keywords identified in Chapter 5 Keywords and concordance lines are used as the lexical basis of the analysis. The aim of the analysis is to identify the transitions between the texts that form the corpus. The analysis is performed in several steps. First, the distributions of the lemmata from the two semantic fields family and social relationships and place names are established. Second, the possibility to segment a corpus into its constituent parts on the basis of distribution diagrams of keywords is evaluated. The conclusions drawn from this are tested by, third, using VMP to analyse the corpus. This analysis shows that the analytic steps proposed are more successful for segmenting a text than a corpus. The corpus Austen includes Austen’s novels in the order of Table 7.8. Table 7.8 The corpus Austen Text
Wordspan in the corpus
MP P&P EM NA Per S&S
0 to ca. 160,000 ca. 160,001 to ca. 282,000 ca. 282,001 to ca. 442,500 ca. 442,501 to ca. 520,500 ca. 520,501 to ca. 603,500 ca. 603,501 to ca. 723,500
Text segmentation
177
7.3.1 Family and social relationships Family and social relationships were identified as an important topic of the novel by means of a keywords analysis which compared Austen to ContempLit (cf. Chapter 5 Keywords and concordance lines). The lemmata analysed for their literary meanings in Chapter 5 are FAMIL*, SISTER*, DAUGHTER*, COUSIN* and ACQUAINTANCE*. These lemmata also form the basis of the following analysis. Figure 7.11 shows their distributions across the corpus. It shows that all lemmata except COUSIN* and SISTER* occur throughout the entire corpus. The diagram shows neither clusters nor absences of the lemmata. Phases of relative absences only occur in the distributions of COUSIN* and SISTER*. COUSIN* occurs infrequently in the span between words 234,400 and 532,883, and SISTER* is frequently absent between words 322,421 and 429,075. The overlap between these two spans is about 107,000 words. The absences of the lemmata do not correlate with transitions between the different texts from which the corpus is compiled. The absence of COUSIN* falls into the spans of P&P (words ca. 160,000 to 282.000), EM (words ca. 282,001 to 442.000) and NA (words ca. 442,001 to 520,000). The only novel that is complete within this span of absence is EM. The span of absence of SISTER* includes the novels EM and NA. This shows that the distributions of lemmata from the semantic field family and social relationships does not identify points of transition between the texts that compile Austen. Consequently, the corpus cannot be segmented into its constituent texts by the means that are successful for the text NA.
7.3.2 Place names In the analysis of NA, not only lemmata denoting textuality, but also place names were analysed for their distributions. This is why the distributions of place names in Austen are analysed as a second means of segmenting the corpus (cf. Table 7.9 for the first 20 place names on the list of keywords from the comparison of Austen with ContempLit). Table 7.9 shows that there is no even distribution of place names from the six novels on the list of keywords. In fact, the first five place names on the list occur in only three different novels. The first place names of NA are numbers 19 and 20 on the list. The first place name which is characteristic for S&S (Weymouth) is number 298 on the list. Apart from NA and S&S, all other novels are represented by the first ten names on the list of keywords.
178
family and social relationships 800,000
600,000 500,000 400,000 300,000 200,000 100,000
Figure 7.11 Family and social relationships in Austen
Corpus Linguistics in Literary Analysis
number of words in the corpus
700,000
Text segmentation
179
Table 7.9 Place names in Austen Position on the list of keywords
Place name
Novel(s) in which the place name occurs
38 41 52 61 64 72 75 78 79 84 99 106 109 123 124 143 160 170 172 219
Mansfield Hartfield Highbury Randalls Longbourn Bath Uppercross Kellynch Netherfield Lyme Meryton Sotherton Pemberley Donwell Rosings Hertfordshire Enscombe Portsmouth Northanger Fullerton
MP EM EM EM P&P EM, MP, NA, P&P, Per, S&S Per Per P&P Per P&P MP P&P EM P&P P&P EM MP, Per NA NA
Bath has a special position on the list as it occurs in all six novels of the corpus. It is not characteristic for any of the texts and does not identify one novel exclusively. A diagram of the distributions of the first ten place names on the list of keywords (Figure 7.12) clearly shows the different texts within the corpus. They are indicated by the nearly horizontal lines in the diagram. Figure 7.12 shows that the end of one line mostly corresponds to the beginning of another line, that is, the end of one novel marks the beginning of another. This shows the points of transition between the different texts in the corpus. To identify a specific text though, the analyst needs knowledge of the texts, since the specific places have to be assigned to the texts. This makes an analysis, which aims at identifying the sequence of the specific texts, circular. However, an analysis aiming at identifying the points of transitions between the texts in general does not require knowledge of the texts so that it is both successful and non-circular.
180
place names 800,000
700,000
bath
uppercross
lyme
Corpus Linguistics in Literary Analysis
number of words in the corpus
600,000
kellynch
500,000
highbury
randalls
hartfield
400,000
300,000
netherfield
longbourn
200,000 mansfield 100,000
number of occurrences in the corpus
Figure 7.12 Place names in Austen
217
208
199
190
181
172
163
154
145
136
127
118
109
100
91
82
73
64
55
46
37
28
19
10
1
0
Text segmentation
181
The beginnings and ends of the distributions of the lemmata indicate the beginnings and ends of the different texts. According to the diagram in Figure 7.12, the six novels cover the following spans: novel 1: novel 2: novel 3: novel 4: novel 5: novel 6:
0 to ca. 160,000 ca. 160,000 to ca. 280,000 ca. 280,000 to ca. 440,000 ca. 440,000 to ca. 520,000 ca. 520,000 to ca. 600,000 ca. 600,000 to ca. 724,000.
This corresponds to the spans that the novels cover in the corpus.
7.3.3
Vocabulary Management Profiles
Youmans’ software VMP (2001) does not allow for a reliable segmentation of the corpus into its constituent texts. This is demonstrated by the diagram in Figure 7.13, which shows the analysis of Austen by using VMP with an interval of 101 words. The curve of the diagram (Figure 7.13) has got several peaks which indicate the introduction of a large number of new types into the corpus. This indicates the introduction of new topics into the corpus at these points. But the curve does not only rise at the beginnings of new texts within the corpus, it also rises within the texts. There are peaks that reach above the first horizontal line of the diagram (indicating a type-token-ration of 0.1, marked by a green line) at the very beginning of the diagram, at about 35,000 to 60,000, 140,000 to 160,000, 230,000, 290,000, 360,000, 420,000 to 565,000, 600,000 and 640,000 to 700,000 words. Transitions between the texts occur at the beginning of the corpus, at about 160,000, 282,000, 442,000, 520,000, 603,000 and 723,000 words. This shows that the diagram in Figure 7.13 identifies not only texts, but also text parts as these are defined on the basis of their content as expressed by their lexis. While this confirms the findings for NA from earlier in this chapter, it is impossible to identify single complex texts within a corpus by using VMP; complex texts have got more than one topic and more than one text part. Consequently, VMP is not successful in segmenting a corpus compiled from several novels. However, the segmentation of the corpus on the basis of place names is successful.
182
Types/Tokens
0.8
0.6
0.4
0.2
0
70,000
140,000
Figure 7.13 VMP of Austen, interval 101
210,000
280,000
350,000 420,000 Tokens VMP 2.2 Curve
490,000
560,000
630,000
700,000
Corpus Linguistics in Literary Analysis
1
Text segmentation
183
7.3.4 Implications The segmentation above shows that the procedure which is successful for a text has to be modified for a corpus. The segmentation of a corpus into its constituent texts by analysing the distributions of diagrams of lemmata from a dominant semantic field of keywords is not successful, since words from the semantic field occur throughout the whole corpus. It is not possible to segment the corpus into its constituent texts or into text parts by the means that are successful for a text. Instead, the analysis of place names as characteristic features of the different texts in Austen is successful for its segmentation, even though place names were not successful for the segmentation of NA (cf. section 7.2.1). The hypothesis formulated in Chapter 5 Keywords and concordance lines, that proper nouns and place names are of minor significance on lists of keywords since they are textspecific, has to be discarded for the segmentation of a corpus. It is precisely the fact that they are text-specific that allows for the segmentation of a corpus into its constituent texts by analysing their distributions. This finding shows that different strategies are needed for segmenting a text and a corpus. For a text, the analysis of lexis that (1) has been identified by a keyword analysis and (2) indicates important topics in the text is helpful. For a corpus, it is necessary to rely on lexis that is data-specific, for example on proper nouns or place names, in order to identify the different texts compiling the corpus. The success of an analysis is influenced by the corpus’ lexical homogeneity or heterogeneity (cf. section 7.4). The homogeneous corpus Austen consists of Jane Austen’s six novels which all have rather similar contents and which therefore use rather similar lexis. Even though every novel has got its own characteristics and protagonists, their lexis is not a distinguishing feature of the texts as their topics and plots are too similar for the lexis to be divergent. Consequently, an analysis of the distribution of lexis that is characteristic of the different texts, that is, place names, as a means of segmenting a corpus is successful for the analysis of the lexically homogeneous corpus Austen. This similarity of lexis between Austen’s novels is illustrated, for example, by the lists of keywords resulting from the comparison of NA with different reference corpora (cf. section 7.2). The fact that no words from the semantic field emotions are identified in comparisons of the novel with Austen and Austen5 indicates the semantic homogeneity of Austen’s novels. This assumption of Austen’s lexical homogeneity is tested in the following section.
184
Corpus Linguistics in Literary Analysis
7.4 Linguistic homogeneity and heterogeneity Texts and corpora differ in important aspects. A text is a natural and complete linguistic unit. A corpus, on the other hand, is an artificial collection of various texts and text fragments. The constituents of a corpus do not necessarily form a unit in terms of content or language. Corpora can be compiled from a number of single texts or text fragments that may be unconnected and which represent a particular language or language variety. One example of a corpus which represents a language variety is ContempLit. A second form of corpora are those which include all texts on a specific topic or by a specific author, that is, all the language that is relevant for an analysis. One example of such a corpus is Austen which includes all novels by the author Jane Austen. Due to the differences between texts and corpora and between the different kinds of corpora, they are used for different purposes. This is demonstrated by the analyses in this book. In this book, ContempLit is analysed to gain insights into literary language at Austen’s time, and the text NA is analysed to gain insights into the complete text with its linguistic structures and its content. These insights are gained, since the text is one unit so that conclusions are valid for the entire data. By analysing Austen, insights into the complete oeuvre and the idiolect of the author of NA are gained. The analysis also allows us to test whether conclusions regarding NA can be generalized about all of Austen’s works. This linguistically tests the hypothesis by literary critics that NA is distinct from the author’s others works (cf. for instance Craik 1965). The nature of these differences is not discussed by literary critics, but they nevertheless regret them. This is made explicit by Craik: (. . .) it is clearly inconvenient to treat this short, early, unpublished piece [Northanger Abbey] in the midst of the finest and most finished works: it interrupts a perceptible line of development and appears (unnecessarily) to its own disadvantage by comparison with their finished excellence. (1965: 4) In addition, Craik attributes ‘an air of incompleteness’ (1965: 4) to NA. And Litz further says that NA ‘lacks the narrative sophistication of the later works’, because Austen was ‘experimenting (. . .) with several narrative methods she had not fully mastered’ (1965: np, quoted in Auerbach 2004).
Text segmentation
185
The following analyses in this section show that there are no lexical reasons or differences among Austen’s novels for this devastating judgement, since Austen’s novels are lexically homogeneous. The assumption of possible phraseological differences between NA and Austen’s other novels is negated in Chapter 6 Phraseology, differences in content are negated in Chapter 5 Keywords and concordance lines. As a second corpus, also ContempLit is analysed for its lexical homogeneity or heterogeneity. Unlike Austen, ContempLit is a corpus representing a literary period. It includes texts from different decades by different authors and dealing with different topics. Apart from the fact that the texts belong to the same genre, namely prose, and are roughly contemporary with Austen, there are no further formal similarities between them. They can therefore be assumed to be lexically heterogeneous, an assumption that is confirmed by the following analysis. Comparing the findings for Austen and ContempLit with each other gives insights into lexical conventions of literary language of Austen’s time and into the idiolect of Austen, that is, into whether, and if so how, Austen’s language deviates from the conventions of her time. Since there are no linguistic definitions of ‘homogeneity’ and ‘heterogeneity’, the definitions from The Concise Oxford Dictionary (1995) are adopted for the present. The dictionary defines homogeneous as ‘1 of the same kind. 2 consisting of parts all of the same kind; uniform (. . .)’ and heterogeneous as ‘1 diverse in character. 2 varied in content (. . .)’. From a linguistic perspective, this means that data is homogeneous when its language, e.g. its lexis, grammar, and phraseology, is statistically significantly similar in these categories. Conversely, data is heterogeneous when its language differs statistically significantly in these categories. Mirroring the absence of linguistic definitions of ‘homogeneity’ and ‘heterogeneity’, Kilgarriff (2001) states that there is also no generally accepted statistical test for linguistic homogeneity or heterogeneity of data: ‘there are no established measures for homogeneity’ (2). And in fact, Kilgarriff intuitively classifies different kinds of corpora as homogeneous or heterogeneous and writes that statistical tests ‘must match our intuitions’ (14) on similarities or differences among different data. He therefore assumes an implicit homogeneity of a corpus which includes texts of the same genre, for example a corpus of software manuals. In contrast, he calls the Brown Corpus heterogeneous. Kilgarriff does not provide statistical evidence for his claims (16), but he proposes a statistical test for determining a corpus’ homogeneity or heterogeneity later in his article. This test is explained and applied in the following analysis. Based on Kilgarriff’s intuitive classification,
186
Corpus Linguistics in Literary Analysis
Austen would be considered a homogeneous corpus. The classification of ContempLit, however, would not be so straightforward. Also Burrows (2005) assumes that Austen’s novels are linguistically homogeneous. He writes that ‘by definition, it [a corpus compiled of Austen’s six novels] is authorially homogeneous’ (1475), since it includes works by one author only. His classification seems to be intuitive, like that by Kilgarriff (2001), since Burrows neither gives statistical evidence for his claim nor explains his classification. Corpus linguistics provides neither a definition of linguistic homogeneity and heterogeneity nor standardized statistical tests for their calculation. This is confirmed by Kilgarriff (2001: 23) who writes that ‘[c]orpus linguistics lacks a vocabulary for talking, quantitatively, about similarities and differences between corpora.’ In order to close this gap, Kilgarriff suggests the chi-square test for determining linguistic homogeneity or heterogeneity of corpora: χ2 is presented as a suitable measure for comparing corpora, and is shown to be the best measure of those tested. It can be used for measuring the similarity of a corpus to itself, as well as the similarity of one corpus to another (. . .). (2001: 23) The latter is, according to Kilgarriff (2001), best achieved by an analysis which combines the χ2-test with CBDF-values (chi-by-degrees-of-freedom-values). This test procedure is adopted and demonstrated in the analysis in section 7.4.2. In section 7.4.2, the corpora Austen and ContempLit are analysed for their respective lexical homogeneity or heterogeneity. The text NA is not analysed, since lexical homogeneity can be assumed for a text. As the definition of a text says (see above), a text is a natural linguistic unit. While this allows for discontinuity in lexis and content, it is nevertheless minimized because the text’s unity must be maintained. A text can therefore be assumed to be linguistically homogeneous. Even though this is only an assumption, it must be accepted at present, since the statistical tests proposed by Kilgarriff (2001) and which are used in the following, cannot be applied to texts. The tests are based on statistical comparisons between two sets of data, in this case two parts of a corpus, which result in numerical values for all comparisons. The mean score of all values generated for one corpus can be compared to that of other corpora. This results in a scale of homogeneity or heterogeneity of the corpora.
Text segmentation
187
While corpora consist of texts or text fragments, texts do not consist of independent and objective units. Their only objective units might be chapters or paragraphs as these are divisions set by the author of a text. But since chapters and paragraphs contain only a limited number of words, they are too short to function as units for the analysis proposed by Kilgarriff (2001). This is because lexical variation is one of the key preconditions of the test’s success. Text parts are not relevant units for the analysis either, since different recipients perceive them differently. Taking NA as an example, literary critics segment the text differently than the linguistic analysis suggests (cf. section 7.2.1). Using text parts as units for the analysis would therefore introduce a highly subjective element into the analysis. Furthermore, even long texts only contain few text parts, so that only few numerical values could be established per text. This would produce too little data to make a valid statement on the text’s linguistic homogeneity or heterogeneity. Consequently, the statistical test proposed by Kilgarriff (2001) is a valid tool for the analysis of corpora, but not of texts. This is implicitly affirmed by Kilgarriff (2001) who only writes about homogeneity and heterogeneity of corpora.
7.4.1 The statistics According to Kilgarriff (2001), the so-called CBDF-value, that is, the chi-bydegrees-of-freedom-value, is the best measure to establish lexical similarities or differences between the different components that constitute a corpus and therefore differences within the corpus itself. This CBDF-value is generated by way of the χ2-test and quantifies lexical similarities between the two texts that are used for its calculation. CBDF-values are inherently comparative as their calculation is based on comparing two texts with each other. For the following analysis, the corpora Austen and ContempLit are split into their constituent texts which are each used as components of the respective corpus to calculate its CBDF-values. The sum of all values generated for one corpus is a measure of its homogeneity or heterogeneity. The single texts are chosen as corpus constituents, since they are objective units in it. Calculating CBDF-values for comparisons of all texts from one corpus with all other texts from the same corpus and then comparing these values both with each other and with other corpora serves two goals. First, it allows the corpus’ inherent homogeneity or heterogeneity to be established. Second, it allows the corpus’ homogeneity or heterogeneity in comparison with other corpora to be established as well.
188
Corpus Linguistics in Literary Analysis
Calculating CBDF-values requires several steps which are explained in the following. These explanations closely follow the steps outlined by Kilgarriff (2001). Kilgarriff (2001: 20) takes as the basic formula for his statistical test to establish lexical homogeneity or heterogeneity of a corpus. In this formula, z
∑ is a mathematical operation which adds up all values resulting from the term
z z
from one corpus,
o is the observed frequency, that is, the absolute frequency of a word in a text A, and e is the expected value, that is, the expected frequency of a word in text A.
The latter means the absolute frequency of a type that is expected when comparing the percentages of one word type between a text A and a text B. This means, exemplified by a fictitious example, that when one type constitutes five per cent of a text A, but only four per cent of a text B, then the absolute frequency e of this type in text A is calculated as if it constituted four per cent. This is the expected frequency. The formula above is used to calculate a value for every word type with the value being based on a comparison between frequencies of the same word type in a different text. The sum Σ of these values is χ2. In order to calculate comparable values for the different texts which constitute a corpus, all texts are normalized to the same length. This is necessary, since different text lengths influence the values o and e of the formula above. Not normalizing the numbers would therefore result in non-comparable values for the different texts. This is again best illustrated by a fictitious example. Taking a text A with 100,000 words, five per cent of 100,000 are 5,000 words. But taking a text B with 1,000,000 words, five per cent of 1,000,000 are 50,000 words. As these numbers are the basis for the o and the e-values of the formula mentioned earlier, they influence the χ2-value, which is the result of the calculation described by the formula. If the o- and the e-values were not normalized to a common text length, χ2-values from different calculations would not be comparable. This would make valid and comparable statements on the homogeneity or heterogeneity of corpora impossible. Consequently, differences between text lengths are evened out in the following calculations by normalizing all texts to 1,000,000 words. This is the same number as the other normalizations in this book.
Text segmentation
189
CBDF-values are calculated in several steps: 1. Every word type of a text A is compared with every word type in a text B. Every word occurring in text A but not in text B is excluded from the following comparison. The reason for this is a mathematical one, since the absence of the type in text B would result in a 0-value for the term e. However, an equation, as in the formula above, cannot be divided by 0. 2. Every text of one corpus is first compared with every other text in the same corpus by the procedure described in the first step. Second, all texts with their remaining words are compared to all other texts from the same corpus so that every text determines both the o- and the e-values for every word of the texts. The values o and e differ depending on whether a text is text A or text B in this particular comparison. The result of this process is one number or value per word type for every comparison. 3. These numbers or values for all types of one text are added up to generate the sum Σ. The sum of these values, that is, the sum of all values for one text, is the χ2-value for the comparison of two texts. The χ2-value is a numerical value which is generated separately for every text in every comparison for which the text has been used. The more comparisons performed between texts, the more values exist for every text. If a text is used in two comparisons, they result in two values. If a text is used in six comparisons, they result in six values. 4. The χ2-value is a comparative value for two texts which indicates similarities between the occurrences of all word types in both texts. The numerical value indicates lexical similarities or differences between the texts. The smaller the χ2-value is, the smaller the differences are between the types’ observed and their expected frequencies, i.e. the smaller the lexical differences between the two texts which have been compared with each other. Conversely, the higher the χ2-value is, the larger the lexical differences are between the two texts. 5. The next step in the calculation of lexical homogeneity or heterogeneity is generating the CBDF-value for every χ2-value by dividing the χ2-value by the so-called degrees-of-freedom-values. Degrees of freedom values are defined as n-1 with n being the number of types of text B (Sachs 2004: 130, 209, 230). The result of this calculation is the CBDFvalue which Kilgarriff (2001) accepts as a measure of lexical similarities between texts. This value is calculated separately for every comparison of texts.
Corpus Linguistics in Literary Analysis
190
CBDF-values are interpreted in the same way as χ2-values, namely the higher the values are, the larger the lexical differences are between the two sets of data compared with each other. This is because the difference between the observed frequency of a type, o, and its expected frequency, e, decreases parallel to an increase in lexical similarity between the two texts. This means that the numerical value of a CBDF-value decreases as the similarity between the texts increases. Zero is the smallest possible CBDF-value when the two texts are identical, but there is no upper limit to the scale of CBDF-values.
7.4.2 The analysis The comparisons between (1) all texts of Austen with each other and (2) all texts of ContempLit with each other in the way described above reveal visible differences between the lexical homogeneity or heterogeneity of the two corpora. The significance of these differences is established by a comparative approach. This is because (1) Kilgarriff does not give a value which would classify a corpus into either of the two categories and (2) CBDF-values depend on the corpus size they have been normalized to (cf. section 7.4.1). This means that all classifications of homogeneity or heterogeneity are relative to the size of the data. The first and also the clearest result of the analysis is that the mean score for all CBDF-values for Austen is about half of that for ContempLit. The mean score is calculated by first adding up all CBDF-values for one corpus and then dividing the resulting sum by the number of CBDF-values. This calculation identifies a mean of 138.49 for Austen and a mean of 267.21 for ContempLit. The mean for ContempLit nearly doubles that for Austen. This indicates a significantly higher lexical similarity and therefore homogeneity of Austen than of ContempLit. A second indicator of a relative homogeneity of Austen and a relative heterogeneity of ContempLit are the highest and lowest CBDF-values for the respective corpora shown in Tables 7.10 and 7.11.
Table 7.10 Highest CBDF-values for Austen and ContempLit corpus Austen ContempLit
highest value 757.55 3813.37
text A
text B
EM The Last Days of Pompeii
NA Tristram Shandy
Text segmentation
191
Table 7.11 Lowest CBDF-values for Austen and ContempLit corpus Austen ContempLit
lowest value 39.44 32.38
text A
text B
P&P The Heir of Redclyffe
MP The Last Days of Pompeii
While there is not much difference between the lowest values for both corpora, their highest values differ by more than 3,000 points, with the value for ContempLit being higher than that for Austen. This is a second indicator that Austen is lexically more homogeneous than ContempLit, since texts, and therefore also corpora that are compiled from these texts, are increasingly heterogeneous, the higher their CBDF-values are (as shown earlier). A difference of more than 3,000 points between the two sets of texts, which are most dissimilar within their respective corpora, therefore shows that the corpus which includes the set with the higher value is probably more heterogeneous than the corpus with the lower value. This again identifies ContempLit as being lexically more heterogeneous than Austen. The third indicator of the relative homogeneity of Austen and the relative heterogeneity of ContempLit is the span of CBDF-values calculated for the two corpora. The span of ContempLit comprises nearly 3,800 points, the span of Austen comprises about 720 points. The span of Austen is 5.3 times smaller than that of ContempLit. This shows that the texts which compile Austen are lexically more similar to each other, and therefore lexically more homogeneous, than those compiling ContempLit. The large span of ContempLit clearly shows that the texts compiling the corpus differ greatly in their lexis. They are lexically heterogeneous. The analysis above has shown that Austen is lexically homogeneous while ContempLit is lexically heterogeneous. This conclusion is based on three parameters: 1. The comparison between the mean scores for CBDF-values of the two corpora, 2. The comparison between the highest and lowest CBDF-values of the two corpora, 3. The comparison between the spans of values of the two corpora. Following this classification of homogeneity and heterogeneity of the two corpora, the next step is to analyse whether NA differs in its lexis from Austen’s other texts. This is examined by looking at the CBDF-values for NA, both as a text A and as a text B. This procedure allows determining
192
Corpus Linguistics in Literary Analysis
whether the difference that literary critics perceive between NA and Austen’s other novels relates to the lexis of the texts. When first looking at the CBDF-values of Austen, one notices that NA constitutes text B in the comparison between texts which results in the highest CBDF-value of the corpus (cf. Table 7.10). This could indicate lexical differences between NA and the other novels. However, a look at the other comparisons in which NA constitutes either text A or text B shows that all other values of NA are lower than the mean score for the corpus. The suspected lexical differences between NA and the other novels are therefore not confirmed. In fact, NA is lexically homogeneous with Austen’s other novels, so that the literary critics’ impression of a difference between NA and Austen’s other novels is rejected for its lexis. This confirms the results from Chapter 6 Phraseology, where NA is found to be phraseologically homogeneous with Austen, and of Chapter 5 Keywords and concordance lines, where NA is found to be homogeneous in terms of content with Austen. The analysis in the present chapter shows that NA and Austen’s other novels are also lexically homogeneous. The novels which deviate the most from Austen’s other works and from each other in terms of their lexis are EM and Per, both as texts A and texts B in the analyses. Four of the ten CBDF-values for each text are above the mean score for the corpus, including the comparisons between EM and Per themselves with both texts constituting text A and text B. These values indicate that the two texts are the lexically most dissimilar ones within the corpus. Since CBDF-values for the texts in other comparisons are below the mean for the corpus and since the deviations from the mean mentioned above are rather small, especially in comparison to the deviations from the mean observed for ContempLit, EM and Per are still homogeneous with the author’s other novels. This confirms the classification of Austen as a homogeneous corpus.
7.4.3 Implications The analyses of corpus homogeneity and heterogeneity have shown that these two classifications are statistically relative measures, as they derive from comparisons of corpora. There is no absolute measure of lexical homogeneity or heterogeneity of corpora. But this relativity does not call the classification of corpora into question. On the contrary, it allows us to compare different corpora with each other and determine their relative degrees of lexical homogeneity or heterogeneity. This, in turn, permits the
Text segmentation
193
establishment of a scale of homogeneity and heterogeneity of corpora that is helpful for comparisons of several corpora. Furthermore, this procedure calculates separate values for every constituent of a corpus. These values reveal information about the corpus design, for example whether all texts of a corpus are homogeneous or whether some of the texts are heterogeneous in relation to other texts. Values indicating heterogeneous texts might be evened out in a mean score for a complete corpus. The analysis demonstrated above is restricted to the lexis of a corpus and more specifically to its word types. It does not allow for the analysis of grammatical and phraseological patterns. One approach to analysing the phraseology of a corpus is demonstrated in Chapter 6 Phraseology, but grammatical homogeneity or heterogeneity of corpora cannot be analysed at present. From a linguistic point of view, this seems problematic at first glance, since grammar contributes to the meaning of a text (it makes a difference, for instance, whether an action is told in the active or in the passive voice, since the agent occurs only in the active voice) and since lexis and grammar influence each other (cf. Halliday’s lexico-grammar, 1985). For present purposes, however, these concerns are of less importance than they seem. The content of a text is mainly conveyed by its lexis. Grammatical variations and patterns are of minor importance for the openly visible meanings of a text. This justifies the analysis of lexis as a criterion for corpus homogeneity or heterogeneity. The content of a corpus and of the texts of which it is compiled is of prime importance for its categorization. This is implicitly confirmed by Kilgarriff (2001), who intuitively classifies a corpus consisting of software manuals as homogeneous. Furthermore, the choice of grammatical patterns that should be analysed for this classification would introduce a highly subjective element into the analysis. Also, tagging grammatical patterns in a corpus would be a partly subjective process which could cast doubt on the objectivity of the classification. Analysing lexis is the analysis of a straightforward linguistic phenomenon. This makes the analysis replicable, so that it conforms to the evaluation criteria replicability and checkability (cf. Chapter 2 Goals, techniques, principles) and therefore contributes to the transparency of an analysis.
7.5 Concluding comments The analyses demonstrated above give insights into the structures of the text and corpora, both into their constituent parts and into their lexical
194
Corpus Linguistics in Literary Analysis
characteristic features. These insights contribute to the interpretation of the data, such as of the progression of topical foci within NA and Austen. The segmentations presented in sections 7.2 and 7.3 indicate topical foci that go beyond the author’s structural segmentations into, for example, paragraphs or chapters. This is one reason why the text parts of NA identified by corpus linguistic techniques cannot be distinguished optically from the rest of the text. Therefore, an intuitive segmentation which takes linguistic changes into account is, at best, difficult to achieve or, at worst, cannot be achieved at all. This is shown by the differences between the segmentations of NA by corpus linguistic means and by literary critics. The additional knowledge gained from the linguistic segmentation of the data allows a more enhanced understanding of the progression and the structure of the data than knowledge gained from the literary critical segmentation alone. It further supports the decoding and the revelation of the argument structure of the data for a reader. Consequently, revealing the structure of a text supports understanding a text in more detail and depth than would otherwise be possible. This is also true for a corpus and its compilation. Furthermore, the analysis has also yielded literary insights into NA, for example, into the characters’ relationships with each other. It has been shown, among other points, that the relationships portrayed in NA1 and NA3 are successful in the course of the novel, while relationships portrayed in NA2 fail. The analysis of the homogeneity and heterogeneity of Austen and ContempLit has facilitated further insight into the structure of the corpora. This is useful in evaluating the significance of findings from a corpus for the complete set of data, that is, for the entire corpus. Furthermore, it allowed NA to be positioned in the linguistic context of Austen’s oeuvre. This revealed that the literary critical position claiming differences between NA and Austen’s other novels (as discussed earlier) does not have a linguistic basis. This has enlarged the knowledge on Austen’s works and also demonstrated the success of further corpus linguistic techniques in the analysis of structural features and the meanings of literary works.
Chapter 8
Conclusion
‘[L]iterature is a prime example of language use’ (Sinclair 1982: 51). Similar to language in general, literature consists of words, phrases, sentences and text parts in syntagmatic and linear relations. The different constituents refer to each other, and these references create a text. This linearity of language is also emphasized by Sinclair and Mauranen (2006) in their Linear Unit Grammar. Besides its linear connections, language is also a network created by various linguistic features. One example of these features are intratextual references which refer forwards and backwards within a text, such as the repeated use of words from a common semantic field or the repeated use of one phrase throughout a text. These references occur on all linguistic levels and form networks which encode meaning in language. Unlike a text, a corpus as a compilation of texts or text fragments does not have linear relationships between its constituents. Linearity is restricted to the separate components of the corpus. Connections between the components result from intertextual references which have the characteristics of a network. They are, for example, words or phrases with either a common content, indicated by a common semantic field, or which are general features of language use. Their occurrence is a result of the inherently intertextual nature of language which always refers to previously made utterances (cf. Firth 1957a, Teubert 2005). The analyses in this book have shown some of the networks in a text and a corpus, and how they encode meaning in data. The starting points of the analyses are lexis, phraseology and the text as units of meaning in language. In addition, Chapter 7 Text segmentation also identifies text parts as units of meaning. The analyses in this book have demonstrated that the different units of meaning influence and thus cannot be distinguished clearly from each
196
Corpus Linguistics in Literary Analysis
other. They are hierarchically ordered and the smaller units constitute the larger ones: z z z
Phrases, text parts and texts consist of words, Text parts and texts consist of words and phrases, Texts consist of words, phrases and text parts.
These units are interrelated so that the analysis of one unit might include that of another. This is because one unit of meaning might point to another, as, for example, in Chapter 7 Text segmentation where lexis is the criterion for the segmentation of a text into its constituent parts. The two units of meaning word and text part are inherently connected to and constitute each other. Yet, this interdependence of linguistic units does not result in a restriction of the smaller units to, for example, only one text part. Their distributions could cover every text part, but the frequency of their occurrence might vary across the complete text. Thus, it was sufficient in the demonstrated analyses to scrutinize only selected lexis, and not the complete lexis of the data. Its variable distributions across the data could be used to gain an overview of a text’s and corpus’ lexical structure, since it shows where the lexis, and therefore the topic(s) depicted by the lexis, occurs. This is information on the text and corpus structures and their contents gained from both the occurrence and the absence of a linguistic feature. The function of a unit of meaning depends on its context. A word, for example, can function mainly as a lexical unit, as with a wordlist. However, it can also be a keyword when two sets of data are compared with each other, or it can be part of a semantic field. Despite these different functions and labels, the word as such remains the same. It is its interpretation and significance for an analysis which depend on the form of its identification and on the question set for the analysis. This variability in interpretation and significance means that there might be different reasons for choosing the same entity, for instance a word or a phrase, for an analysis. This is demonstrated by Chapters 5 Keywords and concordance lines and 7 Text segmentation which both look at lexis as their units of analysis. In Chapter 5, keywords are generated and their collocations and colligations are used as indicators for their contributions to the data’s topics and plot development. In Chapter 7, the same words are used for analyses in which they are interpreted as structural components of the data. Because of their significance for the content of the data, they are found to also be relevant for its structure, and they are therefore used to
Conclusion
197
segment the data into its constituent parts. The meanings and functions of the words are the same in both analyses, but they focus on different aspects in order to answer different questions. The choice of aspect in an analysis is based on its respective research question. The choice of linguistic items that are selected for the analyses in this book is based on their frequencies in the data. Only the most frequent features are analysed, so that all findings in this book are based on only a relatively small part of the data. The majority of the data is ignored. The reason for this selection is the correlation between frequency and significance of linguistic data as described in Chapter 2 Goals, techniques, principles. The more frequently a linguistic pattern occurs, the more significant it is for the content and the structure of the data. Choosing the most frequent realizations of words and phrases as units of analyses therefore results in knowledge on important and dominant content, meanings and structural features of the data. Knowledge gained from these analyses is relevant for the complete data and highlights ongoing linguistic and contextual features of it, since the relationship between frequency and significance of linguistic features (cf. Chapter 2 Goals, techniques, principles) asserts their relevance for the complete data. This is demonstrated, for instance, in Chapter 6 Phraseology where people and places in the complete text NA are characterized by analysing the novel’s most frequent phrases. While the correlation between frequency and significance is primary in corpus linguistic analyses, singular or infrequent linguistic features might also be relevant for the meaning of the data, especially in literary works. These features are not identified by the means presented in this book, and they are not objects of analysis in corpus linguistic research as demonstrated here (cf. Chapters 2 Goals, techniques, principles and 3 Language and meaning). Within stylistics, they are examined, for example, in an analysis in the tradition of social pragmatics or in one modelled after discourse analysis. They are also discussed by literary critics. Corpus stylistics neither wants nor claims inclusiveness of all linguistic features in an analysis. Its strength and goal are to generate insights into continuous and dominant features of the data by looking at selected, that is, frequent, patterns in the data. Corpus linguistic analyses do not look at singular features in the way literary critical analyses do, but rather cover long stretches of text in linguistic detail and reveal their meanings. This allows for a systematic and comprehensive view of the complete data, which distinguishes corpus stylistics both from literary criticism and from other branches of stylistics. It also results in its large potential for uncovering hitherto unknown meanings of the data. This, in turn, results in corpus
198
Corpus Linguistics in Literary Analysis
stylistics being an important complement to literary criticism with both disciplines analysing different facets of literary works. In Chapter 2 Goals, techniques, principles, the goals and four evaluation criteria for the analyses in this book were described. The following section is (1) an evaluation as to whether the goals have been fulfilled and (2) an evaluation of the analyses performed in this book on the basis of the criteria set up. The goals for this book were 1. to demonstrate the usefulness of corpus linguistic techniques in a stylistic analysis of literary texts and corpora, 2. to gain literary and structural insights into the analysed data, and 3. to develop the techniques demonstrated in the analyses so that they can be adapted to the analysis of other texts and corpora. The evaluation criteria were growth of knowledge, replicability, checkability and innovativeness of analyses. All analyses in this book have generated insights into the contents and the structures of the different sets of data. These insights frequently extend beyond those of current literary criticism, having generated insights into the meanings of the data that literary critics had not discussed previously. This shows that a detailed analysis of linguistic patterns, which are not intuitively recognizable, decodes meanings which are also intuitively, as in literary criticism, virtually impossible to recognize or even unrecognizable. This is particularly significant, since Austen’s novels have been subjects of literary criticism for the past two centuries. They have been analysed intensively and their interpretations are seemingly comprehensive. Finding new meanings in the texts by using different analytic techniques therefore does not show an inability of the literary critical analytic techniques, but rather the new and great potential of the electronically-aided, that is, corpus linguistic, techniques for the analysis of literary texts. Analysing a text and a corpus by these means has demonstrated (1) the techniques’ analytic usefulness and potential and (2) the need to adapt the techniques to different kinds of data. Only this adaptation ensures that the analyses generate insights into different types of data. The kinds of adaptations that are necessary for the analyses depend on the questions set and on the data. This is demonstrated, for example, in Chapter 6 Phraseology where insights into the content and the structures of different sets of data, that is, one text and two corpora, are gained. Also, Chapter 7 Text segmentation shows that an analytic strategy might be successful for a text,
Conclusion
199
but has to be modified for a corpus. The knowledge of necessary adaptations of the techniques for different analyses gained from the analyses in this book can be transferred to other fiction and non-fiction texts and corpora. The analytic questions determine both the focus of an analysis, for example on intertextual references in NA, and the patterns which are most intensively analysed, for example keywords. By doing so, insights can be gained into either the data’s content or structure or both. This book has offered its readers new methodological, theoretical, interpretative and structural insights into the data and the usage and usefulness of corpus linguistic analytic techniques in stylistics. These insights are based on theoretical discussions and practical applications of the theory in several analyses. z
z
z
z
Methodological insights concern, in particular, the use of corpus linguistic techniques in stylistics in general and, more specifically, the combination of techniques, such as using keywords and distribution analyses for a text segmentation (cf. Chapter 7 Text segmentation). Theoretical insights concern, for example, the relationship between langue and parole, between text and corpus, subjectivity and objectivity in corpus linguistics and units of meaning in language (cf. Chapter 2 Goals, techniques, principles). Interpretative insights concern, for example, that (1) the protagonists in NA are characterized by way of their reading habits and (2) the dominant role of irony in NA that is connected to the novel’s dominant topic textuality (cf. Chapter 5 Keywords and concordance lines). Structural insights concern, for example, the frequency of grammatical negations and the use of frequent phrases as a means of characterizing places in NA (cf. Chapters 5 Keywords and concordance lines and 6 Phraseology).
New insights always involve innovations both in terms of generating these insights and in dealing with them. They are frequently generated by using new analytic techniques or by following a new theoretical approach. Furthermore, they may result in a reinterpretation of previous knowledge. Innovations in this book are concerned with both the use of analytic techniques and interpretative insights into the data analysed, with the use of the former having led to the latter. Analytic innovations from the analyses include the use of corpus linguistic techniques in stylistics and the combination of different techniques in one analysis. Interpretative innovations include the influence of family or social closeness or distance
200
Corpus Linguistics in Literary Analysis
on the characters’ perceptions of other characters in Austen’s novels (cf. Chapter 5 Keywords and concordance lines). One further interpretative innovation is the knowledge of the progression of the narrative’s themes and topics, which is used to segment the novel into its constituent parts (cf. Chapter 7 Text segmentation). Corpus stylistics is still a rather new field within corpus linguistics and its analytic techniques and potential continue to be developed and researched. This makes the scope of the present work, the range of the analyses and the analysis of different sets of data on different linguistic levels an innovation in the discipline. In this book, words, phrases, text parts and the text are analysed as units of meaning in language. The innovativeness of this scope becomes particularly clear when compared to other works in corpus stylistics (cf. Chapter 3 Language and meaning). However, the significance of these innovations is not only relevant for the present data. The analyses have demonstrated that corpus stylistics in general has great potential for gaining insight into the meanings of both literary and non-literary data, since the systematic and empirical analysis of language data leads to an in-depth understanding of its contents and structures. The means for adapting the analyses to other research are given by laying open the data and the different steps of the analyses. This transparency also allows them to be replicated and checked. The basis for methodological and interpretative decisions and the decisions themselves are made explicit. This furthers the significance of the present work for corpus linguistics and corpus stylistics, since the applicability of the techniques and the range of possible sets of data are demonstrated. It allows the adaptation of the analyses and analytic techniques to different sets of data. The discussion of theoretical questions of linguistics, corpus linguistics, stylistics and corpus stylistics, and the demonstration of analyses have embedded corpus stylistics in a theoretical framework and have shown its potential for deriving information on the content of the data. It has been demonstrated that corpus stylistics is well-suited to the formal analysis of literature and for gaining insights into the literary meanings of the data. It therefore fulfils Sinclair’s demand on linguistics with regard to literature that no systematic apparatus can claim to describe language if it does not embrace the literature also; and not as a freakish development, but as a natural specialization of categories which are required in other parts of the descriptive system. Further, the literature must be describable in terms which accord with the priorities of literary critics. (1982: 51)
Conclusion
201
This is precisely what corpus stylistics does: it analyses literary language in the same way as non-fiction data by using corpus linguistic techniques to reveal linguistic patterns in the data. These patterns are the basis of the data’s literary interpretation. The findings from the analyses in this book are therefore valid linguistic, interpretative and structural insights into the data, which reflect the priorities of literary critics, that is, gaining literary insights into the data, in linguistic analyses of literary texts. The detailed interpretative and structural insights into the data that have been gained in this book are powerful arguments for linguistic stylistic and particularly corpus stylistic analyses of literary texts and corpora. Treating literature like general language, but with the goal of gaining literary insights, opens up the entire analytic potential of linguistics in combination with the data potential of literary studies. Using literary texts which have been intensively discussed by literary critics allows linguists to evaluate both their analytic techniques and the knowledge gained from an analysis. This knowledge about the data is as objective as possible. Furthermore, much of it cannot be gained by literary critics and was impossible before the introduction of electronically-aided analyses in stylistics. It is only the detailed, systematic and electronic analysis of large quantities of linguistic data, such as texts and corpora, which allows for the decoding of this meaning. This prepares the field for a fruitful collaboration between literary critics and corpus stylisticians.
Appendix
Intertextual references in NA Intertextual references, in particular on Radcliffe’s The Mysteries of Udolpho (1794) and Lewis’ The Monk (1796), are integral parts of NA. They are particularly prominent though, when the narrator describes Catherine and Henry Tilney’s drive from Bath to Northanger Abbey, during which Henry amuses Catherine by telling her about the abbey (142–5). The following is an analysis of intertextual references in NA, which demonstrates how they are interwoven with the plot of NA. In his description of Northanger Abbey (142–5), Mr Tilney draws on situations and descriptions that either occur in or are adapted from The Mysteries of Udolpho or The Monk. The comparisons between NA and these two novels in tables A1 and A2 show parallels between the Gothic novels and Mr Tilney’s account. Table A1
Intertextual references NA – The Mysteries of Udolpho
Northanger Abbey
The Mysteries of Udolpho
At Northanger Abbey, Catherine is going to stay in a lonely apartment far away from the family and the servants. Catherine’s room is dark as it is lit by only one candle.
At Udolpho, Emily lives in an apartment that is far away from the family and the servants. She is isolated. The way to Emily’s room is unlit, there is no fire in her part of the castle, and Emily does not dare leave her apartment to fetch a candle. Emily’s room seems to have been uninhabited for several years.
Catherine will be told by the ancient housekeeper that her room has not been used for a long time, in fact since a cousin of the family died in it. The ancient housekeeper will tell Catherine Annette, Emily’s aunt’s servant, declares that that the part of the abbey Catherine’s room Udolpho and in particular that is situated in, is haunted by a ghost. part of the castle that Emily stays in is haunted by a ghost. Mr Tilney says that Catherine’s room will The door between Emily’s room and the not have a lock. room next to hers cannot be locked so that Emily uses a chair to lock herself in. Northanger is situated in the mountains. The castle Udolpho is situated in the Apennine, an inaccessible mountain region.
Appendix Table A2
203
Intertextual references NA – The Monk
Northanger Abbey
The Monk
Mr Tilney tells Catherine that • she will find the memories of a Matilda • she will find a secret passageway between her room and the chapel of St. Anthony
The devil is personified by the nun Matilda. There are secret underground passageways between monasteries in Madrid which allow secret meetings of the monk Ambrosio and Matilda. One of the novel’s protagonists, Agnes, is imprisoned in an underground dungeon.
• this underground passage leads past dungeons.
Intertextual references in NA play a major role in creating irony in that part of the novel set at Northanger Abbey. Due to Catherine’s inability to fully distinguish between fiction and reality, she creates a Gothic setting at Northanger, where she believes crimes to have happened. Intertextual and exophoric references in the text, and the allusion to schemata from Gothic novels (cf. chapter 5 Keywords and concordance lines) transport Catherine’s image of a Gothic abbey to the readers of the novel. Even though Catherine herself is an anti-heroine in the sense of Gothic novels (cf. chapter 5 Keywords and concordance lines), she nevertheless entertains phantasies based on the genre. This emphasizes the contrast between Catherine’s imagination and the novel’s fictional reality. Apart from the intertextual references listed in tables A1 and A2, there are further implicit intertextual references in NA. Waldron (1999) identifies references to Dr. John Gregory’s A Father’s Legacy to his Daughters (1790) and The Female Quixote (1752) by Charlotte Lennox. She also shows that Catherine is a counter-model to Emmeline from Emmeline, or the Orphan of the Castle (1788) by Charlotte Smith. NA not only includes implicit intertextual references to Gothic novels, but a total of twelve novels are mentioned by name in the text. Ten of them are Gothic novels. The following list of the texts mentioned in NA is ordered alphabetically according to author: z z z z z z
Camilla or A Picture of Youth (1796) by Fanny Burney The History of Tom Jones, a Foundling (1749) by Henry Fielding Midnight Bell (1798) by Francis Lathom The Monk (1796) by Matthew Lewis The Castle of Wolfenbach (1793) and Mysterious Warnings (1796) by Eliza Parson The Mysteries of Udolpho (1794) and The Italian (1797) by Ann Radcliffe
204 z z z z
Appendix
Clermont (1798) by Regina Maria Roche The Orphan of the Rhine (1796) by Eleanor Sleath Necromancer of the Black Forest (1794) by Peter Teuthold and Horrid Mysteries (1796) by Peter Will.
Tom Jones and Camilla or a Picture of Youth are the two novels on the list that are not Gothic novels. Apart from novels, non-fiction books such as the History of England and The Mirror are also mentioned. Furthermore, the classic authors Gray, Milton, Pope, Prior, Shakespeare, Sterne and Thomson are named. A distribution analysis of these explicit references to books and authors shows that they mostly occur in that part of the novel set in Bath. The reason for this is the content structure of the novel, since the topic textuality is more explicit in that part of the novel in which Catherine stays at Bath than in that set in Northanger (cf. chapter 7 Text segmentation).
Notes
Chapter 2 1
2
3
4
The British National Corpus, BNC, was compiled between 1991 and 1994 and published in 1995. The corpus consists of about 100 million words of British English which were selected to be as representative of then present day current British English as possible. It consists of about 90 per cent written language and about 10 per cent spoken language (cf. section 2.2.1 for further information on the BNC). Technical developments in the 1960s to 1970s have made it possible to analyse large amounts of data electronically. Corpus linguistics as a discipline within linguistics has been institutionalized since 1979 when the first ICAME conference (International Computer Archive of Modern and Medieval English) took place. Keywords are words which are statistically salient in a text or corpus in relation to a reference corpus (cf. section 2.2.2 for a detailed explanation of the concept). Craik (1965), for example, calls his book Jane Austen The Six Novels, and Brown (1979: 37) talks of the ‘novels of Jane Austen’ and lists the six texts discussed in the present volume. Also, Brooke (1999) only accepts these six texts as Austen’s novels.
Chapter 3 1
The Brown Corpus was compiled between 1963 and 1964 at Brown University. It includes about 1 million words of American English in 500 text samples. These samples represent the then current language use so that the corpus is, despite its smallness by today’s standards, a representative corpus of American English (cf. the Brown Corpus Manual under http://icame.uib.no/brown/bcm.html for further information on the corpus).
Chapter 5 1
Even though this semantic field has not been identified automatically, but rather manually and intuitively, its occurrence on the list of keywords confirms that it is an important topic of the text. This is supported by the following analyses.
206
2
3
4
Notes
The fact that it is a semantic field which points to an important topic makes it a ‘keyfield’ of NA, an analogy to the term ‘keyword’. All page numbers for NA refer to the Penguin edition of the novel (cf. References). This reference has been brought to my attention by one of the anonymous reviewers of the manuscript. Louw defines semantic prosody as a ‘constituent aura of meaning with which a form is imbued by its collocates’ (1993: 157).
Chapter 6 1
Also, inner monologues would warrant the frequent occurrence of I. But this narrative form does not occur in NA and, in fact, the first time that it occurs in a literary work is in 1877 in W.M. Garsin’s Four Days. It only becomes more common in literature at the beginning of the 20th century (von Wilpert 1989: 411f.). A firstperson narrator would explain the dominance of the personal pronoun I as well. As pointed out in Chapter 5 Keywords and concordance lines, however, Austen uses an omniscient narrator in NA who is not one of the novel’s characters.
References
Primary sources Austen, J. (1818), Northanger Abbey. Penguin Popular Classics. London: Penguin Books, 1994. — (1818), ibiblio P2P [Northanger Abbey]. EBook #121. April, 1994. [last update 4 August 2002] http://www.gutenberg.net/etext/121 (Accessed 5 August 2008). — (1818), Northanger Abbey. A Norton Critical Edition. Susan Fraiman (ed.). New York, London: W. W. Norton & Company, 2004. — (1818), persu11.txt. EText #105. February, 1994. [last update 28 March 2002]. http://www.gutenberg.net/etext/105 (Accessed 5 August 2008). — (1816), emma10.txt. EText #158. August, 1994. [last update 18 August 2002]. http://www.gutenberg.net/etext/158 (Accessed 5 August 2008). — (1814), mansf10.txt. EText #141. June, 1994. [last update 28 March 2002]. http:// www.gutenberg.net/etext/141 (Accessed 5 August 2008). — (1813), pandp12.txt. EBook #1342. Jun, 1998. [last update 15 August 2003]. http://www.gutenberg.net/etext/1342 (Accessed 5 August 2008). — (1811), sense10.txt. Etext #161. September, 1994. [last update 30 June 2002]. http://www.gutenberg.net/etext/161 (Accessed 5 August 2008). — Sense and Sensibility. Sony Pictures, 1995. — Emma. Miramax Films, 1996. — Pride and Prejudice. Universal Studios, 2005. Brontë, A. (1847), Agnes Grey. EBook #767. December, 1996. http://www.gutenberg. net/dirs/etext96/agnsg10h.htm (Accessed 5 August 2008). Brontë, C. (1847), Jane Eyre. EBook #1260. March, 1998. http://www.gutenberg. net/dirs/etext98/janey11.txt (Accessed 5 August 2008). Brontë, E. (1847), Wuthering Heights. Etext #768. December, 1996. http://www. gutenberg.net/dirs/etext96/wuthr10.txt (Accessed 5 August 2008). Bulwer-Lytton, E. (1834), The Last Days of Pompeii. tldop10.txt. Etext #1565. December, 1998. http://www.gutenberg.net/dirs/etext98/tldop10.txt (Accessed 5 August 2008). Burney, F. (1796), Camilla or A Picture of Youth. — (1778), Evelina. EBook #6053. July, 2004. http://www.gutenberg.net/dirs/ etext04/eveli10.txt (Accessed 5 August 2008). Cleland, J. (1749/50), Memoirs of Fanny Hill. Memoirs of a Woman of Pleasure. http:// wiretap.area.com/ftp.items/Library/Classic/fannyhill.txt (Accessed 5 August 2008). Dickens, C. (1837/39), Oliver Twist. EBook #730. November, 1996. http://www. gutenberg.net/dirs/etext96/olivr11.txt (Accessed 5 August 2008).
208
References
Disraeli, B. (1826), Vivian Grey. EBook #9840. February, 2006. http://www. gutenberg.net/dirs/etext06/8vvgr10.txt (Accessed 5 August 2008). Edgeworth, M. (1800), Castle Rackrent. Etext #1424. August, 1998. http://www. gutenberg.net/dirs/etext98/rkrnt10.txt (Accessed 5 August 2008). Eliot, G. (1859), Adam Bede. EText #507. April, 1996. http://www.gutenberg.net/ dirs/etext96/adamb10.txt (Accessed 5 August 2008). Fielding, H. (1749), The History of Tom Jones, a Foundling. EBook #6593. September, 2004. http://www.gutenberg.net/dirs/etext04/8tomj10.txt (Accessed 5 August 2008). Fielding, S. (1749), The Governess or The Little Female Academy. gvrns10.txt. Etext #1905. September, 1999. http://www.gutenberg.net/dirs/etext99/gvrns10.txt (Accessed 5 August 2008). Garsin, W. M. (1877), Four Days. Gaskell, E. (1848), Mary Barton. EText #2153. April, 2000. http://www.gutenberg. net/dirs/etext00/mbrtn11.txt (Accessed 5 August 2008). Goldsmith, O. (1762/66), The Vicar of Wakefield. vicar10.txt. EText #2667. June, 2001. http://www.gutenberg.net/dirs/etext01/vicar10.txt (Accessed 5 August 2008). Gregory, Dr J. (1790), A Father’s Legacy to his Daughters. Hawthorne, J. (ed.) (1967), The Lock and Key Library Classic Mystery and Detective Stories – Old Time English. lckyl10.txt. EText #1831. July, 1999. http://www. gutenberg.net/etext/1831 (Accessed 5 August 2008). Kingsley, C. (1850), Alton Locke, Tailor And Poet. EBook #8374. July 4, 2003. http:// www.gutenberg.net/dirs/etext05/8allk10.txt (Accessed 5 August 2008). Lathom, F. (1798), Midnight Bell. Lennox, C. (1752), The Female Quixote. Lewis, M. (1796), The Monk. tmonk10.txt. EText #601. July, 1996. http://www. gutenberg.net/etext/601 (Accessed 5 August 2008). Mackenzie, H. (1771), The Man of Feeling. EBook #5083. February, 2004. http:// www.gutenberg.net/dirs/etext04/mnfl10.txt (Accessed 5 August 2008). Marryat, F. (1836), Mr. Midshipman Easy. EBook #6629. October, 2004. http://www. gutenberg.net/dirs/etext04/measy10.txt (Accessed 5 August 2008). Parson, E. (1796), Mysterious Warnings. — (1793). The Castle of Wolfenbach. Blackmask Online. 2002. http://www.blackmask. com/books64c/castlewolfdex.htm (Accessed 22 June 2007). Radcliffe, A. (1797), The Italian. — (1794). The Mysteries of Udolpho. newhd10.txt. EText #3268. June, 2002. http:// www.gutenberg.net/etext/3268 (Accessed 5 August 2008). Richardson, S. (1740), Pamela or Virtue Rewarded. EBook #6124. July, 2004. http:// www.gutenberg.net/dirs/etext04/pam1w10.txt (Accessed 5 August 2008). Roche, R. M. (1798), Clermont. Russell Mitford, M. (1832), Our Village. EText #2496. February, 2001. http://www. gutenberg.net/dirs/etext01/vllg10.txt (Accessed 5 August 2008). Scott, Sir W. (1814), Waverley, Or ‘Tis Sixty Years Hence. EBook #5998. August, 2004. http://www.gutenberg.net/dirs/5/9/9/5998/5998.txt (Accessed August 2008). Shelley, M. (1818), Frankenstein, or the Modern Prometheus. frank14.txt. October, 1993. http://www.gutenberg.net/dirs/etext93/frank14.txt (Accessed 5 August 2008). Sleath, E. (1796), The Orphan of the Rhine.
References
209
Smith, C. (1788), Emmeline, or the Orphan of the Castle. Smollett, T. (1771), The Expedition of Humphry Clinker. txohc10.txt. EText #2160. April, 2000. http://www.gutenberg.net/dirs/etext00/txohc10.txt (Accessed 5 August 2008). Sterne, L. (1760), The Life and Opinions of Tristram Shandy, Gentleman. EBook #1079. October, 1997. http://www.gutenberg.net/dirs/etext97/shndy10.txt (Accessed 5 August 2008). Teuthold, P. (1794), Necromancer of the Black Forest. Thackeray, W. M. (1847/48), Vanity Fair. EBook #599. July, 1996. http://www. gutenberg.net/dirs/etext96/vfair12.txt (Accessed 5 August 2008). Trollope, A. (1855), The Warden. EText #619. August, 1996. http://www.gutenberg. net/dirs/etext96/twrdn10.txt (Accessed 5 August 2008). Walpole, H. (1764), The Castle of Otranto. cotrt10.txt. EBook #696. September 2002. http://www.gutenberg.net/etext/696 (Accessed 5 August 2008). Will, P. (1796), Horrid Mysteries. Wollstonecraft, M. (1791), Maria or the Wrongs of Woman. EText #134. May, 1994. http://www.gutenberg.net/dirs/etext94/maria10.txt (Accessed 5 August 2008). Yonge, C. (1853), The Heir of Redclyffe. EText #2505. February, 2001. http://www. gutenberg.net/dirs/etext01/redcl10.txt (Accessed 5 August 2008).
Secondary sources Altenberg, B. (1998), ‘On the phraseology of spoken English: the evidence of recurrent word-combinations’, in A. P. Cowie (ed.), Phraseology. Theory, Analysis, and Applications. Oxford: Clarendon, pp. 101–22. Amante, D. J. (1980), ‘Ironic language: a structuralist approach’. Language and Style, 13 (1), 15–25. Armstrong, N. (2009), ‘The gothic Austen’, in C. L. Johnson and C. Tuite (eds), A Companion to Jane Austen. Oxford: Wiley-Blackwell, pp. 237–47. Auerbach, E. (2004), Searching for Jane Austen. Madison: University of Wisconsin Press. Baayen, R. H. (2001), Word Frequency Distributions. Dordrecht, Boston, London: Kluwer. Baker, P. (2004), ‘Querying keywords’. Journal of English Linguistics, 32 (4), 346–59. Bartlett, F. C. (1932), Remembering. Cambridge: Cambridge University Press. Beaugrande, R. de (1993), ‘Closing the gap between linguistics and literary study: discourse analysis and literary theory’. Journal of Advanced Composition, 13 (2), 423–48. http://www.beaugrande.com/Linguisticsliterarystudy.htm (Accessed 5 August 2008). Benveniste, E. (1954), ‘Civilisation: contribution d`un mot. Hommage à Lucien Febvre’, reprinted in É. Benveniste, Problèmes de Linguistique Générale. Paris: Gallimard, 1966, pp. 336–45. Biber, D. and Conrad, S. (1999), ‘Lexical bundles in conversation and academic prose’, in H. Hasselgard (ed.), Out of Corpora. Studies in Honour of Stig Johansson. Amsterdam: Rodopi, pp. 181–90.
210
References
Biber, D., Conrad, S. and Cortes, V. (2004), ‘If you look at . . . : lexical bundles in university teaching and textbooks’. Applied Linguistics, 25 (3), 371–405. Biber, D., Leech, G., Johansson, S., Conrad, S. and Finegan, E. (1999), Longman Grammar of Spoken and Written English. London: Longman. Booth, W. C. (1974), A Rhetoric of Irony. Chicago, London: University of Chicago Press. Bressler, C. E. (2003), Literary Criticism. An Introduction to Theory and Practice (3rd edn). Upper Saddle River, NJ: Prentice Hall. British National Corpus, http://www.natcorp.ox.ac.uk (Accessed 5 August 2008). Brooke, C. (1999), Jane Austen: Illusion and Reality. Cambridge: D. S. Brewer. Brooks, M. and Watson, N. (2000), ‘Northanger Abbey: contexts’, in D. da Sousa Correa (ed.), The Nineteenth-Century Novel: Realism. London: Routledge, pp. 62–86. Brown, J. P. (1979), Jane Austen’s Novels. Social Change and Literary Form. Cambridge, MA, London: Harvard University Press. Brown Corpus Manual, http:/icame.uib.no/brown/bcm.html (Accessed 5 August 2008). Burgess, G. J. A. (2000), ‘Corpus analysis in the service of literary criticism: Goethe’s Die Wahlverwandtschaften as a model case’, in B. Dodd (ed.), Working with German Corpora. Birmingham: University of Birmingham Press, pp. 40–68. — (1999), A Computer-Assisted Analysis of Goethe’s Die Wahlverwandtschaften. The Enigma of Elective Affinities. Lewiston: Edwin Mellen. Burlin, K. R. (1975), ‘“The pen of the contriver”: the four fictions of Northanger Abbey’, in J. Halperin (ed.), Jane Austen. Bicentenary Essays. Cambridge: Cambridge University Press, pp. 89–111. Burrows, J. (2005), ‘Jane Austen’, in D. A. Cruse, Franz Hundsnurscher, Michael Job and Peter Rolf Lutzeier (eds), Lexikologie / Lexicology. Ein internationales Handbuch zur Natur und Struktur von Wörtern und Wortschätzen. An International Handbook on the Nature and Structure of Words and Vocabularies. Berlin, New York: Walter de Gruyter, pp. 1474–7. — (2002), ‘“Delta”: A measure of stylistic difference and a guide to likely authorship’. Literary and Linguistic Computing, 17 (3), 267–87. — (1987), Computation into Criticism. A Study of Jane Austen’s Novels and an Experiment in Method. Oxford: Clarendon. Burton, D. (1982), ‘Through glass darkly: through dark glasses’, in R. Carter (ed.), Language and Literature. London: Allen & Unwin, pp. 195–214. Bussmann, H. (1996), Routledge Dictionary of Language and Linguistics. Translated and edited by G. Trauth and K. Kazzazi. London, New York: Routledge. Campbell, D. and Fiske, D. W. (1959), ‘Convergent discriminant validation by the multitrait-multimethod matrix’. Psychological Bulletin, 56, 81–105. Carter, R. (1986), ‘Linguistic models, language and literariness: study strategies in the teaching of literature to foreign students’, in C. J. Brumfit and R. Carter (eds), Literature and Language Teaching. Oxford: Oxford University Press, pp. 110–32. Cervel, M. S. P. (1997–1998), ‘Pride and Prejudice : a cognitive analysis’. Cuadernos de Investigacion Filologica, XXIII–XXIV, 233–55.
References
211
Chapman, R. W. (1933), ‘Miss Austen’s English’. Appendix to Sense and Sensibility (3rd edn). Oxford: Oxford University Press, pp. 388–421. Cook, G. (1998), ‘The uses of reality: a reply to Ronald Carter’. ELTJ, 52 (1), 57–64. — (1986), ‘Texts, extracts, and stylistic texture’, in C. J. Brumfit and R. Carter (eds), Literature and Language Teaching. Oxford: Oxford University Press, pp. 150–66. Craik, W. A. (1965), Jane Austen. The Six Novels. London: Methuen. Culler, J. (1977), ‘Structuralism and literature’, in H. Schiff et al. (eds), Contemporary Approaches to English Studies, pp. 59–76. Reprinted in D. Keesey (ed.), Contexts for Criticism (3rd edn). Mountain View: Mayfield, 1998, pp. 302–11. — (1975), Structuralist Poetics. Structuralism, Linguistics and the Study of Literature. London: Routledge & Kegan Paul. Culpeper, J. (2009), ‘Keyness. Words, parts-of-speech and semantic categories in the character-talk of Shakespeare’s Romeo and Juliet’. International Journal of Corpus Linguistics, 14 (1), 29–59. — (2002a), ‘A cognitive stylistic approach to characterisation’, in E. Semino and J. Culpeper (eds), Cognitive Stylistics. Language and Cognition in Text Analysis. Amsterdam, Philadelphia: John Benjamins, pp. 251–77. — (2002b), ‘Computers, language and characterisation: an analysis of six characters in Romeo and Juliet’, in U. Melander-Marttala, C. Östman and M. Kytö (eds), Conversation in Life and in Literature: Papers from the ASLA Symposium Uppsala, 8–9 November 2001. Universitetstryckeriet: Uppsala, pp. 11–30. Culpeper, J. and Kytö, M. (2002), ‘Lexical bundles in early modern English dialogues. A window into the speech-related language of the past’, in T. Fanego, B. Méndez-Naya and E. Seoane (eds), Sounds, Words, Text and Change. Selected Papers from 11 ICEHL, Santiago de Compostela, 7–11 September 2000. Amsterdam, Philadelphia: John Benjamins, pp. 45–63. Curcó, C. (2000), ‘Irony: negation, echo and metarepresentation’. Lingua, 110, 257–80. Cureton, R. D. (1997), ‘Linguistics, stylistics, and poetics’. Language and Literature, XXII, 1–43. Danielsson, P. (2003), ‘Automatic extraction of meaningful units from corpora. A corpus-driven approach using the word stroke’. International Journal of Corpus Linguistics, 8 (1), 109–27. Denzin, N. K. (1970), The Research Act. Chicago: Aldine. Dunning, T. (1993), ‘Accurate methods for statistics of surprise and coincidence’. Computational Linguistics, 19 (1), 61–74. Eagleton, T. (1983), Literary Theory. An Introduction. Oxford: Basil Blackwell. Emsley, S. (2005), Jane Austen’s Philosophy of Virtues. New York, Houndsmill: Palgrave Macmillan. Erickson, L. (1990), ‘The economy of novel reading: Jane Austen and the circulating library’. SEL: Studies in English Literature, 1500–1900, 30 (4), 573–90. Firth, J. R. (1957a), Papers in Linguistics (1934–51). London: Oxford University Press. — (1957b), ‘A synopsis of linguistic theory, 1930–1955’. Studies in Linguistic Analysis. Special Volume of the Philological Society. Oxford: Blackwell, pp. 1–32.
212
References
— (1950), ‘Personality and language in society’. The Sociological Review, XLII, 37–52. — (1935), ‘The technique of semantics’. Transactions of the Philological Society, 36–72. Fischer-Starcke, B. (2009a), ‘Literature and linguistics in foreign language teaching’. VIEWS, 18 (3), 37–40. http://anglistik.univie.ac.at/fileadmin/user_upload/dep_ anglist/weitere_Uploads/Views/Views_18_3_2009_special_issue.pdf (Accessed 25 June 2009). — (2009b), ‘Keywords and frequent phrases of Jane Austen’s Pride and Prejudice. A corpus stylistic analysis’. International Journal of Corpus Linguistics, 14 (4), 492–523. — (2007), ‘Korpusstilistik: Korpuslinguistische Analysen literarischer Werke am Beispiel Jane Austens’. Unpublished PhD Thesis, University of Trier. Fish, S. E. (1973), ‘What is stylistics and why are they saying such terrible things about it?’ in S. Chatman (ed.), Approaches to Poetics: Selected Papers from the English Institute. New York, London: Columbia University Press, pp. 109–52. Forster, E. M. (1927), Aspects of the Novel and Related Writings. Cambridge: Edward Arnold. Fowler, R. (ed.) (1987), A Dictionary of Modern Critical Terms (revised edn). London, New York: Routledge & Kegan Paul Fowler, R. (1986), ‘Studying literature as language’, in T. D’Haen (ed.), Linguistics and the Study of Literature. Amsterdam: Rodopi, pp. 187–200. Francis, G. (1993), ‘A corpus-driven approach to grammar – principles, methods and examples’, in M. Baker, G. Francis and E. Tognini-Bonelli (eds), Text and Technology. In Honour of John Sinclair. Philadelphia, Amsterdam: John Benjamins, pp. 137–56. Franck, D. (1975), ‘Zur Analyse indirekter Sprechakte’, in V. Ehrich and P. Finke (eds), Beiträge zur Grammatik und Pragmatik. Kronberg: Scriptor, pp. 219–31. Freeman, M. H. (2002), ‘Momentary stays, exploding forces’. Journal of English Linguistics, 30 (1), 73–90. Gerster, C. (2000), ‘Rereading Jane Austen: dialogic feminism in Northanger Abbey’, in L. C. Lambdin and R. T. Lambdin (eds), A Companion to Jane Austen Studies. Westport, London: Greenword Press, pp. 115–30. Gläser, R. (1998), ‘The stylistic potential of phraseological units in the light of genre analysis’, in A. P. Cowie (ed.), Phraseology. Theory, Analysis, and Applications. Oxford: Clarendon, pp. 125–43. Granger, S. (1998), ‘Prefabricated patterns in advanced EFL writing: collocations and formulae’, in A. P. Cowie (ed.), Phraseology. Theory, Analysis, and Applications. Oxford: Clarendon, pp. 145–60. Grice, P. H. (1989), Studies in the Way of Words. Cambridge, MA, London: Harvard University Press. — (1975), ‘Logic and conversation’. Syntax and Semantics, 3, 41–58. — (1967), ‘Logic and conversation’. William James Lectures. Cambridge, MA: Harvard University Press. Groeben, N. and Scheele, B. (1984), Produktion und Rezeption von Ironie. Pragmalinguistische Beschreibung und Psycholinguistische Erklärungshypothesen. Tübingen: Gunter Narr.
References
213
Halliday, M. A. K. (1992), ‘Language as system and language as instance: the corpus as a theoretical construct’, in J. Svartvik (ed.), Directions in Corpus Linguistics. Berlin: Mouton de Gruyter, pp. 61–77. — (1991), ‘Corpus studies and probabilistic grammar’, in K. Aijmer and B. Altenberg (eds), English Corpus Linguistics. London: Longman, pp. 30–43. — (1985), Introduction to Functional Grammar. London: Arnold. — (1978), Language as Social Semiotic. London: Arnold. — (1971), ‘Linguistic function and literary style: an inquiry into the language of William Golding’s The Inheritors’, in S. Chatman (ed.), Literary Style: A Symposium. New York: Oxford University Press, pp. 330–65. Halliday, M. A. K. and Hasan, R. (1976), Cohesion in English. London: Longman. Hanks, P. (1987), ‘Definitions and explanations’, in J. Sinclair (ed.), Looking Up. An Account of the COBUILD Project. London, Glasgow: Collins ELT, pp. 116–36. Hardy, D. E. and Durian, D. (2000), ‘The stylistics of syntactic complements: grammar and seeing in Flannery O’Connor’s fiction’. Style, 34, 92–116. Harmsel, H.T. (1964), Jane Austen: A Study in Fictional Conventions. London: Moutin & Co. Hasan, R. (1984), ‘Coherence and cohesive harmony’, in J. Flood (ed.), Understanding Reading Comprehension. Delaware: International Reading Association, pp. 181–219. Hearst, M. A. (1997), ‘TextTiling: segmenting text into multi-paragraph subtopic passages’. Computational Linguistics, 23 (1), 33–64. Herberg, D. (1997), ‘Schlüsselwörter – Schlüssel zur Wendezeit’, in H. Kämper and H. Schmidt (eds), Das 20. Jahrhundert. Sprachgeschichte – Zeitgeschichte. Jahrbuch 1997 des Instituts für deutsche Sprache. Berlin, New York: de Gruyter, 1998, pp. 330–44. Hermansson, C. (2000), ‘Neither Northanger Abbey: the reader presupposes’. Papers on Language and Literature, 36 (4), 337–56. Heywood, J., Semino, E. and Short, M. (2002), ‘Linguistic metaphor identification in two extracts from novels’. Language and Literature, 11 (1), 35–54. Hoey, M. (2005), Lexical Priming. A New Theory of Words and Language. London, New York: Routledge. — (1991), Patterns of Lexis in Text. Oxford: Oxford University Press. Hoover, D. L. (1999), Language and Style in The Inheritors. Lanham: University Press of America. Hori, M. (2002), ‘Collocational patterns of –ly manner adverbs in Dickens’, in T. Saito, J. Nakamura and S. Yamazaki (eds), English Corpus Linguistics in Japan. Amsterdam, New York: Rodopi, pp. 149–63. Jackendorff, R. (2002), Foundations of Language. Oxford: Oxford University Press. Jakobson, R. (1958), ‘Closing statement: linguistics and poetics’, in T. A. Sebeok (ed.), Style in Language. Cambridge, MA: MIT Press, 1960, pp. 350–77. Jobbins, A. and Evett, L. (1998), ‘Text segmentation using reiteration and collocation’. COLING-ACL 1998, 614–18. Johnson, S., Culpeper, J. and Suhr, S. (2003), ‘From “politically correct councillors” to “Blairite nonsense”: discourse of “political correctness” in three British newspapers’. Discourse and Society, 14 (1), 29–47.
214
References
Kavanagh, J. (1863), English Women of Letters. Biographical Sketches. London: Hurst and Blackett Publishers. http://www.archive.org/stream/englishwomenlet01kavagoog (Accessed 25 June 2009). Keller, J. R. (2000), ‘Austen’s Northanger Abbey: a bibliographic study’, in L. C. Lambdin and R. T. Lambdin (eds), A Companion to Jane Austen Studies. Westport, London: Greenwood, pp. 131–43. Kenny, A. (1992), ‘Computers and the Humanities’. The Ninth British Library Research Lecture, 1991. London: The British Library. Kilgarriff, A. (2001), ‘Comparing Corpora’. Corpus Linguistic Theory, 1 (2), 2005, 263–76. http://www.kilgarriff.co.uk/Publications/2001-K-CompCorpIJCL.pdf (Accessed 5 August 2008). — (1997), ‘Putting frequencies in the dictionary’. International Journal of Lexicography, 10 (2), 135–55. Kuhn, T. S. (1970), ‘Logic of discovery or psychology of research?’, in I. Lakatos and A. Musgrave (eds), Criticism and the Growth of Knowledge. Cambridge: Cambridge University Press, pp. 1–23. Kunin, A. V. (1969), ‘Osnovniye ponyatiya frazeologicheskoy stilitistiki’. Problemi lingvisticheskoy stilistiki. Moskva: Moskovskiy pedagogicheskiy institut inostrannih yazikov imeni M. Toreza, pp. 71–75 [quoted in Naciscione 2001]. Labov, W. (1972), Language in the Inner City. Studies in the Black English Vernacular. Philadelphia: University of Pennsylvania Press. Ladendorf, O. (1906), Historisches Schlagwörterbuch. Ein Versuch. Straßburg, Berlin: Karl J. Trübner. Lakatos, I. (1970), ‘Falsification and the methodology of scientific research programmes’, in I. Lakatos and A. Musgrave (eds), Criticism and the Growth of Knowledge. Cambridge: Cambridge University Press, pp. 91–196. Lakoff, G. (1993), ‘The contemporary theory of metaphor’, in A. Ortony (ed.), Metaphor and Thought (2nd edn). Cambridge: Cambridge University Press, pp. 202–51. Lepp, F. (1908), Schlagwörter des Reformationszeitalters. Leipzig: Heinsius Nachf. Liddell, R. (1963), The Novels of Jane Austen. London: Longmans. Litz, A. W. (1965), Jane Austen: A Study of her Artistic Development. New York: Oxford University Press. Louw, B. (1993), ‘Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies’, in M. Baker, G. Francis and E. Tognini-Bonelli (eds), Text and Technology. In Honour of John Sinclair. Philadelphia, Amsterdam: John Benjamins, pp. 157–76. Mahlberg, M. (2007a), ‘Corpus stylistics: bridging the gap between linguistic and literary studies’, in M. Hoey, M. Mahlberg, M. Stubbs and W. Teubert (eds), Text, Discourse and Corpora. Theory and Analysis. London: Continuum, pp. 219–46. — (2007b), ‘Clusters, key clusters and local textual functions in Dickens’. Corpora, 2 (1), 1–31. Malinowski, B. (1953), ‘The problem of meaning in primitive languages’, in C. K. Ogden and I. A. Richards (eds), The Meaning of Meaning. The Influence of Language upon Thought and of the Science of Symbolism (10th edn). London: Routledge and Kegan Paul, 1969, pp. 296–336.
References
215
Markus, M. (2002), ‘Towards an analysis of pragmatic and stylistic features in 15th and 17th century English letters’, in P. Peter, P. Collins and A. Smith (eds), New Frontiers of Corpus Research. Papers from the Twenty First International Conference on English Language Research on Computerized Corpora. Sydney 2000. Amsterdam, New York: Rodopi, pp. 179–97. Martin, R. (1992), ‘Irony and universe of belief’. Lingua, 87, 77–90. Mason, O. and Platt, R. (2006), ‘Embracing a new creed: lexical patterning and the encoding of ideology’. College Literature, 32 (2), 154–70. Mathison, J. K. (1957), ‘Northanger Abbey and Jane Austen’s conception of the value of fiction’. English Literary History, 24, 138–52. Miall, D. S. (1995), ‘Representing and interpreting literature by computer’. Yearbook of English Studies, 25, 199–212. Mills, S. (1992), ‘Knowing your place: a Marxist feminist stylistic analysis’, in M. Toolan (ed.), Language, Text, and Context: Essays in Stylistics. London: Routledge, pp. 182–205. Moon, R. (2001), ‘The distribution of idioms in English’. Studi Italiani di Linguistica Teorica e Applicata, XXX (2), 229–41. — (1987), ‘The analysis of meaning’, in J. Sinclair (ed.), Looking Up. An Account of the COBUILD Project. London, Glasgow: Collins ELT, pp. 86–103. Mooneyham, L. G. (1988), Romance, Language and Education in Jane Austen’s Novels. Houndsmill, London: Macmillan. Morris, J. and Hirst, G. (1991), ‘Lexical cohesion computed by thesaural relations as an indicator of the structure of text’. Computational Linguistics, 17 (1), 21–48. Muecke, D. C. (1970), Irony. London, New York: Methuen. Mukherjee, J. (2005), ‘Stylistics’, in P. Strazny (ed.), Encyclopaedia of Linguistics. New York: Fitzroy Dearborn, pp. 1184–6. Naciscione, A. (2001), Phraseological Units in Discourse: Towards Applied Stylistics. Riga: Latvian Academy of Culture. Neill, N. (2004), ‘“The truth which the press now groans”: Northanger Abbey and the gothic best sellers of the 1790s’. Eighteenth-Century Novel, 4, 163–92. O’Halloran, K. A. (2007a), ‘The subconscious in James Joyce’s “Eveline”: a corpus stylistic analysis which chews on the “Fish hook” ’. Language and Literature, 16 (3), 227–44. — (2007b), ‘Corpus-assisted literary evaluation’. Corpora, 2 (1), 33–63. Ousby, I. (ed.). (1992), The Wordsworth Companion to Literature in English (revised edition). Cambridge: Cambridge University Press. Page, N. (1972), The Language of Jane Austen. Oxford: Basil Blackwell. Perry, R. (2009), ‘Family matters’, in C. L. Johnson and C. Tuite (eds), A Companion to Jane Austen. Oxford: Wiley-Blackwell, pp. 323–31. Phillipps, K. C. (1970), Jane Austen’s English. London: André Deutsch. Phillips, M. (1989), Lexical Structure of Text. Birmingham: University of Birmingham Press. — (1985), Aspects of Text Structure: An Investigation of the Lexical Organisation of Text. Amsterdam: North Holland. — (1983), ‘Lexical Macrostructure in Science Text’. PhD Thesis, University of Birmingham.
216
References
Pinion, F. B. (1973), A Jane Austen Companion. A Critical Survey and Reference Book. London, Basingstoke: Macmillan. Poplawski, P. (1998), A Jane Austen Encyclopedia. London: Aldwych. Popper, K. R. (1979a), Truth, Rationality, and the Growth of Scientific Knowledge. Frankfurt (Main): Vittorio Klostermann. — (1979b), Objective Knowledge. An Evolutionary Approach (revised edn). Oxford: Clarendon. — (1962), Conjectures and Refutations: The Growth of Scientific Knowledge. New York: Basic Books. — (1935), Logik der Forschung: Zur Erkenntnistheorie der modernen Naturwissenschaft. Wien: Springer. Project Gutenberg http://www.gutenberg.org/wiki/Main_Page (Accessed 5 August 2008). Renouf, A. and Sinclair, J. (1991), ‘Collocational frameworks in English’, in K. Aijmer and B. Altenberg (eds), English Corpus Linguistics. London: Longman, pp. 128–43. Reynar, J. C. (1994), ‘An automatic method of finding topic boundaries’. ACL 1994, 331–3. Sachs, L. (2004), Angewandte Statistik. Anwendung statistischer Methoden (11th edn). Berlin, Heidelberg, New York: Springer. Saussure, F. de (1916), Cours de Linguistique Générale. Paris: Payot. Scott, M. (2002), ‘Picturing the key words of a very large corpus and their lexical upshots or Getting at the Guardian’s view of the world’, in B. Kettemann and G. Marko (eds), Teaching and Learning by Doing Corpus Analysis. Proceedings of the Fourth International Conference on Teaching and Language Corpora, Graz 19–24 July, 2000. Amsterdam, New York: Rodopi, pp. 43–50. — (2001), ‘Mapping key words to problem and solution’, in M. Scott and G. Thompson (eds), Patterns of Text. In Honour of Michael Hoey. Amsterdam, Philadelphia: John Benjamins, pp. 109–27. Scott, M. and Tribble, C. (2006), Textual Patterns. Key Words and Corpus Analysis in Language Education. Amsterdam, Philadelphia: John Benjamins. Searle, J. R. (1979), Expression and Meaning. Studies in the Theory of Speech Acts. Cambridge: Cambridge University Press. Semino, E. and Short, M. (2004), Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Narratives. London: Routledge. Shipley, J. T. (ed.) (1970), Dictionary of World Literary Terms. Forms, Techniques, Criticism (revised edn). Boston: The Writer. Short, M. (1981), ‘Discourse analysis and the analysis of drama’. Applied Linguistics, 2, 180–202. Sinclair, J. (1999), ‘The computer, the corpus and the theory of language’, in G. Sertoli et al. (eds), Anglistica e . . . : Metodi e percorsi comparatistici nelle lingue, culture e letterature di origine europea, I: Transiti letterari e culturali. Trieste: Università di Trieste, pp. 1–15. — (1996), ‘The search for units of meaning’. Textus, 9, 75–106. — (1991), Corpus, Concordance, Collocation. Oxford: Oxford University Press. — (1987), ‘Collocation: a progress report’, in R. Steel and T. Threadgold (eds), Language Topics – Essays in Honour of Michael Halliday. Amsterdam: John Benjamins, pp. 319–31.
References
217
— (1982), ‘Planes of discourse’, in S. N. A. Rizvi (ed.), The Two-fold Voice. Essays in Honour of Ramesh Mohan. Reprinted in J. Sinclair and R. Carter (eds), Trust the Text. Language, Corpus and Discourse. London: Routledge, 2004, pp. 51–66. — (1975), ‘The linguistic basis of style’, in H. Ringbom, A. Ingberg, R. Norrman, K. Nyholm, R. Westman, K. Wikberg and T. Soderholm (eds), Style and Text. Studies Presented to Nils Erik Enkvist. Stockholm: Sprakforlaget Skriptov AB & Abo Akademi, pp. 75–89. — (1965), ‘When is a poem like a sunset?’ A Review of English Literature, 6, (2), 76–91. Sinclair, J., Jones, S. and Daley, R. (1970), English Lexical Studies. Report to OSTI on Project C/LP/08. Department of English, Birmingham University. [quoted in Phillips 1989]. Reprinted: Sinclair, J. et al. (2004), English Collocation Studies: The OSTI-Report. London: Continuum. Sinclair, J. and Coulthard, M. (1975), Towards an Analysis of Discourse. Oxford: Oxford University Press. Sinclair, J. and Mauranen, A. (2006), Linear Unit Grammar. Integrating Speech and Writing. Amsterdam, Philadelphia: John Benjamins. Smith, A. E. (1992), ‘“Julias and Louisas”: Austen’s Northanger Abbey and the sentimental novel’. English Language Notes, 30 (1), 33–42. Smith, N. and Wilson, D. (1992), ‘Introduction’. Lingua, 87, 1–10. Sperber, D. and Wilson, D. (1982), ‘Mutual knowledge and relevance in theories of comprehension’, in N. V. Smith (ed.), Mutual Knowledge. London: Academic Press, pp. 61–85. Starcke, B. (2008), ‘I don’t know – differences in patterns of collocation and semantic prosody in phrases of different lengths’, in A. Gerbig and O. Mason (eds), Language, People and Numbers. Corpus Linguistics and Society. Amsterdam: Rodopi, pp. 199–216. — (2007), ‘Korpuslinguistische Daten als Grundlagen von Literaturrezeption im Unterricht’, in S. Doff and T. Schmidt (eds), Fremdsprachenforschung heute – Interdisziplinäre Impulse, Methoden und Perspektiven. Frankfurt (Main): Peter Lang, pp. 211–24. — (2006), ‘The phraseology of Jane Austen’s Persuasion: Phraseological units as carriers of meaning’. ICAME Journal, 30, 87–104. Stokes, M. (1991), The Language of Jane Austen. Houndsmill: Macmillan. Strauß, G., Haß, U. and Harras, G. (1989), Brisante Wörter von Agitation bis Zeitgeist. Ein Lexikon zum öffentlichen Sprachgebrauch. Berlin, New York: de Gruyter. Stubbs, M. (2007), ‘Quantitative data on multi-word sequences in English: the case of the word world’, in M. Hoey, M. Mahlberg, M. Stubbs and W. Teubert (eds), Text, Discourse and Corpora. Theory and Analysis. London: Continuum, pp. 163–189. — (2006), ‘Quantitative data on multi-word sequences in English: the case of prepositional phrases’. Paper presented at Berlin-Brandenburgische Akademie der Wissenschaften. 3 November 2006. — (2005), ‘Conrad in the computer: examples of quantitative stylistic methods’. Language and Literature, 14 (1), 5–24. — (2002), Words and Phrases. Corpus Studies of Lexical Semantics. Oxford: Blackwell. Stubbs, M. and Barth, I. (2003), ‘Using recurrent phrases as text-type discriminators: a quantitative method and some findings’. Functions of Language, 10 (1), 65–108.
218
References
Tabata, T. (2002), ‘Investigating stylistic variation in Dickens through correspondence analysis of word-class distribution’, in T. Saito, J. Nakamura and S. Yamazaki (eds), English Corpus Linguistics in Japan. Amsterdam, New York: Rodopi, pp. 165–82. — (1994), ‘Dickens’ narrative style: a statistical approach to chronological variation’. Revue Informatique et Statistique dans les Sciences Humaines, 30, 165–82. Tanner, T. (1986), Jane Austen. Houndsmill, London: Macmillan. Tave, S. M. (1973), Some Words of Jane Austen. Chicago: Chicago University Press. Teubert, W. (2005), ‘My version of corpus linguistics’. International Journal of Corpus Linguistics, 10 (1), 1–13. — (1989), ‘Politische Vexierwörter’, in J. Klein (ed.), Politische Semantik. Opladen: Westdeutscher Verlag, pp. 51–68. The Concise Oxford Dictionary (9th edn). Oxford: Clarendon Press, 1995. The Oxford English Dictionary (2nd edn). 1989. OED Online. Oxford: Oxford University Press. 4 Apr. 2000 http://dictionary.oed.com (Accessed: 25 June 2009). Todd, J. (2006), The Cambridge Introduction to Jane Austen. Cambridge: Cambridge University Press. Tolcsvai Nagy, G. (1998), ‘Quantity and style from a cognitive point of view’. Journal of Quantitative Linguistics, 5 (3), 232–9. Toolan, M. (2004), ‘Values are descriptions or from literature to linguistics and back again by way of keywords’. Belgian Journal of English Language and Literature, 2, 11–30. — (1998), Language in Literature: an Introduction to Stylistics. London, New York: Arnold. — (1988), Narrative: a Critical Linguistic Introduction. London: Routledge. Tribble, C. (2000), ‘Genres, keywords, teaching: towards a pedagogic account of the language of project proposals’, in L. Burnard and T. McEnery (eds), Rethinking Language Pedagogy from a Corpus Perspective. Papers from the Third International Conference on Teaching and Language Corpora. Frankfurt (Main): Peter Lang, pp. 75–90. van Peer, W. (1989), ‘Quantitative studies of literature: a critique and an outlook’. Computers and the Humanities, 23, 301–7. van Peer, W. and Maat, H. P. (2001), ‘Narrative perspective and the interpretation of characters’ motives’. Language and Literature, 10 (3), 229–41. Voloshinov, V. N. (1929), Marxism and the Philosophy of Language. First published in Russian. Translated by L. Matejka and I. R. Titunik. New York: Academic Press. 1986. von Wilpert, G. (1989), Sachwörterbuch der Literatur (7th edn). Stuttgart: Kröner. Waldron, M. (1999), Jane Austen and the Fiction of her Time. Cambridge: Cambridge University Press. Walter, J. (ed.) (1996), Kindlers neues Literatur-Lexikon. München: Kindler. Watt, I. (1960), ‘The first paragraph of The Ambassadors: an explanation’. Essays in Criticism, 10, 250–74. Weber, J. J. (ed.) (1996), The Stylistics Reader. From Jakobson to the Present. London, New York: Arnold. Widdowson, H. G. (2004), Text, Context, Pretext: Critical Issues in Discourse Analysis. Oxford: Blackwell.
References
219
— (1991), ‘The description and prescription of language’, in J. Alatis (ed.), Linguistics and Language Pedagogy. The State of the Art. Washington D. C.: Georgetown University Press, pp. 11–24. — (1975), Stylistics and the Teaching of Literature. Longman: London. Williams, R. (1983), Keywords. London: Fontana. Wilson, D. and Sperber, D. (1992), ‘On verbal irony’. Lingua, 87, 53–76. Wimsatt, W. K. and Beardsley, M. C. (1946), ‘The Intentional Fallacy’. Sewanee Review, 54, 468–488. Revised and republished in The Verbal Icon: Studies in the Meaning of Poetry. Lexington: University of Kentucky Press, 1954, pp. 3–18. Wittgenstein, L. (1953), Philosophische Untersuchungen. J. Schulte (ed.). Frankfurt (Main): Suhrkamp, 2003. Wray, A. (2002), Formulaic Language and the Lexicon. Cambridge: Cambridge University Press. Wunderlich, D. (1975), ‘Zur Konventionalität von Sprechakten’, in D. Wunderlich (ed), Linguistische Pragmatik (2nd edn). Wiesbaden: Athenaion, pp. 11–58. Youmans, G. (2002), Homepage. http://www.missouri.edu/~youmansc/ (Accessed 5 August 2008). — (1994), ‘The Vocabulary-Management Profile: Two stories by William Faulkner’. Empirical Studies of the Arts, 12 (2), 113–30. http://www.missouri.edu/~youmansc/ vmp/help/Youmans-ESA-VMP.pdf (Accessed 5 August 2008) page numbers in the text refer to the pdf-document. — (1991), ‘A new tool for discourse analysis: the Vocabulary-Management-Profile’. Language, 67 (4), 763–89. http://www.missouri.edu/~youmansc/vmp/help/ Youmans-NewTool.pdf (Accessed 5 August 2008) page numbers in the text refer to the pdf-document.
Software Barth, I. (2002), Word-Distribution. Fletcher, W.H. (2002), kfNgram. http://www.kwicfinder.com/kfNgram/kfNgramHelp. html (Accessed 5 August 2008). Scott, M. (1999), WordSmith Tools. 3.0. Oxford: OUP. Youmans, G. (2001), Vocabulary Management Profiles. http://web.missouri.edu/ ~youmansc/vmp/index.shtml (Accessed 5 August 2008).
This page intentionally left blank
Index of Names
Altenberg, B. 24, 109, 111 Amante, D. J. 83 Armstrong, N. 71 Auerbach, E. 77, 86, 184
Danielsson, P. 111, 113 Denzin, N. K. 60 Dunning, T. 32 Durian, D. 56, 58
Baayen, R. H. 147 Baker, P. 68 Barth, I. 31, 109, 123, 128, 145 Bartlett, F. C. 53 Beardsley, M. C. 51 Beaugrande, R. de. 5 Benveniste, E. 68 Biber, D. 24, 109, 111, 113f. Booth, W. C. 82 Bressler, C. E. 3f. Brooke, C. 205 Brooks, M. 71 Brown, J. P. 85, 205 Burgess, G. J. A. 56 Burlin, K. R. 106 Burrows, J. 9, 56, 57f., 186 Burton, D. 40 Bussmann, H. 123
Eagleton, T. 3, 4 Emsley, S. 75 Erickson, L. 79 Evett, L. 147f.
Campbell, D. 60 Carter, R. 40 Cervel, M. S. P. 55 Chapman, R. W. 9 Conrad, S. 24, 111, 113f. Cook, G. 4, 15, 41 Cortes, V. 113f. Coulthard, M. 148 Craik, W. A. 110, 184, 205 Culler, J. 47ff. Culpeper, J. 24, 40, 41, 54, 69f., 111 Curcó, C. 83f. Cureton, R. D. 41
Firth, J. R. 15, 36, 41, 68, 195 Fischer-Starcke, B. 10, 40, 41, 56, 59, 113 Fish, S. E. 29f., 47, 49ff., 53, 55, 59 Fiske, D. W. 60 Fletcher, W. H. 31, 108, 109 Forster, E. M. 119 Fowler, R. 2, 5, 7, 41f. Francis, G. 52 Franck, D. 84 Gerster, C. 85 Gläser, R. 111f. Granger, S. 111 Grice, P. H. 81ff. Groeben, N. 84f. Halliday, M. A. K. 36, 37f., 45ff., 49, 55f., 146, 193 Hanks, P. 15 Hardy, D. E. 56, 58 Harmsel, H. T. 75 Harras, G. 67 Hasan, R. 146f. Haß, U. 67 Hearst, M. A. 147 Herberg, D. 67
222 Hermansson, C. 71 Heywood, J. 41, 54 Hirst, G. 145 Hoey, M. 131, 147 Hoover, D. L. 46f. Hori, M. 41, 57 Jackendorff, R. 114 Jakobson, R. 40, 42ff., 47, 50, 55f. Jobbins, A. 147f. Johansson, S. 109, 111 Johnson, S. 69 Kavanagh, J. 75 Keller, J. R. 152 Kenny, A. 19f., 23 Kilgarriff, A. 23, 185ff. Kuhn, T. S. 21f. Kunin, A. V. 112 Kytö, M. 24, 111 Labov, W. 119 Ladendorf, O. 67 Lakatos, I. 13 Lakoff, G. 54 Leech, G. 109, 111 Lepp, F. 67 Litz, A. W. 184 Louw, B. 10, 78, 84f., 206 Maat, H. P. 54 Mahlberg, M. 41, 57 Malinowski, B. 127 Markus, M. 41 Martin, R. 82 Mason, O. 68 Mathison, J. K. 106 Mauranen, A. 195 Miall, D. S. 7 Mills, S. 40 Moon, R. 70, 111, 113, 114, 147 Mooneyham, L. G. 89 Morris, J. 145 Muecke, D. C. 81 Mukherjee, J. 39
Index of Names Naciscione, A. 111, 112 Neill, N. 71 O’Halloran, K. A 10, 56, 58f., 61 Ousby, I. 71 Page, N. 9 Perry, R. 96, 106 Phillipps, K. C. 9 Phillips, M. 67, 145, 148 Pinion, F. B. 110 Platt, R. 68 Poplawski, P. 152 Popper, K. R. 16ff., 19f., 21f. Renouf, A. 109 Reynar, J. C. 147 Sachs, L. 189 Saussure, F. de 34ff. Scheele, B. 84f. Scott, M. 24, 26, 31f., 65, 66, 68f. Searle, J. R. 82 Semino, E. 41, 54, 59 Shipley, J. T. 3, 4, 5 Short, M. 40f., 54, 59 Sinclair, J. 5, 6, 15, 36, 41, 67, 70, 109, 111, 113, 147, 148, 195, 200 Smith, A. E. 71 Smith, N. 83 Sperber, D. 81, 83 Starcke, B. 10, 40f., 56, 58, 111f., 114f. Stokes, M. 9 Strauß, G. 67 Stubbs, M. 10, 41, 56, 58, 109, 113, 114, 120, 123, 128, 142, 165 Suhr, S. 69 Tabata, T. 56f. Tanner, T. 75, 85 Tave, S. M. 9 Teubert, W. 3, 15, 68, 195 Todd, J. 77 Tolcsvai Nagy, G. 44 Toolan, M. 2, 54, 69f. Tribble, C. 66, 68f.
Index of Names van Peer, W. 7, 54 Voloshinov, V. N. 15 von Wilpert, G. 206 Waldron, M. 203 Walter, J. 152 Watson, N. 71 Watt, I. 94, 133 Weber, J. J. 2
223
Widdowson, H. G. 2, 15f., 40, 50ff., 53 Williams, R. 68, 69 Wilson, D. 81, 83 Wimsatt, W. K. 51 Wray, A. 114 Wunderlich, D. 84 Youmans, G. 26, 31, 151, 155, 164ff., 181
Index of Subjects
analysis combination of techniques 61f., 106 documentation of 19, 22 quantitative 7, 14 significance 15 transparency 19, 21, 22, 50, 51, 200 Austen, Jane – novels 8 author 85ff. British National Corpus 205 Brown Corpus 205 CBDF-values 186ff. checkability 21 chi-square test 186ff. cognitive linguistics 37f. cognitive stylistics 53ff., 61 coherence 80, 143, 146, 174f. cohesion 73, 143, 146ff., 174f. colligation 75, 108 collocation 68, 75ff., 108 cooperative principle 81f. copyright 8, 30 corpus 3, 5, 14, 184ff., 195 nature of 22f. range of occurrences 140 reference corpus 65f., 72f. corpus compilation 29f., 72f. corpus linguistic analyses 18, 24 comparative nature of 20f. corpus-based 52f. corpus-driven 52f. descriptive nature of 20f. evaluation of 19ff. probabilistic nature of 13, 14, 21, 22, 34, 39
corpus linguistic analyses of literary texts 20 corpus linguistic evidence 13 corpus linguistics 1, 3, 18f., 20 discipline 23f. corpus stylistics 1, 6, 7, 9, 10, 11, 12, 24, 55ff., 61, 197f., 201 corpus stylistic analyses – evaluation of 10, 11 corroboration 22 data 5, 14 Austen 5 27, 94 British National Corpus 27 Cf. also British National Corpus ContempLit 27ff. Gothic 29f. Jane Austen 8, 9 Northanger Abbey 8, 30, 63f. nature of 22 deduction 52 delexicalisation 126ff., 133f. discourse 2 discourse markers 123, 127, 133, 143 documentation of analyses 21 dualism 36 Emotions 71f., 74, 160ff. empiricalness 13, 22 falsification 18, 22 Family and social relationships 95ff. fiction – reading habits 79f. frequency 3, 6, 13, 14, 15f., 65, 197
Index of Subjects Gothic novels 73ff., 79 heterogeneity 141ff., 185ff., 190ff. homogeneity 94, 105, 138, 141ff., 144, 158, 183ff., 190ff. idiolect 123, 143 indirect speech acts 126 induction 34f., 36, 52 The Inheritors 45ff. innovation 23, 199f. intertext 5, 17, 23, 34ff., 38, 39 interpretation 6f., 19 intuition 6 irony 73f., 90 decoding of 82, 90 identification of 84f. theory of 81ff. keywords 26, 31f., 65ff., 67ff., 91f., 105ff. choice of 71ff., 149 see also proper nouns knowledge 17 growth of 17, 19ff., 199 language 1, 3, 14, 195 functions of 42, 44, 45 language system 5, 14, 15, 34f. language use 14, 15 typicality 16 Langue 34ff., 38, 39 lemma 70, 146f. lexical priming 131, 133 lexis 68 linguistic form 3, 6 linguistics 1, 3, 4, 5, 6, 34 literary conventions 85, 88ff., 94 literary criticism 3, 4 literary studies 1, 2, 3, 4, 5, 24 literature 4 meaning 4, 14, 34ff., 37ff., 43, 47, 65, 67, 106, 109 correlation with form 6, 39, 41 literary 1, 2f., 70
225
narrative structure 119f., 137 narrator 85ff. negations 83, 92ff., 131ff. objectivity 6, 16ff., 43, 65 Parole 34ff., 38, 39, 53 pattern 3, 6, 14, 34, 38, 39, 61, 70 lexical 65, 67f. recurrence 15, 34f., 38, 39f., 43 phatic talk 127 phraseology 26, 108 Phrases doubles 116f. length of 111 n-grams, p-frames 108f., 142 principle of equivalence 44 Project Gutenberg 27, 30 proper nouns 95, 183 prototype 15 reader 86ff. references exophoric 81, 83, 90 intertextual 34ff., 48f., 73ff., 81, 83, 202ff. intratextual 35ff., 73, 81, 146 relationship literary studies – linguistics 7f., 11, 201 relevance theory 83f. replicability 21, 23 replication 39 research 16ff. rhetoric – classical 4, 81f., 85 rhetorical questions 126 schema 38, 53f., 74ff., 86, 89, 132 science 17 scientific theory 19f., 22 segmentation – text and corpus 26, 144f., 152ff., 183 semantic field 145f., 174f. semantic preference 128f., 133, 137 semantic prosody 78, 102 sentimental novels 74, 76 science 16ff.
226
Index of Subjects
software kfNgram 31 Vocabulary Management Profiles 31, 164ff., 181 Word-Distribution 31, 145 WordSmith Tools 31f. spoken language 120, 122, 137 style 2, 3, 5, 16, 39f., 44 stylistics 1, 2, 5, 6, 10, 39ff., 60f. circularity 49f. criticism of 49ff. functional approach 45f. structuralist approach 47ff. subjectivity 6, 16ff. syntagmatic axis 40, 42ff., 48
testability 21f. text 4, 38f., 148, 184, 187, 195 text part 26, 144, 146ff., 158, 165 textual competence 35f., 38, 47ff. textuality 71ff., 77ff., 88f., 158ff. transitivity 45f. triangulation 60 unit of meaning 11f., 26, 38f., 60, 142, 174f., 195f. word usage 15 worlds (world two and three) 17f.