What's in a Word-list? (Digital Research in the Arts and Humanities)

wha t’s in a word-list? D igital R esearch in the A rts and H umanities Series Editors Marilyn D eegan, L orna H ughe...

Author: Dawn Archer

59 downloads 763 Views 2MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

wha t’s in a word-list?

D igital R esearch in the A rts and H umanities Series Editors Marilyn D eegan, L orna H ughes and H arold S hort D igital technologies are becoming increasingly important to arts and humanities research and are expanding the horizons of our working methods. T his important series will cover a wide range of disciplines with each volume focusing on a particular area, identifying the ways in which technology impacts on speci.c subjects. The aim is to provide an authoritative reflection of the ‘state of the art’ in the application of computing and technology. T he series will be critical reading for experts in digital humanities and technology issues but will also be of wide interest to all scholars working in humanities and arts research AHR C ICT Methods N etwork Editorial Board S heila A nderson, Centre for e-R esearch, King’s College L ondon Chris Bailey, L eeds Metropolitan University Bruce Brown, University of Brighton Mark Greengrass, University of Sheffield S usan H ockey, University College L ondon S andra Kemp, R oyal College of A rt S imon Keynes, University of Cambridge Julian R ichards, University of York S eamus R oss, University of Glasgow Charlotte R oueché, King’s College L ondon Kathryn S utherland, University of O xford A ndrew W athey, N orthumbria University Forthcoming titles in the series T he Virtual R epresentation of the Past Edited by Mark Greengrass and Lorna Hughes IS BN 978 0 7546 7288 3 Modern Methods for Musicology Prospects, Proposals and R ealities Edited by Tim Crawford and Lorna Gibson IS BN 978 0 7546 7302 6

W hat’s in a W ord-list?

Investigating W ord Frequency and Keyword Extraction

Edited by da wn ar ch er University of Central Lancashire, UK

© D awn A rcher 2009 A ll rights reserved. N o part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise without the prior permission of the publisher. D awn A rcher has asserted her moral right under the Copyright, D esigns and Patents A ct, 1988, to be identified as the editor of this work. Published by A shgate Publishing L imited A shgate Publishing Company W ey Court East S uite 420 Union R oad 101 Cherry S treet Farnham Burlington S urrey, GU9 7PT VT 05401-4405 England USA www.ashgate.com British Library Cataloguing in Publication Data W hat’s in a word-list? : investigating word frequency and keyword extraction. - (D igital research in the arts and humanities) 1. L anguage and languages - W ord frequency I. A rcher, D awn 410.1'51 Library of Congress Cataloging-in-Publication Data W hat’s in a word-list? : investigating word frequency and keyword extraction / [edited] by D awn A rcher. p. cm. -- (D igital research in the arts and humanities) Includes bibliographical references and index. IS BN 978-0-7546-7240-1 1. English language--W ord frequency. 2. English language--W ord order. 3. English language--Etomology. I. A rcher, D awn. PE1691.W 5 2008 428.1--dc22 2008034289 IS BN : 978-0-7546-7240-1 (H ardback) IS BN : 978-0-7546-8065-9 (E-book)

Contents List of Figures List of Tables Notes on Contributors Acknowledgements Series Preface 1

D oes Frequency R eally Matter? Dawn Archer

2 W ord Frequency Use or Misuse? John M. Kirk

vii ix xi xv xvii 1 17

3

W ord Frequency, S tatistical S tylistics and A uthorship A ttribution 35 David L. Hoover

4

W ord Frequency in Context: A lternative A rchitectures for Examining R elated W ords, R egister Variation and H istorical Change Mark Davies

53

5 Issues for H istorical and R egional Corpora: First Catch Your W ord 69 Christian Kay 6 In S earch of a Bad R eference Corpus Mike Scott

79

7

Keywords and Moral Panics:Mary W hitehouse and Media Censorship Tony McEnery

8

‘The question is, how cruel is it?’Keywords, Fox H unting and the H ouse of Commons Paul Baker

125

9

Love – ‘a familiar or a devil’? A n Exploration of Key D omains in S hakespeare’s Comedies and T ragedies Dawn Archer, Jonathan Culpeper, Paul Rayson

137

93

vi

10

What’s in a Word-list?

Promoting the W ider Use of W ord Frequency and Keyword Extraction T echniques Dawn Archer

Appendix 1 Appendix 2 USAS taxonomy Bibliography Index

159 163 168 171 177

L ist of Figures 1.1 R esults for * ago in the BN C, using VIEW

3

3.1 T he frequency of the thirty most frequent words of The Age of Innocence 37 3.2 Modern A merican poetry: D elta analysis (2,000 MFW s) 44 3.3 Modern A merican poetry: D elta-O z analysis (2,000 MFW s) 44 3.4 Modern A merican poetry: D elta-L z (>0.7) analysis (3,000 MFW s) 45 3.5 A uthorship simulation: D elta analysis (800 MFW s) 47 3.6 A uthorship simulation: D elta-2X analysis (800 MFW s) 47 3.7 A uthorship simulation: D elta-3X analysis (800 MFW s) 48 3.8 A uthorship simulation: D elta-P1 analysis (600 MFW s) 48 3.9 A uthorship simulation: D elta-O z analysis (800 MFW s) 49 3.10 A uthorship simulation: D elta-L z (>0.7) analysis (1,000 MFW s) 49 3.11 A uthorship simulation: changes in D elta and D elta-z from likeliest to second likeliest author: D elta-L z (1,000 MFW s) 51 6.1 Berber S ardinha’s (2004, 102) formula 6.2 S aving keywords as text 6.3 Importing a word-list from plain text 6.4 D etailed consistency view of the twenty-two keyboard sets 6.5 Excel spreadsheet of results 6.6 Precision values for text A 6L 6.7 Precision values for text KN G 6.8 Precision values for text A 6L with genred BN C R Cs

81 84 84 85 86 87 87 90

7.1 Words which are key keywords in five or more chapters of the MW C 7.2 W ords which are key keywords in all of the MW C texts 7.3 T he responsible 7.4 Porn is good 7.5 T he call for the restoration of decency 7.6 Pronoun use by VALA 7.7 T he assumption of Christianity 7.8 S peaking up for the silent majority 7.9 T he use of wh-interrogatives by VALA

100 100 112 114 120 121 122 122 123

8.1

127

Keywords when p<0.000001

This page has been left blank intentionally

L ist of T ables 2.1 Verbal spellings in ‘–ise’ and ‘–ize’ 2.2 Frequencies of be forms 2.3 Irish loanwords 2.4 Lexicon of other words in ICE-IRL deemed ‘Irish’ 2.5 Frequency of modal verbs 2.6 O ccurences of get 2.7 L ocal names (or onomastic lexicon) 2.8 D ictionaries and word frequency 2.9 L ist of discourse markers in the S PICE-Ireland corpus

20 23 25 25 26 27 28 29 32

3.1 T he sixty MFW s of The Age of Innocence 36 3.2 Masculine and feminine pronouns in eight novels 39 3.3 Calculating Delta for ‘the’ in The Professor and six other novels 42 4.1 Example of 3-grams where lem1 = ‘break’ and word2 = ‘the’ 4.2 Example of 3-grams where LEM1 = ‘break’ and WORD2 = ‘the’ 4.3 Example of 3-grams where LEM1 = ‘break’ and WORD2 = ‘the’ 4.4 ‘Hard N ’ (N duro) in the Corpus del Español, by century 4.5 Sequential n-gram ‘tokens’ 4.6 T ext meta information table 4.7 Grouping by synonyms 4.8 WordNet/BNC integration: frequency of synonyms of ‘sad’ 4.9 S ynonyms of [bad] + NO UN 4.10 S equential n-gram table 4.11 T ext meta information table 4.12 S earching by register: lexical verbs in legal texts

55 55 57 57 60 60 63 64 65 66 66 67

5.1

78

HTE entries for ‘war’ and ‘peace’

6.1 N umbers of texts in each R C 6.2 Keywords identified using only Shakespeare RC 6.3 D octor-patient keywords with two R Cs 6.4 Genred R Cs

83 88 88 89

7.1 T ext categories in the Brown corpus 7.2 Keywords of the MW C when compared with the LO B corpus 7.3 Keywords in the MW C derived from a comparison of FLO B

94 96 97

What’s in a Word-list?

7.4 T he keywords of the MW C placed into moral panic discourse categories 7.5 Words which are key keywords in five or more chapters of the MW C mapped into the moral panic discourse rôles 7.6 W ords which are key keywords in all of the MW C texts mapped into their moral panic discourse rôles 7.7 T he distribution of chapter only, text only and chapter and text key keywords across the moral panic discourse categories 7.8 T he key keyword populated model 7.9 Corrective action keywords 7.10 The keyword ‘report’ 7.11 Collocates of ‘pornography’ and ‘pornographic’ 7.12 Enclitics which are negative keywords in the MW C when the MW C is compared to the subsections of LO B 7.13 The relative frequency of genitive ‘’s’ forms and enclitic ‘’s’ forms in the MW C compared to the sub-section of LO B 7.14 The collocates of ‘programme’ and ‘programmes’ 7.15 The collocates of ‘film’ 7.16 The collocates of ‘television’ and ‘broadcasting’ 7.17 The collocates of ‘decency’ 8.1 8.2 8.3 8.4 8.5 8.6 8.7

Concordance of ‘criminal’ Sample concordance of ‘fellow citizens’, ‘Britain’ and ‘people’ (pro-hunt) Concordance (sample) of ‘cruelty’ (anti-hunt) Concordance (sample) of ‘cruelty’ (pro-hunt) Concordance (sample) of words tagged as S1.2.6 ‘sensible’ (pro-hunt) Concordance (sample) of words tagged as G2.2 – ‘ethics: general’ (pro-hunt) Concordance (sample) of words tagged as S1.2.5 5 ‘Toughness; strong/weak’ (anti-hunt)

98 100 101 102 103 104 107 113 115 116 117 118 118 119 128 130 130 132 134 135 136

9.1 T he top level of the USAS S ystem 140 9.2 T he most overused items in the comedies relative to the tragedies 142 9.3 Participants in intimate/sexual relationships 143 9.4 Processes in intimate/sexual relationships 143 9.5 L ove-related lexical items which occur in the comedies and tragedies 148 9.6 Most overused items in the tragedies relative to the comedies 149 9.7 Domain collocates of S3.2 ‘Intimate relationship’ in the comedies 153 9.8 Domain collocates of S3.2 ‘Intimate relationship’ in the tragedies 156

N otes on Contributors

Dawn Archer is a reader in corpus linguistics at the University of Central L ancashire. Much of her work combines both (historical) pragmatics and corpus linguistics. Over the last five years she has published one monograph – Historical Sociopragmatics: Questions and Answers in the English Courtroom (1640–1760) – two edited collections, and numerous journal articles and chapters in books. S he is also co-developer (with D r Paul R ayson) of a Variant D etector (VARD ) – a computer program which helps users to ‘detect’ variant spellings and normalize them, enabling the electronic annotation of historical texts. Paul Baker is a senior lecturer in the D epartment of L inguistics and English L anguage at L ancaster University. H is books include Using Corpora in Discourse Analysis (2006), A Glossary of Corpus Linguistics (2006, with A ndrew H ardie and T ony McEnery), Public Discourses of Gay Men (2005) and Polari: The Lost Language of Gay Men (2002). H e is commissioning editor for the journal Corpora, and his research interests include language and identities, and combining corpus linguistics with critical discourse analysis. H e has recently completed a project examining constructions of refugees in a corpus covering ten years of British newspaper articles. Jonathan Culpeper is a senior lecturer in the D epartment of L inguistics and English L anguage at L ancaster University. H is work spans pragmatics, stylistics and the history of English, and his major publications include History of English (2nd edn, 2005), Cognitive Stylistics (2002, edited with Elena S emino), Exploring the Language of Drama (1998, co-edited with Mick S hort and Peter Verdonk) and Language and Characterisation in Plays and Other Texts (2001). Corpora, and corpus linguistics, underpin much of his work, but particularly in current projects relating to the history of English and also corpus stylistics. Mark Davies is a professor of corpus linguistics at Brigham Young University in Provo, Utah, USA . H e has published more than 50 articles and books dealing with Corpus L inguistics, and with syntactic change and variation in S panish and Portuguese. H e has also created an innovative architecture for large corpora (based on relational databases), and has placed online several large corpora which are based on this architecture. T hese include the 100-million-word Corpus del Español, the 45-million-words Corpus do Português, a new architecture ‘and interface for the 100-million-word British N ational Corpus (UK English, 1980s–1990s) and a

xii

What’s in a Word-list?

100-million-word corpus based on Time Magazine (US English, 1900s). ‘In early 2008, he placed online the Corpus of Contemporary A merican English. T his is the first large, balanced corpus of American English, and contains nearly 400 million words from 1990 to the present time, which is the first large corpus of American English. David L. Hoover received his PhD in English language from Indiana University in 1980, and is currently professor of English and webmaster at N ew York University, where he has taught since 1981. H e has worked in the areas of A ngloS axon metre, linguistic stylistics and humanities computing for over 25 years, recently concentrating on authorship attribution and computational stylistics. H e is the author of A New Theory of Old English Meter (1985) and Language and Style in The Inheritors (1999), and has edited Stylistics: Prospect and Retrospect (2007, with S haron L attig). H e is active in the Poetics and L inguistics A ssociation, T he A ssociation for Computers and the H umanities, and T he A ssociation for L iterary and L inguistic Computing. Christian Kay is an honorary professorial research fellow in the D epartment of English L anguage at the University of Glasgow. H er research interests include contemporary and historical semantics and syntax; the history of the English language; lexicology and lexicography, especially conceptual thesauri; and the use of databases and corpora in linguistic research. Current projects include the H istorical T hesaurus of English and the Corpus of Modern S cottish W riting, the successor to the S cottish Corpus of T ext and speech. S he is convener of S cottish L anguage D ictionaries, the body responsible for the major academic dictionaries of S cots. S he is co-author of A Thesaurus of Old English (2000) and The Historical Thesaurus of the OED (forthcoming 2009). John M. Kirk is senior lecturer in English and S cottish language at Queen’s University Belfast. H e has compiled the Northern Ireland Transcribed Corpus of Speech and, with Jeffrey Kallen, the Irish Component of the International Corpus of English (ICE-Ireland) and arising from the latter’s spoken component, the SPICE Ireland Corpus. H e has so far edited 13 books, including Corpora Galore, and is general editor of Belfast Studies in Language, Culture and Politics and Queen’s Scots Texts. Tony McEnery is professor of English language and linguistics at L ancaster University. H e has published widely in the area of corpus linguistics and is the author, with A ndrew W ilson, of Corpus Linguistics (1997). H is most recent book, Swearing in English (2005), links corpus linguistics, historical linguistics and sociology, and his work in this volume is closely related to the work in that book. Paul Rayson is the director of the L ancaster University Centre for Computer Corpus R esearch on L anguage (UCR EL ). H e has undertaken recent projects in

Notes on Contributors

xiii

the areas of assisting translators, studying language change in twentieth-century British English and extending a semantic annotation tool for research on metaphor. H e has published over 75 papers in corpus-based natural language processing, is production editor on the journal, Corpora, and co-editor of the R outledge series of frequency dictionaries. He co-organized the first four international Corpus L inguistics conferences (L ancaster, 2001–2003, and Birmingham, 2005–2007). Mike Scott has worked at the University of L iverpool since 1990. H is main research interest is the computer processing of language. H e has developed two software programs, MicroConcord (1993) and W ordS mith T ools (1996). W ordS mith has passed through several versions, the latest of which is W ordS mith 5.0 (2008). T he features he regards as most rewarding in his own research concern keyness and the identification of key patterns of clustering. Recent publications include Textual Patterns: Keyword and Corpus Analysis in Language Education, which he coauthored with Chris T ribble.

This page has been left blank intentionally

A cknowledgements My thanks to the following for their help and/or encouragement in the preparation of this book: Marilyn D eegan, H azel Gardiner, L orna Gibson, L ydia H orstman, L orna H ughes and the fantastic staff at the AHR C ICT Methods N etwork. I reserve a special thanks to Matthew D avies, for his unending patience when helping me to give shape to this collection.

This page has been left blank intentionally

S eries Preface What’s in a Word-List? is volume 3 of Digital Research in the Arts and Humanities. Each of the titles in this series comprises a critical examination of the application of advanced ICT methods in the arts and humanities. T hat is, the application of formal computationally based methods, in discrete but often interlinked areas of arts and humanities research. Usually developed from Expert S eminars, one of the key activities supported by the Methods N etwork, these volumes focus on the impact of new technologies in academic research and address issues of fundamental importance to researchers employing advanced methods. A lthough generally concerned with particular discipline areas, tools or methods, each title in the series is intended to be broadly accessible to the arts and humanities community as a whole. Individual volumes not only stand alone as guides but collectively form a suite of textbooks reflecting the ‘state of the art’ in the application of advanced ICT methods within and across arts and humanities disciplines. Each is an important statement of current research at the time of publication, an authoritative voice in the field of digital arts and humanities scholarship. T hese publications are the legacy of the AHR C ICT Methods N etwork and will serve to promote and support the ongoing and increasing recognition of the impact on and vital significance to research of advanced arts and humanities computing methods. T he volumes will provide clear evidence of the value of such methods, illustrate methodologies of use and highlight current communities of practice. Marilyn D eegan L orna H ughes H arold S hort S eries Editors AHR C ICT Methods N etwork Centre for Computing in the H umanities King’s College L ondon

xviii

What’s in a Word-list?

About the AHRC ICT Methods Network T he aims of the AHR C ICT Methods N etwork were to promote, support and develop the use of advanced ICT methods in arts and humanities research and to support the cross-disciplinary network of practitioners from institutions around the UK. It was a multi-disciplinary partnership providing a national forum for the exchange and dissemination of expertise in the use of ICT for arts and humanities research. T he Methods N etwork was funded under the AHR C ICT Programme from 2005 to 2008. T he Methods N etwork A dministrative Centre was based at the Centre for Computing in the H umanities (CCH ), King’s College L ondon. It coordinated and supported all Methods N etwork activities and publications, as well as developing outreach to, and collaboration with, other centres of excellence in the UK T he Methods N etwork was co-directed by H arold S hort, D irector of CCH , and Marilyn D eegan, D irector of R esearch D evelopment, at CCH , in partnership with A ssociate Directors: Mark Greengrass, University of Sheffield; Sandra Kemp, Royal College of Art; Andrew Wathey, Royal Holloway, University of London; Sheila Anderson, Arts and Humanities Data Service (AHDS) (2006–2008); and Tony McEnery, University of L ancaster (2005–2006). T he project website () provides access to all Methods Network materials and outputs. In the final year of the project a community site, ‘Digital Arts & Humanities’ (http://www.arts-humanities.net>) was initiated as a means to sustain community building and outreach in the field of digital arts and humanities scholarship beyond the Methods N etwork’s funding period. Note on the text A ll publications in English are presumed as published in L ondon unless otherwise stated.

I dedicate this edited collection to my husband, Eddie, to my children, Paul, Peter, Jonathan and Jessica, and to my daughters-in-law, Becky and Charlotte. Thank you, Eddie, for your constant love and support. Thanks kids for ensuring that life is always full of surprises!

This page has been left blank intentionally

Chapter 1

D oes Frequency R eally Matter? D awn A rcher

Words, words, words A hypothesis popular amongst computer hackers – the infinite monkey theorem – holds that, given enough time, a device that produces a random sequence of letters ad in.nitum will, ultimately, create not only a coherent text, but also one of great quality (for example, S hakespeare’s Hamlet). T he hypothesis has become more widely known thanks to D avid Ives’ satirical play, Words, Words, Words (Dramatists Play Service, N Y). In the play, three monkeys – Kafka, Milton and S wift – are given the task of writing something akin to Hamlet, under the watchful eye of the experiment’s designer, D r R osenbaum. But, as Kafka reveals when she reads aloud what she has typed thus far, the experiment is beset with seemingly insurmountable difficulties: ‘“K k k k k, k k k! K k k! K ... k ... k.” I don’t know! I feel like I’m repeating myself!’

In my view, Kafka’s concern about whether the simple repetition of letters can produce a meaningful text is well placed. But I would contend that the frequency with which particular words are used in a text can tell us something meaningful about that text and also about its author(s) – especially when we compare word choice/usage against the word choice/usage of other texts (and their authors). T his can be explained, albeit in a simplistic way, by inverting the underlying assumption of the infinite monkey theorem: we learn something about texts by focussing on the frequency with which authors use words precisely because their choice of words is seldom random. A s support for my position, I offer to the reader this edited collection, which brings together a number of researchers involved in the promotion of ICT methods such as frequency and keyword analysis. Indeed, the chapters within What’s in a Word-list? Investigating Word Frequency and Keyword Extraction were originally The infinite monkey theorem was first introduced by Émile Borel at the beginning of the twentieth century, and was later popularized by S ir A rthur Eddington. D . Ives, Words, Words, Words, D ramatists Play S ervice, N ew York. O f course, the extent to which this process is a completely cognitive one is a matter of debate.

What’s in a Word-list?

presented at the Expert S eminar in L inguistics (L ancaster 2005). T his event was hosted by the AHR C ICT Methods N etwork as a means of demonstrating to the A rts and H umanities disciplines the broad applicability of corpus linguistic techniques and, more specifically, frequency and keyword analysis. Explaining frequency and keyword analysis Frequency and keyword analysis involves the construction of word lists, using automatic computational techniques, which can then be analyzed in a number of ways, depending on one’s interest(s). For example, a researcher might focus on the most frequent lexical items of a number of generated word frequency lists to determine whether all the texts are written by the same author. A lternatively, they might wish to determine whether the most frequent words of a given text (captured by its word frequency list) are suggestive of potentially meaningful patterns that they might have missed had they read the text manually. T hey might then go on to view the most frequent words in their word frequency list in context (using a concordancer) as a means of determining their collocates and colligates (i.e. the content and function words with which the most frequent words keep regular company). For example, the word ‘ago’ occurs 19,326 times in the British National Corpus (BN C) and, according to H oey, ‘is primed for collocation with year, weeks and days’. We can easily confirm this by entering the search string ‘* ago’ into Mark Davies’s relational database of the BN C. In fact, we find that nouns relating to periods of time account for the 20 most frequent collocates of ‘ago’ (see D avies, this volume, for a detailed discussion of the relational database employed here, and S cott and T ribble, for a more extensive discussion of the collocates of ‘ago’). T he researcher(s) who are interested in keyword analysis may also be interested in collocation and/or colligation, but they will compare, initially, the word frequency list of their chosen text (let’s call it text A ) with the word frequency list of another normative or reference text (let’s call it text B) as a means of identifying both words that are frequent and also words that are infrequent in text A , statistically speaking, when compared to text B. T his has the advantage of removing words

M. S cott and C. T ribble, Textual Patterns: Keyword and Corpus Analysis in Language Education (A msterdam: Benjamins, 2006), p. 5. Produced in the 1990s, the BN C is a 100 million-word corpus of modern British English containing registers that are representative of the spoken and written medium. M. H�� oey, Lexical Priming: A New Theory of Words and Language (R outledge, 2005), p. 177. S cott and T ribble, Textual Patterns, p. 43. Normative corpus and reference corpus are often used interchangeably by corpus linguists.

Does Frequency Really Matter?

Figure 1.1

Results for * ago in the BNC, using VIEW

that are common to both texts, and so allows the researcher to focus on those words that make text A distinctive from text B (and vice versa). In the case of the majority of English texts, this will mean that function words (‘the’, ‘and’, ‘if’, etc.) do not occur in a generated keywords list, because function words tend to be frequent in the English language as a whole (and, as a result, are commonly found in English texts). T hat said, function words can occur in a keyword list if their usage is strikingly different from the norm established by the reference text. Indeed, when Culpeper undertook a keywords analysis of six characters from S hakespeare’s Romeo and Juliet, using the play minus the words of the character under analysis as his reference text, he found that Juliet’s most frequent keyword was actually the function word ‘if’. On inspecting the concordance lines for ‘if’ and additional keyword terms, in particular, ‘yet’, ‘would’ and ‘be’, Culpeper concluded that, when viewed as a set, they served to indicate Juliet’s elevated pensiveness, anxiety and indecision, relative to the other characters in the play. Text mining techniques as indicators of potential relevance A s the example of Juliet (above) reveals, a set of automatically generated keywords will not necessarily match a set of human-generated keywords at first glance. In some instances, automatically generated keywords may also be found to be J. Culpeper, ‘Computers, Language and Characterisation: An Analysis of Six Characters in Romeo and Juliet, in U. Melander-Marttala, C. O stman and M. Kytö (eds), Conversation in Life and in Literature: Papers from the ASLA Symposium (Uppsala: A ssociation S uédoise de L inguistique A ppliquée, 2002), pp. 11–30.

What’s in a Word-list?

insignificant by the researcher in the final instance (see, for example, A rcher et al., this volume), in spite of being classified as statistically significant by text analysis software. T his is not as problematic as it might seem. T he reason? T he main utility of keywords and similar text-mining procedures is that they identify (linguistic) items which are: 1. likely to be of interest in terms of the text’s aboutness10 and structuring (that is, its genre-related and content-related characteristics); and, 2. likely to repay further study – by, for example, using a concordancer to investigate collocation, colligation, etc. (adapted from S cott, this volume). Put simply, the contributors to this edited collection are not seeking (or wanting) to suggest that the procedures they utilize can replace human researchers. O n the contrary, they offer them as a way in to texts – or, to use corpus linguistic terminology, a way of mining texts – which is time-saving and, when used sensitively, informative. Aims, organization and content of the edited collection T he aims of What’s in a Word-list? are similar to those of the 2005 Expert S eminar, mentioned above: • •

to demonstrate the benefits to be gained by engaging in corpus linguistic techniques such as frequency and keyword analysis; and, to demonstrate the very broad applicability of these techniques both within and outside the academic world.

T hese aims are especially relevant today when one considers the rate at which electronic texts are becoming available, and the recent innovations in analytic techniques which allow such data to be mined in illuminating (and relatively trouble free) ways. T he contributors also identify a number of issues that are crucial, in their view, if corpus linguistic techniques are to be applied successfully within and beyond the field of linguistics. They include determining: • • • • •

what counts as a word what we mean by frequency why frequency matters so much the consistency of the various keyword extraction techniques which of the (key)words captured by keyword/word frequency lists are the most relevant (and which are not)

10 M. Phillips, ‘Lexical Structure of Text’, Discourse Analysis Monographs 12 (Birmingham: University of Birmingham, 1989).

Does Frequency Really Matter?

• • • •

whether the (de)selection of keywords introduces some level of bias what counts as a reference corpus and why we need one whether a reference corpus can be bad and still show us something what we gain (in real terms) by applying frequency and keyword techniques to texts.

Word frequency: use or misuse? John Kirk begins the edited collection by (re)assessing the concept of the word (as token, type and lemmatized type), the range of words (in terms of their functions and meanings) and thus our understanding of word frequency (as a property of data). H e then goes on to refer to a range of corpora – the Corpus of Dramatic Texts in Scots, the Northern Ireland Transcribed Corpus of Speech, and the Irish component of the International Corpus of English – to argue that, although word frequency appears to promise precision and objectivity, it can sometimes produce imprecision and relativity. H e thus proposes that, rather than regarding word frequency as an end in itself (and something that requires no explanation), we should promote it as: • •

something that needs interpretation through contextualization a methodology, which lends itself to approximation and replicability.

Kirk also advocates that there are some advantages to be gained by paying attention to words of low frequency as well as words of high frequency. In his concluding comments, he touches on the contribution made to linguistic theory by word frequency studies, and, in particular, the usefulness of authorship studies in the detection of plagiarism. Word frequency, statistical stylistics and authorship attribution D avid H oover continues the discussion of high versus low frequency words, and authorship attribution, focussing specifically on some of the innovations in analytic techniques and in the ways in which word frequencies are selected for analysis. H e begins with an explanation of how, historically, those working within authorship attribution and statistical stylistics have tended to base their findings on fewer than the 100 most frequent words of a corpus. T hese words – almost exclusively function words – are attractive because they are so frequent that they account for most of the running words of a text, and because such words have been assumed to be especially resistant to intentional manipulation by an author.11 H oover then goes on to document the most recent work on style variation which, by concentrating on word frequency in given sections of texts rather than in the 11 T his means that their frequencies should reveal authorial habits which remain relatively constant across a variety of texts.

What’s in a Word-list?

entire corpus, is proving more effective in capturing stylistic shifts. A second recent trend identified by Hoover is that of increasing the number of words analysed to as many as 6,000 most frequent words – a point at which almost all the words of the text are included, and almost all of these are content words. The final sections of his chapter are devoted to the authorship attribution community’s renewed interest in D elta, a method for identifying differences between texts that is based on comparing how individual texts within a corpus differ from the mean for that entire corpus (following the innovative work of John Burrows). D rawing on a two million word corpus of contemporary A merican poetry and a much larger corpus of 46 Victorian novels, H oover also argues that refinements in the selection of words for analysis and in alternative formulas for calculating D elta may allow for further improvements in accuracy, and result, in turn, in the establishment of a theoretical explanation of how and why word frequency analysis is able to capture authorship and style. Word frequency in context In Chapter 4, Mark D avies introduces some alternatives to techniques based on word searching. In particular, he focuses on the use he has made of architectures based on relational databases and n-gram12 frequencies when developing corpora (including the 100-million-word Corpus del Español,13 a BN C-based 100-millionword corpus modelled on the same architecture (Variation in English Words and Phrases, VIEW ),14 and a 40-million-word Corpus of Historical English.15 D avies’ main proposal is that such architectures can dramatically improve performance in the searching of corpora. For example, the following capture three of the many simple word frequency queries that take no more than one to two seconds on a 100-million-word corpus: • • •

overall frequency of a given word, set of words, phrase, or substring in the corpus; ‘slot-based’ queries, e.g. the most common nouns one ‘slot’ after ‘mysterious’, or z-score rank words immediately preceding ‘chair’; and, wide-range collocates, e.g. the most common nouns within a ten-word window (left or right) of ‘string’ or ‘broken’.

D avies also highlights the importance of developing an architecture that can account for variation through the creation of n-gram frequency tables for each register within a given corpus. T he advantage of such an approach is that 12 A n n-gram is a (usually consecutive) sequence of items from a corpus. T he items in question can be characters (letters and numbers) or more usually words. 13 . 14 . 15 .

Does Frequency Really Matter?

each n-gram will have an associated frequency according to historical period and register, and this information will be directly accessible as part of a given query. L ike Kirk (chapter 2, this volume), D avies is zealous about word frequency being something that needs interpretation through contextualization. Indeed, he advocates that word frequency, ‘be analyzed not just as the overall frequency of a given word or lemma in a certain corpus, but, rather, as the frequency of words in a wide range of related contexts’ (p. 66). Unlike Kirk, however, he does not seem to be readily concerned about the inclusion of low frequency words in any given query. This is because of a potential ‘size issue’ which means that n-gram tables can ‘become quite unmanageable’ when dealing with excessively large corpora (p. 57). Consequently, D avies advocates that, for such corpora, we include just those n-grams that occur three times or more. T his is not a problem if one is interested in only the highly-frequent n-grams, of course, but it could make a detailed comparison of sub-corpora potentially problematic. Issues for historical and regional corpora – first catch your word In Chapter 5, Christian Kay focuses primarily on variable spelling within historical texts, and the difficulties that this occasions when seeking to ‘catch a word’ in corpora, especially corpora such as the Historical Thesaurus of English (HT E), and a semantic index to the Oxford English Dictionary,16 which is supplemented by O ld English materials (published separately in R oberts et al.’s A Thesaurus of Old English17) and, as such, captures English vocabulary from the earliest written records to the present.18 Kay goes on to point out that spelling variation can also create problems when searching corpora relating to (modern-day) non-standard varieties such as the Scottish Corpus of Texts and Speech and the Dictionary of the Scots Language. Indeed, even the specialized dictionaries that lemmatize common variants (for example, the Dictionary of the Scots Language) are by no means comprehensive. S he also demonstrates how homonymy and polysemy can create additional problems for those working with (historical and dialectal) corpora – and this is something that lemmatization may not be able to solve. Kay concludes by suggesting ways of addressing some of these problems using the resources described above, including the development of a rule-based system which predicts possible variants and maps them to the relevant headwords (S ee also chapter 9). In addition, Kay touches on 16 Oxford English Dictionary (O xford: O xford University Press, 1884–, and subsequent edns); OED Online, ed. J.A . S impson. (O xford: O xford University Press, 2000–). 17 J. R oberts, C. Kay and L . Grundy, A Thesaurus of Old English (A msterdam: R odopi, 2000 [1995]). 18 W ord senses within the thesauri are organized in a hierarchy of categories and subcategories, with up to 14 levels of delicacy. T he material is held in a database that can be searched on the Internet, and is likely to be of use in a range of humanities disciplines.

What’s in a Word-list?

the relationship between e-texts (of which there are many) and structured corpora (of which there are few). In search of a bad reference corpus Mike S cott’s contribution to this edited collection tackles the issue of reference corpora. More specifically, he is interested in determining how bad a reference corpus can be before it becomes unusable (in the sense that it generates keywords that do not help to clarify the aboutness of a target text). A s previous chapters have revealed, this issue is particularly pertinent, as good reference corpora are not available for all genres / periods / languages. Using the keywords facility of his own text analysis program, W ordS mith T ools,19 S cott’s starting point is the formula proposed by Berber S ardinha, which suggests that the larger the reference corpus, the more keywords will be detected.20 Berber S ardinha also suggests that, as a reference corpus that is similar to the target text (i.e. the text being analysed) will filter out genre features common to both, an optimum reference corpus is one that contains several different genres. D rawing on a series of reference texts of varying lengths (32 in total: 22 BN C texts and 10 S hakespeare plays), S cott explores the different keyword results that are generated by W ordS mith T ools for two target texts: an extract from a book profiling business leaders and a doctor/patient interaction. Scott pays particular attention to their ‘popularity’ and ‘precision’ scores as a means of answering three research questions: 1. T o what extent does the size of the reference text impact on the quality of the keywords and, if so, is there a point at which the size of the reference text renders the (quality of the) keywords unacceptable? 2. W hat sort of keyword results obtain if a reference text is used which has little or no relation to the target text (beyond them both being written in the same language)? 3. W hat sort of keyword results obtain if genre is included as a variable? Popularity relates to the presence of each keyword in the majority of the reference texts (for example, 20 out of the 22 BN C texts). T his is based on the rationale that keywords which are identified using most of the reference texts are more likely to be useful than those identified in only a minority of the reference texts. Precision is

19 M. S cott, WordSmith Tools, Version 4 (O xford: O xford University Press, 2004). W ordS mith T ools is probably the most popular text analysis program in corpus linguistics. For more information, see . 20 The critical size of a reference corpus is said to be about two, three and five times the size of the node text: A .P. Berber S ardinha, Lingüística de Corpus (Barueri, S ão Paulo, Brazil: Editora Manole, 2004), pp. 101–103).

Does Frequency Really Matter?

computed following O akes,21 and involves dividing the total number of keywords for each reference text by the number of popular keywords (as determined by the popularity test). W hilst S cott admits that usefulness is a relative phenomenon, which is likely to vary according to research goals (and research goals cannot be predicted with certainty), he contends that it is still worth undertaking such a study, not least because it will help to determine the dimensions that appear to effect the meaningfulness (or not) of generated keywords. T hese include size in tokens (i.e. frequency), similarity of text-type, similarity of historical period, similarity of subject-matter, etc. More importantly, perhaps, this and later studies will provide a useful means of determining the robustness of the keywords procedure and thus, in turn, its potential usefulness in (non-)linguistic fields. And the indications from this preliminary study look promising; indeed, Scott suggests that even relatively restricted reference corpora can give good results in keyword extraction. T hat said, S cott notes that a small reference corpus containing a mixture of texts is likely to perform better than a larger corpus with more homogeneous texts. Keywords and moral panics – Mary Whitehouse and media censorship T ony McEnery also utilizes the keywords facility of W ordS mith T ools – in conjunction with his own lexically driven model of moral panic theory22 – as a means of determining the extent of moral panic in the books penned by Mary W hitehouse during the period 1967–77. In brief, words that are found to be key (i.e. statistically frequent) in the writings of W hitehouse (relative to a reference corpus23) are classified according to McEnery’s moral panic categories.24 T hese categories are heavily influenced by the moral panic theory of the sociologist, S tanley Cohen. Indeed, they capture the discourse roles thought to typify moral panic discourse, including ‘object of offence’, ‘scapegoat’, ‘moral entrepreneur’, ‘corrective action’, ‘consequence’, ‘desired outcome’ and ‘rhetoric’. Some of the categories are also sub-classified according to pertinent semantic fields: for example, the scapegoat category contains the semantic fields of ‘people’, ‘research’, ‘broadcast programmes’, ‘media’, ‘media organisations and officers’ and ‘groups’. These semantic fields have been generated using a ‘bottom-up’ approach: that is to say, they have been constructed by McEnery, rather than being 21 M. O akes, Statistics for Corpus Linguistics (Edinburgh: Edinburgh University Press, 1998), p. 176. 22 A .M. McEnery, Swearing in English: Bad Language, Purity and Power from 1586 to the Present (R outledge, 2005). 23 McEnery opted to use the L ancaster-O slo-Bergen (LO B) corpus as his reference corpus. T he LO B captures 15 text categories, containing 500 printed texts of British English (approximately 2,000 words each) all of which were produced in 1961. 24 Ibid. see also A.M. McEnery, ‘The Moral Panic about Bad Language in England, 1691–1745’, Journal of Historical Pragmatics, 7/1 (2006): 89–113.

10

What’s in a Word-list?

automatically identified by a text analysis tool (see Baker, Chapter 8, and A rcher et al., Chapter 9, for a useful comparison with the ‘bottom-up’ approach). Using this procedure, McEnery is able not only to capture keywords that help establish the aboutness of the moral panic, but also to determine those words (like ‘violence’) that are actually key keywords, i.e. are key in a number of related texts (as well as moral panic categories) in the corpus.25 McEnery’s chapter is a good example of the benefits to be gained from combining a keywords methodology with other theories (linguistic and nonlinguistic) – not least because it demonstrates the usefulness of corpus linguistic techniques beyond linguistics. In addition, McEnery is one of several authors in this edited collection (see, for example, Baker, Chapter 8, and A rcher et al., Chapter 9) who seek to combine a quantitative approach to text analysis with a qualitative approach. Indeed, McEnery specifically focuses on the issue of bad language and, in particular, how bad language was represented by W hitehouse’s organization VALA (Viewers and L isteners’ A ssociation), through an investigation of the collocations and colligations of several of the more prominent key keywords in W hitehouse’s books. Moreover, he discusses those findings not only in respect of their semantic importance, but also in respect of their ideological significance. He argues, for example, that the key keywords within the corrective action category, ‘parents’ and ‘responsible’, serve to generate in and out-groups, the former being regarded as serious, reasonable and selfless, and the latter, as the antithesis of these qualities. In addition, a closer inspection of the key keywords in context (using a concordancer) suggests that the in/out-group distinction is heavily related to a dichotomy between (religious) conservatism and liberalism, which, in turn, is comparable to the opposition to bad language voiced by seventeenth-century religious organizations. Keywords, fox hunting and the House of Commons Paul Baker is the third author in this edited collection to utilize the keywords facility in W ordS mith T ools – in this case, to examine a small corpus of debates on fox hunting (totalling 130,000 words). T he debates took place in the (British) H ouse of Commons in 2002 and 2003, prior to a ban being implemented in 2005. For the purposes of this study, Baker split the corpus into two sub-corpora (depending on whether speakers argue for or against fox hunting to be banned) so that they could be compared with each other, rather than with a more general reference text. T he bulk of Baker’s chapter is dedicated to a discussion of the different discourses (or ways of looking at the world) that speakers access in order to persuade others of their point of view, which Baker identifies using concordance 25 McEnery suggests that the key keywords approach is especially useful when one is working with large volumes of data (and the volume is such that the number of keywords generated is overwhelming). Key keywords are also useful if the transience of particular keywords may be an issue.

Does Frequency Really Matter?

11

analyses of pertinent keywords. For example, he notes how the pro-hunt speakers overused ‘people’, relative to the anti-hunt speakers. Moreover, they tended to use the term to identify those: • •

who would be adversely affected by the ban if it was implemented (because of losing their jobs and/or their communities and/or facing the possibility of a prison sentence, if they opted to ignore the ban), and who do not hunt, but were not upset or concerned by those who do.

In addition, the pro-hunt speakers also utilized the keywords ‘fellow’, ‘citizens’, ‘Britain’ and ‘freedom’, the first two occurred together as a noun phrase, e.g. ‘fellow citizens’, and when used as such were preceded in all cases by a first person possessive pronoun (‘my’ or ‘our’). Baker argues that the pro-hunt speakers could thus be seen to use an hegemonic rhetorical strategy to intimate that it was they (and not their opponents) who were able to speak for and with the people of Britain. Baker also explores additional ways of using keyness to find salient language differences in texts, including the identification of key semantic categories (also referred to as key domains). A tool that enables such analysis to be undertaken automatically is the UCR EL S emantic A nalysis S ystem (henceforth USAS , also referred to as the UCR EL S emantic A nnotation S ystem). D eveloped at L ancaster University, USAS consists of a part-of-speech tagger, which utilizes CLAWS (the Constituent L ikelihood A utomatic W ord-tagging S ystem), and a semantic tagger that, at its conception, was loosely based on McA rthur’s Longman Lexicon of Contemporary English,26 but has since been revised in the light of practical application.27 Currently, the semantic tagset consists of 21 macro categories that expand into 232 semantic fields (see A ppendix 2). O nce again, Baker focuses on just a few of the most salient key semantic categories. For example, he points out how the semantic category ‘S1.2.6 sensible’ is overused by the pro-hunt speakers in the parliamentary debates (relative to the anti-hunt speakers): words like ‘sensible’, ‘reasonable’, ‘common sense’ and ‘rational’ are used when discussing the reasons for keeping hunting, and ‘ridiculous’, ‘illogical’ and ‘absurd’, when describing the proposed ban on hunting, which prompts Baker to suggest that this may be another example of their hegemonic rhetorical strategy (i.e. presenting one’s view of the world as ‘right’ or ‘common sense’).

26 T . McA rthur, L ongman Lexicon of Contemporary English (L ongman, 1981). 27 See, for example, A. Wilson and J. Thomas, ‘Semantic Annotation’, in R. Garside, G. L eech and A . McEnery (eds), Corpus Annotation: Linguistic Information from Computer Texts (Longman, 1997), pp. 55–65; P. Rayson, D. Archer, S.L. Piao. and T. McEnery, ‘The UCR EL S emantic A nalysis S ystem’, proceedings of the workshop on Beyond N amed Entity R ecognition S emantic L abelling for NL P T asks in association with the fourth international conference on L anguage R esources and Evaluation (LR EC, 2004), pp. 7–12.

What’s in a Word-list?

12

Baker concludes by suggesting that keywords offer a potentially useful way of focussing researcher attention on aspects of a text or corpus, but that care should be taken not to over-focus on difference/presence at the expense of similarity/ absence. H e also suggests that the best means of gaining the fullest possible picture of the aboutness of text(s) is to use multiple reference corpora. For example, one might wish to compare texts of the same type against a (larger) corpus of (more) general language usage as a means of capturing those words that, because they are typical of the text-type or genre, may be too similar to show up as keywords in a same text-type comparison (see my discussion of Romeo and Juliet under ‘Explaining Frequency and Keyword Analysis’). An exploration of key domains in Shakespeare’s comedies and tragedies A rcher, Culpeper and R ayson also utilize USAS , in this case to explore the concept of love in three S hakespearean love-tragedies (Othello, Anthony and Cleopatra and Romeo and Juliet) and three S hakespearean love-comedies (A Midsummer Night’s Dream, The Two Gentlemen of Verona and As You Like It). T heir aim is to add a further dimension to approaches that: • •

use corpus linguistic methodologies such as keyword analysis to study S hakespeare,28 by systematically taking account of the semantic relationships between keywords through an investigation of key domains; and study S hakespeare from the perspective of cognitive metaphor theory,29 by providing empirical support for some of the love-related conceptual metaphors put forward by cognitive metaphor theorists.

In brief, their top-down30 approach involves determining how love is presented in the two datasets and then highlighting any resemblances between their findings and the conceptual metaphors identified by cognitive metaphor theorists. They also discuss how the semantic field of love co-occurs with different domains in the two datasets, and assess the implications this has on our understanding of the concept of love. A s the original USAS system is designed to undertake the automatic semantic analysis of present-day English language, they have opted to utilize the historical version of the tagger. D eveloped by A rcher and R ayson, the historical tagger includes supplementary historical dictionaries to reflect changes in meaning over

28 See, for example, Culpeper, ‘Computers, Language and Characterisation’. 29 See, for example, D.C. Freeman, ‘“Catch[ing] the nearest way”: Macbeth and Cognitive Metaphor’, Journal of Pragmatics, 24 (1995): 689–708. 30 Top-down captures the fact that the categories are pre-defined and applied automatically by USAS .

Does Frequency Really Matter?

13

time and a pre-processing step to detect variant (i.e. non-modern) spellings.31 T he inclusion of a variant detector is important when automatically annotating historical texts as it means that variant spellings can be mapped to spellings that the text analysis tool can recognize; this, in turn, means that standard corpus linguistic methods (frequency profiling, concordancing, keyword analysis, etc.) are more effective (see Kay, Chapter 5). T he taxonomy of the historical tagger is the same, at present. H owever, A rcher et al. are using studies such as this to evaluate its suitability for the Early Modern English period.32 Indeed, they comment on the semantic domains that seem to capture the data well in their chapter, whilst also pointing out semantic domains that do not work as well. For example, they explain how the overuse of L3 ‘Plants’ in the love-comedies (relative to the lovetragedies) can be explained in large part by ‘Mustardseed’ (a character’s name) and ‘flower’ (part of the phrase, ‘Cupid’s flower’, i.e. the flower that Oberon used to send T itania to sleep in A Midsummer Night’s Dream). In addition, the bulk of the remaining items in the L 3 category capture features of the setting (for As You Like It and A Midsummer Night’s Dream are set in the woods). N evertheless, even within the L 3 category, there are items which have a strong metaphorical association with ‘love’ or ‘sex’. By way of illustration, in As You Like It, Silvius uses an agricultural metaphor (‘crop’, ‘glean’, ‘harvest’, ‘reaps’) to confirm that he is prepared to have Phoebe as a wife in spite of her less-thanvirginal state. A ccording to O ncins-Martínez,33 the ‘sex is agriculture’ metaphor and its sub-mappings (‘a woman’s body is agricultural land’, ‘copulation is ploughing or sowing’, etc.) were common in the Early Modern English period. As Archer et al.’s findings demonstrate, then, a keyness analysis does not merely capture aboutness; it can also uncover metaphorical usage, as in this case, or character traits, as in the case of Culpeper,34 discussed above. In addition, their approach can confirm – and also suggest amendments to – existing conceptual metaphors. By way of illustration, they suggest that the container idea within

31 D. Archer, T. McEnery, P. Rayson and A. Hardie, ‘Developing an Automated S emantic A nalysis S ystem for Early Modern English’, in D . A rcher, P. R ayson, A . W ilson and T . McEnery (eds), Proceedings of the Corpus Linguistics 2003 Conference, UCR EL Technical Paper Number 16 (Lancaster: UCREL, 2003), pp. 22–31; Rayson et al. 2005); Rayson, P., Archer, D. and Smith, N., ‘VARD Versus Word: A Comparison of the UCREL Variant D etector and Modern S pell Checkers on English H istorical Corpora’, Proceedings of the Corpus Linguistics Conference Series On-Line E-Journal 1:1 (2005). 32 �� A rcher and R ayson are also exploring the feasibility of mapping the USAS tagset to, first, the categories utilized by Spevack, in his A Thesaurus of Shakespeare and, then, to the H istorical T hesaurus of English. 33 J.L . O ncins-Martínez, N otes on the Metaphorical Basis of S exual L anguage in Early Modern English, in J.G. Vázquez-González et al. (eds), The Historical LinguisticsCognitive Linguistics Interface (H uelva: University of H uelva Press, 2006). 34 Culpeper, ‘Computers, Language and Characterisation’.

14

What’s in a Word-list?

Barcelona S ánchez’s35 ‘eyes are containers for superficial love’ (which, in itself, is a development of L akoff and Johnson’s36 ‘eyes are containers for the emotions’) is not clearly articulated in the (comedy) data, and that the latter would be better captured by the conceptual metaphor, ‘eyes are weapons of entrapment’. T his particular finding is made possible because of their innovative analysis of key collocates at the domain level, using S cott Piao’s Multilingual Corpus Toolkit.37 L ike Baker, A rcher et al. believe that key domains can capture words that, because of their low (comparative) frequency, would not be identified as keywords in and of themselves.38 H owever, they are acutely aware that the USAS process is an automatic one, and so will mis-tag words on occasion. A rcher et al. therefore suggest that researchers thoroughly check the results of such processes, using a manual examination of concordance lines to determine their contextual relevance. By way of illustration, they comment on the occurrence of ‘deer’, which is assigned to the category L2 ‘Living creatures’ by USAS. Archer et al. found that deer (like many items assigned to L 2) was used metaphorically, and can be captured by the conceptual metaphor ‘love is a living being’ and the related metaphor ‘the object of love is an animal’. W hen the concordance lines of these items were checked, they discovered that, although correctly assigned, the bulk of them had strong negative associations, semantically speaking. This finding contrasts with the items that Barcelona S ánchez39 discusses in respect of Romeo and Juliet. Indeed, even the ‘deer’ example is problematic: it is linked to cuckoldry in many of S hakespeare’s plays (e.g. Love’s Labours Lost, The Merry Wives of Windsor) and may indicate that it, too, had negative undertones for both S hakespeare and his audience.40 35 A . Barcelona S ánchez, Metaphorical Models of R omantic L ove in Romeo and Juliet’, Journal of Pragmatics, 24 (1995): 667–88, 679. 36 G. L akoff, and M. Johnson, Metaphors We Live By (Chicago and N ew York: University of Chicago Press, 1980). 37 S.L. Piao, A. Wilson and T. McEnery, ‘A Multilingual Corpus Toolkit’, paper given at AAA CL –2002, Indianapolis, Indiana, USA , 2002. 38 Given many authors/(public) speakers seek to avoid unnecessary repetition by using alternatives to a given word, I would suggest that key domain analysis provides us with a useful means of capturing low frequency words that (although not key in and of themselves) do become ‘key’ when viewed alongside terms with similar meaning (see R ayson 2003, 100–113, for a more detailed exploration of the advantages of the key domains approach). 39 Barcelona Sánchez, ‘Metaphorical Models of Romantic Love in Romeo and Juliet’, p. 683. 40 Culpeper’s investigation is another useful reminder of the importance of checking – as a means of contextualizing – any generated keywords (or key domains). For Culpeper found that some of the nurse’s keywords in Romeo and Juliet (‘god’, ‘warrant’, ‘faith’, ‘marry’, ‘ah’) did not relate to her character at all – or to aboutness for that matter. R ather, they were surge features (or outbursts of emotion), which occurred at points in the play when the nurse was reacting to traumatic events (involving Juliet, in particular). Culpeper, ‘Computers, Language and Characterisation’.

Does Frequency Really Matter?

15

Their final sentence is devoted to a call for quantitative analysis to be combined with qualitative analysis. For, like Baker (Chapter 8), they recognize that it is the researcher who must determine their cut-off points in respect of (contextual) salience in the final instance. Indeed, how the researcher chooses to interpret the data is probably the most important aspect of corpus-based research. Promoting the wider use of word frequency and keyword extraction In this final Chapter, I report on several AHRC ICT Methods Network promotional events (some of which were inspired by the Expert S eminar in L inguistics) that have helped to bring frequency and keyword extraction techniques to a wider community of users. I also address ways in which we might promote word frequency and keyword extraction techniques to an even wider community than we have at present (commercial and academic). In particular, I stress the need for (ongoing) dialogue, so that: • •

the keyword extraction community can discover what it is that other research communities are interested in finding out, and then determine how their tools might help them to do so; and ‘other’ research communities keep the keyword extraction community informed of (the successes and failures of) research that makes use of text mining techniques, which will allow the latter, in turn, to improve (the functionality of) their text analysis tools further.

How to use this book I have deliberately incorporated detailed summaries of the contributors’ chapters in this introductory chapter so that readers can ‘pick and choose’ those chapters that seem most relevant to their interests. T hat said, I would encourage readers with the time and inclination to read the edited collection as a whole, so that they gain a better sense of the different issues that must be considered if we are to utilize word frequency and keyword extraction techniques successfully. T he most important message of this edited collection, however, is that the researcher who engages in word frequency/keyword analysis has at their disposal a relatively objective means of uncovering lexical salience/(frequency) patterns that invite – and frequently repay – further qualitative investigation.

This page has been left blank intentionally

Chapter 2

W ord Frequency Use or Misuse? John M. Kirk

Introduction In this chapter, I shall not be concerned with statistical treatments of word frequency beyond percentage distributions and relativized frequencies per thousand(s) or million(s) words. My primary concern will be frequency as a property of data, and I shall take a critical look at statements like ‘each text comprises 2,000 words’. I shall be concerned with words as tokens, types and lemmatized types; the range of functions and meanings of words; and words and lexemes; and I shall consider words of low frequency as well as of high frequency. In a critical section, I shall ask whether word frequencies are self-explanatory or need explanation, and whether approximation is as useful as precision. I shall refer to a range of well-known corpora of English as well as the three corpora which I have compiled: the Corpus of Dramatic Texts in Scots, the Northern Ireland Transcribed Corpus of Speech (NIT CS ), and the Irish component of the International Corpus of English (ICE-Ireland). I also wish to discuss, briefly, the following claims: • • • • •

W ord frequency is the placing of numbers on language or the representation of language through numbers. Word frequency provides an instantiation of the claim that ‘linguistics is the scientific study of language’. W ord frequency promises precision and objectivity whereas the outcome tends to be imprecision and relativity. W ord frequency is not an end in itself but needs interpretation through contextualization whence the relativity and comparison. W ord frequency is not a science but a methodology, which lends itself to replicability.

O ne of the aims of this chapter is to deconstruct statements of the following type: ‘each text contains (approximately) 2,000 words’, in which there are two issues: the concept (word) and the number (2,000).

What’s in a Word-list?

18

Classes of words Of the many subclassifications of words, one which might suit our present purposes is the taxonomy proposed by McA rthur which offers eight possible word classes: 1. 2. 3. 4. 5. 6. 7. 8.

T T T T T T T T

he orthographic word he phonological word he morphological word he lexical word he grammatical word he onomastic word he lexicographical word he statistical word

T o this list, I wish to add a further two classes: 9. T he numeral word 10. T he discourse word O f these eight or ten types, it is class eight – the statistical word – which is usually associated with the notion of word frequency. McA rthur provides the following definition: word in terms of occurrences in texts is embodied in such instructions as ‘count all the words on the page’: that is, count each letter or group of letters preceded and followed by a white space. T his instruction may or may not include numbers, codes, names, and abbreviations, all of which are not necessarily part of the everyday conception of ‘word’. Whatever routine is followed, the counter deals in tokens or instances and as the count is being made the emerging list turns tokens into types: or example, there could be 42 tokens of the type the on a page, and four tokens of the type dog. Both the tokens and the types however are unreflectingly spoken of as words.

S tatistical words are words or any string of characters bounded by space which can be counted by a computer. N o other distinction is made. S uch words are regarded as word ‘types’.

T. McArthur, ‘What is a Word?’, in T. McArthur (ed.), Living Words: Language, Lexicography and the Knowledge Revolution (Exeter: Exeter University Press, 1999 [1992]). OCEL 1992, reprinted in McArthur, ‘What is a Word?’, p. 47, my emphases.

Word Frequency Use or Misuse?

19

W hen the statistical word test is applied to ICE-Ireland, what frequency precision do we find? For the present, all figures are based on the beta version of the spoken component. It is regularly stated that the spoken component of an ICE corpus comprises 300 texts each of 2,000 words, thus amounting to 600,000 words in total. In the case of ICE-Ireland, the total is 623,351 words comprising 300 texts ranging from 960 to 2,840 words each. W hereas these totals already exclude markup, they still include X-corpus, editorial comments and partial words (marked up as <.> … and underlined here for presentation), as shown in (1) and (2): 1. Uhm Marie-L ouise and I were in you know the Bang <,> and <{> <[> <.> O luf <#> W hat is it <#> O lufsen

2. A nd uh <,> like three thousand eight hundred <#> A nd there was another one at four <.> hu four thousand two hundred and something

T he question thus arises whether, in terms of McA rthur’s taxonomy, those 623,351 statistical words are also 623,351 orthographic words, 623,351 phonological words, or even 623,351 morphological words. T hey are not 623,351 lexical words (in the sense of lexical types), even less 623,351 lexemes (in terms of which ‘die’, ‘pass on’ and ‘kick the bucket’ may be considered single lexemes). Let us consider briefly each type of word in turn. The orthographic word O ne instance of an orthographic word is where the word has dual spellings, as in: airplane, aeroplane; esthetic, aesthetic; archeology, archaeology; connection, connexion; counselor, counsellor; gray, grey; instill, instil; jeweler, jeweller; jewelry, jewellery; libelous, libellous; marvelous, marvellous; mollusk, mollusc; mustache, moustache; panelist, panellist; paralyze, paralyse; analyze, analyse; pajamas, pyjamas; skeptic, sceptic; color, colour; honor, honour; labor, labour; traveler, traveller; traveling, travelling; willful, wilful; woolen, woollen.

T hese are well-known standardized instances of dual spellings which, as a result of institutionalization, are regarded as ‘American’ or ‘British’. When we investigated those spellings in the written texts of ICE-Ireland, all but a few of which had been published in Ireland, we found that ICE-Ireland is actually more British than ICEGB, as shown in T able 2.1.

ICE-Ireland is the abbreviated name for T he Irish component of the International Corpus of English. See J.M. Kirk, J.L . Kallen, O . L owry, A . R ooney and M. Mannion, T he ICE-Ireland Corpus: T he International Corpus of English: T he Ireland Component (CD ) (ICE-Ireland Project: Queen’s University Belfast, 2005 (beta version)).

What’s in a Word-list?

20

Table 2.1

Verbal spellings in ‘–ise’ and ‘–ize’ ICE-NI

ICE-ROI

ICE-GB

Total

‘–ise’

17

9

35

61

‘–ize’

2

1

12

15

D ialect words present another particular instance of the orthographic word as many such words have survived in oral currency and have never had a standardized written form. In Ireland, there are many words for the national crop, the humble potato, which can be listed under the headword ‘potato’, as in the Concise Ulster Dictionary: Potato: the national crop in all parts of Ireland: potato, pitatie, pirtie, pirta, purta, purty, pitter, porie, pratie, praitie, prae, prata, prater, pritta, pritty, pruta, poota, tater, tattie, totie. (H iberno-English forms are recoded as pratie, praitie, etc.; Scots forms as pitatie, tattie, tottie; and a southern English form as tater).

For other words, there is no agreed standardized form, as in the various forms of the dialect word for ‘embers’ borrowed from the Irish word ‘griosach’: greeshoch, greesagh, greesach, greesay, greeshagh, greeshaugh, greeshaw, greesha, greshia, greeshy, greesh, grushaw, greeshog, greesog, greeshock

Some words are harder to identify. The word for ‘twilight’ or ‘dusk’ is ‘dailygan’ in the S cots dictionary of Ulster, but the Concise Ulster Dictionary lists: daylight going, daylit goin, dayligoin, daylight gone, dayligone, dailagone, dailygan, dayligane, dayagone, dayligo

making it unclear whether the underlying base form is ‘daylight going’ or ‘daylight gone’. A s statistical words, these orthographic words would be counted separately – as types – whereas they merely represent various pronunciation variants of the same lexical type. Each of these three lists present only one lexical type. T he BBC is currently running a nationwide dialect project called Voices. It falls into this same trap of counting orthographic variants as separate – it goes so far as to say unique – words. T he Voices website states (in S eptember 2005):

.

Word Frequency Use or Misuse?

21

The Word Map has been highly successful; an initial look at the data suggests 32,000 users have registered […] 23,000-odd unique words (including spelling variations) …

That list shows that ‘stunning’, ‘stunning’’ and ‘stunning’ or ‘smashin’, ‘smashin’’ and ‘smashing’ were counted as three ‘words’ in each set; ‘phaor’, ‘phoar’, ‘pheoar’, ‘phwaar’, ‘phwoaar’, ‘phwor’, ‘phwooooaar’ are each counted as separate ‘words’, etc. Such lists, albeit on a selective basis, are available (August 2007) online. W ith regard to the issue of word frequency, orthographic words present as many difficulties as statistical words. The phonological word Phonological words are conceivable as several subtypes: vocalized words (as the set of ‘phoar’ words in the preceding section indicate), partial words (the initial segments of a word but not the complete word, as in (1) and (2) above), orthographic or pronunciation-variable words (presenting different pronunciation variables, as in ‘economics’ or ‘tomato’, or because of shifting stress positions in ‘controversy’, or in the dialect forms above), syllabic words (e.g. ‘gonna’, ‘hadda’, ‘musta’, ‘needti’, ‘wanna’, etc.), or even clausal or intonation-unit words (e.g. ‘�� spindona’, ‘gerritupyi’, etc.). In these ways, phonological words either become orthographic words (which in turn become statistical words) or appear as conflated words which, if counted as statistical words, under-represent the actual total. T here is no �� corpus of segments or syllables, although, interestingly, it is claimed by Crystal (2003) that 25 per cent of speech is made up of only 12 syllables. S o with regard to the issue of word frequency, phonological words present many other difficulties too. The morphological word Morphological words may be lexical or grammatical words. First, let us consider lexical morphemes. The prefixes: ‘cyber–’, ‘e–’, ‘eco–’, ‘euro–’, etc. all became frequent as a result of change in technology, politics or attitudes to the environment. T he sudden increase of use of such forms helps to construct the discourses about these new realities. A s the Oxford Dictionary of Ologies and Isms shows, many prefixes and suffixes are specific to particular domains – even linguistics can claim ‘glosso–’, ‘grapho–’, ‘logo–’, ‘semio–’, ‘Slavo–’ as prefixes and ‘–eme’, ‘–gram’, ‘–graphy’, ‘–lect’, ‘–lepsis’, ‘–logue’, ‘–onym’, ‘–phasia’, ‘–speak’ and ‘–word’ as suffixes.

.

What’s in a Word-list?

22

In the southern component of ICE-Ireland, we discovered that clipped words with the suffix ‘–o’ marked colloquial speech, perhaps even slang: Defos

S

Slang; ‘definites’ (used especially in

replies)

Invos

S

Slang ‘invitations’

Morto

S

Slang ‘mortified’

Séamo S

Form of Séamus

Smarmo S In ICE as interjection (< smarmy).

O ther forms were: Relies

S

Slang ‘relatives’

Sca

S

Slang ‘news, gossip’ (< scandal).

T here are no such forms in ICE-GB. Even if the absolute numbers are few, their presence in one corpus and not in another may be interpreted as significant – indicative of innovating colloquialisms, possibly slang words. T he more lexical items adopting this ‘–o’ suffix, the more the pattern becomes established. Frequency can thus reveal cultural innovation. �� Grammatical morphemes offer numerous challenges. S ome are mere variants of a single form, sometimes conditioned by external factors, such as dialect contact in the case of the past tense form of ‘bring’ as ‘brought’ or ‘brung’, or a negated form of ‘could’ as ‘couldn’t’ and ‘couldnae’. ICE-Ireland has six instances of ‘gotten’ alongside ‘got’, each with a clear dynamic meaning. Grammatical variants and grammatical innovations may be interpreted in terms of external contexts, but they may also be indicative of changes in the particular sub-system itself. The form ‘gonna’ may be as the output of a grammaticalized progressive ‘go’ construction, but only if ‘gonna’ is transcribed as such. In ICE-Ireland, it was not so transcribed – only the standard ‘going’ was used for every instance of progressive ‘go’, in stark contrast to the British National Corpus where its inclusion was left to the subjective preference of the audio-typists who were transcribing the tapes. W hen the statistical word test is applied to grammatical words, the result can be confusing. ‘Is’ and ‘was’ are often shown to rank among the most frequent words, but they are only verb forms; they are neither verb types nor the most frequent verbs – for that we need the total of all forms of be. A lthough T able 2.2 presents frequencies See also R . �� H ickey, Dublin English: Evolution and Change, Varieties of English A round the W orld General S eries, vol. 35 (A msterdam and Philadelphia: John Benjamins Publishing Company, 2006)�� , pp. 138–9. T he British component of the International Corpus of English (CD ).

Word Frequency Use or Misuse?

Table 2.2

23

Frequencies of be forms

Form

ICE-IRL Spoken f.

LONDON-LUND f.

MILLER f.

‘–’s’

17.21

21.64

30.60

‘is’

9.46

10.45

7.05

‘was’

10.64

10.52

7.51

‘be’

6.15

5.46

3.44

of individual forms of ‘be’ in three spoken corpora, which show some consistency across corpora for each form, it does not show the frequency of be itself. Frequencies of will require the sum of ‘’ll’, ‘will’, ‘won’t’ and whatever other spelling variants are used. �� So, when it comes to word frequency, much caution and qualification is needed concerning the frequency of grammatical words. The lexical word A s already shown, lexical words are often mistaken for variants of their realization: phonological words, which are rendered in writing as orthographic words, or morphological words (particularly with different noun or verb forms). Much of the interest shown in lexical words surrounds them as lexical types, or lemmatized types, not as families of realizations. Even if we establish frequencies for lexical types – something the statistical word does not do – how are we to interpret the result? �� O f the many possible contexts, I raise only three here: semantic �� prosody, attitude raising, and constructions of identity or reality. Semantic prosody Following the pioneering work of L ouw, the notion of semantic prosody is now generally accepted. ‘Utterly’ is regarded as having a negative prosody, i.e. it collocates with words expressing a negative meaning, so that, in ICE-Ireland,

B. Louw, ‘Irony in the Text or Insincerity in the Writer? The Diagnostic Potential of S emantic Prosodies’, in M. Baker, G. Francis and E. T ognini-Bonelli (eds), Text and Technology (Philadelphia/A msterdam: John Benjamins, 1993).

24

What’s in a Word-list?

we find that prosody confirmed in the six examples: ‘utterly boring’, ‘utterly unacceptable’, ‘condemn utterly’ (x2) (ICE-Ireland). Attitude raising The common word ‘happy’ seems innocent enough until put into the literature for boy scouts and girl guides by L ord Baden-Powell, who urges that the purpose for girls in life is to make boys happy, whereas the purpose of boys in life is simply to be happy. T he accumulation of overuse in those texts is shown by S tubbs to turn the word ‘happy’ into a sexist term. Constructions of identity or reality In a masterly study of keywords, Baker10 shows how gay identity is constructed very differently by different groups of people. For the H ouse of L ords, key words for the pro-formers were ‘law’, ‘rights’, ‘sexuality’, ‘reform’, ‘tolerance’, ‘orientation’, ‘sexual’. ‘human’, whereas key words for the anti-reformers were ‘buggery’, ‘anal’, ‘indecency’, ‘act’, ‘blood’, ‘intercourse’, ‘condom’. For the British tabloid press covering crimes on gay men, key words were ‘transiency’, ‘acts’, ‘crime’, ‘violence’, ‘secrecy’, ‘shame’, ‘shamelessness’, ‘promiscuity’. In contact ads, British gay men described themselves as ‘guy’, ‘bloke’, ‘slim’, ‘attractive’, ‘professional’, ‘young’, ‘tall’, ‘non-scene’, ‘good-looking’, ‘active’, ‘caring’, ‘sincere’. In gay fantasy literature, gay men are described as brutes (‘socks’, ‘sweat’, ‘beer’, ‘football’, ‘towel’, ‘team’) or emotionless machines (‘lubed’, ‘jacked’, ‘leaking’, ‘throb’, ‘throbbing’, ‘spurt’, ‘spurts’, ‘pumped’, ‘pumping’). In safer sex awareness leaflets distributed to gay men, gay men are described as animals, and gay sex as violent (‘grunted’, ‘groaned’, ‘grabbed’, ‘shoved’, ‘jerk’, ‘jerked’, ‘jerking’, ‘slapping’, ‘pain’); at the same time, gay men’s language is shown to be informal, non-standard and impolite (‘fucker’, ‘cocksucker’, ‘faggot/fag’, ‘stuff’, ‘yeah’, ‘shit’, ‘hell’, ‘fuckin’, ‘ain’t’, ‘wanna’, ‘gotta’, ‘gonna’, ‘’em’, ‘kinda’, ‘real’, ‘hey’, ‘damn’, ‘good’). In each of these settings, it was the frequency of words occurring above the norm that created very different discourses in each context and foregrounded the perspective or point of view. Baker shows convincingly that by studying word frequencies, common, everyday words words like ‘human’ or ‘young’ when over-used in particular texts – and thus have a relatively high frequency – become keywords and agents in the creation of, and also discrimination between, those discourses. S imilarly, ICE-Ireland creates Ireland – through the use and frequency of various classes of lexical words:�� dialect words, �� Irish loanwords�� , other �� words in ICE-IRL deemed ‘Irish’ and institutional words (many of them onomastic words) M. S tubbs, Text and Corpus Analysis (O xford: Blackwell, 1996), ch. 4. 10 P. Baker, Public Discourses of Gay Men, R outledge A dvances in Corpus L inguistics, vol. 8 (L ondon and N ew York: R outledge, 2005).

Word Frequency Use or Misuse?

Table 2.3

Irish loanwords

Fleadh Gaeltacht Poitín Scór

Table 2.4

25

N,S S S S

T raditional music festival (< Irish) Irish-speaking district (< Irish) Illicit distilled spirits (< Irish) ‘Tally’ (
Lexicon of other words in ICE-IRL deemed ‘Irish’

‘Maracycle’ ‘Motorsports’ ‘Bogger’ ‘Feck’ ‘Greenkeeper’ ‘Imprimatured’ ‘Knacker’ ‘Legger’

N N S S S S S S

‘Liveweight’

S

L ong distance cycling race S ports using cars or motorcycles HibE dialect ‘person from rural areas’ Slang; variant of fuck O ne who maintains a green Use of imprimatur as verb Derogatory for ‘traveller’; extended to did a legger ‘ran away’ cf. Sc. and dialect use leg ‘use the legs, to walk fast or run’ W eight of live animal

– that frequency always being relatively high compared to other ICE-corpora, where such items do or will not occur. T hese words are keywords because neither is used in any other ICE corpus nor could they construct anything but Ireland. A n analysis of a sample of the ICE-Ireland spoken component revealed the following Irishisms, as listed in Tables 2.3 and 2.4, where ‘N’ and ‘S’ relate to northern and southern distributions: A knowledge of these rather unexceptional words – English in form, but only used in Ireland – are important for a knowledge of Ireland. The grammatical word Grammatical words are not morphological words which are grammatical variants of a lexical word. A lthough sometimes homonyms, grammatical words are grammatical in their own right, although they may be realized by a subset of phonological/orthographic variants, as in the case of ‘’ll’ for ‘will’ or ‘won’t’ as a contracted form of negated ‘will’. Although benchmark frequencies are sometimes given, The Comprehensive Grammar of the English Language asserts that ‘will’ occurs four times in every 1,000 words, I find that in Scottish English (S cots) it occurs at least eight times in every 1,000 words, the figure attributable not simply to additional functions but through the restructuring of exponents within the system of modality – in effect because of internal systemic differences.

What’s in a Word-list?

26

Table 2.5

Frequency of modal verbs NITCS

ICE-IRL

LONDON-LUND

MILLER7*

F

N

F

F

F

will

1.77

2584

4.14

4.28

6.19

would

9.69

3652

5.85

3.51

4.79

*The Miller Corpus of Undergraduate Conversation in Edinburgh Scots.

H owever, in the Northern Ireland Transcribed Corpus of Speech (NIT CS 11), the most frequent modal is ‘would’, for which the explanation is external. Although ‘would’ carries a habitual-in-the-past meaning in standard English, Irish English marks habituality in the present, under the transfer of that category from Irish through language contact. Moreover, the majority of the texts in NIT CS constitute interviews about childhood reminiscences and reflections on changes experienced by the interviewees during their lifetime. S o, contextually, it is not surprising that ‘would’ appears so frequently. T hese results are presented in T able 2.5.12 Grammatical words not only express a range of meanings, they can occur in a range of syntactic constructions, which may in some cases correlate with particular meanings. H igh frequencies of get as a lexical word (as in T able 2.6) hide the many possibilities both for complements and for premodification through auxiliary, modal or catenative verbs. By �� contrast low frequencies of the afterperfect construction can be shown to correlate directly with different contexts: After-perfect ICE-IRL : all nine are southern

After-perfect in NITCS: majority (three out of five) among Catholic speakers After-perfect in GLAS GOW : all nine are ethno-linguistic markers – by the same Catholic speaker T he form-function relationship which lies at the core of grammatical words further complicates the issue of word frequency.

11 See J.M. Kirk, Northern Ireland Transcribed Corpus of Speech (ESR C D ata Archive: University of Essex, 1990; rev. edn, Belfast: Queen’s University, 2004). 12 Based on J.M. Kirk, ‘Aspects of the Grammar in a Corpus of Dramatic Texts in Scots’, PhD thesis, University of Sheffield, 1986.

Word Frequency Use or Misuse?

Table 2.6

27

Occurences of get

NIT CS ICE-IRL S poken GLAS GOW *

N

F/1000

2,397 3,005 723

10.06 4.92 9.65

*A Corpus of Dramatic Texts from Glasgow, compiled by John M. Kirk (O xford

T ext A rchive)

The onomastic word A s already indicated, institutional names in ICE-IRL act as key words in the construction of Ireland. But onomastic words raise issues with regard to statistic words … H ow many words are in a single name? H ow many ways are there to spell, in particular, an Irish name? O nomastic names may also occur as acronyms. Here is a list of onomastic words from ICE-Ireland (‘N’ and ‘S’ again denoting northern and southern distributions in ICE-NI and ICE-ROI respectively). The lexicographical word T he lexicographical word adopts a different approach to word frequency. Dictionaries attempt only to reflect reality of use, so that some frequency information is provided implicitly through the display of spelling variants (as shown above) or different senses (the more senses a word has, the more frequent it is likely to be – thus ‘peregrinate’, which has one basic sense, and ‘to go on’, which has 14 different senses). A lthough, nowadays, lexicography is heavily corpus-based, the inclusion of frequency information in a dictionary implies a certain predictability about what the dictionary user is likely to find. Lexicographical words are headwords. Most are lexical words; some are grammatical words. T hey are not orthographic words, nor morphological words (although there is a debate in EFL circles about the choice of headword for verbs in early learner dictionaries given that past tense forms can be more frequent than base forms (e.g. ‘declined’, that well-known example discussed by John Sinclair). S ome headwords proliferate numbered subdivisions – either on the basis of word class (e.g. ‘round’ has several numbered entries, as if separate words are created through polysemy, as in ‘Macmillan’) or as a homograph (i.e. one headword, with several subsenses) as in Collins English Dictionary or Encarta World English Dictionary. T�� he decision about the treatment of subsenses has implications for

What’s in a Word-list?

28

Table 2.7

Local names (or onomastic lexicon)

Aer Lingus Radio Telefis Éireann Gardaí Taoiseach DENI DHSS UUP UYO Forum RUC EHSSB An Bord Pleanála Ceann Comairle Coláiste Íde CRC Cultúrlann na hÉireann Dáil EIS Fáinne Telecom Éireann Fás Fianna Fáil Féile Garda Siochána Oireachtas PRSI RTC Seanad Tánaiste Toisigh TD

N,S N,S N,S N,S N N N N N N N S S S S S S S S S S S S S S S S S S S S

Irish national airlines (< Irish) RTÉ; the Irish broadcasting authority Plural of garda, member of Garda Siochána (< Irish) Prime minister in Irish government (< Irish) D epartment of Education N orthern Ireland D epartment of H ealth and S ocial S ervices Ulster Unionist Party Ulster Youth O rchestra N orthern Ireland Forum for Peace and R econciliation R oyal Ulster Constabulary Eastern H ealth and S ocial S ervices Board T he Irish planning authority (< Irish) Presiding officer of the Dáil (< Irish) [< Irish; name of local school] Central R emedial Clinic [< Irish; name for Irish traditional culture centre] Dáil Éireann; the main Irish legislative body (< Irish) Environmental Impact S tatement Lapel pin associated with speaking Irish (
frequency. �� Many headword lists also include onomastic words. T hus with regard to frequency implications, the choice of headwords has long been controversial. A further frequency issue with regard to lexicographical words is the growing practice that current dictionaries have established of providing frequency tables relating to content, similar to the literature on the size of corpora. Consider T able 2.8. A ll the same, it is hard to know what exactly each dictionary is counting. T he Concise Ulster Dictionary boasts ‘over 15,000 words’ on its cover, but in fact there are 19,936 headword entries.13

13 figure.

I am grateful to A nne S myth of the Ulster Folk and T ransport Museum for this

Word Frequency Use or Misuse?

Table 2.8 Dictionary

29

Dictionaries and word frequency Headwords

References

Text

100,000

400,000

3.5m

NODE

350,000

4.0m

Collins 4E

180,000

3.6m

Encarta

T he issue of word frequency raises the question of a frequency dictionary, of which it could be claimed there already are several, especially: • • • •

K. Hofland and S. Johansson, Word Frequencies in British and American English (1982). W. Nelson Francis and H. Kučera, Frequency Analysis of English Usage Lexicon and Grammar (1982). K. Johansson and K. Hofland, Frequency Analysis of English Vocabulary and Grammar (1989). G. James, R . D avison, A .H .Y. Cheung and S . D eerwester, English in Computer Science: A Corpus-based Lexical Analysis (1994).

Frequency information, however, is already accommodated in several dictionaries using different methods: • • • • • • • • •

the number system (frequency is indicated by numbers, as shown in frequency dictionaries) the star system (frequency is indicated by stars) COBUILD Dictionary of Idioms: *** = ‘1 occurrence per 2million words’; ** = ‘not as common as *** idioms’; * ‘1–3 occurrences in 10 million words’ Macmillan English Dictionary: For Advanced Learners: * ‘fairly common’; *** ‘one of the most basic words’ the colour system (frequency is indicated by different headword colours) Longman Dictionary of Contemporary English, fourth edition (red), Macmillan (red) the graph system (frequency is indicated by distributional graphs) now (speech vs writing) in Longman Dictionary of Contemporary English, third edition notice (types of complement) in Longman Dictionary of Contemporary English, fourth edition

What’s in a Word-list?

30

T he result is that, on the basis of corpus research, and by treating the item as a closed system within which the variation occurs, both distributional and benchmark frequency information is given. L exicographers remain cautious about using word frequency as the basis of headword choice, placing their doubt on the adequacy of any sample or choice of corpus material to be a reliable indicator of frequency, and on the methodology to be sophisticated enough to reflect not only form frequency but, in the case of polysemy, sense frequency too. A ccording to the A merican doyen of lexicography, S idney L andau,14 The American Heritage Word Frequency Book based on the Brown Corpus and compiled by Francis and Ku�� č�� era (mentioned above) is ‘valuable but flawed’ because a corpus of one million words is, ‘far too slight to give any true indication of the frequency relationships of the entire lexicon. … A statistically useful corpus would have to be many times larger than five million words.’ N evertheless, according to L andau: [there is] a sense in which dictionaries do use frequency counts – that of their own citation files. A dictionary citation file is a collection of quotations of actual usage selected to serve as a basis for constructing definitions or for providing other semantic or formal information (such as collocation, degree of formality, spelling, compounding, etymology, or grammatical data). Citation files may also include transcriptions or recordings of spoken forms. T he manner of collection and use of citation files for defining will be discussed in the next chapter. Suffice it to say here that as traditionally collected, citation files, however vast, and Merriam-Webster’s files reputedly number over 12 million – have been assembled in too haphazard a manner to be used as a reliable guide to frequency. As James A.H. Murray had occasion to remark in connection with the OED files, citation readers all too often ignore common usages and give disproportionate attention to uncommon ones, as the seasoned birdwatcher thrills at the glimpse in the distance of a rare bird while the grass about him teems with ordinary domestic varieties that escape his notice.15

The numeral word T o McA rthur’s taxonomy, I wish to add two new classes of words. A s a transcriber of speech and compiler of corpora, I am aware that these two classes reveal themselves in high numbers and present their own difficulties with regard to form and thereby word frequency. N umbers and numeral words are neither lexical words nor grammatical words. They fall between classes. There are always difficulties of transcription as many utterances are no more than a spoken version of a number which exists primarily in 14 S . L andau, Lexicography (Cambridge: Cambridge University Press, 1984). 15 Ibid., pp. 79–80.

Word Frequency Use or Misuse?

31

writing. It is interesting that some of the most recent dictionaries include a section on ‘Numbers that are used as words’ (Cambridge Advanced Learners’ Dictionary) and ‘Numbers that are entries’ (Macmillan English Dictionary: For Advanced Learners). But even in grammars, the treatment of numbers and numerals can vary, as between inclusion in a chapter on ‘giving information about people and things’ (the COBUILD English Grammar) and a chapter on ‘lexical word-formation’ (the Cambridge Grammar of the English Language). The discourse word In discourse, words can have a function which is neither referential nor propositional, which may also be pragmatically indeterminate, but which is, nevertheless, relevant for the development and cohesion of the conversation. In the prosodically and pragmatically tagged version of ICE-Ireland,16 we have identified such discourse words as ‘�� Indeterminate Conversationally-relevant Utterances’ or , as in the following example: (3) <$A> <#> <exp> I ‘m not even sure 2exActly when I ‘ll 2nEEd somebody from% <$B> <#> 2R ight% <$A > <#> But uhm I would need an 1Extra pair of 2hA nds%

T he category includes backchannels. A lthough discourse words will not be distinguished as statistical words from any other type, a considerable proportion of any spoken text comprises utterances. D iscourse markers are also marked in ICE-Ireland. Three types are identified: syntactic, lexical and phonological. Multi-word discourse markers are hyphenated to distinguish their functional use; all discourse markers are asterisked. T ranscribers of speech need to think through their policy for encoding such speech features. W hatever decision is taken, there are implications for word frequency. T he London-Lund Corpus and the Corpus of American Spoken English” � transcribe non-lexicalized, non-grammaticalized sounds phonetically, e.g. with the phonetic symbol schwa. R egardless of vowel quality, in ICE-Ireland such vocalized sounds used as hesitation markers or fillers are transcribed uniformly as ‘uh’ and ‘uhm’ (depending on whether there was a final audible nasal). 16 See J.M. Kirk, �� J.L . Kallen, O . L owry, A . R ooney and M. Mannion, The SPICEIreland Corpus: Systems of Pragmatic Annotation for the Spoken Component of ICEIreland, version 1.2 (Queen’s University Belfast and T rinity College D ublin, 2007).

What’s in a Word-list?

32

Table 2.9

List of discourse markers in the SPICE-Ireland corpus

Syntactic

Lexical

Phonological

D o-you-know D o-you-see I-’d-say I-know I-mean I-see I-suppose I-think(-that) You-know You-see S ee

A h-no A h-well A h-right A ctually A ll-right, alright God, Jesus, Jeez Just Kind-of, kinda L ike [focus] My-God, My-gosh N o, naw, no-no N ow O h-God, O h-gosh O h-my-God, God-A lmighty O h-Jesus O h-right O h-well O h-yeah, O h-yes O kay O nly R ight So S ort-of, sorta S ure T hen T here W ell Yeah-no Yeah-yeah Yes, yeah, yup, aye

Ah A rrah O ch Oh

Ten classes of word frequency? T hese ten classes of words offer themselves as ten classes of word frequency. W hat the examples have shown is that, in each word type, frequency is a clear factor. Frequency becomes a factor when a link is inferred between the frequency and the context. T he context may be the linguistic system itself (as with exponents of modality or cases of grammaticalization); and the context may be external, which can be interpreted as conditioning the frequency and so the pattern of frequency variation of which it is a part. External factors are many and varied – they may have

Word Frequency Use or Misuse?

33

to do with speakers (whether identified by country, province, region, age, sex, sexual orientation, education, life-history, L 1/L 2 speaker, etc.), or discourse situations (whether what is spoken is read, prepared, spontaneous, broadcast, or what the audience or purpose or intended effect might be). It may have to do with the method of recording (ICE is fly-on-the-wall without fieldworker presence; NITCS is driven wholly by fieldworker questions), the time of recording as a special moment in history; or the discourse which comes to be constructed, intentionally or otherwise. Comparing frequencies in corpora Comparing corpora will always generate different frequencies for interpretation by such external conditioning factors: • • • • • •

ICE-Ireland vs ICE-GB vs ICE-(whatever)17 ICE-NI vs ICE-ROI ICE-NI S poken vs NIT CS ICE(-whatever) vs LL C18 vs LO B19 vs FLO B,20 etc. ICE(-whatever) vs BN C CDT G (GLAS GOW )21 vs L euven22

Comparing corpora makes it necessary to go beyond raw frequencies – numbers of occurrences – and relativize frequencies per 1,000 or 10,000 or 100,000 or even one million words to compare occurrences from corpora or datasets of different lengths or to relativize them as one of a closed set (percentage distribution), sometimes to stand as a benchmark figure relativized to 1,000 words or one million words. Comparing corpora finally depends on replicability. The statistical word test may seem to be the obvious and easy answer. But, as the present examples show, the significance of word frequency also demands qualitative interpretation depending on context. 17 T he International Corpus of English project comprises some 18 national varieties of standardized English. T hose countries completed so far are East A frica (T anzania, Kenya and Uganda), Fiji, Great Britain, H ong Kong, India, Ireland, N ew Zealand, the Philippines, and S ingapore. See . 18 T he London-Lund Corpus of Spoken British English, see . 19 T he Lancaster-Oslo/Bergen Corpus, see . 20 T he Freiburg Lancaster-Oslo/Bergen Corpus, see . 21 Corpus of Dramatic Texts from Glasgow, compiled by J.M. Kirk (O xford T ext A rchive). 22 T he Leuven Theatre Corpus of British Dramatic Texts, 1966–72: see D . Geens, L .K. Engels and W . Martin, Leuven Drama Corpus and Frequency List (L euven: Institute of A pplied L inguistics, University of L euven, 1975).

What’s in a Word-list?

34

Theoretical aspects T he invitation to this workshop raised the question of the contribution which word frequency makes to linguistic theory. O n the basis of the present evidence, I would suggest that the contribution is both post-hoc and propter-hoc. I have shown that frequencies are factors in items, systems, texts and discourses, that frequencies are discovered as part of distributional preference, that frequencies are used to indicate distributional choices, and that frequencies are quantitative but depend on qualitative interpretation. S o I would suggest that frequencies are essentially calibrating – comparing but also establishing identity and discriminating individuality. Frequencies belong to description and prediction. Conclusion: use or misuse? In addressing my own question, I would conclude that ‘misuse’ is the statistical word. If all word frequencies were based on the statistical word test, nothing would follow or be revealing. A ll linguistic interest is in the frequency of the different types of words; as shown, frequency is a factor in the description of each type, not paradigmatic with the other types. T here are only nine word types. W ith regard to good use, I have shown that frequency is a factor with all word classes, that frequency is bound up with the interpretation of the value of the frequency of that word in the social context of occurrence, that frequency has a value in the description of particular lexical and grammatical items, and that frequency is replicable as a basis of systematic comparison and of identity construction. I conclude that it does not matter whether ‘each text contains (approximately) 2,000 words’ – rather, it is the classification and interpretation of those 2,000 words in that particular text and context which will determine the real value of frequency study. T o return to the beginning, I have shown that: • • • • •

W ord frequency is the placing of numbers on language or the representation of language through numbers; Word frequency provides an instantiation of the claim that ‘linguistics is the scientific study of language’; W ord frequency promises precision and objectivity whereas the outcome tends to be imprecision and relativity; W ord frequency is not an end in itself but needs interpretation through contextualization whence the relativity and comparative discrimination; and W ord frequency is not science but methodology, which lends itself to replicability.

Chapter 3

W ord Frequency, S tatistical S tylistics and A uthorship A ttribution D avid L . H oover

Introduction T he increased availability of large corpora and electronic texts, and innovations in analytic techniques, have spurred a great deal of recent interest in the venerable topic of word frequency, especially in relation to authorship attribution and statistical stylistics. Until recently, research in these areas was typically based on the 30 to 100 most frequent words (MFW s) of a corpus. T hese words – almost exclusively function words – were chosen because they are so frequent that they account for a large proportion of the running words (tokens) of a text, and also because it was assumed that their frequencies should be especially resistant to intentional authorial manipulation, and so should reveal authorial habits that remain relatively constant across a variety of texts. S uccessful as studies based on these words have been, recent work has expanded the list, changed the way the words for analysis are chosen, and proposed new analytic tests and measures of authorial style. Some problems with word-frequency lists A discussion of some of these recent developments can usefully begin with an examination of the 60 MFW s of Edith W harton’s The Age of Innocence (1920; Age, below), shown in T able 3.1. T his typical novel has a total of 101,840 tokens and 9,731 types, and roughly half its types (4,873) are hapax legomena (words occurring once). T he rapid decrease in word frequencies shown in Figure 3.1 is typical of English texts, as are the words themselves, although the feminine personal pronouns are more frequent than usual. Anyone working with word frequency lists will find this one familiar; it is very similar to those from, for example, the Brown Corpus and the British National Corpus (BN C). A lthough the Brown Corpus is ten times – and the BN C a thousand times – larger than this novel (and the latter contains British English rather than A merican English), 47 of these 60 words are also among the 60 MFWs of both corpora. The presence of two proper names, ‘Archer’ and ‘Oleska’, is also typical. (Any number of proper names from zero to about five

What’s in a Word-list?

36

Table 3.1

The sixty MFWs for The Age of Innocence

Rank Word Freq. %

Rank Word Freq. %

Rank Word

Freq. %

1

the

5,477 5.38 21

but

689

0.68 41

an

310

0.30

2

and

3,285 3.23 22

for

670

0.66 42

this

302

0.30

3

to

3,002 2.95 23

archer 629

0.62 43

no

297

0.29

4

of

2,749 2.70 24

him

549

0.54 44

which

296

0.29

5

a

2,147 2.11 25

not

512

0.50 45

who

283

0.28

6

her

1,867 1.83 26

be

463

0.45 46

up

280

0.27

7

he

1,762 1.73 27

mrs

414

0.41 47

out

266

0.26

8

in

1,669 1.64 28

were

396

0.39 48

would

266

0.26

9

had

1,422 1.40 29

if

386

0.38 49

there

263

0.26

10

was

1,387 1.36 30

have

382

0.38 50

is

254

0.25

11

that

1,384 1.36 31

been

378

0.37 51

when

254

0.25

12

she

1,314 1.29 32

one

373

0.37 52

into

248

0.24

13

his

1,205 1.18 33

from

355

0.35 53

me

245

0.24

14

it

1,070 1.05 34

they

354

0.35 54

new

244

0.24

15

with

946

0.93 35

all

339

0.33 55

mr

239

0.23

16

i

895

0.88 36

so

333

0.33 56

like

230

0.23

17

as

860

0.84 37

said

327

0.32 57

may

225

0.22

18

you

798

0.78 38

what

326

0.32 58

my

216

0.21

19

at

774

0.76 39

by

322

0.32 59

olenska 215

0.21

20

on

774

0.76 40

their

311

0.31 60 after Cumulative:

206 0.20 48,434 47.60

could be considered typical.) Aside from ‘Archer’ and ‘Oleska’, only ‘new’ and ‘said’ might be regarded as content words (‘like’ would drop in frequency if verb examples were subtracted), and ‘new’ appears here only because it is so frequent as part of ‘New York’. It seems surprising that comparing such lists of ubiquitous and very frequent words is so regularly effective in distinguishing one author from another. However, even an ordinary list like this one raises significant questions. D eleting the proper nouns before comparing this text with other texts seems

Word Frequency, Statistical Stylistics and Authorship Attribution

Figure 3.1

37

The frequency of the thirty most frequent words of The Age of Innocence

appropriate, and many researchers have done so, sometimes without comment, but common nouns that might also be proper nouns remain problematic, and some are far more difficult to identify and deal with than ‘new’. Consider ‘woman’ in W illiam Golding’s The Inheritors, a novel told mainly from the point of view of a N eanderthal whose society is destroyed by a more advanced invading tribe. ‘Woman’ is more than 15 times more frequent in The Inheritors (rank: 43) than it is in the written portion of the BN C (rank: 387), primarily because of the frequency of ‘the old woman’ for the Neanderthal matriarch, and ‘the fat woman’ and ‘the crumpled woman’ for women of the invading tribe. Even without these epithets, however, ‘woman’ ranks 221st, and remains about 2.7 times more frequent in The Inheritors than in the BN C. S uch words raise subtle and potentially important analytic and theoretical questions. Is it possible, practical or even desirable to tease out the different functions and meanings of ‘woman’ in The Inheritors? H ow does this question relate to the thematic importance of ‘woman’ in the novel? Are epithets, or proper nouns themselves, stylistic or authorial markers? Imagine The Wizard of Oz with D orothy Gale and T oto replaced by Tiffany Spindrift and Fifi. And is the relationship between ‘Gale’ and ‘tornado’ irrelevant? Although the names ‘Archer’ and ‘Oleska’ are unusual enough that they are unlikely to occur in any novels being compared with Age, they certainly could do so, and names like ‘John’, ‘London’ or ‘New York’ could potentially skew results in novels that were otherwise similar. In an analysis of just the 60 MWFs, the absence of ‘Archer’ and ‘Oleska’ from other novels by W harton would also tend to separate the novels in spite of their common authorship. Yet assuming their absence is unwise; some of Faulkner’s characters,

38

What’s in a Word-list?

for example, appear in more than one of his novels, and ‘Archer’ is also frequent in James’s The Portrait of a Lady, in which Isabel A rcher is the main character. The problem of ‘new’ as a common noun and as part of ‘New York’ also suggests that part of speech (POS ) tagging might be desirable to prevent one text in which ‘new’ is extremely frequent as a common noun from appearing similar to another that happens to be set in N ew York. Unfortunately, even the most accurate POS taggers introduce errors, and all the taggers I have tried are hopelessly inaccurate for poetry. W hen, as is often true of my own work, newly-constructed corpora of millions of words are being analysed, manual correction of tagging is impractical. Furthermore, as the case of ‘woman’ in The Inheritors has shown, POS tagging cannot solve subtler problems of classification and function. Although ‘has’, ‘had’, ‘are’, ‘were’, ‘will’ and ‘would’ are all among the 60 MFWs of both corpora, only the past tense forms ‘had’, ‘were’ and ‘would’ appear among the 60 MFW s of Age. A nd this suggests that lemmatizing the texts might help to overcome variations in tense and reduce generic differences between narration (typically in past tense) and dialogue (often in present tense). T o the best of my knowledge, no large-scale tests have been performed to see how or whether POS tagging or lemmatization affects the accuracy of authorship attribution, though Burrows has manually-tagged a few very frequent words such as ‘to’ and ‘that’ for function before performing his analyses. As is so often true, a priori assumptions about whether either or both of these interventions would improve authorship attribution are little more than guesses. Carefully constructed tests of their effects would be valuable but extremely labour-intensive and timeconsuming. The word ‘I’ ranks 16th in Age, suggesting a good deal of dialogue in this thirdperson novel and highlighting the problem of point of view. Ideally, one might compare only third-person or first-person texts in any one analysis, precisely because personal pronouns are so frequent and their use varies widely in texts with different points of view. Unfortunately, for some problems this would drastically reduce the number of texts available for analysis, and some novels (for example, some of Conrad’s) contain first-person narratives within a third-person frame narrative, a situation that requires tedious manual editing if points of view are to be separated. A nd third-person novels with very large proportions of dialogue are likely to diverge markedly from those with little or none. For example, ‘I’ occurs 418 times in the first 50,000 words of Age, where it ranks 16th, just as it does in the entire novel, but only eight times in the first 50,000 words of Upton Sinclair’s The Jungle, where it ranks 601st. Worse still, ‘I’ jumps from 601st to 46th when the entire novel is analysed, showing that extreme intra-textual variation is possible. Occasionally ‘I’ ranks first in a novel, as it does in the following eighteenth-century novels: R ichardson’s Pamela (where ‘the’ ranks fourth, behind ‘I’, ‘and’ and ‘to’), Burney’s Evelina (where ‘the’ ranks third, behind ‘I’ and ‘to’) and Foster’s The �� For discussion, see D.L. Hoover, ‘Statistical Stylistics and Authorship Attribution: A n Empirical Investigation’, Literary and Linguistic Computing, 16/4 (2001): 421–44.

Word Frequency, Statistical Stylistics and Authorship Attribution

Table 3.2

39

Masculine and feminine pronouns in eight novels

Novel

Masculine pron. ‘he’ ‘him’ ‘his’

Feminine pron. ‘she’ ‘her’

D oyle, The Hound of the Baskervilles

9

13

30

68

59

L ondon, The Call of the Wild

4

8

14

95

75

Kipling, The Jungle Book

6

8

19

119

130

D oyle, The Lost World

12

53

14

191

238

Foster, The Coquette

24

26

35

15

8

Montgomery, Anne of Green Gables

34

85

77

11

12

James, The Portrait of a Lady

13

28

22

6

7

Chopin, The Awakening

10

24

15

6

4

Coquette and D efoe’s Moll Flanders (where ‘the’ ranks second). The epistolary form of the first three of these seems responsible for the extreme frequency of ‘I’, and Moll Flanders is written in the form of an autobiography (a check of six actual autobiographies finds none with ‘I’ as the most frequent word, however). O ther personal pronouns seem potentially problematic because they are closely tied to content, and especially to the number and gender of the main characters: obviously, pronouns referring to women tend to be infrequent in texts that do not contain women. Table 3.2 shows the ranks of ‘he’, ‘his’, ‘him’, ‘she’ and ‘her’ in eight novels. In the first four, the rarity of female characters restricts the frequency of feminine pronouns. In the second, the focus on women reverses the pattern to some extent. Modi.ed word-fr equency lists My own recent work attempts to cope with the difficulties mentioned above in several ways that are, ultimately, related. I often manually remove all dialogue from novels to eliminate problems arising from differing proportions of dialogue and narration, but this requires long hours of tedious, error-prone work, and runs into difficulty in novels in which dialogue and narration are not clearly

40

What’s in a Word-list?

differentiated. A nd in some novels (The Coquette, for example), dialogue is not distinguished typographically, making the process very difficult and, thus, a matter of interpretation as well as analysis. D eleting the dialogue also deletes more than half of some novels, sometimes resulting in an inconveniently small sample. A lso, I normally remove all personal pronouns from the word frequency list before performing an analysis, or do the analysis both with and without pronouns. T his is not standard practice, largely because so many analysts simply begin with the 50 most frequent words, and removing personal pronouns often gives poorer results for such analyses. Furthermore, the prevalence of male or female characters, masculine or feminine pronouns, or plural or singular pronouns may sometimes help to differentiate authors. A nother innovation I have made is to delete any word that is frequent in the whole corpus because of its frequency in a single text, typically culling words for which a single text accounts for 60 to 80 per cent of all occurrences. T his eliminates most proper nouns and other idiosyncratic items that are usually tied closely to content, though it does not eliminate main characters with the same name or place names that are frequent in two or more novels with the same setting. I sometimes remove such words manually, but this conflicts with another innovation: using very large numbers of frequent words. In some of my earliest work on authorship attribution and statistical stylistics, I expanded the list to the 800 MFW s, and, more recently, to the 1,200 MFW s. Following a suggestion by R oss Clement of W estminster University (personal communication), I now typically include the 4,000 MFW s when working with long texts (Clement even reports very good results using the 6,000 MFW s). A s we have seen for Age, even the 60 MFW s account for almost half of the tokens of a long novel. T he 1,000 MFW s typically account for 75 to 80 per cent of the tokens, and the 4,000 MFW s for more than 90 per cent. Using such large numbers of words violates the traditional assumption that authorship markers are most likely to be found among words that are so common and ubiquitous that authors are unlikely to manipulate them consciously. It also bases the analysis on very infrequent words – a practice that statisticians are likely to frown upon. N evertheless, dozens of unrelated analyses have shown that these large numbers of (mostly) content words, many of which occur only once or twice in any text, are much more effective in authorship attribution than are the 50 to 100 most frequent function words. As far as I am aware, no theoretical justification exists for using such large numbers of words, but the results speak for themselves. Perhaps the main reason for the improved results is simply the much larger amount of information they Hoover, ‘Statistical �� Stylistics and Authorship Attribution’. . D.L. �� Hoover, ‘Delta, Delta Prime, and Modern American Poetry: Authorship A ttribution T heory and Method’, paper given at the ALL C/A CH Joint International Conference. BC, Canada: University of Victoria, 16 June 2005; D.L. Hoover, ‘The Delta S preadsheet’, paper given at the ALL C/A CH Joint International Conference. BC, Canada: University of Victoria, 17 June 2005.

Word Frequency, Statistical Stylistics and Authorship Attribution

41

provide. W hether or not the words at rank 1,000 and above are, individually, less reliable or less informative than the 50 most frequent, there are so many of them that they both improve results and insulate the analysis from many of the errors and problems mentioned above. If ‘new’ is left in the analysis because its mixed proper and common noun status goes unnoticed, there are so many other words that any small, inappropriate effect it has is overwhelmed. S imilarly, if POS tagging is unavailable or not accurate enough to use, so many words with unambiguous classifications remain that the desirable but unavailable information is not finally necessary. R ecent experiments suggest that using large numbers of words allows texts of different points of view, texts with varying proportions of dialogue and narration, and texts in both British and A merican English to be included in the same analysis without any great degradation of the accuracy of the results. Unfortunately, large word lists are appropriate only for large texts. New methods for authorship attribution: Delta and Victorian novels Combining these methods of dealing with some of the difficulties of using word frequencies for authorship and statistical stylistics with D elta, a new measure of the difference between texts developed by John F. Burrows, seems an appropriate way of investigating word frequencies and authorship attribution further. Burrows has demonstrated the effectiveness of D elta on R estoration poetry and has applied the technique to the interplay between translation and authorship. I have published two studies involving D elta that automate the process of calculating and evaluating the results of D elta in an Excel spreadsheet with macros. T he first article demonstrates Delta’s effectiveness on early twentieth-century novels, and shows that using the 700 or 800 MFW s substantially improves the results achieved with smaller numbers of words, as does removing personal pronouns and culling words that are frequent in only one text. It also shows that large drops in Delta from the first to the second likeliest author are strongly associated with correct attributions. T he second article shows that the accuracy of attribution can often be improved by selecting subsets of the word frequency list or changing the D.L. �� Hoover, ‘Testing Burrows’s Delta’, Literary and Linguistic Computing, 19/4 (2004): 453–75. J.F. �� Burrows, ‘Questions of Authorship: Attribution and Beyond’, paper given at the ACH/ALLC, Joint International Conference, New York, 14 June 2001; J.F. Burrows, ‘“Delta”: A Measure of Stylistic Difference and a Guide to Likely Authorship’, Literary and Linguistic Computing, 17 (2002): 267–87; J.F. Burrows, ‘Questions of Authorship: A ttribution and Beyond’, Computers and the Humanities, 37/1 (2003): 5–32. J.F. Burrows, �� ‘The Englishing of Juvenal: Computational Stylistics and Translated T exts’, Style, 36 (2002): 677–99. D.L. Hoover, �� ‘Testing Burrows’s Delta’ Literary and Linguistic Computing, 19/4 (2004): 453–75.; D.L. Hoover, ‘Delta Prime?’, Literary and Linguistic Computing 19/4 (2004): 477–95.

What’s in a Word-list?

42

Table 3.3

z-score A bs.diff.

Calculating Delta for ‘the’ in The Professor and six other novels Brontë

Brontë

Collins

D ickens

The Professor

Jane Eyre

–0.47

–0.65

1.40

–0.13

–

0.18

1.87

0.34

Eliot

Woman Dombey Middlemarch in White and Son

T hackeray T rollope Vanity Fair

Dr Thorne

–0.86

1.08

–0.83

0.39

1.55

0.36

formula of Delta itself, as will be described briefly below. It also tests Delta and its variants on contemporary literary criticism, where they continue to perform very well. Further studies are underway by several researchers, involving a ‘real life’ attribution problem on nineteenth-century prose, an application of the technique and its variants to evolutionary biology, and my own projects on the style of H enry James, on narrators’ styles in eighteenth and nineteenth-century novels, and on the authorship of a late Middle English saint’s life. Consider, now, an authorship attribution test involving 46 Victorian novels by Charlotte Brontë, Collins, D ickens, Eliot, T hackeray and T rollope. For this test, I used plain AS CII texts, downloaded from Gutenberg whenever possible, to keep the formats similar, and deleted any Gutenberg information, introductions, prefaces, tables of contents, notes and page numbers, but not chapter or section titles, epigraphs or dialogue. I created a word frequency list for the resulting corpus of about 5,500,000 words, selected the 7,000 MFW for analysis with Burrows’s D elta, and used T he D elta Calculation S preadsheet to calculate the percentage frequency for each word in all 46 novels and enter a zero record for each word absent from any given text. O ne large novel by each author served as the primary authorial sample and the other 40 novels formed the secondary test set. Before D elta was calculated, words absent from all six primary samples were removed so that means and standard deviations could be calculated. T his left only 3,556 words for analysis, and removed from consideration most of the character names and proper nouns of the test set. Calculating D elta begins with the differences between the frequencies of words in the primary authorial samples and their mean frequencies in the entire primary set. T o allow all of the rapidly declining frequencies to contribute equally, these differences are converted into z-scores by dividing the difference between the mean frequency of the word in the primary set and its frequency in the sample by the standard deviation of the word in the primary set. T he result indicates how many standard deviations above the mean for the primary set (positive z-scores) or below it (negative z-scores) each word falls. D elta then measures the difference

<�� http://www.nyu.edu/gsas/dept/english/dlh/T heD eltaS preadsheets.html>.

Word Frequency, Statistical Stylistics and Authorship Attribution

43

between test texts and primary authorial samples in a simple way. Each word’s frequency in the test text is first compared with its mean frequency in the primary set, as with the primary samples. T he difference between the test text and the mean is then compared with the difference between each primary authorial sample and the mean. For example, consider ‘the’ in Charlotte Brontë’s The Professor and the six other novels that comprise the primary authorial samples, shown in T able 3.3. In both The Professor and Jane Eyre (the primary authorial sample for Brontë), the frequency of ‘the’ is about half a standard deviation below the mean. Thus the two are about equally different from the mean, with a difference between their differences of only about 0.18. The differences between the frequency of ‘the’ in the other five authorial samples and the mean are far larger, as reflected in the absolute differences. The final step is to total the absolute differences for all words and calculate their mean, producing Delta, ‘the mean of the absolute differences between the z-scores for a set of word-variables in a given text-group and the zscores for the same set of word-variables in a target text’.10 T he primary authorial sample with the smallest Delta is ‘least unlike’ it11 and its author is the most likely of the primary authors to be the author of the test text. D elta is extremely effective for this corpus, attributing all 40 test texts to their correct authors in all analyses based on 200 or more words. D elta is so effective that it makes almost no difference whether words frequent in a single text are culled, and only in analyses based on fewer than the 200 MFW s does removing pronouns improve results. T hese results are so accurate that there seems little point in testing any of the alternatives to D elta on them. I turn instead to a more challenging corpus of more than 2,000,000 words of Modern A merican poetry, returning to poetry but moving forward to the twentieth century. Delta, Delta Prime and Modern American Poetry T o produce results readily comparable to Burrows’s, I began, as he did, with a primary authorial set of 25 samples by 25 poets and a secondary set of 32 samples – 16 by members of the primary set and 16 by other poets. I downloaded the samples from Chadwyck-H ealey’s L iterature O nline, accessed through N ew York University’s Bobst L ibrary, and removed section numbers, references, notes, page numbers, some dramatic sections with large numbers of character names, some dialect, all noticed foreign language passages, epigrams and other quotations, and prose sections. T he resulting samples ranged from 21,000 to 72,000 words, and I took large samples for the primary set wherever possible. T his resulted in a mean sample size of 44,000 words for the primary set and 38,000 for the secondary 10 D.L. Hoover, �� ‘Testing Burrows’s Delta’ Literary and Linguistic Computing, 19/4 (2004): 453–75, p. 271. 11 J.F. �� Burrows, ‘Questions of Authorship: Attribution and Beyond’, paper given at the A CH /ALL C, Joint International Conference, N ew York, 14 June 2001, p. 15.

44

What’s in a Word-list?

Figure 3.2

Modern American poetry: Delta analysis (2,000 MFWs)

Figure 3.3

Modern American poetry: Delta-Oz analysis (2,000 MFWs)

Word Frequency, Statistical Stylistics and Authorship Attribution

Figure 3.4

45

Modern American poetry: Delta-Lz (>0.7) analysis (3,000 MFWs)

set. (One significant difference between my samples and Burrows’s is that he used single long poems as the secondary texts, to more obviously mirror a real authorship attribution problem.) In the best results from a preliminary test on the 20 to 200 MFWs, those based on the 160 MFWs, Delta correctly identified all samples by primary authors, but four samples by others invaded the section of correctly attributed samples. R emoving pronouns improved the results slightly, but culling the word list had no effect, removing very few words overall, and none from the 200 MFW list. A s noted above, much larger word frequency lists nearly always improve results, and this is clearly true of this corpus. A n analysis based on the 2,000 MFW s, shown in Figure 3.2, is much more accurate than that based on the 160 MFW s, and is the best result I was able to achieve using original D elta. A lthough D elta impressively attributes all 16 of the samples by members of the primary set of authors using the 80, 100, 120 ... 200, 400, 600 ... 4,000 MFW s, these results exaggerate its effectiveness somewhat. For example, as Figure 3.2 shows, if the R ukeyser sample that appears surrounded by texts by others were a test text, it would not be identified correctly by this analysis, and establishing a reasonable threshold of confident attribution is difficult. The five possible improvements on Delta proposed in Delta Prime? recapture information about whether a word is more or less frequent than the mean, how

What’s in a Word-list?

46

different the test text is from the mean, the size of the absolute difference between the test text and each primary text, and the direction of the difference between the test text and the primary text.12 D elta-L z and D elta-O z retain Burrows’s definition of D elta as the mean of the absolute differences but base the mean on a limited set of words. D elta-L z includes only those words for which the z-score of the test text in question has a large absolute value – words much more frequent or much less frequent than the mean. D elta-O z includes only those words for which the signs of the z-scores of the primary text and the test text being compared are opposite – words for which the test text and the primary text are different from the mean in opposite directions. Both of these measures compare each pair of texts on the basis of a newly-defined and typically unique subset of the original word frequency list. T he other two measures, like D elta, include all the words, but change the definition of Delta itself. Rather than taking the mean of the absolute differences between the test text and each primary sample, D elta-2X doubles the mean of the differences that are positive (words for which the z-score of the test text is greater than that of the primary sample – words more frequent in the test text than the primary sample) and subtracts the mean of the negative differences (words less frequent in the test text than the primary sample). Because the second figure is a negative, in effect, this calculation adds the absolute value of the negative mean. D elta-3X triples the positive mean before subtracting the negative mean. Finally, D elta-P1 adds one to the positive mean and squares that sum before subtracting the negative mean. A ll three of these measures weigh more heavily those words that are more frequent in the test text than the primary sample, treating presence as more significant than absence. D elta-2X improves slightly on D elta, and D elta-3X and D elta-P1 do better still, though, in all cases, one member’s text falls below the most obvious threshold and would not be correctly identified if it were being tested. Both Delta-Oz and D elta-L z (>0.7) produce results (shown in Figures 3.3 and 3.4) in which none of the samples by others invade the samples by members, but only D elta-L z shows a clear threshold, and two members fall below it. D elta and the D elta Primes clearly reflect real authorship information and good results can be replicated on diverse sets of texts. H owever, correctly attributing texts by members is not enough – the analysis must avoid falsely attributing texts by others to members of the primary set. T hat no samples by others invade the samples by members is encouraging, but this kind of testing seems inadequate. O ver-interpretation of results is all too easy when the truth is known in advance. Beginning with a large group of samples of poetry, many by the authors tested above and some by additional authors, and selecting smaller samples to increase the difficulty in the hope of revealing any differences in effectiveness among Delta and the D elta Primes, I created a simulation in which the true authors were not initially known. I first selected 25 primary samples, 11 by members and 13 by others. I then added 15 more samples (six by members and nine by others), hid 12

Hoover, ‘Delta �� Prime?’

Word Frequency, Statistical Stylistics and Authorship Attribution

Figure 3.5

Authorship simulation: Delta analysis (800 MFWs)

Figure 3.6

Authorship simulation: Delta-2X analysis (800 MFWs)

47

48

What’s in a Word-list?

Figure 3.7

Authorship simulation: Delta-3X analysis (800 MFWs)

Figure 3.8

Authorship simulation: Delta-P1 analysis (600 MFWs)

Word Frequency, Statistical Stylistics and Authorship Attribution

Figure 3.9

49

Authorship simulation: Delta-Oz analysis (800 MFWs)

Figure 3.10 Authorship simulation: Delta-Lz (>0.7) analysis (1,000 MFWs)

50

What’s in a Word-list?

their identities, put them in random order, and had a helper, who knew nothing about the simulation, select 12 of the 15 to rename T EXT 01 ... T EXT 12, by authors A 01 ... A 12. For the simulation, the mean sample size was about 24,500 words for primary samples, 22,000 for samples by members, 13,500 for samples by others and 16,000 for the test samples. T he simulation left me with the unwanted knowledge that at least three and no more than six samples were by members, but the results in Figures 3.5 to 3.10 show that this knowledge could hardly have affected the results. A precise characterization of the relative effectiveness of D elta and the Primes is difficult, but all of the Primes, with the possible exception of D elta-P1, seem more effective than D elta. T hese results do not show the whole picture, however. A nalyses involving the 200, 400 ... 4,000 words for Delta and the five Primes, 120 analyses in all, produced only 21 errors, and never more than one in any analysis. Fourteen of the errors involved D elta-P1, which also produced the only analyses in which test samples other than one, three, four, eight and 11 mingled with the samples by members. Finally, the following five attributions were absolutely consistent across all 120 analyses: T EXT 01 = Frost, T EXT 03 = Gunn, T EXT 04 = S andburg, T EXT 08 = Creeley, and T EXT 11 = Kooser. N one of the suggested attributions of texts known to be by others attained this kind of consistency, though R ich was frequently the most likely author for T EXT 12, as H art Crane was for T EXT 09. (O ne or two authors frequently appear as the most likely author for many of the texts by others in an analysis.) Finally, as Figure 3.11 shows, the changes in D elta and D elta-z from the likeliest to the second likeliest author were generally much greater for these five test texts and the samples by members than for the other seven test texts and the texts by other authors (the texts are arranged in order of increasing change in D elta). At this point it will come as no surprise that these five attributions are correct. A s Figures 3.5 to 3.11 show, however, if the R ich sample that appears as a member text had instead been a test sample, none of the analyses would very convincingly have attributed it to her. O nly D elta-O z and D elta-L z, however, strongly suggest that this sample is not by a member of the primary set. Perhaps these two Primes are more sensitive to intra-author variation than are D elta and the other Primes, but that is a topic for further research. It would be tempting to suggest the small size of this sample compared to R ich’s other samples as a cause (it is only about one third as long), but the two strongest attributions in all of the analyses shown above are the two other pairs with the greatest discrepancy in size. A more promising explanation is suggested by the fact that the larger samples come from Collected Early Poems: 1950–1970, while the smaller one comes from Midnight Salvage: Poems, 1995–1998, and R ich notoriously changed her style during her long career. This ‘failure’ reminds us that some authors’ styles change dramatically over their careers, and that some authors use very different styles in different texts. T his calls into question the basic assumption of an invariable wordprint for each author and suggests that statistical methods alone may not always be adequate and may

Word Frequency, Statistical Stylistics and Authorship Attribution

51

Figure 3.11 Authorship simulation: changes in Delta and Delta-z from likeliest to second likeliest author: Delta-Lz (1,000 MFWs) need to be augmented by more time-consuming methods such as those based on extreme differentials in the use of pairs of words. D elta and the D elta Primes are extremely effective in attributing the five samples by members and in rejecting the seven samples by others. T he fact that the failures are all failures to attribute rather than failures of attribution is important: in no case does D elta or any of the D elta Primes strongly encourage a false attribution. Further tests on contemporary prose and on samples that are lemmatized or tagged for part of speech would be helpful – not so much to confirm the effectiveness and reliability of D elta and the D elta Primes, which now seem very solidly validated, but, rather, in the hope of more fully understanding why these relatively simple techniques work so well, and in continuing to improve their already impressive power. Conclusion W ord frequencies have been studied over a long period of time and with a wide variety of methods. In spite of this long history, however, the increasing availability of larger corpora has generated new interest in this venerable topic. O ne clear trend has been an increase in the number of words studied, but new methods of defining and selecting the words to be counted have also continued to emerge. T he recent introduction of new measures of textual difference based on word frequencies, such as Burrows’s D elta, suggest that word frequencies will continue to play a central role in authorship attribution and statistical approaches to style.

This page has been left blank intentionally

Chapter 4

W ord Frequency in Context: A lternative A rchitectures for Examining R elated W ords, R egister Variation and H istorical Change Mark D avies

Introduction Since the advent of ‘mega-corpora’ that are 100 million words in size or larger, there have been challenges in terms of extracting large amounts of data economically. For example, several years ago it was sufficient to create a query engine that would perform a linear scan of the text, and such an architecture might return the results from a one million word corpus in one to two seconds. Using that same architecture, however, a similar query on a 100 million word corpus might take 100 to 200 seconds. A s a result of these performance issues, in the last ten to 15 years a number of alternative architectures have been developed. T hese include the use of large numbers of indices that contain offset values for each word in the corpus and the use of large hash operations to find nearby words (e.g. SARA/XAIRA ) and the relational database architecture of the IMS Corpus W orkbench. Over the past five years, we have employed a modified relational database architecture for a number of corpora that we have created. In contrast to the IMS Corpus Workbench approach, however, these corpora rely heavily on an ‘n-gram’ architecture, which will be one of the major topics of this paper. T hese corpora include the 100 million word Corpus del Español and a new architecture and interface for the 100 million word British National Corpus (BN C), both of which will be discussed below. G. Burnage and D. Dunlop, ‘Encoding the British National Corpus’, in J. Aarts, P. de H aan and N . O ostdijk (eds), English Language Corpora: Design, Analysis and Exploitation, papers from the 13th International Conference on English L anguage R esearch R esearch on Computerized Corpora, N ijmegen 1992 (A msterdam: R odopi, 1993), pp. 79– 95; L. Burnard, Reference Guide for the British National Corpus: World Edition (O xford: O xford University Computing S ervices, 2000). O . Christ, The IMS Corpus Workbench Technical Manual (Institut für Maschinelle S prachverarbeitung, S tuttgart, Germany: Universität S tuttgart, 1994). . .

What’s in a Word-list?

54

A s with some other competing architectures, this relational database/n-gram approach allows queries like the following: • • •

the overall frequency of a given word or phrase in the corpus (‘mysterious’, ‘blue skies’) the frequency of words with a given substring (*able, *heart*, etc.) queries involving part of speech or lemma (e.g. ‘utter’ NN1, ‘as’ ADJ, ‘barely realized’ ADV VVD)

Unlike some other architectures, however, our approach is quite fast: any of the preceding queries on a 100 million word corpus would take less than one second. In addition, as we will demonstrate, our approach allows a number of types of query that would be difficult or impossible to carry out directly in one step with competing architectures. T hese include – but certainly are not limited to – the following: • • • •

comparison of frequency with related words, e.g. nouns occurring immediately after ‘utter’ but not after ‘complete’ or ‘sheer’, or adjectives within ten words of ‘woman’ but not ‘man’; one simple query to find the frequency of words in separate databases, such as user-defined, customized lists (clothing, emotions, technology terms, etc.) or synsets from WordNet; register variation, e.g. all words ending in ‘*icity’, or all verbs, or all threeword lexical bundles, which are more common in academic texts than in works of fiction, or in legal or medical texts; historical variation, e.g. all words, phrases or collocates of a given word or part of speech, which are more common in the 1900s than in the 1800s.

In the discussion that follows, we will first present the basic architecture (relational databases and n-grams) and provide concrete examples of some of the types of queries that this architecture allows. W e will then discuss shortcomings of this architecture, and consider how these issues have been handled in some of the newer interfaces that we have created, such as the VIEW interface for the BN C. A simple n-gram architecture Let us first consider the ‘first-generation’ approach to relational databases and ngrams, which was used for the Corpus del Español that we created in 2001–2002, . . M. D avies, A dvanced R esearch on S yntactic and S emantic Change with the Corpus del Español, in J. Kabatek, C.D . Pusch and W . R aible (eds), Romance Corpus Linguistics II:

Word Frequency in Context

Table 4.1

55

Example of 3-grams where lem1 = ‘break’ and word2 = ‘the’

FREQ.

WORD1

WORD2

WORD#

106

breaking

the

law

98

break

the

law

56

broke

the

silence

53

break

the

news

46

broke

the

news

40

break

the

deadlock

24

broken

the

law

23

break

the

habit

Table 4.2

Example of 3-grams where LEM1 = ‘break’ and WORD2 = ‘the’

FREQ WORD1 LEM1 POS1 W2 LEM2 POS2 W3

LEM3

POS3

106

breaking break

VVG the

the

AT 0

law

law

NN 1

98

Break

break

VVI

the

the

AT 0

law

law

NN 1

56

broke

break

VVD

the

the

AT 0

silence

silence

NN 1

53

Break

break

VVI

the

the

AT 0

news

news

NN 1

46

broke

break

VVD

the

the

AT 0

news

news

NN 1

40

Break

break

VVI

the

the

AT 0

deadlock deadlock NN 1

24

broken

break

VVN

the

the

AT 0

law

law

NN 1

23

Break

break

VVI

the

the

AT 0

habit

habit

NN 1

and the subsequent BNC-based ‘Phrases in English’ database and interface that was based on the same architecture and which was created by W illiam Fletcher in 2003. In this approach, one uses a program to create the n-grams of a given corpus, such as the W ordL ist module of W ordS mith or the KFN -gram program.10 For example, with W ordS mith, one would simply create separate lists of all of the 1Corpora and Diachronic Linguistics (Tübingen: Guntar Naar, 2005), pp. 203–14; M. Davies, ‘The Advantage of Using Relational Databases for Large Corpora: Speed, Advanced Queries, and Unlimited A nnotation’, International Journal of Corpus Linguistics, 10 (2005): 301–28. W. Fletcher, ‘Exploring Words and Phrases from the British National Corpus’, Phrases in English (2005) , accessed June 2005. M. S cott, WordSmith Tools, Version 4. 10 .

What’s in a Word-list?

56

grams, 2-grams, 3-grams, etc., in the corpus, and then import these into a relational database. In the case of 3-grams for the BN C, for example, a small section of the 3-grams table would look like T able 4.1. Each unique three-word string in the corpus appears in the database, with its associated frequency. For example, in the BNC ‘breaking the law’ occurs 106 times, ‘broke the law’ occurs 56 times, and so on. It is also possible to create frequency tables that include POS (part of speech) and lemmatization information. In this case, the table might look like T able 4.2. W ith these n-gram/frequency tables, it is a relatively simple process to use S QL queries to extract the desired data. For example, to extract the 100 most common singular nouns (NN 1) in the BN C, the S QL query would be as follows: (1) SELECT TOP 100 * FROM [TABLE_NAME] WHERE POS1 = ‘NN1’ ORDER BY FREQ DESC

To select the 100 most common three-word strings where the first word is a form of ‘break’, the second word is ‘a’ or ‘the’, and the third word is a noun, the SQL query would be as follows, and the results would be those seen in T able 4.2, above: (2) SELECT TOP 100 * FROM [TABLE_NAME] WHERE LEM1 = ‘BREAK’ AND WORD2 = ‘THE’ AND POS3 LIKE ‘NN%’ ORDER BY FREQ DESC

Either of these two queries would take less than half a second to retrieve the 100 most frequent matching words or strings from the 100 million word corpus. T his is the approach used in our Corpus del Español and in the ‘Phrases in English’ interface, and it represents an early approach to the use of n-grams. Accounting for register or historical variation T here is a serious problem, however, associated with a strict n-gram architecture. O nce the frequencies are calculated for each unique n-gram in the corpus, one then loses all contextual information for that n-gram – in other words, in which part of the corpus each n-gram occurs. For example, T able 4.2 above shows that

Word Frequency in Context

Table 4.3

WORD1 breaking break broke break broke break broken break

Table 4.4 WORD1 MANO L ÍN EA S ER CUELLO MAD ERA CARA

57

Example of 3-grams where LEM1 = ‘break’ and WORD2 = ‘the’ LEM1 break break break break break break break break

POS1 VVG VVI VVD VVI VVD VVI VVN VVI

… … … … … … … … …

W3 law law silence news news deadlock law habit

LEM3 law law silence news news deadlock law habit

POS3 NN 1 NN 1 NN 1 NN 1 NN 1 NN 1 NN 1 NN 1

REG1 x1 x2 x3 x4 x5 x6 x7 x8

REG2 y1 y2 y3 y4 y5 y6 y7 y8

REG3 z1 z2 z3 z4 z5 z6 z7 z8

Hard N’ (N duro) in the Corpus del Español, by century WORD2 x12 D URA 0 D URA 0 D URO 0 D URO 1 D URA 0 D URA 0

… x17 … 0 … 0 … 1 … 0 … 0 … 0

x18 10 0 9 0 5 3

x19 22 12 10 10 10 10

x19SP x19FIC 6 9 0 3 7 1 9 1 7 0 6 4

x19NF 7 9 2 0 3 0

‘breaking the law’ occurs 106 times in the corpus, but at this point we have no idea how many of these are in the spoken texts, or fiction or newspapers. T herefore, it would probably be impossible to find the most frequent words or phrases in a given register using this approach, or to compare the frequency of words or phrases in two competing (sets of) registers. T here is a way around the lack of context for each n-gram, however. O ne could calculate the overall n-gram frequency for a set of different registers (as in T able 4.2), and then create n-gram frequency tables for each register individually. One would then ‘merge’ the information from the ‘register’ tables into the overall frequency table, which would contain separate columns (for each n-gram) showing the frequency in each register. In other words, the resulting table might look like T able 4.3. T his is in fact the approach taken in the construction of the 100 million word Corpus del Español. For each n-gram, there are columns that show the frequency of the string in each century from the 1200s to the 1900s (x12–x19, below). T here are also separate columns containing the frequency of the n-gram in each of the three registers spoken, fiction and non-fiction in the 1900s. T able 4.4

What’s in a Word-list?

58

shows a small section of bi-grams, containing a few of the n-grams that match the query NOUN + lemma DURO ‘hard N’. T he advantage of using such an approach should be readily apparent. S ince each n-gram has the associated frequency in each of the different historical periods and the different registers, this frequency information can be accessed directly as part of the query. For example, in the case of the Corpus del Español, we can find which nouns occur with duro ‘hard’ in the 1900s, but not in the 1800s. The SQL query would look something like the following (here simplified from how it would appear in the actual database):

(3)

SELECT TOP 100 * FROM [TABLE_NAME] WHERE

POS1 = ‘NOUN’ AND

LEM2 = ‘DURO’ AND

X19 <> 0 AND

X18 = 0

ORDER BY X19 DESC

This gives us results like ‘línea dura’ (‘hard line’), ‘disco duro’ (‘hard drive’) and ‘años duros’ (‘hard years’), etc. In spite of the advantages of this approach, one problem is that it is quite costly to run the S QL UPDAT E commands that copy the frequency information (for tens of millions of n-grams) from each of the separate tables (1200s, 1500s, 1900s–FIC, etc.) into the main n-gram tables. In addition, this approach may only be practical when there are a limited number of frequency columns, such as the 11 columns in the Corpus del Español (1200s–1900s, and three additional registers for the 1900s). In the case of the BN C, on the other hand, there are nearly 70 different registers, according to the categorization made by D avid L ee.11 In summary, the frequency information from each sub-corpus can be quite valuable, in terms of being able to compare different historical periods, different registers, and so on. However, because of the difficulty of creating such tables, they are not used in some other competing architectures and interfaces. A s a result, it is only possible to look at word and phrase frequency across the entire corpus with these approaches.

11 .

_

Word Frequency in Context

59

The issue of size In addition to the problem of ‘granularity’ in terms of frequency in sub-registers, another problem with a strict n-gram approach has to do with the size of the tables. W hile there are only about 800,000 rows in the [1-grams] table of the BN C (i.e. 800,000+ unique types in the corpus), this increases to about 11 million unique [2-grams] and 40 million unique [3-grams], and it would move towards 90 to 95 million unique 7-grams for the 100 million words. T he problem with this approach, then, is that the n-gram tables become quite unmanageable in terms of size. Even with efficient clustered indexes on the tables, it may take ten to 15 seconds to return the results from a particularly difficult query, such as the most frequent n-grams for [‘the NN1 that’]. The second problem is that once we add up all of the different n-gram tables (1-grams + 2-grams + 3grams, etc.), we soon find that the total number of rows in these tables is larger than the total number of words in the corpus, especially if we include 4-grams to 7-grams. A s a result of the size issue, the approach taken in the construction of the ‘Phrases in English’ database and interface12 is to include just those n-grams that occur three times or more. By eliminating from the tables all n-grams that occur just once or twice, the size of the tables is reduced dramatically – by 75�� per cent�� in the case of the 3-grams, and even more for 4-grams to 7-grams. T herefore, there is a large performance gain by eliminating all n-grams that occur just once or twice. Unfortunately, with this approach, one also completely loses the ability to analyze these less common strings. If one is interested in only the highly-frequent n-grams, this is probably not a significant problem. But for detailed comparison of sub-corpora, or to see which phrases are entering into or leaving the language, it is problematic to exclude 75�� per cent�� or more of all n-gram types. A s we have seen, there are two basic problems with a strict n-gram approach. First, in terms of sub-corpora, we either have to ignore the context in which each n-gram appears (register, historical period, etc.), or else we have to create tables that are quite difficult to construct, by merging the frequency for each sub-corpus into a column in the main n-gram table. S econdly, in terms of size and speed, we either create tables that eliminate n-grams that occur just once or twice (and thus lose 75�� per cent�� or more of all unique n-grams), or we have a number of separate n-gram tables (2-grams, 5-grams, etc.) whose combined size in terms of rows is probably much larger than the total number of words in the corpus. An alternate architecture – all sequential n-grams There is a ‘second-generation’ approach, however, that avoids both of the problems associated with a strict n-gram approach. In this architecture, we create one single 12

.

What’s in a Word-list?

60

Table 4.5 ID 50891887 50891888 50891889 50891890 50891891 50891892 50891893 50891894 50891895 50891896

Table 4.6 Text EA 8 EA 9 EAA EA J EA K

Sequential n-gram ‘tokens’ text EAA EAA EAA EAA EAA EAA EAA EAA EAA EAA

… … … … … … … … … …

word3 a small group of People at W ork , there will

pos3 AT 0 A J0 NN 1 PR F NN 0 PR P NN 1 PUN EX0 VM0

word4 small group of people at work , there will be

pos4 A J0 NN 1 PR F NN 0 PR P NN 1 PUN EX0 VM0 VBI

word5 group of people at work , there will be significant

pos5 NN 1 PR F NN 0 PR P NN 1 PUN EX0 VM0 VBI A J0

… … … … … … … … … …

Text meta information table

Register W _commerce W _commerce W _soc_science W _soc_science W _nat_science

Topics leadership; commerce hotel management; tourism Public law; politics scientific research

Title Making it happen… T he hotel receptionist… Managing people at… Public law and political... N ature. L ondon: Macmillan…

database table that has as many rows as the total number of words in the corpus. Each row contains a sequential word in the corpus (along with part of speech and other information, if desired). Most likely, these sequential words will be one of the central columns of the table. T his column is then surrounded on the left and right by a number of ‘contextual’ columns, to create a ‘context window’ for each word in the corpus. For example, T able 4.5 represents the main table in our BN C/VIEW database.13 T his table contains about 100 million rows (one for each sequential word in the BN C), with each word in corpus (column word4) surrounded by three words to the left (word1 to word3) and three words to the right (word5 to word7), as well as the ‘word offset’ ID and the text from which the n-gram is taken (for the sake of brevity, only word3 to word5 is shown here). In addition to this main sequential n-gram table, there is also a small table that contains ‘meta-data’ for each of the 4,000+ texts in the BNC. For example, T able 4.6 shows just a few of the columns of information for a handful of entries from this table, based on the [EAA ] text. 13

.

Word Frequency in Context

61

L et us now consider how these tables can be used to create quick, powerful queries even on large corpora like the BN C. Basic frequency data for words, phrases, substrings and collocations Basic queries, even on large corpora such as the 100 million BN C are quite fast with the new architecture. They also have the added benefit of including all matching strings – not just the most frequent n-grams, as is the case with some other architectures. For example, to find the most frequent nouns following ‘break’ + ‘a’/‘the’ + NOUN (‘break the news’, ‘break a promise’), the user would enter the following string through the web-based interface. T his would be converted to the following S QL command, which would produce the results seen in T able 4.2 above. (4) QUERY:

break a/the [nn*]

SQL: SELECT TOP 100 COUNT(*),WORD4,POS4,WORD5,POS5,WORD6,POS6 FROM [TABLE_NAME] WHERE WORD4 = ‘BREAK’ AND WORD5 IN (‘THE’,‘A’) AND POS6 LIKE ‘NN%’ GROUP BY WORD4,POS4,WORD5,POS5,WORD6,POS6 ORDER BY COUNT(*) DESC

A ny combination of substrings, words, or part of speech in a seven-word window can be used for the query. In addition to seeing just raw frequencies, one can also see a ‘relevancy-based‘ display, which is based on a modified z-score. W ith a simple raw-frequency sorting ([sort by: relevance] in the search form), a query like [ * chair ] would produce results like ‘a chair’, ‘the chair’, ‘his chair’, etc. However, if the user selects [sort by: relevance] in the search form, then the results will be ‘high-backed chair’, ‘sedan chair’, ‘wicker chair’, ‘swivel chair’, etc. To produce this set of results, the script, (1) finds the raw percentage for all strings matching [* chair], (2) finds the overall frequency of the words in the [*] slot (e.g. ‘the’, ‘wooden’, ‘swivel’, etc.), and then, (3) divides the first figure by the second. In this case, though, the three-

What’s in a Word-list?

62

step query still only takes about half of one second to return the results from the 100 million word corpus. Comparing related words A major advantage of storing the n-grams in a relational database is that this architecture lends itself well to frequency comparisons between sets of words. For example, it is possible to find all of the collocates of a given WORD 1 and the collocates of a different WORD 2, and then compare these two lists within the database itself. For example, a user can see the difference between two or more synonyms by finding the collocates that occur with one but not with the other, and all of this can be carried out in just one or two seconds. A s a concrete example, assume that a non-native speaker of English is interested in the difference between ‘utter’, ‘sheer’ and ‘absolute’. The user would enter the following into the web-based search form: (5) QUERY:

{utter/sheer/absolute} [n*]

The curly brackets around the three synonyms indicate that the user wants to find the collocates for each of these three words, and then ‘group’ these collocates together for each individual word. T he results would look like those shown in T able 4.7. This would indicate to the language learner, for example, that ‘sheer’ occurs with ‘weight’ (60 tokens), ‘force’ (31 tokens) and ‘luck’ (26 tokens), but that none of these words occur with either ‘utter’ or ‘absolute’. ‘Utter’, on the other hand, is the only one of the three synonyms that occurs with ‘confusion’ (ten tokens), ‘condemnation’ (five tokens) and ‘devastation’ (five tokens), and ‘absolute’ is the only synonym that occurs with ‘majority’ (98 tokens), ‘terms’ (84 tokens) and ‘zero’ (54 tokens). To process this query, the script carries out a number of tasks: (1) it finds all nouns following ‘sheer’; (2) it finds all nouns following ‘utter’; (3) it finds all nouns following ‘absolute’; (4) it stores all of the results for these three queries in a temporary table; and, (5) it runs three separate, sequential SQL queries to find the most frequent collocates for each of these words, which do not occur with either of the other two. (It is also possible to sort by raw frequency with each word, rather than WORD 1 v. WORD 2 / WORD 3, as shown above.) N otice that comparisons such as these might be possible, but they would certainly be much more cumbersome with other architectures. For example, users of those interfaces could carry out a query with WORD 1, and then WORD 2, and so on. T hey would then copy and paste the results into separate tables of a database, and then (assuming some skill in S QL ), carry out cross-table JOIN s to determine the relative frequency with the different words. In this case, however,

Word Frequency in Context

Table 4.7

63

Grouping by synonyms

%

+

–

SH EER

%

+

–

UTT ER

%

+

–

A BSOL UT E

1.00

60

–

weight

1.00

19

–

confusion

1.00

98

–

majority

1.00

31

–

force

1.00

5

–

condemnation

1.00

84

–

terms

1.00

26

–

luck

1.00

5

–

devastation

1.00

54

–

zero

1.00

23

–

quantity

1.00

5

–

disregard

1.00

53

–

minimum

1.00

13

–

cliff

1.00

5

–

helplessness

1.00

51

–

value

1.00

12

–

cliffs

1.00

4

–

loneliness

1.00

39

–

egalitarianism

33

–

right

27

–

price

1.00

11

–

coincidence

1.00

3

–

dejection

1.00

1.00

10

–

enjoyment

1.00

3

–

ruthlessness

1.00

the process would probably take at least three to four minutes, whereas with our interface it takes only about 1.2 seconds. Integration with other databases A nother important advantage of the relational database approach is that the central n-gram tables can be integrated with other relational databases. For example, the frequency information from the BN C/VIEW databases can be joined with semantic information from other databases such as W ordN et,14 or with personalized lists (relating to semantic fields) created by the user. L et us take a concrete example, which we have already discussed elsewhere.15 In order to add a strong semantic component to the BN C/VIEW database, we have imported into a separate database the entire contents of W ordN et – a semanticallybased hierarchy of hundreds of thousands of words in English. The ‘synsets’ that make up W ordN et indicate roughly equivalent meanings (synonyms), more specific words (hypernyms) and more general words (hyponyms) for a given word, as well as ‘parts’ (meronyms) of a larger group (e.g. parts of a body) or the larger groups to which a certain word belongs (holonyms, e.g. leg > table/body).

14 C. Fellbaum, WordNet: An Electronic Lexical Database (Cambridge, MA : MIT Press, 1998); S. Landes, C. Leacock and R. Tengi, ‘Building Semantic Concordances’, in C. Fellbaum (ed.), WordNet: An Electronic Lexical Database (Cambridge, MA : T he MIT Press, 1998), pp. 199–216. 15 M. Davies,‘Semantically-based Queries with a Joint BNC/WordNet Database’, in R . Facchinetti (ed.), Corpus Linguistics Twenty-five Years On (A msterdam: R odopi, 2007), pp. 149–67.

What’s in a Word-list?

64

Table 4.8 1 2 3 4 5 6

WordNet/BNC integration: frequency of synonyms of ‘sad’

S orry S ad D istressing Pitiful D eplorable L amentable

10,767 3,322 359 199 144 69

A t the most basic level, a user can simply submit as a query like the following: (6) QUERY:

[=sad]

The script sees the equals sign and interprets this as a query to find all synonyms of ‘sad’. The script then retrieves all of the matching words from WordNet and finds the frequency of each of these words in the BN C, see T able 4.8. Users can also limit the hits by part of speech, and can also intervene before seeing the frequency listing to select just certain synsets (or meanings) from W ordN et. T he W ordN et data can also be integrated into more advanced queries. A user can – with one simple query – compare which nouns occur with the different synonyms of a given word. For example, the following query finds all cases of a noun following a synonym of ‘bad’: (7) QUERY:

[=bad ] [nn*]

In less than four seconds, the user would then see something similar to T able 4.9. (N ote that the format on the web interface is somewhat different to the abbreviated listing shown here.) S uch collocational data can be very useful for a language learner, who is probably unsure of the precise semantic range of each adjective. T he type of listing given above, which shows the most common nouns with each of the adjectives, can easily permit the language learner to make inferences about the semantic differences between each of the competing adjectives. For example, he or she would see that ‘severe illness’ occurs but ‘wicked illness’ does not, and that ‘terrible mistake’ is common, whereas ‘foul mistake’ is not. Finally, it is again worth noting that such a query would be extremely difficult with an architecture that does not allow the user to access other databases that can be tied into the main frequency/n-gram databases. W ithout our approach, users would probably have to create the list of synonyms in another program, perhaps run queries with each of these words individually, and then collate and sort the

Word Frequency in Context

Table 4.9

65

Synonyms of [bad] + NOUN

PHRASE

FREQ.

PHRASE

FREQ.

  disgusting thing

16

  severe shortage

26

  disgusting way

5

  severe weather

43

  distasteful species

5

  severe winter

49

  evil empire

10

  terrible accident

26

  evil eye

24

  terrible blow

13

  evil influence

9

  terrible danger

14

  foul language

29

  terrible feeling

19

  foul mood

21

  terrible mistake

48

  foul play

72

  terrible shock

41

  foul temper

14

  wicked grin

6

  severe blow

44

  wicked people

13

  severe burn

28

  wicked thing

27

  severe damage

59

  wicked way

14

  severe drought

30

  wicked witch

12

  severe illness

23

results in a spreadsheet. Using our approach, all of this is done by the web-based script in less than two seconds. Register-based queries A s was discussed previously, one serious shortcoming of many other approaches is that it is either difficult or impossible to compare the results of different subcorpora – for example, different registers or different historical periods. W ith the newer BN C/VIEW architecture, however, this is both quite simple and quite fast. Recall that with the ‘sequential n-gram’ architecture, each word in the corpus appears on its own line in the database, as shown in T able 4.10. A s shown previously, there is also a small table, see Table 4.11, that contains ‘meta-data’ for each of the 4,000+ texts in the BN C. T o limit the query to one particular register (or set of registers), or to compare registers, the script relies on an S QL JOIN between these two tables. It looks for all rows that match the n-gram table (T able 4.10), all texts that match the specific register(s) (Table 4.11), and then limits these hits to just those that have a particular text in common (e.g. EAA ). Examples of the types of register-based queries might be queries to find which nouns, or verbs, or three-word lexical

What’s in a Word-list?

66

Table 4.10 Sequential n-gram table ID 50891887 50891888 50891889 50891890 50891891 50891892 50891893 50891894 50891895 50891896

Table 4.11 Text EA 8 EA 9 EAA EA J EA K

Text EAA EAA EAA EAA EAA EAA EAA EAA EAA EAA

… … … … … … … … … …

word3 a small group of people at work , there will

pos3 AT 0 A J0 NN 1 PR F NN 0 PR P NN 1 PUN EX0 VM0

word4 small group of people at work , there will be

pos4 A J0 NN 1 PR F NN 0 PR P NN 1 PUN EX0 VM0 VBI

word5 group of people at work , there will be significant

pos5 NN 1 PR F NN 0 PR P NN 1 PUN EX0 VM0 VBI A J0

… … … … … … … … … …

Text meta information table

Register W _commerce W _commerce W _soc_science W _soc_science W _nat_science

Topics leadership; commerce hotel management; tourism – public law; politics scientific research

Title Making it happen… T he hotel receptionist… Managing people at… Public law and political... N ature. L ondon: Macmillan…

bundles, or collocates with ‘chair’ occur more in one register (e.g. fiction) than in another (e.g. academic). For example, suppose that a user wants to find which verbs are more common in legal texts than in academic texts generally. H e or she would simply select the following in the web-based form: (8) QUERY:

[vvi]

REGISTER 1

[w_ac_polit_law_edu] (from pull-down menu)

REGISTER 2

[ACADEMIC]

This searches for all infinitival lexical verbs (VVI) in law texts (w_ac_polit_law_ edu), and then all infinitival verbs in academic texts as a whole, and then compares the two sets of words. T he words – representing verbs that are highly frequent in a legal context – are found in T able 4.12. A gain, note that such comparisons would possibly be quite cumbersome for competing architectures. T o the degree that these architectures allow queries by

Word Frequency in Context

67

Table 4.12 Searching by register: lexical verbs in legal texts Word/phrase

Tokens Tokens

Per mil in REG1

Per mil in REG2

[4,640,346 words] [10,789,236 words]

Ratio

REG1

REG2

S ue Certify A djourn N otify W aive D isclose O verrule

331 27 26 38 38 190 23

10 2 2 3 3 16 2

71.33 5.82 5.60 8.19 8.19 40.95 4.96

0.93 0.19 0.19 0.28 0.28 1.48 0.19

76.96 31.39 30.23 29.45 29.45 27.61 26.74

Prohibit

53

5

11.42

0.46

24.65

Plead

62

6

13.36

0.56

24.03

sub-corpora (and not all do), users could carry out a query for [VVI] in R egister 1 and then a subsequent query in R egister 2. T hey would then copy and paste the results into separate tables in a database, and then carry out cross-table JOIN s to determine the relative frequency in the two registers. A gain, however, the process would probably take at least three to four minutes, but using our interface it takes only about two seconds. Expanding collocate searches T he architecture described to this point works well for determining the frequency of words, substrings, phrases and ‘slot-based’ searches (e.g. adj + ‘world’, synonym of [‘tired’] + noun, etc.). O ne fundamental disadvantage of the architecture, however, is that it is limited by its very nature to strings that occur within a sevenword window. This is because the ‘sequential n-gram’ table (see T able 4.5 above) contains that number of columns. Therefore, it would not be possible to find, for example, all the nouns that occur in the wider context of a given adjective using this architecture. R ecently, however, we changed the searching algorithm to allow for the recovery of collocates from a much wider context – up to ten words to the left and to the right of a given word. T here are several steps in the script that carry out such queries. L et us take a concrete example to see how this would work. A ssume that we are looking for adjectives that are collocates – ten words to the left or right – of the word ‘woman’. First, the script finds all of the ID values (=word offset values in the corpus) for ‘woman’ – more than 22,000 occurrences in the 100 million word corpus – and these are stored in a temp table. We then find all of the adjectives in the corpus whose ID value is between ten more and ten less than the ID values in the temp table. T hese are placed in a second temp table, and an

68

What’s in a Word-list?

SQL command then finds the most common words in this table. Overall, the script takes about five to six seconds to run, and yields adjectives like ‘young’, ‘old’, ‘beautiful’, ‘married’, etc. As with the ‘slot-based’ queries (e.g. ADJ + ‘woman’) these collocates can then be compared to the collocates of a competing word, such as ‘man’, to determine with collocates are used with ‘woman’ much more than with ‘man’ (e.g. ‘childless’, ‘pretty’, ‘pregnant’, ‘distraught’, ‘fragile’, ‘desirable’), or more with ‘man’ than with ‘woman’ (e.g. ‘honourable’, ‘reasonable’, ‘military’, ‘modest’, ‘rational’). L ikewise, one can compare collocates across registers, to look for possible polysemy with a given word. For example, one could look for adjectives with ‘chair’ that are more common in fiction than in academic text (e.g. ‘small’, ‘hard’, ‘rocking’, ‘asleep’) or which are more common in academic texts than in fiction (e.g. ‘senior’, ‘philosophical’, ‘established’, ‘powerful’). Again, this script only takes about 1.5 seconds to produce the list of contrasting collocates. Conclusion W e hope to have demonstrated two fundamental facts in this chapter. First, it is often insightful and advantageous to look at word frequency in context – by register, in historical terms, in syntagmatic terms (collocations) and paradigmatic terms (a given word contrasted with competing words that fill the same slot, such as synonyms). S econdly, the highly-structured relational databases lend themselves well to the comparison of contexts. W ord frequency, then, can be analyzed not just as the overall frequency of a given word or lemma in a certain corpus, but, rather, as the frequency of words in a wide range of related contexts.

Chapter 5

Issues for H istorical and R egional Corpora: First Catch Your W ord Christian Kay

Introduction My interest in what words can tell us about a text stems mainly from two electronic projects: the Historical Thesaurus of English (HT E) and the Scottish Corpus of Texts and Speech (S COTS ). I have a further involvement in the Linguistic and Cultural Heritage Electronic Network project (LI CH EN ), headed by L isa-L ena O pas-H änninen at the University of O ulu, Finland, which aims to collect and display languages of the circumarctic region. TOE and SCOTS HT E was replaced in 2008, and already has a daughter project, A Thesaurus of Old English (TO E), a conceptually organized thesaurus of the surviving vocabulary of O ld English (O E, c. ad 700–1100). Both TO E and S COTS , which are freely available on the internet, make some attempt to deal with word frequency. Much of the vocabulary of O E is assumed not to have survived, and that which does is unlikely to be representative of the whole. A s a result, the editors made use of four flags, somewhat similar to the register labels in modern dictionaries. These are ‘o’ indicating infrequent use, ‘p’ for poetic register, ‘q’ for doubtful forms, and ‘g’ for words occurring only in glossed texts or glossaries. Unlabelled words C. Kay, J. R oberts, M. S amuels and I. W otherspoon (eds), The Historical Thesaurus of the OED (Oxford: Oxford University Press, forthcoming 2009); see also . : J. Corbett, J. A nderson, C. Kay and J. S tuartS mith, funded by AHR B grant B/R E/AN 9984/A PN 17387. See I. Juuso, J. Anderson, W. Anderson, D. Beavan, J. Corbett et al., ‘The LICHEN Project: Creating an Electronic Framework for the Collection, Management, O nline D isplay, and Exploitation of Corpora’, in A . H ardie (ed.), Digital Resources in the Humanities Conference 2005 Abstracts (DRH , 2005), pp. 27–9. A n electronic version, supported by British A cademy grant LR G-37362, can be seen at .

70

What’s in a Word-list?

are common by default. T he database as a whole contains some 50,700 meanings, deriving from almost 34,000 different forms. O f these meanings, around 30 per cent have one or more of the above flags attached to them, which is an indicator of the peculiar nature of the surviving O E vocabulary. The flags are held in a separate field in the TOE database and can be searched either individually or in combination, yielding information about both individual words and their distribution over semantic categories. From the TO E search menu, the user can select a search on the flags, and then ‘p’ for words occurring only in poetry. The results can be browsed, or a particular semantic field, such as ‘Section 13 W arfare’, can be chosen. T he W arfare screen shows that a total of 457 out of 1,450 headwords in this section (around 32 per cent) are marked with a ‘p’ flag, which is the highest proportion in TO E. A sizeable proportion of these (302 or about 20 per cent) are also marked as rare words by the ‘o’ flag. T he specialized nature of the vocabulary of this area of Old English is thus confirmed. S COTS demonstrates frequency in a different way. If we conduct a search on the form ‘war’, we find that it occurs in 90 documents (17.08 per cent of the published corpus), and that these in turn contain 303,093 words (38.83 per cent of the corpus), while the form ‘war’ itself occurs 344 times. H owever, a glance at the citations will show that this is not the end of the matter, since at least three meanings of ‘war’ are represented in the selection: 1. T hey war first biggit (‘they were first built’, where ‘war’ is the third-person plural past tense form of the verb ‘to be’). 2. H is faither is away tae the war (‘his father has gone to war’, where it is a noun). 3. Be war of inserting sic lang words hinmest in the line (‘be wary of inserting such long words at the end of the line’. Here ‘war’ is an aphetic form of the adjective ‘aware’ or ‘wary’). Since these three uses of ‘war’ happen to be different parts of speech, grammatical parsing, which we have not yet tackled, would contribute to a solution here. But in general terms, despite many proposed solutions, multiple meaning remains a challenge for corpus searching. T he joker in the pack is the third example, which is from a modern lecture text but is quoting from an Essay on Poesie written by James VI (of S cotland) and I (of England) and published in 1585. S uch temporal displacement is not unusual in texts. D irect quotation can be dealt with by tagging, For more detail, see ‘A Thesaurus of Old English Online’ . These figures will fluctuate slightly, since TOE is periodically updated on receipt of new materials from the T oronto Dictionary of Old English project . These figures will vary, since SCOTS is updated at regular intervals. It contains over 4 million words, of which 20 per cent are spoken.

Issues for Historical and Regional Corpora

71

and possibly ignored when results are returned, but there remains the problem of allusion – where a word may trigger reference to an event, a text, or whatever, necessitating access to a knowledge base. Perhaps this takes us further than tagging should have to go. Where a word has only one meaning, figures such as those above can give us an idea of its relative frequency in the corpus. In a case like ‘war’, they can still provide useful pointers – for example, what form of the verb is used by those who do not select ‘war’? ‘Were’ (most likely in written Scottish Standard English) or ‘was’ (predictable for Glasgow and elsewhere in the central belt)? For any word, the extensive S COTS metadata can answer questions which are crucial for interpreting frequency statistics as opposed to merely recording them. W here do the users of a form come from? What is their sociolinguistic profile? Do they use the word primarily in speech or in writing? In which genre does the word occur? From my point of view, however, which is that of someone interested in historical and regional language, the title of this seminar, ‘Word Frequency and Keyword Extraction’, is really the wrong way round. Issues of frequency can only be tackled if we are confident that we are able to retrieve the material we need from our corpora. A t least two problems currently stand in the way of such confidence: the problem of variable spelling and, as just demonstrated, the problem of semantic ambiguity. S uch problems affect not only linguists but also the wide range of humanities scholars engaged with regional or historical texts. Spelling To take spelling first, any work on historical texts has to solve the problem of the variations in spelling which occurred in English until at least the eighteenth century, and which become more marked further back in time. Even in modern standard varieties there is a degree of variation – as in the resistance of British English writers to using ‘–ize’ forms in words like ‘realise’ – and one might also want to capture erroneous spellings. A parallel problem involves spelling variation between varieties and sub-varieties of English. In some of these, such as S cots, there may be no generally accepted written standard, with orthographic choice being left to individuals or groups, who may not themselves be consistent. T his problem is shared by the many languages in the world which have underdeveloped written forms. In the LI CH EN project, we will be tackling languages with virtually no written form, such as the Finnish varieties Meänkieli and Kven – but that presents different problems.

A list of recommended spellings, based mainly on frequency of occurrence, is being drawn up by S cottish L anguage D ictionaries L td (SLD ), the body with responsibility for developing academic dictionaries of S cots. SLD has, of course, no power to impose its use.

72

What’s in a Word-list?

S everal methods have been used to deal with spelling variation. T he simplest is to present the user with an alphabetized concordance of all the words in a corpus, allowing them to search on likely variants. S uch a method presupposes a well-informed user, otherwise variants which are not alphabetically close or obvious, such as ‘fit’ as a variant of ‘what’ in north-eastern Scots, may be missed. A more foolproof method is to lemmatize the variants by tagging while the corpus is under construction, thus building up a spelling dictionary for that particular body of texts.10 T his method is likely to be successful, but at considerable cost in human effort, and depends on skilled annotators being available. Moreover, its success cannot be guaranteed beyond the selected texts. A more sophisticated procedure involves extending the use of wildcard searches by writing algorithms predicting the range of possible spellings, either overall or for particular periods or varieties. T his is not an easy task, either linguistically or computationally, but if the method were sufficiently generalisable, it would prove invaluable in many humanities disciplines, since it would allow scholars to import and search texts of their own choosing rather than rely on prepared corpora. For the next generation of tools, we should perhaps be looking to such frameworks: in many areas of the humanities, electronic texts are relatively easy to acquire, but annotated corpora are not. A case in point was provided at the D igital R esources in the H umanities Conference 2005 (DRH ), where two historians interested in trade and material culture in the early modern period discussed the issues involved in setting up an electronic Dictionary of Traded Goods and Commodities 1550–1820 based on a digitized corpus of primary materials.11 Problems were encountered in retrieving the very variously-spelt terms for items such as foodstuffs, spices and dyes in their database. A good deal of information about spelling in the history of English is lemmatized under the headwords in the Oxford English Dictionary (O ED ), where the main variants are given century by century.12 T hese listings suggest a way forward, but also illustrate the extent of the problem, as a glance at the O ED data for two homophonous English words, ‘peace’ and ‘piece’, demonstrates. A n example of this kind of approach is the corpus of Middle English Medical Texts (MEMT ) compiled by T aavitsainen and her team at the University of H elsinki, and obviously aimed at sophisticated users: I. T aavitsainen, P. Pahta and M. Mäkinen, Middle English Medical Texts, CD -RO M (A msterdam: John Benjamins, 2005). 10 A nneli Meurman-S olin of the University of H elsinki has used this method in preparing her forthcoming Tagged Corpus of Scottish Correspondence (personal communication). See also . 11 See N. Cox and K. Dannehl, ‘The Rewards of Digitisation: A Corpus-based A pproach to W riting H istory’, in A . H ardie (ed.), Digital Resources in the Humanities Conference 2005 Abstracts (DRH , 2005), pp. 13–14. 12 OED Online, (ed.) S impson, J.A . (O xford: O xford University Press, 2000–). See also . T he revised version, O ED 3, contains a much wider range of spelling variations.

Issues for Historical and Regional Corpora

73

Peace, noun Forms: 2–4 pais, 2–6 pes, (3–5 pays, peys, 3–6 peis, 4 payes, 4–5 payse, pese, pees, S c. and north. pess), 4–6 pece, (5 peese), 5–6 peas, pease, (pesse, S c. peice, 5–7 peax, 6 S c. peiss, pace), 6– peace. Piece, noun Forms: 3–7 pece (3–5 pees, 4 pise, 4–5 pice, peis, 5 pes, peyce, peese, 5–6 pes(s, pesse); 5– piece, (5 pyece, 5–8 peace, 6 pease, peise, peyss, (Sc. peax), pysse, 6–7 peece, 6–8 peice).

(The figures show the centuries of currency of particular forms; thus ‘2–4’ indicates twelfth to fourteenth century. S c = S cots.) It would be possible to use these listings as a starting point, instructing a program to search for all variants. H owever, this would be only partially helpful since the spellings under the headwords are the most common variants rather than a comprehensive listing, and other forms might well occur. A n alternative, as discussed above, would be to attempt to predict the range of possible variants. S ome of the resulting algorithms would be very broad, as for the range of vowel variation above, while others would be quite restricted. In the sixteenth to eighteenth centuries, for example, ‘ph’ could be substituted for ‘f’ in words of classical origin such as ‘phantastic’, ‘phrentic’ (‘frantic’) or ‘phanatic’. An added complication is that the ending of such words could be ‘–ic’, ‘–ick’, ‘–ik’, ‘–icke’, ‘–ike’ or even ‘–ique’. These endings, however, have widespread substitutability over a longer period in a greater range of words, and so could be encapsulated in a more general rule. Indeed, it may be that we could make progress by ignoring the vowels and concentrating on the consonants: the Dictionary of Old English (DO E) manages to find variously-spelt phrases by treating all vowels as one in their searches.13 Ambiguity Even if an adequate system for retrieving spellings is devised, we will still face the problem of semantic ambiguity. A lthough neatly distinguished in modern usage, the spellings of many homophones overlap historically, as in the ‘peace’/‘piece’ example above, and the researcher could not always be sure which one was being retrieved. For a common word such as ‘piece’, with many meanings and spellings, manual sifting (e.g. of the 7,340 hits returned for ‘piece’ in OED2) would be extremely time-consuming. 13

.

74

What’s in a Word-list?

T his situation raises the more general problem of disambiguating multiple meaning in what are traditionally distinguished as homonyms (‘peace’/’piece’ coming from different roots) and polysemes (the 17 main meanings of ‘piece’ listed by the O ED , not to mention sub-senses, phrases and compounds). T his distinction, and especially the role of polysemy in the extension of meaning through such processes as metonymy and metaphor, is of considerable interest in semantics even if computationally irrelevant. D isambiguation of such forms has long been a challenge in N atural L anguage Processing (NL P), with increasing success being achieved through projects such as W ordN et,14 MindN et,15 and the preferencebased approach of Wilks’ Pathfinder project at N ew Mexico S tate University.16 S uch work exploits the compositionality of lexical meaning and the tendency of words to co-occur with others from the same semantic domain. In the case of W ilks and his colleagues, there is a refreshing and demonstrable appreciation of the contribution of information in published dictionaries to creating NL P tools.17 Historical Thesaurus of English (HTE) T his project is effectively a semantic index to the O ED , supplemented by O ld English materials from TO E. HT E’s projected 650,000 word meanings are presented in 26 major categories, each arranged in a detailed semantic taxonomy of up to 12 hierarchical places, thus showing the position of each meaning within the overall structure. Every section has an explanatory heading in modern English, which can be traced back through the hierarchy to create a definition; indeed, it would be possible to reverse the process and to produce a unique kind of structured dictionary with the headwords in alphabetical order. Each section is organized internally in chronological order, with words retrievable through a unique number in the 29-field database. Someone searching for ‘piece’ in HTE would find that it occurs at various times in numerous categories, such as ‘land’, ‘bread’, ‘drugs’, ‘people’, ‘armaments’, ‘games’, etc., while ‘peace’ occurs in ‘flowers’, ‘absence of war’, ‘freedom from care’, ‘absence of noise’, 14 , accessed 8 A ugust 2005. 15 , accessed 8 A ugust 2005. 16 For a discussion of developments from the 1950s onwards, see Y.A . W ilks, B.M. S lator and L .M. Guthrie, Electric Words: Dictionaries, Computers and Meanings (Cambridge, MA and L ondon: MIT Press, 1996). More recent work, including the Mappings, A gglomerations and L exical T uning (MALT ) project, is described at , accessed 1 O ctober 2005. 17 For a defence of dictionaries in linguistic terms, see C.J. Kay, ‘Historical Semantics and H istorical L exicography: W ill the T wain Ever Meet?’, in J. Coleman and C.J. Kay (eds), Lexicology, Semantics and Lexicography in English Historical Linguistics: Selected Papers from the Fourth G.L. Brook Symposium, Manchester, August 1998 (A msterdam: Benjamins, 2000), pp. 53–68.

Issues for Historical and Regional Corpora

75

etc. First steps towards the internet publication of the materials were taken by an AHR C-ICT S trategy Project creating datasets and searches, including delimitation of words by dates of currency, for use in a range of humanities disciplines.18 In addition to being of interest to linguists, the organization of vocabulary in semantic categories can cast light on such topics as the development of material culture, social organization, and intellectual pre-occupations.19 T hesauri have long been employed as a component of automatic parsing tools, as in the use of McA rthur’s (1981) Lexicon of Contemporary English20 in L ancaster’s USAS package (UCR EL S emantic A nalysis S ystem), which is being redeveloped to cope with historical factors in the sixteenth to eighteenth centuries.21 A favourite for such research has been R oget’s well-known Thesaurus of English Words and Phrases (1852),22 which was used in early work of this kind.23 A n interesting current example of a historically focussed use of R oget occurs in research by T erry Butler of the University of A lberta to map a tagged version of the notebooks of the English poet Samuel Taylor Coleridge (1772–1834) to the categories of the first edition of R oget (1852), thus creating a contemporary subject index.24 A lthough the overall structure of HT E was originally devised by a componential analysis of key OED definitions,25 it is essentially a folk taxonomy, more akin to McA rthur than R oget. In the example below, 03 represents the major class 3.S ociety (the other two being 1.T he Physical Universe, and 2.T he Mind), with 03.03 giving the most general words for the concept of armed hostility. O ld English words are given first, marked simply ‘OE’, but earliest and latest dates of currency from the OED are given from 1150 on. An entry like ‘win<(ge)winn OE – c1275’ gives the post-O E form, followed by its O E ancestor, with a last recorded use of around 1275. ‘Conflict 1611—’ indicates a first recorded date of 1611, with continuous currency into present-day English. Interrupted currency or scarcity of examples is 18 J. S mith, S . H orobin and C. Kay, Lexical Searches for the Arts and Humanities, AR 112456. 19 See, for example, C.J. Kay, ‘Historical Semantics and Material Culture’, in S.M. Pearce (ed.), Experiencing Material Culture in the Western World (L ondon and W ashington: L eicester University Press, 1997), pp. 49–64. 20 T . McA rthur, Longman Lexicon of Contemporary English (L ongman, 1981). 21 D. Archer, T. McEnery, P. Rayson and A. Hardie, ‘Developing an Automated S emantic A nalysis S ystem for Early Modern English’, in D . A rcher, P. R ayson, A . W ilson and T . McEnery (eds), Proceedings of the Corpus Linguistics 2003 Conference, UCR EL T echnical Paper N umber 16 (L ancaster: UCR EL , 2003), pp. 22–31. 22 P.M. R oget, Thesaurus of English Words and Phrases (L ongman, 1852, and subsequent edns). 23 W ilks et al., Electric Words, 130–31, citing M. Masterman,. ‘The Thesaurus in S yntax and S emantics’, Mechanical Translation, 4 (1957): 1–2. 24 Improving Access to Encoded Primary Texts, A CH -AHR C abstract , accessed: 19 July 2005. 25 C. Kay and M.L. Samuels, ‘Componential Analysis in Semantics: Its Validity and A pplications’, Transactions of the Philological Society (1975): 49–81.

76

What’s in a Word-list?

indicated by the use of a plus sign between dates. OED labels such as ‘poetic’ or ‘dialectal’ can be downloaded if desired. 03.03. n Armed hostility: geflit OE, garniþ OE, guþ OE, hild OE, niþ OE, orlege OE, orlegniþ OE, sæcc OE, unfriþ OE, unsibb OE, win<(ge)winn OE – c1275, camp
T his paragraph is followed by a sequence of semantic sub-categories, then by parallel categories for other parts of speech. S ub-categories read back to the main heading: in those below, ‘armed hostility’ must be supplied after the preposition, giving ‘outbreak of armed hostility’, etc., as the full heading. 03.03. /01. n (.outbreak of): 03.03. /02. n (.declaration of): 03.03. /03. n (.commencement of):

N ineteen major categories follow 03.03, starting with 03.03.01 W ar, and moving through Battle, Victory, Defeat, Warriors, Weapons, and so on until we finally reach 03.03.19 Peace/absence of war. D egrees of subordination within sub-categories are represented by the number of dots. T he example below shows a pathway, reading from the lowest level, defining a person or ship carrying / a flag of truce / as part of a suspension of hostilities / leading to their cessation / and thus peace. 03.03.19. n Peace/absence of war: 03.03.19. /06. n (.cessation of hostilities): 03.03.19. /06.01. n (..suspension of hostilities): 03.03.19. /06.01.03. n (...flag of): 03.03.19. /06.01.03.01. n (....person/ship carrying):

Both the dates and the semantic structure displayed here could be used in creating a probability-based method of disambiguating historical word forms. O ne could predict that if the word ‘peace’ or a variant occurred in a context where other words from that HT E category also occur, then it is likely to be O ED sense I.1.a., ‘Freedom from, or cessation of, war or hostilities; that condition of a nation or community in which it is not at war with another’, or one of its sub-sections, that is involved, rather than the peace rose or a piece of cake. If the approximate date of the target text is known, then only words contemporary with that text

Issues for Historical and Regional Corpora

77

need to be considered. T he matching would not be 100 per cent successful, since unknown words and unexpected contexts are bound to occur. R ules would have to be developed for the amount of context and levels of hierarchy required, bearing in mind that the level of semantic delicacy of HT E is much greater than that of most thesauri. Music, for example, has 7,471 meanings under 2,416 category headings, while A nimals has 29,883 meanings and 12,818 headings. Formal novelty is, of course, linked to spelling variation. A nyone in the seventeenth century could produce the forms ‘fantastic’, ‘fantastick’, ‘fantastik’, ‘fantasticke’, ‘fantastike’ or ‘fantastique’, not to mention ‘fantastickal’, ‘fantastikal’, ‘fantastical’ or ‘fantastiqual’; if actual occurrences of any of these have escaped the O ED net, they could nevertheless be recognized as potential words. Common types of metonymic extension could be incorporated into a search tool, such as the name of a tree being used for its wood or its fruit (e.g. ‘apple’). Thesauri also reveal metaphoric extension; if there is a noticeable overlap in words between an abstract and a concrete category (as in A nger/H eat), then there is often a metaphorical connexion, with the potential for new metaphors to be added to the set. O verall, HT E could be a useful addition to electronic resources for historical text linguistics, including data-mining of older texts – but only if the spelling problem is solved first. As a first step in achieving this, Dawn Archer and I plan to do further work using the historical version of the VARD variant spelling detector to link HT E headwords to variant spellings in R enaissance texts.26 Keywords A n HT E category or subcategory can also contribute to work on keywords, in the sense of words that reveal cultural preoccupations. A lthough frequency is not marked as such, absence of labels such as ‘rare’ or ‘dialectal’ indicates general currency. A long date range indicates likely importance over a period of time, while the semantic clustering of many words may indicate a significant concept. N ew ideas or technology may be represented by a sudden spurt of words at a particular period. Anyone who wonders about the relative importance of ‘war’ and ‘peace’ as talking points in English might reflect on the relative numbers of words for these concepts in HT E, as shown below. W ithin the W ar category there are strikingly long lists for military artefacts and personnel – the material goods

26 O n VARD , see A rcher et al.’s chapter in this volume and also P. R ayson, D . Archer and N. Smith, ‘VARD Versus Word: A Comparison of the UCREL Variant Detector and Modern S pell Checkers on English H istorical Corpora’, Proceedings of the Corpus Linguistics Conference Series On-Line E-Journal 1:1 (2005); P. Rayson, D. Archer, A. Baron and N. Smith, ‘Travelling Through Time with Corpus Annotation Software’, Proceedings of Practical Applications in Language and Computers (PALC 2007) Conference, Łódź, Poland, April 2007 (PAL C, forthcoming).

What’s in a Word-list?

78

Table 5.1

‘war’ ‘peace’

HTE entries for ‘war’ and ‘peace’ Records 16,785 406

Headings 3,885 101

Words 12,900 305

may change, but the concept unfortunately lingers on. W ithin Peace, there is very little. Conclusion Electronic dictionaries and corpora are now familiar resources in humanities computing, useful both in linguistic research and in the many disciplines where searching for words can produce historical or social information or literary insights. W e are reaching a point where these tools are moving on, becoming more complex in their computing architecture and more powerful in what they can achieve. It is to be hoped that, before too long, the considerable achievements of NL P in tools for text analysis can be harnessed for the benefit of work on historical and nonstandard language.

Chapter 6

In S earch of a Bad R eference Corpus Mike S cott

Introduction: Snark or Boojum? T hey hunted till darkness came on, but they found N ot a button, or feather, or mark, By which they could tell that they stood on the ground W here the Baker had met with the S nark. In the midst of the word he was trying to say, In the midst of his laughter and glee, He had softly and suddenly vanished away — – For the S nark was a Boojum, you see. (L ast two stanzas of The Hunting of the Snark by L ewis Carroll)

W e are here hunting a S nark. For those who have not read the poem (L ewis Carroll 1876), a S nark is a mysterious creature sought by some people aboard a ship. It is never very clear what a S nark is, or a Boojum, or why a S nark might turn out to be a Boojum. T he ship is captained by the Bellman. H e had bought a large map representing the sea, W ithout the least vestige of land: A nd the crew were much pleased when they found it to be A map they could all understand. “What’s the good of Mercator’s North Poles and Equators, T ropics, Zones, and Meridian L ines?” S o the Bellman would cry: and the crew would reply “They are merely conventional signs!”

L ikewise, here it is not clear what a really bad reference corpus is and what may happen if we should meet up with one. W e are dealing with conventional signs, too, and our principles of navigation are also somewhat unclear. T here may be a variety of methods for identifying keywords in texts – methods which rely on word frequency alone, excluding function words through a stop list, or on human identification, ones which access a previously-identified semantic

80

What’s in a Word-list?

word-bank, or ones which rely on a combination of these. T he procedure for identifying keywords under discussion here, however, that devised for use in W ordS mith T ools which essentially compares a wordlist based on the text in question and a wordlist based on a reference corpus. T he idea is quite simple: by comparing the frequency of each item in turn with a known reference, one may identify those items which occur with unusual frequency. T his is done without any attempt to identify or match up the semantics or pragmatics, and is based on a simple verbatim comparison, without even necessarily grouping lemma variants together. It is important to stress that no claim should be made that a set of keywords thus identified, (1) will match a set of human-generated keywords, or (2) is significant as a set even if each individual comparison reaches statistical significance. The main utility of the procedure has instead been in identifying items which are likely to be of linguistic interest in terms of the text’s aboutness and structuring, and which can be expected to repay further study, e.g. through concordancing to investigate collocation, etc. T he method will clearly achieve results which are largely dependent on the qualities of the reference corpus itself. A n analogy will help to make this clear. S uppose one wishes to identify and evaluate the qualities of a given car. If it is compared with all existing cars of the same category, such as family sedans, comparative features such as price, safety, speed and comfort will be used. T he whole set of family sedan cars made by the world’s auto manufacturers will probably be used as the ‘reference corpus’. T he mere facts that the motor is made of an alloy, or that the tyres are made of rubber, or that the engine burns fuel are not relevant to the comparison, if all such cars burn fuel, have alloy engines and rubber tyres. T he amount of fuel consumed would come into the comparison, but not the fact that fuel is burned. But it would instead be possible to compare the same family saloon car to all means of transport available for a given purpose, e.g. getting to Barcelona for a family holiday. In that case, the comparison will involve different criteria, such as convenience, expense, opportunity to take and bring back luggage, impact on the environment, etc. T he reference corpus is now the set of {car, train, ferry, taxi, plane, etc.}. For different research purposes, different reference comparisons are needed. In general, then, claims can be made, (1) that the choice of reference corpus will affect the results, (2) that features (such as rubber tyres) which are similar in the reference corpus and the node text itself will not surface in the comparison, but (3) only features where there is significant departure from the reference corpus norm will become prominent for inspection.

M. S cott, WordSmith Tools (O xford: O xford University Press, 1996), with numerous subsequent versions). See M. S cott and C. T ribble, Textual Patterns: Keyword and Corpus Analysis in Language Education (A msterdam: Benjamins, 2006), chs 4 and 5.

In Search of a Bad Reference Corpus

81

T he question then raises its head: how much difference, in the case of words and text, does it make if a somewhat imperfect reference corpus is used? In reality, it might be hard to obtain a large, perfectly-matched reference for some comparisons. For example, the BN C might be considered a good reference for texts in English, but despite its numerous positive features and the enormous effort that went into constructing it, it is still only based on 100 million words – any search of the internet suggests that the amount of text in English on the Internet far exceeds this – and on a sampling procedure which gives about ten times as much weight to written than spoken English. In the case of the analysis of S umerian or other dead languages, there is only one body of texts to be used. S uppose one wished to compare a text in S umerian which belonged to the poetry genre, one might find that a reference corpus of Sumerian poetry was extremely slim indeed and question whether the results would be useful. S ize is not the only condition. A nother criterion is date. If one were comparing a corpus of seventeenth-century sermons in English, would it be acceptable to use the 1990s BN C as a reference corpus? O r would that constitute a bad reference corpus? H ence the title of this chapter. Furthermore, what the texts in a reference corpus are about is, presumably, critical, unless we use so many texts that what they are each about is drowned in the whole. A text about scoring goals in a football match will resonate with lots of others in newspaper texts, but one about the culture and customs of a small tribe may not. A further consideration is whether one is comparing a single node text with a reference, or a whole set of texts (e.g. comprising a sub-genre) with the reference corpus. For most of this chapter we shall be considering only the comparison of one text at a time with a reference corpus. Berber Sardinha’s formula Berber S ardinha discusses the tendency of a reference corpus which is similar to the node text to ‘filter out’ genre features common to both, and thereby suggest that a reference corpus which contains several different genres of text in it would be the non-marked choice. In general, he claims that, ‘critical reference corpus sizes are 2, 3 and 5 times that of the node text’ and presents a formula for calculating the number of keywords likely to be obtained when comparing two corpora. KW s = 249.837059 – 0.00002734 * ref corpus tokens + 0.00886787 * ref corpus types + 0.00137131 * node corpus tokens

Figure 6.1 Berber Sardinha’s (2004, 102) formula A .P. Berber S ardinha, Lingüística de Corpus (Barueri, S ão Paulo, Brazil: Editora Manole, 2004). Ibid., p. 102. Ibid.

What’s in a Word-list?

82

T hus, if we have a reference corpus of 100 million tokens and 400,000 types, and a node corpus of 5,000 tokens, we should get 1,070 keywords. T he formula works by computing a regression line. It works best with relatively small corpora, up to about five million running words. T he formula may be useful for predicting the number of keywords which can be expected to be found using a given reference corpus, but it will not tell us what sorts of keywords are likely to be generated, which is why the exploratory study described below was carried out. Study The present investigation was thus designed to study the influence on the keywords which would be generated using one and the same routine and settings, but varying the number and kinds of texts comprising the reference corpus (R C). T he study was carried out in three stages. T he research questions were: 1. W hat distribution is obtained if a set of keyword calculations is made using R Cs of different sizes, using randomly selected BN C texts regardless of genre? For example as the size of the R C increases does the quality of the keyword results increase? If so, is there any noticeable threshold below which the quality is unacceptable? 2. W hat sort of keyword results obtain if a deliberately strange R C is used, one which has little or no relation to the text in question apart from being based on the same language? Is the quality of results (un)acceptable? 3. W hat quality of keyword results obtains if genre is included as a variable, so that BN C texts are compared by genre with the source texts? Materials and methods T he software for all three research questions was W ordS mith T ools version 4.0. T wo source texts were used as a sample for comparison with the various R Cs. T hese were BNC text A6L, a book profile of leaders of commerce, about 46,000 words in length, and text KN G, a spoken text of only 615 words, between a doctor and a patient. Fragments of these are reproduced without BN C tags in A ppendix 1. For research question 1 above, R C texts were chosen from the 4,054 BN C texts (spoken plus written) using a randomizing function, so that 22 different sizes of R C were selected, comprising the numbers of texts shown in T able 6.1, without

A .P. Berber S ardinha, personal communication. M. S cott, WordSmith Tools, Version 4 (O xford: O xford University Press, 2004 [1996]).

In Search of a Bad Reference Corpus

Table 6.1

83

Numbers of texts in each RC Total number of texts in each RC 10

50

250

15

60

300

2,000

20

75

400

2,500

25

100

500

3,000

30

150

1,000

4,000

40

200

1,500 (total number of RCs = 22)

consideration of genre (i.e. mixed genres).T he two source texts mentioned above were then compared with these different R Cs. The KeyWords tool settings were as follows: minimum frequency = 3; maximum keywords = 5,000; negative keywords to be excluded; p value = 0.000001; procedure = log likelihood. Keywords were computed for the two source texts using each of the 22 reference corpora. Keywords and frequency information for each were saved as text (see Figure 6.2). T hese were then imported into the W ordL ist tool (see Figure 6.3). In this way, it was possible to treat each set of keywords as a word-list and examine which of the items were found in the different sets based on the 22 R Cs, using W ordL ist’s detailed consistency procedure (see Figure 6.3). T his detailed consistency procedure allows one to sort the keywords in a number of ways (in Figure 6.4 they are sorted alphabetically), for example, according to whether they are found to be key in the various R Cs. Item frequencies are constant where greater than zero in Figure 6.4, e.g. ‘absolutely’ was found to be key, and had a source text frequency of 15, in all comparisons except that with the smallest R C, where it was not picked up as key. O verall, there were 267 keyword types in the 22 lists from text A 6L , and 18 keyword types from text KN G. T hese results were exported into MS W ord™ so that the numbers could be simplified, in such a way that zeroes were replaced with space and numbers standardized as ones, and these results brought into MS Excel™. Figure 6.5 shows a fragment of the Excel data: for the smallest R C (based on ten texts only), 179 keywords were identified in relation to text A6L, but 156 when using the 15-text R C, rising to 184 with the 30-text R C. Dunning, T., ‘Accurate Methods for the Statistics of Surprise and Coincidence’, Computational Linguistics, 19:1 (1993): 61–74.

84

What’s in a Word-list?

Figure 6.2

Saving keywords as text

Figure 6.3

Importing a word-list from plain text

In Search of a Bad Reference Corpus

Figure 6.4

85

Detailed consistency view of the twenty-two keyboard sets

T wo further variables were computed in Excel: popularity and precision. Popularity was defined (as can be seen in Figure 6.5) as the presence of each keyword in at least 20 of the 22 sets. T hus of the 179 keywords found using the smallest R C, 99 were common to at least 20 of the 22 sets. T he rationale was that the keywords identified using most of the RC sets are more likely to be useful than those identified in a minority, and is the only indicator of quality used in the study. (O ther indicators of keyword quality might include informant testing with a variety of informants ranging from the naïve – ‘12 good men and true’ – to the linguistically sophisticated, or for example the original authors.) Precision was computed following O akes, ‘the proportion of retrieved items that are in fact relevant (the number of relevant items obtained divided by the total number of retrieved items)’. In this case the calculation involved dividing the total number of keywords (179 for the smallest R C) by the number of popular keywords (99) which gives a precision value of 55�� per cent�� for the smallest R C. T his study uses comparative precision as its main measure. Results R esearch question 1: W hat distribution is obtained if a set of keyword calculations is made using R Cs of different sizes, using randomly selected BN C texts regardless of genre? For example as the size of the R C increases does the quality of the keyword results increase? If so, is there any noticeable threshold below which the quality is unacceptable?

M. O akes, Statistics for Corpus Linguistics (Edinburgh: Edinburgh University Press, 1998), p. 176.

86

What’s in a Word-list?

A first set of results for this research question, based on the two texts is shown in Figures 6.6 and 6.7. Figure 6.5 shows increasing precision values as the size of the R C increases. T he 22 different R C sizes are on the x axis and exact R C sizes for each can be found in T able 6.1. T he plot suggests that, after a rocky start, where precision is fairly inconsistent but still high at over 50�� per cent�� , the precision gently increases to a maximum value corresponding at or near the biggest R C. T he text in question is a lengthy (44,000 words) section of a book profiling well-known business leaders. For this very short (615-word) doctor-patient interview in Figure 6.7 we get rather different results. There were only 18 keywords identified over the 22 lists. A gain the precision values are high, all over 75�� per cent�� , but these are clearly higher values than with the much longer text, which generated many more keywords. Here we do not get increasing precision values as the size of the RC increases; instead, there seems to be a ceiling effect with fairly small R Cs based on 50 or 60 texts. It seems that an appropriate answer to the first research question is that there is no clear and obvious threshold below which poor keyword results can be expected. Precision values when using a mixed bag of R C texts, even if the set is small, are high; there is no obvious cut-off point; very much the same keywords are generated whatever the R C used. W e have not yet found a really bad reference corpus. R esearch question 2: W hat sort of keyword results obtain if a deliberately strange R C is used, one which has little or no relation to the text in question apart from being based on the same language? Is the quality of results (un)acceptable? For this research question, an R C was used that was based on all of S hakespeare’s plays. T he genre is drama, the period late sixteenth and early seventeenth century. W ill this absurd R C give rise to usefully poor results?

Figure 6.5

Excel spreadsheet of results

In Search of a Bad Reference Corpus

87

Precision

Figure 6.6

Precision values for text A6L Precision

Figure 6.7

Precision values for text KNG

Using the leaders of commerce text from the 1990s with the S hakespeare R C, there were 606 keywords altogether. A lthough the source text is lengthy at 44,000 words, 606 distinct keyword types seems a large number. W ith the BN C R C, there were just over a quarter of that number, 161 keywords. 143 of these were common to both sets. If we assume that the BNC RC is the better RC, at first sight it might seem that using an inappropriate R C may generate a lot of unwanted keywords.

What’s in a Word-list?

88

Table 6.2

Keywords identified using only Shakespeare RC

‘objective(s)’, ‘obviously’, ‘of’, ‘offered’, ‘oil’, ‘on’, ‘one’, ‘only’, ‘opened’, ‘operate’, ‘operations’, ‘opportunity/ies’, ‘option’, ‘organisations’, ‘original’, ‘other’, ‘outside’, ‘over’, ‘overseas’, ‘owned’

A few keywords picked up by the BN C R C were not in the S hakespearegenerated set: common pronouns or conjunctions (‘I’, ‘we’, ‘them’, ‘you’, ‘when’), high frequency verbs and nouns (‘finds’, ‘go’, ‘have’, ‘make’, ‘own’, ‘sir’, ‘take’, ‘taught’, ‘thing’) a couple of numerical and time-related words (‘thousand’, ‘never’) and two which were clearly unknown to Shakespeare (‘Sikorsky’, ‘jojoba’). A further 463 keywords were generated only when using the S hakespeare R C. T hose beginning with ‘o’ are presented as a sample in Table 6.2. Table 6.3

Doctor-patient keywords with two RCs

Shakespeare RC

BNC RC

YES THAT ’S RI GHT DO CTOR OH CAN ’T CRA MP JUST AHA ER MR QUININ E TA BL ETS T EAS POON FUL I’M EIGHT Y R EALL Y O PERATION GETTIN G IN FIR MAR Y O CTO BER PHON E GET EAS ES AN Y

YES THAT ’S DO CTOR RI GHT CRA MP T EAS POON FUL QUININ E I’LL TA BL ETS NO H EARD EAS ES YO U AHA

In Search of a Bad Reference Corpus

Table 6.4

89

Genred RCs

Text genre 1. commerce texts (A 6L is a business text) 2. academic medicine texts 3. prose fiction texts 4. non-academic H umanities texts 5. non-academic politics, law, education texts 6. spoken broadcast discussions 7. conversations 8. oral history interviews 9. spoken meetings

Number 112 24 432 111 93 53 153 119 132

* No claim is made here that the BNC texts are themselves complete; this is recognised by A ston and Burnard (1988, 28).

T able 6.2 suggests that although many more keywords were picked up using our deliberately inappropriate R C, the keywords themselves are not absurd. Most of these words beginning with ‘o’ have to do with business operations and are indicative of the aboutness10 of the text. In the case of the doctor-patient consultation, the numbers are much more manageable. A gain we get more keywords when using S hakespeare as the R C. It is, however, difficult to claim that, as a set, those on the left side are in any way worse than those on the right. W e are using a necessarily subjective criterion, but research question 2 (is the quality of results unacceptable?) can now be answered provisionally: no, we still have no really bad R C. R esearch question 3: W hat quality of keyword results obtains if genre is included as a variable, so that BN C texts are compared by genre with the source texts? For this part of the study, sets of keywords based on text A 6L were again computed using the BN C as the source for the R Cs. A series of nine R Cs were constructed, using the sub-sections shown in T able 6.411 of the BN C itself, and so were no longer mixed genres. The classification used David Lee’s categories embedded in the BN C W orld Edition headers. Procedures for computing popularity and precision were as in the first part of the study, except that the criterion for popularity was now based on presence in 10 M. Phillips, ‘Lexical Structure of Text’, Discourse Analysis Monographs 12 (Birmingham: University of Birmingham, 1989). 11 No claim is made here that the BNC texts are themselves complete; this is recognized by A ston and Burnard: G. A ston and L . Burnard, The BNC Handbook (Edinburgh: Edinburgh University Press, 1998), p. 28.

What’s in a Word-list?

90

eight or nine of the keyword sets. Figure 6.8 shows precision figures over the nine R Cs. T he horizontal numbers correspond to the numbers in the list in T able 6.4, so that number seven, where there seems to be a big dip, represents 153 conversations. T hese conversations generated a lot of keywords but a low precision as measured by agreement between these various R Cs. Precision values here are noticeably lower than in the other, similar graphs above at between 20�� per cent�� and 70�� per cent�� . T he line rises and falls much more steeply. To interpret this finding, let us remind ourselves that the measure of precision here may be much less appropriate than it was earlier. In the first part, there was a fairly straightforward increase from R C1 to R C22, along a single dimension – the number of BN C texts in each R C. H ere, on the other hand, we have several variables in operation at once, the number of texts varying quite unsystematically, and the type of genre also in no particular order except that the spoken ones are the last four. T o expect there to be agreement between these R Cs is to assume that they are alike in some way – we could reasonably assume this in the first part, but not here. T here are two main ways in which we may assume they are not alike: (1) because they come from different media and genres, and also (2) because they are about different topics. It is therefore possible that any of the sets generated might Precision

Figure 6.8

Precision values for text A6L with genred BNC RCs

be useful – or useless – for a given purpose. R esearch question 3 must therefore remain unanswered for the time being. W e do not from this know what the quality of the keywords is, though it does seem that the keywords generated do differ if genre-different R Cs are used, much as the keywords differed when the genre Elizabethan drama was used, in comparison with the mixed bag BN C R C. If this is so, different aspects of the source text’s aboutness are being picked up.

In Search of a Bad Reference Corpus

91

Conclusions T hese three mini-studies have important limitations. T he texts in the study are all incomplete extracts from larger texts apart from the doctor-patient interview and the S hakespeare plays. T his is probably not a major limitation in itself, since there is no reason to suppose that the �� keyword� method depends exclusively on the identification of clear text boundaries; indeed, it is likely that the method can be used successfully with segments of texts, and it certainly has been used to compare groups of texts with an R C. W e have only examined a couple of texts in comparison with our 32 different R Cs. K�� eywords�� have been found by comparing one text at a time with an R C, not by comparing sub-corpora with larger R C corpora. N o other method has been used to evaluate the quality of keyword�� sets apart from agreement between the different R Cs and subjective appreciation. Informant studies could also be carried out. In conclusion it seems that the first part suggested that, using a mixed bag R C, the larger the R C the better – but not in the case of the small doctor-patient consultation: a moderate sized RC may suffice. This suggests that the �� keyword� procedure is fairly robust. The second part suggests that �� keywords�� identified even by an obviously absurd R C can be plausible indicators of aboutness, which reinforces the conclusion that �� keyword�� analysis is robust. T he third part suggested that genre-specific RCs identify rather different �� keywords�� , which itself led to the conclusion that the aboutness of a text may not be one thing but numerous different ones. T he S nark is still out there. S omewhere.

This page has been left blank intentionally

Chapter 7

Keywords and Moral Panics: Mary W hitehouse and Media Censorship T ony McEnery

Introduction In this chapter I will use the analytical framework established in McEnery in order to investigate the moral panic encoded in the writings produced by Mary W hitehouse in the 1960s and 1970s in Britain. In so doing, I will use keywords as a way of focussing on the aboutness of the moral panic. T hrough a study of patterns of colligation and collocation I will explore the moral panic in the Mary Whitehouse Corpus (MWC). To begin with, however, let me briefly review both (1) the corpora used in this chapter, and (2) moral panics and the use of keywords to explore them. The corpora used in this chapter The Mary Whitehouse Corpus (MWC) T he MW C includes the major writings of Mary W hitehouse in the period 1967– 77. T his corpus covers three of her books, namely Cleaning-up TV, Who Does She Think She Is? and Whatever Happened to Sex?, amounting to 216,289 words in total. T hese books, which had a wide circulation, were the principal public output in this period from the organization that W hitehouse headed – a pressure group called the National Viewers’ and Listeners’ Association (VALA ). I regard them as a good focus for a study of how VALA tried to excite a moral panic in the general population of Britain focused on the relationship between immorality, violence and the media.

�� A.M. McEnery, ‘The Moral Panic about Bad Language in England, 1691–1745’, Journal of Historical Pragmatics, 7/1 (2006): 89–113. �� I am indebted to the Faculty of S ocial S ciences, L ancaster University, for a small grant which enabled me to construct this corpus. I would also like to thank D an McIntyre, who undertook the bulk of the corpus construction work under my supervision.

What’s in a Word-list?

94

Table 7.1

Text categories in the Brown corpus

Code

T ext category

A B C D E F G H J K L M N P R T otal

Press reportage Press editorials Press reviews R eligion S kills, trades and hobbies Popular lore Biographies and essays Miscellaneous (reports, official documents) S cience (academic prose) General fiction Mystery and detective fiction Science fiction Western and adventure fiction Romantic fiction H umour

N umber of samples 44 27 17 17 38 44 77 30 80 29 24 6 29 29 9 500

Proportion (%) 8.8 5.4 3.4 3.4 7.6 8.8 15.4 6 16 5.8 4.8 1.2 5.8 5.8 1.8 100

The Lancaster/Oslo-Bergen (LOB) and Freiberg-Lancaster/Oslo-Bergen Corpora (FLOB) Both the LO B and FLO B corpora are related to an earlier corpus, the Brown University Standard Corpus of Present-day American English (i.e. the Brown Corpus). T he corpus was compiled using 500 chunks of approximately 2,000 words of written text. T hese texts were sampled from 15 categories, and all were produced in 1961. T he components of the Brown Corpus are given in T able 7.1. LO B and FLO B follow the Brown model. T he Lancaster/Oslo-Bergen Corpus of British English (LO B) is a British match for the Brown Corpus. T he corpus was created using exactly the same sampling frame with the exception that LO B aims to represent written British English used in 1961. T he Freiberg-LOB Corpus of British English (FLO B) represents written British English as used in 1991 using

� See H. Kučera and W.N. Francis, Computational Analysis of Present-Day American English (Providence: Brown University Press, 1967). See S . Johansson, G. L eech and H . Goodluck, Manual of Information to Accompany the Lancaster/Oslo-Bergen Corpus of British English for Use with Digital Computers, Technical Report (Bergen: N orwegian Computing Centre for the H umanities, Bergen, 1978).

Keywords and Moral Panics

95

the Brown sampling frame once again. LO B and FLO B, as well as being corpora which allow one to study recent change in British English, may also be used, as they are in this paper, to stand as a proxy for general published written British English in the early 1960s and early 1990s respectively. Moral panic theory T he sociologist S tanley Cohen developed moral panic theory in the late 1960s in order to account for episodes where the media and society at large focus on a particular problem and generate an alarmist debate around it that in turn, leads to action being taken to resolve the perceived problem. T he response to the problem is typically disproportionate to the threat posed. Cohen introduces the idea of a moral panic by saying that: S ocieties appear to be prone, every now and then, to periods of moral panic. A condition, episode, person or group of persons emerges to become defined as a threat to societal values and interests; its nature is presented in a stylized and stereotypical fashion by the mass media; the moral barricades are manned by editors, bishops, politicians and other right-thinking people; socially accredited experts pronounce their diagnoses and solutions

McEnery presents a lexically driven model of moral panic theory in which keywords arising from a moral panic can be allocated to one of a number of discourse roles within the moral panic. T he roles are as follows: ‘Object of offence’ – that which is identified as problematic; ‘Scapegoat’ – that which is the cause of or which propagates the cause of offence; ‘Moral entrepreneur’ – the person/group campaigning against the object of offence; ‘Consequence’ – the negative results which it is claimed will follow from a failure to eliminate the object of offence; ‘Corrective action’ – the actions to be taken to eliminate the object of offence; ‘Desired outcome’ – the positive results which will follow from the elimination of the object of offence.

� See M. H undt, A . S and and R . S iemund, Manual of Information to Accompany the Freiburg-LOB Corpus of British English (FLOB)(Freiburg: Freiburg University, 1998). Available online at . �� S . Cohen, Folk Devils and Moral Panics, 3rd edn (R outledge, 2002), p. 1. A .M. McEnery, Swearing in English: Bad Language, Purity and Power from 1586 to the Present (R outledge, 2005).

What’s in a Word-list?

96

Table 7.2 Keywords of the MWC when compared with the LOB corpus Positive keywords bbc, sex, television, broadcasting, sexual, programmes, programme, pornography, children, public, violence, tv, whitehouse, people, our, viewers, censorship, we, society, greene, campaign, film, intercourse, abortion, listeners, denmark, governors, freedom, education, women, ita, who, permissive, radio, danish, obscene, manifesto, moral, director-general, responsibility, standards, corporation, humanist, child, obscenity, vala, debate, clean-up, pornographic, hugh, what, birmingham, rape, films, legal, parents, media, report, normanbrook, responsible, to, masturbation, my, morality, association, advisory, mrs, fpa, screen, laws, i, that, us, press, rang, crime, young, religious, postmaster-general, which, very, school, sexuality, contraception, concern, lobby, me, shewn, trevelyan, book, daily, itv, decency, dr, its, meeting, liberation, corrupt, viewer, homosexual, phone, porn, hoggart, calder, fox, law, parliament, sixties, support, interview, copenhagen, jury, human, letter, homosexuals, abortions, christian, audience, drama, wrote, of, relationships, girls, cosmo, publication, sexually, speak, opinion, prosecution, believe, homosexuality, kenneth, broadcast, about, reaction, invited, charter, adults, licence, series, listener, family, exploitation, medium, producer, compassion, dpp, buckland, anti-censorship, four-letter, creative Negative keywords company, car, s, de, french, eyes, percent, two, looked, n’t, water, ll, him, you, pound, his, her, she, he

‘Moral panic rhetoric’ – lexis which is typically used to amplify the construction of members of any of the other categories in the model, e.g. negatively loaded modifiers such as ‘filthy’, ‘revolting’, ‘brutal’, ‘irresponsible’, ‘weak’ and ‘degradation’ being used to amplify the objects of offence

McEnery’s model is both diagnostic and analytical. If keywords arising from a contrast of two corpora fit the model wholly or in large part, this indicates a corpus which contains a moral panic. S imultaneously, populating the model is used as the first step in the analysis of that moral panic. McEnery’s model is used both diagnostically and analytically in this chapter. Keywords and moral panic A comparison of the MW C and the LO B corpus produces a large set of positive keywords. T able 7.2 outlines the keywords derived from comparing LO B and the MW C. Before considering the keywords derived from a comparison of the MW C and LO B, I would like to consider the comparability of the MW C and LO B. I do not believe that the differences that I am looking for between the LO B corpus and the MW C are such that the differences in the sampling frame between the two should

Keywords and Moral Panics

97

Table 7.3 Keywords in the MWC derived from a comparison of FLOB Positive keywords bbc, sex, television, broadcasting, pornography, sexual, programmes, our, programme, we, tv, children, violence, whitehouse, public, people, film, viewers, censorship, greene, society, obscene, obscenity, campaign, denmark, freedom, i, governors, films, danish, intercourse, pornographic, my, who, corporation, ita, director-general, that, hugh, which, listeners, me, clean-up, school, abortion, humanist, permissive, vala, meeting, morality, child, young, mrs, of, birmingham, moral, very, education, association, manifesto, this, responsibility, parents, normanbrook, standards, masturbation, postmaster-general, dr, rape, fpa, advisory, contraception, daily, to, book, what, press, debate, letter, decency, speak, responsible, one, girls, contraceptive, laws, not, shewn, upon, screen, hoggart, porn, copenhagen, would, professor, lord, council, legal, members, country, opinion, the, concerned, fox, doubt, ray, report, broadcast, cosmo, lobby, muggeridge, trevelyan, crime, calder, given, viewer, law, pilkington, us, swedish, attitude, husband, protest, many, parliament, rang, girl, values, contraceptives, medium, express, reply, williams, sir, propaganda, national, radio, homosexuality, showing, morals, buckland, anti-censorship, dpp, human, whole Negative keywords market, local, water, ll, looked, uk, company, bullet, percent, him, you, n’t, s, his, he, her, she

matter much. In order to explore this, I compared the MW C with the FLO B corpus (see T able 7.3). If the list of keywords is relatively stable across the comparisons, then my hypothesis has some weight. If the list is radically different, then my hypothesis is in serious doubt. T able 7.3 shows the keyword list derived from the comparison of the FLO B and MW C corpora. T he similarity between the keywords in T ables 7.2 and 7.3 is quite remarkable and certainly adds weight to my hypothesis that, in this case, the mismatch in the sampling frame is largely irrelevant – or at least is as relevant to a corpus (LOB) sampled some seven years before the first text of the MWC as a corpus (FLO B) sampled 11 years after the last MW C text was written. For example, the LO B/MW C comparison yields 151 positive keywords, 109 of which are shared with the FLO B/MW C comparison, and 42 of which are unique to LO B/ MW C (27.8 per cent of the total). T he FLO B/MW C yields 145 positive keywords, 109 of which are shared by both corpora and 36 of which are unique to FLO B/ MW C (24.8 per cent of the total). It is arguable that the 109 keywords may be the best focus of this study – in effect one is triangulating the keywords in the MW C by using a pair of reference points, one before the MW C begins and one after, with the aim of extracting the keywords from the corpus that are relatively independent of the sampling frame. Furthermore, one may be able to look at the keywords unique to the LO B set and identify words which, over the span of 20 years, have changed in frequency such that they are no longer key when FLO B is compared to MW C. S imilarly the keywords indicate those words which have

What’s in a Word-list?

98

Table 7.4

The keywords of the MWC placed into moral panic discourse categories

Category

Positive keywords in that Category

Consequence

public, violence, children, people, society, freedom, education, moral, responsibility, standards, rape, women, child, morality, young, concern, corrupt, viewer, human, audience, relationships, girls, adults, family, listener, exploitation, creative

Corrective action

censorship, campaign, ita, manifesto, debate, legal, parents, report, responsible, advisory, laws, press, rang, religious, postmaster-general, school, daily, meeting, phone, support, interview, parliament, jury, letter, wrote, speak, prosecution, reaction, invited, charter, licence, dpp

D esired outcome

clean-up, decency, christian, compassion, opinion

Moral entrepreneur whitehouse, viewers, listeners, vala, association, buckland

O bject of offence

sex, sexual, violence, pornography, intercourse, abortion, obscene, permissive, obscenity, pornographic, masturbation, crime, contraception, school, sexuality, liberation, homosexual, porn, homosexuals, abortions, sexually, homosexuality, broadcast, anti-censorship, four-letter

S capegoat

bbc, television, broadcasting, programmes, programme, tv, greene, film, governors, denmark, radio, danish, director-general, hugh, humanist, films, legal, media, corporation, report, normanbook, association, advisory, fpa, screen, press, postmaster-general, book, lobby, trevelyan, daily, itv, sixties, hoggart, fox, law, copenhagen, drama, publication, calder, cosmo, series, medium, producer, dpp

R hetoric

our, we, who, what, birmingham, to, my, us, that, which, very, me, i, believe, about

Unclassified

shewn, dr, its, of, kenneth, mrs

changed their frequency over the 20-year time frame to become key. Given the corpora available, I decided to proceed with my analysis of the moral panic using the LO B/MW C keyword list. The MWC/LOB keywords, key keywords and moral panics A problem arises with the use of keyword analyses here. In spite of setting the p value for the keywords to the maximum allowable by W ordS mith, the MW C corpus generates a large number of keywords. W hile nonetheless analysable, it would be preferable to analyse a smaller coherent subset of the keywords in order to expedite the analysis. In order to do this, I am using key keywords in this chapter. Key keywords are keywords which are key in all, or the majority, of subsections of a corpus.

Keywords and Moral Panics

99

In order to determine the effect of key keyword analyses, I first analysed the keywords into the moral panic categories established in McEnery. In doing this I used the same methodology as McEnery to verify the category membership, and cross-category membership, where appropriate, of each keyword. T he results are shown in T able 7.4. Following the analysis of keywords in the moral panic discourse model categories, I undertook two key keyword analyses. In the first, I calculated the key keywords for each main text of the MW C. In the second, I calculated the key keywords for each chapter in each book of the MW C. In each case, I once again used the LO B corpus as a reference corpus. W hat were my intended goals in carrying out these analyses? First, I wanted to see how the key keywords organized themselves in terms of the moral panic categories – are the key keywords spread evenly across the categories? S econdly, I wanted to see what the key keywords were across the whole MW C (i.e. which key keywords are keywords across all of the texts in the MW C) and which keywords drew their strength from particular subsections of the MWC – as small as a single chapter perhaps. The first goal is methodological to some extent, because it allows us to explore the question of whether or not key keywords can give us sufficient data to allow us to populate the moral panic discourse model. Yet, it is also related to the content of the corpus data. The key keywords that we find at the corpus level, i.e. shared between all four books in the MW C, highlight enduring themes of the MW C corpus. Certain other words, while keywords in the whole corpus, may have their keyness attributed to just one book, or perhaps even one chapter. In short, we will be able to differentiate relatively transient keywords (those appearing, say, in the first book but not in later books) from those which are permanent, i.e. key across the whole corpus. In turn, when we then consider these transient and permanent key keywords in terms of the moral panic discourse categories, we may discover that a pattern emerges, e.g. scapegoats being more transitory and consequences being more permanent. W e may see that part of the moral panic is prone to being more static than other parts. Figures 7.1 and 7.2 give the results of the key keyword analyses for all three texts and all 57 chapters in the MW C respectively. In T ables 7.4 and 7.5, the data from the two tables are placed in the moral panic discourse categories. In the figures and tables, I have listed only the key keywords which were key in all of the MWC texts (Figure 7.1 and Table 7.5) and key keywords that were key in five or more chapters of the MW C (Figure 7.2 and T able 7.6). In Figures 7.1 and 7.2 the words are ordered in descending order of key keyness. T able 7.7 shows, for each moral panic category, which key keywords are moral panic keywords when key keywords are calculated both by book and chapter, as well as solely by book or chapter. T able 7.7 in particular is interesting because it shows that the transience of key keywords is observable. In this table, however,

Ibid. Ibid.

What’s in a Word-list?

100

television, broadcasting, bbc, sex, i, programmes, whitehouse, programme, sexual, pornography, tv, children, viewers, cannot, our, violence, we, people, greene, society, censorship, me, campaign, intercourse, my, vala, obscenity, public, phone, permissive, clean-up, women, pornographic, ita, parents, masturbation, freedom, film, sexuality, listeners, corporation, education, meeting, abortion, director-general

Figure 7.1

Words which are key keywords in five or more chapters of the MWC

Table 7.5

Words which are key keywords in five or more chapters of the MWC mapped into the moral panic discourse rôles

Category Consequence Corrective action D esired outcome Moral entrepreneur O bject of offence S capegoat R hetoric Unclassified

Positive keywords in that Category public, violence, children, people, society, freedom, corrupt, education censorship, debate, parents, report, responsible, meeting decency whitehouse, viewers, listeners sex, sexual, violence, sexuality, abortion television, broadcasting, programmes, programme, greene, radio, director-general, humanist, report, film, corporation our, we, who, what none

broadcasting, television, bbc, sex, i, programmes, whitehouse, programme, sexual, pornography, tv, viewers, our, violence, censorship, children, we, people, greene, campaign, society, me, public, intercourse, my, vala, obscenity, permissive, phone, women, school, ita, pornographic, mrs, masturbation, freedom

Figure 7.2 Words which are key keywords in all of the MWC texts transience is relative since even the most transient keyword is key in at least five chapters. This transience should become more pronounced if the cut-off of five applied to key keywords in this experiment is further reduced. I will return to transient keywords later in this chapter. H aving calculated the chapter and book-based key keywords, I would now like to consider the key keywords in T able 7.7 and discuss how they act as moral panic key keywords in each category of the moral panic discourse model. H owever, rather than exploring each category word by word, I will simply present the fully populated model here and then address particularly important/surprising cases in a general discussion. My reason for doing so is that McEnery10 has demonstrated 10

Ibid.

Keywords and Moral Panics

Table 7.6

Words which are key keywords in all of the MWC texts mapped into their moral panic discourse rôles

Category Consequence Corrective action D esired outcome Moral entrepreneur O bject of offence S capegoat R hetoric Unclassified

101

Positive Keywords in that Category public, violence, children, people, society, freedom, women censorship, campaign, ita, school, phone none whitehouse, viewers, vala sex, sexual, violence, pornography, intercourse, permissive, obscenity, pornographic, masturbation, school bbc, television, broadcasting, programmes, programme, tv, greene our, we, my, me, i mrs

how the model can be populated in this way. Given that the method was applied as easily to the MW C corpus as that studied by McEnery,11 (albeit with a shift of emphasis to key keywords), I do not see any need to discuss the results on a caseby-case basis with the goal of justifying the method. The key keyword populated model In this section, I will present the key keywords, placed into moral panic categories and divided into semantic fields, where appropriate.12 See T able 7.8 for the populated model. I will then present a series of more detailed discussions of the key keywords and, to a lesser extent, the keywords. For the detailed discussions I will give the collocates for the words discussed, showing where those collocates are link collocates, i.e. shared between two keywords (emboldened, MI strength for link collocate plus the keywords linked to in parentheses after the MI score). W here the link collocate is also a keyword the word is underlined. S o, for example, the entry ‘clean-up’ (6.38, ‘tv’) for the key keyword ‘campaign’ indicates that ‘clean-up’ is a collocate of ‘campaign’. It is also a keyword. The words collocate with an MI score of 6.38, and the word ‘tv’ is a keyword which shares the collocate ‘clean-up’ with ‘campaign’. T able 7.8 represents a diagnostic test of the MW C. T he key keywords from the corpus fit the moral panic theory discourse model perfectly. However, in order to understand how the key keywords can move from being a diagnostic tool to being 11 Ibid. 12 Note that some semantic fields contain only one word. This is because the fields were initially developed for the full keyword list. When this is used, the fields with only one member gain further members. For example, the people field in the scapegoat category gains words such as ‘fox’ and ‘hoggart’.

What’s in a Word-list?

102

Table 7.7

The distribution of chapter only, text only and chapter and text key keywords across the moral panic discourse categories

Category

Book based key keyword only

Consequence

women

Corrective action

campaign, ita, phone, school

D esired outcome Moral entrepreneur vala O bject of offence

S capegoat

intercourse, masturbation, obscenity, permissive, pornographic, pornography, school bbc, tv

R hetoric Unclassified

i, me, my mrs

Both book and chapter based key keyword children, freedom, people, public, society, violence censorship,

viewers, whitehouse sex, sexual, violence broadcasting, greene, programme, programmes, television our, we

Chapter based key keyword only corrupt, education debate, meeting, parents, report, responsible decency listeners abortion, sexuality

corporation, director-general, film, humanist, radio, report what, who

an expressly analytical tool, I now need to focus on the words in context themselves to show exactly how the sociological processes behind the discourse roles in the moral panic are realized. In the following sections, rather than analysing each key keyword in turn, I focus, instead, on particularly interesting examples from T able 7.8, seeking to show how, through a key keyword analysis, we can focus on the text at a number of levels, e.g. ideology, meaning and rhetoric. How to lobby without lobbying T he words in the corrective action category speak strongly of the day-to-day lobbying that VALA was engaged in, particularly in the agitation field (see T able 7.9). L etters are written, phone calls made, public debates undertaken, meetings held and interviews, typically in the media, are given. T hroughout, there is an attempt to garner and maintain support from a range of organizations, such as the police federation.

Keywords and Moral Panics

Table 7.8

The key keyword populated model

Consequence S emantic Field People A cts A bstractions

Key keywords public, children, people, women violence, corrupt society, freedom, education

Corrective Action S emantic Field A gitation O rganizational Public S elf R egulation R esearch S tatutory

Key keywords campaign, debate, meeting, phone school parents, responsible ita report censorship

Desired Outcome decency Moral Entrepreneur whitehouse, viewers, listeners, vala Object of Offence S emantic Field Crime O bscenity Pornography

Key keywords violence obscenity pornography, pornographic

Scapegoat S emantic Field People R esearch Broadcast programmes Media Media Organizations and Officers Groups

Key keywords greene report programmes, programme, radio television, tv, film, broadcasting bbc, director-general, corporation humanists

Moral Panic Rhetoric S emantic Field Pronouns/D eterminers

Key keywords our, we, who, what, my, me, i

Unclassified mrs

103

104

Table 7.9 Word ‘campaign’ ‘debate’ ‘meeting’ ‘phone’

What’s in a Word-list?

Corrective action keywords Freq. Collocates in MWC clean-up (6.38, tv), supporter, represent (5.12, clean-up, 143 tv), mount, begun, discredit, specifically, launched, tv (4.51, cleanup), swizzlewick (4.38) parliamentary, opening, roy (4.96, obscene), lords, dealt, 80 continuing, bill (3.84, parliament), annual (3.69, ita, report), union, result (3.51) anniversary, library (5.12, public), brighton, interruption, ar143 range, hall, sponsored (4.71, pornography), demanded, holding, town (4.52, itv) ringing, calls (7.06, hoggart), feb, call, stopped, hardly, rang, 43 received (4.58, letter), down, next (3.55)

T he agitation that VALA was engaged in seems to be a blueprint for other lobbying organizations. Interestingly, however, the keyword ‘lobby’ does not belong in the corrective action category at all. As shown in Table 7.4, ‘lobby’ is most certainly a scapegoat category word. ‘Lobby’ in the MWC is a word with a powerful, negative semantic prosody. As will be shown later (Figure 7.3), ‘lobby’ links to a negatively loaded collocational network with the words ‘homosexual’ and ‘permissive’ being its immediate link nodes. Its collocates (‘anti-censorship’, ‘myths’, ‘tactics’, ‘permissive’, ‘claim’, ‘humanist’, ‘homosexual’), are linked to groups or concepts Whitehouse was opposed to (‘anti-censorship’, ‘permissive’, ‘humanist’, ‘homosexual’) or represent a negative evaluation of the lobby concerned (‘myths’). These lobbies make claims and use tactics. ‘Claim’ and ‘tactics’ are in turn words in the MWC with notably negative collocates. ‘Claim’ in its verbal form collocates with ‘lobby’. T his verb is a marker of epistemic modality – a degree of uncertainty is being attributed to the statement made. It is hardly surprising then, that when we explore the verb ‘lobby’, we discover that claims are typically made by those groups with whom Whitehouse disagreed. Those who are claiming are ‘the secularist lobby’, ‘advocates of permissiveness’, ‘the new populists’ and ‘the anti-censorship lobby’. ‘Tactics’ is another word which is coloured in a negative fashion in the MWC. Tactics are used by ‘the permissive lobby’, ‘the anti-censorship lobby’, ‘the progressive left’, ‘the New Left’ and ‘the new morality wing of the Anglican church’. They are ‘communist’ and ‘revolutionary’. In short, from Whitehouse’s perspective, lobbies are bad – they are the encapsulation of all that she opposes, and their pronouncements lack certainty. In this way, W hitehouse generates an in- and an out-group. People who campaign for a cause which she approves of, such as her own group, are not lobbies, and do not collocate with ‘lobby’; those who campaign in exactly the same manner for things that she disapproves of are lobbies. The word ‘lobby’ is in effect a snarl word in the MWC. So, while W hitehouse undoubtedly lobbied, the word is not in the corrective action category,

Keywords and Moral Panics

105

and is not a key keyword, as the word was taken to embody the activities of the groups to which she was opposed. Schools One key keyword, ‘school’, is of interest as it is both a corrective action key keyword and a scapegoat keyword. Certain schools and schooling practices were viewed as not merely acceptable by W hitehouse, but viewed as a means of combating changes she did not welcome in society. T hose schools which adopted progressive schooling practices, or which failed to regulate children in a manner which Whitehouse found acceptable, were, however, a significant object of offence for W hitehouse. T he contrast between the two types of school in the MW C is stark. An almost Enid Blyton picture of hyper-normality is painted of the ‘good schools’. There is talk of a ‘school choir’, children attending art classes bringing with them ‘little bags and boxes of samples of sand, bark and tea’. These schools are populated with children who always ‘wear gloves when in school uniform’ and ‘carry a clean hankie’ in their pockets. By contrast, the progressive school is interested in launching ‘a campaign to persuade girls to carry condoms in their school satchels’, is not ‘terrified of thirteen year-olds starting sexual intercourse’ and wishes there were ‘several contraceptive machines in every school’. It may also take children to visit ‘a sex show club in the course of their studies’. It is a place where ‘boys were chasing the girls round her school playground punching them in the stomach’. If this were not bad enough, ‘filthy books of no literary merit are to be found in school libraries’ at this kind of establishment. T he contrast could not be starker: an idealized version of British childhood and an appalling hell on Earth. T he contrast in itself underlines the innate conservatism of W hitehouse’s approach. S he lionizes an approach to education based on an idealized version of schooling in the 1950s. O nly the good is emphasized in this argument, with the class bias inherent in the system ignored and the brutal treatment meted out to children in such schools overlooked.13 A nything that deviates from this idealized norm is subversive, and all attempts are made to suppress any good aspect of the alternative, and to highlight and dramatize any possible negative aspect of it. T he only people who could possibly agree with such schooling were, in W hitehouse’s view, the constitutionally irresponsible. Indeed, she claims only children themselves could approve of such libertinism: T his is, admittedly, not in agreement with the wish of S wedish parents, 70 percent of whom would wish the school to exercise its authority to promote the ideal of youthful abstinence, but on the other hand in full accord with the wish

13 For some excellent, if harrowing, first-hand accounts of discipline in such schools see .

What’s in a Word-list?

106

of schoolchildren, 95 percent of whom are reported to share the view of their radical school authorities.

Whitehouse’s was the responsible view; that of those who agreed with the ‘radical school authorities’ was childish. S chools were key in promoting the responsible view for W hitehouse – hence their appearance as a corrective action key keyword, too. For W hitehouse, schools were to play an important rôle in the imposition of a new moral order, especially in the area of sex education. Yet in the 1960s and 1970s widespread national schooling was established, as was an approach to schooling which ran counter to the views of VALA. Consequently, ‘school’ is cross-posted to both the corrective action and scapegoat categories. T hose schools which Mrs W hitehouse approved of were part of her corrective action (grammar schools, for example). T hose which she disapproved of were one of the many scapegoats targeted by VALA . T wo examples from the MW C show how W hitehouse saw schools participating in her corrective action, in this case with regard to sex education: 1. T his made me realize that the true function of the school is to help parents educate their own children, and this is what the majority of parents want to do. 2. Just as in my sex education work at school I worked from the very strong belief that the school’s job in this matter was not to remove the privilege and responsibility from the parents, so I believe the T V screen should help the parent and not rush in to tell all without restraint. S chools which were approved of by W hitehouse did not impose any morality other than that imposed in the assumed Christian home of the child by the child’s parents.14 In seeking to impose that morality W hitehouse would, where convenient, appeal to published or informal research, as discussed in the next section. Reports and research The use of the key keyword ‘report’ in the MWC is usually focused on advisory or research reports commissioned by either a public or private body. T he key keyword does occur in the context of press reportage, but given that only seven of the occurrences of ‘report’ occur with this sense, the key keyword ‘report’ in the MW C refers almost exclusively to commissioned reports. O ne thing that is clear from looking at the collocates of ‘report’ is that reports are often identified by their chairmen – hence frequent references to the N ewsom and Pilkington reports,

14 See later in this chapter for a discussion relating to W hitehouse’s assumptions regarding the Christian nature of Britain.

Keywords and Moral Panics

Table 7.10 Word ‘report’

107

The keyword ‘report’ Freq. 127

Collocates in MWC newsom (6.39, education, religious), pilkington, revealing, debated, council (5.56, advisory, listeners), annual (5.34, debate, ita), biased, authors, secondary (4.66, school)

both commissioned by the UK government, generate the collocates ‘newsom’ and ‘pilkington’ for the key keyword ‘report’15 (see T able 7.10). T he N ewsom report was published in 1963 and made recommendations for the future of secondary schooling in the UK. T he Pilkington report was published in 1960 with a remit of determining the future development of the BBC and the ITA (Independent Television Authority). Given the importance of ‘school’, as a corrective action and a scapegoat, and television output as an object of offence, the prominence of these reports in W hitehouse’s writings is hardly surprising. T he surprise wanes further when one observes that these reports broadly support the position that W hitehouse was taking. It was the desire of W hitehouse to see the concerns of reports such as N ewsom and Pilkington taken into account that places the key keyword ‘report’ in the corrective action category. These reports are cited as support by W hitehouse for her own position. T alking of the concern generated by media output felt by the mass of citizens that W hitehouse claims she represents, W hitehouse asserts: It cannot be dismissed as the unrepresentative opinion of a few well-meaning but over-anxious critics, still less as that of cranks. It has been represented to us from all parts of the kingdom and by many organisations of widely different kinds’. So wrote the authors of the Pilkington Report in June 1960. ‘Disquiet,’ they said, ‘derived from an assessment which we fully accept’ that the power of the medium to influence and persuade is immense; and from a strong feeling, amounting often to a conviction, that very often the use of the power suggested a lack of awareness of, or concern about the consequences.

T he N ewsom report produced conclusions in line with W hitehouse’s own, leading her to cite it as a source of evidence and a guide for legislators: T he fundamental questions by whom should sex education be given? when? and to what end?-have been increasingly submerged in a culture which, by its very nature, negates the basic privacy essential to healthy mental and emotional growth and deals with this most personal of matters in a conformist and impersonal fashion. It was an awareness of this growing threat that caused the 15 H . Pilkington, The Future of Sound Radio and Television (H er Majesty’s S tationery Office, 1962) and J. Newsom, Half Our Future (Her Majesty’s Stationery Office, 1963). .

What’s in a Word-list?

108

compilers of the N ewsom R eport, H alf our Future, published by the Ministry of Education, 1963 (still the most recent government report on secondary education), to declare that sex education must be given on the basis of ‘chastity before marriage and fidelity within it’.

T he summation of a position that it is claimed N ewsom supports, the assertion of the recency – and hence one assumes authority – of the report and the use of a quotation from the report that is very supportive of W hitehouse’s views on the dangers presented to society by moral relativism, allow W hitehouse to take upon herself the report’s authority, claiming that her views were ‘not simply my own, but also those set out in the N ewsom R eport’. W hile taking support from such reports, W hitehouse also supports them, presenting the reports as a good guide to corrective action. In a discussion of a case where parents in the Knapmann family withdrew their children from schooling because of what they saw as progressive sex education,16 W hitehouse is quick to point out that the corrective action taken was in line with the N ewsom R eport, and quotes Exeter L ocal Education A uthority as recommending the N ewsom guidelines on spiritual and moral development as offering ‘excellent guidance’. These guidelines, Whitehouse notes, coincide with those of VALA in so far as ‘For our part we are agreed that boys and girls should be offered firm guidance on sexual morality based on chastity before marriage and fidelity within it’. The finding that reports are used as support by Whitehouse raises a vexatious issue. W hitehouse often stated her clear opposition to certain forms of academic research, yet the N ewsom and Pilkington reports were based on research. A related issue links to another collocate of ‘report’, ‘biased’. It is clear from the references to reports such as N ewsom and Pilkington that W hitehouse does not merely see these as the positive results of research, she sees them, especially N ewsom, as blueprints for corrective action. Yet, clearly, she also sees some reports as biased. In order to begin to discover whether this split occurs with ‘report’ only, I decided to look at a clearly related word, ‘research’. Does a similar split occur there also, or is research viewed exclusively negatively? The word ‘research’ occurs 59 times in the MWC. If one distinguishes the cases where research is presented in positive terms from those in which it is presented in negative terms, the picture is somewhat surprising. W hitehouse’s references to research are overwhelmingly positive – 50 out of the 59 cases see W hitehouse presenting research positively, typically in support of her own views. Collocates allow us to begin to see how the division between positively evaluated research and negatively evaluated research may be drawn. For positive mentions of research, ‘audience’ collocates nine times, ‘own’ six times and ‘my’ four times. For the negative mentions of research, two collocates, ‘academic’ and ‘sexual’, occur twice each in complementary distribution. T hese results indicate that the 16 28–9.

See M. W hitehouse, Whatever Happened to Sex? (H ove: W ayland, 1977), pp.

Keywords and Moral Panics

109

research she cites is either her own (‘own’, ‘my’) or that derived from viewers (‘audience’). Academic research (either by academics in general or sexologists in particular) is marginalized in the sense that it is referred to fleetingly, and when it is referred to, the reference is negative. T here are only two co-occurrences of ‘academic’ and ‘research’ in the MWC and both of these cases present research negatively. D oes the same divide – research based on the views of non-academics being good, research undertaken by academics being bad – apply to the reports? T he answer to this question is no: the reports referred to by VALA are almost exclusively produced by organizations, whether they be public or private, and are not linked to academic or non-academic research sources explicitly. In order to investigate the split between the positive and negative uses of ‘report’, I categorized each use of the word in the same way that I had categorized ‘research’, either as a positive or negative use of the word. As with the word ‘research’, the number of positive references to a report far outweighs the number of negative references, with counts of 98 and 24 respectively. Mentions of parliamentary reports (10), the Pilkington report (8), the N ewsom report (5) and police reports (4) predominate. O ther reports mentioned include those from religious organizations (e.g. the Church of S cotland) and medical authorities (e.g. the British Medical A ssociation). T here is an interesting link here between the source of the reports and the corrective action category. Parliament, as the ultimate source of the N ewsom and Pilkington reports, is an important focus for corrective action, and its reports are presented as such. T he religious nature of corrective action is underlined by the reference to religious reports also. Yet what of the reports which are presented as problematic? In the case of these reports, it was the presentation of views with which W hitehouse disagreed that caused the negative evaluation. H owever, the sources of these reports form as coherent a group as the positive reports. R ather than linking to the corrective action category, though, they link to the scapegoat category. T he negative reports are produced by such scapegoats as ‘bbc’ (3), ‘bha’ (1), ‘hoggart’ (1),17 as well as other organizations which, if they are not scapegoat keywords in the MW C, are certainly organizations which would fall into that category, such as the Greater L ondon Council and the N ational Council for Civil L iberties. It would, of course, be foolish to claim that the word ‘report’ is at times used positively by Whitehouse because she collocates it with a corrective action keyword. It would also be foolish to make that claim with reference to negative uses of report and scapegoat keywords. H owever, the relationship between the word ‘report’ and the authors of the report is shown clearly by the collocates here. T hose reports written by organizations which W hitehouse approves of, giving advice she agrees with, are evaluated positively. T hose produced by organizations she disapproves of, espousing views she disagrees with, are evaluated negatively. T he negative evaluation of the reports W hitehouse disagrees with is further intensified by evaluative terms being attached to the word 17 For a brief outline of H oggart’s attack on VALA see C.R . Munro, Television, Censorship and the Law (Farnborough: S axon H ouse, 1979), p. 132.

110

What’s in a Word-list?

‘report’ – these reports display ‘ideological bias’,18 are ‘biased’, ‘tendentious’19 and display a mastery of ‘half-truth’.20 By contrast, the reports W hitehouse approves of are typically presented with modifications that amplify the panic W hitehouse is trying to exploit, or are used to strengthen the credibility of the report. For example, Whitehouse states that ‘Recently the Chief Medical Officer to the Ministry of H ealth, S ir George Godber, presented a report on the disturbing increase in venereal disease among young people and called for “an all-out attack” on the problem’. W hitehouse cites this as evidence in support of her own solution to the problem: sexual abstinence before marriage. N o evidence is provided regarding S ir George’s own proposed solution, nor is there any discussion of the possible source of the increase – whether it was an increase in the report of the diseases, or an actual increase in the rate of infection. T he quote is interpreted within a framework established by W hitehouse to give maximum support to her position. A nother report used for this purpose is the British Medical A ssociation’s 1955 report on H omosexuality and Prostitution. T his is described by W hitehouse as a ‘famous’ report. The report describes how homosexuality may be ‘cured’ and W hitehouse uses this evidence to support her own view that homosexuality is an aberration which should be treated both physically and mentally. N o major medical authority in the world now agrees with this position and even in 1977, when W hitehouse was writing, she was not quoting from current research, and was citing a position from which the medical profession had retreated.21 N onetheless, such matters were overlooked and the report lionized as ‘famous’. T he use of research and reports by W hitehouse is complex. R esearch from non-academic sources is welcome. A cademic research – which tends to disagree with her ‘common sense’ research – is shunned and vilified. Reports which are in tune with her own thinking, especially by agents of corrective action, are used to support her view, and are granted her approbation – even when this means endorsing out-of-date and discredited research which forms the basis of a report. O n the other hand, reports which disagree with her positions, notably those produced by scapegoats, are dismissed as being biased. In looking at what W hitehouse regarded as organizations producing biased reports, we find her describing a practice which could just as well be attributed to her as to any of those individuals and organizations she is complaining about:

18 18�� A claim made of a US report produced by the A merican Presidential Commission on O bscenity and Pornography, 1970. 19 19�� A claim made of the A rts Council R eport on Censorship of the A rts, 1969. 20 A claim made of Enid W istrich’s report for the Greater L ondon Coucil’s Film Viewing Committee on the abolition of censorship in films intended for over-18-year-old viewers, 1975. 21 For example, by 1974 the A merican Psychiatric A ssociation had removed homosexuality from its list of recognized diseases.

Keywords and Moral Panics

111

S o do big doors hang on little hinges – not because of the strength of the hinges themselves but because the intellectually committed believe what they want to believe, see what they want to see, and do their best to ensure that the rest of us see it their way too.

In-groups and out-groups – parents and responsibility In the corrective action category, the key keyword ‘parents’ and ‘responsible’ are interesting because they generate in and out-groups, and two key in-groups are those who may be viewed as ‘responsible’ and ‘parents’. There is also an assumption of considerable overlap between these two groups, though not all people represented as responsible by W hitehouse are parents (e.g. Pope Paul VI) and not all parents discussed by Whitehouse are assumed to be responsible; Whitehouse is clearly condemnatory of divorced parents, or, as she puts it, those who ‘run to a cigarette or a drink or out through the front door whenever there is trouble’. Yet, these two groups are generally held up by W hitehouse as a crucial source of corrective action. It is parents working with such professionals as educationalists who can work to offset such undesirable practices as teachers who ‘use pornographic books’. Parents, in W hitehouse’s view of society, are the force which will anchor the moral absolutist position in the face of the flood of moral relativism. For Whitehouse, ‘parents’ are typical of the ‘ordinary decent-minded people, who are so cruelly offended and worried’ by moral revolution. A s such, they, and other responsible people, i.e. those opposed to this change in British society, represented the ‘silent majority’ that W hitehouse claimed to speak for.22 W hitehouse was very clear on the point that her view was responsible, and those who supported her view were of necessity responsible also, as shown in Figure 7.3. The quotes in Figure 7.3 are illuminating, as they define an in-group, the responsible, while also setting up a series of potential oppositions which define an out-group. The in-group is serious (1), reasonable (2) and selfless (4). By contrast, one may imagine that the out-group is defined by the negation of these qualities. S imilarly, it is established that there are responsible Christian viewpoints as well as ones which are not responsible (3). In terms of this particular delineation of the in-group, the quotation in (3) above continues to impose, clearly, the in/out-group distinction based on religious conservatism (Billy Graham, Cardinal H eenan)23 in

22 See the discussion of pronouns later in this chapter for a further discussion of W hitehouse’s claim to speak for a majority of people in Britain. 23 Billy Graham is a conservative A merican S outhern Baptist evangelical preacher given to travelling the world trying to attract mass conversions. Cardinal H eenan was the doctrinally conservative Catholic primate of all England in the period 1963–75.

What’s in a Word-list?

112

1. T he fact that the Postmaster-General met us in the middle of the postal strike of 1964 was an indication, in the words of one M.P., that ‘this campaign is regarded as the expression of the will of serious and responsible people in the country’. 2. A s far as censorship is concerned it is quite clear to me that the people most likely to create a backlash are those in the arts who refuse to listen to the modulated voices of responsible opinion. 3. We may well agree with the Head of Religious Broadcasting when he says ‘We must go on trying to see that every responsible Christian viewpoint is given fair expression within the whole spectrum of religious broadcasting in television and radio’. 4. The FPA was founded fifty years ago to alleviate the child-bearing problems of women in countries all over the world, but it has travelled a long way since then, and not always to the satisfaction of those responsible people who worked so hard and selflessly for its original aims, or to the credit of those who have been involved in its change of emphasis and policy in recent years.

Figure 7.3

The responsible

the in-group versus religious liberalism (D r R obinson, W erner Pelz)24 in the outgroup: 1. But when he translates that unexceptionable principle into personal terms one cannot but shudder: ‘We must find room for Billy Graham as well as Dr R obinson, for W erner Pelz as well as Cardinal H eenan,’ he tells us. Porn, pornography and enclitics Given that pornography was a major source of offence for VALA , its appearance in the object of offence category is hardly surprising. W hat is interesting, however, is that its shortened form, ‘porn’, while a keyword, is not a key keyword. On closer investigation, one discovers that there is a marked difference between the collocates of ‘porn’ (‘harmless’, ‘pleasure’, ‘pictures’, ‘industry’) and those for ‘pornography’ and ‘pornographic’ (see T able 7.11). W hile the collocates of ‘pornography’ and ‘pornographic’ are broadly the type of words which one would expect to imbue these words with a negative semantic prosody, the collocates of ‘porn’ do not merely not represent a failure to associate the word with a negative 24 D r John A .T . R obinson was an English bishop who embraced liberal causes – he appeared for the defence in the Lady Chatterley trial, for example. H e was also doctrinally liberal, and his book H onest to God (S CM, 1963) espoused a number of radical ideas (e.g. the non-existence of a personal God). W erner Pelz was a sociologist and author of The Scope of Understanding in Sociology: Towards a More Radical Reorientation in the Social and Humanistic Sciences (R outledge, 1974).

Keywords and Moral Panics

Table 7.11 Word

113

Collocates of ‘pornography’ and ‘pornographic’ Freq.

Collocates in MWC

‘pornography’

197

presidential (5.07, obscenity), freely, sell, sale, pictorial, deviant (4.66, sexual), commission (4.52, obscenity), sponsored (4.24, meeting), proof, link (4.24)

‘pornographic’

56

enterprise, gross, sight, blasphemous (5.60, obscene), pictures (5.47, porn, intercourse), explicit (5.25, sexually), cheap, erotic, magazines, material (4.65)

semantic prosody – it associates the word with a positive semantic prosody through collocates such as ‘harmless’ and ‘pleasure’. However, an exploration of the concordances of ‘porn’ reveals an explanation: ‘porn’ is a word Whitehouse rarely uses, though she does report the use of it in the speech of others. In the 22 examples of the word in the MW C, 14 occur in quotation. It is in these examples, where W hitehouse is quoting from those who oppose her views, that the word collocates with ‘harmless’ and ‘pleasure’ and has a positive semantic prosody, as shown in the examples in Figure 7.4. In both examples, ‘porn’ is a word used by others, and the word in itself becomes a marker of approval for pornography, being associated with a positive view of pornography to the extent that W hitehouse avoids the use of the word (using it herself only eight times) in favour of ‘pornography’ (which is used in quotation by W hitehouse 17 times and by W hitehouse herself 180 times in the MWC). In quotation, ‘pornography’ has a negative semantic prosody just as it does out of quotation. The word ‘porn’ itself, it could be argued, is shunned by Whitehouse, and hence fails to become a key keyword; she was aware of the positive semantic prosody of the word, and wished to avoid it, instead favouring its full form, as the semantic prosody of the full form better reflected her own view of pornography. H owever, there is another possible explanation for her avoidance of the word ‘porn’. Whitehouse is a formal writer as, amongst other things, she tends to avoid enclitic forms. N ote the presence of the enclitic forms ‘s’, ‘ll’ and ‘n’t’ in the negative keyword list given in Table 7.12. Enclitic forms are markers of speech and informal writing,25 and the presence of these enclitics in the negative keyword list indicates that W hitehouse’s writings belong to a more formal register. T he formal style of W hitehouse’s writing is one of its most notable features. H owever, a discussion of the enclitic forms begs two further questions which must be addressed before we can proceed. First, are there genres in LO B which are more similar to the MW C in terms of their use of enclitics and secondly, with reference to the form ‘’s’, is it a negative keyword as an enclitic verb form, 25 See D . Biber, S . Johansson, G. L eech, S . Conrad and R . R eppen, The Longman Grammar of Spoken and Written English (L ongman, 1999), pp. 1048, 1060–62 for a discussion of the use of enclitics in speech.

What’s in a Word-list?

114

1. ‘But,’ says the book, ‘there are other kinds,’ and it goes on to describe, in concrete terms, bestiality (in the specific sense of that term) and sado-masochism. The book’s general comment on what it has thus described is as follows: ‘Porn is a harmless pleasure if it isn’t taken too seriously and believed to be real life’. 2. The ‘soft’ essence of the trendy churchman was encapsulated by the Reverend Chad Varah when, writing in Encounter, he used ‘a great deal of language that most people would call simply filthy’ and went on: ‘In Soho, the soft porn is kept in the front room and the hard in the back ... In D enmark, thanks to the enlightened D anes’ abolition of censorship it’s all in the shop ... T he best porn is not only therapeutic but appeals to our sense of wonder.’

Figure 7.4

Porn is good

a genitive marker or both? In the MWC, ‘’s’ occurs as a genitive 597 times, and as an enclitic form of the verb ‘be’ 140 times. Tables 7.13 and 7.14 compare the distribution of genitive and enclitic forms of ‘’s’ in the MWC and LOB. While a description of the LO B categories is included in Chapter 1 of this book, I have included a description of each category in T able 7.12 for ease of reference. Table 7.12 shows two things quite clearly. First, the form ‘’s’ is a negative keyword for the MW C irrespective of which of the LO B sub-sections the MW C is compared with. S econdly, LO B H is the sub-section of LO B which, in terms of its usage of enclitics, matches the MW C most closely. Given that it is argued here that avoidance of enclitics is an indicator of formality, it is interesting to note that the LO B H category is composed of very formal texts indeed – largely government documents and official reports. It is also notable that those texts which use a wider variety of enclitics, such as LO B L and LO B P, are clearly more informal genres, composed of popular fiction. Importantly, these are also genres in which representations of speech occur most frequently. H owever, given the fact that ‘’s’ is a negative keyword for the MWC irrespective of the sub-category of LOB it is compared to, the question of exactly what the ‘’s’ in the corpora is – a genitive, an enclitic or both – becomes all the more pressing. In T able 7.13, the last two columns give a log likelihood (LL ) score which tests the significance of the difference in frequency between the MWC and subsections of LOB for the occurrence of the genitive ‘’s’ form (column four) and enclitic ‘’s’ form (column five). Following each log-likelihood score is a + or a – in parentheses. A plus indicates that the relative frequency of the form is greater in the MWC; a minus indicates that this relative frequency is higher in LOB. The log likelihood scores have been emboldened where these figures exceed the 99.9 per cent significance level. T able 7.13 shows that, with few exceptions, it is both the genitive and enclitic form of ‘’s’ which is a negative keyword for the MWC. Both overall and in ten of the 15 subsections, singular genitive marking is used significantly less frequently

Keywords and Moral Panics

115

Table 7.12 Enclitics which are negative keywords in the MWC when the MWC is compared to the subsections of LOB LOB Section

Category description

LO LO LO LO LO LO LO LO LO LO LO LO LO LO LO

Press: reportage Press: editorial Press: review R eligion S kills, trades and hobbies Popular lore Belles lettres, biographies, essays Miscellaneous S cience General fiction Mystery and detective fiction Science fiction A dventure and western R omance and love stories H umour

BA BB BC BD BE BF BG BH BJ BK BL BM BN BP BR

Negative keywords enclitics when MWC is compared to the LOB section n’t, s n’t, s n’t, s n’t, s n’t, s n’t, s n’t, s s m, s m, d, ll, n’t, s m, re, ve, ll, d, n’t, s ll, n’t, s re, ve, d, ll, n’t, s m, d, re, ve, ll, s, n’t ve, ll, n’t, s

in the MW C than in LO B. S imilarly, overall, and in eight of the 15 subsections, the enclitic ‘’s’ form is used significantly less frequently in the MWC than in LOB. However, the enclitic ‘’s’ form does differ somewhat from the singular genitive: in three of the genres, LOB G, H and J, the enclitic form occurs significantly more frequently in the MW C than in LO B. O ne infers, therefore, that in T able 7.13 it is the effect of the combination of the genitive and enclitic form of ‘’s’ which makes ‘’s’ a negative keyword when compared to LOB G (Belles Lettres, biographies, essays), H (miscellaneous) and J (Science). When the different types of ‘’s’ are separated, the formality of LO B G, H and J with reference to enclitic forms is highlighted; it is even more formal than the MWC. Given W hitehouse’s general avoidance of enclitic forms and the formal style that results, the avoidance of the abbreviated form ‘porn’ may simply be explained by her tendency to formality. H owever, I do not believe that the two explanations for her preference of ‘pornography’ over ‘porn’ are antagonistic. Rather, they are complementary in that together they give an even stronger impetus for W hitehouse to use ‘pornography’ rather than ‘porn’. Bad sex Given the history of VALA , the presence of a cluster of keywords associated with sex in the discourse of the MW C is hardly surprising. N or is the rather negative semantic prosody of the words in this cluster, with its emphasis on what W hitehouse

What’s in a Word-list?

116

Table 7.13

The relative frequency of genitive ‘’s’ forms and enclitic ‘’s’ forms in the MWC compared to the sub-section of LOB

Section

Freq. of

Freq. of enclitic ‘’s’ 52

MWC v LOB genitive ‘’s’ LL score 240.89 (–)

MWC v LOB enclitic verb LL score 0.39 (+)

LO B A

genitive ‘’s’ 608

LO B B

222

16

24.8 (–)

10.58 (+)

LO B C

327

30

272.86 (–)

2.01 (–)

LO B D

124

30

7.26 (–)

2.15 (–)

LO B E

215

34

0.12 (–)

3.95 (+)

LO B F

364

128

33.77 (–)

41.12 (–)

LO B G

800

41

137.15 (–)

29.04 (+)

LO B H

173

0

0.4 (–)

67.87 (+)

LO B J

435

20

0.03 (+)

68.31 (+)

LO B K

233

155

18.22 (–)

133.23 (–)

LO B L

279

168

85.24 (–)

195.7 (–)

LO B M

49

31

5.03 (–)

34.05 (–)

LO B N

260

248

32.16 (–)

313.24 (–)

LO B P

276

242

44.08 (–)

302.3 (–)

LO B R

75

28

8.99 (–)

13.93 (–)

4,440 597

1,223 140

127.06 (–) –

57.75 (–) –

LO B T otal MW C

Keywords and Moral Panics

117

would view as deviance (‘homosexual’, ‘torture’), transgression (‘offences’, ‘premarital’) and indulgence (‘fantasy’, ‘gratuitous’, ‘titillation’). The collocates are revealing. Two out of the four link to the keyword ‘homosexual’, establishing a link between ‘homosexual’ and ‘masturbation’ through ‘intercourse’, and ‘homosexual’ and ‘minorities’ through ‘sexual’. ‘Sex’ is linked to ‘violence’ through the collocate ‘gratuitous’. ‘Masturbation’ is linked to ‘abortion’ through ‘prior’. Then a link is made to the pornography semantic field of the objects of offence through the link collocate ‘pictures’. The impact of such links will be discussed in more detail later. For the moment it is sufficient to say that the ‘sex’ semantic field is tied to the scapegoat category (‘homosexual’) and another ‘object of offence’ semantic field (‘pornography’). Bad programmes W ords relating to the broadcast of programmes on the television and radio appear in the key keyword scapegoat list (see T able 7.14). T he discussion of these broadcasts almost always identifies the broadcast as a problem, and the act of broadcasting the material as a problem leading to negative consequences. The collocate ‘excellent’ for ‘programmes’ may lead us to assume, however, that not all programmes are identified by Whitehouse as having negative consequences. However, a closer inspection of the examples where ‘excellent’ programmes are discussed shows that it is indeed those programmes to which W hitehouse objects that are being discussed. T hey are being accused of driving excellent programmes off the air or negating their positive effect, as in the following example from the MWC, ‘What a great pity it is to spoil these excellent programmes and the excellent showing we get from the BBC by distasteful programmes’. Table 7.14

The collocates of ‘programme’ and ‘programmes’

Word

Freq.

Collocates in MWC

‘programmes’

237

olds, types, satirical, excellent, preview, screened (4.39, programme, television), intervals, affairs, related, build (4.17)

‘programme’

282

catholics, transmitted, complained, talkback, screened (4.14, programmes, television), falling (4.14, tv), braden, finished, thames (3.73, broadcast, tv), night’s (3.73)

A s well as blaming individuals for the objects of offence and consequences outlined by W hitehouse, the media, broadly conceived, is accused by W hitehouse of broadcasting and distributing the object of offence. H ence collocates with a

What’s in a Word-list?

118

Table 7.15

The collocates of ‘film’

Word

Freq.

Collocates in MWC

‘film’

191

censors, russell’s, makers, horrible, cole’s, distributed, glc, management (4.29, director-general), exceptional, critic (4.09)

negative semantic prosody such as ‘horrible’, ‘suffer’, ‘blue’ and ‘x’ occur with the key keyword ‘film’ and the keyword ‘films’ (see T able 7.15).26 S uch collocates clearly form a bridge to the objects of offence category, as they are emblematic of the bad language, sex and violence that W hitehouse objects to. It is hardly surprising, therefore, to see a link to the corrective action category: ‘broadcasting’ and ‘television’ are linked to the corrective action category by the use of language associated with the proposed regulation of the media, resulting in the collocates ‘accountable’ and ‘accountability’. Yet, if particular broadcast programmes are often presented as the scapegoat by W hitehouse, she is also clear that the decision to broadcast material is often the decision of individuals within a media organization, and the individuals, or a reified organization, may also be represented as scapegoats. As was demonstrated earlier, individuals are indeed represented as scapegoats in this way. But what Table 7.16 Word

‘television’

‘broadcasting’

The collocates of ‘television’ and ‘broadcasting’ Freq.

Collocates in MWC

399

screens, dimension, consumers (4.23, radio, broadcasting), accountable (4.23, broadcasting, parliament), independent (4.05, ita), correspondent, radio (3.92, broadcast, jury), myths (3.90, lobby), companies, screened (3.64, programmes, programme)

277

accountable (5.49, broadcasting, television), consumers (5.17, radio, television), accountability, range, exempt, authorities, overall, affirm, urges (4.17, corporation), temple (4.17)

26 T he phrase blue film is used to refer to pornographic films. At the time when Whitehouse was writing, x was a certificate awarded to films limited to an adult audience. Such films were limited because they contained bad language, sex or violence, either singly or in combination.

Keywords and Moral Panics

119

of those bodies and groups associated with the decision to broadcast? A s can be seen from the media organization and officers field of the scapegoat category, media organizations are identified as scapegoats by Whitehouse, and so we could say that, for her, the scapegoating process encompasses the entire organization. T he principal organization that was the subject of her disapproval is indicated by the key keyword ‘bbc’. As well as criticizing the corporation itself, specific individuals associated with those different layers may be singled out for signal blame, for example, H ugh Greene. T he process of W hitehouse moving between levels in an organization while criticizing it is a feature of W hitehouse’s approach to attacking scapegoats, and the BBC in particular. The fragile nature of decency The desired outcome identified in the MWC is the advent of a society in which ‘decency’ reigns. A form of Christian values based on absolute morality were intimately linked to decency and compassion for the MW C. T his link to Christianity becomes explicit for ‘decency’ through the link collocate ‘faith’. However, it must be noted that in linking Christianity and decency, there is an implicit denial that those who oppose the MWC are either ‘christian’ or ‘decent’. Those people identified in the scapegoat category must be those trying to frustrate this outcome and, hence, cannot be Christian and cannot be decent. Indeed, they are responsible for the state of the media that needs to be cleaned-up. Consider the examples in Figure 7.5. T he examples in Figure 7.5 claim that the public wanted decency restored, implying that it was under threat and in decline (1). T hose who sought to defend decency would have to suffer attacks from those that wanted to attack it (2). T he enemies of decency represent an out-group who are opposed to decency, amongst other things, and those who wish to defend decency (2 and 3). D ecency can be maintained or restored through the work of moral entrepreneurs who will persuade the government to institute changes to the law to enable the curbing of the outTable 7.17

The collocates of ‘decency’

Word

Freq.

Collocates in MWC

‘decency’

29

offend, taste, petition (6.01, parliament), calling, good, against, feeling, faith (4.52, christian), public (4.32, opinion), standards (3.23)

120

What’s in a Word-list?

1. O n a point of information, 1.1 million people signed the 1973 N ationwide Petition for Public D ecency calling for more effective controls – 85 percent of those who had the opportunity to sign. 2. While they try to discredit with a yell of ‘fascist’ those who defend decency and culture, they themselves launch an assault upon the senses and freedom of the individual which is the essence of the worst kind of dictatorship. 3. S he isn’t afraid of being called a moral busy-body, a pedlar in cant, a prude, a hypocrite, or any of the other verbal weapons in the arsenal of those who despise taste, ridicule good manners, resent decency, applaud blasphemy and generally espouse the litter louts of the arts. 4. It was with this warning in mind that N ational VALA , with the support of the Festival of L ight, launched a Petition calling upon the Government so to revise the O bscenity L aws that they become an effective and workable instrument for the maintenance of public decency.

Figure 7.5

The call for the restoration of decency

group that wants to attack decency (1 and 4). For VALA ‘decency’ is a fragile object. It is also under attack and must be defended by individuals and, ultimately, the state. Those seeking to attack ‘decency’ are those ‘who despise taste, ridicule good manners, resent decency, applaud blasphemy and generally espouse the litter louts of the arts’. T his is a powerful example of an out-group being presented negatively. Be vague – moral panic rhetoric The presence of first person singular pronouns (‘I’, ‘me’, ‘my’) as key keywords in the MW C texts is to some extent not surprising, because the texts in this corpus are largely written from the point of view of Mary W hitehouse herself. However, the fact that the corpus also contains a first person plural pronoun (‘we’) and the first person plural determiner ‘our’ as key keywords makes the choice of the first person singular point of view an interesting one, since, as through the predominance of singular and plural first person pronoun/determiner forms, the author is able to blur the distinction between the views which she holds, those which she and her supporters hold, and those which are held by a larger group, including both the author and the reader. First person plural pronouns/determiners

Keywords and Moral Panics

121

S o we put our (1) heads together and produced our (2) manifesto. TH E MANI FESTO 1. W e (3) women of Britain believe in a Christian way of life.

2. W e (4) want it for our (5) children and our (6) country.

3. W e (7) deplore present day attempts to belittle or destroy it and in particular we (8) object to the propaganda of disbelief, doubt and dirt that the BBC projects into millions of homes through the television screen.

Figure 7.6

Pronoun use by VALA

are vague.27 Consider the use of ‘we’ and ‘our’ in the manifesto of VALA as shown in Figure 7.6 (the examples in Figure 7.6 have been given superscript numbers by me to facilitate a discussion of the pronouns). Examples (1) and (2) clearly encompasses only those people who sat down to write the manifesto. H owever, (3) encompasses a larger group, since not all of the Christian women of Britain sat down to write the manifesto and not all British women are Christians. A n in-group and out-group is set up here – the in-group being the Christian women of Britain who agree with the manifesto, the out-group being those women of Britain (whether they view themselves as Christian or not) who do not agree with the manifesto. Examples (4), (5), (6), (7) and (8) may or may not refer to the groups identified by (1), (2) and (3) or to some other group. These pronouns/determiners have a sweeping and vague scope that is difficult to determine with certainty from the text. A s well as exploiting the vagueness of the plural first person pronouns/determiners and generating in- and out-groups, these word forms can also be used to imply that the reader shares the views of VALA , as shown in Figure 7.7.28 In Figure 7.7, (1), (2) and (3) assume that the reader is a Christian. W hile this may indeed have been true for many readers, it is not axiomatic that those who would read W hitehouse’s works would be Christian. H owever, given the central rôle of a variety of Christianity based on absolute morality in the campaigns of 27 See Biber et al., The Longman Grammar of Spoken and Written English, pp. 329– 30 for a discussion of the vagueness of this category of pronouns. 28 I include the keyword ‘us’ in this analysis. While it is not a key keyword, the inclusion of ‘us’ in this discussion seems appropriate in the context of discussing the way in which Whitehouse manipulates first person plurals.

122

What’s in a Word-list?

T he philosophical concept of the spontaneous apprehension of absolute good has lost all credence in a day when the entire concept of good is challenged, and we1 need to be aware that it is largely our2 Christianity and nothing else that has taught us3 of goodness, justice, love, truth and beauty. A nd this is not something just for a reluctant S unday.

Figure 7.7

The assumption of Christianity

1. T he Churches – as indeed had happened in other European countries where the pornographer Thorsen had tried to get his film made – not only came vigorously to life but united, one with another, under the leadership of the Queen, the A rchbishop of Canterbury, Cardinal H ume of W estminster and the Prime Minister. A nd they united with those lay people in the country who had been fighting pornography for years. The people spoke with one voice-all except, that is, for some pathetic bleats from some of the anti-censorship lobby whose gods are, of course, those same pieces of silver which betrayed Christ in the first place. This seemed to me the most wonderful thing. No longer could the publicists of the ‘God is dead’ school, or the hot gospellers of the secularist lobby, claim that we live in a post-Christian era. W e may not all go to church, be we care. 2. T here is at the heart of the nation a sound Christian core. Parents who know what they value for their children and are prepared to see that they get it.

Figure 7.8

Speaking up for the silent majority

VALA , we can see clearly why W hitehouse wanted to assume that readers would be Christian, because she was claiming to represent the views of the silent Christian majority, who abhorred the switch away from absolute moral positions driven along by groups and individuals, such as the humanists or the BBC’s D irector General, H ugh Greene, as shown in the examples in Figure 7.8. W hitehouse’s tactic when claiming that Britain was essentially Christian was to argue that, while the population was not visibly Christian, they were, so to speak, closet Christians. T his may or may not have been true. W hat is true is that having adopted this position, the writings of Whitehouse are then bound to reflect that view, and that view is in itself crucial to the argument W hitehouse is putting forward. If Britain were not full of closet Christians then W hitehouse’s arguments would have no force. H er battle would be lost before it began. H ence, in the use of first person plural pronouns and determiners to encompass groups in society larger than W hitehouse can rationally claim she was representing, W hitehouse was implying a support for the moral panic that she was promoting that she may – or may not – have had.

Keywords and Moral Panics

123

1. W hen the Viewers’ and L isteners’ A ssociation was formed S ir H ugh Greene that evening called it a ‘lunatic fringe’. What does this so-called lunatic fringe consist of? A mong its members are an A nglican Bishop, the head of the R oman Catholic Church in Britain, a high official of the British Medical Association, many chief constables and many Members of Parliament. I submit that the lunatic fringe who ought to look at their own misconduct are the minority to whom I have referred not the people who are trying to get things put right. 2. A month earlier the Managing D irector of ST V, the BBC’s S cottish rival, had attracted a good deal of attention by announcing that he intended to act as a censor himself ‘to fight some kind of rearguard action against progressive loosening of moral standards’. W hich is the more responsible attitude? T he BBC has always prided itself on exercising its own controls and gives this as a reason for rejecting any control from outside. In The Listener Sir Hugh Greene re-stated the theory : ‘We have (and believe strongly in) editorial control …’ T hat is exactly what all this protest is about. S ome simple clear principles must be defined. 3. But who has the time for eternal vigilance? Mary Whitehouse, whom I first knew as a teacher in a S chool in my D iocese and had met at her parish Church, decided to give herself entirely to this new task and with great courage resigned from her teaching post.

Figure 7.9

The use of wh-interrogatives by VALA

T he wh-keywords in the MWC (‘who’, ‘which’, ‘what’) merit some discussion, as they signal another important rhetorical device used in the corpus: the use of questions.29 W hile not all of the uses of the wh-forms discussed here are questions, in the MWC 131 examples of ‘what’, 37 examples of ‘which’ and 59 examples of ‘who’ are questions. It is their importance as interrogative clause markers in the MW C that has led to their inclusion as wh-interrogatives in this discussion. W hat is the purpose and nature of questions in a discourse of the sort encoded in the MW C? T he examples in Figure 7.9 illustrate the rôle of these wh-interrogatives well.30 In all three cases in Figure 7.9, a question is used as a rhetorical device to allow the writer to provide the answer that they prefer – the lunatic fringe consists of the critics of VALA , not VALA itself (1), censorship is the responsible choice (2), and Mary W hitehouse is the person who can stand watch over the nation’s morals (3). By posing and replying to questions, the texts give a semblance of debate, while remorselessly pursuing an agenda of moral absolutism in a context in which the answers given to questions, and the outcome of a supposed debate will be in harmony with the views of VALA . 29 T he words who and what are key keywords. I will also discuss which here as the word groups logically with the two other wh-forms under discussion. 30 The third example in this figure is from the book Cleaning Up TV and is from the foreword written by the Bishop of H ereford.

124

What’s in a Word-list?

Conclusion T his paper, through an exploration of moral panic theory, has demonstrated the worth of keywords and key keywords. W hen faced with a mass of data, keywords may be of use in exploring that data in a structured and efficient manner. However, there are occasions on which the number of keywords is overwhelming or where the transience of keywords may become an issue. In such cases key keywords can be very useful in approaching corpus data. In using corpora and keyword techniques, one is able to approach and analyse volumes of text which, given a hand-and-eye led analysis, would be prohibitively time consuming. In allowing researchers to engage with large volumes of data rapidly and effectively, corpus linguistics promises not only the prospect of rapid and comprehensive results, but also, as I hope this paper has shown, the gateway to a number of unexpected and illuminating insights into the data in question.

Chapter 8

‘The question is, how cruel is it?’ Keywords, Fox H unting and the H ouse of Commons Paul Baker

Introduction In the UK, fox hunting as it is recognized today had been practised since the seventeenth century. T here have been numerous attempts to regulate or ban it, stretching back over half a century. In January 2001, according to the BBC, more than 200,000 people took part in fox hunting in the UK, and it was described as, ‘one of the most divisive issues among the population’. T ony Blair’s L abour Party manifesto in 1997 promised a, ‘free vote in parliament on whether hunting with hounds should be banned’. In July 1999 he announced that he would make fox hunting illegal and before the next general election if possible. A fter a number of parliamentary debates and votes, the ban was implemented in February 2005. In order to examine discourses surrounding the issue of banning fox hunting I decided to build a corpus of parliamentary debates on the subject. I collected electronic transcripts of three debates in the H ouse of Commons which occurred prior to votes on hunting. T hese occurred on 18 March 2002, 16 D ecember 2002 and 30 June 2003, and the total corpus size was 129,798 words. In general, the majority of Commons members voted for the ban to be ratified, although in each debate a range of options could be considered and subsequently voted on. For example: a complete ban vs hunting with some form of supervision. It might be useful to remember that the fox hunting debate has two sides and, ultimately, each speaker had to vote on the issue of banning fox hunting. It is possible that speakers who voted in the same way in fact approached the subject from very different perspectives and had different reasons for the way they voted; that speakers voted, and that their contributions to the debate would be made with an idea of persuading others to vote in the same way, suggests one area where conflicting discourses may be illuminated. Therefore, it was decided to split the corpus in two: the speech of all of the people who voted to ban fox hunting was placed in one file, while the speech of those who voted for hunting to remain was

R . S cruton, On Hunting (Yellow Jersey Press, 1998). , accessed 11 May 2006.

126

What’s in a Word-list?

placed in another. T he anti-hunt voters contributed more speech to the debates overall (71,468 words vs 58,330 words). Keywords in the corpus Using W ordS mith, it is possible to compare the frequencies in one wordlist against another in order to determine which words occur statistically more often in wordlist A when compared with wordlist B and vice versa. T hen all of the words that occur more often than expected in one file when compared to another are grouped in another list, called a keyword list. It is this keyword list that is likely to be more useful in suggesting lexical items that could warrant further examination. A keyword list therefore gives a measure of saliency, whereas a simple word list only provides frequency. In Figure 8.1, the first column (N) numbers the keywords in the order that they are presented; they are ordered here in terms of keyword strength. The second column (WORD ) lists each keyword. T he third column (FR EQ.) gives the frequencies of each keyword as it occurred in the anti fox-hunting sub-corpus. The fourth column (AYE.LST %) shows this figure as a percentage of the whole sub-corpus. Where there is no figure at all, the percentage is so small as to be negligible. The fifth and sixth columns show the same figures for the pro-foxhunting sub-corpus. D ue to the fact that the two sub-corpora are of different sizes, the best way to compare frequencies is to look at the percentage columns rather than the raw frequency columns. T he seventh column assigns a keyness value to each word; the higher the score, the stronger the keyness of that word, whereas the final column gives the p value of each word. A s p is set so low here, almost all of the figures in this column are 0.000000. Therefore the keyness value gives a more gradable account of the strength of each word in the table. In Figure 8.1, the keyness score starts high (at 158.8) for the word ‘Michael’, and gradually decreases, to around 24 by the middle of the table. H owever, after that it starts to rise again. By the last row of the table it has reached to 61.7. T his is because the table is actually showing two sets of keywords (which explains why about half of the list is shown in a different colour). The first part shows words which occur more frequently in the anti-hunt speeches when compared with the pro-hunt speeches, while the opposite is true for the second part of the list. Analysis of keywords T he majority of the keywords found consist of what S cott calls the ‘aboutness’ variety (words that tell us about the genre of the corpus), in both parts of the list. M. S cott, WordSmith Tools Help Manual. Version 3.0 (O xford: Mike S cott and O xford University Press, 1999).

‘The question is, how cruel is it?’

Figure 8.1

127

Keywords when p<0.000001

It should be noted again that the words at the extremes of the keyword list are the strongest in terms of their occurring significantly more often in one side of the debate than the other. Consider the word at row 21 of the table, ‘criminal’. If the proper noun ‘Gray’ is discounted, the word ‘criminal’ is the strongest keyword used by those who were opposed to a ban on hunting. It occurs 38 times in the collective speech of the pro-hunters and only twice in the speech of the antihunters. W hy is this so? A s with ordinary frequency lists, this is unfortunately where the limitations of keyword lists come into play. W e may want to speculate on the reasons why ‘criminal’ is used so much by pro-hunters, and looking at some of the other keywords may provide clues. H owever, without knowing more about the context of the word ‘criminal’, as it is used in both sides of the debate, our theories will remain as just that: theories. T herefore, it is necessary to examine individual keywords in more detail, by carrying out concordances of them. When a concordance of ‘criminal’ was carried out on the corpus data (see T able 8.1 for an excerpt of this concordance), it was found that common phrases containing the word ‘criminal’ included ‘the criminal law’ (14), ‘a criminal offence’ (10), ‘criminal sanctions’ (6) and ‘a criminal act’ (3). The modal verbs ‘would’ and

What’s in a Word-list?

128

Table 8.1

Concordance of ‘criminal’

1

Benches. T he Bill will turn into a criminal

offence an activity now lawfully enjoyed

2

ticularly wrong to invoke the criminal

law against people in my constituency w

3

e found so to do. It is the use of the criminal

law that would most appal me. I shall not

4

to say that the invocation of the criminal

law in these circumstances is somehow ak

5

Mr. Garnier: W e are extending the criminal

law. D oes my hon. Friend think it in the l

6

reason we do not normally use the criminal

law in areas of this kind. O f course, we us

7

ed by the new authority would be a criminal

act attracting a fine of up to £5,000. The a

8

is view, it should not be part of the criminal

law. My hon. Friend the Member for N ort

9

y law that we might pass. Imposing criminal

sanctions on anybody is a serious matter.

10

ke to address the issue of imposing criminal

sanctions on people who transgress any la

‘should’ occur as strong collocates of ‘criminal’, as do forms of the verb make (e.g. ‘make’ and ‘made’). W hat seems clear from the table is that the pro-hunters are using a strategy of framing the proposed fox-hunting ban as criminalizing people and that they are against this. For example, the use of invoke in lines two and four and impose in lines nine and ten. H ere, again, in order to get a better idea of the discourse prosodies associated with these terms, it is useful to refer to a corpus of general English. Interestingly, in the British National Corpus (a reference corpus of 100 million words of written and spoken general British English), invoke collocates strongly with two sets of words – legal terms (‘procedure’, ‘jurisdiction’, ‘law’, ‘legal’) and terms relating to supernatural forces: ‘spirits’, ‘command’, ‘powers’ and ‘god’. S emantically then, invoke implies reference to higher powers (with a connection being made between the legal and the supernatural). T he lemma impose, on the other hand, collocates in the BNC with ‘restrictions’, ‘sanctions’, ‘curfew’, ‘fines’, ‘ban’, ‘penalties’, ‘burdens’ and ‘limitations’. It therefore contains an extremely negative discourse prosody: if we use ‘impose’ in relation to criminal law/sanctions, then we are showing that we disapprove of the criminal law/sanctions. W e can see, therefore, that once a keyword is made the subject of concordance and collocational inquiry, interesting patterns in the discourse begin to emerge. Terms like ‘invoke’ and ‘impose’ are rhetorical strategies, used to strengthen a particular discourse position – in this case, that a ban on hunting would be wrong.

‘The question is, how cruel is it?’

129

W hat of the other keywords in the list? D ue to space limitations, it is not possible to examine each one in detail, although all provide something interesting – each is a different piece of a puzzle which gradually helps us to form a clearer picture. The word ‘people’, for example, which is key in the pro-hunt side of the debate is often used in attempts to reference a large uncountable mass in two ways. First, ‘people’ refers to those who will be adversely affected by the Bill if it is passed (their livelihoods and communities threatened, and their futures involving imprisonment). S econdly, it refers to (a presumably greater number of) people who do not hunt, but are not upset or concerned by those who do. H owever, the keyword list has only given us a small number of words to examine, and once all of the proper nouns (‘Michael’, ‘Alun’, ‘Atkinson’, ‘Lidington’, ‘Garnier’, ‘Gray’) have been discounted, this leaves us with just 16 words in total. W e may also want to discount (or at least to background for the moment) the keywords which relate to text genre, in this case parliament (‘Bill’, ‘Commons’, ‘House’, ‘Minister’s’), which leaves us with only 12 keywords. T welve keywords do not give us much to analyse. S o in order to address this issue, the p value was increased to p < 0.001 and the keywords process was carried out again, producing 120 keywords, which were reduced to 88 when the proper nouns were discarded. A lthough the keyness scores in this longer list are less impressive, what is interesting about working with a larger list, is that it becomes possible to see connections between words, which may not always be apparent at first, but are clearer once they have been subjected to a more rigorous mode of analysis. For example, keywords in the pro-hunt debate include the following words: ‘fellow’, ‘citizens’, ‘Britain’, ‘freedom’, ‘imposing’, ‘illiberal’, ‘sanctions’ and ‘offence’. All of these keywords are connected in some way to the findings we have already considered. So ‘sanctions’, ‘offence’, ‘imposing’ and ‘illiberal’ occur in similar ways to the word ‘criminal’ which was examined above. As a different yet related strategy, the keywords ‘fellow’, ‘citizens’, ‘Britain’ and ‘freedom’ are related to the keyword ‘people’ which was discussed earlier. Consider the concordance in Table 8.2. We can see that the term ‘fellow citizens’ is always preceded by a first person possessive pronoun (‘my’ or ‘our’). The use of this term looks like a strategy on the behalf of pro-hunters to appear to be speaking for and with the people of Britain, thereby implicitly labelling their discourse as a hegemonic one. N ote also how in lines 10 and 11, the debater speaks for the people ‘the people of Britain are beginning to catch on’ and ‘for most of the 55 million people in England it is of peripheral interest’. Finally, in lines 13 to 16 the lemma restrict and the word ‘individual’ both collocate with ‘freedom’. There is an underlying nationalist discourse being drawn on here, which could be paraphrased as: ‘Britain is a good country because it is a place where people are free’. This discourse is used as an argument to allow fox hunting to continue. T herefore, examining these additional keywords helps us to build on the findings we have already uncovered. A number of discourses are then starting to become apparent, particularly for the pro-hunt speakers. For example, use of terms

What’s in a Word-list?

130

Table 8.2

Sample concordance of ‘fellow citizens’, ‘Britain’ and ‘people’ (pro-hunt)

1

able to me and, I believe, to most of my fellow citizens

. The killing of an animal is justifia

2

a small but significant minority of our fellow citizens

. I agree with one thing the Minist.

3

al freedom, that it will rob some of our fellow citizens

of their livelihood and take homes

4

7, when the pensions of millions of our fellow citizens

are affected by a deeply serious cri

5

at. O f course, I accept that some of our fellow citizens genuinely disapprove of hunting wi

6

umber of my family and 407,000 of my fellow citizens

, I took part in the march for liberty

7

the T hird R eich. D own the ages, we in

Britain

have fought against the persecution

8

an who ripped apart the fabric of rural

Britain

and passed the most illiberal and di

9

that is being practised on the people of

Britain

tonight. Mr. A tkinson: T here we

10 se to offer the people of Britain, and the

people

of Britain are beginning to catch o

11

rs speak, but for most of the 55 million

people

in England it is of peripheral intere

12

ce to a largely urban nation, millions of

people

people recognise that to criminalise

13

unjustifiable restrictions on individual

freedom

, would increase the suffering of fo

14

unjustifiable restrictions on individual

freedom

, that it will rob some of our fellow

15

t, illiberal and arbitrary. It will restrict

freedom

and do nothing to help animal wel

16

e unjustifiable restrictions on individual

freedom

trying to justify itself, but failing, i

like ‘criminal’, ‘sanctions’, ‘offence’ and ‘imposing’ suggest a discourse of civil liberties, whereas words like ‘Britain’, ‘fellow’, ‘citizens’ and ‘people’ suggest a discourse of shared British identity. Using a reference corpus S o far our keywords analysis has been based on the idea that there are two sides to the debate, and that by comparing one side with another we are likely to find a list of keywords which will then act as signposts to the underlying discourses within the debate on fox hunting. O ur analysis so far has uncovered some interesting differences between the two sides. H owever, it also raises some issues. In focussing on difference, we may be overlooking similarities – which could be

‘The question is, how cruel is it?’

131

equally important in building up a view of discourse within text. For example, why do certain words not appear as keywords? Given that ‘barbaric’ occurred as a keyword in the anti-hunting speeches, another word that I had expected to appear as key in the anti-hunting debates was ‘cruelty’. However, this word occurred 124 times in the anti-hunting speeches and 106 times in the pro-hunting speeches. In terms of proportions, taking into account the relative sizes of the two subcorpora, the anti-hunt speakers actually used the word ‘cruelty’ proportionally less than the pro-hunters (0.17 per cent vs 0.18 per cent). So while ‘cruelty’ occurred slightly more often on one side of the debate, this was not a statistically significant difference: clearly, the concept of cruelty is important to both sides. H owever, how would we know (without making an educated guess) that a word like ‘cruelty’ is worth examining? O ne solution would be to carry out a different sort of keywords procedure; this time by comparing the entire set of debates against another corpus – one which is representative of general language use. T his would produce a keyword list that highlights all of the words which occur in the fox hunting debates more frequently than we would expect in ‘normal’ language. In this case it was decided to implement the Freiberg-Lancaster/Oslo-Bergen corpus (FLO B) which consists of one million words of written British English sourced in the 1990s. A lthough the FLO B corpus contains written texts and the debates were spoken, a good proportion of the debate consists of prepared speech, so, in a sense, it could be argued that it contains elements of written language. A comparison of the hunting debates with the FLO B corpus reveals a different set of keywords; the 20 strongest being ‘hon.’, ‘hunting’, ‘that’, ‘bill’, ‘ban’, ‘I’, ‘friend’, ‘Mr’, ‘foxes’, ‘member’, ‘clause’, ‘fox’, ‘minister’, ‘cruelty’, ‘we’, ‘gentleman’, ‘house’, ‘my’, ‘dogs’ and ‘is’. Comparing this list to Figure 8.1 (which showed keywords when the two sides of the debate were examined), it is clear that some of these words are key in the debates when compared to FLO B because they occur very frequently on one side of the debate (for example, ‘I’, ‘clause’, ‘bill’, ‘house’ and ‘dogs’ are key in both lists due to their prevalence of use by anti-hunting speakers). H owever, other words do not appear in both lists, for example ‘foxes’ and ‘cruelty’. A further line of investigation therefore could be to examine words which are key across the debate when compared to a reference corpus, rather than simply looking at words which are only key on one side of the debate. Examining the word ‘cruelty’ in more detail, it becomes apparent that although it occurs with a reasonably comparable frequency on each side of the debate, the ways that it occurs are quite different for different speakers. T he anti-hunters tend to use it in conjunction with words like ‘ban’, ‘outlaw’, ‘unnecessary’, ‘target’ and ‘eradicate’ (Table 8.3). Their speech also tends to assume that cruelty already exists; thus, for example, ‘The underlying purpose of the Bill is to ban all cruelty associated with hunting with dogs.’ H owever, those who are pro-hunting question this position – using collocates such as ‘test’, ‘tests’, ‘prove’, ‘evidence’ and ‘defining’ (Table 8.4). Therefore, rather than accepting the presence of cruelty, pro-hunting speakers problematize it, e.g. the full text in line one of T able 8.4 is,

What’s in a Word-list?

132

Table 8.3

Concordance (sample) of ‘cruelty’ (anti-hunt)

1

ise, no uncertainty, no delay; a ban on the cruelty and sport of hunting in the lifetime of t

2

I see it very clearly in a Bill that bans the cruelty associated with hunting in all its forms.

3

about banning cruelty and eradicating the cruelty associated with hunting. I have tried to

4

to be enforceable and to eradicate all the cruelty associated with hunting with dogs, and

5

issue for many who want to see an end to cruelty and for those who want things to remai

6

en to an organisation that exists to prevent cruelty to animals and I remind the hon. Memb

7

S hrining in law the principle of preventing cruelty as well as the principle of recognising u

8

ffective and enforceable law. It will tackle cruelty , but it also recognises the need to deal

9

is uncompromising in seeking to root out cruelty . It will not allow cruelty through hunti

10

gly, twice, to bring an end to unnecessary cruelty to wild mammals. T here can seldom in

‘Cruelty is subjective and comparative, and the Bill entirely fails adequately to define cruelty or utility.’ Comparing a smaller corpus or set of texts to a larger reference corpus is, therefore, a useful way of determining key concepts across the smaller corpus as a whole. Indeed, for many studies where the text or set of texts under scrutiny is relatively uniform, using a reference corpus may be all that is needed. H owever, in order to address the problem of over-focussing on differences at the expense of similarities, it is recommended that the corpus being analysed is used in the creation of more than one keyword list. Key categories A further way of considering keyness is to look beyond the lexical or phrasal level, for example, by considering words that share a related semantic meaning or grammatical function. W hile a simple keyword list will reveal differences between sets of texts or corpora, it is sometimes true that lower frequency words will not appear in the list, simply because they do not occur often enough to make a sufficient impact. This may be a problem, as low frequency synonyms tend to be overlooked in a keyword analysis. H owever, text producers may sometimes try to avoid repetition by using alternatives to a word, so it could be that it is not a word itself which is particularly important, but the general meaning or sense that it refers to. For example, the notion of ‘largeness’ could be key in one text when compared to another, and this would be demonstrated by the writer using

‘The question is, how cruel is it?’

Table 8.4

133

Concordance (sample) of ‘cruelty’ (pro-hunt)

1

the Bill entirely fails adequately to define cruelty or utility. A s my hon. Friend the Mem

2

ess or avoidable suffering” when defining cruelty . The phrase playing the fish” is no eup

3

ct. T he arbitrary application of the tests of cruelty

4

nless those who hunt can meet the tests of cruelty and utility described by the Minister. T

5

e whole H ouse has heard the definition of cruelty , as given by the Minister, relating to ne

6

than not, focuses on cruelty or perceived cruelty . I commend the former H ome S ecretar

7

ill not be for the authorities to prove that cruelty takes place; if the Bill is enacted, hunti

8

scribed as incontrovertible evidence of the cruelty of deer hunting, he must tell us what it i

9

the Minister is so concerned, where is the cruelty test in the autumn for shooting or snari

10

inister said that those would not pass the cruelty or utility tests. H ow can he know that?

and utility to fohunting [sic] is illogical whe

a range of words such as ‘big’, ‘huge’, ‘large’, ‘great’, ‘giant’, ‘massive’, etc. – none of which occur in great numbers, but would, taken cumulatively, appear as key. T hinking grammatically, in a similar way, one text may have more than its fair share of modal verbs or gradable adjectives or first person pronouns when compared to another text. Finding these key categories could help to point to the existence of particular discourse types – they would be a useful way of revealing discourse prosodies. In order for such analyses to be carried out, it is necessary to undertake the appropriate form(s) of annotation. T he automatic semantic annotation system used to tag the fox hunting corpus was the USAS , UCR EL S emantic A nalysis S ystem. T his semantic tagset was originally loosely based on McA rthur’s (1981) Longman Lexicon of Contemporary English. O nce the semantic annotation had been carried out, word lists (consisting of words and semantic tags) of the two sides of the fox hunting debate were created and compared with each other to create a keyword list. From this list, the relevant key semantic tags were singled out for analysis. T here is not enough space to look at all of the key tags in detail, so I want to concentrate on a couple of significant findings here. Two key tags which occurred significantly more often in the pro-hunt speeches were ‘S1.2.6 sensible’ and ‘G2.2 ethics – general’. Looking at a concordance of A. Wilson and J. Thomas,‘Semantic Annotation’, in R. Garside, G. Leech and A. McEnery (eds), Corpus Annotation: Linguistic Information from Computer Texts (L ongman, 1997), pp. 55–65. T . McA rthur, Longman Lexicon of Contemporary English (L ongman, 1981).

What’s in a Word-list?

134

Table 8.5

Concordance (sample) of words tagged as S1.2.6 ‘sensible’ (prohunt)

1

he Bill makes illegal only the perfectly reasonable sensible and respectable occupations

2

continuation of hunting. I appeal to all reasonable hon. Members to support me in seeki

3

inal law rather than fiddle around in an

absurd

way with this absurd Minister on this

4

rmed roast. T he debate has not shown a

rational

analysis of the facts: misplaced co

5

be justified by scientific evidence. The ridiculous new clause 13 wrecks it further, and i

6

this matter. Most people with common

sense

7

eds your protection. Mr. Gray: Calm,

sensible

and rational people across Britain a

8

ss. W hy not? T hat would be a logical,

sensible

and coherent approach. A s I have to

9

method of control in that time is utterly

illogical

Mr. Gray: My hon. Friend makes an

10

ng-during that time. T his ludicrous and

illogical

new clause is the result of a shabby d

will say, “Why don’t they reach a dea

words that were tagged as S 1.2.6 (T able 8.5 shows a small sample from of the total number of cases) it is clear that this contains a list of words relating to issues of sense: ‘sensible’, ‘reasonable’, ‘common sense’, ‘rational’, ‘ridiculous’, ‘illogical’ and ‘absurd’. The prevalence of this class of words is due to the way that the prohunt speakers construct the proposed ban on hunting (as ridiculous, illogical and absurd) and the alternative decision to keep hunting (as reasonable, sensible and rational). W hile this way of presenting a position would appear to make sense in any argument it should be noted that the anti-hunt speakers did not tend to characterize the debate in this way. T hey did not argue, for example, that their position was sensible, reasonable, etc., and that of their opponents was ridiculous and absurd. It is also worth noting that one feature of hegemonic discourses is that they are seen as ‘common-sense’ ways of thinking. To refer continually to your arguments in terms of ‘common-sense’ is therefore a powerful rhetorical strategy. W ith this sort of analysis, we are not only seeing the presence of discourses in texts, but we are also uncovering evidence of how they are repeatedly presented as the ‘right’ way of viewing the world. W hat other key categories of meaning did the pro-hunters tend to focus on? The G2.2 tag was affixed to a set of words relating to ethics, including ‘moral’, ‘rights’, ‘principles’, ‘humane’, ‘morality’, ‘ethical’, ‘legitimate’, ‘noble’ and ‘fair’. It appears that the pro-hunt speakers are more likely to argue their position from an explicitly ethical standpoint – a somewhat surprising finding given that the ethical position of ending cruelty to animals would appear to be a more obvious stance for the anti-hunt protesters to have taken. H owever, a closer examination of a concordance of words which receive the G2.2 tag (T able 8.6) reveals that the

‘The question is, how cruel is it?’

Table 8.6

135

Concordance (sample) of words tagged as G2.2 – ‘ethics: general’ (pro-hunt)

1

e should be careful about imposing our

morality

on other people, someone on the L ab

2

ople to make up their own minds about

morality

. O ne of the issues that I dealt with as

3

In any event, they are surely moral and

ethical

issues to be considered by individu

4

g, vivisection and slaughter? T here are

moral

gradations here and no moral absolut

5

the Bill that it is based on no consistent

ethical

principle. I was rather pleased when

6

ere is a complete absence of consistent

ethical

principles in the contents of the Bill.

7

at not an issue? Is hunting not the more

humane

method of controlling the fox pop

8

omeryshire (L embit Öpik). T here is no

moral

justification for the Government’s po

9

questions involved, will he explain the

moral

difference between a gamekeeper us

10

en. Predators do not consider the moral

rights

and wrongs as we do as human bein

pro-hunt speakers are pre-occupied with issues of morality because they wish to question the supposed absolutist ethical standpoint of the anti-hunters. T herefore, their frequent references to ethics are based on attempts to problematize or complicate the ethical position of the anti-hunters: again, this finding complements and widens the analysis of the word ‘cruelty’ above. W hat about the other side of the debate? O ne semantic category which occurred more often in the speech of those who are opposed to hunting was ‘S1.2.5 Toughness; strong/weak’. This category consists of words such as ‘tough’, ‘strong’, ‘stronger’, ‘strength’, ‘strengthening’, ‘robust’, ‘weak’ and ‘feeble’ (T able 8.7 shows a small sample of these cases). O n this side of the debate then, the pro-hunt stance is viewed as weak, whereas the proposed Bill is frequently characterized as tough, strong or robust. So here we have a significant difference in the ways that the two sides of the debate try to position themselves as correct. W hile the pro-hunt debate frames itself in terms of what is sensible, the anti-hunt debate uses strength as its criteria. A semantic tagging of the corpus, then, helps to reveal some of the more general categories of meaning which are used in the construction of discourse positions on different sides of the debate. T he pro-hunt speakers talk in terms of what is sensible, whereas the anti-hunt speakers talk in terms of what is strong. O n their own, individual words like ‘strong’, ‘tough’, ‘sensible’ and ‘rational’ did not appear as keywords – it was only by considering them as a single part of a wider semantic category that their importance became apparent. W idening the scope of keywords beyond the lexical level can, therefore, prove fruitful.

What’s in a Word-list?

136

Table 8.7

Concordance (sample) of words tagged as S1.2.5 5 ‘Toughness; strong/weak’ (anti-hunt)

1

to the Bill, we would have incredibly

strong

legislation with which to tackle hunti

2

lleagues to unite today in getting good,

strong

legislation through the H ouse. I hope

3

n. H owever, although the current Bill is

strong

in that respect, it does not set the th

4

H on. L ady’s argument is not especially

strong

. T he Bill is good in that it takes us

5

stands is far from imperfect. It is a very

strong

Bill. It deals with the issue of cruelty

6

the other Government amendments to strengthen the Bill are agreed, I can give the H o

7

practicable in their area. T he measure is

tough

but fair, and it will be simple to

8

T he tests, as I have said, are

tough

but fair. S upporters of hunting say th

9

Eve in while being seen by the public as

tough

and fair and being strong enough to

weakness

of their case. H aving given every op

10

upport it appear to be unable to see the

Conclusion A keyword list is a useful tool for directing researchers to significant lexical differences between texts. H owever, care should be taken in order to ensure that too much attention is not given to lexical differences whilst ignoring differences in word usage and/or similarities between texts. Carrying out comparisons between three or more sets of data, grouping infrequent keywords according to discursive similarity, carrying out analyses on semantically annotated data, and conducting supplementary concordance and collocational analyses will enable researchers to obtain a more accurate picture of how keywords function in texts. A lthough a keyword analysis is a relatively objective means of uncovering lexical salience between texts, we should not forgot that the researcher must specify their cut-off points in order to determine levels of salience; such a procedure requires more analysis to establish how cut-off points can influence research outcomes. W hen used sensitively, keywords can reveal a great deal about frequencies in texts which is unlikely to be matched by researcher intuition. H owever, as with all statistical methods, how the researcher chooses to interpret the data is ultimately the most important aspect of corpus-based research.

Chapter 9

Love – ‘a familiar or a devil’? A n Exploration of Key D omains in S hakespeare’s Comedies and T ragedies D awn A rcher, Jonathan Culpeper, Paul R ayson

Introduction Keyword analysis has proved to be a very useful means of determining the aboutness of a text (or texts) and/or the style of a text, and for focussing researchers’ attention on aspects of a text (or texts) that deserve further enquiry. Importantly, a number of researchers who engage in keyword analysis group their keywords semantically, i.e. according to related or shared semantic space(s). For example, McEnery (Chapter 7) has grouped some of the (key) keywords within his moral panic categories into particular semantic fields so that he can learn more about their aboutness: for example, the ‘scapegoat’ category contains ‘people’, ‘research’, ‘broadcast programmes’, ‘media’, ‘media organisations and officers’ and ‘groups’. Culpeper’s grouping of keywords relating to the main characters in Romeo and Juliet was determined by a different motivation, i.e. what they might tell us about characterization within the play. Romeo’s top three keywords – ‘beauty’, ‘blessed’ and ‘love’ – identify him as the lover of the play, for example, whilst other keywords relating to Romeo – ‘eyes’, ‘lips’ and ‘hand’ – highlight a related concern with the physical. Juliet, R omeo’s love interest, has very different keywords, the most key being ‘if’, ‘yet’, ‘but’ and ‘would’. On further investigation, Culpeper has found that many of Juliet’s usages of these lexical items, ‘reflect the fact that Juliet is in a state of anxiety for much of the play’. T he keywords associated with Juliet’s nurse differ from both Romeo and Juliet. Indeed, the majority – ‘god’, ‘warrant’, ‘faith’, ‘marry’, ‘ah’ – can be categorized as surge features, that is to say, they

J. Culpeper, ‘Computers, Language and Characterisation: An Analysis of Six Characters in R omeo and Juliet’, in U. Melander-Marttala, C. O stman and M. Kytö (eds), Conversation in Life and in Literature: Papers from the ASLA Symposium (Uppsala: A ssociation S uédoise de L inguistique A ppliquée, 2002), pp. 11–30. Ibid., p. 20.

138

What’s in a Word-list?

reflect ‘outbursts of emotion’. Interestingly, when Culpeper explored these surge features, he found that they marked occasions when the nurse was reacting to quite traumatic events (involving Juliet, in particular) and therefore should not be regarded as a character trait, per se. T his should alert us to the importance of contextualizing keywords – a point often made but not always carried out convincingly. In contrast to Baker (Chapter 8, this volume), we take the grouping of keywords into related semantic spaces one step further in this chapter, by adopting a procedure that begins with the automatic identification of key domains in six S hakespearean plays – i.e. Othello, Anthony and Cleopatra, Romeo and Juliet, A Midsummer Night’s Dream, The Two Gentlemen of Verona and As You Like It – using the UCR EL S emantic A nalysis S ystem (USAS ), and then goes on to identify keywords within these different key domains. The benefit of such an approach is that we are able to identify words that would not have been picked up by a keyword analysis (because they are not deemed to be key in and of themselves) but which nonetheless add to the aboutness of a text, because they share the same semantic space as the keywords. T his key domains method is further described in R ayson. A s will become clear, our approach also enables us to provide empirical support for the kinds of conceptual metaphor put forward by cognitive metaphor theorists when studying S hakespeare (see, for example, the work of Freeman and Barcelona S ánchez). T o illustrate, Barcelona S ánchez discusses the metaphorical basis of romantic love in Romeo and Juliet, in terms of the overarching concept metaphor, love is the unity of its complimentary parts. A s love is a common theme within S hakespeare and conceptual metaphors relating to love have been studied in some detail by cognitive metaphor theorists in a variety of literary and non-literary texts (including S hakespeare), we have chosen to explore the concept of love in our S hakespearean dataset. H owever, rather than focusing on each text individually, A Midsummer Night’s Dream, I. Taavitsainen, ‘Personality and Styles of Affect in The Canterbury Tales’, in G. L ester (ed.), Chaucer in Perspective: Middle English Essays in Honour of Norman Blake (Sheffield: Sheffield Academic Press, 1999), pp. 218–34. J. Culpeper, ‘Computers, Language and Characterisation: An Analysis of Six Characters in R omeo and Juliet’, in U. Melander-Marttala, C. O stman and M. Kytö (eds), Conversation in Life and in Literature: Papers from the ASLA Symposium (Uppsala: A ssociation S uédoise de L inguistique A ppliquée, 2002), pp. 11–30. P. Rayson ‘From key words to key semantic domains’, International Journal of Corpus Linguistics, 13: 4 (2008): 519–49. D.C. Freeman, ‘“Catch[ing] the nearest way”: Macbeth and Cognitive Metaphor’, Journal of Pragmatics, 24 (1995): 689–708; A. Barcelona Sánchez, ‘Metaphorical Models of R omantic L ove in Romeo and Juliet’, Journal of Pragmatics, 24 (1995): 667–88. Ibid. T he source data for our research is taken from the Nameless Shakespeare Corpus hosted by N orthwestern University. For more details, see .

Love – ‘a familiar or a devil’?

139

The Two Gentlemen of Verona and As You Like It are explored collectively as lovecomedies and Othello, Anthony and Cleopatra and Romeo and Juliet are explored collectively as love-tragedies. T he approach we adopt can be described as topdown, in that the categories are pre-defined and applied automatically by the USAS system. A breakdown of those categories and an explanation of the methodology we followed in this investigation are given in the methodology section. W e then discuss the results of the automatic analysis, before considering the innovative study of key collocates at the domain level. Finally, we conclude this chapter by reflecting on the results and methodological implications of the research. Methodology In order to explore the key domains within our dataset, we initially annotated the Nameless Shakespeare, using USAS , and made manual adjustments where necessary10 before re-tagging the data in their collective groupings, i.e. lovecomedies and love-tragedies. A s the original USAS system was designed to undertake the automatic semantic analysis of present-day English language, we employed the historical version of the tagger. T he H istorical S emantic T agger, which has been developed by A rcher and R ayson, includes supplementary historical dictionaries to reflect changes in meaning over time and a pre-processing step to detect variant (i.e. non-modern) spellings11. Indeed, the variant detector (VARD ) tool used for this investigation currently searches for over 45,000 variant spellings and inserts the modern equivalent alongside the variant spelling in each case (as a ‘reg tag’).12 W e have found that this feature, in particular, greatly facilitates the application of those standard corpus linguistic methods which are otherwise

P. Rayson, D. Archer, S.L. Piao and T. McEnery, ‘The UCREL Semantic Analysis S ystem’, proceedings of the workshop on Beyond N amed Entity R ecognition S emantic L abelling for NL P T asks in association with the fourth international conference on L anguage R esources and Evaluation (LR EC, 2004), pp. 7–12. 10 W e were aided in the initial tagging and checking process by students from N orthwestern University. 11 D. Archer, T. McEnery, P. Rayson and A. Hardie, ‘Developing an Automated S emantic A nalysis S ystem for Early Modern English’, in D . A rcher, P. R ayson, A . W ilson and T . McEnery (eds), Proceedings of the Corpus Linguistics 2003 Conference, UCREL T echnical Paper N umber 16 (L ancaster: UCR EL , 2003), pp. 22–31. 12 A rcher, R ayson and Baron have developed VARD so that it makes use of fuzzy matching algorithms in addition to the simple search and replace scripts. T he algorithms allow previously unseen variants to be matched to their ‘correct’ modern equivalents. They are now working with S mith to incorporate context rules: the context rules will allow the detection of real-word spelling variants (for example, ‘bee’ instead of ‘be’) and the detection (and ‘correction’) of morphological inconsistencies (for example, the ‘correction’ of ‘(e)s’ to ‘’s’ where we would expect the genitive today).

What’s in a Word-list?

140

Table 9.1

The top level of the USAS System

A . General and A bstract terms F. Food and Farming K. Entertainment, S ports and Games O . S ubstances, Materials, O bjects and Equipment T . T ime

B. T he Body and the Individual

C. A rts and Crafts

E. Emotional A ctions, S tates and Processes G. Government and the H . A rchitecture, Build- I. Money and Public D omain ing, H ouses and the Commerce H ome L . L ife and L iving M. Movement, N . N umbers and T hings L ocation, T ravel and Measurement T ransport P. Education Q. L inguistic A ctions, S . S ocial S tates and Processes A ctions, S tates and Processes W . T he W orld and O ur X. Psychological Y. S cience and Environment A ctions, S tates and T echnology Processes

Z. N ames and Grammatical W ords

hindered by multiple variant spellings, i.e. frequency profiling, concordancing, key word analysis, etc.13 T he taxonomy employed in the (modern and historical) USAS system presently uses a hierarchy of 21 major domains, expanding into 232 semantic field tags. T able 9.1 shows the top-level domains (see A ppendix 2 for the full taxonomy): T he USAS system initially assigns part-of-speech tags to each word in a text prior to assigning one or more of the 232 semantic field labels. Portmanteau tags are used for those senses that straddle the borders of two or more semantic fields (such as ‘alehouse’, which borders the domains F and H). A key feature is the marking of multi-word expressions as single units for semantic analysis. T he USAS taxonomy was derived from that of McA rthur,14 and has been considerably revised in the light of practical application. W e are continuing to evaluate its suitability for the Early Modern English period through studies such as the one described in this chapter.15 13 For further details, see P. Rayson, D. Archer and N. Smith, ‘VARD Versus Word: A Comparison of the UCR EL Variant D etector and Modern S pell Checkers on English H istorical Corpora’, Proceedings of the Corpus L inguistics Conference S eries O n-L ine EJournal 1:1 (2005). 14 T . McA rthur, L ongman Lexicon of Contemporary English (L ongman, 1981). 15 A rcher and R ayson are also exploring the feasibility of tagging to other thesauri, including S pevack’s (1993) A Thesaurus of Shakespeare and Historical Thesaurus of English.

Love – ‘a familiar or a devil’?

141

T he second stage of the analysis of love in the S hakespearean dataset was to compare semantic tag frequency profiles of the love-tragedies against the lovecomedies. T his was achieved using the log-likelihood statistic applied to the semantic tag frequencies. T his step is analogous to the well-known key words procedure implemented in WordSmith Tools.16 H ere, we extended the technique to compare tag frequencies rather than word frequencies using the W matrix software.17 By calculating the log-likelihood (LL ) statistic for each tag and then sorting the profile by the result, we were able to see the most overused and underused semantic fields in the love-comedies relative to the love-tragedies. This technique has already been applied to a large Forced Migration Online corpus and has shown that improved efficiency over the standard key word technique can be achieved.18 The third stage of the analysis involved finding significant collocates for the key semantic fields. Our motivation was to discover which semantic tags collocate significantly with a small number of the key semantic tags selected from stage two. W e were not aware of any off-the-shelf software tool which performs this task. W e therefore used a Multilingual Corpus Toolkit,19 which implements a number of well-known collocation statistics. T he text was prepared from the tagged version by stripping out the words and leaving only the sequences of semantic tags and sentence breaks. T he mutual information (MI) statistic with a window of +/–5 was applied to calculate tag collocates. O ne concern is that the relatively high frequencies of tags compared with words will result in negative MI values, but all the results we quote have positive MI values. Results In discussing our results, below, we will make some reference to cognitive metaphor theory (as developed by the L akoff, Johnson, T urner group). T he application of cognitive metaphor theory to literary texts – S hakespeare in particular – has been established by people like Freeman.20 Given our interest in the concept of love, 16 M. Scott, ‘Focusing on the Text and Its Key Words’, in L. Burnard and T. McEnery (eds), Rethinking Language Pedagogy from a Corpus Perspective (Frankfurt: Peter L ang, 2000), pp. 103–22. 17 P. Rayson, ‘Matrix: A Statistical Method and Software Tool for Linguistic Analysis T hrough Corpus Comparison’, PhD thesis. L ancaster University, 2003. 18 D. Archer and P. Rayson, ‘Using the UCREL Automated Semantic Analysis System to Investigate D iffering Concerns in R efugee L iterature’, in M. D eegan, L . H unyadi and H . S hort (eds), The Keyword Project: Unlocking Content Through Computational Linguistics, OH C publications 18 (Office for Humanities Communication Publications, forthcoming). 19 S.L. Piao, A. Wilson and T. McEnery, ‘A Multilingual Corpus Toolkit’, paper given at AAA CL –2002, Indianapolis, Indiana, USA , 2002. 20 For example, Freeman ‘“Catch[ing] the nearest way”’.

142

What’s in a Word-list?

we will be making particular use of Barcelona S ánchez,21 who analyses the love metaphors in Romeo and Juliet and O ncins-Martínez,22 who discusses metaphors relating to sexual activity in Early Modern English. H owever, the reader should note that our focus on conceptual metaphors in this chapter is not meant to suggest that the USAS system can only be used for such analyses; indeed, Archer and McIntyre23 have used the USAS approach to investigate mind style in literary texts, and A rcher and R ayson,24 to investigate the different representations of refugees and their plight within refugee literature. Rather, our focus is merely meant to reflect the fact that some of the semantic fields we identify have metaphorical relationships with one another. T hat said, we do want to show how the USAS system may be used to provide empirical support for metaphor-based research and, importantly, indicate previously undiscovered conceptual metaphors. W e will begin with a discussion of the most overused items in the comedies, relative to the tragedies. The most overused items in the comedies relative to the tragedies Nine semantic fields received a LL score above 15.13 in the comparison. This means that these semantic fields were significantly overused (at p < 0.0001 1d.f.) in the love-comedies relative to the love-tragedies. A s length constraints prevent Table 9.2

The most overused items in the comedies relative to the tragedies Comedies

S 3.2 = intimate/sexual relationship L 2 = living creatures L 3 = plants S 1.2.6– = (not) sensible X3.1 = sensory: taste E2+ = liking T 3– = old, new, young: age

Tragedies

Freq.

%

Freq.

%

379 343 149 72 120 325 153

0.64 0.58 0.25 0.12 0.20 0.55 0.26

292 279 94 32 91 321 128

0.36 0.34 0.12 0.04 0.11 0.39 0.16

Log Likelihood 55.50 42.30 35.99 31.02 18.41 17.36 17.12

21 Barcelona Sánchez, ‘Metaphorical Models of Romantic Love in Romeo and Juliet’. 22 J.L. Oncins-Martínez, ‘Notes on the Metaphorical Basis of Sexual Language in Early Modern English’, in J.G. Vázquez-González et al. (eds), The Historical LinguisticsCognitive Linguistics Interface (H uelva: University of H uelva Press, 2006). 23 D. Archer and D. McIntyre, ‘A Computational Approach to Mind Style’, paper given at the 25th conference of the Poetics and L inguistics A ssociation, University of Huddersfield, July 2005. 24 Archer and Rayson, ‘Using the UCREL Automated Semantic Analysis System to Investigate D iffering Concerns in R efugee L iterature’.

Love – ‘a familiar or a devil’?

Table 9.3

143

Participants in intimate/sexual relationships

The two participants

Male participant

Male or female participant

Female participant

Couples L overs

L over S uitor

L ove

Virgin W anton

a detailed discussion of all of the statistically significant fields, we limit our discussion in this section to the seven fields listed in Table 9.2. It is noticeable that S3.2 ‘Intimate/sexual relationships’ and E2+ ‘Liking’ are amongst the most overused semantic fields in the love-comedies (relative to the love-tragedies). This means that S3.2 ‘Intimate/sexual relationships’ and E2+ ‘Liking’ represent two of the most underused semantic fields in the love-tragedies, when compared with the love-comedies. W e will say more about this underuse of the love-related semantic fields in the love-tragedies in the next section. Interestingly, the dominant lexical patterns within S3.2 ‘Intimate/sexual relationships’ can be characterized in terms of H allidayan-type25 participants (see T able 9.3) and processes (see Table 9.4). These results very clearly reflect an Early Modern patriarchal view of love, in which the male (in the role of ‘lover’ or ‘suitor’) undertakes certain acts (e.g. ‘kissing’), which the female suffers (she is ‘seduced’, ‘deflowered’), with the result that she switches from ‘virgin’ to ‘wanton’. The semantic field L2 ‘Living creatures’ appears as the second most overused field (see Table 9.2). Many of the lexical items within this field can be subsumed by the metaphor love is a living being and related metaphors, such as the object of love is an animal. Perhaps unexpectedly, and contrary to the items which Barcelona S ánchez26 discusses for Romeo and Juliet, the bulk of these items Table 9.4

Processes in intimate/sexual relationships

Transitive processes with male agents Kiss Kissing Kissed Kisses

Transitive processes with male or female agents

Intransitive Transitive processes with male processes with or female agents female patients

L oves

Fall in love Falling in love Fallen in love Fell in love

S educed Deflowered

25 M.A .K. H alliday, An Introduction to Functional Grammar, 2nd edn (L ondon: Edward A rnold, 1994). 26 Barcelona Sánchez, ‘Metaphorical Models of Romantic Love in Romeo and Juliet’ (1995) p. 683.

144

What’s in a Word-list?

(e.g. ‘bears’, ‘serpent’, ‘snail’, ‘monster’, ‘adder’, ‘snake’, ‘claws’, ‘chameleon’, ‘worm’, ‘monkey’, ‘ape’, ‘weasel’, ‘toad’, ‘rat’) have strong negative associations, semantically speaking. Indeed, very few can be described as neutral (e.g. ‘cattle’, ‘horse’, ‘goats’, ‘creature’ and ‘capon’) or positive (e.g. ‘deer’, ‘dove’, ‘nightingale’). Moreover, some of the items within the positive list are problematic; ‘deer’, for example, appears to be used positively in our texts. However, ‘deer’ is linked to cuckoldry in many of S hakespeare’s plays (e.g. Love’s Labours Lost, The Merry Wives of Windsor) and so may indicate that ‘deer’ had negative undertones for both S hakespeare and his audience.27 It is also worth noting that many of the negative lexical items in the love-comedies are personifications which relate to other metaphors suggested by Barcelona S ánchez.28 T hey include love is war29 and love is pain, both of which are clearly relevant in the context of unrequited love, as the following example demonstrates: L ysander H ang off, thou cat, thou burr! vile thing, let loose, O r I will shake thee from me like a serpent! H ermia W hy are you grown so rude? what change is this? Sweet love, — (A Midsummer Night’s Dream)

A s will become clear, love is war and love is pain are also important conceptual metaphors in the love-tragedies (see below). L3 ‘Plants’ is the third most overused semantic field in the love-comedies (relative to the love-tragedies). T his category is a useful one in that it highlights the importance of thoroughly checking the items captured by the different USAS tags. By way of illustration, the most frequent item in L3 – ‘mustardseed’ – is a character’s name, and the second most frequent item, ‘flower’, occurs as part of the multi-word unit ‘Cupid’s flower’, i.e. the flower that Oberon used to send T itania to sleep in A Midsummer Night’s Dream. More importantly, the bulk of the remaining items in the ‘Plants’ semantic field can be explained by the fact that As You Like It and A Midsummer Night’s Dream are set in woods (something which is not true of any of the tragedies) – and, if they were removed, the keyness of this category would probably decrease substantially. T hat said, a small number of the L3 items have a strong metaphorical association with ‘love’ or ‘sex’. For example, S ilvius states that he is prepared to have Phoebe as a wife in spite of her less-thanvirginal state in As You Like It: 27 Prof John Joughin (personal communication). 28 Barcelona Sánchez, ‘Metaphorical Models of Romantic Love in Romeo and Juliet’ (1985). 29 love is war is a long-established metaphor for love. See, for example, G. L akoff and M. Johnson, Metaphors We Live By (Chicago and N ew York: University of Chicago Press, 1980), p. 49.

Love – ‘a familiar or a devil’?

145

S ilvius S o holy and so perfect is my love, A nd I in such a poverty of grace, T hat I shall think it a most plenteous crop T o glean the broken ears after the man T hat the main harvest reaps: loose now and then A scattered smile, and that I’ll live upon.

A ccording to O ncins-Martínez,30 this type of usage was common in the Early Modern English period. Indeed, he argues that the general conceptual metaphor sex is agriculture and its sub-mappings, a woman’s body is agricultural land, copulation is ploughing or sowing and gestation and birth is harvesting underlie linguistic expressions which permeate many texts from this period. The semantic field ‘S1.2.6 (Not) sensible’ does not refer to the older meaning of ‘not having the capacity to sense (feel)’ but to being foolish, silly, stupid, and so on. Interestingly, the metaphorical associations are much stronger in this semantic field than in L3 ‘Plants’. Indeed, many of the items can be accounted for by Lakoff and Johnson’s31 love is madness metaphor,32 as exemplified here: S ilvius O Corin, that thou knew how I do love her! Corin I partly guess; for I have loved ere now. S ilvius N o, Corin, being old, thou Can not guess, T hough in thy youth thou was as true a lover A s ever sighed upon a midnight pillow: But if thy love were ever like to mine—; As sure I think did never man love so—; H ow many actions most ridiculous have thou been drawn to by thy fantasy? Corin Into a thousand that I have forgotten. S ilvius O , thou did then ne’er love so heartily! If thou remember not the slightest folly T hat ever love did make thee run into, T hou have not loved […] (As You Like It)

‘T3 Old, new, young: age’ also appears to relate to the love is madness metaphor, for the reason that being young and being in love are assumed to be states that are accompanied by a lack of rational thought. In the following extract from A 30 Oncins-Martínez, ‘Notes on the Metaphorical Basis of Sexual Language in Early Modern English’ (2006). 31 L akoff and Johnson, Metaphors We Live By, (1980) p. 49. 32 Cf. Barcelona S ánchez’s love is insanity in ‘Metaphorical Models of Romantic L ove in Romeo and Juliet’, (1995) p. 679.

What’s in a Word-list?

146

Midsummer Night’s Dream, for example, love is ‘said to be a child’ because of its capacity to ‘beguile’: Nor has Love’s mind of any judgement taste; Wings and no eyes figure unheedy haste: A nd therefore is L ove said to be a child, Because in choice he is so oft beguiled.

Significantly, all the items within this semantic field relate to the early years of life, the most frequent items being ‘youth’ and ‘young’. Some of these lexical items, in turn, modify ‘lover(s)’. The lexical items that constitute ‘X3.1 Sensory: taste’ fall into three groups: sweet

bitter

taste

sweetest

bitterness

tastes

sweeter

sourest sour

The first group is very much part of ‘sweet talk’ used in courtship. The most frequent item by far is ‘sweet’ with 94 instances;33 the next most frequent item is ‘bitter’ with 12 instances. The connection with love can be seen in the metaphor love is food.34 In the example below, ‘loving words’ become ‘sweet honey’: Julia N ay, would I were so angered with the same! O hateful hands, to tear such loving words! Injurious wasps, to feed on such sweet honey A nd kill the bees that yield it with your stings! (Two Gentlemen of Verona)

‘Sweet’ often appears as part of vocative expressions, as in ‘sweet lady’. Although it is used for men and women together (e.g. ‘sweet lovers’) and for men (e.g. ‘sweet Proteus’), the vast majority of the instances refer to women. A possible explanation for this lies in the metaphor a woman’s body is agricultural land, which we have discussed briefly in relation to L3 ‘Plants’ (see above). A s O ncinsMartínez explains,35 the land gives rise to trees, metaphorically representing people, and the trees give rise to fruits, metonymically associated with a woman’s sexual 33 ‘Sweet’ also occurs as a key word, with a LL score of 19.14. 34 Barcelona Sánchez, ‘Metaphorical Models of Romantic Love in Romeo and Juliet’, (1995) pp. 672–3. 35 Oncins-Martínez, ‘Notes on the Metaphorical Basis of Sexual Language in Early Modern English’ (2006).

Love – ‘a familiar or a devil’?

147

attributes. T hen a link can be established with woman is an edible substance and sex is eating. Hence, referring to a woman as ‘sweet’ can be seen as instantiating metaphors relating to women as edible sexual objects. T he second group of items in this semantic field, relating to bitter/sour, often relate to the troubles of love (e.g. unrequited love). R osalind H e’s fallen in love with your foulness and she’ll fall in love with my anger. If it be so, as fast as she answers thee with frowning looks. I’ll sauce her with bitter words. (As You Like It)

Note how, in this particular example, the word ‘sauce’ relates to the love is food metaphor: anger leads to love which is food, so anger will be a bitter sauce. T he OED lists this particular example as an illustration of the sense ‘to rebuke smartly’, but the earlier sense ‘to season, dress, or prepare (food) with sauces or condiments’ was still current. We have left our discussion of E2+ ‘Liking’ until now as our investigations have shown that it is not currently a secure category. For example, the most frequent item ‘like’ is almost always a preposition, and we are interested in ‘like’ as a verb. Moreover, cases like ‘loved’, ‘loving’, ‘beloved’, ‘dotes’, ‘enamoured’, ‘adores’, ‘adored’, ‘adoration’, ‘amorous’ and ‘doting’ are not always dealt with on a sufficiently principled basis (that is to say, different variants of the base form of these particular items occur in different positions in the categorization of both E2+ ‘Liking’ and S.3.2 ‘Intimate/sexual relationship’). It is perhaps not surprising that these categories closely inter-mesh, since: • • • •

‘Liking’ stands in a very close relationship with ‘intimate/sexual relationship’; Intimate or sexual relationships presuppose physical closeness; Physical closeness can be caused by love (a metonymic ‘effect for cause’ relationship36); ‘Liking’ stands in a metonymic relationship with ‘love’ (‘a part of love stand[s] for the whole concept of love’37).

H owever, currently, English Modern words may not accurately be mapped onto such a semantic network of relations. For example, ‘lover’ did not necessarily

36 See Barcelona Sánchez, ‘Metaphorical Models of Romantic Love in Romeo and Juliet’, p. 671. 37 Ibid., p. 675.

148

What’s in a Word-list?

Table 9.5

Love-related lexical items which occur in the comedies and tragedies

Comedies

Tragedies

adoration (1), adore (2), adored (1), affection (7), affections (1), after-love (1), amorous (1), applaud (2), applause (1), apple of his eye (1), beloved (8), chastity (3), cherish (1), cherished (2), copulation (1), couples (4), dear (32), deflowered (1), dote (8), dotes (4), doting (1), enamoured (2), enjoy (2), fall in love (4), fallen in love (1), falling in love (1), fancies (1), fancy (7), fell in love (1), fond (7), gone for (1), impress (1), in love (34), kiss (20), kissed (3), kisses (2), kissing (3), like (117), liked (2), liking (2), love (354), loved (28), lover (32), lovers (26), loves (26), loving (10), paramour (2), precious (5), prized (1), relish (2), revelling (1), revels (5), savours (3), seduced (1), sensual (1), suitor (2), take to (1), that way (1), virgin (4), wanton (5)

adore (1), adores (1), affection (8), affections (7), affinity (1), amorous (6), applauding (1), applause (1), beloved (5), beloving (1), bewhored (1), carnal (1), chamberers (1), chastity (3), cherish (1), cherished (1), cherishing (1), courts (1), cuckold (6), darling (1), darlings (1), dear (59), deflowered (1), devotion (4), dote (1), dotes (2), doting (5), enamoured (1), enjoy (3), enjoyed (1), fall in love (1), fancies (2), fancy (4), fond (8), impress (1), in love (8), kiss (25), kissed (6), kisses (13), kissing (6), liked (1), likes (1), liking (1), likings (1), love (259), loved (21), lover (6), lovers (9), loves (29), loving (12), lust (10), lusts (1), paramour (1), precious (6), prized (1), rate (3), rated (1), relish (1), revel (3), revels (4), sluttish (1), suitor (2), suitors (2), take to (1), that way (1), the other way (2), wantons (1), wooer (1)

indicate a physically intimate relationship in the Early Modern English period, but, rather, could also mean ‘friend’.38 In Table 9.5, we have combined the E2+ ‘Liking’ and S.3.2 ‘Intimate/sexual relationship’ lexical items for the love-comedies and the love-tragedies together in a love-related macro category. W e have also marked in boldface the lexical items that only occur in the comedies or only occur in the tragedies, so that we can more readily highlight the stronger negative associations inherent in some of the lexical items relating to the latter (e.g. ‘bewhored’, ‘carnal’, ‘cuckold’ and ‘sluttish’). Notice that these particular items appear to provide empirical support for Barcelona Sánchez’s claim that the concept of tragical love is ‘characterized by being adulterous [love] inevitably ending in death’.39 Significantly, several of these love-related lexical items are keywords in and of themselves in the love-comedies relative to the love-tragedies. Indeed, ‘love’, ‘in love’ and ‘lover’ all have LL scores above our cut-off point of 15.13, whilst ‘lovers’, ‘dote’, ‘betrothed’, ‘couples’ and ‘virgin’ have LL scores between 6.91 and 15.13 38 D . Crystal and B. Crystal, The Shakespeare Miscellany (Penguin, 2005). 39 Barcelona Sánchez, ‘Metaphorical Models of Romantic Love in Romeo and Juliet’, p. 684.

Love – ‘a familiar or a devil’?

Table 9.6

149

Most overused items in the tragedies relative to the comedies

Semantic tag (field) G3 = warfare, defence, and the army L 1– = (lack of) life/living things Z2 = geographical names E3– = (not) calm/violent/angry M4 = movement (by sea/through water) S 9 = religion and the supernatural S 7.1– = (lack of) power/organising

Comedies Freq. % 425 0.52 490 0.60 399 0.49 343 0.42 92 0.11 644 0.79 193 0.24

Tragedies Freq. % 57 0.10 170 0.29 153 0.26 143 0.24 21 0.04 345 0.58 77 0.13

Log Likelihood 213.51 77.16 49.56 33.67 28.51 21.92 21.55

(see italicized items). Interestingly, the USAS system chose to assign three key words with a potential link to love, ‘betrothed’, ‘woo’ and ‘desire’, to other semantic fields when assigning LL scores. T he reader should be aware, then, that a key word and key domain analysis of the same data will reveal both overlap and difference.40 A s T able 9.5 reveals, the strength of the USAS system is that it can identify words that would not have been picked up by a keyword analysis (because they are not deemed to be key in and of themselves) but which add, nonetheless, to the aboutness of a text, as they share the same semantic space as the keywords. H owever, as the USAS process is an automatic one, it is important that any results are checked thoroughly to determine their contextual relevance. T hat said, we would point out that the keyword list showing overused items in the love-comedies (relative to the love-tragedies) stretches to some 275 items in comparison to nine key domains, and that the entries in the keyword list are ambiguous for part-of-speech and sense. Consequently, a manual examination of concordance lines is required in addition to manual grouping into semantic patterns.41 The most overused items in the tragedies (relative to the comedies) Twelve semantic fields from the tragedies achieved an LL score of over 15.13 indicating significance at the 99.99 per cent level (p < 0.0001 1 d.f.). Space is limited, so we will be concentrating on seven specific semantic fields in the lovetragedies relative to the love-comedies (see T able 9.6). Notice the lack of semantic fields that relate directly to love (e.g. ‘Intimacy/ sexual relationships’, ‘Liking’), and the appearance of fields, such as ‘warfare’, ‘lack of life or living things’ and ‘geographical names’ that seem to have nothing 40 For a more detailed discussion of this potential overlap/difference see Culpeper in D . H oover, J. Culpeper and B. L ouw, Approaches to Corpus Stylistics (R outledge, forthcoming). 41 See P. Rayson, ‘Matrix: A Statistical Method and Software Tool for Linguistic A nalysis T hrough Corpus Comparison’, PhD thesis. L ancaster University, 2003, pp. 100– 13, for a more detailed exploration of the advantages of the key domains approach.

150

What’s in a Word-list?

to do with love. A s explained previously, this is because the love-related semantic fields (‘E2+ Liking’ and ‘S3.2 Intimate/sexual relationships’) are in fact amongst the most underused categories in the love-tragedies relative to the love-comedies. H owever, it is worth noting that many of the categories that are overused do have metaphorical links with love, albeit to differing degrees. For example, some of the items within ‘G3 Warfare, defence, and the army’ reflect the love is war metaphor,42 briefly mentioned above. Romeo, for example, comments that, ‘She will not stay the siege of loving terms, N or bide th’ encounter of assailing eyes’. S ome items in the G3 category have nothing to do with love, of course, capturing, instead, other aspects of the tragedy. The most frequent of these items, ‘Soldier’, occurs most frequently in Anthony and Cleopatra, as do most cases of ‘sword’, ‘war’, ‘wars’, ‘army’, ‘battle’, ‘armour’ and ‘navy’. This is not surprising: Anthony and Cleopatra involves military power struggles. T here are a few cases from Othello: Othello, and the other men, are military folk, with a military history. ‘General’ and ‘lieutenant’ are vocative forms which nearly always refer to Othello and Cassio respectively. T here are very few items from Romeo and Juliet: indeed, 50 per cent of the occurrences of ‘swords’ occur in Romeo and Juliet (in the fight between Benvolio and Tybalt), as do most of the occurrences of ‘dagger’ (in the fight between Mercutio and T ybalt, and Juliet’s suicide with R omeo’s dagger). This key semantic field in the love-tragedies parallels the most key field in the love-comedies – ‘S3.2 Intimate/sexual relationship’ – to some extent, in that both capture the distinctive participants and processes that characterize the plots of the two genres. H owever, the three love-tragedies are not equally characterized by literal warfare. Indeed, the warfare in Romeo and Juliet tends to be metaphorical. Significantly, where there is literal warfare (or military activity), there is also a strong link with the sea (which helps to explain the keyness of the ‘M4 Movement by sea through water’ domain in the love-tragedies). The 15 most frequent items in the semantic field ‘L1 (Lack of) life/living things’ are: ‘death’, ‘dead’, ‘die’, ‘kill’, ‘slain’, ‘murder’, ‘dies’, ‘killed’, ‘mortal’, ‘tomb’, ‘dying’, ‘murdered’, ‘corpse’, ‘fatal’, ‘drowned’. Not surprisingly, the bulk of these appear at the ends of the plays, in the death scenes. The field contains a mixture of literal and metaphorical usages. In the following example, ‘death’ is personified as Juliet’s lover: A h, dear Juliet, W hy art thou yet so fair? shall I believe T hat unsubstantial death is amorous, A nd that the lean abhorred monster keeps T hee here in dark to be his paramour? (Romeo and Juliet)

42 See Barcelona Sánchez, ‘Metaphorical Models of Romantic Love in Romeo and Juliet’, pp. 678–79.

Love – ‘a familiar or a devil’?

151

A s Barcelona S ánchez43 points out, the concept of tragical love is, ‘characterized by being adulterous inevitably ending in death’. The ‘Z2 Geographical names’ field contains a number of miscategorizations (e.g. ‘Moor’), and so its appearance as a highly key semantic field must be treated with caution. Nevertheless, this field does reflect the fact that some of the plots of the love-tragedies involve a number of different geographical locations. T his is particularly true of Anthony and Cleopatra, in which the action moves back and forth between Rome and Egypt. This field seems to have no obvious link with love, metaphorical or otherwise, however. Generally, the category ‘E3- (Not) calm/violent/angry’ captures the violent conflicts that characterize the tragedies. ‘Poison’ is the most frequent item. It may appear rather odd in this category, but it is presumably designed to capture the sense of someone who is strongly hated (see O ED 3.b). O f course, this is a modern sense that does not apply to our data. Indeed, most of our instances relate to the literal poison in Romeo and Juliet. H owever, the idea that love is poison is also articulated in the plays, as the following example, where Cleopatra talks about A nthony, makes clear: H e’s speaking now, Or murmuring “Where’s my serpent of old Nile?”; For so he calls me: now I feed myself W ith most delicious poison. T hink on me, T hat am with Phoebus amorous pinches black, A nd wrinkled deep in time? (Anthony and Cleopatra)

Some E3– ‘(Not) calm/violent/angry items’, i.e. ‘angry’, ‘rage’ and ‘fury’, occur in all the plays. Interestingly, although there is no obvious metaphorical link with love, there is a link with the more negative aspects of love, such as adultery, jealousy and revenge. T hese emotional states arise mainly as a consequence of the tragic plot. T hus, for example, as O thello’s jealousies increase due to the presumed adultery of his wife, Desdemona, Iago innocently asks, ‘Is my lord angry?’ whilst Desdemona, perplexed, asks, ‘What, is he angry?’ A ll but one of the nine instances of ‘revenge’ come from Othello – the only play in our data that has the plot of a ‘Revenge Tragedy’. The results of the semantic field ‘S9 Religion and the supernatural’ are slightly skewed by 85 instances of ‘Friar’ as a term of address (and also by ‘holy’ in the vocative ‘Holy Friar’). Similarly, ‘pray’ is usually part of the politeness formula ‘I pray you’. The most frequent item, ‘heaven’, is often used as part of an appeal (e.g. ‘heaven defend your good souls’, Othello). Almost all of the uses of ‘heaven’ are from Othello and, to a lesser extent, Romeo and Juliet. T he same is also true of the 43

Ibid., p. 684.

What’s in a Word-list?

152

distribution of ‘soul’ and ‘devil’. This result may be indicative of the characters’ experiences in these plays, in the sense that some of the characters endure more protracted bouts of torment as a consequence of the tragic plot. A lternatively, some of the usages of ‘heaven’, ‘soul’ and ‘devil’ within Othello may reflect the fact that O thello is a deeply religious character. A small number of the ‘S9 Religion and the supernatural’ items can be accounted for by the metaphor object of love is a deity,44 as illustrated by the following examples: Roderigo

I can not believe that in her; she’s full of most blessed condition. (Othello)

R omeo If I profane with my unworthiest hand T his holy shrine, the gentle sin is this: My lips, two blushing pilgrims, ready stand T o smooth that rough touch with a tender kiss. (Romeo and Juliet)

R ather fewer of the items in this category appear in Anthony and Cleopatra, but, when they do, they tend to reflect the non-Christian setting of the play (i.e. ‘gods’ and ‘soothsayer’). Our final semantic field, ‘S7.1- (Lack of) power/organising’, is somewhat skewed by the most frequent item, ‘servant’, which is a character name (e.g. ‘First Servant’). The word ‘wench’ is a another possible miscategorization, as the original meaning of ‘young woman’ was still current in this period, alongside newer senses indicating a ‘girl of the rustic or working class’, ‘a wanton’ or ‘a female servant’ (O ED ) – senses which denote someone of low power. But there are many items – ‘knave’, ‘sirrah’, ‘minion’, ‘churl’, ‘slave’ and so on – that clearly do reflect lack of power, and are often used to hurl abuse at characters. T he bulk of these do not connect with love, or even the tragic conception of it. Instead, they reflect the fact that the love-tragedies revolve around hierarchical power structures rather more than the love-comedies. Moving towards an analysis of collocations at the domain level A s well as being interested in key domains, we are interested in the extent to which important collocational information can be discovered at the domain level (rather than the word level). Table 9.7, then, captures the domains that the category ‘S3.2 Intimate relationship’ collocated most strongly with in the love-comedies. D ue to 44

Ibid., p. 674.

Love – ‘a familiar or a devil’?

Table 9.7

153

Domain collocates of S3.2 ‘Intimate relationship’ in the comedies

O 2= objects

MI =2.392

‘I kiss the instrument of their pleasures’

A 1.1.1 = general actions

MI =1.412

‘Think true love acted simple modesty’

B1 = anatomy and physiology MI =1.317 Z8m = pronouns (male)

MI =1.298

‘… a fire sparkling in lovers eyes’ ‘if thou Can cuckold him, thou do thyself a pleasure, me a sport’

length constraints, our discussion here will concentrate on two of the four: ‘B1 Anatomy and physiology’ and ‘Z8m Pronouns’. The presence of the semantic field ‘B1 Anatomy and physiology’ is not surprising, given that the ‘embodiment’ of meaning is perhaps the central idea of the cognitive view of meaning.45 Moreover, the human body, so close to us as to be tangible, is an obvious source domain for metaphorically understanding abstract targets such as love. ‘Eyes’ (or ‘eye’) and ‘heart’ are the most frequent items within the B1 semantic field, with 175 occurrences. The bulk of the instances of ‘eyes’ occur in A Midsummer Night’s Dream – remember that Puck puts the love potion in T itania’s eyes. Elsewhere, there is a strong notion that a woman’s eyes were an aspect of her beauty that could capture men. Barcelona S ánchez46 suggests that the underlying metaphor here is eyes are containers for superficial love, which seems to be a development of L akoff and Johnson’s eyes are containers for the emotions. In fact, the idea of a container is not clearly articulated in the comedy data. Indeed, we would suggest that eyes are weapons of entrapment is a more appropriate conceptual metaphor for our data, as below (cf. the related metaphor love is war): O rlando W ounded it is, but with the eyes of a lady.

(As You Like It)

Valentine T his is the gentleman I told your ladyship H ad come along with me, but that his mistress D id hold his eyes locked in her crystal looks. S ilvia Belike that now she has enfranchised them Upon some other pawn for fealty. Valentine N ay, sure, I think she holds them prisoners still. Silvia Nay, then he should be blind; and, being blind 45 Z. Kövecses, Metaphor: A Practical Introduction (N ew York: O xford University Press, 2002), p. 16. 46 Barcelona Sánchez, ‘Metaphorical Models of Romantic Love in Romeo and Juliet’, (1995) p. 679.

What’s in a Word-list?

154

H ow could he see his way to seek out you? Valentine W hy, lady, L ove has twenty pair of eyes. T hurio T hey say that L ove has not an eye at all. Valentine T o see such lovers, T hurio, as yourself: Upon a homely object L ove can wink. (Two Gentlemen of Verona) H elena H ow happy some o’er other some can be! T hrough A thens I am thought as fair as she. But what of that? Demetrius thinks not so; H e will not know what all but he do know: A nd as he errs, doting on H ermia’s eyes, S o I, admiring of his qualities: T hings base and vile, holding no quantity, L ove can transpose to form and dignity: L ove looks not with the eyes, but with the mind; A nd therefore is winged Cupid painted blind: Nor has Love’s mind of any judgement taste; W ings and no eyes figure unheedy haste: A nd therefore is L ove said to be a child, Because in choice he is so oft beguiled. A s waggish boys in game themselves forswear, S o the boy L ove is perjured every where: For ere D emetrius looked on H ermia’s eyne, H e hailed down oaths that he was only mine […] (A Midsummer Night’s Dream)

Apart from the literal sense, a ‘heart’ can stand for a person, through a part-whole metonymic relationship. T hey can be further understood as containers for love, the metaphor being the heart is a container of emotions.47 T hus, hearts can harden (e.g. ‘if your heart be so obdurate, Vouchsafe me yet your picture for my love’, Two Gentlemen of Verona), so denying access to the emotional reservoir inside, or the protective container can be pierced – damaging the emotional contents (e.g. ‘Pierced through the heart with your stern cruelty’, A Midsummer Night’s Dream). Expressions such as ‘with all my heart’ seem to have the sense, ‘with all the emotions within my heart’. H earts, or more accurately the emotions within, can also be attributed agency (e.g. ‘Here is her hand, the agent of her heart’, Two Gentlemen of Verona), and can be personified (e.g. ‘My heart to her but as guest-wise sojourned, A nd now to H elen is it home returned, T here to remain’, A Midsummer Night’s Dream).

47

Ibid., p. 670.

Love – ‘a familiar or a devil’?

155

‘Tears’ are closely tied to unrequited love, as the following example makes clear: Phoebe Good shepherd, tell this youth what it is to love. Silvius It is to be all made of sighs and tears; A nd so am I for Phoebe. (As You Like It)

T ears have a cause-effect metonymic relationship with pain or emotional distress: they are the effect, and pain is metaphorically related to love (love is pain). Consider the following in which love personified inflicts pain with resultant tears and sighs: I have done penance for contemning L ove, W hose high imperious thoughts have punished me W ith bitter fasts, with penitential groans, W ith nightly tears and daily heart-sore sighs. (Two Gentlemen of Verona)

The collocation between ‘Z8m pronouns (male)’ and ‘S3.2 Intimate relationship’ in the love-comedies is not surprising when one considers the extent to which female characters – in particular, Julia and R osalind – talk about the men they love (Proteus and Orlando respectively). Significantly, other female characters within the love-comedies appear to share their affinity for male pronouns, as the following extract taken from As You Like It demonstrates: Phoebe T hink not I love him, though I ask for him; it is but a peevish boy; yet he talks well; But what care I for words? yet words do well W hen he that speaks them pleases those that hear. It is a pretty youth: not very pretty: But, sure, he’s proud, and yet his pride becomes him: He’ll make a proper man: the best thing in him Is his complexion; and faster than his tongue D id make offence his eye did heal it up. He is not very tall; yet for his years he’s tall: His leg is but so so; and yet it is well: T here was a pretty redness in his lip, A little riper and more lusty red T han that mixed in his cheek; it was just the difference Betwixt the constant red and mingled damask. T here be some women, S ilvius, had they marked him In parcels as I did, would have gone near To fall in love with him; but, for my part,

What’s in a Word-list?

156

Table 9.8

Domain collocates of S3.2 ‘Intimate relationship’ in the tragedies

N 3.2+ / A 2.1 = change in size MI =5.014 A5.2+ = evaluation (‘true’)

MI =2.774

M2 = movement/transporting MI =2.099 S 6+ = obligation and necessity

MI =1.944

Z8f = pronouns (female)

MI =1.844

A 7+ = certainty

MI =1.604

A 1.1.1 = general actions

MI =1.354

‘But my true love is grown to such excess’ ‘For if he be not one that truly loves you’ ‘Look, if my gentle love be not raised up!’ ‘I must show out a flag and sign of love’ ‘It can not be long that Desdemona should continue her love to the Moor’ ‘… if thou Can cuckold him, thou do thyself a pleasure, me a sport’ ‘O mistress, villainy has made mocks with love!’

I love him not nor hate him not; and yet have more cause to hate him than to love him: For what had he to do to chide at me? He said mine eyes were black and my hair […] (As You Like It)

A s T able 9.8 shows, gender pronouns are also a key collocate of the love-tragedies. H owever, the female pronoun that is key here, which suggests that, rather than women characters talking about their love for a man directly (as occurs in the lovecomedies), male and female characters in the tragedies are reporting a female’s love for a man (cf. the discussions respecting D esdemona’s love for O thello, Cleopatra’s love for A nthony and, to a lesser extent, Juliet’s love for R omeo). Summary of main findings In this chapter, we have reported on an exploration of key domains within three S hakespearean love-comedies and three S hakespearean love-tragedies. W e have observed marked differences in the occurrence of ‘love’ in our two datasets. This is clearly represented by the semantic fields of ‘intimate/sexual relationships’ and ‘liking’ appearing as the most underused concepts in the love-tragedies when compared to the love-comedies. The love-tragedies focus, instead, on ‘war’, ‘lack of life/living things’, ‘religion and the supernatural’, ‘lack of power’, ‘movement’, etc., some of which highlight interesting metaphorical patterns. W e have also observed that, when love is represented in the love-tragedies, it is much ‘darker’, and may typify the ‘tragical’ love (as opposed to ‘ideal’ or ‘romantic’ love)

Love – ‘a familiar or a devil’?

157

identified by Kövecses.48 Many of our results have been explained in terms of cognitive metaphor theory. T his is not surprising, as abstract concepts such as love are difficult to express, and so metaphor is used. However, it should be noted that key domain analysis is not only concerned with the identification of metaphorical patterns. Indeed, in Anthony and Cleopatra in particular, the key semantic fields tended to identify (or relate to) the tragic plot. As a result of this study, we plan to refine several USAS categories, so that we can capture differences in pronoun usage more readily, e.g. in respect of gender and subject/object positioning. W e are also actively investigating the inclusion of further components within the W matrix structure, which will allow us to distinguish metaphorical usage and to calculate domain collocation statistics. Concluding comments W e have shown that the analysis of key domains is a useful methodology in that it enables us to discover links across different semantic fields that may not be readily apparent when using a key words analysis or analysing texts manually. Key domains also provide a way in to cognitive metaphor-type analysis in that we can identify lexical semantic patterns using the USAS system, which, on closer manual inspection, may be found to be linked metaphorically to particular domains. W e believe that one of the greatest strengths of the USAS system is that we are able to compare huge amounts of data in a relatively short period of time; however, the analyst must always keep in mind the limits of automatic annotation tools. T hus, we advocate that quantitative analysis is always combined with qualitative analysis. Acknowledgements T he work presented in this chapter was carried out within two projects: Unlocking the Word Hoard funded by the A ndrew W . Mellon Foundation with Martin Mueller of N orthwestern University, and Scragg Revisited funded by the British A cademy (small research grant scheme S G-40246).

48 Kövecses, Metaphor; see also Barcelona Sánchez, ‘Metaphorical Models of R omantic L ove in Romeo and Juliet’.

This page has been left blank intentionally

Chapter 10

Promoting the W ider Use of W ord Frequency and Keyword Extraction T echniques D awn A rcher

Identifying a wider community of users O ne of the challenges given to contributors to this volume was to identify a wider community of users that might benefit from using word frequency and keyword extraction techniques. Given the current availability (and seemingly unending growth) of large quantities of information in digital form, the need to gather ever larger collections of data is only going to increase. W ord frequency and keyword extraction techniques are therefore likely to be of most benefit to the growing number of academic and non-academic researchers who investigate primary source documents, which are (or can be) digitized in a way that makes them computer-readable. T he mining of large volumes of unstructured information is already a key commercial area, of course. IBM’s Unstructured Information Management Architecture, for example, uses a combination of semantic analysis and search components to find information in unstructured texts. Other companies such as nstein offer ‘document intelligence’ in the areas of e-publishing, homeland security and the corporate world. Chapters in this edited collection that demonstrate the applicability of keyword extraction techniques beyond the academic discipline of linguistics include Baker (Chapter 8), McEnery (Chapter 7) and A rcher et al. (Chapter 9). For example, A rcher et al.’s chapter indicates that corpus linguistic techniques can be used as a means of (in)validating the ‘close reading’ approach of literary scholars. A rcher, Baker, McEnery and R ayson have demonstrated elsewhere that word frequency and keyword extraction techniques can also be used to uncover media/political representations of issues/groups in society (be they social, cultural, political and/or religious): they were involved in what became known as ‘the one month challenge’, the aim of which was to carry out investigations on refugee materials (in a very short time frame) and present those findings at a specially convened workshop for experts in Refugee Studies See, for example, M. Deegan et al., ‘Computational Linguistics Meets Metadata, or the A utomatic Extraction of Key W ords from Full T ext Content’, RLG Diginews, 8/2 (2004).

160

What’s in a Word-list?

and related fields. T aken together, then, the above would suggest that the list of possible users of word frequency and keyword extraction techniques is likely to include those working within areas as diverse as business, history, literature, politics, religion, sociology, security, marketing and the media to name but a few. Is there anything left to do? The simple answer is ‘yes’. Researchers within some disciplines are likely to find the transformation of a text (or texts) into a list of words a rather daunting prospect. Consequently, it is crucial that we begin and/or continue to engage in discussions which seek to address the various ways in which different disciplines (and also sub-disciplines within disciplines) approach texts in particular, and language in general. For example, I would advocate that those of us using frequency/keyword techniques are always careful to stress that the de-contextualization of a text into a list of words is but the first step of a corpus linguistic approach; indeed, as this edited collection reveals, corpus linguists who regularly use (key) word lists emphasize the importance of re-connecting those list(s) of (key) words with the text(s) from which they came and, where possible, with their ‘context of production’ so that we can better appreciate (the meaning behind) the language (as it is) used. In turn, corpus linguists need to appreciate better what constitutes a text for those outside their main discipline – the historian, literature scholar, sociologist, etc. – and what process(es) they utilize when investigating that text. Encouraging cross-disciplinary collaboration O f course, the best way of promoting word frequency and keyword extraction techniques to an even wider community of users (commercial and academic) is to begin using these techniques to help answer the questions that researchers within these communities are presently debating. T his means that the keyword extraction community needs to discover what it is that other (commercial and academic) researchers are interested in finding out, and then to determine how their tools might help them to do so. One means of achieving this is to utilize the ‘virtual’ networks The organizers of the ‘one month challenge’ hoped to determine the usefulness of computer-based text analysis tools when investigating differing concerns in refugee literature. T he challenge was part of a larger pilot project funded by the A ndrew W . Mellon Foundation. M. S cott and C. T ribble, Textual Patterns: Keyword and Corpus Analysis in Language Education (A msterdam: Benjamins, 2006). T his is a particularly good reference for those seeking to use corpus linguistic techniques in a classroom context. And, indeed, whether the answer to ‘what constitutes a text?’ and ‘what is text analysis?’ changes according to the sub-discipline and, if so, in what way(s)?

Promoting the Wider Use of Word Frequency

161

created by R eimer and others. T he next step, at this point, would then require the developers of word frequency and keyword extraction software to determine whether/how their tools can be ‘tweaked’ (or, if needs be, re-developed) so that functionality can be improved. A possible third step would be for funding bodies to provide the financial means of supporting such activities, so that the ‘new’/ redeveloped tools can be made available to/easily accessed by those additional communities of users. As part of this, funding bodies might consider financing one ‘user-friendly’ site, as a means of bringing together software tools, data collections and relevant learning materials (such as those made available on the Methods Network site), and also financing a programme of relevant workshops/network building seminars (along the lines of those made possible under the three-year Methods N etwork programme).

R eimer was the S enior R esearch Co-ordinator for the Methods N etwork, and instigated the ‘virtual’ Methods Network as a means of encouraging researchers from different disciplines to engage with others outside their main field, ask for help (if needs be) and moot possible inter-disciplinary projects: see . By way of illustration, the Methods N etwork (co)-sponsored two workshops, Historical Text Mining (July 2006) and Text Mining for Historians (July 2007), which, as their titles suggest, sought to familiarize historians with the word frequency and keyword analysis approach. T he 2007 workshop was co-sponsored by AHDS H istory and the AH CUK.

This page has been left blank intentionally

A ppendix 1 Fragment of text A6L S ir A drian Cadbury Born: Birmingham, 1929. Educated: Eton; King’s College, Cambridge. S ir A drian Cadbury is chairman of Cadbury S chweppes plc. H e joined Cadbury Brothers L td in 1952 and became chairman in 1956. A fter the merger between Cadbury and S chweppes he succeeded L ord W atkinson as Chairman of the combined company at the end of 1974. H e was knighted for his services to industry in 1977. H e is a director of the Bank of England and of IBM UK H oldings L td, and chairman of Pro N ed, an organisation that encourages the appointment of nonexecutive directors to company boards. H e also heads the CBI Business Education T ask Force. H e is chancellor of the University of A ston in Birmingham, a trustee of the Bournville Village T rust and president of the Birmingham Chamber of Industry and Commerce. H e was made a freeman of the City of Birmingham in 1982. S ir A drian Cadbury is not one of those who subscribes to the popular theory that a truly professional manager can take over the helm of any type of business with only a superficial knowledge of the nuts and bolts. ‘I’m very sceptical of the ability to shift from managing a bank to managing a steel mill, for example. I have grave doubts about that. I think it is essential to understand the key factors for success or failure in your type of business and I’m not convinced you can do that without actually understanding the process in some detail.’

164

What’s in a Word-list?

Text KNG H ello D octor. Good morning Mrs . Yes. W ell young lady, what can we do I’m just up to see about this operation. W hat’s what operation? You know Royal Infirmary A ha. th that was temporary. Yes. A nd he’s the consultant told me, it would take two to three days. I think it’s years he means. H ave you not heard any more about it? T wo cancellations D octor, two. A nd that’s all you’ve heard? T hat’s all I’ve heard. Cos we’ve never heard any more. N o. I just thought I’d come up and speak to you about that. Yes. A nd it’s a thingy that I can’t forget about. I can’t make any appointments for going T hat’s right. anywhere. T hat’s right. You know. R ight well I’ll get on to them this morning. W ill you? T hat was Mr ? Yes. Mr was T hat’s right. the man. Yes. R ight. Yes. I mean, that’s a long time isn’t it?

Appendix 1

165

O h yes. Yes. Phone Mr Royal Infirmary Mrs operation. A s soon as possible. N ow could I have some cramp er tablets D octor? Yes. For my hands any my feet. A nd this is where I used to get pain. N ow I can be constipatied constipated and I can be the other way. A ha. A nd D r that was the only w doctor ever I knew up here. Yes. A nd he always gave me this bottle T hat’s the syrupy stuff? Yes, and he told me to take a spoonful T hat’s at night a teaspoonful A teaspoonful before you go to your bed . Yes. T hat’s right. S o could I have that? Yes. Please. I I’m not really a doctor person really, but this is really troubling me up here you know . O h yes. O h aye. You should heard long before this. O h it’s a terrible thing D octor . T hat’s that’s a terrible old thing . A nd I’m eighty eighty two, I’ll be eighty three in D ecember. A ha. A nd you’re not getting any younger. I’m not getting any younger, but mind you I’d like to get it done. Yes. Because I can’t take any freedom. T hat’s right. T hat’s right. N ow.

166

What’s in a Word-list?

Mark this in here. W e were getting the sun weren’t we? A ye, today, today. N ow that was yours Your er Cramp. Your erm quinine. W as it you quinine? T ab tablets. W as it O h yes. the quinine tablets? T he old fashioned ones. For the cramp. was for this c for the cramp? Yes yes it is still. S ometimes I got to get up in the night and walk about and Mhm. my hand’s cold. But that oil, it seemed to help me a lot . O h yes, it just eases Just a teaspoonful. T hat’s right, just eases things through. ? N o I can’t do with anything. N o no no no. D octor used to say, N ever you take a laxative. N o no. N o that’s the worst thing you could do. Yes. H ave you had your holidays D octor? N o no. N o? O ctober. O ctober. T here we are and I’ll I’ll get on to the R oyal this morning. T hanks ever so much D octor. A nd I’ll be A nd we’ll try and get worked out to you this week. greatly obliged to you. R ight Mrs ,

Appendix 1

Yes. I’ll just go straight through just now. R ight and thank you D octor . N ow have we got your phone number? Er. Yes. W ait a bit, Yes that’s right. T hat’s right, . O kay. we know where to find you. T hanks D octor, thanks very much. R ight that’s enough. Bye bye. R ight cheerio now.

167

A ppendix 2 USAS S emantic T agset S ee http://ucrel.lancs.ac.uk/usas/for mor details A GENERAL AND ABSTRACT TERMS A 1 General A 1.1.1 General actions, making, etc. A 1.1.2 D amaging and destroying A 1.2 S uitability A 1.3 Caution A 1.4 Chance, luck A 1.5 Use A 1.5.1 Using A 1.5.2 Usefulness A 1.6 Physical/mental A 1.7 Constraint A 1.8 Inclusion/Exclusion A 1.9 A voiding A 2 A ffect A 2.1 A ffect: Modify, change A 2.2 A ffect: Cause/Connected A 3 Being A4 Classification A 4.1 Generally kinds, groups, examples A4.2 Particular/general; detail A 5 Evaluation A 5.1 Evaluation: Good/bad A 5.2 Evaluation: T rue/false A 5.3 Evaluation: A ccuracy A 5.4 Evaluation: A uthenticity A 6 Comparing A 6.1 Comparing: S imilar/different A 6.2 Comparing: Usual/unusual A 6.3 Comparing: Variety

A7 Definite (+ modals) A 8 S eem A9 Getting and giving; possession A10 Open/closed; Hiding/Hidden; Finding; Showing A 11 Importance A 11.1 Importance: Important A 11.2 Importance: N oticeability A12 Easy/difficult A 13 D egree A13.1 Degree: Non-specific A 13.2 D egree: Maximizers A 13.3 D egree: Boosters A 13.4 D egree: A pproximators A 13.5 D egree: Compromisers A 13.6 D egree: D iminishers A 13.7 D egree: Minimizers A 14 Exclusivizers/particularizers A 15 S afety/D anger B THE BODY AND THE INDIVIDUAL B1 A natomy and physiology B2 H ealth and disease B3 Medicines and medical treatment B4 Cleaning and personal care B5 Clothes and personal belongings C ARTS AND CRAFTS C1 A rts and crafts E EMOTIONAL ACTIONS, STATES AND PRO CESS ES E1 General

E2 L iking E3 Calm/Violent/A ngry E4 H appy/sad E4.1 H appy/sad: H appy E4.2 H appy/sad: Contentment E5 Fear/bravery/shock E6 Worry, concern, confident F FOOD AND FARMING F1 Food F2 D rinks F3 Cigarettes and drugs F4 Farming and H orticulture G GOVT. AND THE PUBLIC DOMAIN G1 Government, Politics and elections G1.1 Government, etc. G1.2 Politics G2 Crime, law and order G2.1 Crime, law and order: L aw and order G2.2 General ethics G3 Warfare, defence and the army; W eapons H ARCHITECTURE, BUILDINGS, HO US ES AND TH E HO ME H 1 A rchitecture, kinds of houses and buildings H 2 Parts of buildings H 3 A reas around or near houses H 4 R esidence H5 Furniture and household fittings

I MONEY AND COMMERCE I1 Money generally I1.1 Money: Affluence I1.2 Money: D ebts I1.3 Money: Price I2 Business I2.1 Business: Generally I2.2 Business: S elling I3 W ork and employment I3.1 W ork and employment: Generally I3.2 W ork and employment: Professionalism I4 Industry K ENTERTAINMENT, SPORTS AND GAMES K1 Entertainment generally K2 Music and related activities K3 R ecorded sound, etc. K4 D rama, the theatre and show business K5 S ports and games generally K5.1 S ports K5.2 Games K6 Children’s games and toys L LIFE AND LIVING THINGS L 1 L ife and living things L 2 L iving creatures generally L 3 Plants M MOVEMENT, LOCATION, TRAVEL AND TRANSPORT M1 Moving, coming and going M2 Putting, taking, pulling, pushing,

transporting, etc. M3 Movement/transportation: land M4 Movement/transportation: water M5 Movement/transportation: air M6 L ocation and direction M7 Places M8 R emaining/stationary N NUMBERS AND MEASUREMENT N 1 N umbers N 2 Mathematics N 3 Measurement N 3.1 Measurement: General N 3.2 Measurement: S ize N 3.3 Measurement: D istance N 3.4 Measurement: Volume N 3.5 Measurement: W eight N 3.6 Measurement: A rea N 3.7 Measurement: L ength AND height N 3.8Measurement: S peed N 4 L inear order N 5 Quantities N5.1 Entirety; maximum N5.2 Exceeding; waste N 6 Frequency, etc. O SUBSTANCES, MATERIALS, OBJECTS AND EQUIPMENT O 1 S ubstances and materials generally O 1.1 S ubstances and materials generally: S olid O 1.2 S ubstances and materials generally: L iquid O 1.3 S ubstances and materials generally: Gas O 2 O bjects generally O 3 Electricity and electrical equipment O 4 Physical attributes O 4.1 General appearance and physical properties O 4.2 Judgement of appearance (pretty, etc.)

O 4.3 Colour and colour patterns O 4.4 S hape O 4.5 T exture O 4.6 T emperature P EDUCATION P1 Education in general Q LINGUISTIC ACTIONS, STATES AND PROCESSES Q1 Communication Q1.1 Communication in general Q1.2 Paper documents and writing Q1. S peech acts Q2.1 S peech, etc: Communicative Q2.2 S peech acts Q3 L anguage, speech and grammar Q4 T he Media Q4.1 T he Media: Books Q4.2 T he Media: N ewspapers, etc. Q4.3 T he Media: T V, R adio and Cinema S SOCIAL ACTIONS, STATES AND PROCESSES S 1 S ocial actions, states and processes S 1.1 S ocial actions, states and processes S 1.1.1 General S 1.1.2 R eciprocity S 1.1.3 Participation S 1.1.4 D eserve, etc. S 1.2 Personality traits S 1.2.1 A pproachability and Friendliness S 1.2.2 A varice S 1.2.3 Egoism S 1.2.4 Politeness S1.2.5 Toughness; strong/weak S 1.2.6 S ensible S 2 People S 2.1 People: Female S 2.2 People: Male S 3 R elationship

S 3.1 R elationship: General S 3.2 R elationship: Intimate/sexual S 4 Kin S5 Groups and affiliation S 6 O bligation and necessity S 7 Power relationship S 7.1 Power, organizing S 7.2 R espect S 7.3 Competition S 7.4 Permission S 8 H elping/hindering S 9 R eligion and the supernatural T TIME T 1 T ime T 1.1 T ime: General T 1.1.1 T ime: General: Past T1.1.2 Time: General: Present; simultaneous T 1.1.3 T ime: General: Future T 1.2 T ime: Momentary T 1.3T ime: Period T 2 T ime: Beginning and ending T3 Time: Old, new and young; age T 4 T ime: Early/late W THE WORLD AND OUR ENVIRONMENT W 1 T he universe W 2 L ight W 3 Geographical terms W 4 W eather W 5 Green issues X PSYCHOLOGICAL ACTIONS, STATES AND PROCESSES X1 General X2 Mental actions and processes X2.1 T hought, belief X2.2 Knowledge X2.3 L earn X2.4 Investigate, examine, test, search

X2.5 Understand X2.6 Expect X3 S ensory X3.1 S ensory: T aste X3.2 S ensory: S ound X3.3 S ensory: T ouch X3.4 S ensory: S ight X3.5 S ensory: S mell X4 Mental object X4.1 Mental object: Conceptual object X4.2 Mental object: Means, method X5 A ttention X5.1 A ttention X5.2 Interest/boredom/excited/energetic X6 D eciding X7 Wanting; planning; choosing X8 T rying X9 A bility X9.1 A bility: A bility, intelligence X9.2 A bility: S uccess and failure Y SCIENCE AND TECHNOLOGY Y1 S cience and technology in general Y2 Information technology and computing Z NAMES AND GRAMMATICAL WORDS Z0 Unmatched proper noun Z1 Personal names Z2 Geographical names Z3 O ther proper names Z4 D iscourse Bin Z5 Grammatical bin Z6 N egative Z7 If Z8 Pronouns, etc. Z9 T rash can Z99 Unmatched

This page has been left blank intentionally

Bibliography

T he place of publication is presumed as L ondon, unless otherwise stated. Archer, D. and McIntyre, D., ‘A Computational Approach to Mind Style’, paper given at the 25th conference of the Poetics and L inguistics A ssociation, University of Huddersfield, July 2005. Archer, D. and Rayson, P., ‘Using the UCREL Automated Semantic Analysis S ystem to Investigate D iffering Concerns in R efugee L iterature’, in M. D eegan, L . H unyadi and H . S hort (eds), The Keyword Project: Unlocking Content Through Computational Linguistics, OHC publications 18 (Office for H umanities Communication Publications, forthcoming). Archer, D., McEnery, T., Rayson, P. and Hardie, A., ‘Developing an Automated S emantic A nalysis S ystem for Early Modern English’, in D . A rcher, P. R ayson, A . W ilson and T . McEnery (eds), Proceedings of the Corpus Linguistics 2003 Conference, UCR EL T echnical Paper N umber 16 (L ancaster: UCR EL , 2003), pp. 22–31. A ston, G. and Burnard, L ., The BNC Handbook (Edinburgh: Edinburgh University Press, 1998). Baker, P., Public Discourses of Gay Men, R outledge A dvances in Corpus L inguistics, vol. 8 (L ondon and N ew York: R outledge, 2005). Barcelona Sánchez, A., ‘Metaphorical Models of Romantic Love in Romeo and Juliet’, Journal of Pragmatics, 24 (1995): 667–88. Berber S ardinha, A .P., Lingüística de Corpus (Barueri, S ão Paulo, Brazil: Editora Manole, 2004). Biber, D ., Johansson, S ., L eech, G., Conrad, S . and R eppen, R ., The Longman Grammar of Spoken and Written English (L ongman, 1999). Burnage, G. and Dunlop, D., ‘Encoding the British National Corpus’, in J. Aarts, P. de H aan and N . O ostdijk (eds), English Language Corpora: Design, Analysis and Exploitation, papers from the 13th International Conference on English L anguage R esearch R esearch on Computerized Corpora, N ijmegen 1992 (A msterdam: R odopi, 1993), pp. 79–95. Burnard, L ., Reference Guide for the British National Corpus: World Edition (O xford: O xford University Computing S ervices, 2000). Burrows, J.F., ‘Questions of Authorship: Attribution and Beyond’, paper given at the A CH /ALL C, Joint International Conference, N ew York, 14 June 2001. Burrows, J.F., ‘“Delta”: A Measure of Stylistic Difference and a Guide to Likely A uthorship’, Literary and Linguistic Computing, 17 (2002): 267–87.

172

What’s in a Word-list?

Burrows, J.F., ‘The Englishing of Juvenal: Computational Stylistics and Translated T exts’, Style, 36 (2002): 677–99. Burrows, J.F., ‘Questions of Authorship: Attribution and Beyond’, Computers and the Humanities, 37/1 (2003): 5–32. Christ, O ., The IMS Corpus Workbench Technical Manual (Institut für Maschinelle S prachverarbeitung, S tuttgart, Germany: Universität S tuttgart, 1994). Cohen, S ., Folk Devils and Moral Panics, 3rd edn (R outledge, 2002). Cox, N. and Dannehl, K., ‘The Rewards of Digitisation: A Corpus-based Approach to W riting H istory’, in A . H ardie (ed.), Digital Resources in the Humanities Conference 2005 Abstracts (DRH , 2005), pp. 13–14. Crystal, D ., The Cambridge Encyclopedia of the English Language, 2nd edn (Cambridge: Cambridge University Press, 2003). Crystal, D . and Crystal, B., The Shakespeare Miscellany (Penguin, 2005). Culpeper, J., ‘Computers, Language and Characterisation: An Analysis of Six Characters in Romeo and Juliet’, in U. Melander-Marttala, C. O stman and M. Kytö (eds), Conversation in Life and in Literature: Papers from the ASLA Symposium (Uppsala: A ssociation S uédoise de L inguistique A ppliquée, 2002), pp. 11–30. Davies, M., ‘Advanced Research on Syntactic and Semantic Change with the Corpus del Español’, in J. Kabatek, C.D . Pusch and W . R aible (eds), Romance Corpus Linguistics II: Corpora and Diachronic Linguistics (T übingen: Guntar N aar, 2005), pp. 203–14. Davies, M., ‘The Advantage of Using Relational Databases for Large Corpora: S peed, A dvanced Queries, and Unlimited A nnotation’, International Journal of Corpus Linguistics, 10 (2005): 301–28. Davies, M., ‘Semantically-based Queries with a Joint BNC/WordNet Database’, in R . Facchinetti (ed.), Corpus Linguistics Twenty-Five Years On (A msterdam: R odopi, 2007), pp. 149–67. Deegan, M., Short, H., Archer, D. et al., ‘Computational Linguistics Meets Metadata, or the A utomatic Extraction of Key W ords from Full T ext Content’, RLG Diginews, 8/2 (2004). Dunning, T., ‘Accurate Methods for the Statistics of Surprise and Coincidence’, Computational Linguistics, 19:1 (1993): 61–74. Fellbaum, C. (ed.), WordNet: An Electronic Lexical Database (Cambridge, MA : MIT Press, 1998). Fletcher, W., ‘Exploring Words and Phrases from the British National Corpus’, Phrases in English (2005). Francis, W.N. and Kučera, H., Frequency Analysis of English Usage Lexicon and Grammar (Houghton Mifflin, 1982). Freeman, D.C., ‘“Catch[ing] the nearest way”: Macbeth and Cognitive Metaphor’, Journal of Pragmatics, 24 (1995): 689–708. Geens, D ., Engels, L .K. and Martin, W ., Leuven Drama Corpus and Frequency List (L euven: Institute of A pplied L inguistics, University of L euven, 1975).

Bibliography

173

H alliday, M.A .K., An Introduction to Functional Grammar, 2nd edn (L ondon: Edward A rnold, 1994). H ickey, R ., Dublin English: Evolution and Change, Varieties of English A round the W orld General S eries, vol. 35 (A msterdam and Philadelphia: John Benjamins, 2006). H oey, M., Lexical Priming: A New Theory of Words and Language (R outledge, 2005). Hofland, K. and Johansson, S., Word Frequencies in British and American English (T he N orwegian Computing Centre for the H umanities, Bergen, 1982). H oover, D ., Culpeper, J. and L ouw, B., Approaches to Corpus Stylistics (R outledge, forthcoming). Hoover, D.L., ‘Statistical Stylistics and Authorship Attribution: An Empirical Investigation’, Literary and Linguistic Computing, 16:4 (2001): 421–44. Hoover, D.L., ‘Testing Burrows’s Delta’, Literary and Linguistic Computing, 19/4 (2004): 453–75. Hoover, D.L., ‘Delta Prime?’, Literary and Linguistic Computing 19:4 (2004): 477–95. Hoover, D.L., ‘Delta, Delta Prime, and Modern American Poetry: Authorship A ttribution T heory and Method’, paper given at the ALL C/A CH Joint International Conference. BC, Canada: University of Victoria, 16 June 2005. Hoover, D.L., ‘The Delta Spreadsheet’, paper given at the ALLC/ACH Joint International Conference. BC, Canada: University of Victoria, 17 June 2005. H undt, M., S and, A . and S iemund, R ., Manual of Information to Accompany the Freiburg-LOB Corpus of British English (FLO B)(Freiburg: Freiburg University, 1998). A vailable online at . James, G., D avison, R ., Cheung, A .H .Y. and D eerwester, S . (eds), English in Computer Science: A Corpus-based Lexical Analysis (H ong Kong: L ongman, 1994). Johansson, K. and Hofland, K., Frequency Analysis of English Vocabulary and Grammar: vol. 1, Tag Frequencies and Word Frequencies; vol. 2, Tag Combinations and Word Combinations (O xford: O xford University Press, 1989). Johansson, S ., L eech, G. and Goodluck, H ., Manual of Information to Accompany the Lancaster/Oslo-Bergen Corpus of British English for Use with Digital Computers, Technical Report (Bergen: N orwegian Computing Centre for the H umanities, Bergen, 1978). Juuso, I., Anderson, J., Anderson, W., Beavan, D., Corbett, J. et al., ‘The LICHEN Project: Creating an Electronic Framework for the Collection, Management, O nline D isplay, and Exploitation of Corpora’, in A . H ardie (ed.), Digital Resources in the Humanities Conference 2005 Abstracts (DRH , 2005), pp. 27–9.

174

What’s in a Word-list?

Kay, C.J., ‘Historical Semantics and Material Culture’, in S.M. Pearce (ed.), Experiencing Material Culture in the Western World (L ondon and W ashington: L eicester University Press, 1997), pp. 49–64. Kay, C.J., ‘Historical Semantics and Historical Lexicography: Will the Twain Ever Meet?’, in J. Coleman and C.J. Kay (eds), Lexicology, Semantics and Lexicography in English Historical Linguistics: Selected Papers from the Fourth G.L. Brook Symposium, Manchester, August 1998 (A msterdam: Benjamins, 2000), pp. 53–68. Kay, C. and Samuels, M.L., ‘Componential Analysis in Semantics: Its Validity and A pplications’, Transactions of the Philological Society (1975): 49–81. Kirk, J.M., ‘Aspects of the Grammar in a Corpus of Dramatic Texts in Scots’, PhD thesis, University of Sheffield, 1986. Kirk, J.M., Northern Ireland Transcribed Corpus of Speech (ESR C D ata A rchive: University of Essex, 1990; rev. edn, Belfast: Queen’s University, 2004.) Kirk, J.M., Kallen, J.L ., L owry, O ., R ooney, A . and Mannion, M., T he ICE-Ireland Corpus: T he International Corpus of English: T he Ireland Component (CD ) (ICE-Ireland Project: Queen’s University Belfast, 2005 (beta version)). Kirk, J.M., Kallen, J.L ., L owry, O ., R ooney, A . and Mannion, M., The SPICEIreland Corpus: Systems of Pragmatic Annotation for the Spoken Component of ICE-Ireland, version 1.2 (Queen’s University Belfast and T rinity College D ublin, 2007). Kövecses, Z., Metaphor: A Practical Introduction (N ew York: O xford University Press, 2002). Kučera, H. and Francis, W.N., Computational Analysis of Present-Day American English (Providence: Brown University Press, 1967). L akoff, G. and Johnson, M., Metaphors We Live By (Chicago and N ew York: University of Chicago Press, 1980). L andau, S ., Lexicography (Cambridge: Cambridge University Press, 1984). Landes, S., Leacock, C. and Tengi, R., ‘Building Semantic Concordances’, in C. Fellbaum (ed.), WordNet: An Electronic Lexical Database (Cambridge, MA : T he MIT Press, 1998), pp. 199–216. Louw, B., ‘Irony in the Text or Insincerity in the Writer? The Diagnostic Potential of S emantic Prosodies’, in M. Baker, G. Francis and E. T ognini-Bonelli (eds), Text and Technology (Philadelphia/A msterdam: John Benjamins, 1993). McA rthur, T ., Longman Lexicon of Contemporary English (L ongman, 1981). McArthur, T.,‘What is a Word?’, in T. McArthur (ed.), Living Words: Language, Lexicography and the Knowledge Revolution (Exeter: Exeter University Press, 1999 [1992]). McEnery, A .M., Swearing in English: Bad Language, Purity and Power from 1586 to the Present (R outledge, 2005). McEnery, A.M., ‘The Moral Panic about Bad Language in England, 1691–1745’, Journal of Historical Pragmatics, 7/1 (2006): 89–113. Munro, C.R ., Television, Censorship and the Law (Farnborough: S axon H ouse, 1979).

Bibliography

175

N ewsom, J., Half Our Future (Her Majesty’s Stationery Office, 1963). O akes, M., Statistics for Corpus Linguistics (Edinburgh: Edinburgh University Press, 1998). OED Online, ed. J.A . S impson. (O xford: O xford University Press, 2000–). Oncins-Martínez, J.L., ‘Notes on the Metaphorical Basis of Sexual Language in Early Modern English’, in J.G. Vázquez-González et al. (eds), The Historical Linguistics-Cognitive Linguistics Interface (H uelva: University of H uelva Press, 2006). Oxford English Dictionary (O xford: O xford University Press, 1884–, and subsequent edns). Pelz, W ., The Scope of Understanding in Sociology: Towards a More Radical Reorientation in the Social and Humanistic Sciences (R outledge, 1974). Phillips, M., ‘Lexical Structure of Text’, Discourse Analysis Monographs 12 (Birmingham: University of Birmingham, 1989). Piao, S.L., Wilson, A. and McEnery, T., ‘A Multilingual Corpus Toolkit’, paper given at AAA CL –2002, Indianapolis, Indiana, USA , 2002. Pilkington, H ., The Future of Sound Radio and Television (H er Majesty’s S tationery Office, 1962). Rayson, P., ‘Matrix: A Statistical Method and Software Tool for Linguistic Analysis T hrough Corpus Comparison’, PhD thesis. L ancaster University, 2003. Rayson, P., ‘From key words to key semantic domains’, International Journal of Corpus Linguistics, 13:4 (2008): 519–49. Rayson, P., Archer, D. and Smith, N., ‘VARD Versus Word: A Comparison of the UCR EL Variant D etector and Modern S pell Checkers on English H istorical Corpora’, Proceedings of the Corpus Linguistics Conference Series On-Line E-Journal 1:1 (2005). Rayson, P., Archer, D., Baron, A. and Smith, N., ‘Travelling Through Time with Corpus A nnotation S oftware’, Proceedings of Practical Applications in Language and Computers (PALC 2007) Conference, Łódź, Poland, April 2007 (PAL C, forthcoming). Rayson, P., Archer, D., Piao, S.L. and McEnery, T., ‘The UCREL Semantic Analysis S ystem’, proceedings of the workshop on Beyond N amed Entity R ecognition S emantic L abelling for NL P T asks in association with the fourth international conference on L anguage R esources and Evaluation (LR EC, 2004), pp. 7–12. R oberts, J., Kay, C. and Grundy, L ., A Thesaurus of Old English (A msterdam: R odopi, 2000 [1995]). R oget, P.M., Thesaurus of English Words and Phrases (L ongman, 1852, and subsequent edns). S cott, M., WordSmith Tools (O xford: O xford University Press, 1996). S cott, M., WordSmith Tools Help Manual. Version 3.0 (O xfod: Mike S cott and O xford University Press, 1999). Scott, M., ‘Focusing on the Text and Its Key Words’, in L. Burnard and T. McEnery (eds), Rethinking Language Pedagogy from a Corpus Perspective (Frankfurt: Peter L ang, 2000), pp. 103–22.

176

What’s in a Word-list?

S cott, M., WordSmith Tools, Version 4 (O xford: O xford University Press, 2004). S cott, M. and T ribble, C., Textual Patterns: Keyword and Corpus Analysis in Language Education (A msterdam: Benjamins, 2006). S cruton, R ., On Hunting (Yellow Jersey Press, 1998). S pevack, M., A Thesaurus of Shakespeare (H ildesheim: Georg O lms Publishers, 1993). S tubbs, M., Text and Corpus Analysis (O xford: Blackwell, 1996). Taavitsainen, I., ‘Personality and Styles of Affect in The Canterbury Tales’, in G. L ester (ed.), Chaucer in Perspective: Middle English Essays in Honour of Norman Blake (Sheffield: Sheffield Academic Press, 1999), pp. 218–34. T aavitsainen, I., Pahta, P. and Mäkinen, M., Middle English Medical Texts, CD RO M (A msterdam: John Benjamins, 2005). W hitehouse, M., Whatever Happened to Sex? (H ove: W ayland, 1977). W ilks, Y.A ., S lator, B.M. and Guthrie, L .M., Electric Words: Dictionaries, Computers and Meanings (Cambridge, MA and L ondon: MIT Press, 1996). Wilson, A. and Thomas, J., ‘Semantic Annotation’, in R. Garside, G. Leech and A . McEnery (eds), Corpus Annotation: Linguistic Information from Computer Texts (L ongman, 1997), pp. 55–65.

Index

algorithms 67, 72, 73 ambiguity, semantic 71, 74 A merican H eritage W ord Frequency Book, T he 30 authorship attribution 43, 51 automatic annotation 157 BN C see British N ational Corpus bottom up approach 9 British N ational Corpus 22, 35, 53, 60, 61, 63, 64, 65 Brown Corpus 30, 35 Cambridge A dvanced L earners’ D ictionary 31 Cambridge Grammar of the English L anguage 31 CAS E see Corpus of A merican S poken English CLAWS see Constituent L ikelihood A utomatic W ord-tagging S ystem clustering 77 CO BUILD 31 cognitive metaphor theory 12, 141, 157 colligates/colligation 2 colligation(s) 10 Collins English D ictionary 27 collocate(s) 2, 23, 62, 68 collocation(al) 2, 10, 64 concordance/concordancer 2, 4, 10, 72 Constituent L ikelihood A utomatic W ordtagging S ystem 11 content word(s) 36, 40 corpus annotated 72 historical 78 historical, ambiguity 74 historical, spelling 73 normative 2 ‘off the shelf’, see BN C, Brown, FLO B, ICE, LL C, LO B reference 2, 8, 9, 91

regional 78 size 9, 59 Corpus del Español 53, 54, 56, 57, 58 Corpus of A merican S poken English see Corpus of A merican S poken English Corpus of D ramatic T exts inS cots 5, 17, 26 Corpus of H istorical English 6 data mining 77 D elta 6, 43, 51 D elta-L z 46 D elta-O z 46 D elta Prime 51 dialogue 38, 39 discourse(s) 10 discourse marker(s) 31 discourse prosody/ies 128 discourse role(s) 9 editing 40, 41 manual 38, 39, 40 frequency data 62 function word(s) 3, 35, 40 granularity 59 Gutenberg 42 hapax legomena 35 H istorical T hesaurus of English (HT E) 69, 77 historical variation 58 homonym(s) 74 homophone(s) 73 HT E see H istorical T hesaurus of English ICE 19, 33 ICE-GB 19, 22 ICE-Ireland 19, 22, 23, 24, 25, 31 International Corpus of English. see ICE

178

What’s in a Word-list?

key domain(s) 11, 14 key keyword(s) 10 keywords/keywords analysis 9 KFN -gram 55

polysemy 74 POS tagging. see tagging pronoun(s). see noun(s) proper noun(s) 40

lemmatizing/lemmatization 38, 51, 56 LI CH EN . see L inguistic and Cultural H eritage Electronic N etwork L inguistic and Cultural H eritage Electronic N etwork 69 L iterature O nline 43 LL C see L ondon-L und Corpus LO B see L ancaster-O slo-Bergen Corpus L ondon-L und Corpus 31 L ongman Mexicon of Contemporary English 11, 133

register 58, 67 register variation 68 relational database 2, 54 relational databases 6

Macmillan English D ictionary For A dvanced L earners 31 manual analysis sorting/sifting 14, 73 meaning, multiple 70 MEMT see Middle English Medical T exts MI 101, 141, 153, 156 Middle English Medical T exts 72 MindN et 74 MW C see Mary W hitehouse Corpus n-gram/n-gram architecture 6, 53, 54, 56, 64, 65 narration 38, 39 N atural L anguage Processing 74 N ECT E see N ewcastle Electronic Corpus of T yneside English NI CTS see N orthern Ireland T ranscribed Corpus of S peech NL P see N atural L anguage Processing N orthern Ireland T ranscribed Corpus of S peech 26, 33 noun(s) common 37, 38, 41, 56 pronoun(s) 38, 39, 40 proper 37, 41 part of speech tagging see tagging Pathfinder 74 personal pronouns. see noun(s) Phrases inEnglish 55, 56, 59

SARA 53 S COTS see S cottish Corpus of T ext and S peech S cottish Corpus of T ext and S peech 71 semantic prosody 23 spelling variation 71, 73 S QL query 56, 58, 62 statistical stylistics 51 synonym(s) 68 synsets 63, 64 T agged Corpus of S cottish Correspondence 72 tagger semantic 11 tagging manual 38 manual:correction 38 part of speech (POS ) 38, 41, 51, 56 pragmatic 31 prosodic 31 tagger 38 T hesaurus of O ld English 71 TO E see T hesaurus of O ld English token(s) 9, 35, 82 top down approach 12 type(s) 82 UCR EL S emantic A nalysis S ystemm 11, 75, 133, 138 Unstructured Information Management A rchitecture 179 USAS see UCR EL S emantic A nalysis S ystem variable spelling see spelling variation variant spelling detector 13, 77

Index

179

VIEW 60, 63 VIEW interface 54, 63, 65

W ordN et 54, 63, 64, 74 W ordS mith 8, 55

wildcard searches 72 word frequency 1, 5, 42, 51, 68 list 2, 45 list, modified 41 list, problems with 39

XAIRA

53

z scores 42, 61