C-ORAL-ROM
Studies in Corpus Linguistics Studies in Corpus Linguistics aims to provide insights into the way a corpus can be used, the type of findings that can be obtained, the possible applications of these findings as well as the theoretical changes that corpus work can bring into linguistics and language engineering. The main concern of SCL is to present findings based on, or related to, the cumulative effect of naturally occuring language and on the interpretation of frequency and distributional data. General Editor Elena Tognini-Bonelli Consulting Editor Wolfgang Teubert Advisory Board Michael Barlow
Graeme Kennedy
Rice University, Houston
Victoria University of Wellington
Robert de Beaugrande
Geoffrey Leech
Federal University of Minas Gerais
University of Lancaster
Douglas Biber
Anna Mauranen
North Arizona University
University of Tampere
Chris Butler
John Sinclair
University of Wales, Swansea
University of Birmingham
Sylviane Granger
Piet van Sterkenburg
University of Louvain
Institute for Dutch Lexicology, Leiden
M. A. K. Halliday
Michael Stubbs
University of Sydney
University of Trier
Stig Johansson
Jan Svartvik
Oslo University
University of Lund
Susan Hunston
H-Z. Yang
University of Birmingham
Jiao Tong University, Shanghai
Volume 15 C-ORAL-ROM: Integrated Reference Corpora for Spoken Romance Languages Edited by Emanuela Cresti and Massimo Moneglia
C-ORAL-ROM Integrated Reference Corpora for Spoken Romance Languages
Edited by
Emanuela Cresti Massimo Moneglia University of Florence
John Benjamins Publishing Company Amsterdam/Philadelphia
8
TM
The paper used in this publication meets the minimum requirements of American National Standard for Information Sciences – Permanence of Paper for Printed Library Materials, ansi z39.48-1984.
Cover illustration from original painting Random Order by Lorenzo Pezzatini, Florence, 1996.
Library of Congress Cataloging-in-Publication Data C-ORAL-ROM : integrated reference corpora for spoken Romance languages / edited by Emanuela Cresti and Massimo Moneglia. p. cm. (Studies in Corpus Linguistics, issn 1388–0373 ; v. 15) Includes bibliographical references and index. 1. Romance languages--Data processing. 2. Romance languages-Variation. 3. Romance languages--Spoken Romance languages. I. Cresti, E. (Emanuela) II. Moneglia, Massimo. III. Series. PC44.5.C2 2005 440’0285--dc22 isbn 90 272 2286 X (Eur.) / 1 58811 548 8 (US) (Hb; alk. paper)
2005041056
© 2005 of the book – John Benjamins B.V. © 2005 of the DVD – Università di Firenze; Universidad Autónoma de Madrid; Centro de Linguística da Universidade de Lisboa; L’Université de Provence. No part of the book and DVD may be reproduced in any form, by print, photoprint, microfilm, or any other means. John Benjamins Publishing Co. · P.O. Box 36224 · 1020 me Amsterdam · The Netherlands John Benjamins North America · P.O. Box 27519 · Philadelphia pa 19118-0519 · usa
Table of contents
Acknowledgements
xi
Preface
xv
Chapter 1 The C-ORAL-ROM resource Massimo Moneglia 1.1 Introduction 1 1.1.1 C-ORAL-ROM 1 1.1.2 Organisation of the volume 3 1.1.3 The issues of representation and comparability 4 1.1.4 C-ORAL-ROM sampling strategy 8 1.1.5 Comparing C-ORAL-ROM with other corpora 12 1.2 Prosodic tagging criteria 14 1.2.1 Prosodic breaks and utterance limits 15 1.2.2 Background of prosodic labelling 17 1.2.3 Utterance boundaries and labelling of discourse acts 19 1.2.4 Utterance limits in spontaneous speech 20 1.2.5 Prosodic labelling procedure and Alignment units 24 1.2.6 Conventions for prosodic tagging in the transcripts 25 1.2.6.1 Fragmentation phenomena 26 1.3 Textual format 27 1.3.1 Metadata 28 1.3.2 Dialogue representation 32 1.3.3 Conventions for transcription 33 1.3.4 Pauses 35 1.3.5 Restrictions on dialogue representation: The Intersection convention 36 1.3.6 Dependent lines 37 1.3.7 Alignment principle 38 1.3.8 Files, filename conventions and folder structure of the resource 1.3.8.1 Files 39 1.3.8.2 Filename conventions 39 1.3.8.3 Folder structure 40
1
39
Table of contents
1.4 WinPitch Corpus. A Text-to-Speech Analysis and Alignment Tool 40 Philippe Martin 1.4.1 Text-to-speech alignment 40 1.4.1.1 Automatic alignment 41 1.4.1.2 Alignment and transcription 41 1.4.1.3 Computer-assisted alignment 42 1.4.2 WinPitch Corpus features 43 1.4.3 Basic layout 46 1.5 C-ORAL-ROM PoS tagging 51 1.5.1 Minimal tagset requirements 52 1.5.2 Comparison between tagsets 52 1.5.2.1 PoS tagsets 52 1.5.2.2 Morpho-syntactic features of verbs 54 1.5.2.3 Non-standard tagsets 55 1.5.3 Tagging and frequency lists formats 56 1.6 Measurements of spoken language variability in the Romance languages 57 1.6.1 Mid-length of utterances (MLU) 58 1.6.2 Mid-length of the dialogic turn (MLTw) 59 1.6.3 Speed 60 1.6.4 Length of the tone unit (MLTone) 61 1.6.5 Fragmentation 62 1.6.6 Some conclusions 62 Chapter 2 The Italian corpus Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano 2.1 History of the corpus within the national framework 71 2.1.1 Historical overview 71 2.1.2 The LABLITA Corpus 75 2.2 Orthographic transcription 76 2.2.1 General criteria 76 2.2.2 Orthographic transcription of regional words 77 2.2.3 Diacritic marks 81 2.2.4 Interjections 84 2.3 Morpho-syntactic tagging 85 2.3.1 Tools and strategy adopted for automatic PoS tagging and lemmatisation 85 2.3.2 Tagset 86 2.3.2.1 Choices in the PoS tagset 88 2.3.3 Extended tagset for spoken language 92 2.3.3.1 Non-standard words tagset 93 2.3.3.2 Non-linguistic elements 93 2.3.4 Other choices 94
71
Table of contents
2.3.4.1 Regional and dialectal forms 94 2.3.4.2 Multiwords 94 2.3.4.3 Names 96 2.3.5 Evaluation 97 2.3.6 Specific problems with the morpho-syntactic tagging of spoken language 99 2.3.6.1 Words adjacent to utterance boundaries 101 2.3.6.2 Interruptions and retracting 101 2.3.6.3 PoS assignment in connection with secondary prosodic boundaries 102 2.4 Main data from lemmatisation 104 Chapter 3 The French corpus Estelle Campione, Jean Véronis, and José Deulofeu 3.1 History of the corpus within the national framework 111 3.2 Orthographic transcription 114 3.2.1 General criteria 114 3.2.2 Interjections 116 3.3 Morpho-syntactic tagging 116 3.3.1 Tagset 116 3.3.2 Multiword expressions 119 3.3.3 Tools and strategy adopted for automatic PoS tagging and lemmatisation 119 3.3.4 Evaluation 121 3.3.5 Main data from lemmatisation 123 Chapter 4 The Spanish corpus Antonio Moreno, Gillermo de la Madrid, Manuel Alcántara, Ana Gonzalez, José M. Guirao, and Raúl De la Torre 4.1 History of the corpus in the national framework 135 4.1.1 Historical overview 135 4.1.2 CORLEC features 137 4.1.3 C-ORAL-ROM features 138 4.1.4 Final remarks 140 4.2 Orthographic transcription 141 4.2.1 General criteria 141 4.2.2 Orthography for non-standard words 142 4.2.3 Interjections 142 4.3 Morpho-syntactic tagging 143 4.3.1 Tools and strategy adopted for automatic PoS tagging and lemmatisation 143
111
135
Table of contents
4.3.1.1 Electronic vocabulary 145 4.3.2 Tagset 146 4.3.2.1 Tagset adopted 146 4.3.3 The notion of “multiword” 147 4.3.4 Ambiguous clustering 148 4.3.5 Level of morpho-syntactic encoding of forms 149 4.3.6 Evaluation 151 4.3.7 Specific tagging problems with the Spanish spoken corpus 153 4.3.7.1 Retracting and interruption phenomena 153 4.3.7.2 Linguistic forms whose distribution is not consistent with the distributional characters of written language 154 4.3.7.3 Linguistic forms and non-standard forms used as discourse markers 156 4.3.8 Main data from lemmatisation 156 Chapter 5 The Portuguese corpus Maria Fernanda Bacelar do Nascimento, José Bettencourt Gonçalves, Rita Veloso, Sandra Antunes, Florbela Barreto, and Raquel Amaro 5.1 History of the corpus within the national framework 163 5.1.1 Historical overview 163 5.1.2 Reusing materials from existing databases at CLUL 168 5.1.3 New materials specifically collected and/or transcribed for the C-ORAL-ROM project 170 5.1.4 Final remarks 171 5.2 Orthographic transcription 172 5.2.1 Specific Portuguese conventions 172 5.2.1.1 General orthographic norms 172 5.2.1.2 Other specific conventions 173 5.2.2 Interjections 174 5.3 Morpho-syntactic tagging 175 5.3.1 Tools and strategy adopted for automatic PoS tagging and lemmatisation 175 5.3.2 Tagset 177 5.3.2.1 Categories and main options 177 5.3.2.2 Particular cases 182 5.3.2.3 Specific tags for the Portuguese spoken corpus 183 5.3.3 Lemmatisation of the spoken corpus 184 5.3.3.1 Specific lemmatisation choices 185 5.3.4 Evaluation 186 5.3.5 Specific problems with the morpho-syntactic tagging of spoken language 188 5.3.6 Some options of the Portuguese team 193
163
Table of contents
5.4 Main data from lemmatisation 195 5.4.1 The 100 most frequent verbs, nouns, adverbs, adjectives: A comparison between C-ORAL-ROM and CORLEX 195 5.4.2 Similarities and differences in the two corpora 200 5.4.3 Main data from word-forms 204 5.4.4 Lexical density 205 5.4.5 Multiword expressions 206 Chapter 6 Notes on lexical strategy, structural strategies and surface clause indexes in the C-ORAL-ROM spoken corpora Emanuela Cresti 6.1 Premises 209 6.1.1 The utterance 210 6.1.2 Comparison between speech and writing 211 6.1.3 Comparison among spoken Romance languages 213 6.1.4 Variation according to the corpus design 213 6.2 The noun vs. verb lexical strategy in speech 215 6.2.1 Lexical strategy and formality 217 6.3 Informational patterning 219 6.3.1 Informational patterning according to the corpus design 221 6.4 The verbal utterance 223 6.4.1 The verbal utterance according to the corpus design 224 6.5 The ‘non-structuring strategies’ in Italian 226 6.6 The structural types of utterances 228 6.6.1 General tendencies 230 6.6.2 Structural types according to the corpus design 232 6.7 Some remarks on Italian Media and Telephone 236 6.8 Surface clause indexes 237 6.8.1 General percentile data 238 6.8.2 Percentile data according to the corpus design 240 6.9 The informational positions of surface clause indexes 241 6.9.1 A general frame of correlation between syntactic functions and informational positions for Italian 242 6.9.2 Incidence of informational positions of surface indexes in Italian 250 6.10 Some remarks on coordination, subordination and negation in the four Romance languages (FRLs) 252
209
Table of contents
Appendix Evaluation of consensus on the annotation of terminal and non-terminal prosodic breaks in the C-ORAL-ROM Corpus Massimo Moneglia, Marco Fabbri, Silvia Quazza, Andrea Panizza, Morena Danieli, Juan Maríia Garrido, and Marc Swerts a.1 Goals of the evaluation 257 a.2 Evaluation background 258 a.3 Experimental setting 258 a.4 Selection of evaluators 260 a.5 Measurements and statistics 262 a.5.1 Evaluation data 262 a.5.2 First step: Binary comparison file 263 a.5.3 Second step: Ternary comparison file 264 a.5.4 Measurements 264 a.5.5 Kappa coefficient 266 a.6 Results 267 a.6.1 Binary Comparison 269 a.6.2 Ternary Comparisons 270 a.6.3 K-coefficient 271 a.7 Discussion 272 a.7.1 General results 273 a.7.2 Specific results 273 Bibliography
257
277
Acknowledgements
The C-ORAL-ROM Corpus for Spoken Romance Languages is the chief result of the C-ORAL-ROM project, which was funded within the Fifth Framework Program of the European Union, within the Information Society Technology Program (IST2000-26228). The official project web page and documentation is at http://lablita.dit.unifi.it/coralrom. The C-ORAL-ROM corpus is now available in two forms: 1. In this publication which presents the resource in compressed and encrypted format in one DVD. In this form, designed for research and wide distribution in the linguistic community, the resource can be accessed and studied only through the speech software and the text search software that are found on the DVD. 2. Through the ELDA Catalogue (http://www.elda.fr) in 9 DVDs where files are non-compressed and non-encrypted. This form is devoted to speech industry and speech laboratories which can exploit the potential of the resource for the purposes of research and speech technology. The project started in January 2001 and was completed in three and a half years by the following consortium, which was coordinated and managed by the LABLITA lab of the University of Florence:
Corpus and linguistic studies – – – –
Università di Firenze (Dipartimento di Italianistica, LABLITA); Direction: Emanuela Cresti Université de Provence (Description linguistique Informatizée sur Corpus, DELIC); Direction: Jean Véronis Centro de Linguística da Universidade de Lisboa (CLUL); Direction: Maria Fernanda Bacelar do Nascimento Universidad Autónoma de Madrid (Laboratorio de Lingüística Informática); Direction: Antonio Moreno Sandoval
Speech software –
Pitch Instruments France SARL; Direction: Philippe Martin
Acknowledgements
Distribution and dissemination – –
European Language Resource Distribution Agency (ELDA); Direction: Khalid Choukri Instituto Cervantes; Direction: Antonio Cid
Speech recognition –
Istituto Trentino di Cultura, Centro per la ricerca scientifica e tecnologica (ITCIRST); Direction: Daniele Falavigna
For the representation of spoken language in Media contexts in C-ORAL-ROM, various broadcasting companies allowed the publication of samples of their transmissions. On behalf of the consortium, we wish to acknowledge the following for their cooperation: – – – – – – – – – –
TECHE-RAI RTVE (Radio Televisión Española) Radio Televisión Madrid COPE (Cadena de Ondas Populares Españolas/Radio Popular) Onda Cero Radio RTP2: Rádio e Televisão de Portugal; SGPS, S.A. RDP (Radiodifusão Portuguesa), AS. (Antena 1 and Antena 2) SIC (Sociedade Independente de Comunicação), S.A. Rádio Notícias Produções e Publicidade
C-ORAL-ROM is in fact a result of the interaction of many different skills and interests in spoken language which characterise academia, the speech technology industry and the EU Commission in the present time. From the EU Commission representatives, we wish to thank project officer Bent Hauschildt who helped in structuring the project at its start, and project officer Philippe Gelin who took it over at its mid-term review and brought it to its conclusion: both gave constant support which has been essential for the coordination of this complex, collective work. We are also especially obliged to the project reviewers Johanna Moore and Louis ten Bosch: their suggestions have been important for enhancing the quality of the corpus and the reliability of its annotation. Besides the consortium, and the EU representatives, an extremely valuable advisory board acted within C-ORAL-ROM, representing the various technological and theoretical domains that are presently concerned with spoken language and with multilingualism: – – – –
Nuno Beires, PT-Inovação (Portugal Telecom, Lisboa) Claire Blanche-Benveniste, Ecole Pratique des Hautes Etudes (Paris, France) Bernard Cerquellini, INaLF (Institute National de la Langue Française, Paris, France) Morena Danieli, Loquendo (Torino, Italy)
Acknowledgements
– – –
Juan Maria Garrido, Telefonica I+D (Madrid, Spain) Marc Swerts, Tilburg University (Tilburg, Netherlands) Dominique Willems, Collate Research Network (University of Gent, Belgium)
We wish to thank all members of the board for their support. Their presence and patience in project meetings were crucial in the discussions and decisions involved in determining a corpus design strategy and format. Specific thanks must be extended to Morena Danieli, Juan Maria Garrido and Marc Swerts for their collaboration in the evaluation of the prosodic annotation, and Claire Blanche-Benveniste for her friendship, which provided support, ideas and contributions in a series of issues that are too many to be mentioned, from the conception of the project until its end. We also wish to express our gratitude to Anna Brita Stenström for reading the manuscript of this book and for requesting a number of interesting clarifications; and to Eugenio Picchi, who allowed the use of his technology and assisted us in the morpho-syntactic tagging of the Italian corpus. In addition, we wish to thank Silvia Quazza and her staff at Loquendo, and Kees Vaes at John Benjamins, for their work as sub-contractors in the C-ORAL-ROM project. It has been a pleasure working with them. We are also extremely grateful to Elena Tognini Bonelli for her support and advice on the form of both this book and the DVD. We are indebted too to Lisa Lim for taking on the language editing of this volume which has, after all, been written in English by authors with a Romance language competence. The Italian Department of the University of Florence has been in charge of all administrative matters, and we wish to express our thanks to the Director Anna Dolfi and to the secretarial staff Cinzia Baldi and Lucia Azalin for all their help. We have one regret: the extension of the C-ORAL-ROM project to Romanian and Catalan was not funded, and this collection of corpora thus does not represent two Romance languages whose importance is presently increasing in many respects. Finally, on behalf of the whole consortium, we wish to thank the numerous students and colleagues who have worked on the corpus recordings and transcriptions. Without their contribution, the C-ORAL-ROM collection would never have been possible. All responsibility for the work in this volume rests, of course, with the authors. Emanuela Cresti & Massimo Moneglia
Preface
Having been involved in the C-ORAL-ROM project during the preparation stage, I am glad to welcome this outcome of contrastive studies on four corpora of spontaneous spoken Romance Languages, Italian, French, Spanish and Portuguese. The editors, Emanuela Cresti and Massimo Moneglia, had to accept a double challenge: a scientific commitment in defining appropriate contrastive methods for spoken corpora, and a cooperative venture in integrating four types of material, methodological and theoretical resources, supported by different national traditions. Such an enterprise required a common assent. Italian, Spanish, Portuguese and French teams consented to a collective discipline, providing us with the first contrastive Romance study based on prosodic and pragmatic units (although it has to be admitted that French, according to a well-known national trend, proved to be less disciplined). Cresti and Moneglia deserve all the more credit: very fruitful results can be drawn from their report, with possible applications in various theoretical and practical domains, such as language perception, multi-layered linguistic analysis, early acquisition of prosodic patterns, multilingual inter-comprehension, cooperation with computational parsing on spoken languages, information-packaging according to different communicative situations, etc. Comparable data across the four languages were ensured by unified samplings, extracted from large corpora of adults recordings. Apart from traditional distinctions such as dialogue versus monologue and formal versus informal, analytical results imposed another classification involving the channel, by transmission or face-to-face, and an important subdivision between telephone and other media, the telephone being a dialogic event in its pure state, figuring the lowest step of speech structuring. Moreover, media and telephone productions did not seem to be directly related to such features as formality or informality, dialogic or monologic expressions. According to the Italian team’s hypothesis (based on twenty years of research on spoken Italian in LABLITA lab in Firenze), only prosody could supply the searchers with adequate units for intra- and inter-languages comparisons. Usual syntactic schemes were unsuccessful because of the high percentage of non-canonical patterns, and discourse units or speech acts were too elusive. The utterance, an expression marked by a prosodic terminal break, having tight correspondences with speech acts, was thus chosen as the fundamental analytical unit. As the investigators wanted to settle this unit at a psycholinguistic level, they even tested the perceptual recognition of utterances with experimental procedures. Philippe Martin’s WinPitch software offered accurate prosodic measurements for the four languages and an easy access to a general
Preface
text-to-speech alignment, a great asset of the present work. The whole constitutes a strong plea for considering the utterance to be the best unit for such a comparison. Even if some linguists will not wholly side with this choice (because they prefer more grammatical or more semantic grounds or other perspectives), one has to admit that this prosodically-based unit, acting in the way a useful algorithm would do, proved to be an efficient tool for comparing phenomena in the four languages. It could show, for instance, that tonal unit length varies in a language-dependent way, whereas correlations between utterance length, dialogic turns and speed are tightly correlated in all four languages. It also gave a nice point of reference for the distribution of grammatical elements such as coordination, relative-conjunction che/que or negation, by situating them as initial, medial or terminal, relative to prosodic delimitations. Among ensuing consequences, it can be stated, for instance, that, in the four languages, negations are sensitive to situations and are more frequent in informal than in formal speech, with the percentage of negations being the highest in Portuguese and Spanish. It could be stated too that, instead of assuming coordinative or subordinative functions inside the utterances, initial coordinatives, as well as initial subordinatives often open dialogic turns or connect speech acts, thus revealing an unexpected weight of non-subordinative and non-coordinative functions. Moreover, in formal monologues, initial pseudo-coordinatives and pseudo-subordinatives often aggregate together with negative expressions, creating patterns which could not be described within usual grammatical frames. In order to account for some kind of structural complexity inside utterances, two additional features were considered: a morphological one, the presence or absence of one or several finite verbs; and a prosodic one, the presence of one or several prosodic units. Here again, the choice might be contested by many specialists who would prefer more syntactic or more pragmatic approaches. But one has to agree that, combined with textual features, these characteristics allow some interesting generalisations as, for instance, the great proportion of verbless utterances in informal speech in the four languages, the regular absence of complexity in dialogic productions and some language-dependent regularities. A wide structural similarity among the four spoken Romance languages emerges from the comparisons, as shown in nice contrastive diagrams. Nevertheless, in many regards, French appears to be, as Raffaele Simone stated it (Simone 1997), “the less Romance of all Romance languages”. French required special processes for calculating word length because of its specific orthographic system, and special patterns for investigating negation (having three negatives ne, pas, non, where others have one or two) and relatives (two forms qui, que where the three other Romance languages have one). When compared to Italian, Spanish and Portuguese, French utterances showed a much greater number of simple verbal phrases (twice as many) and of simple prosodic units. And, on the contrary, French displayed fewer verbless utterances (even in telephone speech, where they tended to be more frequent) and fewer coordinatives in initial position, so significant for the other languages. The French drift, observed in many
Preface
previous comparative studies in Romance linguistics (Harris 1978), is given here new confirmation in new domains. In a recent textbook (Renzi & Andreose 2003), Lorenzo Renzi recalled the three main streams followed by usual studies in Romance Linguistics (i tre paradigmi degli studi romanzi): the classical period, the historical-comparative method and the structural-generative trend. A new one seems to be now emerging from linguistic comparisons based on spoken languages and operating with computational procedures. This Integrated Reference Corpora for Spoken Romance Languages moves along new paths in this famous domain. Already a successful outcome, it will certainly expand into important future developments. Claire BLANCHE-BENVENISTE Professeur émérite, Université de Provence, Ecole Pratique des Hautes Etudes
Chapter 1
The C-ORAL-ROM resource Massimo Moneglia
. Introduction .. C-ORAL-ROM The setting up of linguistic resources is a current issue in the European Union, and spoken language corpora play a specific role in this context. It is important to achieve knowledge on spoken language for the purpose of the development of linguistic engineering, which, in the era of digital communication, must address the primary role played by spontaneous speech within natural communication. This task is even more urgent in a multilingual context such as that of the EU, where citizens have the right to express themselves in their own native language, and it is the responsibility of the EU to enable the fulfillment of this principle. The study of statistically significant samples of everyday language use is also important for linguistic studies, whose long tradition has mostly been based on ideal models. Despite this general need, spontaneous spoken language is still under-represented, and audio information, which is essential to the development of a robust speech technology, is almost unavailable. The need of spoken corpora is even more strongly felt for Romance languages, as compared to English and other languages for which many resources are already available.1 The main goal of the C-ORAL-ROM project thus is to provide a comparable set of corpora of spontaneous spoken language of the main Romance languages, namely French, Italian, Portuguese and Spanish. The project was undertaken by a European consortium, and launched in 1999 by Emanuela Cresti, Massimo Moneglia, Claire Blanche-Benveniste, Fernanda Bacelar, Philippe Martin, Francisco Marcos Marín and Carlota Nicolás. Supported and co-ordinated by the University of Florence, it was then approved and funded within the IST program of the EU (C-ORAL-ROM IST 2000-26228). C-ORAL-ROM consists in total of 772 spoken texts and 121:43:07 hours of speech by 1,427 different speakers. Four comparable recorded resources of Italian, French, Portuguese and Spanish spontaneous speech sessions (roughly 300,000 words for each language) have been collected respectively by the following providers:2
Massimo Moneglia
i.
University of Florence (LABLITA, Laboratorio linguistico del Dipartimento di italianistica); ii. Université de Provence (DELIC, Description Linguistique Informatisée sur Corpus); iii. Centro de Linguística da Universidade de Lisboa (CLUL); iv. Universidad Autónoma de Madrid (Departamento de linguistica, Laboratorio de Lingüística Informática).3 Each recorded session in C-ORAL-ROM comes with the following main annotations: a.
Session metadata, containing essential information of speakers, recording situation, acoustic quality, source, and content of each session, ensuring clear identification of the various speech types documented; b. Orthographic transcription, in standard format, enriched by the tagging of terminal and non terminal prosodic breaks; c. Text-to-speech synchronisation, based on the alignment with the acoustic source of each transcribed utterance; d. Part of Speech (PoS) tag of each form in the transcribed texts and the corresponding frequency list of forms and lemmas.
This resource is stored in the DVD accompanying this book, and is integrated with two tools which allow the direct exploitation of the acoustic and textual information,4 respectively, WinPitch Corpus (© Pitch France) and Contextes (© Jean Véronis). The DVD presents C-ORAL-ROM in five sections: 1. Metadata of the recorded sessions of each language resource (see (a) above); 2. Multimedia corpora, which allow simultaneous access to textual and acoustic information through the speech software WinPitch Corpus; 3. Textual corpora, which allow searches on textual and PoS tagged files through the text search engine Contextes; 4. Frequency lists of forms and lemmas of each language collection; 5. Histograms and tables, derived from the main corpus annotation, which present general features and peculiar qualities (lexical, structural, syntactic) of spontaneous speech in the four Romance languages. The four corpora are orthographically transcribed in standard textual format (CHAT format, MacWhinney 1994), with representation of the main dialogue characteristics; that is, speaker’s turns; the main occurring non-linguistic and paralinguistic events; the prosodic breaks; and the segmentation of the speech flow into discrete speech events. The identification of speech events is one of the more particular features of CORAL-ROM. The speech flow and its transcription are divided into units which rank above the word level and belong to the domain of action, i.e. utterances. An utterance is the linguistic counterpart of a speech act (Austin 1962) and it is identified in C-ORAL-ROM through an heuristic method that is based on the perception of prosodic cues (Cresti 1994, 2000; Cresti & Firenzuoli 1999).5 Each utterance is
The C-ORAL-ROM resource
assumed to end with a perceptively relevant prosodic break having a terminal value. The textual string is divided into utterances following the annotation of prosodic breaks, discriminated in the speech flow through perceptive judgments.6 With WinPitch Corpus, which ensures a natural and meaningful text/sound correspondence, the annotated transcripts are aligned with their acoustic counterpart, with segments defined on independent layers, generating a database of all utterances in the resource (roughly 134,000 in the multilingual corpus).7 Besides text-to-speech and speech-to-text alignment, WinPitch Corpus allows an easy and efficient acoustic analysis of speech with real-time fundamental frequency tracking, spectrographic display, re-synthesis after editing of prosodic parameters, etc. The Contextes text search engine operates on the C-ORAL-ROM textual resource, allowing searches on textual items at different levels of annotation; that is, at word level, at lemma level, at PoS level, at metadata level, and at prosodic tagging level.
.. Organisation of the volume The C-ORAL-ROM resource aims to represent the variety of speech acts performed in everyday language and to enable the induction of prosodic and syntactic structures in the four Romance languages, from a quantitative and qualitative point of view. This chapter will describe the set of choices adopted to this end. The criteria for the corpus design of the resource and for its prosodic annotation are presented respectively in Sections 1.1 and 1.2. In 1.3 the format for both metadata and dialogue representation is outlined, while in 1.4 the text-to-speech alignment method and tool are described. 1.5 will present the specifications of the C-ORALROM format for PoS tagging, and 1.6 will give some general measurements of the main language variations that can be found in the corpus structure. The next four chapters (2 to 5) will present the Italian, French, Spanish and Portuguese sub-corpora, respectively. Each chapter is similarly structured, providing descriptions along the following lines: (1) the history of the national corpus; (2) the choices made for the orthographic transcription of the oral material; (3) the PoS tagging tool and the strategy adopted, accompanied by the evaluation of PoS tagging correctness; and (4) the main data extracted from lemmatisation, for example, the most frequent lemmas in spoken language. The description of each sub-corpora, however, represents the peculiar point of view of each corpus provider on the resource itself. Other related arguments are also variously included as sub-sections in the four parallel chapters. More specifically, the state of the art of spoken resource collections in each nation, the problems encountered in PoS tagging of that particular language, and the peculiar characteristics of frequency lists derived from each spontaneous speech resource are treated in each chapter with more or less attention, depending on the research tradition of each team. A set of measurements highlighting the strategies of construction of spoken language utterances and their variation in the C-ORAL-ROM corpus design is also provided. These measurements are reported in a Diagram menu in the DVD, which
Massimo Moneglia
shows the recorded values in the four languages in Frames. The diagrams address the following main language data: i. lexical strategy: the proportion of nouns to verbs; ii. structural strategies: the level of structural complexity recorded in spoken texts through the main strategies adopted for the construction of the utterance, i.e. (a) verbal vs. non verbal; (b) prosodically simple vs. prosodically patterned; iii. structural types: the main structural types of utterances, i.e. compound verbal, simple verbless, simple verbal, and compound verbless; iv. clause indexes: the incidence of the main functional expressions, such as indexes of coordination, subordination and negation on the total utterances. Clearly, this work offers the reader some general results regarding the speech properties of the four Romance languages which can become a vital platform for comparison both with writing and with other spoken languages (Biber et al. 1999). It is also a first example of how research on speech which wishes to go beyond lexical frequencies and phonetic observations can be driven on the basis of utterances taken as reference units. A brief presentation of this is given in Chapter 6. Finally, the Appendix presents the results of the evaluation of the prosodic annotation in the C-ORAL-ROM corpus. The evaluation, executed by independent users of the speech industry sector, is aimed at guaranteeing the overall reliability of the data and computations presented.
.. The issues of representation and comparability The domain of spontaneous spoken language has become consolidated only in very recent times (see Biber 1988; Blanche-Benveniste et al. 1990; Cresti 2000; Givón 1979; Miller & Weinert 1998). This tradition agrees that spontaneous speech performances can be defined as every oral performance which does not execute a previous (written or scripted) text. From a positive point of view, speech events that may be considered spontaneous have a typical set of qualities, among which the following are the most relevant. Spontaneous speech events comprise: a. b. c. d.
face-to-face multi-modal interactions; inter-subjective reference to a deictic space; mental programming simultaneous with vocal execution (unscripted); contextually undetermined linguistic behaviour (unpredictable behaviour).
A long tradition of socio-linguistic studies (see Berruto 1987; Biber 2001; Biber et al. 1998; Gadet 1996a, b, 1997, 2000, 2003; Labov 1966) has frequently dealt with the significance of the sociological and contextual parameters in the definition of speech qualities, pointing out their variability. Many types of spontaneous speech exist, varying along the following parameters:
The C-ORAL-ROM resource
a. b. c. d. e. f. g. h. i.
a variety of possible structures of the communication event (monologue, dialogue, conversation between many participants, etc.); the channel of communication of the speech event; the sociological context, that is, the domain of society in which the speech event takes place (family life, private life, public life); programming conditions (totally unscripted vs. scripted or partially scripted performances); a variety of possible language registers and genres (diaphasic variation); socio-linguistic factors, such as sex, age, education, occupation of the speakers (diastratic variation); the geographical origin of speakers (diatopical variation); the task of the speech event; the topic of the speech event.
For instance, a tale told to a child and a row between husband and wife are different types of speech events which occur in family life; these events vary strongly in language register, dialogue structure, and of course in topic. Where public contexts are concerned, a lesson given by an university professor and a discussion in a talk show may present a set of features (viz. dialogical character, unscripted or partially unscripted programming, informal style) which allow them to be considered varieties of spontaneous speech as well, even if the speaker is performing a professional task. These two types vary in register, in communication structure, in channel, and in programming and control of the oral execution. The linguistic properties of the speech events may vary in connection with such non-linguistic variations The setting up of spontaneous speech databases is therefore a complex task which must ensure the representation of the most significant variations exploited by the different types of spontaneous speech performances (Berruto 1987; Biber 1988; De Mauro et al. 1993; Gadet 1996a, b, 2003). Spoken resources set up in controlled environments that have been collected to satisfy the need of speech technologies (such as telephone information, health dialogues, map tasking) constitute at present the majority of the available databases.8 Their acoustic/phonetic quality is excellent, but by definition they deal with restricted semantic domains, and in the given context the language behaviour is highly predictable. Should one wish to represent the spontaneous speech universe, the constitution criteria must ensure the widest possible variation in speech contexts, and the lowest control on the speech event, which is exactly the opposite of what dedicated resources do. On the basis of this need, the C-ORAL-ROM corpus is oriented towards the collection of corpora in natural environments; this, however, necessarily results in a lower acoustic quality of the resource (Labov 1966). Moreover, since C-ORAL-ROM exploits the rich archives created in previous years of research on spoken languages, the acoustic quality and the recording conditions of the resources are, inevitably, variable.
Massimo Moneglia
In any case, the acoustic quality of each recording and the most relevant information of the recording condition are always recorded in the metadata of each text. The following outline the requirements followed in the project for the acoustic format and the recording apparatus: i. Format: mono .wav files (Windows PCM); sampling frequency: 22,050 Hz, 16 bit. ii. Recording and storing process for old analogue recordings: directly derived in .wav files (20,050 Hz, 16 bit) from the original analogue tapes through a standard sound card (Sound Blaster live or compatible) with a professional sound editor. iii. Recording and storing process for new recording: a. dialogues: stereo DAT or minidisk recording (44,100 Hz) with unidirectional microphones, converted into mono .wav files (Windows PCM, 22,050 Hz, 16 bit) via SPDIF port of a standard sound card (Sound Blaster live or compatible) with a professional sound editor; b. conversations with more than two participants: mono DAT or minidisk recording with cardioids or omni-directional microphone, converted into mono .wav files via SPDIF port of a standard sound card (Sound Blaster live or compatible) with a professional sound editor. The sound files of the acoustic database are set on a quality scale (recording, volume, voice overlapping and noise) and are comparable with respect to it. The quality scale extends from the highest level of clarity of the voice signal to low levels of acoustic quality. The quality is gauged spectrographically (see the rules for marking the acoustic quality in 1.3). A further problem that a multilingual corpus of spontaneous speech must face is comparability between different language resources. Comparability of large written corpora has been tested in two forms: a. Parallel corpora (e.g. CRATER and EUROM4); b. Corpora of the same type or of the same specialised field in several languages. Clearly, with respect to the task of collecting multilingual spontaneous spoken (language) corpora, only the second alternative is, in principle, available, since, in the domain of speech, parallel corpora are possible only for reading and acting performances. In other words, it is impossible to achieve parallel spoken corpora without losing spontaneity. The prototype example for the second alternative in the domain of written resources is the relation between the Brown Corpus (Brown University, USA, early 1960s) and the LOB Corpus (Lancaster/Oslo/Bergen, 1970) which achieves a comparable sampling of American English and British English. In the spoken domain, comparability is quite easy to pursue with respect to resources based on the selection of a specific semantic domain (e.g. telephone information, health information, map tasking etc.), that is, with people in the same controlled situation doing the same things. However such resources are subject to elicitation parameters, within limited contexts, and therefore lack the main character of spontaneous speech.
The C-ORAL-ROM resource
If we assume that the representation of spontaneous speech must necessarily maximise spoken text variation, in a multilingual resource the more variability there is represented in each language resource, the more difficult it becomes to compare one language resource with the others. Therefore the comparability of the resource is a function of the application of specific variation parameters in a corpus design structure. The need to document many types of spontaneous speech in order to offer a significant representation of the spoken universe is confirmed by recent corpus linguistics studies, where all levels of language description vary when they are considered with respect to different types of spontaneous speech events. This is quite evident at the lexical level. The representation of a sufficient number of contexts, covering relevant types of speech events in the universe, is the only possible strategy to identify frequency lexicons. A high-frequency lexicon may be under-represented in specific pragmatic domains which, on the contrary, by definition, maximise the probability of occurrence of low-frequency lexical items (which are the real focal point of dedicated resources used for speech recognition). The frequency of Parts of Speech also varies according to the context of use. Nouns are known to be more frequent than verbs, but the relative frequency of nouns is lower in spoken language, and it is much lower in informal conversations with respect to formal contexts (1/1 vs. 1/3). Adjectives, on the other hand, are much more frequent in formal speech (Biber et al. 1999). More subtle lexical variations may also be noted. For instance, it is generally assumed that deverbal nouns (destruction) are characteristic of technical contexts and are much less frequent in everyday dialogues (Halliday 1989). Rouget (2000) shows that such nominalisation also occurs in spontaneous everyday dialogues, but typically in restricted syntactic positions, i.e. not depending on a verb. On the other hand, their syntactic position is unconstrained in technical contexts. Recent corpus-based grammatical descriptions of language have also highlighted the fact that the majority of complement clauses which depend on a verb depend on a putandi verb in spontaneous conversation, while complement clauses depend more frequently on a dicendi verb in broadcasting and media contexts (Biber et al. 1999). In corpus-based approaches, the induction of the main syntactic properties is strongly correlated to text variation parameters as well. For example, in English, both main types of dependent clauses (relative and complement clauses) vary in frequency according to socio-linguistic parameters. Generally speaking, in syntactic structures controlled by a noun, the frequency of both that-complement clauses and to-complement clauses is higher in formal language, while, in syntactic structures controlled by a verb, that-complement clauses are much more frequent in conversation (Biber 2000). Similar conclusions can be drawn with respect to relative clauses: relative constructions are much more frequent in formal speech, with their restrictive function generally being the most frequent (Biber et al. 1999). In other words, the pragmatic domain of corpora strongly influences the probability of occurrence of syntactic properties in the core area of grammar.
Massimo Moneglia
Spontaneous speech variation also displays variations in type and length of utterances. Recent work on spoken Italian (Cresti 2000) claims that the mid-length of the utterance (MLU) in dialogic speech sessions (family conversations, country wakes, conversations among work colleagues and conversations among university students) systematically diverges from the MLU of formal texts (university lectures and radio interviews). Moreover the number of speech act types in the previous Italian collections of spontaneous speech is much higher than those reported in semi-spontaneous corpora. Firenzuoli (2003) records over 80 speech act types distributed throughout the corpus, while in the map-task coding scheme (Anderson et al. 1991), the set of possible dialogue acts corresponds to roughly 16 possible moves (Stirling et al. 2001). It is clear from the preceding review that findings can differ drastically according to variation in parameters; C-ORAL-ROM aims to provide a sufficient representation of spontaneous speech variation.
.. C-ORAL-ROM sampling strategy C-ORAL-ROM sampling of the four Romance language resources is based on the following set of variation parameters which constitutes the semiological and sociological structure of the spontaneous speech corpus: a.
Dialogical structure: the recorded sessions are classified in monologues, dialogues, and conversations according to the interlocutor’s active participation in the exchange; b. Social context: the session recording of interactions belonging to family and private life are distinguished from those taking place in public; c. Channel: face-to-face interactions are distinguished from media productions and telephone recordings; d. Domain of use: domain of social environments, activities and professions, such as law, business, research, teaching, church, etc.; e. Register variation: recorded sessions where a formal language register is used are distinguished from those where informal uses are preferred; f. Speaker parameters: the main socio-linguistic qualities of speakers are registered; that is age, sex, education, occupation and geographical origin.
The C-ORAL-ROM corpora have been collected in Continental Portugal, Central Castilian Spain, Southern France, and Western Tuscany (the latter historically considered the source of the Italian language), and are intended to represent a possible standard usage that occurs in those areas.9 It must be pointed out that the language archive is not defined on the basis of the geographical origin of speakers as in dialectologicaloriented research (Contini et al. 2002). In other words, systematic diatopical variation is not represented in C-ORAL-ROM: in a multilingual collection of limited size such as this, diatopical restrictions must necessarily be established for each language, and the corpus cannot be concerned with the representation of geographical varieties. Such variation of pronunciation, lexical and morpho-syntactic features in a language,
The C-ORAL-ROM resource
Figure 1.1 Geographical origin of speakers in the four corpora
depending on geographical origin of speakers, can be represented only within wide intra-linguistic corpora, which takes into account the systematic representation of the geographical varieties of the language in question.10 C-ORAL-ROM thus represents the language actually spoken in a relevant national centre (namely Madrid, Lisbon, Aix-Marseille, and Florence) and in their neighbouring area (Cresti et al. 2002). Naturally, whatever other geographical varieties that may be employed in that reference centre may also occur, in varying proportions, and, in fact, instances of varieties of different geographical origin are recorded for each language. It should also be noted that for many participants in media emissions and other sessions recorded in public contexts, their geographical origin is not available. Figure 1.1 shows the number of speakers recorded in the four corpora and the distribution of their geographical origin. The goal of the collection is therefore limited to ensuring the representation of the sole diaphasic and diastratic varieties of the language that is spoken in the region in question (Berruto 1987; Gadet 1996b), and it is expected to present a sufficient variation for the study of communicative acts, lexicon, morphology, syntax and, as far as the main variety recorded may be considered significant, prosody. Nonetheless, since the speaker’s characteristics (sex, age, education, occupation and geographical origin) and those of the session (communication event, social context, register, topic) are recorded as part of the metadata, the user can always link any linguistic phenomenon to the specific situation in which it occurs and be aware of its value.
Massimo Moneglia
Given the above parameters, the main choices adopted in C-ORAL-ROM for the representation of the speech universe in the four 300,000-word corpora are the following: 1. 2. 3. 4.
Separating formal speech (50%) and informal speech (50%); Selecting distinct criteria for sampling the formal and informal parts of the corpus; Defining the dimension of samples in terms of number of words; Ensuring a sufficient representation of dialogical informal speech (which is the resource with higher added value); 5. Ensuring the representation of Formal language both in Broadcasting (Media) and in face-to-face contexts (which are labelled “Natural context” in the C-ORAL-ROM corpus design). Comparability across the four spoken corpora is ensured by means of common sampling criteria, and by the same weight in words in each domain type in the four corpora. Table 1.1 presents the matrix for all the languages in C-ORAL-ROM.
Table 1.1 Parameters for the description of spoken language in the corpus design matrix Language register Informal 150,000 words
Formal
Social context Family/private 124,500 words Public 25,500 words
Structure of the communication event Dialogue and Multi-dialogue 102,000 words Monologue 48,000 words
Channel Natural context 65,000 words
Typical domain of use Political speech Political debate Preaching Teaching Professional explanation Conference Business Law Talk shows Scientific press Reportage Interviews Sport News Weather forecast Private conversations 15,000 words Human-machine interactions 10,000 words
Formal
Media 60,000 words
Informal
Telephone 25,000 words
*At least 20,000 words from conversations with more than two participants in the Informal part.
The C-ORAL-ROM resource
Given the design scheme which offers a representation of the speech universe, the number of samples that are chosen for each domain in the universe is the main variable for the statistical relevance of the corpus. Although the overall dimension of the CORAL-ROM corpora determines a severe limit in this respect, it must be highlighted that this choice relies on two concurring needs that may have consequences on the comparability of the multilingual corpora and on their significance. The more samples are chosen, the more accurate the representation of the universe, but the amount of information in each sample must also be sufficient to represent each domain with respect to dialogue structure and syntax. In other words, while the samples must be large enough to offer a probability of occurrence of those properties that characterise each context, each corpus must also have a comparable number of samples. The C-ORAL-ROM corpus design strategy considers that the representation of spoken language must, in a specific language context, ensure both the possible appreciation of syntactic properties (micro-syntaxe) and the sufficient representation of different dialogue structures (macro-syntaxe and informational patterning; see BlancheBenveniste et al. 1990; Scarano 2003). Therefore the length of each sample must in principle be sufficient for both ends. The choices concerning sample length and number of samples consequently vary according to the corpus domains. Sample length is determined in terms of quantity of information (i.e. number of words) rather than in temporal units (minutes).11 Additionally, since formal texts, by definition, feature a more complex textual structure, two different strategies have been adopted for the formal and the informal categories, as follows: i.
Informal section: a. Short texts: at least 64 texts of approximately 1,500 words each.12 b. Long texts: from 8 to 10 texts of approximately 4,500 words each.
ii. Formal section: a. Formal in natural context: 2 or 3 samples for each domain of approximately 3,000 words each. b. Media: 2 or 3 samples of approximately 3,000 words for each domain, while only short samples for weather forecasts and news are provided. c. Telephone: text length is not defined (preferably 1,500 words as upper limit, no lower limit); the man-machine interactions domain should contain 10,000 words, comprising the complete interaction. In conclusion, given the above matrix, each corpus in the multilingual resource is not strictly comparable with the others with respect to the occurring semantic domains; however, as far as each corpus approximates the matrix, it demonstrates a comparable variation. This variation is significant with respect to the possible occurrence of spoken language structure/s at both syntactic and dialogue representation level.
Massimo Moneglia
.. Comparing C-ORAL-ROM with other corpora A corpus which has a very similar design structure to the C-ORAL-ROM matrix is the Spoken Dutch Corpus which, despite its different size, has been explicitly considered a model by the C-ORAL-ROM consortium. Both C-ORAL-ROM and the Dutch Corpus address roughly the same set of variation parameters; those of the latter are outlined in Table 1.2. The aim of the C-ORAL-ROM matrix is to define the crossing over between the communicative event’s structure parameters (Dialogue/Conversation vs. Monologue), the sociological context of use (Private vs. Public) and the channel (Broadcasting vs. Natural context), as the Dutch Corpus does. However some relevant differences in the corpus design strategy must be highlighted. The use of the formal/informal partition in the sampling strategy is absent in the Dutch corpus. This distinction of register is only partially equivalent to that of scripted/unscripted programming. While a formal register of speech is frequently partially scripted, it is also linked to other relevant conditions, such as: a. public use of speech; b. professional use of speech, in accordance with social functions and roles of the speaker in the community; c. the intentional task of accomplishing an oral text, that implies the treatment of a topic, an argumentation, conclusions, etc.13 In short, Formal speech may still be spontaneous, as long as it is not the execution of a previously prepared text.
Table 1.2 The Spoken Dutch Corpus design14 Dutch Corpus dialogue/ multilogue
monologue
private
unscripted
direct
conversations (‘face-to-face’) interviews distanced telephone conversations business transactions public broadcast more or less scripted interviews and discussions non-broadcast unscripted discussions, debates, meetings lectures private more or less scripted descriptions of pictures public broadcast unscripted spontaneous commentary more or less scripted news reports, current affairs programmes news commentary non-broadcast more or less scripted lectures, speeches read aloud text
The C-ORAL-ROM resource
For these reasons, as the C-ORAL-ROM matrix in Table 1.1 shows, genre variation is the main criterion applied in formal contexts, where differences in general professional functions (lawyers, teachers, priests), social occasions, and topics (health, business, art, science) are relevant. In contrast, genre variation is not strictly defined as a parameter in informal contexts; instead it is variations in social domain of use (family/private, public) and dialogue structure (monologue, dialogue, conversation) which represent the parameters systematically adopted. In other words, C-ORALROM adopts two different sampling strategies, one for the representation of spoken language in formal contexts and another for informal contexts. Only in formal contexts is the genre (or domain of application) of the recorded sessions strictly defined in a closed list of sub-types; this information is not provided for informal contexts. This choice has a theoretical value. While it is natural to assume the existence of a series of closed situations in which, in a certain social-historical context, the formal use of language is preferred, the same does not hold for the universe of informal speech. To feature typical contexts of use is a specific, marked trait of formal speech. For this reason, formal speech can be effectively identified by listing its most typical contexts of use. Conversely, the set of situations where informal language is used is open, and its domain cannot be represented by a list of typical contexts of use. No context is more typical than another. This feature is, in fact, far from obvious. For example, the Dutch corpus tries to define, as a comparability criterion, the contexts of its collection of monologues and dialogues, by determining the number of words for a closed set of domains of application (e.g. Business transaction, Picture description, Interview, Face-to-face conversation, Telephone). By observing a posteriori the sampling strategies used by these two collections, both aimed at the documentation of spontaneous speech and the comparability of data from various corpora, the concurrence of two criteria appears evident. It is a fact that, by strictly defining the genres, a higher degree of data comparability can be attained. However, the downfall of this practice is the a priori exclusion of significant spheres of informal speech, where genre characterisations still remain largely unexplored. This choice is significant for formal speech only. We can conclude that, by not defining explicitly the genres and domains of use of speech, C-ORAL-ROM’s corpus sampling guarantees, in theory, the possibility of occurrence in the collection of any significant genre. In other words, no genre has zero probability of occurrence. Considering the possible sampling strategy for a spontaneous speech corpus, it may be interesting to make explicit mention of another large project that was also taken into consideration when the C-ORAL-ROM corpus design was decided. The Corpus of Spoken Israeli Hebrew (CoSIH) does not feature a corpus design scheme.15 CoSIH was designed to integrate, in a random sampling, demographic and contextual criteria for both speakers and speech events. Day-long recordings of 950 informants (statistically representing the Israeli population) have been planned over a one-year period, along
Massimo Moneglia
with respective sociolinguistic data. These recordings must be evaluated, and a cell from each sample must be transcribed, to set up a five-million-word corpus.16 In principle the mapping of a continuous recording into a textual resource could also be a function of a random selection of one-hour recorded cells (consisting of 5,000-word units). However this method, appropriate from a statistical point of view, does not completely bypass the definition of criteria for the selection of relevant textual units. As a matter of fact, in order to select 5,000 words of a coherent continuous text, a huge amount of fragments, silences, and minor exchanges must be negatively selected. Therefore criteria like those specifying possible corpus variations in the corpus design scheme are also needed. The first results of this project however are expected shortly and will provide new evidence for the documentation of spontaneous speech variation.
. Prosodic tagging criteria The segmentation of the speech flow into discrete speech events is one of the most relevant questions for the analysis of speech resources. In the case of written language, the nature of the linguistic units ranking above word level is not controversial. Although the unit of analysis may be chosen at different levels (i.e. argument structures, sentences or clauses or head dependent structures, see Abeillé 2003), written language can be properly parsed into discrete objects that belong to the domain of syntax. In recent linguistic tradition this question is overtly raised for spoken language and specifically for spontaneous speech (see Blanche-Benveniste 1997a; Biber et al. 1999; Cresti 2000; Miller & Weinert 1998; Quirk et al. 1985). Although the reference units for spontaneous speech are commonly referred to as speech events or utterances, the definition of such entities is far from obvious (see the discussion in Scarano 2003). The definition of an utterance may be anchored to syntactic and/or semantic properties. One of the most common solutions is to identify it with a syntactic clause (Miller & Weinert 1998), clearly distinct from a sentence, the latter, on the other hand, being commonly assumed as the reference syntactic unit of writing. The Longman Grammar proposes a definition of C-Unit as an entity with or without a clause structure (Biber et al. 1999). Another recent definition is that proposed in the pronominal approach (Blanche-Benveniste 1997a), which identifies the nucleus of an utterance in a macro-syntactic domain based on a modalised noyau. Other solutions have been those based on a concept of predication (Voghera 1992), which in most cases has a verb as its core. However we must consider that figures show that in spontaneous speech, verbless contexts appear in around 30% of utterances (38% according to Longman Grammar). More specifically the statistical measurements made on the C-ORAL-ROM corpus show that verbless utterances constitute 38.1% in the Italian corpus, 24.1% in French, 37.23% in Spanish and 36.57% in Portuguese. Since on average 30% of utterances are verbless, all the definitions based on clause structure and verbal predication appear to be inadequate for spoken corpus analysis purposes.
The C-ORAL-ROM resource
A speech event may also be identified as a dialogue act, recorded in a dialogue representation scheme (see Sinclair & Coulthard 1975, and the literature cited below); it has been also interpreted as the minimal entity allowing a modification in the discursive memory (acte énocniatif in the framework of Berrendonner & Béguelin; see Berrendonner 1990, 2003). The background assumption for the C-ORAL-ROM project is that in spontaneous language the speech flow is divided into utterances, which are defined as the linguistic counterpart of speech acts, that is, entities belonging to the action domain. Speech acts are defined in the classic work of Austin (1962).17 Given that it is generally assumed that the identification of speech acts is quite a vague exercise, and that the notion lacks a formal definition (Fava 1995; François et al. 1990; Kempson 1977), the present literature is frequently sceptical of the possibility of identifying speech acts in the reality of spoken language, and, further, of finding linguistic regularities concerning this notion. In the rest of this section, we will first show how utterances are determined in CORAL-ROM from an operative point of view, and subsequently discuss the relation with other concurring criteria. The question of the replicability of the coding scheme will also be addressed. The first two issues will be dealt with in these sections, while the reader can refer to the Appendix of this book for the third question.
.. Prosodic breaks and utterance limits In C-ORAL-ROM, prosodic breaks are considered the most relevant cue for determining utterance boundaries. The relation between prosodic cues and the segmentation of the speech flow is a very complex argument (Simon 2004) that cannot be addressed in detail here. C-ORAL-ROM exploits some general properties of this relation to allow the segmentation of a large multilingual corpus in a clear and homogeneous manner. Tone units (or prosodic envelopes) are separated by prosodic breaks. A very general relation between tone units and information units has been recognised since the work of Halliday (Halliday 1976). From this point of view the division of speech into information units by mean of prosodic boundaries can be assumed. In fact, there is clear perceptual evidence for prosodic breaks, and they are also quite easily recovered in the speech flow (see 1.2.2 below). However this correlation is not sufficient for the detection of utterances in the speech flow, given that there is no necessary one-to-one correspondence between utterances and information units. An information unit may not be an utterance, but just a part of it. In other words, an utterance can be composed of one or more information units, and can therefore be prosodically performed by more than one tone unit, that is, through a complete intonation pattern (Cresti 2000; ’t Hart et al. 1990). A correlation between prosodic breaks and the segmentation of the speech flow into utterances can however be maintained, considering that classic studies on prosody have always highlighted the fact that utterances (or sentences) end with a terminal profile (Crystal 1975; Karcevsky 1931). From this point of view those prosodic breaks
Massimo Moneglia
that conclude an utterance can in principle be discriminated from those breaks that do not. Therefore those utterances that correspond to one tone unit may be distinguished from those featuring a complex pattern of tone units. In C-ORAL-ROM the perceptual detection of terminal breaks has been adopted as a heuristic method for determining utterance boundaries in the speech flow: each string ending with a terminal break is considered an utterance. The inter-annotator agreement reached in C-ORAL-ROM in the perceptual detection of this cue is reported in the Appendix. The rough equivalence between utterance and textual string ending with a terminal break is based on the idea that the performance of language actions is necessarily correlated to prosody, which constitutes the interface between the accomplishment of illocutionary and locutionary acts. An utterance can be performed by a simple or compound intonation pattern, that is, one that is made up of only one necessary tone unit (root), or by more optional tone units, with different prosodic features (for instance, prefix unit, suffix unit, in the framework of ’t Hart et al. 1990). Each tone unit (corresponding to an information unit) ends with a break, which is an object of perception (Hirst & Di Cristo 1998). Very roughly speaking, intonation patterns the utterance into information units where the necessary root unit specifies the illocutionary force (Cresti 2000; ’t Hart et al. 1990). For example, a simple noun phrase like the door can be considered a speech act in spontaneous speech (i.e. an assertion, a question, an order, a doubt, or other possible illocutionary act), only once it is performed with an appropriate root unit. The perception of the property “terminal” in a prosodic break seems to be at least in correlation with this: we perceive the break as terminal because an act has been accomplished. The accomplishment of an illocutionary act is the main property that a language event must has in order to be considered an utterance. The illocutionary force determines how the “propositional” content of the utterance must be interpreted in the world (e.g. the content describes the world if the action is declarative, or it specifies how to modify the world if the action is directive). An important consequence derives from the previous theoretical considerations. From an operational point of view the utterance can be defined as the minimal linguistic unit such that it allows a pragmatic interpretation in the world. Therefore, as far as the illocutionary values are conveyed by prosodic cues, each utterance must contain the prosodic cues that allow its pragmatic interpretation. In other words, prosodic cues are those that enable a competent hearer to interpret a linguistic string as a language activity. Competent speakers are extremely sensitive in their perceptual ability to detect even very slight prosodic variations, which mostly concern suprasegmental parameters, such as F0 movements, length, and intensity. More specifically, they are able to selectively perceive those variations that are voluntarily produced (’t Hart et al. 1990), as is the case when an intonation profile, conventionally codified in a language, is performed in order to express a specific illocutionary force.18 From the point of view of
The C-ORAL-ROM resource
perception, competent speakers are the only subjects that can judge, on the basis of its prosodic features, if a spontaneous speech string does or does not have the property to allow a pragmatic interpretation (illocutionary criterion).19 The strings of the speech flow that do not allow a pragmatic interpretation by a competent speaker, therefore, are not actual candidates for being an utterance.20 Those strings always correspond to language strings that do not end with a break perceived as “terminal”. Competent speakers, with their judgements, may not be able to assign to an utterance one specific speech act label from among the large number of possible speech acts that occur in spontaneous speech.21 For example, a competent speaker can be uncertain of whether a string like better to keep to the right or down the lane, uttered in a given context with the appropriate intonation, is to be considered “advice” or “prompting”. But every speaker will recognise if each information unit can or cannot receive a pragmatic interpretation. For example, a competent speaker will deny this property if the previous two strings appear within the same tone unit, as in better to keep the right down the lane, or in a patterned utterance with a topic-comment structure as down the lane / better to keep the right.22 In summary, with regard to its prosodic properties, an utterance can be simple, that is, featuring only one tone unit, or compound, that is patterned in various tone units. Competent speakers have the ability to detect prosodic breaks and to distinguish breaks that terminate a sequence of tone units from those which signal the flow of a same prosodic programme. This perceptual ability is enforced by the pragmatic correspondence of prosodic patterns ending with a terminal breaks with the accomplishment of illocutionary acts. The heuristic proposed in C-ORAL-ROM relies on these theoretical premises just outlined. Perceptively relevant terminal breaks are assumed to identify the utterance limits or, less compromisingly, to provide enough evidence that what follows a terminal break necessarily belongs to a different utterance, and therefore to a different pragmatic and linguistic domain. The various kinds of prosodic breaks are conceptualised and defined as follows: a.
Prosodic break: perceptively relevant prosodic variation in the speech continuum such as to cause the parsing of the continuum into discrete prosodic units. b. Terminal prosodic break: given a sequence of one or more prosodic units, a prosodic break is considered terminal if a competent speaker assigns to it, according to his perception, the quality of concluding the sequence. c. Non-terminal prosodic break: given a sequence of one or more prosodic units, a prosodic break is considered non-terminal if a competent speaker assigns to it, according to his perception, the quality of being non-conclusive.
.. Background of prosodic labelling The reliability of the perception of prosodic breaks in the speech continuum is based on their strong salience (’t Hart et al. 1990; Hirst & Di Cristo 1998). Therefore the C-
Massimo Moneglia
ORAL-ROM prosodic tagging procedure is considered accessible to labellers even after a brief training. This assumption is in line with other experiences of prosodic break labelling. Breaks have also been annotated (together with other cues), specifically in a sub-corpus of the Dutch corpus (Buhmann et al. 2002). In this, strong and weak breaks are discriminated and marked in the speech flow, and are reported in the transcripts (with the symbols | and ||, respectively).23 The prosodic annotation of prosodic breaks (both the distinction between “weak” and “strong”, and “terminal” and “non-terminal” ones) does not deal with modelling F0 movements, nor with marking lengthening or intensity peaks, as is the case with other different prosodic notations that have been widely used in the last two decades (e.g. the ToBI system). This type of prosodic annotation that does not constitute a “transcription of prosody” is significant as regards large corpora labelling strategies. The percentage of consensus on cues like “types of edge tone” and “types of pitch accent” was found to be quite low (Wightman 2002), while on the contrary a high degree of agreement on labelling prosodic breaks is found. For instance, in ToBI annotation, boundary tones score an agreement between 85% and 92%, and the prosodic annotation in the Dutch corpus by non-expert labellers has also been successfully verified in terms of K statistics (Buhmann et al. 2002).24 However the results obtained for the Dutch corpus cannot be automatically extended to C-ORAL-ROM. C-ORAL-ROM is a multilingual resource and many factors may influence the perception of prosodic cues in different languages. Moreover, although the annotation in the Dutch corpus may partially overlap that of C-ORALROM, it is not co-extensive. Strong breaks are defined as “severe interruptions of the normal flow of speech”, while weak breaks are defined as “weak but still clearly audible interruptions of the speech flow”. Now, it is very likely for all terminal breaks to be perceived as severe interruptions of the speech flow, but a remarkable number of nonterminal breaks also share this property. For instance, the prosodic break connected to prefix intonation (see ’t Hart et al. 1990 and the example in Figure 1.4 below) is a severe interruption of the speech flow, but under no circumstance may it be perceived as a terminal break. In other words, a strong break may not have the functional value of terminal breaks (that is, of marking the end of the utterance), and therefore it has lower linguistic value. Given these premises, the prosodic tagging accomplished in C-ORAL-ROM has been evaluated by an external institution in order to confirm the reliability of this annotation. This evaluation was performed by mother-tongue non-expert evaluators hired by one of the main individuals in speech technology (Loquendo, Turin), and it is found that the annotation of prosodic breaks, especially terminal ones, has a very high consensus; reliability is therefore high. An extensive report on this is found in the Appendix.
The C-ORAL-ROM resource
.. Utterance boundaries and labelling of discourse acts Traditional speech act theory has been considered insufficient for managing dialogue flow. Language activities for discourse management have been introduced, comprising devices such as turn-taking, repair, reference/information, and attention, which have been proposed for analysing concrete dialogues of everyday use. The first scheme for dialogue acts, based on the study of real conversations, was worked out by Sinclair and Coulthard (1975), and developed by Stenström (1994). In this tradition the dialogue exchange is described as consisting of dialogue transactions, dialogue turns, dialogue moves, and dialogue acts, which are defined in a coding scheme. Dialogue coding has been very useful in a range of research fields, especially in computational linguistics and language technology; consequently, some best practices have emerged.25 Some coding schemes have been developed to cover specific applications such as LINDA answer systems, as well as translation systems, such as VERBMOBIL (Alexandersson et al. 1997; Siegel 1997). Others are more general and are similar to that conceived by Sinclair and Coulthard (1975) in their analysis of classroom conversations. The main general schemes have been developed for Map Tasking (Carletta et al. 1997) and DAMSL (Discourse Representation Initiative). Given the relevance of this recent tradition for the annotation of speech corpora, the relation with the C-ORAL-ROM annotation of utterance boundaries must be explicitly considered. It is important to note that, once the relation between prosodic cues and speech act performances is recognised, the parsing of the speech flow into discrete speech events does not rely on the recognition by the labeller of a specific performed action. Instead, recognition of utterance limits is independently motivated. This is not irrelevant from the point of view of the annotation of dialogue acts. Although the coding scheme for dialogue acts provides a closed list of possible moves, a competent speaker may find it difficult to identify and define the performed act. The replicability of the coding scheme is, as a matter of fact, one of the main problems of the annotation of dialogue acts, even in quite restricted domains. Once the utterance limits are identified, the language string corresponding to an utterance is the linguistic entity which is suitable for receiving a certain tag. In other words, the definition of utterance limits is a matter of direct perception, while the assignment of a specific value to a dialogue act is a categorisation issue involving our knowledge of linguistic values. The annotation of dialogue acts parallels current labelling practices for the transcription of prosody in ToBI (Avesani 1996; Grice et al. 1996; Nolan & Grabe 1997): both these practices add qualitative labels to a language domain according to an annotation scheme. While the dialogue labeller selects a specific tag for discourse acts performed by the textual string in the dialogue structure, in ToBI, the labeller interprets a prosodic movement and tags it in a certain way.26
Massimo Moneglia
Based on these premises, it can be seen that the annotation of terminal and nonterminal breaks does not describe the prosodic movement that actually occurs in correspondence with a specific speech segment, but rather it selects the specific segment where, according to perception, a significant movement occurs. At the same time the annotation does not specify which proper speech act is performed by a sequence of words, but rather, specifies which sequence of words performs an act, for prosodic reasons. Clearly, the hypothesis underlying this scheme is that the previous annotation processes are quite different from one another. In short, while the segmentation of speech into utterances through terminal breaks is a matter of direct perception, the interpretation of intonation movements and the identification of dialogue acts mainly involves categorisation and understanding. For this reason, the three levels should be separated. The annotation of terminal and non-terminal breaks, however, specifies a common interface for both levels of annotation. Once the relevant domain for prosodic movements and speech acts is determined, this will probably allow a better interpretation of both the relevant prosodic movements and the functional, dialogical value of the speech event. The same consideration can hold for syntactic features. Utterances cannot be identified and defined on the basis of syntactic properties as clauses can, for instance, but once an utterance is identified on the basis of a terminal break, any kind of morpho-syntactic and lexical evaluation can be driven on it.27 In other words, the annotation of terminal and non-terminal breaks links the domain of prosody with the domain of speech act analysis and may be significant for improving the general state of the art on spoken text representation and analysis.
.. Utterance limits in spontaneous speech The pragmatic definition of speech events is not the only theoretical alternative that has been proposed. Besides the syntactic definition cited above, in current work on spoken language, a further alternative has often been used, namely, the practical definition of an utterance as the linguistic sequence between two silences (see TEI guidelines28 ). Such a criterion may sound preferable to the prosodic criterion adopted in C-ORAL-ROM, as it appears to be more objective, whereas the prosodic criterion may be considered arbitrary, as it is based on perception. In what follows, we will point out two empirical arguments strictly concerned with the analysis of spontaneous speech: (a) that syntactic structures appear frequently underdetermined, and that their definition is rather a function of prosodic cues; and (b) that the timing of the wave signal is not linguistically significant, as it is, at the same time, a criterion which is both too weak and too strong to determine the utterance boundaries in spontaneous speech corpora. The following is a dialogic exchange between a beautician and a client who is about to undergo a depilation of her legs. This text, taken from the Italian sub-corpus (file ifamdl15 of the multimedia corpus in the DVD), allows an empirical evaluation of the
The C-ORAL-ROM resource
difficulties encountered by both criteria. Let us consider the bare transcription (i.e. words only), accompanied by the essential contextual information: *EST: o vieni dai [come on then] %act: the beautician invites her client to begin the hair removal *CLA: a patire [to suffer] *EST: no ascolta qui sopra sì [no listen up here yes] %act: the beautician gets closer to the leg to be depilated *CLA: qui sì [here yes] The first thing to note is that the third and the fourth turns are verbless. For this reason their syntactic structure is underdetermined by the actual syntactic data and may be compatible with many possible interpretations. The following punctuations highlight eight possible dependency structures for the first case and three dependency structures for the second: No. Ascolta. Qui? Sopra? Sì. No. Ascolta qui sopra. Sì. No. Ascolta. Qui sopra, sì. No, ascolta. Qui. Sopra sì. No, ascolta. Qui sopra, sì. No. Ascolta. Qui. Sopra sì. No, ascolta. Qui. Sopra sì. No. Ascolta, qui sopra. Sì.
[No. Listen. Here ? At the top ? Yes.] [No. Listen here. Yes.] [No. Listen. Up here, yes.] [No, listen, here. Yes, on top.] [No, listen. Up here, yes.] [No. Listen. Here. Yes, on top.] [No, listen. Here. Yes, on top.] [No. Listen, (what about) up here? Yes.]
Qui sì. Qui, sì. Qui. Sì.
[Here, ok.] [Here, yes.] [Here. Yes.]
As the translations suggest, all the word groupings, which all correspond to distinct possible utterance boundaries, are consistent with the contextual domain. Therefore, neither syntactic nor contextual information are sufficient to determine the actual structure of the previous turns. In speech, however, these turns are in fact not ambiguous. Once the information borne by terminal and non-terminal breaks is perceptively recovered, the reference units can be determined with precision. The speech act labels in the dependent tier may help the reader to achieve the proper interpretation, as illustrated in the following: *EST: o vieni / dai // [come on then] %ill: invitation *CLA: a patire // [to suffer] %ill: ironical assertion *EST: no // ascolta / qui sopra ? sì // [no // listen / (what about) up here ? yes //] %ill: reassurance (1) question, introduced by a conative (2) self answer (3)
Massimo Moneglia
*CLA: qui ? sì // [here ? yes] %ill: question (1) answer(2) In other words, the index which determines the choice of possible structure for both turns under investigation is the prosodic structure, and not the reverse. This is a relevant property of the utterance that follows from the fact that each utterance accomplishes a distinct illocutionary act. This act defines the relevant domain for linguistic relations in the speech flow. Once two strings belong to two distinct utterances, then the linguistic expressions belonging to those strings do not share the same linguistic domain. They belong to two autonomous speech acts, they depend on two illocutionary forces, and they are the function of two distinct locutionary and prosodic programmes. In present approaches to multimedia spoken language archives, a “from silence to silence” criterion is the most common, probably because the automatic recognition of pauses in the speech flow is quite an easy task to be pursued, given the actual technology. This, in fact, at the outset, seems very reasonable, because utterance boundaries frequently occur together with significant wave interruptions. Figure 1.2 illustrates the third turn from the dialogue examined, where, after the utterance no, there is an interruption of approximately 600 ms (indicated in light grey) that marks the break before the beginning of the next utterance. This is not, however, as straightforward as it seems. To illustrate a different – and common – case, we examine the fourth turn of the dialogue, shown in Figure 1.3. The speech flow of this turn corresponds to two distinct speech acts (marked in black and light grey, respectively), but actually there is no pause separating the two speech events.
Figure 1.2 Pause before a new utterance
The C-ORAL-ROM resource
Figure 1.3 Two utterances not separated by pause
Figure 1.4 Pause within an utterance, after a Topic unit
The distinction is still possible because prosody perception appears to be sensitive even to the 20 Hz discontinuity occurring at the start of the second utterance. A converse case may also occur: a perceptively relevant prosodic break may be accompanied by a pause even if the break does not mark the end of the utterance. Figure 1.4 illustrates a typical utterance with Topic-Comment structure, taken from the same dialogue.29 The first element has prefix intonation and is perceived as nonconclusive, while the second string is conclusive (’t Hart et al. 1990): *EST: . . . lei / prima veniva tutte le settimane // [she / used to come every week once //]
Massimo Moneglia
The Topic unit is separated from the Comment unit by a pause (in light grey). If we assumed the silence to silence criterion, the sequence would be wrongly considered a sequence of two distinct utterances. In summary, the concept of utterance as a sequence between two silences does not match the identification of an utterance determined on a prosodic basis. Utterances may occur with no need for pauses between them, and therefore in such instances, it would be a case of too strong a criterion; conversely, the occurrence of a pause is not a sufficient cue to infer the conclusion of an utterance, hence making it too weak a criterion. A quantitative measurement of the incidence of both over-extension and underextension of the silence to silence criterion can be proposed on the basis of C-ORALROM data. The French corpus is tagged using both temporal and perceptual criteria. Pauses of more than 200 ms were detected automatically in the speech flow and annotated in the transcripts. At the same time the corpus was also tagged with respect to terminal and non-terminal prosodic breaks, as perceived by the expert operators who transcribed and tagged the corpus. On the basis of the results of this double tagging, we recorded that approximately 63% of sequences ending with a terminal break are accompanied by a pause, while 37% of sequences ending with a terminal break do not bear a pause. Conversely, approximately 42% of breaks that were considered non-terminal are also accompanied by a pause.
.. Prosodic labelling procedure and Alignment units Based on the empirical results above, the identification of utterance boundaries by means of prosody appears to be significant from a linguistic point of view; however it must also be stressed that it is in fact an easy annotation to be added to spoken corpora. Once the concept of prosodic break is clear to the transcribers, they are able to listen directly to the speech flow and use direct perception to annotate the terminal and non-terminal breaks in real time, together with transcription of the segmental information. The segmentation of speech into utterances is therefore simultaneous to both the transcription and the annotation of the main prosodic boundaries. Tagging consists of the marking of prosodic breaks in the orthographic transcription of speech. In C-ORAL-ROM each word boundary is considered a possible position for a prosodic break, but not within-word breaks. Each word boundary in C-ORAL-ROM transcripts thus necessarily has one of the following values: 1. no break 2. terminal break 3. non-terminal break The labelling is based only on perceptual judgments and in principle does not require any specific linguistic knowledge, although the notion of speech act is always familiar to the expert transcribers (comprising PhDs and PhD students) who annotated the corpus.
The C-ORAL-ROM resource
The process of prosodically tagging utterance boundaries in C-ORAL-ROM involves linking the annotation of the speech signal with the relevant linguistic information; in other words, alignment. The definition of the text-to-speech interface in C-ORAL-ROM is based on the idea that the access to acoustic information in a multimedia corpus must go hand in hand with the representation of prosody. Given that in C-ORAL-ROM all the textual information is tagged simultaneously with respect to prosodic parsing and utterance limits, each textual unit corresponding to an utterance can be easily and directly aligned to its acoustic counterpart, thus ensuring a natural and meaningful text-to-sound correspondence. Such a method can be proposed as a possible standard for storing oral language in multimedia and multi-modal language resources. In Section 1.4 the technical process and methods for text-to-speech aliment will be discussed. The following outlines the standard procedure adopted in C-ORAL-ROM: a. Tagging of prosodic breaks simultaneous with the transcription by a first labeller. b. Revision of tagging by a different labeller in connection with the revision of transcripts. c. Revision of tagging and specific challenge of terminal breaks by a third labeller in connection with the alignment. This process ensures control over inter-annotator relevance of tags and a maximum accuracy in the detection of terminal breaks. The accuracy with respect to non-terminal breaks is by definition lower.
.. Conventions for prosodic tagging in the transcripts Discriminating between terminal and non-terminal breaks is mandatory in all the C-ORAL-ROM transcripts. When a prosodic break occurs, the prosodic tagging is specified with a dedicated symbol taken from the CHAT tradition, as shown in Table 1.3. Each symbol represents a type of break which is defined in the annotation schema which permits a very restricted set of alternatives. C-ORAL-ROM allows the prosodic tagging to be displayed at two hierarchical levels, with greater or lesser attention to the annotation of the types of terminal breaks and to disfluency in the speech performance (Shriberg 1994). For terminal breaks, a // tag (double slash) is inserted in the transcription each time a prosodic break is perceived as terminal. At the richer level of transcription, three more types of terminal breaks, that appear to be extremely evident perceptually, can be specified, namely, interrogative utterances, intentionally suspended utterances, and interrupted utterances. Each of these is given a dedicated symbol, as listed in Table 1.3. At less detailed levels of transcription, interrogatives and intentionally suspended utterances are marked with the generic terminal break // tag, with no added specification. For non-terminal breaks, the / symbol (single slash) is inserted in the transcription to mark the internal prosodic parsing of a textual string which ends with a terminal
Massimo Moneglia
Table 1.3 Summary of prosodic break types in C-ORAL-ROM30 Symbol
Type description
// ? (optional)
Conclusive prosodic break Conclusive prosodic break such that the utterance has an interrogative value . . . (optional) Conclusive prosodic break such that the utterance is left intentionally suspended by the speaker + Conclusive prosodic break such that the utterance is interrupted by the listener or by the speaker himself / Non conclusive prosodic break [/] (optional) Non conclusive prosodic break caused by a false start [//] (optional) Non conclusive prosodic break caused by a false start (retracting) such that the linguistic material is only partially repeated [///] (optional) Non conclusive prosodic break caused by a false start (retracting) such that the linguistic material is not repeated
Break value Terminal Terminal Terminal Terminal Non-terminal Non-terminal Non-terminal
Non-terminal
break. This sign is inserted in the position where a prosodic break that is not perceived as terminal is detected in the speech flow.
... Fragmentation phenomena This annotation scheme embodies the generalisation that a prosodic break always occurs when a disfluency arises in the speech performance. This is true in the case of the interrupted utterances mentioned above, but it is also true in the case of false starts or retracting, which are probably the most frequent and interesting fragmentation phenomena of spontaneous speech (Blanche-Benveniste 2003). When fragmentation occurs, this is perceived as a clear break in the prosodic programme. Therefore a fragmentation never takes place without a simultaneous break in prosodic fluency. Two types of fragmentation can be noted: 1. Interruptions. An interruption (non-completion) of the utterance may be due to many reasons: a change of the linguistic programming by the speaker, an interruption caused by the listener, or by other events in the environment. Interruptions may be accompanied by word fragmentation, i.e. interruption before the end of the last word of the utterance, although the absence of fragmentation is the more frequent case. The interruption mark “+”31 is counted as a kind of terminal break, and is inserted in the transcription in the position of the interruption. The following is an example of interruptions within a dialogic turn: *MAR: è perché forse / non c’ è stata + dico / tu / che perdi il capo / che + perché / a lui / gli piacciono i soldi / Ida // [it depends perhaps / there has not been + I say / you / that lose your head / that + because / he / he likes money / Ida // ]
The C-ORAL-ROM resource
2. Retracting and/or restart and/or false start(s). Retracting (or false start) is the most frequent fragmentation phenomenon in spontaneous speech. The speaker hesitates while trying to find the best way to express himself and retracts his speech before choosing between two alternatives. This phenomenon is, generally, clearly distinguishable from interruptions or changes in programming, reported above, which do not feature speaker’s hesitations. Contrary to interruptions, retracting is almost always accompanied by the repetition (complete or partial) of the linguistic material, and clearly causes a loss of the informational value of the retracted material, which is abandoned by the speaker in favour of the chosen alternative. As in the case of interruption, in retracting phenomena, a change of prosodic envelope is again necessary. In other words, the retracting between two elements cannot be accomplished in the same prosodic envelope. Therefore, retracting is always accompanied by a prosodic break marked with the symbol “[/]”,32 illustrated in the following example: *MAX: mio &cug [/] mio fratello non nuota // [my &cous [/] my brother doesn’t swim //] Retracting breaks are considered a type of non-terminal breaks and are highlighted only at richer levels of transcription. The symbol is inserted in the transcription after each set of fragments, in the position where a restart begins. At poorer levels of transcription, retracting phenomena are not treated as a special kind of prosodic break caused by fragmentation, and only the generic non-terminal break sign is used after each set of fragments where a restart begins, as in the following Portuguese example: *FER: [2] / o / o touro / realmente / estava a / estava a dar cabo do cavalo // [the / the bull / actually / was to / was attacking the horse //]
3. Retracting/interruption ambiguity. In some cases it is hard to decide whether a fragmentation phenomenon fits with the definition of “restart” or “interruption”. This can be the case when an alternative to the locution in object is chosen, but no repetition is involved. In this case a supplementary sign, “[///]”, can optionally be used marking the fact that a retracting phenomenon probably occurs with neither partial nor complete repetition of the linguistic material. The ambiguous mark is counted as a non-terminal break.
. Textual format C-ORAL-ROM’s format can be defined as a relation between two independent levels: a textual format, in which each recorded session is transcribed, and a text-to-speech alignment format, which allows access to the acoustic data from the textual information. The principles which guide the transcription of spontaneous speech will be highlighted in this section, while the alignment method will be explained in Section 1.4.
Massimo Moneglia
C-ORAL-ROM’s textual format is an implementation of the CHAT architecture (MacWhinney 1994). Its structure comprises two main levels:
1. Metadata. Metadata contain all possible information on the recorded session, allowing its identification and localisation within the resource. Metadata are recorded as an ordered set of Headers lines placed before the session transcription and are compiled following explicit rules and terminology. 2. Dialogue representation. Sets of relevant linguistic and non-linguistic information embodied in the recorded session are presented in textual form. The dialogue representation principles allow the written representation of the spoken events and the relevant circumstances which enable their interpretation. Dialogue representation features two main levels of information: i.
Text lines: orthographic transcription of the speech events occurring in the recorded session, divided as follows: a. Vertically, in dialogic turns (introduced by a speaker label); b. Horizontally, by prosodic parsing and utterance limit, representing terminal and non-terminal prosodic breaks of the speech continuum.
ii. Dependent tiers: contextual information.
.. Metadata Metadata are listed in a closed set of types, comprising: a. b. c. d. e. f.
the context of the recorded session; its size; the speakers; the acoustic quality; the source; the person who can provide information on the session.
Each type of metadata is introduced by “@” immediately followed by a label, a colon “:”, and an empty space, and then followed with PCdata ending with “enter”. In order to allow better retrieval, PCdata refer as much as possible to a previously defined standard terminology (closed vocabulary). Table 1.4 summarises the various metadata types, and Tables 1.5 to 1.8 provide further details regarding the vocabulary and the formation rules. The Situation field describes the set of contextual information regarding the session and how it was recorded, as detailed in Table 1.6, and is represented as an ordered set of sub-fields separated by a comma.33 The Class field is an ordered set of sub-fields which mirrors the corpus design classification of each recorded session, as illustrated in Table 1.7.
The C-ORAL-ROM resource
Texts in the collections are labelled with respect to the acoustic quality of the sound source according to the criteria outlined in Table 1.8.34 Table 1.4 Metadata types Label type @Title: @File:
Description
One or two words in the object language which help to identify the text. Filename (note that the names of the audio and text files differ only in the extension which is not included in the Filename). @Participants: Three capital letters identifying each speaker, followed by the corresponding first name, plus a sub-field with an ordered set of information about the speaker. @Date: Date of the recording, in the form day/month/year; e.g. 20/06/2001. @Place: Name of the city where the recording session took place. @Situation: Ordered set of information: genre, role of the participants, surroundings, main actions performed, recording conditions; e.g. gossip between friends at home during dinner, not hidden, research participant. @Topic: The main topic dealt with in the speech event (max. 50 characters); e.g. traffic problems. @Source: Name of the collection leading to a copyright holder; e.g. LABLITA_CORPUS; CORPAIX. @Class: Set of fields identifying the session in accordance with the C-ORAL-ROM corpus structure, separated by commas. @Length: Length of the transcribed audio file in minutes (’) and seconds (”); e.g. 12’ 15”. @Words: Number of words in the text file. @Acoustic_quality: Acoustic quality of the recording labelled according to specific criteria (A, B or C). @Transcriber: Name of the person(s) in charge of the text, who can provide further information. @Revisors Names of the revisors. @Comments: Transcriber’s comments on the text. *
Massimo Moneglia
Table 1.5 Ordered set of sub-fields for each Participant Type
Description
Vocabulary
Sex
Sex of the speaker
Closed vocabulary. One of the following conventional capital letters, according to the description in brackets: M (Male) W (Female)35 O (non-defined; e.g. artificial devices)
Age
Age of the speaker
Closed vocabulary. One of the following conventional capital letters for each range of age between brackets: A (18–25) B (25–40) C (40–50) D (over 60)
Education
Level of education Closed vocabulary. One number according to the according to the degree degree of schooling between brackets: of schooling 1 (primary school or illiteracy) 2 (high school) 3 (graduates or university students)
Profession
Speaker’s occupation
Role
Role in the recorded Open vocabulary. Name of the role (e.g. father; event (even if it is the professor) same as the profession)
Geographical origin/linguistic influence
Name of the place of origin of the speaker
Open vocabulary. Name of the profession (e.g. professor; secretary; student)
Open vocabulary (e.g. Ile de France; Castilla)
The C-ORAL-ROM resource
Table 1.6 Ordered set of sub-fields for Situation Type
Description
Vocabulary
Genre
Information helping to define the genre of activity to which the linguistic event belongs.
Open vocabulary (e.g. gossip; chat; quarrel; discussion; narration; claim; etc.). The neutral case is “talk”. The information of the Class field (dialogue; conversation; etc.) is not repeated.
Reciprocal role
Reciprocal position of the participants.
Open vocabulary (e.g. friends, colleagues, relatives, citizens).
Ambience
Surroundings in which the recordings Open vocabulary (e.g. in a silent studio; took place. on the street; at home; in a shop; at school; in the office; etc.).
Action
Main action performed during the speech event (if any).
Recording conditions
Status of recording with respect to the Closed vocabulary. Choice between the observer’s paradox in spontaneous alternatives in each of the following two speech resources. sets: a. hidden vs. not hidden; b. participant researcher vs. observing researcher vs. researcher not present.
Open vocabulary (e.g. while ironing; during depilation; etc.).
Table 1.7 Ordered set of sub-fields for Class Informal Type
Type
Type
Formal Type
family/private public
formal in natural context
media
telephone
Sub-type monologue dialogue conversation
Sub-type political speech political debate preaching teaching professional explanation conference business law
Sub-type news sport interview weather forecast scientific press documentary talk show
Sub-type private conversation human-machine interaction
sub-sub-type monologue dialogue conversation
Massimo Moneglia
Table 1.8 Labels for acoustic quality Type
Properties
Label
Digital recordings
Digital recordings with DAT or minidisk apparatus and unidirectional microphones. a. Analogue recordings with good microphone response; b. Low background noise; c. Low percentage of overlapped utterances; d. F0 computing possible for most of the files. a. Low-quality analogue recordings with poor microphone response; b. Background noise; c. Medium percentage of overlapped utterances; d. F0 computing possible in many parts of the files.
A
Analogue recordings
Analogue recordings
B
C
.. Dialogue representation As introduced in the beginning of Section 1.3, for a given recorded session which comprises a continuous set of acoustic data, dialogue representation includes: (1) the notion of text, that is, the representation of a session in discrete written form; and (2) the notion of dialogic turn, which defines the maximal structural components of such a discrete unit. The textual representation of both sessions and dialogic turns must be consistent with the representation of the speech events that make up the dialogue itself. The definitions outlined in Table 1.9 provide a clear distinction between the relevant entities of dialogue structure and speech event representation, allowing a possible textual representation of dialogue. Table 1.9 Entities for dialogue representation Concept
Definition
Session Text
Set of dialogic turns corresponding to one metadata set. Each recorded session as an ordered collection of transcribed dialogic turns referring to a given metadata set. Dialogic Turn Continuous set of speech events performed by a single speaker’s voice. The dialogic turn changes if, and only if, a speech event by another speaker occurs. Utterance The minimal speech event by a single speaker such that it can be pragmatically interpreted as a speech act. Word A speech event perceived as a phonetic unit, such that it conveys a meaning.
Turn representation. A dialogic turn by one speaker is expressed by “*”, immediately followed by three capital letters identifying the speakers in the metadata, then followed by “:” and one space before the transcription of the speech event. Each dialogic turn ends with an “enter” (Table 1.10).
The C-ORAL-ROM resource
Table 1.10 Turn representation Convention*
Description
^\*[A-Z]{3}:\s{1}
Dialogic turn of a given speaker
* Expressed with regular expression notation.
Utterance representation. Each utterance is represented by a series of transcribed words, ending with the termination symbol “//” or other symbols having a terminal value such as “?” “+”; “. . . ” (see Table 1.3), as in the following example: *MOR: I’m going home // Are you tired too? *PIL: bye bye // A dialogic turn can also be filled with non-linguistic or paralinguistic material according to the transcription convention (see Table 1.12),36 as follows: *MAX: are you sure? *PIL: hhh //
Word representation. Each word is transcribed as a continuous sequence of characters between two empty spaces, in accordance with the orthographic conventions of a language. .. Conventions for transcription General criteria. The transcription of a dialogic turn expresses, horizontally, the sequence of speech events (i.e. utterances) that occur in each dialogic turn. In principle, no “enter” can occur within the sequence of utterances.37 Transcription follows the standard orthography of every language in the C-ORAL-ROM resource and is integrated with special signs devoted to handling spoken language phenomena, as detailed in Section 1.3.3.38 Overlapping. The speech of one speaker is, in spontaneous dialogues, frequently overlapped by the speech of another speaker, who may insert his dialogic turn in the first speaker’s turn. The overlapping therefore determines a relation of temporal equivalence between two or more speech portions in different dialogic turns. In C-ORALROM, overlapping is represented by the conventions summarised in Table 1.10 and elaborated on below. The overlapped text in both dialogic turns is placed between a pair of angled brackets, i.e.
. Moreover, the sign “[<]” may optionally appear at the beginning of the second turn, immediately before the overlapped text, to mark that an overlapping relation holds between the two pieces of text between brackets in the two turns, as illustrated in the following examples:
Massimo Moneglia
*ABA: María mi ha detto che [/] // [Maria told me that / //] *BAB: [<] <non viene> // [she is not coming //] In C-ORAL-ROM, overlapping is marked only when it occurs with at least two words in two different turns; this means that the overlapping of single syllables is left unmarked or reported to word boundaries. When, due to the simultaneous occurrence of more than one dialogic turn, it is impossible to attribute the speech to a speaker, a fixed variable is used to mark a mixedturn (see Table 1.11), e.g.: *XYZ: chi è va bene //39 [is who is it all right]
Spoken language noise phenomena. As mentioned in Section 1.3.2, transcription follows standard orthography of each language concerned, and is integrated with special signs for handling spoken language phenomena, as summarised in Table 1.12. Table 1.11 Symbols for overlapping Symbol
Description
<>
Brackets which mark the beginning and the end of the overlapped text of a given speaker. Symbol which specifies the overlapping relation between two bracketed textual strings belonging to two speakers. Turn of overlapped speech by non-identified speakers.
[<] *XYZ:
Table 1.12 Symbols for spoken language phenomena Symbol
Description
& hhh xxx yyy yyyy
Speech fragments Paralinguistic or non-linguistic elements Incomprehensible word Non-transcribed word Non-transcribed audio signal
Fragments. All incomplete words and/or phonetic fragments are immediately preceded by the ampersand symbol “&”, as in the following example of an incomplete utterance:40 *MAX: mio &cug [my &cous] or in the following lengthening of the programming time:41
The C-ORAL-ROM resource
*MOR: credo che si chiami / &eh / Giovanni // [I think they call him/ &eh / Giovanni //]
Paralinguistic elements. All paralinguistic elements (laughing, crying, etc.) are not counted as a word occurrence in a frequency list and are indicated as “hhh”. Further information of the element is explained in a dependent tier (see Section 1.3.6). Incomprehensible words. All words that are not properly understood are reported (and counted as word occurrences in a frequency list) as “xxx”. Non-transcribed words. When a word must be deleted for reasons concerning privacy or decency, it is substituted by a “yyy” variable, to be counted as a word, as in the following example:42 *MOR: il dottor yyy è un cretino // è proprio un bello yyy // [doctor yyy is an idiot // he’s a right yyy //]
Non-transcribed audio signals. When, for whatever reason, part of the audio cannot be transcribed, a single variable “yyyy” is inserted in the transcripts, which is not dependent on the length of the signal. The said variable may be subject to alignment, but will not be counted as a word.43 Interjections. Interjections are conventional phonetic elements with dialogic functions and low lexical or grammatical meaning. Interjections are not fragments and are transcribed in accordance with the lexicographical tradition of each Romance language. New interjections discovered in the corpus are transcribed tentatively and their presence is reported in an added glossary. Non-standard words. In the absence of any previous orthographic tradition, nonstandard regional expressions found in the corpus are transcribed tentatively, normalised according to the graphical conventions in use in each language’s tradition, and reported in an added glossary. .. Pauses Pauses in the speech flow are indicated with “#” and are included in the transcription only if clearly perceived as a significant interruption of speech fluency (Table 1.13). Other pauses can be reported optionally. No distinction is made with respect to the length of the pause. In the French corpus pauses are detected automatically (see Chapter 3). The “#” symbol is not a sign of prosodic parsing and never substitutes for the marks dedicated to prosodic breaks.
Massimo Moneglia
Table 1.13 Symbol for pauses Symbol Description #
Definition
Pause in the speech flow A perceptively relevant silence in the speech continuum, or, in any case, a silence longer than 250 ms.
.. Restrictions on dialogue representation: The Intersection convention In spontaneous spoken language the event of an intersection of dialogic turns by different speakers occurs frequently; this means that a dialogic turn may arise before the end of the turn that immediately precedes it. In those cases, the representation of a dialogue as a vertical ordered collection of dialogic turns cannot be maintained. Because the CORAL-ROM format forces to represent the sequence of turns in a temporal order, an Intersection convention has been adopted.
Intersection convention. In the C-ORAL-ROM transcripts, a slash placed at the beginning of a turn, i.e. immediately after the turn label and before the transcribed text, is a convention expressing that the turn in question is only virtual. Although the transcribed language strings follow the dialogic turn of another speaker, they still belong to the preceding turn of the same speaker.44 Three major cases of intersection have been detected in C-ORAL-ROM: 1. Simple intersection of turns; 2. Intersection of turns due to interruption by the listener; 3. Intersection of turns due to overlapping.
Intersection of turns. In some cases, a complete intersection of dialogic turns may occur. In the following example, ELA, without actually interrupting the speaker, starts a brief dialogic turn, while MAX goes on with his turn: *MAX: in linea di principio / voi dovreste / [in principle / you had /] *ELA: ah // *MAX: / seguire le regole di trascrizione // [ to follow the transcription rules //]
Interruption and intersection. The speaker frequently interrupts his utterance in connection with the intervention of the listener. However, despite this interruption, he still continues the same turn, but with another utterance. For example, in the following situation, a speaker inserts himself in a dialogic turn, interrupting it, but the original speaker goes on with his turn despite the interruption: *MAX: in linea di principio / voi dovreste + [in principle / you had +] *ELA: ah // *MAX: / dovete seguire le regole di trascrizione // [you had to follow the transcription rules //]
The C-ORAL-ROM resource
Overlapping and turn intersection. Overlapping, which is very frequent in spontaneous speech, is a relation between texts belonging to different turns. In most cases, overlapping also causes interruption and if the speaker goes on after the interruption, this also causes intersection of turns. In order to capture this generalisation, overlapping is always annotated following the Intersection convention.45 In C-ORAL-ROM, when the overlapped text in the upper turn continues after the overlapping, the intersection convention has been applied in two alternative ways: a.
by transcribing over the overlapped turn until the first terminal break, as follows: *ABA: Maria mi ha detto che [/] più / al concerto // *BAB: [<] <non viene> // *ABA: / perché non si sente //
b. by interrupting the transcription of the overlapped turn when the overlapping ends, as follows: *ABA: Maria mi ha detto che [/] *BAB: [<] <non viene> // *ABA: / più / al concerto // perché non si sente // Only the first alternative compels the system to assume the generalisation that each formal turn ends with a terminal break, and allows the alignment of each utterance. Table 1.14 Intersection convention Configuration of symbols Operational definition :/
Relation that converts the turn in which it appears into a linear sequence which includes the preceding turn of the same speaker.
.. Dependent lines Information of three kinds, regarding the text reported in a dialogic turn, is optionally given in lines following a dialogic turn. Those lines, that are not computed as linguistic information are usually referred as dependent lines in the CHAT tradition: a. Alternatives proposed for the transcription of the text; b. Comments regarding the pragmatic context and the visual modality of communication; c. Other comments. These options are summarised in Table 1.15. The link between the information in the dependent line and a word, or series of words, in the dialogic turn can be specified using a serial number which indicates the position in the dialogic turn of the word referred to.46 In the following example, the reported alternative concerns the third word of the utterance
Massimo Moneglia
*MAX: voglio mangiare // pasta // [I want to eat // pasta //] %alt: (3) basta [stop] Table 1.15 Dependent lines types Sign
Definition of the type of information
%act: %sit: %add: %par: %exp: %amb: % sce: %alt: %com:
Actions of a participant while speaking Events or state of affairs occurring in the speech situation Participant to whom the speech is addressed Speaker’s gestures or paralinguistic aspects Explanations necessary for the understanding of the turn, or signs in the text (hhh) Description of the Setting in a media emission Description of the Scene in a media emission Alternative transcription Transcribers’ comments
.. Alignment principle The Alignment of C-ORAL-ROM is based on two choices: a. Specification of the Alignment unit at utterance level (as previously defined). b. Rough equivalence between terminal breaks and utterance limits. Each text is aligned with respect to perceptively relevant terminal breaks annotated in the original transcripts, in order to deliver a database of utterances.47 The alignment of transcribed texts is achieved through the assistance of the alignment tool by means of insertion in the text of a ($) tag after a terminal break, while the audio is played (at a reduced speed). The system assigns two temporal units to each alignment unit: a.
end of the alignment unit: the temporal unit of the sound file in the instant in which the tag is inserted; b. beginning of the alignment unit: the temporal unit which marks the end of the previous segment. The expert operator in charge of the alignment (a postdoctoral researcher or doctoral student) always considers whether the accomplished alignment unit truly corresponds, in his perception, to a speech segment ending with a terminal break. The operator may add or delete the terminal breaks annotated in the original according to his personal perceptual perspective of the speech signal, thus improving the quality of the annotation.
The C-ORAL-ROM resource
.. Files, filename conventions and folder structure of the resource ... Files 1. Audio Files: encrypted MP3 files; sampling frequency: 22,050 Hz; 16 bits 2. Text Files: encrypted .TXT files 3. Alignment files: encrypted .XML files
... Filename conventions The filenames of the C-ORAL-ROM resource bear information of three types, in the following order: a. The represented language; b. The text type; that is, the field and sub-field to which each text belongs in the corpus structure; c. The serial number identifying each text in its sub-field. The following are the conventions adopted in each C-ORAL-ROM language collection: a.
Language Country code: F (French), I (Italian), P (Portuguese), E (Spanish) b. Text type This is detailed in Table 1.16. Table 1.16 Codes for text type Field
Sub-field
informal
fam (family-private) pub (public)
mn dl cv
(monologue) (dialogue) (conversation)
formal
nat (natural context)
ps pd pr te pe bu co la nw mt in rp sc sp ts pv mm
(political speech) (political debate) (preaching) (teaching) (professional explanation) (business) (conference) (law) (news) (weather forecast) (interview) (documentary) (scientific press) (sport) (talk show) (private) (man machine)
med (media)
tel (telephone)
Philippe Martin
c.
Text identification number A two-digit number is used to identify the text within the sub-field; e.g.: efamdl01 (Spanish, Family-private, Dialogue, 01) efamdl02 (Spanish, Family-private, Dialogue, 02) efammn01 (Spanish, Family-private, Monologue, 01)
... Folder structure In the multimedia corpus, the files of each language collection are delivered within the same folder structure, which directly mirrors the C-ORAL-ROM corpus design. For each session, the following are delivered into folders: a. the encrypted MP3 file; b. the encrypted .TXT file of the transcripts; c. the encrypted .XML file defining the text-to-speech alignment in WinPitch Corpus format. The encrypted .TXT files of the transcripts and the encrypted .TXT files of the PoS tagged transcripts delivered two non-structured directories for each language collection.
. WinPitch Corpus. A Text-to-Speech Analysis and Alignment Tool Philippe Martin Although putting a speech corpus together seems to be a simple task involving speechto-text transcription and alignment, this ceases to be the case when the volume of data becomes large. Problems inherent in transcription, even for well-recorded speech data, are not trivial (Blanche-Benveniste 2002), and, even with the use of modern signal editing software, the manual transcription and alignment of just one hour of data quickly becomes cost-prohibitive. The development of adequate and user-friendly tools is thus essential for the elaboration of large speech corpora. WinPitch Corpus, which has been used in the C-ORAL-ROM project and is available in the DVD accompanying this book, is one of these tools, developed to facilitate the alignment process, and to render the analysis of very large corpora an easier and more pleasant task.
.. Text-to-speech alignment Text-to-speech alignment establishes a bi-univocal relationship between units of speech and units of text. In its simplest implementation, each unit of text (be it a syllable, word, phrase or sentence) receives a temporal index corresponding to the time position of its equivalent in the sound file. Once this process is done, an operator can select an aligned unit of text and listen to the corresponding speech segment. The acoustic analysis of the speech sound, such as melodic curve and spectrogram, can also be displayed at the same time. Conversely, the selecting of a speech seg-
The C-ORAL-ROM resource
ment will highlight the corresponding segment of text in its orthographic or phonetic transcription. Text-to-speech alignment is frequently used in multimedia language learning software, where the user can easily listen to the sound corresponding to a specific word or sentence merely by clicking on the appropriate text segments. Other important applications are found in fundamental research in phonetics, in synthesis-by-rule developments, or in speech recognition validation tests. In prosodic research, for example, the user should be able to quickly get an analysis display of the acoustic prosodic parameters simply by highlighting a part of the text of interest.
... Automatic alignment Emerging automatic text-to-speech alignment processes present recurrent limitations: 1. Their performance depends on speakers’ voice characteristics, which cannot be too different from the models adopted in these methods. 2. The recording signal-to-noise ratio must be high enough to reduce the error rate to an acceptable level. Radio and television broadcasts generally meet this requirement, but spontaneous speech recordings made in various public environments (street, public transportation, etc.) rarely fulfil this criterion. Echo in the speech signal is another aggravating factor. 3. The overlapping of speakers’ voices, frequently found in spontaneous dialogues, constitutes an additional aggravating factor. Furthermore, use of recordings for phonetic and general linguistic research requires an acceptable quality of the recording itself, such as a good frequency response curve and a low phase distortion. All these considerations seem to indicate that a human operator is often required to obtain a reliable text-to-speech alignment. The responsibility for all the problems mentioned above is then transferred to the operator, who, with appropriate and ergonomically well-designed tools, should perform better than by manually correcting, one by one, the errors made by an automatic system of alignment.
... Alignment and transcription Text-to-speech alignment can be executed in two modes, depending on whether the text pre-exists or not. If the text must be created, the operator proceeds by selecting segments of speech in sequences (which can be played back at reduced speed to enhance intelligibility) and typing the corresponding text perceived, either orthographically or in phonetic transcription (WinPitch Corpus allows the use of any font defined in Unicode). This is illustrated in Figure 1.5. During this process, a database is automatically updated, and will be later saved directly in Excel® or XML format.
Philippe Martin
Figure 1.5 Simultaneous transcription and alignment. The user sequentially defines segments of speech and enters the corresponding text
... Computer-assisted alignment Experimental studies have shown that improved coordination between visual spotting of words and positioning of a mouse on a computer screen can be obtained by slowing down speech playback by a suitable factor, depending on the size of the text object to spot. Larger chunks of text require less processing time, and thus allow a faster speech playback rate in the process. WinPitch Corpus-assisted text-to-speech alignment is based on this principle. The pre-existing text is displayed dynamically in a window while the corresponding speech sound is played back at a slower speed. This can be adjusted continuously on the fly by a factor ranging from 1/7 to 2. At each identification of a speech unit to segment and align (whether a syllable, word, phrase or sentence), the operator clicks with the computer mouse on the text segment perceived. The programme records the position of the cursor on the text window, which defines the end of the text segment to align, and the time of the click, remapped on the real time scale of the speech wave. This process continuously generates a database of pointers linking segments of text and segments of speech. This is shown in Figure 1.6. Various tools are provided to backtrack, fine-tune speech segment limits (with the help of a displayed spectrogram), dynamically modify limits of overlapping voices, etc.
The C-ORAL-ROM resource
Figure 1.6 Assisted alignment by slowing down speech playback. At each mouse click on a unit of text perceived at slower speed (top right window), bidirectional pointers are generated automatically between the corresponding speech segment (bottom right window) and a database (left window)
.. WinPitch Corpus features a. Slowing down speech. Variable-rate speech playback is the central engine of the assisted text-to-speech aligner. It uses a modified version of the PSOLA algorithm (Moulines & Charpentier 1990) and allows high quality re-synthesis of natural speech. This quality is strongly dependent on precise pitch marking, and therefore on reliable fundamental frequency analysis. Errors in pitch marking (e.g. missing markers, double markers) induce an echo effect due to the misalignment of pitch chunks added in the PSOLA algorithm. F0 estimation is obtained by the spectral comb method (Martin 1981), and the speed playback factor can vary from 1/7 to 2 (the latter meaning that speech is played back at double speed). This rate is dynamically adjustable by the user while the alignment is processed, allowing operations on very large files, and continuous speed adjustment as required by the operator. b. Automatic layer assignment. Pre-existing text can be organised, following a simple convention for naming speaker turns, so that segments are automatically assigned to their corresponding layers. The user does not have to worry about speakers’ turns while executing the aligning, as the programme will put segmented text in the appro-
Philippe Martin
Figure 1.7 Fine-tuning of speech segment limits with the help of a simultaneously displayed spectrogram (which allows precise segmentation in case of speakers’ overlapping)
priate layer assigned to each speaker. Eight layers are presently available, but future extension will provide for an unlimited number of speakers. c. Fine-tuning and speaker overlap. Once the assisted alignment session ends, the programme automatically displays the text under the corresponding speech segments, represented by their acoustic analyses (spectrogram, intensity and melodic curves, waveform). The user can then more precisely adjust the segments by dragging their limits using the mouse, with the help of visual inspection of the corresponding spectrogram or other available acoustic information. Overlapping segments can then be easily and precisely defined. This is illustrated in Figure 1.7. d. Unicode font. WinPitch Corpus is Unicode compliant, which means that any Unicode-defined font can be used for transcription, alignment and labelling. Furthermore, a right-to-left real-time display is provided for languages such as Arabic or Hebrew. e. Input formats. The programme uses most multimedia file formats, such as wav, mp3, aiff, au, wma, snd for audio, and mpeg, mpg, m1v, mp2, mpa, asf, avi, wmv for video. The user can select a time window (limited only by the available RAM size) to store the audio information from a selected time reference of the video or audio file. When loaded, audio and video channels are separated, and only the audio channel stays resident in memory to be processed and analysed as a simple wav file. Very large
The C-ORAL-ROM resource
files (limited only by the hard disk capacity) can thus be handled, transcribed, aligned and analysed. f. Hierarchical transcriptions. The available transcription layers can be assigned by the user to different transcription entities. In the C-ORAL-ROM project, for instance, each layer is assigned to a particular speaker in the dialogue, whereas in other applications, some layers can be assigned to syntactic tagging, syllabic segments, etc. In such cases, the segmentation and alignment are hierarchical, and the hierarchy is automatically maintained in fine-tuning operations. This means that the corresponding limits of segments of various levels (phones, syllables, words, syntactic groups, etc.) are adjusted together automatically if any of the limits is moved. This ensures a proper synchronisation of all levels of the transcription. In addition, this feature can be turned off when required, for instance, when layers are assigned to different overlapping speakers. g. Highlighting and marking. Speech segments can by highlighted individually in any colour, and can also be marked for sampling and statistical analysis. Marked segments are directly sampled, with a constant time interval or a constant number of samples, and their values exported to Excel® without further user intervention. h. Documents produced after alignment. The documents produced after alignment are defined in XML format (see Figure 1.8). i. Multimodal alignment with WinPitch Corpus. The multimodal alignment with WinPitch Corpus allows the synchronisation of acoustic and video information (F0 , waveform, time) (see Figure 1.9).48
Figure 1.8 The resulting text-to-speech alignment defined in XML format
Philippe Martin
Figure 1.9 Simultaneous display of acoustic and video information of a multimedia file
.. Basic layout WinPitch Corpus has three main windows, as shown in Figure 1.10: a.
A command window, displaying main command dialogue boxes, accessed from the toolbar, the operation menu or the dialogue box tabs; b. A navigation window, for selection of speech segments in the sound file; c. An analysis window, displaying the graphic representation of the selected speech part acoustical analysis. WinPitch Corpus operates with an XML file to define an alignment between a speech file (extension *.wav or *.mp3) and a text file (extension *.txt or *.rtf), the *.rtf file supporting Unicode text. Loading the alignment file (*.xml) will automatically load the corresponding sound and text files, whereas a new alignment session requires loading the sound file first, followed by the text to be aligned. To load an existing file, get the Transfer dialogue and click on Load in the Align (*.xml / *.alg) section, as shown in Figure 1.11. Loading a file will overwrite any existing data on the current analysis window. A new window does not have to be created when a file is loaded. Another, and often more convenient, way to load a file (*.wav, *.mp3, or *.xml) is by dragging its file name from the File dialogue to the analysis or navigation window, as illustrated in Figure 1.12. To drag a file, position the cursor on its name, press the mouse left button, move the cursor with the mouse keeping the left
The C-ORAL-ROM resource
Figure 1.10 WinPitch Corpus basic layout
Figure 1.11 Set of buttons for alignment
Figure 1.12 Dragging file names
Philippe Martin
button down, and release the button when the cursor is anywhere inside the analysis or navigation window. The File dialogue displays the hierarchy of the storage devices (diskette, hard disk, CD-ROM, etc.) and operates like Windows Explorer, but shows only directories and files that can be processed by WinPitch Corpus. Each type of file is displayed with a specific icon, as follows: for sound files (*.wav): for alignment files (*.alg or *.xml): and for WinPitch files (*.wp2): This scheme allows for quick loading by dragging, and the analysis of files belonging to the same directory, as the file already loaded in WinPitch is overwritten by the newly loaded file. Once a sound file has been loaded from memory, it can be navigated to retrieve sound sections to listen to and analyse. Sound files which are simply recorded and not saved can also be navigated and analysed (but it is of course wiser if the data are saved). Any of the following processes can be used, alone or in combination, in navigation:
1. Block selection. A “block” is defined in the navigation section as follows: Position the cursor on the starting edge of your block, press the mouse left button, move the cursor with the left button down, and release the button when the cursor is positioned on the ending edge of the block. When the mouse button is released, the analysis window is automatically updated to reflect the analysis of the time frame defined by the block. This is illustrated in Figure 1.13.
Figure 1.13 Navigation block analysis
The C-ORAL-ROM resource
The left and right edges of a block can be dragged to change the block position and size. To move a block edge, position the cursor near the chosen edge. The cursor will take a double arrow shape, indicating that it can be dragged to its new position. To do this, press the mouse left cursor and keep it down until the desired position has been reached. The block can also be moved without changing its size, by positioning the cursor inside the block until it takes a quadruple arrow shape. The whole block can then be dragged to its new position. The navigation window can also be panned by reaching either the left or right edge by moving the block. To remove a block, click with the left mouse anywhere outside it in the navigation window, or press the right mouse button.
2. Navigation window zoom and pan. Zooming in and out is possible using the left slider in the navigation window, and panning left and right is done with the top horizontal slider of the navigation window. The vertical slider on the right side adjusts the amplitude of the waveform displayed, but it has no effect on the actual wave amplitude in the sound buffer. The check box on the bottom right side is used to reverse the phase of the waveform; this also does change the phase of the sound wave in the sound buffer. These are all shown in Figure 1.14. 3. Analysis window zoom. The analysis window can be enlarged virtually up to 10 times by adjusting the click on the
value in the Play dialogue. To get the Play dialogue,
icon on the toolbar or click on the Play
Figure 1.14 Zooming, panning, amplifying waveform, and phase checkboxing
Philippe Martin
dialogue tab. This mode generates a virtual enlarged analysis window, which can be panned with the Analysis horizontal top slider. This is illustrated in Figure 1.15.
4. As shown in Figure 1.16, a block in the analysis window can also be defined, in order to select a sound segment, or for various other functions (e.g. to define highlighted
Figure 1.15 Zooming pan slider and pan level
Figure 1.16 Blocking in the analysis window and in the navigation window
The C-ORAL-ROM resource
sections, to save an aligned segment, etc.). Each time an analysis block is defined or edited (i.e. when the edges of an entire block are moved), the corresponding sound is played back, with variable speed if the Enable slow block is checked.
. C-ORAL-ROM PoS tagging C-ORAL-ROM also provides a textual entry of the transcripts where each form is tagged with respect to the part of speech (PoS) and lemma. It should be noted at the outset that PoS tagging of a multilingual spontaneous spoken resource is still a new task and that few antecedents can be found in the literature concerning the automatic PoS tagging of spontaneous speech.49 The development of multilingual tagsets for Romance languages, the integration of traditional tagsets with PoS for spoken language phenomena, the development of multilingual language technologies for automatic PoS tagging, and the level of encoding and the manual revision standards are, consequently, still issues which need to be explored further, and resolving them completely cannot, unfortunately, be properly addressed in the C-ORAL-ROM project because of the constraints of both time and economy. A specifically dedicated project would have been necessary to this end. Nonetheless, the consortium felt that the state of the art of automatic PoS tagging reached by the present HLT (Human Language Technology) for each Romance language would allow significant annotation of the C-ORAL-ROM transcripts with PoS with limited financial effort. This would ensure the exploitation of the resource for syntactic purposes, as well as enhance the level of awareness regarding PoS tagging of spontaneous speech. Moreover the current practices of PoS tagging in each team were already strongly based on common EU standard tagsets already established in previous projects (LINGUA-EAGLES-MULTEXT-PAROLE) and recently revised or adapted for current research. Thus, although the full integration of the four resources under a common PoS tagging scheme is not possible, the automatic PoS tagging of each resource ensures qualities that are sufficient for the exploitation of the resource. More specifically, C-ORAL-ROM ensures: (1) the use of a common format; (2) comparability of the adopted tagsets; (3) a minimal threshold of syntactic information sufficient for linguistic research; (4) an evaluation of the performance of the automatic PoS in spoken resources with respect to written resources; and (5) a better understanding of the specific difficulties that spontaneous speech presents for PoS tagging, from both a technological and theoretical point of view. In the remainder of this section, the tagsets used in each PoS tagged resource are compared and the C-ORAL-ROM tagging format presented. The specific choices adopted and the results obtained for each language are explained in the description of each resource.
Massimo Moneglia
.. Minimal tagset requirements C-ORAL-ROM’s four sub-corpora are tagged using different tagsets because of the differences in both the languages and the tools adopted for natural language processing systems. Nonetheless, in order to ensure comparability within the whole corpus, a compulsory minimal threshold of information has been established in the tag codes; such a minimal tag set is defined as follows: 1. Part of Speech (PoS) tag is compulsory for each class of word and, if applicable, for Locutions. 2. Specifications on Mood and Tense features are compulsory for Verbs, or, alternatively, the Finite/Non-finite character must be specified [± person] (necessary for the detection of verbless utterances). 3. Specification of the Coordinative/Subordinative quality is compulsory for Conjunctions. 4. Specification of the Common/Proper trait is compulsory for Nouns. The common level of morpho-syntactic tagging features a distinction, within the set, of elements which are not strictly linguistic but which are peculiar to spontaneous speech texts,50 providing two special tags to this aim (with the actual code of the tag depending on the specific tagset) for: a. paralinguistic elements: mh; he; ehm (used for filled pauses); b. extralinguistic elements: hhh (used for laughs and coughs).
.. Comparison between tagsets ... PoS tagsets The main PoS categories related to the tagsets used for each language are listed in Table 1.17. The first column comprises the maximal projection of the word categories present in the four tagsets; in the remaining four columns, the projection is mapped on the tagset of each language, showing the labels used. As far as the main PoS tags are concerned, as seen in Table 1.17a, the four tagsets are completely coincident with respect to the individuation of the main lexical categories (Nouns, Verbs, Adjectives, Adverbs), the functional particles (Prepositions, Conjunctions) and the Interjection class. This ensures a common baseline of annotation that makes the comparison of the high-frequency words of the corpora and the main lexical indexes extracted from the tagged resource reliable. Looking at the different C-ORAL-ROM tagsets in both parts of Tables 1.17, it can be seen that the maximum projection of categories is almost completely represented by the French tagset, which follows the EAGLES specifications on morpho-syntactic annotation of corpora.51 The different choices adopted by the other tagsets are justified by either the different language-specific tools for the automatic tagging procedure, designed with different perspectives on PoS classes and macro-categorisation, or by the differences between the languages. More specifically, the Portuguese and Spanish
The C-ORAL-ROM resource
tagsets include the speech-specific categories of discourse markers, used for the tagging of words with a pragmatic value at discourse level (e.g. mira, vamos, Sp.; pá, portanto, Por.). The Portuguese tagset also uses a specific label for the emphatic elements (e.g. lá, cá) that occur in the speech flow. The second part of the comparison table, Table 1.17b, shows the more complex relations among the different tagsets. The treatment of articles in Romance languages exhibits only a minor difference in categorisation. From a distributional point of view, articles in Romance languages are determiners; this is represented both in the French and in the Spanish tagsets, which have a “definite determiner” class for the articles, while the Italian and the Portuguese tagsets opted for an independent class, in accordance with grammatical tradition. This difference does involve the structure of the tagset: where articles are univocally identified as definite determiners, they constitute a sub-category; where they are identified Table 17a. Synopsis of PoS tag sets (1) Tagset projection
French
Italian
Portuguese
Spanish
nouns verbs adjectives adverbs prepositions conjunctions interjections discourse markers emphatic
NOM VER ADJ:QUA ADV PRE CON INT
S V A B E C I
N V ADJ ADV PREP CONJ INT MD ENF
N V ADJ ADV PREP C INT MD
Table 17b. Synopsis of PoS tag sets (2) Tagset projection
French
Italian
Portuguese Spanish
articles
DET:DEF R ART (definite determiner) (articles) (articles)
DETd (definite determiner)
demonstrative determiners DET:DEM demonstrative pronouns PRO:DEM
DIM
DEM
DETdem
possessive determiners possessive pronouns
DET:POS PRO:POS
POS
POS
DETposs
personal pronouns clitic
PRO:PER
PER
PES CL
PPER
rel-int-excl determiners rel-int-excl pronouns
DET:INT PRO:RIN
REL
REL
PR
indefinite determiners indefinite pronouns numbers (cardinals) numerals (ordinals)
DET:IND PRO:IND NUM ADJ:ORD
IND
IND
Q (quantifiers)
N NA
NUMc NUMo
Massimo Moneglia
as proper articles, they constitute a main category. However, this does not decrease the comparability among the tagsets. The main differences between the tagsets involve pronouns, determiners and pronominal adjectives.52 The treatment of such a lexical section was faced through different annotation schemes in the various initiatives which dealt with PoS tagging in a cross-linguistic perspective.53 In this respect, the different languages in C-ORAL-ROM cannot be described in the same way. For example, possessives in Italian are pronominal adjectives, but they cannot be determiners, in complementary distribution with articles or demonstratives; in French, however, possessives can be both pronouns and determiner (they also can be adjectival elements).54 In this case, differences among the tagset structures reflect different linguistic behaviours among languages. In C-ORAL-ROM, the class of personal pronouns is treated in the same way in all the tagsets, with the exception of the Portuguese tagset where clitic pronouns are separated with a specific tag. The French tagset bundles all the semantic categories of demonstratives, possessives, indefinites and relatives into two distributional macro-classes, determiners and pronouns, which are then divided into semantic sub-categories. The Italian and Portuguese tagsets do not distinguish these four types of lexical items by their functional value (pronominal or adjectival), but rather have each of them in single categories. The Spanish tagset is mostly coincident with the Italian and Portuguese ones for demonstratives, possessives and relatives. Where it differs is in the merging of indefinites, cardinals and ordinals into a single quantifiers class. This choice underlines the fact that, from a functional perspective, indefinites and numerals are quantifiers. From a distributional point of view, however, while cardinal numbers and indefinites have similar characteristics, ordinals do not occur in the same position, and instead present a distribution similar to that of adjectives. This behaviour is represented in the French tagset through an adjectives macro-class, which is split into qualifiers and numerals, while numbers and indefinites are treated separtately. The Italian tagset has separate classes for cardinals (Number) and ordinals (Numeral Adjective), while the Portuguese tagset has a single numerals class which is split into two sub-classes of cardinals and ordinals
... Morpho-syntactic features of verbs As can be seen in Table 1.18, the different tools used for the four languages encode the morphological traits of verbs at different levels. The Italian and Spanish tagsets provide a complete description of the verbal forms, giving specifications of mood, tense, person, number and, for the inflected participles forms, gender. The French and Portuguese tagsets offer information only on mood and tense. The last feature in Table 1.18 has to do with the annotation of auxiliary forms of verbs. The French tagset is the only one that does not contain the distinction between main and non-main use of verbs; conversely, information that the verb is auxiliary is specifically indicated in the Spanish and the Portuguese tagsets. Moreover, in the Portuguese tagset the adjectival past participles receive a special label (PPA), to distinguish
The C-ORAL-ROM resource
Table 1.18 Synopsis of the morpho-syntactic encodings for verbs French
Italian
Portuguese
Spanish
MOOD
indicative subjunctive conditional imperative infinitive participle
indicative subjunctive conditional imperative infinitive participle gerund
indicative subjunctive conditional imperative infinitive participle gerund
TENSE
present past imperfect future
present past imperfect future
indicative subjunctive conditional imperative infinitive participle gerundive adjectival past participle present past imperfect plusperfect future
PERSON
NUMBER GENDER (only for participles) VERB TYPE
first second third singular plural masculine feminine common main non-main
present past imperfect future first second third singular plural masculine feminine
main auxiliary
main auxiliary
these uses from the ones within the compound tenses (VPP). As for the Italian corpus, it distinguishes between the uses of verbs essere (‘to be’) and avere (‘to have’) as main and non-main ones (the latter includes auxiliaries and copulas).
... Non-standard tagsets The non-standard phenomena that occur in spoken language need specific labels in the automatic tagging procedure. Table 1.19 shows a comparison between the different phenomena encoded in the four corpora. The Italian, Portuguese and Spanish corpora mostly agree in the encoding of the main phenomena regarding paralinguistic elements (mostly pause fillers, phonetic supports or word fragments) and extralinguistic ones (laughs and coughs or other non-linguistic sounds), as shown in Table 1.19a. The French resource, however, does not provide a complete account of such phenomena, and does not mark the extralinguistic elements found in the transcripts in the tagset. Table 1.19b, which presents further non-standard phenomena, shows that the Italian non-standard tagset is the broader one, encoding facts of different nature (see Section 2.3.3 for the description of such phenomena).
Massimo Moneglia
Table 19a. Synopsis of the non-standard tagsets (1)
extralinguistic paralinguistic
support & fillers fragments
French
Italian
Portuguese
Spanish
–
XLG
EL
– –
PLG PLG
PL FRAG
<sup> –
Table 19b. Synopsis of the non-standard tagsets (2)
foreign words new formations acquisition forms onomatopoeia meaningless forms euphonic particles non understandable words
French
Italian
Portuguese
Spanish
XXX:ETR – – – – XXX:EUP –
(POS) + K (POS) + Z ACQ ONO
ESTR – – – – – Pimp
– – – – – –
X
.. Tagging and frequency lists formats The output of the morpho-syntactically tagged text represents both the speaker codes and the prosodic breaks, in order to allow context-bound grammatical studies (i.e. within utterances or tone units contexts). The text is given horizontally, with respect to the legibility of dialogic turns. The format of the tag for each lexical entry features three elements divided by backslash separators (\), as follows: a. the first element is the word-form; b. the second element is the lemma (in capital letters); c. the third element is the code which includes the PoS category (in capital letters) and, optionally, the morpho-syntactic description (msd) of the word form. The result is a pattern with the following structure:55 wordform\LEMMA\PoSmsd e.g. *CAR: come\COME\B andò\ANDARE\Vs3ir / a\A\E casa\CASA\S vostra\VOSTRO\POS ? (Italian text) The output of the frequency lists is encoded in a standard form, to ensure the highest comparability among the four Romance languages. The frequency lists are presented in two formats: a.
by lemmas, containing 4 columns for rank, lemma, PoS, and frequency, as illustrated in Figure 1.17; and b. by word-forms, containing 5 columns for rank, form, lemma, PoSmsd, and frequency, as illustrated in Figure 1.18.
The C-ORAL-ROM resource
rank 1 2 ... 32 33 ...
LEMMA IL ESSERE ... BELLO CASA ...
PoS R V ... A S ...
frequency 4213 3840 ... 352 289 ...
Figure 1.17 Sample of lemmas frequency list
rank 1 2 ... 39 ... 46 ... 150 ... 589.
form il è ... bello ... bella ... era ... era
LEMMA IL ESSERE ... BELLO ... BELLO ... ESSERE ... ERA
PoS(msd) R V(s3ip) ... A ... A ... V(s3ii) ... S
frequency 2214 1587 ... 125 ... 109 ... 56 ... 22
Figure 1.18 Sample of forms frequency list
. Measurements of spoken language variability in the Romance languages One of the main characteristics of spoken language as compared to written language is the huge variability of the speech events according to individual characteristics, context of use, and semantic domain. Each Romance corpus in C-ORAL-ROM is designed to offer a representation of the main contexts of use of the spoken domain. In parallel, corpora are annotated with a set of linguistic information which is relevant for the highlighting of spoken language variability. The main parameters used for the description of spoken language domain (Dialogue structure, Sociological domain of use, Genre, Semantic domain of application, Channel) are represented in the corpus design schema, as reported in Table 1.1 above. As outlined in Section 1.5, corpora are tagged with respect to: a. Dialogue structure: Dialogic turns; b. Prosodic breaks: Terminal, non-terminal breaks and fragmentation events; c. Significant discrete units: utterances in the speech flow. Once relevant information regarding the spoken events are marked in the speech resource, corpus-based contrastive studies allow the investigation of correlations be-
Massimo Moneglia
tween the context in which a given spoken event occurs and its main linguistic qualities. Spoken language variability can thus be investigated by measuring the average and the variation coefficient of the linguistic units. The following are the standard variation parameters that reflect the overall structure of the texts recorded in the corpus: a. b. c. d. e.
Mid-length of utterances (in words56 ) (MLU); Mid-length of the dialogic turn (in words) (MLTw); Speed (words per second); Mid-length of the tone unit (in words) (MLTone); Fragmentation.
Cross-linguistic studies of standard measurements in the domain of Romance languages have two main implications: (i) tendencies that are consistent at a crosslinguistic level testify to their linguistic significance; and (ii) if the range of variation among Romance languages turns out to be well defined, then language-specific characteristics can emerge. The line diagrams reported below reproduce the mid-value of the above variation parameters in the various contexts of the corpus design structure and are also stored in the Standard Measure Menu of the Diagram section in DVD. The reader can refer to the DVD for the Variation coefficient registered by each Measure. Figures are identified below with the numbering they have in DVD. The line diagrams indicate strong correlations with the fields represented in the corpus design structure and show predictable characteristics at cross-linguistic level. In brief, the following general tendencies, which will be elaborated on in the following sub-sections, are observed. i.
MLU has a positive correlation with MLTw and exhibits highly predictable values especially in informal dialogic structures. ii. Both MLU and MLTw have an inverse correlation with Speed. iii. MLTone and Speed are however not predictable on the basis of the variation parameters, but instead vary according to language-specific features. iv. Fragmentation varies mainly in accordance with speaker characteristics.
.. Mid-length of utterances (MLU) MLU has strong cross-language correlations with respect to all the variation parameters represented in the corpus structure. In general, it is: a. b. c. d. e.
much higher in formal language; higher in public contexts; significantly higher in monologues; lower in transmission texts than in formal natural contexts; the lowest in the telephone domain, which is lower even than informal dialogues.
The C-ORAL-ROM resource
MLU 25,0 20,0 15,0 10,0 5,0 tel priv fam d/c pub d/c nat d/c fam m
Italian
French
media
Spanish
pub m
nat m
Portuguese
Figure DVD 4.1 Mid-length of utterances in words (MLU) across domains
As can be seen in Figure DVD 4.1, MLU is highly predictable in all four languages in informal dialogic structures: Spanish, Italian and Portuguese vary around the same average (5–7) while French records a higher average (9–11). MLU is however more variable in media and monologues. MLU may also vary consistently from text to text in both formal and informal monologues (with variation coefficient being around 40%), probably in accordance with different strategies adopted for text organisation. The variation coefficient for MLU in the four collections (shown in Figure DVD 4.1), is constantly low in telephone (>20%) and informal dialogues (20–25%), while it is high in media emissions, in accordance with the emission format (over 50%).
.. Mid-length of the dialogic turn (MLTw) MLTw shows a regular increasing trend according to corpus design, a pattern which is shared by the four languages (see Figure DVD 4.2). The variation coefficient of MLTw (see the diagram menu DVD 4.2) also shows a smooth increasing trend, with the only exception being the Italian Formal in Natural Context domain where a significantly higher value is recorded. MLTw has a positive correlation with MLU. Monologues, which have only one long turn, always have a higher MLU value, while in dialogue structures (telephone, family private, public informal and formal), MLU and MLTw co-vary in the four collections. In other words, in spontaneous speech the longer the turns of a text are, the longer each utterance is.
Massimo Moneglia
MLTw 45,0 35,0 25,0 15,0 5,0 tel priv Italian
fam d/c French
pub d/c Spanish
nat d/c Portuguese
Figure DVD 4.2 Mid-length of dialogic turn in words (MLTw) across domains
.. Speed Speed is a relevant predictable quality for discriminating between the four languages in the various spoken styles. As can be seen in Figure DVD 4.3, in the informal dialogic sub-corpus (telephone, family, public), each Romance language records a different speed: speed is lowest in Italian at less than 2.7 w/s, constantly over 3.5 w/s in French, and between 3 and 3.5 in Portuguese and Spanish. The variation coefficient (shown in the DVD in Diagram menu 4.3) is low in telephone conversation and informal dialogues. Conversely, the average value of formal monologues discriminates among the four languages to a lesser degree, while still presenting a low variation coefficient (<20%). Speed, which is limited by articulatory factors, therefore appears to be bound by language-specific phonetic structures. This difference is more evident in informal
Speed 4,0 3,5 3,0 2,5 2,0 1,5 tel priv fam d/c pub d/c nat d/c fam m
Italian
French
media
Spanish
Figure DVD 4.3 Speed, in words per second, across domains
pub m
nat m
Portuguese
The C-ORAL-ROM resource
dialogues rather than in formal corpora, where speed is reduced in connection with the speech performance’s task. Speed turns out to have an inverse correlation both with MLTw and with MLU. A constantly higher value for speed in telephone conversations and informal dialogues and a lower one in formal contexts can be observed cross-linguistically. Together with the findings for MLU and MLTw in 1.6.1 and 1.6.2, this indicates that the longer the turn, the slower the flow of speech.
.. Length of the tone unit (MLTone) MLTone shows strong upper limits, due to breath constraints, and, in theory, may vary in accordance with the prosodic, rhythmic properties of languages and their syllabic structure. Unlike MLU, MLTone cannot be predicted along sociological and structural variation parameters. Breath constraints force a low, predictable, average value. It is constant and similar in Italian, Spanish and Portuguese (2.5–3.5 words), despite the fact that these languages have different prosodic structures (variation coefficient from 10– 20%). The marked difference from the other three languages which is exhibited in French (4–6 words), which can clearly be seen in Figure DVD 4.4, is of course the main phenomenon to be explained. The word weight in terms of number of written words and written syllables is probably the origin of the great part of this difference of French: in speech, the reduction of syllables with respect to their orthographic counterpart is significant, and systematic. This makes possible to produce a higher number of words within the breath unit.
MLTone 6,0 5,0 4,0 3,0 2,0 tel priv fam d/c pub d/c nat d/c Italian
French
fam m media Spanish
pub m
nat m
Portuguese
Figure DVD 4.4 Mid-length of tone unit in words (MLTone) across domains
Massimo Moneglia
Fragmentation Phenomena
100% 50% 0% tel priv fam d/c pub d/c nat d/c fam m Italian
French
media pub m nat m Spanish
Figure DVD 4.5 Incidence of fragmentation, in %, for three corpora, across domains
.. Fragmentation The incidence of fragmentation phenomena (i.e. interrupted utterances and false starts) when compared to the total utterance has been noted for Italian, Spanish and French corpora only, and is found to be high in all collections. This testifies to its relevance as a peculiar feature of spoken language. As can be seen in Figure DVD 4.5, the incidence of fragmentation varies from 20% to 60% in utterances. Quite surprisingly, in all collections it has lower incidence (around 20%) in the informal dialogues, in telephone conversations and media emissions, while it ranges from 40% to 60% in the formal contexts. Fragmentation turns out to be higher when the spoken performance is taskoriented, while it is relatively lower in everyday use. However, although this parameter is predictable on average, it is also strongly determined by individual factors, as is shown by the lower incidence of fragmentation in media emissions (where professional speakers are involved), and confirmed by a high variation coefficient, all through the corpus nodes (40–100%) (see Diagram 4.5 in the DVD).
.. Some conclusions In spoken language, the utterance limits specify the domain of the main linguistic relations. For instance, argument structure, constituency, head dependency, and chunking relations hold among elements of a same utterance. MLU values allow Human Language Technologies to make hypotheses about the probable occurrence of the utterances and therefore to better determine the domain of actual linguistic relations. This may be significant for many purposes, such as adequacy of speech synthesis performances, selection of relevant domain for speech recognition, and automatic indexing technologies.
The C-ORAL-ROM resource
The spoken language domain exhibits a strong variability with respect to MLU, but certain variation tendencies can be clearly noted. The data presented show that, in all four languages under investigation, the mid-length of the utterance varies according to the structure of the communicative event, the sociological domain of use, and the channel; this suggests that the length of the linguistic object which is the outcome of the speech act depends on non-linguistic factors. In other words, non-linguistic conditions determine the goals of the linguistic performance. Accordingly, the linguistic means may be limited to a few words or may need complex structures with many words. The range of variation turns out to be quite predictable in informal dialogues, which are the prototypical domain of application of spontaneous speech. In informal dialogues, values are cross-linguistically recorded within two ranges (5–7 words per utterance in Italian, Spanish and Portuguese, and around 10 in French), with a consistent variation between French and the other Romance languages, specifically in the informal dialogic domains. For all four languages, such values strongly diverge from the formal-monologic domains, which consistently have higher values. The variation coefficient with respect to MLU in informal contexts is also much lower, testifying to the significance of the correlation. The severe restriction of the number of words comprising one utterance in natural dialogic contexts must thus be seriously considered when dealing with speech. The tendency to have much longer utterances is cross-linguistically verified in two domains: (a) monologic texts, and (b) contexts requiring a formal use of language. However while the length of the utterance is predictable as having higher values in those domains (always over 10 words per utterance), the range of variation is also much higher. Such variation is found to be high specifically in media speech. No language-specific tendency can however be highlighted on the basis of the data at our disposal. In order to make a more detailed analysis of MLU in such non-prototypical domains, the corpus sampling must deal with more subtle distinctions such as genre and semantic domain. The C-ORAL-ROM corpus does not however account for a sufficient amount of samples to document those variations. MLU shows significant correlation with the length of the dialogic turn but, at least in general, not with the length of the tone unit. In all four languages, the length of the utterance and the length of the turn co-vary according to socio-structural parameters. Specifically, the more formal the context, the more the linguistic task requires a long turn, and the more each utterance turns out long and structured. This tendency may seem natural, but it must be pointed out that this is not in principle necessary. The reverse possibility could also be considered, that is, the more complex the linguistic performance, the more each piece of information might be made simple. However this theoretical alternative does not apparently apply to natural languages in their steady state. Turn alternation in a given spoken dialogue is therefore a cue for predicting utterance length. On the contrary, in all four languages, the length of the tone unit appears independent from the contexts of use, and is fairly constant within each language. The
Massimo Moneglia
variation in values of MLTone is however relevant at the cross-linguistic level. With the Romance languages, the similarity between Italian, Spanish and Portuguese is evident, while French strongly diverges, probably due to a different syllabic weight of words. The relation between MLTone and MLU must be studied carefully. While MLTone is constant throughout the various corpus domains, MLU shows much variation. In general, the two measurements do not co-vary, The absence of a strict correlation between MLU and MLTone must be highlighted for its linguistic significance. While the latter is a function of the number of syllables and breath constraints, the former is independent of both factors. In other words, the functional values of the utterance that vary in accordance with the tasks requested by the context of use determine its length, that therefore is not strictly bound by the phonological structure of a language. The constant values that the tone unit length has in the various context of uses shows, on the contrary, that its length is not strictly correlated to the context of use. The parameter of speed is found to vary according to both corpus design and language-specific factors. Within the Romance family, speed turns out to be a characteristic particular to each language, and is confirmed as a genuine language-specific datum by the low variation coefficient. Differences are however mainly noted in the informal dialogic domain, where French and Spanish have clearly higher values. Crosslinguistic differences are lower in the formal contexts, where speed decreases in all languages for reasons presumably linked to the task of the speech performance. Given this general tendency, the variation in speed between informal and formal contexts is more marked for those languages with an overall higher speed. The results regarding speed are additionally interesting when compared with variations parameters as MLTw and MLU. Speed has an inverse correlation with both MLTw and MLU: the longer the turn and the utterance, the slower the flow of speech. The consistent frequency of fragmentation events with respect to the number of utterances in spoken language is confirmed by the available data. This is extremely relevant for speech analysis at both syntactic and prosodic levels. All fragmentation events cause the onset of a PoS sequence that, by definition, is inconsistent with the grammar. Moreover all fragmentation events cause a prosodic rupture of speech fluency and the onset of prosodic patterns that are also inconsistent with prosodic rules. The frequency of fragmentation events determines sequences that have little or no meaning from an informational point of view. Given their percentile incidence, their recognition is therefore vital for the understanding of spontaneous speech. Measurements verify that there is a significant similarity of incidence of fragmentation at cross-linguistic level; this suggests that fragmentation is a stable characteristic of spoken language. Despite this meaningful result, no significant correlation can be pointed out with respect to the corpus structure. Cross-linguistically, it is verified that nonprofessional speakers have the tendency for more fragmentation in task-oriented speech performance, rather than in everyday language. However the high variation coefficient indicates that the tendency for fragmentation is more a speaker’s character-
The C-ORAL-ROM resource
istic whose probability of occurrence cannot reasonably be predicted on the basis of the data analysed.
Notes . The status of a spoken resource for each of the Romance languages represented in C-ORALROM is presented in Chapters 2 to 5. Other non-English corpora collections must also be cited for their relevance in the state of the art. Apart from the Dutch and the Israeli corpora extensively described in this chapter, the following are at present the main references: – DGD (Datenbank Gesprochen Deutsch) of Mannheim: 27 corpora (spoken standard, spontaneous speech, relevant German dialects, foreign speakers’ language) collected from 1955. http://dsav-wiss.ids-mannheim.de/DSAvINFO.HTM – CALLHOME German Transcripts corpus: Constituting 100 unscripted telephone conversations between native speakers of German, the CallHome German lexicon consists of 318,807 words. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97T15 – The Kiel corpus of read and spontaneous German, which has been collected and labelled segmentally since 1990. http://www.ipds.uni-kiel.de/forschung/kielcorpus.en.html – The Swedish Spoken Language Corpus: Comprising 1,230,663 tokens, it is an incrementally growing corpus of spoken language from different social activities (Allwood 1999); Göteborg University. http://www.ling.gu.se/projekt/old_tal/SLcorpus.html – The Corpus of Spontaneous Japanese: This project aims at the building of a large-scale spontaneous speech corpus (approximately 7 million words, with a total speech length of 800 hours) and its exploitation to improve the speech recognition technologies (see Furui et al. 2000; Maekawa et al. 2000). For English corpora the following are at present the main available collections: – London-Lund Corpus: 100 texts (435,000 words from conversations, telephone conversations, talks, lessons, radio) with prosodic annotations; collected by the Survey of English Usage, University College, London (published). – The British National Corpus (BNC): 100,000,000 words (10% spoken language); available online at http://thetis.bl.uk/. – ICE BG – Written and Spoken English: 1 million words parsed and tagged; Department of English Language and Literature, University College, London. – Corpus of London Teenage Language: 500,000 words within the BNC. http://www.hd.uib.no/ colt/ – COBUILD Bank of English: 320 million words, with the spoken word represented by transcriptions of everyday conversation, radio broadcastings, etc. http://titania.cobuilt.collins.co. uk/boe_info.html – University of Lancaster annotated corpora (Lancaster/IBM Treebank): Six parsed corpora, comprising 1 million words from debates at the houses of parliament; 250,000 words from the American printing house for the blind; 800,000 words from IBM manuals. http://www.comp.lancsac.uk/computing/research/ucrel/corpora.html – CANCODE, Cambridge and Nottingham Corpus of Discourse in English: 5 million words of spontaneous spoken English; built up by CUP and the University of Nottingham as a part of the Cambridge International Corpus. http://uk.cambridge.org/elt/corpus/cancode.htm
Massimo Moneglia
– ANC, American National Corpus (technical director: Nancy Ide, Vassar College); The first release includes 10 million words, not balanced; the final release will comprise 100 million words of written and spoken American English. http://americannationalcorpus.org/ – TRAINS: Corpus includes 98 task-oriented dialogues, amounting to 6.5 hours of speech, approximately 5,900 speaker turns, and 55,000 transcribed words. http://www.cs.rochester.edu/ research/trains/ – ATHELSAN, Corpus of Spoken, Professional American-English: Contains two main subcorpora of 1 million words each, the first consisting mainly of academic discussions, the second containing transcripts of White House press conferences. http://www.athel.com/cspatg. html – CSAE, Santa Barbara Corpus of Spoken American English; collected by University of California; scientific director: John W. Du Bois. http://www.ldc.upenn.edu/Projects/SBCSAE/ – The CHRISTINE Project: This extends the SUSANNE Corpus to cover spontaneous, informal spoken English; sponsored by the Economic & Social Research Council (UK) and directed by Geoffrey Sampson. http://www.grsampson.net/RChristine.html – WSC, The Wellington Corpus of Spoken New Zealand English: Comprises one million words of spoken New Zealand English collected in the years 1988 to 1994. http://helmer.aksis.uib.no/ icame/wsc/ . The C-ORAL-ROM databases are anonymous. All speech segments that might have offended the user for decency reasons have been erased and substituted with a beep in the audio signal. Speakers authorised each provider to use the recorded data for all ends in the C-ORAL-ROM project, including publication and language technology applications. The authorisation models are available at http://lablita.dit.unifi.it/coralrom/authorization_model. The use of radio and TV emissions has been authorised by the broadcasting companies which also provided the raw data for the database. Acknowledgement of all companies is given in the Copyright section of the DVD. The authorisation databases have been checked by ELDA (European Language Resource Distribution Agency) and delivered to the Commission. . Other partners in the C-ORAL-ROM Consortium: European Language Distribution Agency (ELDA), France, will distribute the resource in the industrial market sector; Instituto Cervantes (IC), Spain, is in charge of dissemination and exploitation for language acquisition purposes; Istituto Trentino di Cultura (ITC-Irst), Italy, tested the resource in current multilingual speech recognition technologies. . Files are encrypted and can only be accessed through the programmes included in the DVD. Minimal configuration required: Pentium III, 1 GHz, 252-megabytes Ram, S-blaster or compatible sound card, running under Windows 2000 or XP only. See Read-me file in the DVD for details. . See the discussion in 1.2.1. . The level of inter-annotator agreement on prosodic tag assignment has been evaluated by an external institution (LOQUENDO) and is reported in the appendix of this book. . In the French resource, terminal breaks are annotated but the alignment mainly follows pauses in the speech flow. . See the reference to those resources for each language in Chapters 2 to 5, and more in general in the ELRA catalogue. . This limit is quite severe for Italian, where local varieties may strongly diverge from the standard (De Mauro et al. 1993).
The C-ORAL-ROM resource . For spoken corpora representative of diathopic variation see: for Italian, LIP, Lessico di Frequenza dell’Italiano Parlato (De Mauro et al. 1993); CLIPS, Corpora e Lessici di Italiano Parlato e Scritto, coordinatore F. Albano Leoni http://www.cirass.unina.it); for French, CRFP, Corpus de référence du français parlé, collected in 1999–2000 at the Université de Provence; for Portuguese, Portugues falado (Bacelar do Nascimento 2000 and 2001a); for British English, SED, Survey of English Dialects: The Spoken Corpus, recorded in England 1948–1973 (Orton et al. 1978; Upton et al. 1994) and CANCODE, the Cambridge and Nottingham Corpus of Discourse in English http://uk.cambridge.org/elt/corpus/cancode.htm; for American English, CSAE, Santa Barbara Corpus of Spoken American English http://www.ldc.upenn.edu/Projects/SBCSAE/ . Around 5% variation on the corpus design figures is always allowed. . Up to 20% of this part may be constituted by texts of different length. . In Chapter 6 it is possible to appreciate all the most relevant lexical, structural and surface syntactic characteristics of an oral formal text. . From the project web site http://lands.let.kun.nl/cgn/doc_Dutch/topics/design/design.htm# intro . A detailed description of the design can be found in Izre’el et al. (2001: 171–197). . A cell was defined as a recorded segment designated to include 5,000 words of coherent continuous text. . In the original Austin definition, a speech act is the simultaneous performance of a locutionary act (“act of saying”), an illocutionary act (language action conventionally performed through the locutionary act) and a perlocutionary act (both the goals and the consequences of the act, in a broad sense). Language actions are a finite set and are defined by language conventions. In the Searle vulgata of speech act theory (Searle 1969), which is even more well known in the linguistic community, an utterance (u) is the result of the application of an illocutionary force (F) to a propositional content (p). According with Searle, F express a possible performative verb. Mirroring the relation between a performative verb and its embedded clause, the utterance is linearly represented as F(p) = u. The present framework, however, directly refers to Austin’s definition, which is more coherent with our assumption. The main formal indexes of illocutionary force in spontaneous speech belong to the prosodic domain, rather then to lexicon, therefore we must consider that prosodic programmes are not applied to the language string, but simultaneously performed while the string is uttered. Crucially the illocutionary force specifies which language action is accomplished by the utterance and therefore it constitutes a necessary feature of the utterance itself. . According to the theoretical framework of the IPO tradition, prosody features two types of F0 movements: “programmed, voluntary F0 changes and physiologically determined, involuntary fluctuation” (’t Hart et al. 1990: 39). Only the first type of movements is a selective object of perception for competent speakers. . The illocutionary criterion has been extensively applied to Italian spontaneous speech corpora for both adult and child language corpora (see references in Cresti 2000 and references to corpora in http://lablita.dit.unifi.it). . The properties of “completeness” and/or “autonomy” that are frequently referred as a necessary condition for the definition of utterance (Simon 2004) can be better understood in connection with the illocutionary criterion. A language entity is judged complete (or autonomous) if it can be pragmatically interpreted, and vice versa.
Massimo Moneglia . Some experimental research in this domain has been conducted in LABLITA Lab on large Italian spoken corpora. Roughly 80 different types of illocutionary acts, pragmatically recognisable by native speakers, have been found. More specifically, those which are most commonly employed (about 30) have specific intonation profiles, dedicated to the expression of a well identified speech act (see Cresti 2000; Firenzuoli 2003). . In other words the annotation of utterance boundaries in accordance with the illocutionary criterion is a much weaker practice than the annotation of speech act labels or discourse act labels (see also 1.2.3 below). . The prosodic annotation of breaks in the Dutch corpus also considered in-word breaks, which are not marked in C-ORAL-ROM, and other cues (e.g. lengthening, etc.). . In the Dutch corpus, a “substantial consistency” (Landis & Koch 1977) of the annotation for strong and weak prosodic breaks has been quantified by means of K-coefficient (Cohen 1960), recorded in the conventional range between 0.61 and 0.80 points. . For an overview, see the result of the IST project Mate. . For an exhaustive discussion of the question, see Wightman (2002) and Martin (2001). . Percentile data, such as those presented in Chapter 6, on lexical, morphologic and surface syntactic characters of the Romance spoken corpora can be derived on the base of the annotation of utterance boundaries. . The following is the definition given in the guidelines for the transcription of speech in the TEI P4: “Each distinct utterance in a spoken text is represented by a element, described as follows: a stretch of speech usually preceded and followed by silence or by a change of speaker”. See http://www.tei-c.org/P4X/TS.html#TSOV . See Cresti (2000) for the definition of Topic-Comment relation, and Firenzuoli and Signorini (2003) for morpho-syntactic and prosodic characteristics. . Typical examples for each break type in the four Romance languages are provided in the DVD in the folder “Types of prosodic breaks” of the Multimedia corpus. . No distinction connected to possible causes of interruption is considered in this frame (e.g. the CHAT format marks when the interruption is caused by the listener). However, the format explicitly marks the distinction between interruption and intentional suspension, which frequently occurs at the end of utterances. Intentional suspension must be marked with a generic utterance limit “//”, or specified, as “. . . ”. . Retracting phenomena with complete or partial repetition can both be expressed by this symbol, therefore, in principle, all traditional CHAT [//] should be simplified to [/] in this system, although the use of both traditional CHAT symbols is tolerated in the C-ORAL-ROM format. . For media corpora, the situation field is filled with the name of the programme. . Sessions in which F0 analysis is not significant are labelled D and excluded from sampling. . In human-machine interaction, in correspondence with the machine’s artificial voice, the value “Woman” or the value “x” is reported. . In this case the dialogic turn is filled by a communication event instead of a speech event. The relation between the two concepts is left undefined in C-ORAL-ROM; this resource marks the main communication events in a dialogue, but is not specifically devoted to the study of such events, which must be considered in a multimodal framework.
The C-ORAL-ROM resource . This may not be the case in overlapping and other intersection phenomena. See below. . No phonetic transcription. . In these cases the transcription may not be reliable for perceptual reasons. . Incomplete words are never subject to rebuilding, except, of course, for systematic phonetic phenomena (e.g. elision, breaking off of the last syllable, etc.). Those phenomena may or may not be mirrored in the transcription, following the orthography of each language and the traditions in editing oral text. Such choices must be detailed in the notes to the corpus edition. . Note that the lengthening of syllables, which is quite a common and perceptively relevant phenomenon in spoken language, is not marked in this system. However, following the philosophy of marking prosodic breaks, the system automatic assumes the generalisation that lengthening necessarily causes a prosodic break. . When a word is not transcribed in the text, it is substituted with a beep of a similar length in the acoustic signal. . Music or advertising fragments in media may not be transcribed, and instead substituted by a variable (that may or may not be aligned). In the case of very long music or advertising fragments in a media corpus, the fragment can be cut and the cut noted in a dependent line. . Each slash immediately following the speaker mark is of course not counted as a prosodic break. . When a dialogic turn continues despite the insertion of another turn in it that partially overlaps it, the traditional CHAT format allows the following linear representation (which has been abandoned in C-ORA-ROM for practical reasons): *ABA: Maria mi ha detto che [/] più / al concerto // perché non si sente // [Maria told me that [/] no more // to the concert // because she doesn’t feel well //] *BAB: [<] <non viene> // [<she doesn’t come> //] This representation is not consistent with the Intersection convention . In long monologues, where the serial number of a word is high, a “: /” convention can be used for the insertion of a dependent line immediately following the commented item. . The alignment of the French resource is by pauses; see Chapter 3. . The multimodal interface is not exploited in C-ORAL-ROM. . The best-known corpora of spoken English – Switchboard Telephone Corpus at Penn Treebank website http://www.cis.upenn.edu/∼treebank/home.html; London-Lund Corpus (Svartvik 1990); Lancaster-IBM Corpus (Knowles, Williams, & Taylor 1996) – received PoS tagging following different tagsets and methods. The recent national project of Spoken Dutch Corpus (CGN, Corpus Gesproken Nederlands, see Boves & Oostdijk 2003) outlined a procedure for exploiting existing taggers (TnT tagger, Brill tagger, maximum entropy memory based taggers) and lexical resources for the annotation of corpora with a new tagset (van Eynde et al. 2000; Zavrel & Daelemans 2000). The Japanese National Project on Spontaneous Speech Corpus (see Maekawa et al. 2000) tried to develop a wider series of specific tags for annotation of spoken language. . See Maekawa et al. (2000). . See Calzolari and Monachini (1996).
Massimo Moneglia . “One of the major problems presented by this category is that the Romance tradition makes use of the label Pronominal Adjective, whereas other languages (e.g. English) use Determiner. The two classes are not mappable one to the other. This is a crucial problem, involving the different behaviour of the Determiners and Pronominal Adjectives” (Calzolari & Monachini 1996). . In the NERC scheme, the problem of the use of different tags, i.e. Determiner and Pronominal Adjective, in different traditions has been raised and the proposal made to opt for the label Determiner, without any further functional distinction. The TEI work-group on annotation (see TEI 1991) has made the choice of inserting Determiners in the Adjective category, with the value pronominal. . See the GENELEX model which distinguishes the different cases of Pronouns, Determiner and Adjectives (which includes Pronominal Adjectives), exemplified by the following set: le nôtre (Pron), nôtre chien (Det), le chien est nôtre (Adj). . Regarding multiwords, there are two possibilities for their tagging, with respect to the different ways for identifying locutions in each tagset: e.g. a_priori\A_PRIORI\B a\A_PRIORI\B1 priori\A_PRIORI\B2 (Portuguese resource) . All measurements regarding words strictly refer to graphic words.
Chapter 2
The Italian corpus* Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
. History of the corpus within the national framework .. Historical overview Before the 1990s when the first structured corpora of spoken Italian were published, it was possible to refer to many databases collected by certain researchers for their studies on spoken Italian, which have been published in the last twenty years (Bazzanella 1994; Berretta 1994; Berruto 1990; Bianconi 1980; Cresti 1987; Giannelli 1994; Orletti 1994; Poggi Salani 1977; Pontecorvo & Duranti 1996; Sornicola 1981; Stammerjohann 1970; Voghera 1992). However, though information about these corpora are available, neither their sound counterpart nor their transcripts are published (apart from a few excerpts of single texts), and their size and running time can only be deduced by references made to them in the mentioned works (Berretta 1994; Koch & Osterreicher 1990). There are also some Ph.D. theses which include their corpora in appendices, like those of Caputo (1996), Rossi (1998) and Voghera (1990). Proper speech corpora made their first appearance in the Nineties, the first one among them being the Lessico di frequenza dell’italiano parlato (LIP, Frequency Lexicon of Spoken Italian). A historical and chronological overview of the most important spoken Italian corpora, research projects and websites dedicated to them follows: 1. The LIP corpus (see De Mauro et al. 1993) is not only the first, but also one of the most important collections of spoken Italian, and doubtless the most used in linguistic research. LIP was accomplished by Università La Sapienza of Rome, in collaboration with the IBM Italia foundation between 1990 and 1992, and was directed by T. De Mauro. Its texts, which account for 57 hours’ recording time, and over 500,000 words, were collected in four cities (Milan, Florence, Rome and Naples, each accounting for 125,000 words) in order to represent the diatopic variety of Italian language. These texts also represent the diaphasic and diamesic variety, because they are organised over five different types of texts: (a) face-to-face conversations; (b) telephone conversations; (c) non-free dialogical interactions; (d) monologues (lectures, sermons, etc.); and (e) radio and TV programmes. Group (c) includes texts in which the dialogical interaction does not proceed with-
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
out restraint, but is guided by one of the speakers: typical examples are interviews, classroom interactions and debates. Each register is represented by samples of texts amounting approximately to 100,000 word tokens. The frequency lexicon was collected in book form as a dictionary, accompanied by two floppy disks containing the 500,000 words of which the corpus consists; this is the broad orthographic transcription of the recordings, whose criteria include the indication of pauses, of overlapping of different speakers and of vowel lengthening, but not of punctuation. The recordings, on magnetic band, are kept by the Universita˛ La Sapienza in Rome, but are not publicly available. The only parts of the LIP corpus accessible to the public are the orthographic transcripts of the recordings. 2. Another activity which employs the LIP corpus is the Banca dati dell’italiano parlato (BADIP, Spoken Italian Databank). Part of Karl-Franzens-Universitat of Graz’s Language Server, this is an initiative which intends to make data of spoken Italian available electronically, i.e. freely accessible on the internet, and to give researchers unified access to it. At present, its efforts are concentrated on an online edition of the entire LIP Corpus. BADIP also contains extra-linguistic data, PoS tags, lemmas, and word frequencies. Future activities will be dedicated to the publication of the audio part of LIP, to accommodate further data from this corpus as well as from other corpora of spoken Italian. The corpus can be accessed at http://languageserver.uni-graz.at/badip/. 3. The Archivio delle Varietà di Italiano Parlato (AVIP, Varieties of Spoken Italian Archive) project was co-funded by MURST (Ministero dell’Universita˛ e della Ricerca Scientifica e Tecnologica) in 1997 and completed in 1999. The project was coordinated by P. M. Bertinetto of the Scuola Normale Superiore di Pisa with collaboration from the Linguistics Laboratory of the Scuola Normale Superiore di Pisa (Bertinetto 2001), CIRASS (Università degli Studi di Napoli “Federico II”), the Phonetics Laboratory of the Istituto Universitario Orientale, Naples, the Department of Electrotechnics and Electronics of Politecnico di Bari, and the Linguistics Department of Università del Piemonte Orientale, Vercelli. The corpus contains speech from different regional Italian varieties, collected in Pisa, Naples and Bari. Totalling 37,600 words, the corpus is aligned text-to-speech following dialogic turns. It includes 39 semi-spontaneous map-task dialogues, produced by young adult speakers, and 5 video-recorded dialogues, produced by children with normal and impaired hearing, for a total of about 14 hours. Only 15 adult speakers’ dialogues, plus the 5 produced by children, giving a total of about 350 minutes, are orthographically transcribed. 75 minutes of speech are phonetically segmented and labeled, with a smaller subset also prosodically labeled, and four dialogues are annotated at textual level. The entire documentation of the project is available by ftp, as well as in the CD version1 (which can be requested at [email protected]), and is freely accessible at ftp://ftp.cirass.unina.it/cirass/pub/avip/. The recordings of all AVIP texts are downloadable in WAWE format. 4. The Archivio di Parlato Italiano (API, Spoken Italian Archive) project, co-funded by MURST in 1999, began in November 1999 and was completed in November
The Italian corpus
2001. It was coordinated by F. Albano Leoni (CIRASS), with collaboration from Scuola Normale Superiore di Pisa, Università di Pisa, Università di Venezia, Università del Piemonte Orientale, Politecnico di Bari, and the Department of Neurosciences and Interhuman Communication of Università degli Studi di Napoli “Federico II”. API added a set of labelling levels to those of the AVIP corpus. Electronic tools which integrate those fine-tuned in the AVI project have also been developed. 5. The consortium coordinated by CIRASS also worked towards the creation of Corpora Linguistici per l’Italiano Parlato e Scritto (CLIPS, Linguistic Corpora for Spoken and Written Italian). This corpus was financed by MIUR, and began in February 2000 and concluded in February 2003. The idea for the project came from F.Albano Leoni at the Federico II University of Naples, and involved the participation of other institutions: Scuola Normale Superiore di Pisa, FUB, ISCTI, Università di Lecce. CLIPS accounts for about 100 hours’ worth of recordings, and is articulated, from a diaphasic point of view, in semi-spontaneous maptask dialogues, in radio and TV broadcastings (news, cultural and entertainment programmes, advertisements), in telephone conversations, and read texts. CLIPS represents regional Italian varieties well, as its recordings were collected in 15 different places, chosen according to linguistic, demographic and social-economic criteria. The corpus is dedicated to technological applications (training of recognition systems, fine-tuning of automatic segmentation tools and labelling), as well as linguistic analysis (systematic analysis and linguistic variability). It was partly orthographically transcribed (about 300,000 words, comprising 30% of the total) and aligned text-to-speech following dialogic turns. About 10% of the corpus was also labelled phonetically. The transcription and the labelling were achieved following international standards, adopting EAGLES recommendations. The documents published by the AVIP consortium have been transferred onto CD, distributed by the consortium, and made available on the FTP site arranged by the Federico II University of Naples’ CDS. 6. The Italiano Parlato (IPAR, Spoken Italian) project was conceived as a conclusion to the AVIP and CLIPS projects, and was co-funded by MURST in 2001. It was coordinated by F. Albano Leoni (CIRASS), and also involved Università degli Studi di Napoli ‘Federico II’ (CIRASS and Department of Neurosciences and Interhuman Communication), Seconda Università di Napoli, Università di Perugia, Università di Torino, Università “La Sapienza” di Roma, Scuola Normale Superiore di Pisa, Università di Pisa, Università del Piemonte Orientale, Università di Venezia, Università di Siena, Università di Salerno, and Istituto Universitario Orientale di Napoli. Its aim was a more in-depth linguistic analysis of spoken Italian which could represent the main regional Italian varieties collected in the AVIP and CLIPS corpora .The project was also aimed towards the fine-tuning of other tools for the automatic analysis of speech, particularly the prosodic component.
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
The next two corpora, LIR and CiT, are examples of corpora specifically collected to investigate the diamesic features of radio and television speech and their specific lexical characters. 7. The Lessici di frequenza dell’italiano radiofonico (LIR, Frequency Lexicons of Radiophonic Italian) project was co-funded by MURST in 1995. It was coordinated by N. Maraschio and executed by the Italian Grammar Centre of the Accademia della Crusca; Dipartimento di italianistica dell’Università di Firenze and Scuola Normale Superiore di Pisa also took part in it. The research work was made possible by the support given by Giovanni Nencioni, a pioneer in Italian speech studies, at that time President of Accademia della Crusca. LIR accounts for around 68 hours of radiophonic speech and over 500,000 transcribed words. The recordings were made in 1995, and taken from the broadcastings of nine nationwide-diffused radio stations: the three RAI stations, Radio DJ, RTL 102.5, Rete 105, Radio Italia, Radio Radicale, and Radio Vaticana. The CD-ROM, still a prototype for the time being, contains the transcripts of the texts aligned with the recordings. The texts are accompanied by metadata information on the station, speaker and communication type. The transcription is orthographic, and integrated by the highlighting of some of the most typical speech phenomena, such as auto-corrections, clippings, overlapping, and hesitations. The corpus was automatically lemmatised using a version of E. Picchi’s DBT program, which also enabled the recovering of multiwords, idiomatic expressions, and foreign words. Separate frequency lexicons for each of the radio stations are also scheduled to be conducted. 8. Work on the Corpus di Italiano Televisivo (CiT, Televised Italian Speech) corpus started in 1998 and is still in progress. CiT is kept by the Department of Language Sciences of the University for Foreigners in Perugia, and its coordinator is S. Spina. The corpus collects texts from television broadcastings and consists of 250,000 words, orthographically transcribed; its aim is the analysis of lexical and grammatical features of televised Italian. CiT was manually annotated, following the standards set by the Text Encoding Initiative (TEI). The labels concern the following different levels: (a) structure of television programmes; (b) grammatical categories; and (c) lexicon (foreign words, technical words and multiwords). The corpus is available for consultation in demo version at http://www.sspina.it/cit/cit.htm. Other corpora with different purposes are: 1. The Acoustic Phonetic And Spontaneous Speech Corpus (APASCI) (Angelini et al. 1994). It was collected by IRST and is distributed by ELRA.2 APASCI’s audio counterpart consists of continuous speech sentences, designed to cover a large number of phonetic contexts. The corpus consists of two portions. The first part is devoted to the training and validation of a speaker-independent speech recognition system, comprising 3,900 utterances from 188 speakers. The second part is designed for the training and validation of speaker-dependent recognition sys-
The Italian corpus
tems using 6 speakers: 3 males and 3 females. The APASCI corpus was collected in a quiet room (SNR * 40 dB) by means of a high quality close-talk microphone. 2. The Italian Fixed Network Speech SpeechDat(M) Corpus (FIXED0IT) was recorded within the SpeechDat(M) project (LRE-63314), funded by the European Commission. The corpus contains the speech of about 1,000 speakers, in approximately equal numbers of males and females, and was designed to support the creation of voice-driven teleservices. Most items are read, while some are spontaneously spoken. All speech is transcribed at orthographic level; moreover, age and regional background of the speakers are provided. A pronunciation dictionary is added, containing all the words which the corpus features, with a corresponding SAMPA broad-class phonemic transcription. Validation and premastering of the CD-ROMs were performed by the Speech Processing Expertise Centre (SPEX) of Leidschendam, The Netherlands. 3. The PIXI Corpus, created during the Eighties within an Anglo-Italian environment, published by Gavioli in 1990, and also available in electronic format, is also worthy of notice. The PIXI Corpus (Gavioli & Mansfield 1990) consists of 450 naturally occurring conversations recorded in bookshops in England and Italy, for the purpose of cross-cultural comparison of discourse structure and pragmatics. The Italian part of the corpus consists of around 30,000 words in 142 texts, some of which are very brief. The recordings were made in different villages of Central Umbria, although the actual locations of the recordings have been anonymized. They are available in electronic form from the Oxford Text Archive, and in book form from Gavioli and Mansfield (1991), together with in-depth details of the data collection, discourse contexts, analytic approach, and bibliography of related publications. However, very few metadata are given. A written request has to be sent to OTA; see http://ota.ahds.ac.uk/ for details. No recordings are available. 4. The CHILDES ITALIA (Bortolini & Pizzuto 1997) project, executed by CNR of Rome, the Psychology Department of Università La Sapienza in Rome, the “Stella Maris” Institute of Care and Research in Pisa and the Infantile Neuropsychiatry Department of Università di Pisa, LABLITA of Università di Firenze, and CNR of Padova, allows the creation of longitudinal collections regarding normal and impaired acquisition of Italian; part of this was included in the international CHILDES database (see Bortolini & Pizzuto 1997). The CHILDES system provides free tools for studying language in infancy data and conversational interactions (MacWhinney 2000). The CHILDES project provides a large database of first and second language acquisition data from over 30 languages in CHAT format.
.. The LABLITA Corpus The LABLITA Corpus, from which the Italian C-ORAL-ROM is taken, is kept by the Linguistics Laboratory of the Italian Department in Florence, which carries out research on corpus linguistics and experimental research on the intonation of spoken Italian. The collection of the corpus began in the 1970s, and is continually updated,
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
enabling it to be acknowledged as an “open” diachronic corpus. It includes the first spontaneous Italian speech corpus, which was collected in 1965 in Florence by Stammerjohann, who gave it to the laboratory as a gift (Stammerjohann 1970; Scarano & Signorini forthcoming; Signorini & Tucci forthcoming). The LABLITA corpus is made up of three corpora: (a) the Spontaneous Adult Language corpus, comprising approximately 152 hours, 62 of which are transcribed, for a total of about 640,000 words; (b) the Early Acquisition corpus, comprising longitudinal and transversal collections of recordings of infants from 13 to 36 months of age, which account for around 72 hours and 210,554 transcribed words. As in the CORAL-ROM corpus, the orthographic transcripts of the LABLITA corpus recordings are in CHAT format, and are integrated by prosodic tagging. The audio is digitally mastered in wav files (22.050 hz – 16 bits). The LABLITA corpus is in the process of being aligned using the utterance-based system developed in C-ORAL-ROM. All of LABLITA’s texts are accompanied by complete metadata, according to the IMDI format at http://corpus1.mpi.nl/IMDI/metadata/IMDI.imdi. The Spontaneous Adult Speech corpus, which comprises texts mostly collected in Florence in a natural environment, was conceived to fully represent the spoken diaphasic and diastratic varieties. It is entirely composed of spontaneous texts, which are classified according to the following parameters, which roughly match those of the C-ORAL-ROM corpus design: text formality or informality, social projection of the communication context (family, private, public), number of speakers involved in each communicative event (monologue, dialogue, conversation), and free or restrained modality of dialogical turning. The entire collection of LABLITA corpora is not freely accessible, but can be consulted for research activities within research programs. Before the C-ORAL-ROM Project, a sample (approximately 7 hours and 60,000 transcribed words) was published in the volume Corpus di italiano parlato (Cresti 2000). A large sampling of LABLITA texts with text/audio/acoustic parameters alignment is scheduled to be freely accessible on the web.
. Orthographic transcription .. General criteria In the transcription of the texts included in the C-ORAL-ROM Italian corpus, the use of initial capital letters is reserved for: a. b. c. d.
Christian names (Calvino; Paola); Toponyms (Parigi, Prato); Odonyms (via Burchiello, Porta Romana, Via Pian de’ Giullari, Ponte alla Vittoria); Television-programme names (Fantastico);
The Italian corpus
e. Film titles (Piccolo grande uomo, Il secondo tragico Fantozzi); f. Book titles (Memoriale, Vangelo, Divina Commedia);3 g. Band names (Depeche Mode; Neganeura). Abbreviations are written entirely in block capitals, without full stops between the letters, e.g. BTP, MPS. Capitals are used in accordance with their role in traditional grammar, and with the swings that grammar lets them perform: e.g., in expressions which contain toponyms and odonyms preceded by a common name, only the proper name always requires an initial capital letter, whereas the common name can be written beginning with an initial capital as well as with a small letter; as for titles of works, the use of capitals is only compulsory for the first letter (see Serianni 1988: 55).
.. Orthographic transcription of regional words This section complements the glossary which accompanies the transcripts of the oral texts included in the C-ORAL-ROM Italian corpus. It is reported in the DVD, in the Italian section of the corpus Metadata Menu. The glossary features: a. regional or local words, which need an explanation; b. idiolectic words for which a spelling form was chosen in the transcripts; c. regional variety forms, characterised by phonetic phenomena within the word or by phonosyntactic phenomena in the linear sequence. Words belonging to category (a) have been defined by other terms which are more common and understandable in the rest of Italy; we shall not attend any further to these words, as no spelling choices are necessary for them. All the other types of words recorded in the glossary will be addressed in this section, because of the choices in spelling required as to their written form. A written form has been attributed to idiolectic words, and to words which are frequent and common in usage, but have not undergone orthographic normalisation in standard Italian, thus not appearing in dictionaries. These include interjections, onomatopoeia, baby-talk words, foreign words, hapax legomena, and true idiosyncratic forms; for their treatment, see Sections 2.2.4 and 2.3.3.1. Among the words to which a written form has been assigned, there are also very common expressions, which, we felt, needed a spelling more in keeping with their pronunciation, such as: vabbè = va bene [all right] vabbò = va buono [ironical form, meaning ‘all right’] These include not only the result of actual lexicalisation of multiwords, but also Tuscan variations of standard Italian forms, like: quande = quando [when] quante = quanto [how much] ’un = non [not and other negatives]
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
dumila = duemila [two thousand] vu’ = voi [you (plural)] e’ = [third person clitic pronoun, plural (him/them) and first person clitic pronoun, singular (I)] to’ = tuo [yours (singular)] For those verbal forms which differ from their paradigm in standard Italian, and which have not been brought back to it, the spelling form adopted reproduces their pronunciation as correctly as possible, with a phoneme/grapheme correspondence which follows the standard Italian system:4 andesti avessin potette rompé rompiede rompò fo vo facci
instead of instead of instead of instead of instead of instead of instead of instead of instead of
andasti avessero potè ruppe ruppe ruppe faccio vado faccia
[you went] [had they] [was able to] [broke] [broke] [broke] [I’m doing] [I’m going] [do (imperative, courtesy form)]
Tuscan forms characterised by an inflectional allomorph, which makes them homophonous with other standard verbal forms, have not been brought back to their paradigm, but loyally transcribed according to their pronunciation; therefore, in our transcription, they appear as homographs of other standard Italian verbal forms, but in the glossary, they are recorded with their dual value, as illustrated in Table 2.1. Verbal polysemic forms of the Tuscan dialect have been transposed, through a graphemic cluster (-gn cluster and -ne as clitic), as in: dagnene = dagliene, daglielo, dargliene, darglielo [give (some) to him/her/it, give it to. . . , to give (some) to. . . , to give it to. . . ] portagnene = portaglielo, portagliene, portaglielo, etc. [take it to him/her, take (some) to. . . ] The above forms are strongly characteristic from a diastratic point of view: they belong to a lowly variety of Tuscan. Orthographic solutions which are loyal to the actual pronunciation of the words were brought about by the adoption of a general criterion of sociolinguistic significance of a phenomenon. The spirantisation of voiceless occlusive consonants in intervocalic position, which is common to all Tuscan speakers, is only highlighted in the transcript when it is marked, and thus significant from diastratic point of view; other phenomena (like rhotacism, for example) are highlighted because they are significant in their own right from a sociolinguistical point of view.
The Italian corpus
Table 2.1 Homographic verbal forms Tuscan form
Mood and person
Corresponding Italian form
Homophonous & homographic Italian form
abbi [to have]
3a singular person subjunctive present
abbia
abbi: 2a singular person imperative present
dividano [to divide]
3a plural person indicative present
dividono
dividano: 3a plural person subjunctive present
intendano [to mean]
3a plural person indicative present
intendono
intendano: 3a plural person subjunctive present
governavi [to govern]
2a plural person indicative past
governavate
governavi: 2a plural person indicative past
possano [can]
3a plural person indicative present
possono
possano: 3a plural person subjunctive present
rendano [to give back] tenevi [to keep]
3a plural person indicative present 2a plural person indicative past
rendono
rendano: 3a plural person subjunctive present tenevi: 2a singular person indicative past
devan [must]
3a plural person indicative present
devono
devano: 3a plural person subjunctive present
mettan [to put]
3a plural person indicative present
mettono
mettano: 3a plural person subjunctive present
vedano, vedan [to see]
3a plural person indicative present
vedono
vedan: 3a plural person present
vengan [to come]
3a plural person indicative present
vedono
vengano: 3a pers. pl. subjunctive present
tenevate
Depending on the degree of evidence of the spirantisation, the spirantised consonant has been omitted, or an h has been inserted, in keeping with a traditional vernacular spelling form: poho passai staa sentìo
instead of instead of instead of instead of
poco passati stata sentito
[a small quantity] [gone, gone by (pl.)] [been (f.)] [heard]
An approach towards the actual pronunciation has also been achieved in the rendering of the intervocalic /v/. This not only applies to well-known forms – which are consolidated in literary tradition, like avea, dicea – but also to forms like guardaa = guardava [was looking at]. Below is a short list; the reader can refer to the DVD for the full set. aere = avere [to have] aerlo = averlo [to have it] andaa = andava [he/she/it was going] andao = andavo [I was going]
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
aveo, avei, avea, aveano, avean = avevo, avevi, aveva, avevano, avevano [I had, you had, he/she/it had, they had] credeo = credevo [I thought] arriò = arrivò [he/she/it arrived] dèo, dèi, dèe, dèano, dèan = devo, devi, deve, devono, devono [I must, you must, he/she/it must, they must] dao = davo [I was giving] doveo, dovea, dovean = dovevo, doveva, dovevano [I had to, he/she/it had to, they had to] A phenomenon considered significant, and therefore highlighted by the graphic transposition, is the loss of the semi-consonant /w/ in the /kw/ cluster (corresponding to the graphematic cluster qu in standard Italian spelling). In this case, we have resorted to the cluster corresponding to occlusive velar consonant (ch), a solution also generally adopted in traditional vernacular spelling: chello, chelli instead of quello, quelli [that, those] chesto, chesta instead of questo, questa [this (m., f.)] cherce instead of querce [oak tree] As mentioned before, cases of rhotacism (l → r) are orthographically highlighted; in order to avoid cases of homography, the tonic vowel is marked with an accent: mórto = molto scarza = scalza (verb scalzare) sòrdi = soldi sòrdo = soldo rinvortata = rinvoltata vorte = volte quarcuna = qualcuna
[very, a lot] [undermines] [money] [penny] [wrapped (f.)] [times] [a few (f.)]
Other speech phenomena whose effects were loyally transposed in writing are various assimilation, epenthesis, epithesis and metathesis phenomena. Various assimilation phenomena: a.
al + consonant > geminated consonant: ammeno < almeno (con ulteriore possibile evoluzione aimmeno) [at least] attro, attri < altro, altri (con possibile evoluzione antro, antra) [other, others] infizzaa < infilzava [was piercing]
b. nl > ll: domall’ altro < doman l’altro < domani l’altro [the day after tomorrow] c.
r + l > ll: prendello < prenderlo [to take it]
The Italian corpus
Another possible result is the degemination of double consonants and the closing of the post tonic which precedes -ll: prendilo = prenderlo [to take it] leggile = leggerle [to read them (f.)] perdile = perderle [to lose them (f.)] riprendila = riprenderla [to retrieve it (f.)] d. -cn- > nn: tennicamente < tecnicamente [technically] Other phenomena: a.
epenthesis phenomena: anderà < andrà [he/she/it will go]
b. epithesis phenomena: barre < bar [bar, pub] chie? < chi? [who?] c.
metathesis phenomena: drento < dentro [inside] presempio < per esempio [for example]
Some phonetic peculiarities of southern Italian varieties have also been rendered graphemically, using the graphic solutions adopted for the transcription of southern dialects, for example: Palatalisation of cluster with /p/: chiù < più < plus (lat.) [more] Loss of /l/ before consonant: vota < volta [time, as in ‘one time’]
.. Diacritic marks Two graphic symbols have been used to make word reductions evident and to avoid cases of homography: the accent and the apex. Accents have been used to distinguish non-standard Italian forms from their homographic or homophonous standard forms: méssi = misi [I put] different from messi (noun, f. pl.) [crops] mésse = mise [he put] past different from messe (noun, f. pl. of messa) [mass]; participle [put (f. pl.)] Tonic vowels resulting from the contraction of diphthongs have been transcribed with a grave accent, for example:
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
bòna, bòni, bòno = buona, buoni, buono [good (f. sing, m. pl., m. sing.)] còcere = cuocere [to cook] The contraction of diphthongs has not been made evident in words where the accent does not fall on the diphthong: bongiorno = buongiorno An apex (apostrophe) has been used where clipping and aphaeresis occur.
Clipping phenomena Cases of vowel apocope, which come after a liquid or nasal consonant – which are also described in Italian grammar books – are not highlighted, but rendered graphically. The following examples are a result of this: buon ben voglian vòl
[good] [well] [they want] [wants]
devan [they must] mettan [they are putting] vengan [they are coming]
The various kinds of vocalic and syllabic clipping, or falling of segments at the end of a word, have instead been signalled with an apex: allo’ < allora [then] aspe’ < aspetta (2a ) [wait] avra’ < avrai [you should have] bra’ < bravo [good boy] capi’ < capito [udestood] du’ < due [two] ecce’ < eccetera [etxetera] fa’ < fai, fare [you do; to do] gua’ < guarda [look] le’ < lei [she] lu’ < lui [he] ma’ < mai; mamma [momy] mi’ < mia, mio [my; mine]
pa’ < padre [dady] po’ < poi [then] sa’ < sai [you know] se’ < sei [you are] so’ < sono [I’m] su’ < suo, sua [his; her] t’ < tu [you] telefonera’ < telefonerai [you will call] tu’ < tuo-a-e, tuoi [your; yours] ve’ < vedi [look] vie’ < vieni (imperative) [come] vò’ < vuoi, vuole [you want; he want] vorre’ < vorrei [I would like]
In vernacular texts, the third person masculine subject clitic gli is sometimes used, for which the chosen transcription is always gl’, without vowels, so as to highlight its non-syllabic character. With regard to the pre-consonantic determinative article il, which in Florentine results in [i] followed by the strengthening of the consonant that comes after it – as in iccane, iggatto, etc. – an apex was used, in keeping with a traditional vernacular spelling form, thus writing i’ cane instead of il cane, etc.5 The same written form of the article il was adopted for “analytical” pronunciations of articulated prepositions – that is, a preposition resulting from the fusion between
The Italian corpus
a simple preposition and an article – each transcribed as “preposition + article” in order to avoid confusing them with the articulate preposition of the corresponding plural form: a i’ for al [at the] da i’ for dal [from the] su i’ for sul [on the] The use of an apex after proper articulated prepositions must be explained; in these cases, the apex marks the loss of the vowel, caused by a process similar to clipping: a’ da’ co’ de’ ne’ su’
for for for for for for
ai dai coi del, dei nei sui
[at the, pl. to (m. pl.)] [from (them), pl.] [with the, pl.] [of the, m., sing. and pl.] [in the, m.pl.] [on the, m.pl]
Finally, synthetic forms of articulate prepositions, which can at times result as ambiguous, have also been marked by an apex. In practice, once transcribed, these could be indistinguishable from the correspondent simple preposition. This would never happen in pronunciation, because only articulate prepositions produce the doubling of the following consonant: di’ for del [of the, m. sing.] ni’ for nel [in the, m. sing.] An unusual case is represented by the demonstratives quel/quei, which behave like articulate prepositions: que’ for quei [those, m.] qui’ for quel [that, m.] Apexes have again been used for prepositions which feature consonant apocope: pe’ < per [for] co’ < con [with] Cases of apocope of final vowels and of final syllables of verbs with infinite inflection have received a spelling common to both. With regard to vocalic apocope, which occurs in Tuscany in front of a consonant followed by an accented vowel, the result is an progressive assimilation of the ‘r + consonant’ group (geminated consonant: andare male > andar male > andammàle). As for syllabic apocope, cases like andà’ and mètte’ occur (and the accent does not change its position). A common spelling form was adopted for these phenomena, which appear together in the same texts with a considerable frequency: the apocope is signalled, whatever its nature, with an apex at the end of the word, but not the vowel on which the accent occurs, which stays the same as in the verbs with infinite inflection without apocope: anda’, mette’.
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
anda’ < andare [to go] analizza’ < analizzare [to analize] apri’ < aprire [to open] ave’ < avere [to have] capi’ < capire [to undestand] compra’ < comprare [to buy]
rifa’ < rifare [to redo] ritorna’ < ritornare [to come back] sape’ < sapere [to know] sede’ < sedere [to seet down] smette’ < smettere [to give up] spende’ < spendere [to spent]
Aphaeresis phenomena The vocalic or syllabic aphaeresis and the loss of initial word segments are marked with an apex at the beginning of the word: cidenti < accidenti [damn] ’nnaggia < mannaggia [damn] ’fatti < ’nfatti [in facts] ’petta < aspetta [wait] ’giorno < ’ngiorno [good morning] ’spetta < aspetta [wait] ’gna < bisogna [we need] ’st’, ’sta, ’ste < quest’, questa, queste [this; these] nfatti < infatti [in facts] ’sto, ’sti < questo, questi [this; these] Cases in which the vocalic or consonantal aphaeresis generates forms which have become common use as variations, and therefore appear in dictionaries, have not been signalled by an apex at the beginning of the word, e.g.: briaca, briache = ‘ubriaca’, ‘ubriache’ [drunk, f., sing. and pl.] ché = ‘perché’ [because]
.. Interjections The following is a list of the most frequent interjections and discourse particles; greetings, wishes and curses are also included (Table 2.2). Table 2.2 Interjections and discourse particles Rank
LEMMA
Occ.
Rank
LEMMA
Occ.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
eh mh ah vabbe’ oh mah beh ciao boh magari buonasera bah he mamma_mia
1944 893 682 215 160 151 133 122 49 47 41 38 28 24
15 16 17 18 19 20 21 22 23 24 25 26 27 28
ohi via grazie uh han eee’ mamma oddio accidenti basta madonna per_carita’ ueh ehm
24 22 20 15 14 13 11 10 8 8 7 6 6 5
The Italian corpus
Table 2.2 (continued) Rank
LEMMA
Occ.
Rank
LEMMA
Occ.
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
ieh perdio ahi bu ich uhm arrivederci buongiorno chi_se_ne_frega hei macche’ mannaggia sh vaffanculo aho’ buonanotte che_palle cristo dagli ehi male meno_male
5 5 4 4 4 4 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
ooh ahm amen caspita grazie_al_cielo hi hum ia in_bocca_al_lupo ma_va’ maramao mhm moh ohe’ ohiohi perdinci prego toh uffa uhi umh vabbuo’
2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
. Morpho-syntactic tagging .. Tools and strategy adopted for automatic PoS tagging and lemmatisation The main antecedent in automatic tagging of spoken Italian is the LIP corpus (De Mauro et al. 1993). However, neither the tagger nor the tagged resource are available.6 The automatic tagging procedure of lemmatisation and morpho-syntactic annotation of the Italian C-ORAL-ROM corpus is based on the PiTagger tool, created and developed by Eugenio Picchi within the ILC-CNR of Pisa. PiTagger was chosen on the basis of three main reasons: (a) the availability of the tool on the market; (b) the EAGLES-like tagset; and (c) PoS annotation, and lemmatisation joined in a coherent system. PiSystem is an integrated procedure for textual and lexical analysis, which consists of the following main components: 1. DBT text encoding and analysis modules; 2. a morpho-syntactic analyser (PiMorpho); 3. a Parts of Speech (PoS) tagger and lemmatiser (PiTagger). The DBT encoding provides a first parsing of the text, known as tokenisation, and represents the pre-analysis level of the morphological analyser. The encoded text is
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
then given to the PiMorpho, which assigns all the possible alternatives of MorphoSyntactic-Description (MSD) to each lexical item. For the PoS disambiguation, PiTagger uses two other input-resources: an electronic dictionary and a training corpus. In detail, these resources consist of: a.
the DMI, a morphologic dictionary of Italian language, developed within the ILC at CNR, Pisa; it collects 106,090 lemmas encoded with PoS specifications and inflectional tags (Zampolli & Ferrari 1979; Calzolari, Ceccotti, & Roventini 1983); b. a Training Corpus of 50,000 words, manually tagged; c. a statistical database extracted from the Training Corpus (BDR). These resources are built on a coherent tagset for PoS and morpho-syntactic annotation. The disambiguation phase is processed by statistic measurements (on tri-grams) extracted from the Training Corpus and stored in the BDR. The main programme of this procedure estimates the maximum likelihood pattern among the possible alternatives given by the morphological component, with a transitional probabilistic method (Picchi 1994). In this environment, the level and the precision of the analysis depends on the information archived in the BDR. Statistics must be defined on a specific level of linguistic information, i.e. lexical patterns or PoS sequences. The BDR uses a hybrid set of specifications called Disambiguation Tags, which defines the threshold of the analysis relevant for statistic measurements, i.e.: a. PoS codes; b. the lemmas ESSERE and AVERE; c. the MSD tags about non-finite moods of verbs. This information is the same used by the PiTagger in the disambiguation phase.
.. Tagset The PoS tagset used is, for the most part, in agreement with the EAGLES recommendations for the morpho-syntactic annotation of Italian language (Monachini 1996). These guidelines are built on the basis of previous experiences on multilingual corpora tagging (NERC project)7 and lexical encoding (MULTILEX and GENELEX models),8 and adopted within the LRE-MULTEXT and MLAP-PAROLE projects,9 within the framework of the European standardisation for multilingual resources.10 The EAGLES tagset for Italian is quite a traditional one, given that it follows the PoS given by descriptive grammars. The aim of this tagset is for it to be adequate both for morpho-syntactic tagging and for the building of lexical resources (Calzolari & Monachini 1996; Leech & Wilson 1993), in order to create a coherent basis on which to operate within various computational linguistics tasks.11 The C-ORAL-ROM tagset, entirely oriented towards corpus tagging, features some adjustments in relation to specific category spaces:
The Italian corpus
1. with regard to verbs, in order to operate a distinction between main verbal instances and non main ones (auxiliaries and copulas); 2. with regard to pronouns and determiners, in order to achieve a semantic-oriented classification of such lexical objects (independent from their functional value). Table 2.3 presents the tagset used for the PoS tagging of Italian C-ORAL-ROM, each with a short definition and an example. The rows in grey refer to the tagset choices Table 2.3 The Italian C-ORAL-ROM PoS tagset
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
which do not follow the EAGLES guidelines; these are detailed in the paragraph following. Tables 2.9 and 2.10 show the tagset extension used to deal with non-standard forms and other special items in the transcript.
... Choices in the PoS tagset Morphological annotation of verbs. A detailed morphological analysis is provided for verbs, to enable the extraction of data which are necessary for grammatical studies. For example, the morpho-syntactic description of verb forms is necessary to determine whether an utterance is a verbal or nominal one. To this end, the trait of finiteness is crucial to distinguish nominal uses from proper verbal uses. The former are found in utterances constituted by one verb in non-finite form, which has neither a subject nor an argument structure, and the resulting utterance has the characteristics of a nominal one in Romance languages: e.g. volare! cantare! ‘to fly!’ ‘to sing!’ Tables 2.4 and 2.5 below show the level of annotation for verbs, divided between finite and non-finite forms. In the first row, the features considered by the tagset are given; optional features are bracketed. If a given feature can assume different values, these are listed in the rows below. The tags are presented in the positional order that they assume in the tagged text. Table 2.4 Verbs: Finite forms Verb
(non-main)
Number singular
V
Person s
W plural
Mood
first
1
second
2
p third
3
Tense
(encl.)
indicative
i
present
p
subjunctive
c
past
r
conditional
d
imperfect
i
imperative
m
future
f
_E
Examples: amo\AMARE\Vs1ip amalo\AMARE\Vs2mp_E
Table 2.5 Verbs: Non-finite forms Verb
V
(non-main)
W
Gender
Number
masculine*
M
feminine*
F
common*
N
singular*
plural* * only for participles Examples: amarlo\AMARE\Vfp_E amata\AMARE\Vfspr
Mood s
Tense
infinite
f
participle
p
gerund
g
p
present
(encl.) p _E
past
r
The Italian corpus
Main and non-main verbs. The morpho-syntactic description of verbal forms represents a crucial problem for automatic PoS tagging. At the current state of the art, it is not easy to achieve a good level of automatic recognition for both auxiliaries and copulas.12 Furthermore, the Pi-Tagger procedure for Italian corpora does not mark the distinction between auxiliaries [ESSERE (‘to be’) and AVERE (‘to have’)] and copulas [ESSERE] and the predicative uses of those verbs.13 This leads to a lack of information in the tagged resource which is significant for the linguistic studies on C-ORAL-ROM corpora (in verbal vs. non verbal utterances; percentage of name vs. verbs), and in general for the linguistic exploitation of the resource. The Italian C-ORAL-ROM finds a way out of this problem, encoding both auxiliaries, verbs [ESSERE and AVERE], and copulas [ESSERE] with the [non-main] feature. An automatic post-edit procedure provides a distinction between the different uses of these verbs: MAIN ones (fully semantic, predicative verbs) and NON-MAIN ones (auxiliaries and copulas). The MAIN versus NON-MAIN distinction represents the first level of annotation for the verbal function (predicative and of “grammatical”) and allows the identification of predicative verbs (marked MAIN) with positive consequences for the corpus-based linguistics studies. The negative consequence of such a treatment is to have the auxiliary and copula instances of ESSERE falling into the same category, with no distinction between them. The tag dedicated to the NON-MAIN verbs is a capital W following V, the main tag. Table 2.6 shows the strategy adopted to tag the verbs ESSERE and AVERE in the most common contexts and uses. The automatic annotation of the MAIN/NON-MAIN value of the occurrence of verbs ESSERE and AVERE was achieved through a post-edit procedure that provided to operate the distinction, on the basis of a set of contextual rules, on the already tagged text. Pronouns and pronominal adjectives. With regard to the categorical space outlined by the traditional categories of “pronouns” and “pronominal adjectives”,14 the Italian C-ORAL-ROM tagset diverges from the EAGLES standards. Specifically, no distinction is made between pronominal and adjectival uses of the possessives, indefinites and determinatives, which are all merged into the same category. This choice follows the strategy used in the tagset for the Portuguese C-ORAL-ROM. Table 2.7 shows the relations between the Italian categories in EAGLES and C-ORAL-ROM. As shown in Table 2.7, while it is possible to map the EAGLES categorisation on the C-ORAL-ROM one, the reverse procedure is not applicable. The EAGLES tagset is structured on a sub-categorisation system within Pronouns and Determiners (macrocategories which represent the general PoS tags); in the Italian C-ORAL-ROM tagset, however, following extensive criteria, each category is treated as a proper PoS, merging the pronominal and the adjectival uses of the same lemmas. The EAGLES choices, according to a most accurate level of linguistic description, are motivated by the perspective of the achievement of a level of analysis that would be
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
Table 2.6 Essere and avere tagging Value
Examples
Classification Tag
predicate era\ESSERE\V il due settembre [it was September the second] mio nonno aveva\AVERE\V solo la bicicletta [my grandfather only had a bicycle] c’ è\ESSERE\V un cane in mezzo alla strada [there’s a dog in the middle of the road] quello che hai visto è\ESSERE\V il padre di Luca [the man you saw is Luca’s father]
MAIN
V
copula
NON-MAIN
VW
NON-MAIN
VW
Marco è\ESSERE\VW un ingegnere (NP) [Marco is an engineer] sono\ESSERE\VW troppo stanco (A) per uscire di nuovo [I’m too tired to go out again] il vaso è\ESSERE\VW rotto (PP) [the vase is broken]
auxiliary mio figlio se ne è\ESSERE\VW andato di casa a diciotto anni [my son went away from home at eighteen] il ladro è\ESSERE\VW stato\ESSERE\VW catturato [the thief has been caught] il gatto ha\AVERE\VW mangiato tutto [the cat has eaten everything] ho\AVERE\VW corso un’ora di seguito [I have run for a whole hour]
Table 2.7 Categories in EAGLES and C-ORAL-ROM EAGLES Pronouns subcat. Personals Reflexives Relatives Interrogatives Exclamatives Demonstratives Indefinites Possessives
Determiners subcat.
C-ORAL-ROM Merged categories Personals
Relatives Interrogatives Exclamatives Demonstratives Indefinites Possessives
Relatives
Demonstratives Indefinites Possessives
consistent with the lexicon encoding standards.15 However, such a choice encounters problems from several points of view: a.
The traditional label of “pronominal adjectives”, widely used in the description of Romance languages, cannot be entirely mapped to the notion of determiners, mostly used within the English and American traditions.16
The Italian corpus
b. From a theoretical point of view, Italian articles are determiners (for example, they are in complementary distribution with respect to demonstratives). As Romance tagsets normally feature articles separated from the other PoS (in accordance with the TEI proposal), the determiner word class would be inadequate according to these choices.17 c. Neither the syntactic nor the semantic values of the elements which would be tagged as determiners are uniform; on the other hand, the pronominal class would be a hybrid one, including both determiners in pronominal functions and strictly pronominal elements. The Italian system can be summarised as follows: a.
possessives in Italian are mostly adjectival elements (Renzi et al. 1988; Serianni 1988) but they can never be determiners (occurring with articles or other determiners, both as pronouns, as in (1), and as adjectives, as in (2)): (1) ogni cappella c’ aveva la\IL\R sua\SUO\POS [every chapel had its own] (2) me lo vende un\UN\R mio\MIO\POS amico [a friend of mine is selling it to me]
b. indefinites are quantifiers: (3) tremilacinquecento caffè / non son tanti\TANTO\IND [3500 coffees aren’t many] (4) si potrebbero risolvere / tanti\TANTO\IND problemi [many problems could be solved] c.
determinatives are true determiners (in complementary distribution with articles), mostly with deictic value: (5) un ferro / uguale a quello\QUELLO\DET + [an iron the same as that one] (6) era allucinante / quell’\QUELLO\DET uomo // [he was terrifying, that man]
d. personals are pronouns (they cannot appear with determiners), whose referents are unequivocally determined though a deixis on the basis of the participants to the dialogue and their common knowledge; e. relatives are mostly pronouns (also used in adjectival functions) which have an anaphoric value (with respect to an NP in the speech context) or act as interrogative markers (within a wh- question). The main definition underlying the above classification is based on lexical-semantic considerations. The definition criteria adopted avoids the merging of the above semantic categories into heterogeneous macro-classes which do not represent a valid
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
generalisation with respect to their distributional characters. Moreover, this reduction in the tagset allows a lower error rate of the automatic PoS tagging results. For these reasons this classification has been preferred for the Italian tagset to the distributional one proposed in EAGLES.
.. Extended tagset for spoken language The spoken language transcripts of C-ORAL-ROM corpora contain elements of different types.18 More specifically, besides the linguistic elements which belong to the dictionary, there is also a wide variety of non-standard linguistic forms in the corpora. The following cases have been distinguished: a. b. c. d.
foreign words (they); new formations (bruttozzi, torniante); onomatopoeia (fffs, zun); language acquisition forms (aua, cutta).
Moreover, a wide series of non-linguistic phenomena may also be involved in the speech flow. Such phenomena are identified in the transcripts following the C-ORALROM format and are encoded in the tagged resource as special elements: a. word fragments (&pa, &costrui); b. phonetic support elements and pause fillings (&he, &mh); c. laughs and coughs. Table 2.8 shows the underlying structure of the tagsets, with respect to the classification of the elements occurring in the transcripts. In accordance with this general scheme, the tagsets for non-standard linguistic elements and for non-linguistic elements, and the specifications determining these choices, are described in the paragraphs following.
Table 2.8 General structure of the Italian C-ORAL-ROM tagset ROOT classification linguistic elements
non-linguistic elements (NL tagset)
Secondary classification Standard (PoS tagset)
compositional non-compositional
non-standard (NS tagset)
compositional non-compositional
paralinguistic extralinguistic
Elements classified all parts of speech interjection (according to tradition, within PoS) foreign and new forms onomatopoeia, language acquisition forms fragm. words, phonetic supports, pause fillings coughs and laughs
The Italian corpus
Table 2.9 Non-standard (NS) tagset Non-standard element Compositional Foreign forms New formations Non-compositional Acquisition Onomatopeic
Tag
Example
(PoS+)K (PoS+)Z
they\PERK torniante\SZ
ACQ ONO
cutta\ACQ zun\ONO
... Non-standard words tagset Although foreign words, occasional neologisms, new formations, onomatopoeia and language acquisition forms, as listed in Table 2.9, are not widely present in the corpus, their treatment is important when the quality level of the tagging results is taken into consideration. The main feature which marks the distinction between these forms is the syntactic value of these elements within the linguistic structures: while foreign and new forms are compositional elements (which follows the syntactic criterion), onomatopoeia and language acquisition forms are non-compositional. For example, the foreign word in example (7) and the new formation in example (8) produce complex Noun Phrases (underlined in the text) and preserve the agreement features. Conversely, the onomatopoeic element in example (9) is not compositional; i.e. it lacks both syntactic and argumental relationships with respect to the other linguistic elements of the utterance: (7) *ROS: le &mac [/] raccomandazioni del gruppo per la prevenzione / gruppo spread\AK / [the &mac [/] recommendations of the prevention group / spread group] (8) *ANG: [. . . ] con quelli babbussi\SZ brutti che arrivavono . . . [with those ugly bad-guys which were coming] (9) *MON: i’ tronco / dentro / quando t’ arrivi a segarlo / fffs\ONO / s’ allarga // [the trunk / inside / when you start to saw / fffs / it widens] This generalisation on the non-compositional value of certain lexical elements has the consequence of establishing a macro-class encompassing onomatopoeia and acquisition forms (as non-standard elements) and interjections (standard elements).19 As a rule, all non-standard forms should not be considered tokens for lemmatisation. Since their occasional use does not conform to the definition of lemma, they are reported in the frequency lists as simple forms without lemma specification.
... Non-linguistic elements Non-linguistic elements present in the transcript (see the C-ORAL-ROM format requirements in the introduction) and all words (or word chains) that the transcribers have not understood are tagged with special codes,20 illustrated in Table 2.10.
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
Table 2.10 Non-linguistic (NL) tagset Non-linguistic element
Tag
Example
Paralinguistic Extralinguistic Non-understandable words
PLG XLG X
&he|PLG hhh\XLG xxx\X
.. Other choices ... Regional and dialectal forms Spoken language, Italian in particular, is characterised by a wide presence of regional and dialectal forms.21 This problem is particularly relevant in the verb class, because of the complexity of the Italian verbal inflection and the relevance of dialects within the inflectional paradigms. The regional inflected forms are inserted in a pre-dictionary already analysed and tagged with respect to the full set of morpho-syntactic features and standard lemma information. ’apito\CAPIRE\Vmspr [understood] ’cchiappavono\ACCHIAPPARE\Vp3ii [they were catching]
... Multiwords Definition. A multiword is a complex lexical unit, made up of a group of words (two or more) which have a single linguistic value. The meaning of a multiword expression does not strictly depend on the composition of the single meanings of each word (even if in many cases it is possible to find a head constituent); moreover, the word class of a multiword can be completely independent from the PoS values of its components (e.g. un sacco di is a sequence of an article, a noun and a preposition, but it assumes the value of an indefinite quantifier). In general, multiwords assume a holistic value as a whole, creating an independent lexical entry with specific semantic value and syntactic function. An explicit list of criteria is needed to distinguish the objects we consider as real multiwords from other expressions whose meaning is made up of syntactic and semantic relationships. The criteria followed to test whether a compound lexical expression is a multiword may follow different perspectives (see Voghera 1994); for example: 1. insertions within the word-chain are not possible (morpho-syntactic cohesion); 2. it is not possible to interchange the elements which compose the multiword expression (morpho-syntactic fixity); 3. in the speech flow, it is impossible to insert a prosodic break between two elements of a multiword (prosodic cohesion). These criteria are used to identify different degrees of cohesion with respect to the compound lexical expressions, rather than to be applied at the same time to all multi-
The Italian corpus
words. The presence of at least one of the characters described in the list is necessary to claim that a generic “compound expression” is a real multiword. Verbal multiwords were excluded from the list, because of the difficulties in identifying the borderline between the verbal composition (syntactic and thematic structures) and the different types of real verbal multiwords (phrasal verbs, analytic forms, locutions). The criteria used to identify other kinds of multiwords do not predict the behaviour of verbal multiwords, whose status frequently remains ambiguous. For example, the following expression can be used in Italian as a phrasal verb,22 as in example (10): (10) sono molti anni che [corre dietro] a quel suo sogno (verbal multiword); [he’s been [running after] that dream of his for many years] but in other contexts, such as (11), the same word patterns do not create phrasal verbs: (11) per tutta la gara [ha corso] dietro al vincitore (verb + preposition);23 [he [has run] behind the winner for the whole time of the race] Multiwords in Italian can belong to several word classes, not only to the verbal one. The following list shows some examples of compound lexemes categorised as adjective, adverb, conjunction, preposition and interjection:24 di bassa lega\A a viso aperto\B in modo da\C a causa di\E meno male\I
[lowly] [openly] [so as to] [because of] [it’s just as well]
As for nouns, in the Italian C-ORAL-ROM a complex nominal lexeme is considered a multiword when: 1. the replacement of the compound expression with its head is not possible; e.g. acqua minerale ‘mineral water’ is replaceable with acqua ‘water’, so it is not a multiword; aria condizionata ‘air conditioning’ is not replaceable with aria ‘air’, so this is a multiword; 2. even if the replacement of the head is possible, the compound expression is a non-grammatical pattern (two nouns in linear sequence); e.g. linguaggio macchina ‘machine language’ is replaceable with linguaggio ‘language’ but the word pattern is non-grammatical, therefore it is a multiword. The possibility of alternative categorisations for a multiword is similar to the same possibility for simple words: e.g. pagamento in_contanti\IN CONTANTI\A ‘cashpayment’; pagare in_contanti\IN CONTANTI\B ‘to pay cash’.
Ambiguous clustering of multiwords. The linear sequence of words which compose a multiword expression can be read, in some cases, as a syntactic cluster, thus creat-
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
ing ambiguity in the tokenisation. As shown in the following examples (12) and (13), we cannot automatically tag an expression like un sacco di as a whole lexical item, because in (13) each element of the expression, which in (12) is a multiword, has its own syntactic value within the constituent: (12) per fare il purè ci vogliono UN_SACCO_DI\IND patate [it takes A LOT OF potatoes to make purée] (13) UN\R SACCO\N DI\E patate di solito pesa due chili [A SACK OF potatoes usually weighs two kilos] At the current state of the art, it is still impossible to obtain an automatic lemmatisation for these objects, which need a semantic interpretation in the context to be disambiguated.
... Names Proper names are identified in transcriptions as the only words starting with a capital letter, thus avoiding any problems of identification of such items. In the tagset, these elements are characterised by the SP code, both for single names and for multiword proper nouns. Multiword names are joined with underscores in the lemmatised output. The expressions considered as proper nouns (with examples for each type) are listed below: a.
Anthroponyms: names, surnames, nicknames Massimo\MASSIMO\SP, Fausto_Bertinotti\FAUSTO BERTINOTTI\SP
b.
Toponyms: Africa\AFRICA\SP, San_Gottardo\SAN GOTTARDO\SP
c.
Works of art, books, movies, TV, music bands, games: Alice_nel_Paese_delle_Meraviglie\ALICE NEL PAESE DELLE MERAVIGLIE\SP, Pearl_Jam\PEARL JAM\SP
d.
Names of institutions, associations and organisations: UNESCO\UNESCO\SP, Federcalcio\FEDERCALCIO\SP
e.
Names of companies, brands, products: Sony\SONY\SP, Calvè\CALVE’\SP
f.
Acronyms: DC\DC\SP, IVA\IVA\SP
g.
Religious events Natale\NATALE\SP
The Italian corpus
.. Evaluation The C-ORAL-ROM Italian resource comprises 306,638 tagged tokens.25 Since the nonstandard and regional forms were inserted in a special pre-dictionary, the PiTagger system reached a 100% recall of the number of tokens: total number of tokens: 306,638 tagged tokens: 306,638 total recall = 100% The evaluation of the precision of the automatic PoS-tagging procedure is based on a random sampling of 1/100 tokens picked out of the whole C-ORAL-ROM Italian resource, and evaluated in their utterance contexts. Each token is extracted from a different utterance, also randomly selected. The random sampling obtained sufficiently represents the whole. The size of the sampling has been considered sufficient from a statistical point of view, as it ensures a 95% confidence interval lower than 1% (see the evaluation of the French PoS tagging in this volume).26 The manual revision of the tagged samples evaluates the automatic procedure with respect to different degrees of accuracy: errors in sub-categorisation, errors in morpho-syntactic description of verbs, errors in main tag category and in lemma assignation. The statistic precision of these levels of annotation is shown in Table 2.11, where it can be seen that the more significant errors, that is, those of PoS tag and lemma errors, only involve around 10% of tokens.27 The results of the evaluation of PoS tag errors (in the first level of errors in Table 2.11) are shown in a confusion matrix in Table 2.12, listing the errors in PoS assignment with respect to each word class: in each row the errors are recorded by category, while in the columns the correct PoS to which the token belongs is reported.28 The last column records the number of cases in which the PoS is overextended, while the bottom row represents cases of under-extension.
Table 2.11 Precision of automatic tagging procedure Total Sampling Not Decidable Total Evaluated All Correct Correct PoS tag 1. PoS tag errors 2. Lemma errors 3. Sub-category errors 4. Morpho-syntactic description errors Total Errors
3,100 31 3,069 2,726 2,773 296 2 38 7 343
1.00% 88.82% 90.36% 9.64% 0.07% 1.24% 0.23% 11.18%
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
Table 2.12 Confusion matrix
The data show that most mistakes occur in the Nouns category, numbering 103 out of 296 total errors.29 Considering the frequency of the category in the evaluation corpus, the projected overextension of this word class would be around 15%. This probably depends on the statistical normalisation applied by the PiTagger system, which, very roughly, assigns to Nouns the highest probability of occurrence (see Picchi 1994). Another interesting result is that the Adverbs and Interjections category is consistently under-extended with respect to its actual weight: over 10% under-extension for Adverbs and 8.5% under-extension for Interjections. Verbs show a lower incidence of errors, of 3.95% over-extension and 3.44% under-extension, with a roughly correct projection of the frequency of the category on the total. From the confusion matrix in Table 2.12, it is possible to obtain data on precision, recall and f-measure for each category. Table 2.13 details these measurements, which give an overall estimate of the automatic tagging procedure.30
The Italian corpus
Table 2.13 Precision recall and f-measure for each PoS PoS
tp
fp
fn
precision
recall
f-measure
DIM E V POS* R I B PER S N C IND A REL* NA*
65 253 559 10 181 195 502 170 440 40 200 50 90 24 2
0 7 23 1 11 6 25 27 103 1 32 21 27 12 0
0 11 20 0 14 23 85 11 19 11 39 2 33 23 5
100.00% 97.31% 96.05% 90.91% 94.27% 97.01% 95.26% 86.29% 81.03% 97.56% 86.21% 70.42% 76.92% 66.67% 100.00%
100.00% 95.83% 96.55% 100.00% 92.82% 89.45% 85.52% 93.92% 95.86% 78.43% 83.68% 96.15% 73.17% 51.06% 28.57%
1.0000 0.9656 0.9630 0.9524 0.9354 0.9308 0.9013 0.8995 0.8782 0.8696 0.8493 0.8130 0.7500 0.5783 0.4444
.. Specific problems with the morpho-syntactic tagging of spoken language Lemmatisation and PoS tagging need a detailed and coherent definition of the relevant context within which the statistics on disambiguation have to operate. Therefore, resources must include annotations regarding context boundaries. Written language is defined by period boundaries, which are marked by punctuation signs; on the contrary, the minimal relevant unit in spoken language is the utterance, which must be detected and identified within the speech flow. In the following examples, two possible transcripts of the same dialogic turn are presented: the first one without prosodic boundaries (followed by a second row bearing the PoS tags related to each word), the second one according to the Italian C-ORAL-ROM corpus (words in italics are ambiguous). (14) *NIC: sì dice che taglia porta io lì per lì un’ elle penso Adv V Conj/REL S/V S/V PER Adv Prep Adv Art S V (15) *NIC: sì // dice / che taglia porta ? io / lì per lì / un’ elle / penso // [yes, he/she says, which size do you take? There and then, I (say), a large, I think] The PoS of ambiguous elements (underlined in the second row) can be decided only in connection with the context boundaries (represented by utterance boundaries). If the disambiguation process could work on the simple bare transcripts, it would also operate on a non-coherent word pattern. The long PoS sequence reported in the example must be parsed through the context boundaries of the prosodic annotation, and then disambiguated by a PoS tagger working within the relevant contexts. Without prosodic boundaries, the disambiguation of the PoS tags belonging to the words in the dialogic turns would be arbitrary.
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
The prosodic annotation provided by the C-ORAL-ROM resource thus gives the automatic lemmatiser the information on the relevant linguistic unit which constitutes the context of the disambiguation process. PiTagger considers the utterance limits reported in the transcripts as the proper domain of application for the statistics. Although the utterance boundaries have been defined, in previous experiences, by the automatic detection of pauses (longer than 200 ms; see Uchimoto et al. 2002), in the C-ORAL-ROM corpora such boundaries are provided by the marking of terminal prosodic breaks (//, ?, . . . ). Besides the speech-specific feature of language constituted by utterance boundaries, during the evaluation of automatic tagging the operators marked three kinds of contexts relevant to the spoken domain. 1. Under-specification of the PoS in a given context (it is always impossible to express a disambiguation judgement). These cases (1%) have not been counted as part of the evaluation samples: (16) che\CHE\CS?REL? mi\MI\PER piace\PIACERE\Vs3ip tanto\TANTO\B // [which I like so much] 2. Cases in which the PoS disambiguation is possible only if the evaluator is provided with information on the audio; that is, with information not computable by the System. In these cases (2.7%) the PoS assignment by the system has been evaluated. The system failed to assign the right Tag in 83% of cases: (17) e\E\CC allora\ALLORA\CC?B? / decise\DECIDERE\Vs3ir di\DI\E partire\PARTIRE\Vf // [and so (he) decided to leave] 3. Cases of uncertainty, found in connection with the secondary boundaries, in which the operator, in accordance with the tagset, can in principle assign a tag, but also points out a lack of adequacy of this categorisation with respect to the spoken domain. In these cases too (3.1%), the Tags provided by the system have been evaluated positively or negatively, according to the tagset. (18) dai\DARE\V? / prendiamola\PRENDERE\V // [come on / let’s take it] The underspecification not connected to the utterance boundaries occurred in roughly 7% of the selected sample (3,100 words). A second level of evaluation provides us with an estimate of the incidence of these phenomena on the number of errors. As a result, it becomes possible to identify the contexts in which the system encounters problems and to select which of them are caused by specific features of spoken language. As will be detailed in the next sections, the automatic PoS tagging procedure is made complex in spontaneous speech at three main levels:
The Italian corpus
a. words adjacent to utterance boundaries; b. fragmentation phenomena (retracting and interruptions); c. secondary prosodic boundaries.
... Words adjacent to utterance boundaries One word utterances. In the set of total errors in PoS tag (296), 26% are one-word utterances where an ambiguous word occurs, shown in examples (19) and (20). In this case, which is typical of spontaneous speech, the statistics of the disambiguation system, based on PoS order, are radically under-determined: (19) *PRO: esatto\ESATTO\A?B? // [exactly] (20) *ELA: cosa\COSA\S?REL? + [what]
Words in peripheral position. Apart from single words which make up an utterance, in the collected data it can be observed that around 22% of the tagged tokens with a PoS error occur in the first position of the utterance, and that roughly 13% occur in the last position. In other words, this is when the system is faced with little contextual information to make a decision. (21) che\CHE\CS* devo\DOVERE\Vs1ip dire\DIRE\Vfp / io\IO\PER sono\ESSERE\V nelle\IN\E_R mani\MANO\S del\DI\E_R mio\MIO\POS compagno\COMPAGNO\S // [what can I say, I’m in my partner’s hands] (22) e\E\CC vai\ANDARE\Vs2ip sempre\SEMPRE\B diritto\DIRITTO\S* // [and keep going straight on] These data, together with the previous results on single word utterances, tell us that in 61% of cases the errors in PoS tagging occur with a peripheral position of the tagged word. To improve the results on words adjacent to utterance boundaries, the tagging system should be able to take into account the main prosodic breaks as relevant positions for the disambiguation procedure, and to be trained on a training corpus comprising such information. Given the low average utterance length which characterises spontaneous speech, the percentage of such contexts appears to be quite relevant (roughly 30% of words).
... Interruptions and retracting Interruption and retracting phenomena are a trait peculiar to spoken language. They introduce irregular and unpredictable PoS chains. An interruption involves the ending of the utterance, and it may create both an anomalous syntactic configuration (which is impossible to find in written language) and a lack of information necessary to decide on disambiguation, as illustrated in (23):
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
(23) *LIA: [. . . ] per farsi perdonare / che + Prep V V Conj/Rel? [to be forgiven / for] The lemmatisers have neither rules or statistics about these kinds of endings which, by definition, are not regular. The retracting phenomenon is often considered under the generic label of “disfluency phenomena” which occur in the speech flow. Retracting phenomena produce non-informative prosodic units within the utterances, with an irregular PoS sequence. Taking the utterance as the relevant context in which to apply the disambiguation rules, the retracting phenomenon (marked with “[/]” in the transcripts) generates unpredictable linear patterns, as in (24): (24) *AAA: non vedevo bene la [/] la [/] la strada // Adv V Adv Art Art Art Noun [I couldn’t see the road well] In the total number of utterances within which the selected word was wrongly tagged (296), interruptions and retracting appear in around 10% of cases.31
... PoS assignment in connection with secondary prosodic boundaries Some relevant correlations between the prosodic boundaries in the transcripts and the distribution of particular forms emerged during the lemmatisation and PoS tagging processes. In various cases, the secondary prosodic marks isolate lexical or support elements which assume peculiar values within the speech flow. Non-standard support elements. The non-standard forms he and mh constitute the phonetic supports (of the vowel and the consonant type) used in the spoken Italian language. The positions in which they occur are systematically marked by prosodic boundaries, which point out the relation between the prosodic structure of an utterance and the informational values of the elements which occur in different tone units. In utterances like the following ones, these forms may be considered (with regard to the PoS tagging) alternatively as interjections or as paralinguistic elements: (25) *MIC: no / comunque / &he / vorrei vedere il film di [/] di Troisi . . . [no / however / er / I’d like to see Troisi’s film] (26) *LIA: mh / facile // [mh / easy] These are typical and highly frequent phenomena in spoken language: the first one, as in (25), is used to take time; the second one, as in (26), to mark the taking of a dialogic turn. Such non-standard forms have a functional value within the dialogic flow, but their PoS is uncertain and their distributional character cannot be foreseen on the basis of written language data or standard linguistic rules.
The Italian corpus
Standard forms with PoS assignment problems. Even the distribution of standard forms is not consistent with the definition of the categories that are based on syntactic relationships of the words within the clause structure. The following examples show two peculiar values that conjunctions and adverbs may assume in connection with secondary prosodic boundaries, which isolate these words at the extremities of utterances (first/last position): (27) *GNA: allora / stavamo discutendo // Conj/Adv? [so / we were discussing] (28) *EEE: passiamo ad un altro argomento / via // Adv/Int? [let’s change the subject / come on] According to the traditional categorisation, allora, as in (27), may be an adverb or a conjunction, while via, as in (28), may be an interjection or an adverb. In these cases, we are not able to disambiguate the proper PoS for these ambiguous expressions. We can only highlight that these words assume special functional values in such contexts. Moreover, the tagging system used does not take the secondary boundaries into account, and it treats the words in question within a single utterance context, as if belonging to a linear pattern. The disambiguation is estimated, by default, on the basis of the maximum likelihood of a tag sequence, but in this case this process is based on an improper word pattern (e.g. allora/stavamo and argomento/via), given that the two elements do not show a proper syntactic relation. With via in (28), we could decide, on the basis of the secondary boundaries, that the ambiguous element is an interjection at the end of the utterance. On the contrary, with allora in (27), the disambiguation process is not only underdetermined, but also arbitrary. As a matter of fact, the syntactic categories of conjunction and adverb are both unsatisfactory in defining the function of this kind of forms, which is not syntactic. As these elements have such a trait, it would be correct to treat them as non-syntactic. The label of “Discourse Marker”, well-known in the literature (see references in the Chapter 6) and mostly used to denote elements that have a prominent value within the conversational relations in the speech flow, would be useful to tag elements of this kind as mainly informative ones, without further remarks on their syntactic value.
Allocutive emphatic. Similar problems emerge with fixed verb-forms isolated in the speech flow within non-terminal prosodic breaks. The following examples show this speech-specific phenomenon, in which both the PoS category and the lemma of the words senti and guarda (‘listen’ and ‘look’, imperative forms) are uncertain: (29) *SER: senti\? / e poi [/] e poi chi c’ era? V? Conj Adv Conj Adv REL Adv V [listen / and then [/] and then who was there?]
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
(30) *SRE: ti stavamo aspettando / guarda\? // PER V V V? [we were waiting for you / look] The words senti and guarda can never change inflection in this position, and they are more like fixed expressions with a pragmatic function (allocutive or phatic) rather than properly semantic instances of the verbs “SENTIRE” or “GUARDARE”.
Reported speech contexts. Other cases in which the distribution in spoken language is not consistent with the distributional characters found in written language may also arise when reported speech comes about. In this case, the expression which introduces the reported speech creates a linear sequence of categories in which the expression is not separated by proper punctuation, as it would be in written language, as in (31) or an odd sequence, for example, three verbs in sequence, as in (32): (31) *MAR: [. . . ] dice / allora vo a vedere // V Adv V Prep V [(he/she)says / I’ll go and see then] (32) *NOR: [. . . ] Vincenzo / lasciala fare / dice + Noun V V V [Vincenzo / leave her alone /(he/she) says] It can be maintained that, in connection to secondary prosodic boundaries, the statistics extracted from the written training corpora may hardly be recognised and properly applied. In such positions, a word-form may have pragmatic values which match neither its typical PoS, nor its lemma. Other non-standard forms with pragmatic functions can appear in similar isolated positions. The data collected in the evaluation phase of the automatic tagging show that 13.2% of the errors are represented by words in a single tone unit (i.e. isolated by secondary prosodic breaks), mostly standard words with uncertain classification. A tagging system able to assume information on both primary and secondary prosodic boundaries as contextual inputs is needed to achieve an adequate treatment of these forms in the prosodic context. The training of tools on corpora which comprehend such an annotation level will be highly relevant for the improvement of the results on the automatic tagging for spoken corpora. The state of the art of automatic taggers for Italian does not feature such an annotation level.
. Main data from lemmatisation The high frequency lists extracted from the Italian C-ORAL-ROM corpora were compared with a previous frequency lexicon of spoken Italian (LIP) and with a written resource (Linguistic Miner).32
The Italian corpus
Table 2.14 High frequency verbs, excluding auxiliaries and modal verbs: comparison of C-ORAL-ROM and other corpora Rank
C-ORALROM
LIP
Written resource
Rank
C-ORALROM
LIP
Written resource
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
fare dire andare vedere sapere stare venire capire mettere dare partire sentire arrivare prendere pensare parlare guardare trovare portare credere sembrare bisognare chiamare lavorare cercare entrare chiedere conoscere aspettare passare rimanere riuscire piacere ricordare cominciare lasciare comprare cambiare vivere mangiare tornare tenere leggere rispondere diventare riguardare selezionare usare aprire finire
fare dire andare vedere sapere stare dare parlare venire capire mettere sentire pensare trovare guardare chiamare prendere portare credere arrivare ricordare chiedere scrivere scusare tenere sembrare bisognare passare conoscere riuscire cercare lasciare aspettare lavorare finire rimanere leggere bastare entrare mangiare mandare cominciare partire riguardare parere diventare pagare uscire cambiare succedere
fare dire venire andare vedere stare sembrare mettere chiedere parlare trovare prendere diventare entrare continuare prevedere tenere decidere spiegare pubblicare pensare arrivare dare perdere attraversare rimanere capire considerare cercare tornare lasciare sapere ottenere riguardare ricordare raggiungere rispondere costituire rendere portare apparire cominciare uscire seguire sentire credere consentire presentare vivere definire
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
perdere capitare nascere esprimere seguire buttare versare uscire chiudere costruire pagare parere servire sperare creare mandare pigliare scegliere toccare morire vendere rendere ritornare spiegare raccontare scusare decidere provare continuare iniziare esistere incontrare presentare mancare succedere scrivere discutere camminare fermare interessare funzionare imparare ripetere giocare proporre costare durare riconoscere intendere produrre
presentare servire perdere tornare piacere continuare ringraziare sperare spiegare telefonare provare usare morire aprire ascoltare rendere rispondere significare trattare interessare vivere vendere mancare seguire studiare considerare esistere ripetere scegliere fermare ritornare porre nascere pigliare chiudere occupare decidere aiutare raccontare tirare iniziare riconoscere intendere valere proporre organizzare comprare togliere dimenticare dipendere
rappresentare conoscere riuscire realizzare utilizzare esistere cambiare scrivere guardare chiamare restare pagare effettuare lavorare intendere raccontare dichiarare affermare avvenire risultare sostenere precisare bisognare partire ritenere finire dimostrare annunciare passare mancare aprire indicare aggiungere giocare evitare superare scegliere destinare prestare fornire parere svolgere aspettare controllare riconoscere studiare registrare convincere offrire proporre
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
Table 2.15 High frequency adverbs: comparison of C-ORAL-ROM and other corpora Rank
C-ORALROM
LIP
Written resource
Rank C-ORALROM
LIP
Written resource
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
non si’ no ci poi come piu’ cioe’ qui ecco insomma li’ cosi’ bene sempre allora proprio quindi po’ quando molto gia’ ancora prima perche’ forse dove oggi mai niente su pratico* ora adesso invece qua meglio via subito comunque per_esempio tanto senno’ nulla eccetera giu’ vero* almeno soprattutto mica
non si’ no poi più’ bene ecco poco qui così’ sempre molto proprio solo già ancora adesso prima invece qua li’ certo oggi appunto ora meno forse mai pure veramente la’ tanto eccetera meglio soltanto dopo fuori magari avanti via praticamente ormai niente insieme subito abbastanza quasi almeno soprattutto ieri
non come ci su dopo prima ancora quando poi sempre ieri dove molto oggi fino invece mai oltre quasi forse soprattutto quindi poco qui pure comunque soltanto allora almeno no po’ ormai subito fuori adesso inoltre nulla appena niente infine ecco tuttavia dietro intanto sopra insomma domani davvero addirittura piuttosto
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
purtroppo fa sicuramente tutto domani probabilmente male dentro mo’ troppo naturalmente giù’ addirittura intanto certamente ovviamente neanche assolutamente mica completamente sotto direttamente nemmeno spesso chiaramente davanti stasera altrimenti vicino appena sopra effettivamente evidentemente davvero giustamente piuttosto solamente su presto eventualmente piano stamattina dietro esatto oltre semplicemente perlomeno nulla indietro accanto
neppure probabile* dentro diretto* accanto finora naturale* nemmeno magari certo* finale* attorno attuale* particolare* persino neanche perfino abbastanza completo* ovvio indietro assoluto* sicuro* immediato* peraltro purtroppo vero* rispettivo* semplice* altrettanto altrimenti affatto successivo* esclusivo* evidente* recente* pratico* esatto* talvolta contemporaneo* stasera facile* deciso* definitivo* sostanziale* bene stavolta qua ufficiale* perfetto*
abbastanza la’ solo sicuro* meno naturale* poco assoluto* ok davvero qualcosa magari dopo domani ormai purtroppo ovvio troppo ieri dentro intanto chiaro* più_o_meno fuori male probabile* stasera addirittura rispetto_a certo* nemmeno mo pure piuttosto dietro avanti completo* mezzo quasi diretto* evidente* fino indietro semplice* soltanto insieme sopra in_effetti a_posto effettivo*
* The asterisk indicates adverbs in “-mente” reported to lemma without the suffix by the PiTagger tool.
The Italian corpus
Table 2.16 High frequency adjectives: comparison of C-ORAL-ROM and other corpora Rank
C-ORALROM
LIP
Written resource
Rank
C-ORALROM
LIP
Written resource
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
bello grande certo vero diverso solo giusto bellino bravo importante possibile nuovo esatto chiaro piccolo mezzo politico italiano pronto unico complesso buono difficile prossimo automatico grosso ferroviario pubblico scorso ultimo interessante locale maggiore particolare forte strano alto tranquillo giovane facile preciso sociale libero lungo mondiale piano umano famoso tedesco attento
grande bello certo vero importante diverso buono nuovo piccolo ultimo giusto prossimo solo vario politico possibile grosso chiaro bravo particolare italiano unico difficile generale lungo vecchio libero caro presente alto mezzo scorso economico uguale facile forte pubblico strano nazionale solito semplice migliore normale rosso esatto povero sociale attento bianco nero
grande certo diverso possibile vero scientifico economico unico pronto bello attuale fiscale regionale mondiale relativo finanziario elettorale comunale successivo culturale amministrativo commerciale recente facile professionale intero speciale numeroso costretto prossimo operativo aperto ulteriore monetario capace produttivo tradizionale necessario vario complessivo televisivo famoso definitivo disponibile urbano giusto agricolo tedesco antico significativo
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
fondamentale inutile amico centrale generale normale bene fine incredibile migliore professionale civile comunale continuo da_solo aperto brutto elettorale azionario felice grave nero regionale semplice vicino attuale basso lontano pieno cosiddetto leggero nazionale operaio personale a_parte generico internazionale militare necessario pericoloso positivo scusato usato vecchio comune fotografico intero sanitario scientifico sicuro
storico aperto brutto famoso preciso attuale sicuro maggiore contento reale interessante positivo basso incredibile pronto verde minimo precedente caldo necessario breve centrale personale civile giovane comune inutile perfetto serio tranquillo vicino grave relativo tecnico convinto medio culturale pieno fisico logico romano determinato europeo locale privato sbagliato superiore naturale negativo specifico
ambientale pubblico concluso sperimentale immediato tecnologico evidente letterario legislativo istituzionale inutile cosiddetto comunitario mosso notevole enorme pericoloso moderno nero favorevole strano attento aziendale strutturale bravo giuridico elettrico scarso analogo tranquillo biologico sindacale tributario ordinario decisivo opportuno eventuale futuro molecolare costituzionale popolare valido artistico raro offerto probabile mediterraneo annuo delicato lontano
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
Notes * Emanuela Cresti is responsible for §2.1, Antonietta Scarano for §2.2, and Alessandro Panunzi for §2.3. . Available as a CD via the technical staff at Laboratorio di Linguistica. The materials are also accessible via anonymous ftp at ftp.cirass.unina.it and are contained in the folder cirass/pub/avip. . European Language Resource Association http://www.enda.fr/; corpus in http://www.icp. grenet.fr/ELRA/cata/spee_det.html#apasci. . For titles, citations and discourse, inverted commas were not used. . An ad hoc graphic solution was chosen for the spoken form of ‘you have’ in use in Calabria and Campania: he’. . With the only exception being the relative-interrogative pronoun icché (< il che) = che cosa / quello che ‘what, that which’, which is definitely an independent word. An apex has also been used for articles, so as to ensure a transcription as true as possible to pronunciation: ’a e’ ’e ’u
for for for for
la il, i le il
[the (f. sing.)] [the (m., sing. and pl.)] [the (f.pl.)] [the (m.sing.)]
. More recent work (Delmonte 2002), on 37,000 words of the API corpus, has been also conducted, but the accompanying papers do not specify the disambiguation procedure adopted. . The NERC project (Monachini & Östling 1992) represents a comparison between morphosynctactic annotation schemes, on the basis of the following tagsets: – – – – –
UPenn, Gothenburg and Brown corpora for American English; BNC, LOB, Lancaster corpora and ENGTWOL lexicon for British English; ILC-DMI and EUROTRA for Italian; INaLF for French; Uit den Boogaart for Dutch.
. For references about these lexical encoding projects, see MULTILEX Consortium (1993) and GENELEX Consortium (1993). . The main tasks of these projects are “the definition and the implementation of a set of tools for corpus-based research and applications, and the production of a corpus in a multilingual framework”, with respect to the written resources of European languages. . The Spoken Dutch Corpus Project (Corpus Gesproken Nederlands, CGN) also adopted an EAGLES-like tagset for the PoS and morpho-syntactic tagging of the resource; see Van Eynde, Zavrel, and Daelemans (2000). . “Moreover, morpho-syntactic annotation constitutes the basic level of linguistic description in natural language processing and is usually considered a prerequisite for further and more complex kinds of analysis. Almost all systems and applications require this level of linguistic description. In this area, therefore, standard conventions are welcomed not only by a large number of users in all sectors of LE, but also by many in the literary and humanities fields” (Calzolari & Monachini 1996).
The Italian corpus . As a matter of fact, the traditional taggers for Italian (see XEROX tagger, with an online demo at http://www.xrce.xerox.com/competencies/content-analysis/demos/italian; TreeTagger, see Schmid 1994) do not feature the distinction between auxiliary and predicative uses of the verbs ESSERE and AVERE. In another case (IMMORTALE, see Delmonte 1997), this distinction exists in the tool tagset, but the results are not yet available. Only recently this facture has been implemented in CONNEXOR (http://www.connexor.com). . In the tagset presented in Monachini (1996), the distinction provided between main verbs and auxiliary verbs has the trivial consequence of tagging ESSERE and AVERE as auxiliaries in each case, with no regard to their value in the context. In the following suggestions, the possibility of tagging the instances of these verbs with respect to the different values is given as an optional manual procedure in a post-editing phase. . That is, in Italian and the other Romance languages, such expressions can occur in adjectival position as well as be used as pronouns, e.g. questo ‘this’, mio ‘my/mine’, alcuni ‘some/any’. . “Lexical descriptions should be independent from applications and should aim at a general description of each language; corpus tags, depending on the capabilities of state-of-art tagging techniques, may under-specify lexical specifications, collapsing many distinctions and presenting broader categories” (Calzolari & Monachini 1996). . As pointed out in the EAGLES Guidelines, “if calling possessives Determiners in English works for their complementary distribution with respect to the article (my book, the book), this is not the case in Italian, because il libro does not have a corresponding expression *mio libro. The only exceptions in Italian regard the family lexicon, in which possessives work as real determiners (mio padre, il padre; mia madre, la madre). . For TEI recommendations on morpho-syntactic annotation, see TEI (1991). . As recently underlined by the Spontaneous Speech Corpus of Japanese editors, special tags are needed to adequately treat the speech-specific phenomena that occur in the speech flow (reported in the transcripts); see the tentative list of tags used in transcription presented in Furui, Maekawa, & Isahara (2000). As far as we can gather from the published material (Zavrel & Daelemans 2000; Van Eynde, Zavrel, & Daelemans 2000), the PoS tagging of the Spoken Dutch Corpus does not specify these phenomena in the tagset. . The distinction between compositional and non-compositional elements is supported by the analogue behavior of all these elements with respect to the prosodic structure of the utterance. Non-compositional elements are always isolated by primary or secondary prosodic boundaries and are frequently the only element of the utterance. This character is strictly related to the autonomous illocutive force expressed by interjections, onomatopoeia and acquisition forms. The non-compositional trait of these expressions, whose linguistic value is independent from the syntactic relationships, seems to be positively correlated with the presence of such a pragmatic force (cf. Moneglia & Cresti 2000). . The weight of all these elements in the corpus amounts to 7,237 out of 306,638 tagged tokens (2.36%). . This is widespread in spoken Italian; see §2.2. . With regard to the identification of the lexical class of Italian phrasal verbs, see Simone (1997) and Venier (1996). . For this example it is also possible to give a phrasal verb oriented interpretation of CORRERE DIETRO, as in:
Emanuela Cresti, Alessandro Panunzi, and Antonietta Scarano
a”
per tutta la gara [ha corso dietro] al vincitore. [he [has run after] the winner for the whole time of the race]
This kind of ambiguity increases the complexity of treatment of the verbal multiwords, whose interpretation is not strictly related to the linear sequence of words within the utterances. . Within the list of multiwords, a single case of a compound pronoun must be pointed out, mostly used in interrogative contexts and classified as a REL element: CHE_COSA\REL [what]. . This number is estimated, taking into account the number of multiwords detected in the corpus, so it is quite different from the number of graphic words in the corpus. . A minor imbalance in the percentage of some PoS in the sampling may produce some noise in the statistics. More specifically, the percentile incidence of both adverbs and interjections in the sampling exceeds the incidence of those categories in the corpus by about 5%. In parallel, a lower incidence of conjunctions (3%) and nouns (2%) is also recorded. . Tested on a corpus of official documents of the UE Commission (500,000 tokens, reviewed by Enrica Calchini), PiTagger reached a 97% level of correctness. The same recognition rate was reached on the LABLITA literary sampling corpus (60,000 tokens). . For example, number 30 (second row, third column) corresponds to 30 tokens wrongly tagged as nouns (S), which should instead be tagged as adjectives (A). . The statistic relevance of the confusion matrix must be carefully considered, given that the sampling is significant with respect to the occurrence of forms (1/100) rather than categories. More specifically, the significance is lower with respect to less represented categories. . The rows in the table are sorted by the f-measure value (last column), which is an overall standard measurement to evaluate the general accuracy of automatic procedures. The PoS which feature a higher f-measure value are the ones tagged with a higher precision. For the PoS marked with an asterisk in the first column, the number of occurrences in the sampling corpus is too low to ensure an adequate evaluation. . Fragmentation phenomena involve around 30% of utterance in the Italian corpus. . The transcription and lemmatisation of LIP aimed at the extraction of frequency lists. The automatic tagging of whole resource was completely revised and manually corrected. The Linguistic Miner is an automated and non-balanced resource that collects a huge size of texts from various sources. At the state of the art, the resource constitutes about 25 million words.
Chapter 3
The French corpus* Estelle Campione, Jean Véronis, and José Deulofeu
. History of the corpus within the national framework The French part of C-ORAL-ROM was produced by the DELIC research team of Université de Provence at Aix-en-Provence, France. The DELIC team was created in 2000 by the merger of GARS (Groupe Aixois de Recherches en Syntaxe), a group established in 1976 at the Department of French Linguistics of Université de Provence and headed by Claire Blanche-Benveniste, and of the Natural Language and Speech Processing group, founded in 1993 by Jean Véronis at the same University. The team’s research follows three main research tracks: 1. Spoken French corpus collection and analysis.1 The team presently owns the largest corpus of spontaneous spoken French, comprising about 2.5 million words transcribed on paper, and about 2 million words that have been computerised in a format that can be used by concordance tools. French and foreign researchers frequently use these databases, as they constitute an efficient tool for studying language and comparing registers. The corpus is organised according to various criteria, including those for speech pathology (in collaboration with J. L. Nespoulos, Laboratoire Jacques Lordat in Toulouse). 2. The morpho-syntactic study of spoken and written French based on current descriptive methods (in particular, the pronominal approach). This analysis framework has been further developed in three books (Blanche-Benveniste 1997a; Blanche-Benveniste et al. 1984, 1990) and in a large number of research papers. Phoneticians have collaborated on the study of prosodic patterns that are compatible with syntactic descriptions (Ottawa, Leuven, Aix-en-Provence). The ‘macrosyntax’ concept has also been discussed with researchers from Fribourg and Neuchâtel (A. Berrendonner and M.-J. Reichler-Béguelin). 3. Computer processing of language corpora, including encoding issues, grammatical tagging, multilingual text-alignment, information extraction for lexicography, and prosody. This area of research gave way to two books (Ide & Véronis 1995; Véronis 2000) and about 180 papers published in journals, conference proceedings and edited books. Many groups that were part of major international
Estelle Campione, Jean Véronis, and José Deulofeu
projects around the world collaborated on this task (MULTEXT, TEI, EAGLES, ARCADE, etc.). In the spirit of the C-ORAL-ROM project, the DELIC team had originally intended to use the CORPAIX Spoken French Corpus, collected since 1978, in C-ORAL-ROM. CORPAIX contains 3 types of transcribed texts, along with a cassette bearing the original recordings. The recordings were of many different types (interviews, conversations, meetings, etc.), and varied in content (personal memories, professional experiences, political discussions, travel and leisure activities, etc.), and speaker characteristics (age, education, social and geographic origin). Table 3.1 provides an estimate of the number of usable texts in the corpus. However, legal difficulties unforeseen at the time the project was planned had to be overcome. In recent years, there have been court suits and changes in legislation, making for a very unclear overall legal picture. Researchers are for the most part unprepared for this new legal landscape, and lawyers have been of little help so far. The situation, in brief, is as follows. In the past, the common practice in the collection of French databases was not to request written consent from the speakers and media. For this reason, it was subsequently not deemed wise to distribute the data from these databases unless it was totally anonymous and trivial in topic and content (e.g. brief interactions between people that were totally unidentifiable). After careful consideration, only 18 recordings from CORPAIX were found to both meet the C-ORAL-ROM sampling criteria as well as be safely included on legal grounds. The DELIC team then recorded another corpus in 1999-2000, named the Corpus de Référence du Français Parlé (CRFP), for which it systematically collected written consent. This corpus is based on interviews that are sampled according to age, level of education and type of situation (private, professional, and public, as outlined in Table 3.2). The entire corpus comprises 134 recordings that last a total of 37 hours, making Table 3.1 The CORPAIX spoken French corpus Type
Number
Average size (words)
Approximate total (words)
Short Long Very long Total
500 42 28 570
2,000 12,000 18,000 3,500
1,000,000 500,000 500,000 2,000,000
Table 3.2 The corpus de référence du Français parlé (CRFP) Type Private Professional Public Total
Recordings 84 22 28 134
62.7% 16.4% 20.9% 100.0%
Words 282,857 75,001 80,601 438,378
64.5% 17.1% 18.4% 100.0%
The French corpus
an average of a little over 15 minutes per recording, for a total of 440,000 words. The entire set of transcriptions was aligned with the corresponding audio recordings. The average segment length is 3.1 seconds (9.6 words). However, the criteria for sampling and length used in CRFP were not fully compatible with those of the C-ORAL-ROM project, and ultimately only 31 texts from CRFP were able to be used. An additional corpus was created using the interaction of man and machine, namely, dialogue with a train-reservation automatic service. It contained 42 texts and about 15,500 words, but, given its very specific nature, it was not felt to be appropriate to be used for more general linguistic studies and therefore was not included either. In short, most of the data had to be recorded and transcribed specifically for the project. The final composition of the French C-ORAL-ROM texts is shown in Table 3.3. As is clear from the brief history of the project outlined above, most of the texts are recent. Table 3.4 tabulates the number of texts in the French C-ORAL-ROM corpus, distributed by year. The French C-ORAL-ROM corpus was aligned at pauses, using the Transcriber program (Barras et al. 1998).2 The corpus was then automatically converted to the WinPitch format, and prosodically annotated. The resulting files were carefully checked in order to make sure that they could be opened using WinPitch and that they were correctly displayed in it.
Table 3.3 Origin of the French C-ORAL-ROM data Origin
Number of texts
%
CORPAIX CRFP New Total
18 31 115 164
11 19 70 100
Table 3.4 Distribution of French C-ORAL-ROM texts by year Year
Number of texts
1980 1989 1994 1998 1999 2000 2001 2002 Total
2 1 24 3 12 20 48 54 164
Estelle Campione, Jean Véronis, and José Deulofeu
. Orthographic transcription .. General criteria During the construction phase of the project, it was agreed that a number of transcription conventions would be common to all languages (speaker notation, hesitations, prosody annotation, etc.), but that each team would keep their personal conventions for all other features. The French transcription conventions may therefore differ from those of the other teams in the following respects: 1. Acronyms were transcribed using the common practice in French texts: sometimes with full stops (e.g. C.N.R.S.), sometimes without (e.g. ATALA). 2. Some proper names were anonymised (with the corresponding fragment replaced by a beep in the sound file) using the following convention: a. b. c. d.
P1, P2, etc. for names of persons; S1, S2, etc. for names of companies (i.e. sociétés); T1, T2, etc. for names of places; C1, C2, etc. for numbers (i.e. chiffres), such as telephone or credit card numbers.
3. Spelling which could not be decided on due to homophony was put within parentheses, e.g. il(s) voulai(en)t pas, j’en (n’)ai pas voulu. 4. Titles (works, radio broadcasts, etc.) were enclosed within quotes, e.g. “Fables de La Fontaine”. 5. Phonetic transcriptions of particular or deviant pronunciations were given when needed (on the separate %pho tier) using the SAMPA alphabet, as illustrated in Table 3.5.
The French corpus
Table 3.5 The SAMPA alphabet for French3 a. Consonants
Symbol
Example
Transcription
Plosives
p b t d k g f v s z S Z m n J N l R w H j
pont bon temps dans quand gant femme vent sans zone champ gens mont nom oignon camping long rond coin juin pierre
po∼ bo∼ ta∼ da∼ ka∼ ga∼ Fam va∼ sa∼ Zon Sa∼ Za∼ mo∼ no∼ oJo∼ ka∼piN lo∼ Ro∼ kwe∼ ZHe∼ pjER
b. Vowels
Symbol
Example
Transcription
Oral
i e E a A O o u y 2 9 @ e∼ a∼ o∼ 9∼ E/ A/ &/ O/ U∼/
si ses seize patte pâte comme gros doux du deux neuf justement vin vent bon brun = e or E = a or A = 2 or 9 = o or O = e∼ or 9∼
Si Se sEz Pat pAt kOm gRo Du Dy d2 n9f Zyst@ma∼ ve∼ va∼ bo∼ bR9∼
Fricatives
Nasals
Liquids Glides
Nasal
Indeterminate
Estelle Campione, Jean Véronis, and José Deulofeu
Table 3.6 High frequency interjections and discourse particles 1 2 3 4 5 6 7 8 9 10 11 12
ben bon hein mh ah quoi hum eh oh bé pff bah
13 14 15 16 17 18 19 20 21 22 23
allô tiens pardon merci hé tth na là comment attention ciao
.. Interjections Table 3.6 lists the most frequent interjections and discourse particles, among which are also included greetings and wishes.
. Morpho-syntactic tagging .. Tagset The tagset used for French is shown in Table 3.7. The construction of a tagset for a given language raises many theoretical difficulties. These difficulties are very rarely emphasised in the computational linguistics literature, but linguists have discussed them at length.4 For centuries, defining the categories of words in a language has been a subject of disagreement. This certainly has not changed in modern linguistics. There are two main types of difficulties. First, theoretical decisions and analyses can differ among scholars. A well-known example is whether prepositions and adverbs should be distinguished in cases such as: (1) il l’a posé contre la table [he leaned it against the table] (2) il l’a posé contre [he leaned it (against)] Traditionally, the contre in (1) is considered a preposition, while in (2) it is considered an adverb. However, there are many arguments in favour of a common analysis, with the second case differing only in the fact that no complement follows the preposition. Whereas this is an obvious example, many other examples are less straightforward in terms of a solution and lead to considerable debate among linguists. The analysis of the very common word que, for example, can lead to entire chapters of books (Attal 1999) or monographs on linguistics (see Deulofeu 1999).
The French corpus
Table 3.7 The French C-ORAL-ROM PoS tagset PoS
Tag
Sub-type
Examples
Adjectives Adverbs
ADJ:ORD ADJ:QUA ADV CON:COO CON:SUB DET:DEF DET:DEM DET:IND DET:INT DET:POS INT
Ordinal Qualifying
premier, deuxième, troisième petit, grand, vrai ne, pas, oui, alors, très, pratiquement et, o, mais que, parce que, comme, quand, si le, la, les ce, cette, ces une, une, tout, quelques, plusieurs quel mon, ma, ton, ta ben, bon, hein, mh, ah
Conjunctions Determiners
Interjections and discourse particles Nouns
NOM:COM NOM:PRO NUM PRE PRO:DEM PRO:IND PRO:PER PRO:POS PRO:RIN VER:CON:PRE VER:IMP:PRE VER:IND:FUT VER:IND:IMP VER:IND:PAS VER:IND:PRE VER:INF VER:PAR:PAS VER:PAR:PRE VER:SUB:IMP
Coordination Subordination Definite Demonstrative Indefinite Interrogative Possessive
Common Proper
heure, temps, travail, langue France, Marseille, Freud, Roosevelt Numerals deux, trois, mille, cent Prepositions de, à, pour, dans, sur Pronouns Demonstrative ce, ça, celui, cela, ceci Indefinite un, une, tout, rien, quelqu’un Personal je, tu, il, elle, y, en, se Possessive mien Relative/interrogative qui, que, où, quoi, dont, laquelle Verbs Conditional, present aurait, serait, dirais Imperative, present attends, allez, écoutez Indicative, future sera, aura, fera, seront pourra, faudra Indicative, imperfect était, avait, faisait, fallait, disait, allait Indicative, past fut, vint Indicative, present est a ai sont va ont peut fait sais suis avoir, parler, faire Participle, past fait dit été eu vu pris pu mis Participle, present étant disant faisant ayant Subjunctive, fût, vînt imperfect VER:SUB:PRE Subjunctive, present soit ait puisse fasse aille Uncategorisable XXX:ETR Foreign word check up, Eine Sache XXX:EUP Euphonic particle -t-, l’ XXX:TIT Title “Tapas_Café”, “With_Full_Force”, “Retour_des_Vampires”, “Fables_de_La_Fontaine”
The second type of difficulty arises from the fact that many words in a language lexicon have a complex behaviour that cannot be easily categorised. Unfortunately, these words are very often the most frequent ones, especially in spoken language. Paradoxically, they are also the words that linguists often pay less attention to. For example,
Estelle Campione, Jean Véronis, and José Deulofeu
the French word voilà has exotic properties. It has some of the features of a verb, but not all of them, e.g. it accepts clitics and negations: en voilà, ne voilà-t-il pas?. It is close to a preposition in other respects, e.g. nous nous sommes rencontrés voilà six mois. It can also function as an interjection or a discourse particle, e.g. voilà, et voilà, quoi, etc. Some linguists have invented non-traditional categories (“presentatives”, “introducers”, etc.), but it is difficult to say whether this helps or actually hinders the situation.5 As far as spoken language is concerned, “small words” such as hein, ben, bon, alors, quoi and, of course, voilà, already mentioned above, are often among the most difficult words to categorise. In addition to the traditional though hard to sustain labels like “adverb” or “interjection”, these words have also been labelled “discourse particles”, “discourse markers”, “inserts”, etc. In fact, there is no widely accepted analysis concerning these words. They also are very often homographs of adjectives, pronouns, or adverbs. This renders automatic tagging very difficult, as illustrated in examples (3) and (4): (3) a.
alors bon tu viens? (discourse particle) [then well are you coming ?] b. le compte est bon (adjective) [the bill is right]
(4) a.
il est malade quoi (discourse particle) [he is sick you know] b. il a mangé quoi? (interrogative pronoun) [what did he eat?]
A summary of the ongoing disagreement among French linguists and grammarians concerning word categories can be found in Wilmet (1997). The aim of the C-ORALROM project was not however to put an end to centuries of linguistic disputes over word categories. As is the case for most computer-based corpora processing, the tagset we propose is a compromise: between traditional scholarly or dictionary-like categories, and what can be achieved using the latest technologies in computers and enlightened by modern linguistics. Indeed, this practical approach is consistent with the project’s main goal, which is to provide a set of empirical data presented in a way that makes it available for further analyses regardless of the theoretical framework, and that can be used as testing grounds for previous analyses, rather than offer ready-made analyses. The most notable features of the tagset are outlined below: 1. Two categories of adjectives are distinguished: qualifying and ordinal. Other words that are traditionally called adjectives are listed under determiners or numerals. 2. Interjections and discourse particles are grouped together, given how difficult it is to differentiate them. 3. Cardinals are grouped in a numeral category, because it seemed pointless to create a four-way ambiguity for each of them (viz. determiner as in trois hommes, adjective as in les trois hommes, noun as in le trois juillet, and pronoun as in les trois). This ambiguity would have been extremely difficult to solve automatically.
The French corpus
4. The fusion of prepositions + determiners (e.g. du = de + le) is simply given the tag corresponding to the determiner (e.g. du/DET:DEF), because previous experiments show that more complex tags create difficulties when processing the corpus. 5. Relative and interrogative pronouns (qui, quoi, etc.) are not distinguished, because this remains unachievable even with current automatic processing technology. French linguists disagree on whether or not to make such a distinction. Indeed, some of them prefer to group all of these words in a single category of qu- words with morphological variants according to specific syntactic contexts. 6. The traditional distinction between the conjunction que and the relative pronoun que is maintained, in spite of strong linguistic arguments found in modern linguistics in favour of a complementiser (conjunction) analysis of the relative que (either in relative clauses or in cleft sentences). This is because many possible syntactic analyses in a generative framework could be based on the distinction between clauses in which a movement operation occurs, such as relative clauses, and those which do not present it. Our tagging allows automatic retrieving of both structures. This decision could be considered as somewhat contradictory with the conjunction analysis of the comparative que: it has indeed been argued that the comparative clause que has properties of movement. In that case as well, the decision is a practical one: comparative structures can easily be retrieved using the quantifier that is always correlated with this type of que clause. 7. A residual XXX category is created, containing unclassifiable cases such as foreign words, titles or “euphonic particles” that have no precise linguistic function (-t-, l’ before on). We would like to emphasise the fact that skilled linguists are well aware of the grammatical issues in French grammar. In our opinion, automatic tagging is not simply a tool that can assist linguists in collecting approximate distributions of empirical data, but is one that can do it faster than can be done using paper and pencil, and more accurately than a simple raw-text concordance program.
.. Multiword expressions Multiword expressions were detected in every grammatical category. For instance, in the corpus the most frequent ones were: parce_que; un_peu; est-ce-que; quand_même; en_fait; par_exemple; quelque_chose; bien_sûr; beaucoup_de; en_plus
.. Tools and strategy adopted for automatic PoS tagging and lemmatisation Our tagging strategy is a rather complex one. It is based on Cordial Analyzer, which at the moment is probably the best morpho-syntactic tagger for French. Cordial Analyzer was developed by the Synapse Development company6 with considerable input from our team. It uses the technology and modules developed by Synapse, which have
Estelle Campione, Jean Véronis, and José Deulofeu
been incorporated in Microsoft Word’s French spelling corrector. The characteristics of this tool are: 1. a very large dictionary (about 142,000 lemmas and 900,000 forms); 2. a combination of statistics and detailed linguistic rules; 3. shallow parsing capabilities, which enable the taking into account of other relations other than strictly local ones; 4. a remarkable robustness with respect to errors and non-standard language. This last feature, which comes from the sophisticated error-correction modules, is particularly important for processing spoken corpora, and explains to a large extent the very high results obtained on the French C-ORAL-ROM tagging (see Section 3.3.4). For example, the tagger is capable of detecting repetitions, such as le le euh le chien, and is not fooled by occurrences that are not ‘normal’ bigrams in the language, such as le le. Such repetitions very often lead to tagging errors for most taggers which are usually trained on written corpora. Cordial Analyzer has been supplemented by our own dictionary and various preand post-processing modules developed by our team that adapt it to the spoken language and correct a number of residual errors. These modules also change some of Cordial’s tagging decisions. The most noticeable example concerns discourse particles such as bon or quoi, which are not at all treated by Cordial, given its orientation towards written language. One of our post-processing modules changes the original tagging when appropriate, using linguistic rules that use the local context, illustrated in examples (5) and (6): (5) le chocolat est bon\ADJ [the chocolate is good]
⇒ no change
(6) et alors bon\ADV je lui ai dis [and then well I said to him]
⇒ bon\INT
Since Cordial can flag spelling errors and/or unknown words, the tagging process enabled us to detect spelling errors that remained in the transcribed corpus. The errors were manually corrected before the final tagging. After correcting the spelling errors in the transcription, only 311 tokens remained unknown to the tagger (200 different types). All these tokens were then checked manually and added to an ad hoc dictionary used by a final tool that tagged them appropriately. Among these tokens, we found neologisms (e.g. C-plus-plussien), rare or specialised words (e.g. carbichounette), familiar abbreviations (e.g. ophtalmo), alphanumeric acronyms and codes (e.g. Z14), and foreign words (e.g. tribulum, brownies).
The French corpus
.. Evaluation In the case of the unknown words described above, the system still attributed a tag like NOM:COM (which in many cases is right: ambiguity was mostly with XXX:ETR). In that respect, system recall was 100%. However, since an ad hoc dictionary was created, it is probably fair to exclude the 311 tokens concerned by the recall. Even this did not change the results much, since it applied to a corpus of about 300,000 tokens. Recall measured that way still attained a result of 99.999%. The precision figure obtained is more interesting. It was evaluated by drawing a 20-token sample from each of the 164 texts comprising the corpus, i.e. a sub-corpus of 3,280 tokens, comprising about 1/100th of the entire corpus (where ‘token’ means either a single word or a multiword unit). Elementary statistics show that this size is sufficient for ensuring a 95% confidence interval no larger than 1%, and therefore a very precise evaluation is performed, as will be shown below. Tagging was checked and corrected manually, and the errors were categorised according to several criteria: a. b. c. d.
error on main category; main category correct, but error on subtype; error on lemma; error on multiword grouping.
The tagger’s performance was excellent, since only 58 tokens presented an error of one type or the other, or, occasionally, two errors combined. This amounts to a 1.77% error rate, i.e. a precision of 98.23%. This is a very high figure according to current standards, especially for spoken corpora. Table 3.8 lists the errors by type. The rightmost column shows a 95% confidence interval, computed using the Binomial law. This can be used to evaluate the impact of the possible variations in sampling, as well as the sample size. We can see that in all cases the confidence interval is smaller than 1%, which is more than satisfactory for this type of evaluation, given the fact that the disagreement among linguists on the correctness of tags is probably of the same order, if not greater. The main category was correctly allocated in 98.75% of cases. It is worth noting that the tagging of verbs was 100% correct on the sample. This confirms previous studies by our team, which reported over 99% correctness on this category (e.g. Valli Table 3.8 Distribution of error types Type of error
No. of errors
% error
Precision
95% CI
Category SubCategory Tag (Cat or SubCat) Lemma Multiword Erroneous tokens
41 8 49 4 8 58
1.25% 0.24% 1.49% 0.12% 0.24% 1.77%
98.75% 99.76% 98.51% 99.88% 99.76% 98.23%
98.3% – 99.1% 99.5% – 99.9% 98.0% – 98.9% 99.7% – 100.0% 99.5% – 99.9% 97.7% – 98.7%
Estelle Campione, Jean Véronis, and José Deulofeu
Table 3.9 Confusion matrix between main categories Correct ADV CON DET INT NUM PRE PRO XXX ADJ NOM VER Tot. E r r o r
ADV CON 9 DET INT NUM PRE PRO XXX ADJ 1 NOM VER Total 10
3
2
5 16
7
5 1
4
5 4
3
3
1 1 7
2
1
2
3
5 7
5
1 9 1 41
& Véronis 1999). On the other hand, the most difficult categories for tagging were adverbs and conjunctions. The matrix of confusion between categories is given in Table 3.9. One can see that most of the confusions occurred between (difficult) grammatical categories, for example: a. b. c. d. e.
adverb vs. conjunction (e.g. si); adverb vs. preposition (e.g. avec); preposition vs. interjection/particle (e.g. voilà); pronouns vs. determiners (e.g. un, tout); relative pronoun vs. conjunction (e.g. que).
In a few cases, grammatical words (e.g. pendant) were incorrectly tagged as a major category, mainly as a noun. Mistagging rarely occurred across major categories (ADJ, NOM, VER). The only errors concerned the difficult distinction, in French, between adjective and nouns, since many words can be both, as seen in example (7): (7) quelqu’un qu’on sent de passionné/NOM:COM ⇒ should be ADJ:QUA [somebody who feels passionately] In a few cases, the main category was correct, but the sub-category was wrong, as illustrated in Table 3.10. Half of these cases involved the determiner des, which can be either a preposition fused with a definite article, equivalent to de + les (and therefore coded DET:DEF), or an indefinite article (DET:IND). Examples (8) and (9) show these two types respectively: (8) il sort des grandes écoles [he comes from Grandes écoles]
The French corpus
(9) il mange des pommes [he eats some apples] Table 3.10 Confusion matrix between subcategories DET:IND E r r o r
DET:DEF NOM:COM NOM:PRO VER:IND Total
NOM:COM
Correct NOM:PER
VER:IMP
Total
1 1
4 2 1 1 8
4 2 1 4
1
2
Three other errors involved a confusion between proper and common noun when a given form could be both, e.g. côtes/Côtes (as in Côtes de Beaune), salon/Salon (a city in Provence). As opposed to the case of the determiner, this error will be easy to solve in future versions, since the initial capital letter, in spoken corpora, is a non-ambiguous cue marking proper names. The last case involved confusion between present indicative and imperative (e.g. attends). Overall, an error in the tag occurred 49 times, either in the main category or in the sub-category, i.e. 1.49% of cases. This corresponds to a precision of 98.51% of tags, a result that is consistent with previous evaluations conducted by our team: Valli and Véronis (1999) found a 97.9% precision in their experiment. This improvement is due to better dictionaries, better treatment of multiword expressions, and the taking into account of discourse particles which were previously ignored. While most multiword expressions from the sample were processed correctly, a total of 8 cases were not. These involved either multiword units that were not recognised (e.g. d’après [quelqu’un], un petit peu, which inexplicably were lacking from the dictionary), or words which should not have been pasted together, the latter being more frequent, with a total of six cases. For instance, discuter de quoi was tagged discuter/VER:INF + de_quoi/INT instead of being tagged as three separate words.
.. Main data from lemmatisation Before we discuss specific phenomena (e.g. grammatical categories) using the quantitative data extracted from the tagged corpus, a word of caution must be given about using C-ORAL-ROM for linguistic descriptions in general and for French ones in particular. C-ORAL-ROM does not constitute a large enough database for corpus-based linguistic analyses of low frequency phenomena. Ever since the publication of the Longman Grammar of Spoken and Written English (Biber et al. 1999), linguists have accepted that corpus-based linguistic descriptions could only be reliable if each observed genre or register comprised a corpus of at least four to ten million words, depending on what phenomenon was studied. It is obvious that our 300,000-word database is far
Estelle Campione, Jean Véronis, and José Deulofeu
from meeting this requirement. For instance, it would be misleading to base an analysis of French subordinate clauses of ‘consequence’ on this corpus, since fewer than ten examples of those conjunctions which are traditionally considered the main ones (de sorte que, si bien que, au point que, etc.) are available in the corpus. However, C-ORAL-ROM can be used in pilot studies and for checking hypotheses, due to the combination of the following unique properties: text–sound alignment, morpho-syntactic tagging, and careful sampling. These properties offer new ways of elaborating hypotheses on major lexical issues, such as: a.
Grammaticalisation and reanalysis of linguistic items. For example, concerning ‘consequence’, a subordinating layout unknown to traditional descriptions emerged, based on grammaticalised strings, such as ça fait que, qui fait que. There were twice as many occurrences of this layout as there were occurrences of traditional expressions such as de sorte que, si bien que, au point que. b. Lexical constraints on grammar rules. For instance, ‘long dependencies’ constructions appear limited to a narrow set of verbs in a quasi-formulaic use, e.g. vouloir, falloir, dire, etc. as in c’est là que je dis que le communisme a existé. c. Relations between nouns/verbs and other words. By contrasting informal speech with formal speech and written language, based on factors such as the noun/verb (N/V) ratio or the descriptive weight of the more frequent words, we hypothesise that in informal styles, verbs carry the burden of descriptive semantics, setting a representational frame for the situation that is being described. Nouns act more as procedural except in specific texts containing technical terms, as in professional speech. Like other teams, we found it important to contrast our data with the data observed in written style. For the sake of comparison, we used the Syntsem corpus created by Jean Véronis and Benoît Habert, made up of words, used for reference-tracking purposes, or as general classifiers, five 1-million-word categories:7 1. 2. 3. 4. 5.
Literature Press European parliamentary texts Scientific journals Human science monographs
We discuss verbs, nouns and adjectives below. For each category we present three tables which show the top 100 words in speech, the top 100 words in written discourse, and a short comparison of the most striking differences and similarities between spoken and written discourse. As can be seen from Tables 3.11a to c, in both spoken and written discourse, the most frequent verbs were, as expected, auxiliaries and ‘modal’ verbs (it was not possible to automatically separate auxiliaries and modals from main verbs). If we focus on content verbs (i.e. neither auxiliaries nor modals), we see as expected that, in speech, the semantic fields are more connected to the concerns of everyday life, while in written
The French corpus
Table 3.11a High frequency verbs in spoken discourse (C-ORAL-ROM) Rank
Lemma
Rank
Lemma
Rank
Lemma
Rank
Lemma
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
être avoir faire aller dire pouvoir voir savoir vouloir falloir passer prendre metre venir parler arriver penser trouver devoir donner appeler croire connaître aimer travailler
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
partir regarder essayer attendre rester vivre commencer comprendre demander sortir revenir rentrer retrouver permettre rendre entendre rappeler laisser écouter exister jouer changer poser ouvrir chercher
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
tenir acheter servir sentir marcher devenir payer envoyer recevoir finir expliquer manger perdre imaginer créer continuer monter arrêter récupérer garder apprendre utiliser raconter suivre écrire
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
porter descendre obliger habiter lire tomber mourir voter tourner proposer occuper montrer entrer sembler rencontrer reprendre quitter organiser oublier aider souvenir plaire decider choisir retourner
Table 3.11b High frequency verbs in written discourse (Syntsem) Rank
Lemma
Rank
Lemma
Rank
Lemma
Rank
Lemma
1 2 3 4 5 6 7 8 9 10 11 12 13 14
être avoir pouvoir faire dire devoir donner voir prendre aller metre savoir permettre vouloir
26 27 28 29 30 31 32 33 34 35 36 37 38 39
porter devenir prévoir constituer montrer présenter parler comprendre considérer partir rendre connaître entendre laisser
51 52 53 54 55 56 57 58 59 60 61 62 63 64
créer chercher établir apparaître disposer obtenir indiquer estimer compter entrer paraître produire poser appliquer
76 77 78 79 80 81 82 83 84 85 86 87 88 89
écrire offrir fournir développer recevoir servir exprimer accorder mener rappeler revenir conduire effectuer ajouter
Estelle Campione, Jean Véronis, and José Deulofeu
Table 3.11b (continued) Rank
Lemma
Rank
Lemma
Rank
Lemma
Rank
Lemma
15 16 17 18 19 20 21 22 23 24 25
falloir venir trouver tenir concerner agir passer répondre demander sembler rester
40 41 42 43 44 45 46 47 48 49 50
croire exister penser assurer suivre proposer expliquer jouer utiliser appeler viser
65 66 67 68 69 70 71 72 73 74 75
représenter reconnaître réaliser regarder adopter relever attendre ouvrir reprendre imposer arriver
90 91 92 93 94 95 96 97 98 99 100
définir continuer lier decider apporter fonder envisager engager tirer perdre vivre
Table 3.11c High frequency verbs: comparison between spoken and written discourse Common
Speech
Writing
aller avoir dire être faire pouvoir voir
falloir savoir vouloir
devoir donner prendre
language they are mostly connected to society and politics. More interestingly however, we can see that in the written corpus, verbs have more of a descriptive meaning, especially compared to nouns. In contrast with nouns, verbs describe the situation they refer to with accuracy. It is therefore the verbs that give the conceptual and descriptive import. It appears that in spoken language, verbs tend to describe the situation in terms of the main action that is performed in it, whereas in written language they tend to describe how speakers react to the situation or comments that are made about facts and actions. Two interesting comparisons can be made here, based on what is seen in Tables 3.12a to c. First, the semantic import of nouns is lower in oral informal styles. Nouns had more general meanings: spatio-temporal coordinates and reference points or units, categorising nouns (objective: persons/things/old people/young people; subjective: problems/favourable entities or events). It is interesting to note the particular frequency of nouns indicating time: an, heure, année and temps are among the top 10 most frequent nouns in the corpus. In formal styles (including formal speech), nouns had more specific meanings, closer to the list of verbs (note the presence of various nominalisations).
The French corpus
Table 3.12a High frequency nouns in spoken discourse (C-ORAL-ROM) Rank
Lemma
Rank
Lemma
Rank
Lemma
Rank
Lemma
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
an heure chose gens année problème temps personne homme truc vie façon monsieur enfant travail jour voiture langue moment question côté famille maison soir coup
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
monde madame ballon femme pays mois histoire mot ville dieu niveau dictionnaire mètre matin cas point idée compte partie français entreprise photo guerre rue radio
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
minute classe fille service groupe bonjour impression école pièce état équipe début oeil métier franc fois peu forme besoin amour raison envie départ société rapport
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
sens nombre mort jeune client père part objectif étude place droit ami film élève eau terre copain semaine frère président anglais pied petit position voix
Table 3.12b High frequency nouns in written discourse (Syntsem) Rank
Lemma
Rank
Lemma
Rank
Lemma
Rank
Lemma
1 2 3 4 5 6 7 8 9 10 11 12 13 14
commission état objet membre droit communauté homme fait réponse pays conseil effet nom travail
26 27 28 29 30 31 32 33 34 35 36 37 38 39
cadre jour part exemple années problème monde politique directive measures accord marché principe service
51 52 53 54 55 56 57 58 59 60 61 62 63 64
situation groupe cours matière aide sens terme place type société niveau ensemble ordre million
76 77 78 79 80 81 82 83 84 85 86 87 88 89
mesure plan vue intérêt information conditions protection application moment forme emploi histoire manière environnement
Estelle Campione, Jean Véronis, and José Deulofeu
Table 3.12b (continued) Rank
Lemma
Rank
Lemma
Rank
Lemma
Rank
Lemma
15 16 17 18 19 20 21 22 23 24 25
cas temps projet programme question système an point article fois rapport
40 41 42 43 44 45 46 47 48 49 50
auteur nombre action lieu vie développement domaine partie compte mise gouvernement
65 66 67 68 69 70 71 72 73 74 75
recherche produit pouvoir ministre loi production autorités discours fin rôle oeuvre
90 91 92 93 94 95 96 97 98 99 100
résultat idée moyen raison parlementaire milieu enfant objectif communication suite nature
Table 3.12c. High frequency nouns: comparison between spoken and written discourse Common
Spoken
Written
homme
an année chose gens heure personne problème temps truc
commission communauté droit état fait membre objet pays réponse
Second, there were fewer nouns than verbs (see N/V ratio reported in the Diagram Menus DVD 1.1 and DVD 1.2), especially in informal styles. The most frequent nouns act more as classifiers (gens, homme, jeunes, vieux, trucs, problème, façon) or general spatio-temporal frames of the situations being described (jour, semaine, mois, pays, ville). The first true descriptive word only came in fifteenth position (travail). This shows that cognitive import was mostly given by verbs. Nouns were only specific in particular technical areas. In Table 3.13a, we see that, in speech, adverbs, like adjectives, were more devoted to functional meaning than to content meaning. Many of the top-ranked adverbs were modal operators: negation, assertion, question markers, modals (vraiment), quantifiers, aspectuals (toujours, encore), discourse linkers (alors, puis). Even in the domain of content words acting as modifiers, the highest ranks belonged to pro-forms and not to specific words. Time was represented by the deictic maintenant and aujourd’hui, place was represented by the pro-form là, and manner by the general-purpose bien. Only in the last part of the list was manner represented by ‘true’ adverbs of manner: vite, rapidement, directement. Other -ment adverbs had quantifier, modal or discursive functions: évidemment, notamment, simplement. This last adverb functioned much
The French corpus
Table 3.13a High frequency adverbs in spoken discourse (C-ORAL-ROM) Rank Lemma
Rank Lemma
Rank
Lemma
Rank
Lemma
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
avec finalement complètement d’abord comment effectivement tout_à_fait pas_du_tout tout_à_l’heure presque mieux à_l’époque en_train en_général à_peu_près à_l’intérieur vite une_fois notamment seulement vachement forcément pratiquement ensemble en_même_temps
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
également avant absolument tout_de_suite non_plus en_tout_cas au_revoir longtemps loin hier apparemment rapidement plus_tard demain parfois O.K. non_pas exactement des_fois simplement énormément à_côté directement tellement depuis
pas ouais oui ne là non alors très plus enfin bien aussi puis si un_peu est-ce_que en_fait vraiment toujours d’accord encore c’est-à-dire maintenant beaucoup même
déjà après peu assez peut-être trop ici comme par_exemple jamais justement tout bien_sûr souvent ensuite aujourd’hui en_plus surtout etc. d’ailleurs mal moins plutôt évidemment là-bas
Table 3.13b High frequency adverbs in written discourse (Syntsem) Rank
Lemma
Rank
Lemma
Rank
Lemma
Rank
Lemma
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ne pas plus bien ainsi aussi si non même encore très alors moins tout là
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
tant souvent trop aujourd’hui jamais enfin autant cependant après beaucoup surtout point c’est-à-dire d’abord toutefois
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
autour maintenant parfois particulièrement presque fort directement etc en longtemps environ ensuite désormais quelque mal
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
près généralement rapidement guère vraiment essentiellement autrement bientôt au-delà clairement parfaitement pourquoi certainement relativement effectivement
Estelle Campione, Jean Véronis, and José Deulofeu
Table 3.13b (continued) Rank
Lemma
Rank
Lemma
Rank
Lemma
Rank
Lemma
16 17 18 19 20 21 22 23 24 25
toujours déjà ailleurs avec notamment lors peu également seulement ici
41 42 43 44 45 46 47 48 49 50
pourtant mieux comme oui peut-être plutôt comment loin actuellement assez
66 67 68 69 70 71 72 73 74 75
certes tard récemment davantage conformément voire vite largement simplement précisément
91 92 93 94 95 96 97 98 99 100
depuis finalement combien au-dessus fortement partout puis totalement ensemble évidemment
Table 3.13c High frequency adverbs: comparison between spoken and written discourse Common
Spoken
Written
ne non pas plus
alors enfin là ouais oui très
ainsi aussi bien encore même si
more as an act of speech (‘to put it plainly’) than as an adverb of manner (‘with simplicity’). This for the most part also held true in written discourse, seen in Table 3.13b: there were more true adverbs of manner in the latter part of the list, whereas for both styles the earlier portion of the list, i.e. the higher frequency items, comprised functional words. The main difference seemed to be related to the degree of interactivity or discourse planning (enfin was mainly used as a ‘hesitation’ mark in oral styles) as well as pure stylistic considerations (ainsi could be considered a formal variant of alors). Là deserves a special treatment since the distinction between its use as an adverb of place and as an interactive particle is not always easy to make. Some findings, expected in informal styles, can be seen in Table 3.14a, such as the high rate of preposed adjectives expressing topological or inherent features of entities (e.g. petit, grand, jeune, gros), as well as the speaker’s evaluation (e.g. bon, beau, difficile, juste). However, there were some surprising figures, such as the high rate of ‘modal adjectives’ qualifying a situation and not an entity (e.g. vrai, sûr, évident). For instance, only 9 occurrences of vrai out of a total of 344 appeared in the context un vrai N. The majority of occurrences were attributive uses with the impersonal c’est, in particular when associated with a subordinate que (c’est vrai que), making us wonder whether c’est vrai que is becoming a kind of marker of modality in French, expressing the utterance’s positive value for truth.
The French corpus
Table 3.14a High frequency adjectives in spoken discourse (C-ORAL-ROM) Rank
Lemma
Rank
Lemma
Rank
Lemma
Rank
Lemma
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
petit vrai grand autre même bien bon beau important seul français jeune gros bonne dernier social sûr politique nouveau différent difficile long juste pareil intéressant
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
certain sympa dur propre possible général clair normal plein vieux facile super grave mauvais joli public demi blanc nombreux libre simple meilleur économique italien droit
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
content anglais premier bizarre noir évident ancien pratique passé gauche énorme cher sexuel rouge présent particulier national haut extraordinaire capable fort prêt amoureux professionnel précis
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
malade faux culturel spécial principal riche prochain humain véritable merveilleux linguistique froid théorique positif moyen magnifique familial commun central scolaire mondial gentil espagnol dangereux chaud
Table 3.14b High frequency adjectives in written discourse (Syntsem) Rank
Lemma
Rank
Lemma
Rank
Lemma
1 2 3 4 5 6 7 8 9 10 11 12 13 14
grand européen autre social nouveau politique même public communautaire scientifique seul français général petit
35 36 37 38 39 40 41 42 43 44 45 46 47 48
vrai ancien particulier fort simple juridique américain libre local nombreux spécifique actuel haut commune
68 69 70 71 72 73 74 75 76 77 78 79 80 81
nucléaire professionnel gros noir privé meilleur collectif unique principal réel essentiel administratif officiel élevé
Estelle Campione, Jean Véronis, and José Deulofeu
Table 3.14b (continued) Rank
Lemma
Rank
Lemma
Rank
Lemma
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
économique bon relatif dernier différent important nationale certain possible international honorable nécessaire technique culturel propre jeune national naturel long humain
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
financier industriel allemand beau véritable difficile supérieur divers étranger premier agricole plein vieux commun suivant électrique moyen régional faible
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
vert physique fondamental civil récent blanc précis grave nombreuse mondial commercial catholique intérieur total religieux historique démocratique intellectuel modern
Table 3.14c High frequency adjectives: comparison between spoken and written discourse Common
Spoken
Written
autre grand même
beau bien bon important petit seul vrai
communautaire européen nouveau politique public scientifique social
A greater number of qualifying adjectives were observed in writing, as seen in Table 3.14b, particularly a large number of ‘relational’ adjectives (e.g. économique, communautaire, scientifique), whereas the modal vrai went from being second most frequent in speech, to 35th in writing. Table 3.14c illustrates the most striking differences among styles.
The French corpus
Notes * Estelle Campione wrote §3.1and 3.2, Jean Véronis §3.3.1 to 3.3.4 and José Deulofeu §3.3.5. . Editors’ note: In this chapter, no corpus inventory is provided for French. The reader can find relevant information in Ambrose (1996), Bilger (2001), Blanche-Benveniste (2000), and Seijido and Cappeau (forthcoming). . This is a freeware program that can be downloaded here: http://www.etca.fr/CTA/gip/Projets/ Transcriber/. . http://www.phon.ucl.ac.uk/home/sampa/home.htm . Editors’ note: The reader can find relevant discussions regarding the adequacy of grammatical categorisation for French tagsets in the literature; see the recent works of Abeillé on this topic, and references therein (Abeillé 2002; Abeillé & Blache 2000; Abeillé & Godard 2000; Abeillé & Rambow 2000; Abeillé et al. 2001). . Editors’ note: See for discussion Moignet (1969); Morin (1985); Tranel (1973a, b). . http://www.synapse-fr.com . Editors’ note: For the statistical criteria of representation in frequency lexicons in the French tradition, see Brunet (1986), Engwall (1984), Muller (1985), Sankoff and Cedergren (1981), Sauvageot et al. (1956).
Chapter 4
The Spanish corpus Antonio Moreno, Gillermo de la Madrid, Manuel Alcántara, Ana Gonzalez, José M. Guirao, and Raúl De la Torre
. History of the corpus in the national framework .. Historical overview The Corpus Oral de Referencia de la Lengua Española Contemporánea (CORLEC), the first spontaneous speech corpus for Spanish, compiled in the LLI-UAM (Laboratorio de Lingüística Informática – Universidad Autónoma de Madrid) under the supervision of Francisco Marcos Marín between 1991 and 1992 (Marcos 1992), is the reference for the Spanish corpus for C-ORAL-ROM. However, only one text from the original CORLEC is included in C-ORAL-ROM. The reasons for this virtually new compilation will be addressed in this section. The historical evolution, both in methodology and format, between CORLEC and C-ORAL-ROM will also be outlined. In addition, other similar corpora in Spanish will be compared with those compiled at LLI-UAM. The temporal distance between the two corpora mentioned is a decade, from the beginning of the 1990s to the beginning of the 21st century. This time gap allows us to present a historical overview of the progess in knowledge and of those aspects that remain to be improved. These corpora are comparable in several aspects: the same language, and the same laboratory (though with different transcribers). Overall, both share the same goal: to accurately register contemporary spoken variations produced in spontaneous situations. The methodology has stayed the same, involving monologue and conversation recordings in real scenes, without a pre-established script and with no restriction in expression for participants. Both corpora use an orthographic transcription following Spanish written conventions. Finally, both corpora are reference corpora, that is to say, they are created to be used by the scientific community. Their public nature adds to the transcription accuracy requirement, since their accuracy may be contrasted with the original sound source. Differences between these two corpora may be also noted, but only as far as their historical aspects are concerned. As we will see, these differences are basically due to the budget and the technology available at the time each was compiled; improvements in the later corpus were possible as a result of experience and contact with other research groups.
Antonio Moreno et al.
The LLI-UAM is a pioneering group in the creation of spoken corpora for the Spanish language in Spain. A comprehensive account of the state of the art in the 1990s may be found in the work of Llisterri (1997). 1. CORLEC (Corpus Oral de Referencia de la Lengua Española Contemporánea), funded by IBM, was the first spoken language corpus for the Spanish language. It was compiled under the supervision of Prof. Marcos Marín. The transcription and mark-up scheme was taken from TEI (Text Encoding Initiative). This corpus was encoded and revised in SGML in 1997 by the Real Academia Española group to be included in the Corpus de Referencia del Español Actual (CREA). The corpus is freely available through the LLI-UAM FTP server (ftp://ftp.lllf.uam.es/pub/corpus/oral/) and also through the CREA retrieval service (http://www.rae.es/). Among other spoken corpora compiled in Spain in the 1990s, we may find ALBAYZIN, the Corpus de Conversación Coloquial of the Universidad de Valencia (VALESCO) and the Corpus de Variedades Vernáculas Malagueñas of the Universidad de Málaga (VUM). 2. ALBAYZIN is a phonetic database of around 7,000 sentences from 300 speakers. It was developed in the beginning of the 1990s and its commitment was to create a database for speech recognition and for the testing of phonetic transcription systems. The main difference as regards corpora compiled within the LLI is that ALBAYZIN is not a corpus of spontaneous speech, but of ‘phonetically balanced’ sentences, that is to say, sentences designed to clearly distinguish phonemes. 3. VALESCO1 (Corpus de Conversación Coloquial of the Universidad de Valencia) was developed with the aim of studying pragmatic aspects of the Spanish colloquial language. For this goal, the authors collected and transcribed a corpus of conversations. This corpus is clearly different to our corpora. Firstly, VALESCO is not designed as a reference corpus (that is to say, to be used by other research groups), but as a corpus for their own research. Furthermore, it does not include as much register variation as CORLEC or C-ORAL-ROM, where an important part is committed to formal registers (monologues, conferences, and sermons). Finally, there is no use of computer tools for linguistic mark-up in VALESCO. 4. VUM (Corpus de Variedades Vernáculas Malagueñas of the Universidad de Málaga) is one of the first dialectal spoken corpora. Its goal is clearly sociolinguistic and phonetic, attempting to register the phonetic characteristics of Southern Spanish speech. This corpus is based on similar criteria adopted by CORLEC. 5. CLUVI (Corpus Lingüístico de la Universidad de Vigo), funded by the Plan Nacional de I+D+I, is a recent project being developed by the Seminario de Lingüística Informática of the Universidad de Vigo,2 and is similar to that of LLI-UAM, The main difference lies in the fact that their corpus is based on 5 subcorpora, two of which are dedicated to oral language. One of them is a corpus of bilingual (Spanish-Galician) spontaneous dialogues and the other is a corpus of Galician
The Spanish corpus
in the media. In this sense, the LLI-UAM corpus and that of SLI-UVI are complementary. C-ORAL-ROM is a “second generation” spoken corpus (Moreno 2002), since it incorporates innovations such as the consent forms signed by speakers who were recorded, internal and external validation of the transcription and mark-up, exceptional acoustic quality thanks to digital recording, and the use of XML mark-up language. In order to make a comparison with previous approaches, we will now review CORLEC, the antecedent of C-ORAL-ROM.
.. CORLEC features CORLEC is a database comprising around 1,100,000 transcribed words from spoken texts recorded on analogue audio tapes. The methodology consisted in carrying out recordings in their real contexts, usually without knowledge and permission of participants. The transcription was made using an analogue recorder with headphones and writing directly onto the word processor (WordPerfect). Digital technology was neither used in the recording nor in the later treatment of the data.3 The limitations of this first generation methodology are noticeable: acoustic quality is usually deficient, and there is no alignment between the original sound and its transcription. Nevertheless, it must be kept in mind that the main goal of this corpus was to accurately register Spanish spoken varieties for the first time.4 As concerns transcription criteria, the most important feature is the accuracy of what participants say: deleted phonetic segments, breaks, repeated occurrences, self corrections, invented words or other languages are transcribed precisely as pronounced by the speaker. For the retrieval of canonical forms, all these cases are marked-up with relevant tags. Another transcription criterion is the use of punctuation marks (inverted commas, ellipsis, full stops, etc.) in order to mark discursive situations. Inverted commas are used to highlight words and mark titles in direct discourse. Ellipsis marks are used to mark breaks, hesitations, sudden breaks. Commas and full stops are used as syntactic unit markers. As a general rule, the transcriber was required to follow spelling rules for written texts: for instance, a pause must be marked even if the speaker does not pause at the end of a sentence (Marcos Marín 1992). This decision is probably the most contradictory to the one just described before: on the one hand, there is pronunciation accuracy, but on the other, spelling rules are followed as regards written syntax. In contrast, in C-ORAL-ROM we have rejected the use of punctuation marks according to written language conventions. The information provided in the transcription is enriched with a variety of tags for phatic elements (sounds emitted by speakers that are interpreted as assertions, interrogations, etc.), noises (laughs, applauses, music, etc.), and, especially, discursive interactions: turn-taking and overlapping of participants are marked.
Antonio Moreno et al.
Table 4.1 Distribution of CORLEC Classes
Number of words
Percentage
Administrative and political Advertising Debates Documentary Educational Familiar Humanistic Instructions (megaphony) Interviews Journalistic Legal Ludic (competitions, etc.) News Religious Scientific Sports Technical Total estimated
61,200 30,800 93,500 28,600 58,300 269,500 61,200 6,600 171,200
5.6% 2.8% 8.5% 2.6% 5.3% 24.5% 5.6% 0.6% 15.6%
35,200 61,200 72,600 12,100 36,600 58,300 43,100 1,100,000
3.2% 5.6% 6.6% 1.1% 3.3% 5.3% 3.9% 100.0%
The encoding system used follows TEI norms but, since the corpus was compiled when the TEI norms were not yet published, the authors made particular encoding decisions which differ from those norms. It is also important to highlight that this corpus has not passed any format validation process (for example, validation by means of DTD). The distribution of different kind of texts is one of the distinctive components of any corpus. CORLEC is organised by thematic criteria, as illustrated in Table 4.1. The most important texts are journalistic and familiar, comprising 38.6% and 24.5% of the corpus, respectively. However, the kind of communication event (monologue face to dialogue/conversation) and the channel (direct recording through the media, the telephone) are not representative of all possible discourse types. The most distinctive feature in the distribution of texts types in C-ORAL-ROM is the distinction between formal and informal registers, that was unmarked in CORLEC. The complete transcription of CORLEC is freely available through the LLI-UAM FTP server (http://www.lllf.uam.es/).
.. C-ORAL-ROM features5 The main difference between C-ORAL-ROM and other corpora is its multilingual feature. On the one hand, this results in an enrichment due to the experience of other groups and the exchange of ideas. On the other hand, it required the commitment to a common transcription format, and comparable distribution, throughout the four corpora.
The Spanish corpus
Obtaining a similar distribution in the different corpora is essential in order to compare the final results in the different languages. Together with the difficulty of designing a significant distribution for a spoken corpus, due to the inherent variability of spoken language, there is a difficulty in joining together different approaches to the transcription. In C-ORAL-ROM we have reached the following commitment: the two factors that influence decisively the variation are the communication event and the register (but not the theme, which was the organisation axis of CORLEC). Another aspect needed to obtain comparability is to use the same mark-up scheme. The consortium agreed to use a C-ORAL-ROM format based on the Italian model (whose origin is the CHAT format). To get full reuse of the transcription, the mark-up language is XML, assuring an easy interpretation (by means of the corresponding DTD). The LLI-UAM has developed for the project the format conversion software to convert C-ORAL-ROM to XML.6 Other innovative aspects of C-ORAL-ROM compared to CORLEC are elaborated on below: 1. Legality of texts. We have at our disposal the signed consent forms of all participants. This requirement is compulsory, because the law has changed and the lack of this permission affects royalties (in the case of conferences and media recordings) and privacy rights. This fact has been a new experience for all the teams, because no written permission was needed in previous corpora; scientists do not normally worry about legal questions, only about the diffusion of knowledge. We became aware of this regulation thanks to ELRA, the partner that will lead the distribution and commercialisation of the corpus. 2. Validation. Validation has come back into fashion in recent years for any linguistic resource in electronic format: final users of corpora demand reliability of the collected data. C-ORAL-ROM presents different levels of validation. The most important one is internal validation. Every text must pass through five stages: transcription, revision, prosodic annotation, revision of annotation, and text-sound alignment; with each stage performed by a different linguist. As a consequence, at least three researchers intervene in each text. Comparing this to CORLEC, where the same researcher both transcribed and revised texts, reliability has increased remarkably. Furthermore, software developed within the LLI for the project tests format errors (e.g. missing tags, printing mistakes, blank spaces not allowed, etc.). By means of this testing which is necessary for the conversion into XML format, it is possible to unify all texts in the four corpora. Finally, a group of international experts made an external validation in order to increase the reliability of the corpus. 3. The use of XML. We exploit the universality of this mark-up language that guarantees the reusability of the corpus. In fact, if the aim is to create a corpus to be used as a reference by the linguistic and voice recognition communities, it is necessary to work with a format valid in all the communities concerned. XML and its exten-
Antonio Moreno et al.
sions have become the format for linguistic technologies, since XML-encoded text can be converted to any other text formats. 4. Digital quality of recordings. The other teams of the project have reused in part previous analogue recordings. In our case, however, as we did not have written permission for previous texts, where the sound quality testing was in any case discouraging, we started recording afresh using a digital recorder that affords excellent quality. This has meant much more work for the Spanish team, but now two corpora are at the disposal of LLI. 5. Linguistic information tagging. Different linguistic levels are marked-up in CORAL-ROM. The basic one is the prosodic level, which has been completed for all texts. Additionally, a significant part of each corpus will be marked up morphosyntactically, including verb, noun and adjective lemmatisation. 6. Sound-transcription alignment. This characteristic is not provided in CORLEC because at that time technology to fulfil such requirements, i.e. digitalised sound and software to develop the alignment, was not available at an affordable price for the project. For the alignment in C-ORAL-ROM, we have used a version of Winpitch, a tool specifically developed for corpus transcription. We would like to emphasise that this feature provides us with a considerable empirical added value: lack of correspondence between text and audio is revealed clearly by the aligned version.
.. Final remarks This section has shown the clear evolution in the spoken corpora of the LLI, keeping in mind that they still maintain the same basis: the registry of spontaneous spoken language in real contexts. Other aspects that still remain are text features, containing not only a transcription but also a header with rich information regarding the text. All the relevant information must be tagged by using a mark-up language that allows its identification in a clear and unequivocal way. The choice of the encoding scheme must be based on standardised criteria, since it is the only effective way to make a reference corpus for use by other researchers. Nevertheless, there are many changes as a product of experience and technology breakthroughs, as well as of budget and legislation. Second generation corpora must provide a validated quality, both in transcription/annotation and sound source reliability, that must necessarily be given in digital format. As a consequence, the alignment of text and sound must be provided now to allow the empirical verification of the transcription accuracy. As far as a spontaneous spoken language corpus is to be used freely by the scientific community, the authorisation by participants is required, so that both their privacy and copyright are protected.
The Spanish corpus
. Orthographic transcription .. General criteria As far as the transcription of the sound files is concerned, we have respected the rules set by the Real Academia Española. Here we show some of these rules which we thought we should state together with the decisions the group took in order to maintain coherence. 1. Acronyms and symbols. Acronyms are given in block capitals and without full stops, as in IVA, ONG, NIF; the plural suffix for acronyms is in lower case, as in ONGs. Other corpora and the internet have been useful in order to determine the most frequent orthography of specific acronyms. In the case of symbols related to science (chemistry, etc.) we have followed the conventions. The letter x was used in certain contexts to make reference to a non-specific quantity. No abbreviations were used in the transcription of the sound files. 2. New words. We included words which, although not included in the Real Academia Española dictionaries, have a high frequency of occurrence in spoken Spanish, such as porfa, finde or pafeto. 3. Numbers. Numbers were transcribed in letters, except in the cases of numbers which are part of proper nouns, as in La 2, and numbers included in mathematical formulae. Roman numbers were used when referring to popes or kings, as in Juan XXIII, as well as in names in which they are included, as N-III. 4. Capital letters. Initial capital letters were used when transcribing proper nouns which made reference to people (including nicknames), as Inma or el Bibi, as well as cities, countries, towns, regions, districts, squares, streets and so on, as in Segovia, Carabanchel or Madrid. The same was applied to names of institutions, entities, organisations, political parties, etc., as in the case of Comunidad de Madrid, Ministerio de Hacienda, la Politécnica. Initial capital letters were also used when naming scientific disciplines, as well as entities which are considered absolute concepts and religious concepts, such as la Sociología, Internet, la Humanidad, tu Reino y el Universo, while in the case of Señor salvador the second word, being an adjective, is not given an initial capital. Names of sports competitions were transcribed with initial capital letters for content words, as in Copa del Rey or Champions League. In the case of books and song titles, as well as all kinds of works of art, even television programmes, only the first word was given an initial capital letter, except in cases like Las Meninas (as is conventional). Both nouns and adjectives included in the names of newspapers, magazines and such were written with initial capital letters, as in El Mundo or El País. The names of stores and commercial brands as El Corte Inglés were transcribed following the registered name. 5. Italics. No italics were used in the transcription of the sound files. 6. Foreign words. Foreign words were transcribed as in the original language. Words of foreign origin were written with the original orthography when pronounced
Antonio Moreno et al.
in that language, whether they were included in the academic dictionaries or not. When adapted to Spanish, these words were transcribed following the rules set in these dictionaries.
.. Orthography for non-standard words Non-standard productions have been labelled in C-ORAL-ROM in the %alt dependent lines. For example, the following non-standard productions have been transcribed following orthography:7 standard forms abrís básico a El casa claro claro cassette
non-standard forms abréis básimoco al ca cao caro casé
.. Interjections A list of interjections was created during the process of transcription. These words were transcribed with exclamation marks, which are present in the transcription only in these cases. The list is as follows: ¡ah! ¡anda! ¡brum! ¡bueno! ¡chun! ¡Dios mío de mi alma! ¡ey! ¡hombre! ¡jobar! ¡jolines! ¡madre! ¡madre mía de mi vida y mi corazón! ¡mua! ¡ojo! ¡ostras! ¡oy! ¡pum! ¡ups!
¡ahí va! ¡bah! ¡buah! ¡cachis en la mar! ¡coño! ¡eh! ¡hala! ¡jo! ¡joder! ¡leche! ¡madre del amor hermoso! ¡madre mía! ¡oh! ¡ole! ¡ouh! ¡por Dios! ¡uh! ¡yeah!
The Spanish corpus
. Morpho-syntactic tagging .. Tools and strategy adopted for automatic PoS tagging and lemmatisation With respect to the linguistic annotation, the main goal is to provide a complete morphological and PoS tagging, including lemmatisation. These tasks have been performed automatically and validated by expert annotators. For the morphological analysis we use GRAMPAL (Moreno 1991; Moreno & Goñi 1995) which is based on a rich morpheme lexicon of over 40,000 lexical units, and morphological rules. This system has been successfully used in language engineering applications as ARIES (Goñi et al. 1997) and also in linguistic description (Moreno & Goñi 2002). Originally, GRAMPAL was developed for analysing written texts. The tagging has been the most useful test for showing the ability of GRAMPAL to deal with a wide-coverage corpus of Spanish. We use this application for enhancing GRAMPAL with new modules: a PoS tagger and an unknown words recogniser, both specifically developed for spoken Spanish. GRAMPAL is theoretically based on feature unification grammars and originally implemented in Prolog. The system is reversible: the same set of rules and the same lexicon are used for both analysis and generation of inflected wordforms. It is designed to allow only grammatical forms. In other words, the most salient feature of this model is its linguistic rigour, which avoids both over-acceptance and over-generation. The analysis provides a full set of morpho-syntactic information in terms of features: lemma, PoS, gender, number, tense, mood, etc. In order to be suitable for tagging the C-ORAL-ROM corpus, a number of developments has been introduced in GRAMPAL, reported in Moreno and Guirao (2003): 1. A new tokenisation for the spoken corpus. 2. A set of rules for derivative morphology Tokenisation in spoken corpora is slightly different to the same task in written corpora. Neither sentence nor paragraph boundaries make sense in spontaneous speech. Instead, dialogue turns and prosodic tags are used for identifying utterance boundaries. For disambiguation, specific features of spoken corpora directly affect tagging: repetition and retracting produce agrammatical sequences; fragments that are not full sentences appear very often; and there is a more relaxed word order. Finally, there are no punctuation marks. All those characteristics force the PoS tagger, which is typically trained for written texts, to adapt. Fortunately Proper Names recognition is not a problem for C-ORAL-ROM since only names are transcribed with an initial capital letter. As a consequence, analysing them is a trivial task. On the lexical side, we detected two specific features as compared to written corpora: there is a low presence of new terms (i.e. most vocabulary used by speakers in spontaneous conversations is common and basic); and there is a high frequency of
Antonio Moreno et al.
derivative prefixes and suffixes that do not change the syntactic category, because most of them are appreciative morphemes (for instance, diminutives). In order to handle the recognition of derivatives, GRAMPAL has been extended with derivation rules. The Prefix rule is: take any Prefix and any (inflected) word and form another word with the same features. This rule is effective for PoS tagging since in Spanish prefixes never change the syntactic category of the base. The rule assigns the category feature to the unknown word. 239 prefixes have been added to the GRAMPAL lexicon. PoS disambiguation has been solved using a rule-based model, in particular, an extension of a Constraint Grammar using features in a Context-Sensible PS. The output of the tagger is a feature structure written in XML. The formalism allows several types of context-sensitive rules. First in application are the lexical ones: those for a particular ambiguous word, as follows: “word” → / _ “word” → / _ where a given ambigous word is assigned a category X before a word with a category Y, or the category Z after a category W. Any kind of feature can be taken into account, not only the category. For instance, we can face the problem of two ambiguous verbs belonging to different lemmas, as follows: “word” → / _ “word” → / _ In addition to features, strings and punctuation marks can be specified in the RHS of the context sensitive rule as follows: “word” → / string _ “word” → / # _ where string is any token, and # is the symbol for start or end of utterance. If no lexical rule exists for a given ambiguity, then more general, syntactic rules are applied: , → / _ where if a given word is analysed with two different tags, one as a noun (N), and the other as a verb (V), then the one with category N is chosen if it appears after a word with category V. In short, those syntactic rules apply when there is no a specific rule for the case, either because it is a new ambiguity not covered by the grammar, or because the grammar writer did not find a proper way to describe the ambiguity. This method benefits from the fact that most frequent ambiguities for a given language become well-known after a training. As a consequence, many context-sensitive lexical rules can be written by hand or extracted automatically from the data. Figures illustrating tagger performance are given in Section 4.3.6.
The Spanish corpus
Finally, those words which did not undergo disambiguation are treated with TNT (Brants 2000), which assigns PoS following a statistic model obtained from a 50,000word training corpus. It has been shown that TNT is the most precise statistical tagger (Bigert et al. 2003).
... Electronic vocabulary The GRAMPAL lexicon is a collection of allomorphs for both stems and endings. New additions can be easily incorporated, since every possibility in Spanish inflection has been classified in a particular class. In an experiment reported in Moreno and Guirao (2003), 8% of the whole corpus are unknown words for the system, comprising the following categories: 1. Foreign words: walkman, parking 2. Missing words in the lexicon, typically from the spoken language: caramba, hijoputa. 3. Errors in the transcription. 4. Neologisms, mostly derivatives. Rules for handling derivative morphology have been shown in the previous paragraph. For the remaining three classes of unknown words, a simple approach is adopted: a. Foreign words are included in a list, updated regularly. b. Any word in the corpus but not in the lexicon is added, expanding the base resource. c. Errors in the source texts are corrected, and then analysed by the tool. To summarise, the tagger procedure, consisting of seven parts, involves the following: 1. Unknown word detection: Once the tokeniser has segmented the transcription into tokens, a quick look-up for unknown words is run. The new words detected are added to the lexicon. 2. Lexical pre-processing: The programme splits portmanteau words (al, del → a el, de el) and verbs with clitics (damelo → da me lo). 3. Multiword recognition: The text is scanned for candidates for multiwords. A lexicon, compiled from printed dictionaries and corpora, is used for this task. 4. Single word recognition: Every single token is scanned for every possible analysis according to morphological rules and lexicon entries. Approximately 30% of the tokens are given more than one analysis, and some of them are assigned up to 5 different analyses. 5. Unknown word recognition: The remaining tokens that are not considered new words pass through the derivative morphology rules. If some tokens still remain without being analysed (because they were neither included in the lexicon nor recognised by the derivative rules), they are held until the stage of statistical processing, when the most probable tag, according the surrounding context, is given.
Antonio Moreno et al.
6. Disambiguation phase 1: A feature-based Constraint Grammar resolves some of the ambiguities. 7. Disambiguation phase 2: A statistical tagger (the TnT tagger) resolves the remaining ambiguous analyses. After such automatic tagging, human annotators can revise and correct the tagged corpus; Guirao and Moreno-Sandoval (2004) describe a tool developed for aiding human annotators in this task.
.. Tagset During the process of disambiguation, C-ORAL-ROM Madrid developed a document explaining the different matters which concern the construction of the morphosyntactic tagging system for the Spanish spoken corpus. This section will introduce this system, outlining the tagset used, how the tags were defined and, finally, what decisions were taken to solve the problems derived from ambiguity and from the limitations traditional grammar has when approaching categorisation. General theoretical decisions involving the definition of tags and their morpho-syntactic features will also be addressed.
... Tagset adopted A tag is defined as a descriptive symbol which is either manually or automatically assigned to a word or multiword inside a text (van Halteren 1999). In Spanish grammar studies, the PoS problem is still far from being solved. In present-day literature on the subject, it is widely admitted that the list of PoSs we work with is based on a strange mix of criteria: semantic criteria for nouns and verbs; local, at times, for adjectives and prepositions; and of different nature for the adverb. With the aim of avoiding possible incoherence when assigning the different PoSs, in C-ORAL-ROM Madrid these are defined from three different points of view: semantic, morphological, and syntactic. As for the way the different parts of speech are assigned to the different lexical units, there are two main theoretical models. In the first place, there is the functionalist model, according to which, words are assigned one PoS or another depending on their syntactic behaviour inside the sentence. Second, there is the generativist model where words are assigned a PoS at source and, as a consequence, have a concrete behaviour in the syntax of that language. Let us examine an example: (1) el hijo del presidente se educó en un colegio privado [the son of the President was educated in a private school] If we explain this example from a functionalist point of view, the word hijo would be assigned the PoS “noun” for the following reasons:
The Spanish corpus
1. It is in a syntactic position which is typical of nouns. 2. It can be replaced by another noun. 3. Hijo has a syntactic function inside the sentence, which is also typical of nouns. However, from a generativist point of view, the word hijo is a noun in itself and, as such, it has the syntactic behaviour just described. That is, it is not a noun because it is the subject of the sentence; rather, being a noun, it can be the subject of the sentence. From the point of view of C-ORAL-ROM Madrid, the ‘syntactic position’ premise is not enough to justify the change of PoS in a lexical unit; this change must be based on semantic and morphological criteria as well. Above all, the semantic criterion has been favoured, and so the rest of parameters will be described from that perspective. According to this perspective, the semantics of the PoS of a concrete word has morphological and syntactic consequences in the language being dealt with. For example, we will see how nouns are defined as a group of properties which tell one group of individuals from another. Those words which, in Spanish, designate classes of individuals, have gender and number information, occupy the central position in a phrase, allow meaning modifiers (such as articles, determiners or quantifiers) and always have a syntactic function inside the sentence (mainly: subject, direct object, etc.). Table 4.2 shows the different PoSs used in the morpho-syntactic tagging of the corpus; afterwards, we will undertake the definition of each of these PoSs.
.. The notion of “multiword” In C-ORAL-ROM Madrid, each sound chain with a unique meaning has been considered a word unit, regardless of its orthographic representation. Table 4.2 The Spanish C-ORAL-ROM PoS tagset PoS
Subcategory
Noun Proper noun Adjective Determiner
Quantifier Pronoun Verb Auxiliary Preposition Adverb Conjunction Discourse marker Interjection
Article Possessive Demonstrative
Tag
Example
N NP ADJ ART POSS DEM Q P PR V AUX PREP ADV C MD INTJ
mesa María azul mi, tu, su ese, este uno, dos, tres primer, segundo muchos, pocos yo, tú, él, lo_que que cantar habrá cantado ante, bajo, con así, aquí, allí y, pero, ni oye, o sea, es decir madre mía, yuju
Antonio Moreno et al.
In this sense, two kinds of word units can be found in the Spanish corpus: simple words and multiwords. Simple words are those graphically represented between two blank spaces. Complex words, on the contrary, are made up of two or more graphic units. We see an example of each kind in (2) and (3) respectively: (2) mesa [table] (3) fin de semana [weekend] For C-ORAL-ROM Madrid, the following qualities are required for two or more lexical units to become a multiword: a.
Absence of compositional meaning. The new compound, as a whole, can only denote one meaning. For example, al revés is not the result of the sum of the meanings of al and revés, as it is the phrase al cine in the sentence Vamos al cine. b. No insertion of other words inside the expression is allowed. For example, in de todas formas, when we add the article la, the result is de todas las formas. Therefore, we now have an expression which denotes the different ways in which an event can happen and is not anymore the discourse marker which de todas formas was. The different lexical units which form multiwords are tagged together, joined by dashes, as in fin_de_semana.
.. Ambiguous clustering In this group we have included those words which can have two kinds of syntactic analysis, that is, those words which can form either a phrase or a multiword, depending on the context, as illustrated in the following examples: (4) es que\ES QUE\MD mañana no puedo ir a trabajar [because tomorrow I cannot go to work] (5) Lo que quiero decir es\SER\Vindp3s que\QUE\C no he dicho la verdad // [what I mean is that I haven’t told the truth] (6) De eso nada\DE ESO NADA\MD guapa eso no lo haces tú ni por asomo. [by no means darling you won’t do that] (7) de\DE\PREP eso\ÉSE\PPER3s nada\NADA\Q pero de esto que está aquí me da usted un kilo por favor. [from that nothing but from this other piece here please give me a kilo] Information given by prosodic tagging will help in the process of disambiguation through the use of rules. In the examples above then, the rules would tag (4) and (6) as MD, that is, they would analyse them as a multiword, while in (5) and (7) the analysis would be that of a phrase, where each word has its own PoS. In those cases where rules cannot help, GRAMPAL will favour the syntactic analysis of the words.
The Spanish corpus
.. Level of morpho-syntactic encoding of forms In Table 4.3, the semantic, morpho-syntactic and syntactic features of each PoS are presented. While Table 4.3 sums up the general features of each PoS, further explanation is needed concerning the decisions taken for some specific cases.
1. Quantifiers (Q). Those words tagged as quantifiers by C-ORAL-ROM Madrid have been classified by grammatical tradition either according to their syntactic position or to their semantic content. This way, in a traditional analysis of the same word, we can obtain two different categories, as illustrated in examples (8) and (9). (8) algunos alumnos no han venido hoy a clase [some students haven’t attended the class today] (9) algunos no han venido hoy a clase [some haven’t attended the class today] In (8), algunos would be classified a determiner, while in (9) it would be a pronoun.
2. Article or quantifier? Words that in the grammatical tradition are tagged as articles have been classified as quantifiers in C-ORAL-ROM Madrid as well, as shown in (10) and (11): (10) un niño [a/one boy] (11) unos niños [several boys] Choosing the article tag would mean sacrificing the enumerative interpretation in some contexts. It was therefore decided to consider these particles as quantifiers which, depending on different aspects which include their own semantic features, will denote a defined or undefined quantification.
3. Discourse markers. Discourse markers are linguistic units which, at the discourse level, “guide, according to their morpho-syntactic, semantic and pragmatic features, the inferences which take place in communication” (Portolés 1999); see, for instance, vamos and mira in (12) and (13): (12) pero / vamos / a mí me gusta mucho // [but / come on / I like it very much //] (13) pues mira / fue / horrible // [so look / it was / horrible //]
Express events
Express a mental state Guide the inferences which take place in conversation Establish logical or discourse bounds Establish semantic relationships associated to spatial concepts Set the meaning of the verb
Verb
INTJ MD
ADV
PREP
C
P
REL
Q
Express location of the referent in space and time Express number of individuals or objects Retrieve the referent of the noun they modify inside the clause they introduce Refer to a noun phrase
DEM
POSS
ART
Restrict/define the referent of a noun phrase Express relation of possesion or ownership
Denote an element or object classification Mono-referential Denote qualities or properties
N
N (PR) ADJ
Semantic features
PoS
+
+
Morphological features Num Gen Tense-Mood
+
+
+
+
Invariability
Invariability
Invariability
Invariability Invariability
+
+
Will vary depending on the kind of entity they are quantifying Inherit their morphological features from the noun working as the referent + + +
+
+
+
Fixed inflection + +
Per
Table 4.3 Semantic, morpho-syntactic, and syntactic features of each PoS
Relate sentences or elements in a sentence Establish relationships between two elements Do not introduce a second term
Maximum expansion of the noun phrase Central element of the sentence, determine the different syntactic functions Not been assigned Not been assigned
Different syntactic functions inside the clause
Subject, direct object, complement Absence of determiners Noun complement, predicative complement Prenominal position, no syntactic function Prenominal (1st series) and postnominal (2nd series) positions Pre and postnominal positions
Syntactic features
también\TAMBIÉN\ADV
de\DE\PREP
pero\PERO\C
ah\AH\INT es decir\ES DECIR\MD
es\SER\Vindp3s
yo\YO\PPER1s
que\QUE\PR
una\UN\Q
este\ESTE\DETdem
mío\MÍO\DETposs
los\EL\DETdmp
Luisa\LUISA\Npi extranjeros\EXTRANJERO\ADJmp
actitudes\ACTITUD\NCfp
Examples
Antonio Moreno et al.
The Spanish corpus
4. The article lo. C-ORAL-ROM Madrid has defined a pronoun as being able to recover its referent by itself. This lo cannot accomplish this function and this has been the main reason why it has been tagged as an article, as seen in examples (14) to (16). (14) lo azul [the blue] (15) lo alto que eres [* the tall you are] (16) lo graciosa que es esta chica [*the charming this girl is]
5. Pronouns and adverbs. This last group of words, traditionally classified as pronominal adverbs (Kovacci 1999), have been labelled as pronouns by C-ORAL-ROM Madrid, bearing in mind that, as other pronouns, these words are open terms whose referent is not fixed beforehand and does not keep itself constant, but is established every time there is a change of speaker, listener or space and time coordinates. We assume a semantic point of view in the definition of those parts of speech, understanding that they behave as entities and therefore they are a subclass of nouns. As a consequence, the following words, traditionally classified as adverbs, are classified as pronouns in the tagset: abajo; acá; actualmente; ahí; ahora; allí; alrededor; anche; antaño; antes; aquí; arriba; así; atrás; ayer; debajo; delante; después; detrás; encima; enfrente; entonces; hoy; luego; mañana; mientras
.. Evaluation The total number of units (where unique words and multiwords each count as one unit) in the test corpus (hand-annotated)8 is 44,144. The test corpus has been developed using a combined procedure of automatic and human tagging: 1. A fragment of approximately 50,000 words (15% of the corpus) was selected, taken from the different sections and intended to be a representative sampling of the whole. Each word in the 27 texts was tagged with all possible analyses. This means that some words (the unambiguous ones) have one tag, while other are given one tag per morpho-syntactic analysis. 2. Each file was revised by a linguist who selected the correct tag for every case, discarding the wrong ones. 3. From the revised corpus, a set of disambiguation rules were written for handling the most frequent cases. 4. A new run of the tagger, augmented with the disambiguation grammar, provided an automatically tagged corpus, with only one tag per unit.
Antonio Moreno et al.
5. The automatic and human tagged corpora were compared, and the differences were noted one by one, assuming that agreement on the same tag implied a correct analysis. While, in most cases the wrong tag was assigned by the tagger, in several cases it was the linguist who provided an incorrect tag: mistakes were probably due to a lapse in attention due to the repetitiveness of the task. After assigning the proper tag in all the disagreements, a final version of the test corpus was delivered. Both the disambiguation grammar and the statistical tagger were trained against the test corpus. Finally, the rest of the corpus of over 250,000 words was tagged as described in Section 3.1. In order to evaluate the tagger performance (including disambiguation), a new run of GRAMPAL against the test corpus was conducted. The mismatches between the GRAMPAL tagged corpus and the test corpus, working as a golden standard for evaluation, were counted (the figures are shown in Table 4.4). The precision rate was calculated by the number of correct tags assigned by the tagger divided by the total number of tagged units in the test corpus. In other words, 42,206 tags out of 44,144 were assigned correctly. No evaluation of the precision has been performed for the rest of the corpus, but a similar rate (95.61 %) as that obtained against the test corpus can be assumed. With respect to the recall, understood as the ratio between the number of tagged units by the programme and the total number of units, the figure for the whole corpus is 99.96. Only 117 tokens were not given a tag by the programme. There is a discrepancy between the number of words in the transcription corpus and the number of words in the tagged corpus. This can be explained by the fact that a different concept of ‘word’ has been used in transcription and in PoS tagging. In transcription, a word is simply a string between blank spaces, while a word in tagging is a lexical or grammatical unit. That is, it can be a single word (hola) or a multiword (es decir). Since there are many multiwords in the corpus, the actual number of tagged words is less than the number of transcribed words. With respect to the evaluation of PoS tagging, it is important to stress that only a subset of approximately 50,000 transcribed words were revised by hand, resulting in over 44,000 tagged words. The rest of the tagged corpus has not been revised by human annotators. This fact has some consequences in the list of forms and lemmas. The tagger, when two or more tags are available, always assigns the tag with the shared information between the candidates. For instance, many verb forms are ambiguous with respect to first and third person singular: (yo) cante / (él, ella) cante ‘I sing’ vs. ‘he/she sings’. The tags for each are Vp1s and Vp3s, respectively. When the context cannot solve Table 4.4 Recall and precision rates of test corpus
Test corpus Whole corpus
Number of units
Number of tagged units
Recall
Precision
44,144 313,504
44,144 313,387
100% 99.96%
95,61% ∼ = 95.61%
The Spanish corpus
the ambiguity (by means of the pronoun), however, the tag assigned is V, compatible with both. Human annotators, however, can normally resolve the ambiguity when they are revising the tagging, in which case, the appropriate full tag is provided. As a result, different tags for the same word can be found in the list of lemmas and forms.
.. Specific tagging problems with the Spanish spoken corpus ... Retracting and interruption phenomena Phenomena such as retracting and interruption will not be tagged as the rest of elements during the process of automatic tagging, that is, they will not be tagged as PoSs. Nonetheless, in C-ORAL-ROM Madrid we have considered it important not to erase this information because it will play an important part in the process of manual disambiguation and spontaneous speech corpora training by means of disambiguation rules. 1. Retracting. The retracting phenomenon presents itself as problematic when dealing with automatic morpho-syntactic tagging. In the transcription of spoken language we have included cues to signal various retracting phenomena and for this reason the rules for automatic disambiguation thus cannot work properly. Therefore, different rules should be written to avoid problems as those which would arise in the following example: (17) entonces la [/] la llamé [then I called her [/] her] In this case, as the fragment lacks the proper context (la is a pronoun before a verb), the tagger will give the first la the “article” PoS tag, when it is really a pronoun. The retracting phenomenon will be labelled as in the final morphologically tagged text. In C-ORAL-ROM this phenomenon can be divided into two kinds:
a. In the first model of retracting, a repetition of the same word on both sides of the label [/] will take place, for example: (18) la [/] la [/] la moto de mi abuelo Pepe. [the the the motorbike of my grandfather Pepe] In these cases, after applying the contextual rules for disambiguation, GRAMPAL gives the likeliest analysis, which in the example would be the following: (19) la\LO\PPER3s la\LO\PPER3s la\EL\DETdfs moto\MOTO\NCfs de\DE\PREP mi\MI\DETposs abuelo\ABUELO\NCms Pepe\PEPE\Npi It can be seen how the analysis of the first two instances of the word la results in the tag P (pronoun), because the contextual rule la → ART/_N could not be applied as it could be in the case of the third case of la. In order to solve these cases, C-ORAL-ROM Madrid designed a programme to tag repeated words, that is, words occurring on the
Antonio Moreno et al.
left side of the label [/] with the same PoS tag as the one on the right side. For example, in the first case we saw in example (18), the word la in the noun phrase la moto is an article; therefore, all the words la on the left side of the retracting will be tagged as ART, as illustrated in (20): (20) la\EL\DETdfs la\EL\DETdfs la\EL\DETdfs moto\MOTO\NCfs de\DE\PREP mi\MI\DETposs abuelo\ABUELO\NCms Pepe\PEPE\Npi
b. In the second model of retracting, there is no similarity between the words occurring on both sides of the label [///], as seen in (21): (21) la [///] tiras de ahí / lo haces // [the [///] pull out / you do it //] In such cases, due to the break in discourse, it is not possible to predict the PoS of the word, so the tag for such words will be the first one GRAMPAL assigns automatically, as shown in (22): (22) la\EL\DETdfs tiras\TIRAR\Vindp2s de\DE\PREP ahí\AHÍ\P lo\LO\PPER3s haces\HACER\Vindp2s
2. Interruption. In C-ORAL-ROM, both the interruption and the change of topic phenomena are represented with the symbol +, as seen in example (23): (23) No quiero ir a ese + ayer me dijo que yo no lo había comprado // [I don’t want to go to that + yesterday he told me that I hadn’t bought it //] These cases will be dealt with in the same way as in the second case of retracting, i.e. maintaining the choice that the morphological tagger makes after applying GRAMPAL and the contextual disambiguation rules, as in (24). (24) no\NO\ADV quiero\QUERER\Vindp2s ir\IR\V a\A\PREP [no\ADV I-want\V I-go\V to\PREP ese\ESE\DETdem + that\DET]
... Linguistic forms whose distribution is not consistent with the distributional characters of written language In the Spanish corpus in C-ORAL-ROM, those linguistic occurences which are typical of spoken language are labelled according to the following conventions: 1. Support: &ah and &eh. These elements are tagged as <sup>. These are illustrated in examples (25) and (26): (25) mañana &eh no voy [tomorrow &eh I won’t go]
The Spanish corpus
(26) mañana\MAÑANA\P <sup> no\NO\ADV voy\IR\Vindp1s [tomorrow\P no\ADV I-go\V]
2. Non linguistic forms, transcribed as hhh, are labelled with the tag , as in (27) and (28): (27) mañana hhh no voy [tomorrow hhh I won’t go] (28) mañana\MAÑANA\P no\NO\ADV voy\IR\Vindp1s [tomorrow\P no\ADV I-go\V]
3. Onomatopoeia, for example, the imitation of the sound of cars, birds, etc., will be dealt with depending on whether the sound has a conventional transcription or not, as elaborated on below: a. Those onomatopoeia which correspond in the written form with a conventional linguistic expression will be tagged as INTJ (interjection), even though in some contexts, these words have a syntactic function, as in the following example, where the interjection ¡ras! plays the role of a direct object: (29) y le hizo ¡ras! en toda la cara [and he did splash ! in his face] (30) y\Y\C le\LO\PPER3s hizo\HACER\Vindp3s ¡ras!\RAS\INT en\EN\PREP toda\TODO\Q la\EL\DETdfs cara\CARA\NCfs [and\C to-him\PRER3s he-did\V splash\INT . . . ]
b. Those onomatopoeia not conventionalised with a linguistic expression and which were transcribed in C-ORAL-ROM Madrid as hhh have, as with the non-linguistic forms in point 2. above, the tag . 4. Interjections: The rest of the expressions expressing the speaker’s states of mind were tagged as INTJ. Those interjections not included in the Diccionario de la Real Academia Española (DRAE) were added to a list called “C-ORAL-ROM interjections”, which can be consulted in the document where the construction of the tagging is explained. An example from the corpus of interjections not included in the DRAE is shown in (31). (31) uju!\YUJU\INT mañana\MAÑANA\P no\NO\ADV voy\IR\Vindp1s [yuju\INT tomorrow\P no\ADV I-go\V ]
5. Meaningless linguistic forms: This group includes those expressions which are not alphabetically transcribed and which do not have a referent. An example is the case of a speaker humming a song, shown in (32):
Antonio Moreno et al.
(32) *ALF: tachín tachín tachín para pa pa para para pa pa [singing] These cases, which GRAMPAL leaves untagged, are labelled afterwards as (non categorial).
... Linguistic forms and non-standard forms used as discourse markers The label Discourse Marker (MD) is a PoS which was created for the C-ORAL-ROM tagging; as such, this documentation includes its definition, the list of words belonging to the class, as well as an account of how C-ORAL-ROM Madrid solved the problems emerging from the choice of this PoS; this is summarised below. The list of discourse markers is made up of different groups of words, as elaborated on in the following. 1. Words which are always MD. In this group are words such as o sea, sin duda, por tanto, y tal, etc. These words are tagged automatically and do not present any problem of ambiguity. 2. Words with PoS ambiguity. There is a group of words which, in some syntactic contexts, behave as MD, while in other contexts behave as conjunctions, adverbs, nouns or adjectives. We can see in examples (33) to (35) the different interpretations the word vamos can have: (33) hoy vamos\IR\Vindp1p al cine [today we go\V to the movies] (34) vamos\VAMOS\INT hombre eso no te lo crees ni tú [come on\INT man you don’t believe that] (35) era muy competitivo vamos\VAMOS\MD un trepa [he was very competitive say\MD a go-getter] In these cases, during automatic tagging, the disambiguation rules will give priority to one PoS over another. In the case of MDs, the contextual disambiguation rules use the information given by the prosodic tags. If a word is transcribed between single or double slashes, or, for example, if it is a dialogical turn in itself, given its prosodic and syntactic independence, we will probably be dealing with a MD.
.. Main data from lemmatisation In Tables 4.9 and 4.10 we compare the data obtained in C-ORAL-ROM from morphological tagging with those from other corpora: the Spanish UAM Treebank, a 21,420word written corpus taken from newspapers (Moreno et al. 2003); the 10 million-word spoken part of the 41 million-word Longman Spoken and Written English (LSWE) Corpus, from which we have obtained the occurrence percentages for verbs, nouns, ad-
The Spanish corpus
Table 4.5 High frequency verbs, excluding auxiliaries and modal verbs Rank
Lemma
Rank
Lemma
Rank
Lemma
Rank
Lemma
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
ser decir estar tener hacer haber ir ver dar saber pasar poner creer venir llamar llevar hablar quedar querer llegar dejar salir parecer gustar pensar
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Comer trabajar contar coger unir valer encontrar meter empezar conocer poder mirar pedir entender vivir entrar seguir buscar sacar volver comprar pagar preguntar tomar cambiar
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
subir perder esperar echar acordar ganar traer abrir mandar recordar suponer quitar sobrar acabar leer imaginar tratar estudiar intentar ocurrir escuchar sentar tocar casar explicar
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
utilizar tirar considerar jugar dormir mantener levantar morir terminar caer permitir mover necesitar salar conseguir fijar servir aparecer bajar realizar costar referir interesar aprender andar
Table 4.6 High frequency nouns Rank
Lemma
Rank
Lemma
Rank
Lemma
Rank
Lemma
1 2 3 4 5 6 7 8 9 10 11 12 13 14
día año cosa tío vez casa gente persona tiempo momento problema niño hora chico
26 27 28 29 30 31 32 33 34 35 36 37 38 39
padre hermano forma hijo caso amigo madre tarde gobierno grupo señor sitio mes verdad
51 52 53 54 55 56 57 58 59 60 61 62 63 64
estudio país hombre punto partido nivel cuenta precio mujer producto palabra programa historia número
76 77 78 79 80 81 82 83 84 85 86 87 88 89
peseta teléfono sábado kilómetro derecho sistema piso cara información montón sol novio ordenador estado
Antonio Moreno et al.
Table 4.6 (continued) Rank
Lemma
Rank
Lemma
Rank
Lemma
Rank
Lemma
15 16 17 18 19 20 21 22 23 24 25
mundo idea vida trabajo tipo parte tema uno noche mañana pueblo
40 41 42 43 44 45 46 47 48 49 50
semana ciudad manera clase dinero sentido situación coche gracia agua medio
65 66 67 68 69 70 71 72 73 74 75
falta zona libro servicio empresa centro minuto familia virus calle puerta
90 91 92 93 94 95 96 97 98 99 100
señora compañero domingo equipo relación cliente película lunes ciencia lengua sociedad
Table 4.7 High frequency adverbs Rank
Lemma
Rank
Lemma
Rank
Lemma
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
no sí ya también bien claro así siempre además tampoco tal todavía a lo mejor sólo igual casi al final nunca sobre todo más o menos mal fuera realmente por lo menos mejor a veces solamente ahora mismo simplemente
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
tarde prácticamente absolutamente efectivamente totalmente evidentemente precisamente en general probablemente quizá perfectamente justo directamente aún de repente pronto seguramente seguro de vez en cuando al principio o así fundamentalmente aparte completamente justamente de pie lógicamente lejos anteriormente
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
máximo especialmente claramente últimamente básicamente inmediatamente del todo recién sinceramente quizás generalmente obviamente poco a poco concretamente principalmente franco en teoría exclusivamente menos mal curiosamente posteriormente indudablemente a la vez habitualmente a su vez cierto alrededor inicialmente profundamente
The Spanish corpus
Table 4.7 (continued) Rank
Lemma
Rank
Lemma
Rank
Lemma
30 31 32 33 34
primero normalmente dentro cerca exactamente
64 65 66 67
de nuevo al revés en serio en absoluto
97 98 99 100
en coche afortunado finalmente previamente
Table 4.8 High frequency adjectives Rank
Lemma
Rank
Lemma
Rank
Lemma
Rank
Lemma
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
mismo bueno mejor pequeño grande importante último solo nuevo normal mayor social distinto español bonito propio único siguiente político malo posible gran mal fuerte difícil
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
buen cierto bajo general diferente junto alto inglés pobre peor interesante raro claro subordinado nacional pasado fácil anterior principal extranjero libre capaz científico económico determinado
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
fundamental especial humano rápido próximo barato blanco natural fatal técnico urbano mental mínimo suficiente material europeo guapo tonto radical informático contento electoral estupendo viejo menor
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
correcto vecino pesado caro histórico rico joven imposible terreno necesario público verde primate internacional delincuente moderno frío típico gracioso serio clásico habitual central castellano grave
jectives and adverbs in conversation; and finally, the frequency word lists from Juilland and Chang-Rodríguez (1964) (Spanish) and P. M. Alexejew et al. (1968) (English). In the comparison of a spoken Spanish corpus (C-ORAL-ROM) and a written one (UAM Treebank), two conclusions can be reached: 1. There is an inverse distribution for lexical (nouns, verbs, adjectives etc.) and non-lexical (conjunctions, prepositions etc.) categories: the latter are much more frequent in speech, while lexical categories are more present in written texts.
Antonio Moreno et al.
Table 4.9 Frequency word lists: comparison between C-ORAL-ROM and other corpora C-ORAL-ROM Verbs (lemmas)
1 ser 2 decir 3 estar 4 tener 5 hacer 6 haber 7 ir 8 ver 9 dar 10 saber Nouns 1 día (lemmas) 2 año 3 cosa 4 tío 5 vez 6 casa 7 gente 8 persona 9 tiempo 10 momento Adjectives 1 mismo (lemmas) 2 bueno 3 grande 4 mejor 5 pequeño 6 importante 7 último 8 solo 9 nuevo 10 normal Most 1 de frequent 2 el words 3y 4a 5 que (conjunction) 6 la 7 no 8 en 9 es 10 se
Spanish UAM Treebank
Juilland & Chang-Rodríguez*
Alexejew et al. (English)*
1 ser 2 tener 3 estar 4 pedir 5 anunciar 6 haber 7 hacer 8 querer 9 dar 10 morir 1 año 2 millón 3 gobierno 4 día 5 país 6 vida 7 mundo 8 niño 9 presidente 10 juez 1 español 2 grande 3 nuevo 4 bueno 5 último 6 pequeño 7 político 8 europeo 9 italiano 10 vasco 1 el 2 de 3 en 4a 5 un 6y 7 suyo 8 ser 9 con 10 que (conjunction)
1 de 2 el 3 la 4y 5a 6 en 7 él 8 que (pronoun) 9 ser (lemma) 10 que (conjunction)
1 the 2 of 3 to 4 in 5 and 6a 7 for 8 was 9 is 10 that
* These corpora do not contain specific information on lemmatisation. Reflected in the table are only the first ten word-forms.
The Spanish corpus
Table 4.10 Distribution of lexical classes: comparison between C-ORAL-ROM and other corpora
Verbs Nouns Adjectives Adverbs Total
C-ORAL-ROM %
Spanish UAM Treebank %
Longman (Conversation) %
17.12 13.39 3.97 6.50 40.98
10.36 31.25 6.99 2.88 51.48
13 14 3 6 36
2. It is also interesting to note the difference in the distribution of verbs, nouns, adjectives and verbs in the two corpora. In speech, the presence of verbs and adverbs is much greater than in writing, while nouns and adjectives are much more frequent in written than in spoken texts. The correlation between the frequencies of, first, nouns and adjectives, and, second, verbs and adverbs seems obvious, as the members of these pairs of words are closely related through syntactic features, with adjectives and adverbs being modifiers of nouns and verbs, respectively. As for the comparison between spoken Spanish and spoken English, the results are very similar in general terms: the distribution for nouns, adjectives and adverbs is almost equal, while in the case of verbs, the frequency is higher in Spanish; it is also noticeable how verbs are the most frequent category in Spanish, while in English this position corresponds to nouns.
Notes . The website of VALESCO is http://www.uv.es/∼valesco/. . Its website is http://www.uvigo.es/webs/sli/index.html . We have recently digitalised the analogue audio tapes. For more information, please contact [email protected] . In this sense, CORLEC is more extensive, spontaneous and natural than C-ORAL-ROM, since it is three times larger and does not have the constraint of needing to obtain the permission of all the participants. . Universidad Autónoma de Madrid acknowledges the source of the sound files in the media recordings as being kindly provided by RTVE (Radio Televisión Española), Radio Televisión Madrid, COPE (Cadena de Ondas Populares Españolas/Radio Popular) and Onda Cero Radio. . Not available in this edition. . See the complete list of non-orthographic productions in the table of correspondence in the DVD, in the Spanish section of the corpus Metadata Menu. . The files from the C-ORAL-ROM corpus annotated by hand are the following: efamcv03; efamcv06; efamcv07; efamdl03; efamdl04; efamdl05; efamdl10; efammn05; emedin01; emdmt01; emednw01; emedrp01_1; emedsc04; emedsp01; emedts01; emedts05; emedts07; emedts09; enatco03; enatpd01; enatte01; epubcv01; epubcv02; epubdl07; epubdl13; epubmn01; etelef01.
Chapter 5
The Portuguese corpus Maria Fernanda Bacelar do Nascimento, José Bettencourt Gonçalves, Rita Veloso, Sandra Antunes, Florbela Barreto, and Raquel Amaro
. History of the corpus within the national framework .. Historical overview The Portuguese corpus of the C-ORAL-ROM project was compiled by the Corpus Linguistics Group at the Centro de Linguística da Universidade de Lisboa (CLUL). For its constitution, samples of existing corpus were reused and new recordings specifically collected. In order to describe the previously collected materials reused for this project, it is important to briefly outline the history of the Portuguese linguistic corpus, namely the general spoken corpus developed in Portugal. We will focus primarily on CLUL’s spoken corpus since it is the only one among the corpora of spoken language in Portugal that can be considered a general corpus (this corpus is included in the large Corpus de Referência do Português Contemporâneo (CRPC). It is however important to make reference to other large spoken corpora in Portugal that have been compiled for specific studies, namely, for studies in dialectology and psycholinguistics. There are also important Portuguese speech corpora, compiled for acoustic studies of phonetic aspects of European Portuguese, speech recognition and synthesis, and intonation studies. We start our review with these other corpora.
I. The Variation Group at CLUL is working on several Atlases of Portuguese, involving the compilation of several spoken corpora: 1. Atlas Linguístico-Etnográfico de Portugal e da Galiza (ALEPG) is supervised by João Saramago and aims at the publication of a national linguistic atlas. Within the context of ALEPG, there are a few projects: Atlas Linguarum Europae (ALE); Atlas Linguistique Roman (AliR); Atlas Linguístico-Etnográfico dos Açores (ALEAç); Syntactically Annotated Corpus of Portuguese Dialects (CORDIAL-SIN), and Inflectional Variants of the Verb in Spoken Continental Portuguese (VarV). The ALEPG corpus comprises about 3,500 hours of recorded interviews following the
Maria Fernanda Bacelar do Nascimento et al.
same linguistic questionnaire, made in 212 localities (176 in continental territory, 12 at the Spanish border, 7 in the Madeira archipelago, and 17 in the Azores archipelago), and is digitised on CD. The corpus is not yet available but more information can be found at the following webpages: http://www.clul.ul.pt/english/sectores/projecto_alepg.html http://www.clul.ul.pt/english/sectores/projectos.html#4 2. Atlas Linguístico do Litoral Português (ALLP) is supervised by Gabriela Vitorino and has as its main objective the study of the specialised lexicon related to fishing. To date, the corpus has about 210 hours of recorded interviews made in 40 localities on the coast (23 in continental territory, 5 in Madeira, and 12 in Azores). The corpus is not yet available, but data concerning the Azores (fauna and flora) in a total of 156 lexical maps and corresponding geographical maps will be published as a volume of the Atlas LinguísticoEtnográfico dos Açores (ALEAç). For more information, see the project webpage at http://www.clul.ul.pt/english/sectores/projecto_allp.html 3. The Barlavento do Algarve (BA) corpus, supervised by Luisa Segura, comprises more than 100 hours of recorded interviews in an inquiry net of 53 localities. Its objective is the study of the vowel system of a dialectal variant of the south of Portugal so as to define its geographic limits. Although the corpus is not yet available on the web, researchers can access in person to the recorded material.
II. The Laboratório de Psicolinguística of the Faculdade de Letras da Universidade de Lisboa, under the supervision of Isabel Hub Faria, compiled the CHILDES corpus and several other related corpora. 4. Corpus Acquisition of European Portuguese, compiled at the Laboratory of Psycholinguistics, Faculdade de Letras da Universidade de Lisboa, under the supervision of Isabel Hub Faria and Maria João Freitas, consists of a corpus of monolingual Portuguese children’s oral productions during their first years of life. The studies conducted so far are in the areas of phonological, morphological, syntactical and lexical L1 acquisition, although the corpus may be used for analyses within other areas of linguistic research. The Laboratório de Psicolinguística of the Universidade de Lisboa manages the corpus when individual researchers or projects apply for its use. Further information may be obtained through [email protected] 5. Corpus BATORÉO ’94 was organised, transcribed and codified at the Laboratory of Psycholinguistics, Faculdade de Letras da Universidade de Lisboa, Lisbon, by Hanna Batoréo and supervised by Isabel Hub Faria, within the CHILDES Database (Acquisition of European Portuguese). Its objective is the study of acquisition of narrative discourse in European Portuguese, and it is available for research at the Laboratory of Psycholinguistics, Faculdade de Letras da Universidade de Lisboa; at [email protected]; or in a written version in Batoréo (2000).
The Portuguese corpus
III. The 6 following projects have been or are being compiled at INESC (Instituto de Engenharia de Sistemas e Computadores) in the Laboratório de Sistemas de Língua Falada, headed by Isabel Trancoso. Some of these projects are in partnership with CLUL and are co-supervised by Céu Viana. More information on these projects are available at http://www.l2f.inesc-id.pt 6. Corpus ALERT. The ALERT corpus was collected in the framework of the European project with the same name, with the goal of gathering material for training and evaluating several components of the ALERT media watch system for European Portuguese. The corpus has 3 main parts: a.
Speech Recognition Corpus (SRC), including 122 programmes of different types and schedules and amounting to 76 hours of audio data. The main goal of this corpus was the training of acoustic models and the adaptation of language models used in a large vocabulary speech recognition component. b. Topic Detection Corpus (TDC), comprising close to 300 hours of recordings on a daily basis, over a period of 9 months, starting in February 2001. The main goal of this corpus was to have a broader coverage of topics and associated topic classification for training a topic indexation module. c. Textual Corpus (TC). This corpus continues to be updated and is now reaching close to 450 million words. 7. Corpus CORAL. The CORAL corpus was collected in the framework of a national project sponsored by the Praxis XXI programme, by a consortium formed by INESC, CLUL, FLUL (Faculdade de Letras da Universidade de Lisboa), and FCSH-UNL (Faculdade de Ciências Sociais e Humanas da Universidade Nova de Lisboa). The purpose of this project is the collection of a spoken dialogue corpus, with several levels of labelling: orthographic, phonetic, phonological, syntactic and semantic. The corpus comprises 64 dialogues about the predetermined subject of maps. 8. Corpus BD-PÚBLICO. The BD-PÚBLICO database was collected by INESC in the framework of a European project (SPRACH), and a national project (Praxis XXI programme), with the collaboration of Instituto Superior Técnico (IST) and the Público newspaper. Text material for the read sentences was extracted from the Portuguese newspaper Público, consisting of 6 months of news, totalling 10 million words and 156,000 different forms. 9. Corpus SPEECHDAT. The SPEECHDAT corpus collection for European Portuguese was divided into 2 phases: the collection of 1,000 telephone calls (preparatory MLAP Project SPEECHDAT-M); and the collection of 4,000 telephone calls (Language Engineering Project SPEECHDAT-II). The project includes databases from all official languages of the EU and some major dialectal variants. The work was done by INESC under a subcontract with Portugal Telecom. 10. Corpus BDFALA (Base de Dados Falada para o Português Europeu). The BDFALA corpus was jointly developed by INESC and CLUL in the framework of the
Maria Fernanda Bacelar do Nascimento et al.
national project sponsored by JNICT (Program Lusitânia). Its goal is the enlargement of the EUROM.1 corpus (see point 11 below), mainly for the improvement of speech synthesis systems. 6 types of corpus material were collected, comprising approximately 4,600 isolated words, 350 sentences for prosodic studies, 18 phonetically-complete paragraphs, 60 read paragraphs extracted from television debates, approximately 3,000 logatomes, and 600 phonetically rich sentences. 11. Corpus EUROM.1. The EUROM.1 corpus for European Portuguese was collected in the framework of the SAM_A (Speech Assessment Methods) European project, jointly by INESC and CLUL. Despite its main use for recognition and synthesis research, this corpus has also been used for phonetic coding research. For each of the 11 languages included in this project, 4 types of corpus material were collected: CVC material (totalling 121 different logatomes) in isolation and in context (5 carrier phrases); 100 selected numbers from 0-9,999; 40 short passages each containing 5 thematically connected sentences; and 50 filler sentences to compensate for the phoneme-frequency imbalance in the passages.
IV. The following spoken corpus was compiled by the Speech Group at CLUL: 12. The Corpus de Português Europeu – Variação (CPE-Var), compiled by Celeste Rodrigues, can be used for phonological and sociolinguistic studies. It comprises approximately 130 recording hours, and includes read texts and spontaneous conversations. The corpus can be queried with consent of the coordinator. For more information on the corpus, see Rodrigues (2003). As mentioned at the outset, the Portuguese corpora described above are not conformant with the properties of a general corpus, as was required for the C-ORAL-ROM project, viz. a corpus containing spontaneous and formal language, with monologues, dialogues and conversations on several topics, registered in family/private and public situations, with a large range of speakers. We now turn to CLUL’s general spoken corpora which are the ones which have been reused in the C-ORAL-ROM project.
1. Corpus Português Fundamental. The first electronic linguistic corpus of European Portuguese was the spoken corpus of the project Português Fundamental, recorded between 1970 and 1975; the project results regarding the spoken corpus were later published (Bacelar do Nascimento et al. 1987). The Português Fundamental spoken corpus comprising 700,000 tokens was compiled following the methodology of the Français Fondamental (Gougenheim et al. 1964) of 312,135 tokens, and enriched by the experience of the Español Fundamental (Rivenc 1973, 2000; Rivenc & Rojo Sastre 1968) of 8,000,000 tokens. Português Fundamental essentially aimed at providing lexical data extracted from authentic texts, with information on usage frequency and distribution by speakers, for teaching Portuguese as a foreign language, as well as as a first language. The corpus includes 1,400 recordings (dialogues, conversations and monologues) of spontaneous language on very diverse topics, from all regions of Portugal, selected according to
The Portuguese corpus
population density. Speakers are between 15 to 70 years old, have different educational levels and jobs, and the situations are essentially private or family domains. The recordings were orthographically transcribed, following criteria defined according to the project objectives. For statistical purposes, each transcription contains 500 graphical words. In the 140 texts published in Bacelar do Nascimento et al. (1987), the number of words of each text is slightly larger to avoid an abrupt ending in the middle of a sentence. These 140 texts are available for download at the following webpage: http://www.clul.ul.pt/sectores/corpus_oral_pf_publicado.zip
2. Corpus CRPC and related projects on spoken language. The Corpus de Referência do Português Contemporâneo1 (CRPC), sponsored by Fundação Calouste Gulbenkian, União Latina, Fundação Oriente and JNICT, started being compiled in 1988 and comprises around 281 million words: 278 million words of written language and 2.5 million words of spoken language. The corpus is in constant development, covering a time span from 1820 to the present day, and includes the following Portuguese varieties: European, Brazilian, African (Angola, Cap Verde, Guinea-Bissau, Mozambique and S. Tome and Principe) and Asian (Goa, Macao and East Timor). For more information, see the corpus webpage at http://www.clul.ul.pt/sectores/projectos_crpc.html The spoken CRPC sub-corpus includes the spoken corpus of Português Fundamental (1970-75) and many other texts of spoken language, both spontaneous and formal, of all geographical Portuguese varieties, with a broad range of topics and speakers, in terms of age, educational level and socio-professional category. The CRPC is actually the basis for several projects (which, in turn, contribute to the enrichment of CRPC), quite a few of which were a source of texts for C-ORALROM Portuguese corpus: 3. Português Falado, Variedades Geográficas e Sociais (European Comission Program LINGUA and SOCRATES/LINGUA), coordinated by M. F. Bacelar do Nascimento. This corpus was compiled by CLUL with the following partners: University of Toulouse-le-Mirail (P. Rivenc) and University of Provence-Aix-Marseille (C. BlancheBenveniste). The corpus, comprising approximately 9 hours of recordings and 92,000 transcribed graphical words, includes samples of all the spoken Portuguese varieties and is published in 4 CD-ROMs (sponsored by Instituto Camões) with text-to-sound alignment (cf. Bacelar do Nascimento 2000). 4. VARPORT, Análise Contrastiva de Variedades do Português, a Portuguese and Brazilian project of CLUL (M.A. Mota) and the Faculdade de Letras da Universidade Federal do Rio de Janeiro (S. Brandão) – CAPES / ICCTI Program – (cf. Brandão & Mota 2000, 2003). For this project, a comparable corpus of written and spoken language (standard and non-standard) of both Portuguese varieties was compiled following the same socio-linguistic criteria, in order to assure contrastive analysis accuracy. The webpage of the VARPORT corpus is: http://www.letras.ufrj.br/varport
Maria Fernanda Bacelar do Nascimento et al.
5. REDIP, The International Broadcast Network of Portuguese: radio, television and press (Lusitania Programme of FCT). The project is coordinated by the Instituto de Linguística Teórica e Computacional (ILTEC) headed by M. H. Mateus with the following partners: Centro de Linguística da Universidade de Lisboa (M. F. Bacelar do Nascimento) and Universidade Aberta (M. E. Marques). The corpus of this project comprises a database of recorded samples collected through media, namely, radio, television and written press, aiming at making descriptions of European Portuguese as used in media, providing information on its lexical, grammatical, semantic and stylistic properties. The project webpage is http://www.iltec.pt/projectos/ redip.htm Besides these projects based on spoken corpora, CRPC has been used as the basis for other national and international projects at CLUL. Some examples of international projects are European Commission projects: Network of European Reference Corpora (NERC), Preparatory Action for Linguistic Resources Organization for Language Engineering (LE-PAROLE), and the European Language Activity Network (ELAN); while national projects include Léxico Multifuncional Computorizado do Português Contemporâneo (LMCPC), and Dicionário de Combinatórias do Português (DCP). CRPC is also a source of constant queries from users and researchers on Portuguese language. .. Reusing materials from existing databases at CLUL In order to meet the criteria established for the compilation of the multilingual CORAL-ROM corpus, and to assure comparability and feasibility for contrastive study between the four sub-corpora, the following work was done on the existing materials:
a. Collection of materials Collection took into account text compliance with the following parameters: i.
Informal speech with situation variation: Family or Private versus Public; and dialogical structure: monologue, dialogue and conversation. ii. Formal speech with Media variation: Formal in Natural Context, Formal through Media and Telephone and with subtypes in text genders.
b. Selection of previously collected texts Previously collected texts were selected according to the following parameters: i.
Text length. It is important to mention that the length of the already existing transcriptions was very different from the required dimension for C-ORAL-ROM texts. For example, as previously mentioned, the transcriptions of Português Fundamental have approximately 500 graphic words, but the text length for spontaneous language in C-ORAL-ROM varies between 500 and 4,500 graphic words. It was thus mandatory to verify for each collected text if it were possible to continue the transcription until the required length was attained. ii. Acoustic quality. Not all previously collected materials could be integrated in the C-ORAL-ROM corpus since this project required an acoustic quality that most of
The Portuguese corpus
the existing materials lacked. This is due to several factors. First, these materials were relatively old, some of them dating back 30 years, and were kept on tape. Second, the original objective of those recordings was essentially to provide materials for lexical, morpho-syntactic and syntactic studies. Since this did not require a high acoustic quality, the recordings were made with medium quality equipment. Additionally, since the aim was to obtain very natural texts, all the recordings were made in natural environments, at times in noisy surroundings, for example, during family meals, inside cars, in garages, and always without any worry regarding sound isolation. iii. Written consent for publication. Another criteria for the selection of previously collected materials was the requirement of written consent from speakers and some institutions for the publication of their texts. A large number of texts were thus excluded due to the difficulty in contacting speakers who, for the most part, were recorded a long time ago.
c. Reformatting of existing transcriptions Previously collected transcriptions which had different criteria depending on the project to which they belonged had to be formatted in order to match the criteria established for C-ORAL-ROM. The fact that this corpus has text-to-sound automatic alignment makes it less problematic to mark the transcription in a rigorous way with certain phenomena, like phonetic variation or different listening interpretations. We give examples of some transcription differences. For instance, in the Português Fundamental project, there were no specific marks for overlaps, and the orthographic transcription included minimal punctuation symbols that usually received the same value they have in writing; however, special importance was given to their prosodic marker function, so as to transmit, even in a rudimentary way, the spoken language rhythm. Minimal punctuation was also used to solve ambiguities that could interfere with lexical analysis or endanger text coherence. In later transcriptions made for CRPC, punctuation symbols were no longer included; however, there were specific marks for overlaps and for different judgements by the transcribing team, following the criteria established by GARS (Groupe Aixois de Recherches en Syntaxe) of the Université de Provence headed by Claire Blanche-Benveniste. The texts transcribed according to this model, mostly since 1988, were compiled not only for lexical studies, but also for different kinds of studies, especially on spoken language syntax. To summarise, the following work was then undertaken in order to reuse materials collected for previous projects: a.
Listening and selection of recordings according to the requirements of the CORAL-ROM project (constitution, topics, sound quality, collection of written consent, etc.). b. Adaptation of several transcription criteria to the transcription model of the CORAL-ROM corpus.
Maria Fernanda Bacelar do Nascimento et al.
Table 5.1 Materials used in C-ORAL-ROM that were also part of other projects of the Centro de Linguística da Universidade de Lisboa Corpus
Date Informal family/private Cv* Dl Mn
CRPC: Português Fundamental CRPC (not included in other projects) Português Falado REDIP Variation Group at CLUL VARPORT Total
1970– 1975 2
6
1994– 1996
1996
3
Number of texts Informal public Media Formal in natural context Cv Dl Mn 2
5
18
6
1
1
8
1
1
1998 1999 2002
Total
2 11
3 1
1
14 1 1 44
* Cv = conversations; Dl = dialogues; Mn = monologues
c.
Transcription of larger sequences of the selected texts to obtain the required length for the C-ORAL-ROM corpus.
It is a fact that the reusing of previously collected materials was very useful for the project, since it made it possible for the corpus to be more diversified. However, the steps required for reusing these materials did not make the corpus constitution any less onerous or time-consuming. Table 5.1 presents the list of materials that were previously collected for other projects at CLUL and that were formatted so as to be integrated into the C-ORALROM corpus.
.. New materials specifically collected and/or transcribed for the C-ORAL-ROM project New recordings were made specifically for the C-ORAL-ROM project; non-transcribed texts that were part of the spoken CRPC archives and that were conformant to the requirements of this corpus constitution were transcribed. These materials are presented in Table 5.2.
The Portuguese corpus
Table 5.2 Materials collected and/or transcribed specifically for C-ORAL-ROM Date Informal family/private Cv Dl Mn 1978–2002 10 17 21
Informal public
Number of texts Media Formal in Phone Total natural context conversations
Cv Dl Mn 1
7
1
15
19
17
108
.. Final remarks To sum up, the Portuguese C-ORAL-ROM corpus comprises 144 texts, covering a 30year time span (1970–2002), totalling approximately 30 recording hours in a digitised version of the sound wave and 317,916 graphic words of the transcribed version. Its composition is illustrated in Table 5.3. The Portuguese corpus compiled for C-ORAL-ROM is an extremely valuable resource and represents an important improvement in the history of general European Portuguese corpus, mainly due to: a.
the fact that it is part of a multilingual corpus with the same compilation and treatment criteria for their four subcorpora, allowing for the first time for comparative studies between spoken language in four Romance languages, based on authentic texts with thorough representation; b. the diversity of its internal constitution, with regard to discourse type and context of use; c. the excellent sound quality of the more recent recordings and the good sound quality required for the older, reused recordings; d. the time span covered, since it includes texts collected from 1970 up until 2002; e. the text-to-sound alignment with prosodic annotation, an important feature not only for the general needs of the user community but also for particular studies and for further applications; f. the morpho-syntactic annotation which includes annotation of linguistic phenomena specific to spoken language. In spite of the fact that the C-ORAL-ROM corpus is a small corpus by modern standards, it allows for important studies for each one of the languages involved, especially
Table 5.3 Constitution of the Portuguese C-ORAL-ROM corpus Number of texts Informal family/private Cv Dl Mn
Cv
Dl
Mn
12
1
11
7
31
24
Informal public
Media
Formal in natural context
Phone conversations
Total
26
23
17
152
Maria Fernanda Bacelar do Nascimento et al.
prosodic studies, and for contrastive studies between the four Romance languages and between spoken and written language. Furthermore, the corpus properties regarding data compilation and quality, as well as prosodic and morpho-syntactic annotation, make it a model to be followed in the compilation of spoken general corpora of larger dimensions.
. Orthographic transcription The general rules adopted in the transcription of the Portuguese C-ORAL-ROM corpus were those decided upon for the four corpora in the first project meeting in Lisbon in June 2000. These rules concern the following aspects: the four teams should follow the same conventions, for instance, on the use of a common header for each text, an orthographic transcription, without punctuation, of the digitised version of the recordings, the accuracy on representing what speakers say (including repeated words, incomplete or incomprehensible words, etc.), interaction marks, overlapping marks, identification of tone boundaries, etc. Apart from these common conventions, each team would keep its own traditional conventions regarding other graphic representations of the spoken texts.
.. Specific Portuguese conventions ... General orthographic norms As a general rule, the team transcribed the entire corpus according to official orthography. This decision was in fact taken decades earlier in 1970, when the Portuguese team started with the transcription of Português Fundamental: assuming that the orthographic representation of oral discourse is always inexact, insufficient and of difficult interpretation, one should not infuse it with additional marks and arbitrarily chosen spellings. The attempt to reliably represent regional or specific speaker’s mispronunciations further complicate the transcribed text, making it practically unreadable, without, however, attaining the purpose of an orthographic representation completely faithful to the spoken text (cf. Bacelar Do Nascimento 1987: 29–75). When later the edition of a spoken corpus with text-to-sound alignment was started (cf. Bacelar Do Nascimento 2000, 2001), that decision turned out to be even more pertinent. The user of the corpus listens to the sound and reads the transcription at the same time, and therefore he no longer loses any specific information of oral speech as would happen if the orthographic transcription is read separately. (Other spoken research groups have also used official orthographic transcription only, for example, GARS (Groupe Aixois de Recherches en Syntaxe) headed by Claire Blanche-Benveniste (cf. Blanche-Benveniste 2000), having chosen it to make the text more easily readable.)
The Portuguese corpus
... Other specific conventions a. Free variation Variants of a word were transcribed according to how they are registered in the reference Portuguese dictionaries, if found therein. Such is the case for phonetic phenomena like apharesis: ainda > inda; prothesis: mostrar > amostrar; metathesis: cacatua > catatua; or alternations: louça/loiça. b. Proper names It is extremely difficult in spoken texts to accurately set up the distinction between proper and common names. For this reason, the team’s practice was to transcribe with an initial capital letter only anthroponyms and toponyms. These names correspond effectively to the functions more consensually assigned to proper names: designating and not signifying (cf. Segura Da Cruz 1987: 344–353). Examples for toponyms include África, Bruxelas, Casablanca, China, Lisboa, Magrebe, Pequim, and those for anthroponyms include Boris Vian, Camões, Einstein, Madonna, Gabriel Garcia Marques, Oronte. c. Titles Titles were placed within quotation marks, e.g. books: “crónica dos bons malandros”; movies: “pixote”; pieces of music: “requiem” (de Verdi); newspapers: “expresso”; radio programmes: “feira franca” or television broadcasts: “big brother”. d. Foreign words Foreign words were transcribed in their original orthography whenever they were pronounced closely to the original pronunciation, e.g. cachet, check-in, feeling, free-lancer, emmerdeur, ski, stress, voyeurs, workshop. When the foreign words were pronounced in a ‘Portuguese way’, they were transcribed according to the entries of Portuguese reference dictionaries or according to the orthography adopted in those dictionaries for similar cases. The latter include all hybrid forms like pochettezinha, shortezinho, stressado and videozinho. e. Wrong pronunciation When the speaker mispronounced a word and immediately corrected it, the two spellings were maintained in the transcription of the text, as follows: lugal/lugar, sau/céu. If the speaker mispronounced a word and went on in his speech without any correction, the standard spelling of the word was kept in the transcription and a note regarding the wrong pronunciation was added in the header comments (e.g. “in the text pfamdl07, the informant produces banrdarilheiro instead of bandarilheiro”). f. Paralinguistics and onomatopoeia Paralinguistic forms and onomatopoeia not listed in reference dictionaries were transcribed to represent, as closely as possible, the sound produced, e.g. pac-pac-pac, pffff, tanana, tatata, uuu.
Maria Fernanda Bacelar do Nascimento et al.
g. Acronyms Acronyms were transcribed in capital letters, without full stops, e.g. TAP and not T.A.P. However, if the acronym already had an entry in Portuguese dictionaries as a common name, it was transcribed in lower case, e.g. sida or radar. h. Short forms Short forms were maintained in the transcription, e.g. prof. i. Numbers Dates and numbers were always transcribed in full, e.g. mil novecentos e trinta e cinco, mil escudos, mil paus. l. Letters of the alphabet The names of letters of the alphabet were always transcribed in full, e.g. pê for P, erre for R, xis for X. .. Interjections A list of interjections, discourse particles and emphatics was created during the process of transcription; this is presented in Table 5.4. Table 5.4 Interjections, discourse markers and emphatics Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
pois portanto sim ah pronto então pá claro lá ai olha assim exacto digamos olhe
The Portuguese corpus
. Morpho-syntactic tagging .. Tools and strategy adopted for automatic PoS tagging and lemmatisation In order to develop an annotated corpus (which includes, in this case, PoS tagging and wordform lemmatisation) with maximal accuracy, it was decided to recover and reuse already available resources for written corpus, aiming at two different kinds of results: the obtaining of a final product in a reasonable length of time, on the one hand; and on the other, the study and description of the intrinsic characteristics of spoken language and specific uses regarding its automatic treatment. The main goals of the C-ORAL-ROM project include, besides the addition of the text-to-sound alignment and prosodic annotation of the entire corpus, the publication of grammatical descriptions of the spoken language. Therefore, it was necessary to consider PoS tagging and wordform lemmatisation in order to extract quantitative data (e.g. wordform and lemma frequencies, occurrences of verbs, nouns, adjectives and adverbs, and of verbless and non-verbless utterances, etc.) and qualitative information (speech marker distribution, major semantic fields, type of utterance distribution per type of register, etc.) comparable in the four languages. The tagging and lemmatisation processes are thus the means to achieve the final annotated corpus to work upon. In order to reuse existing tools, the morpho-syntactic annotation and lemmatisation of the corpus were performed in two different stages. The Portuguese team used Eric Brill’s tagger (Brill 1993),2 trained on a written Portuguese corpus of 250,000 words, morpho-syntactically annotated and manually revised. The initial tagset for written corpus morpho-syntactic annotation, and that maintained for the C-ORAL-ROM tagset too, covers the main PoS categories (Noun, Verb, Adjective, etc.) and secondary ones (tense, conjunction type, proper noun and common noun, variable vs. invariable pronouns, auxiliary vs. main verbs, etc.), but person, gender and number categories were not included (see Table 5.5). Table 5.5 Portuguese C-ORAL-ROM corpus tagset
Main Class Specifications
Verb Auxiliary Verb Present Indicative Past Indicative Imperfect Indicative Pluperfect Indicative Future Indicative Conditional Present Subjunctive Imperfect Subjunctive Future Subjunctive
Tags
Examples
V VAUX pi ppi ii mpi fi c pc ic fc
sou, tenho, chamo fui, tive, quis, dormi estava, era, dizia, dormia fora, estivera, comera serei, terei, direi seria, teria, dormiria seja, tenha, chame, durma estivesse, dormisse estiver, dormir
Maria Fernanda Bacelar do Nascimento et al.
Table 5.5 (continued)
Main Class Specifications Main Class Main Class Main Class Main Class Specifications Main Class Specifications Main Class Main Class Main Class Specifications Main Class Specifications Main Class Specifications Main Class Main Class Specifications Main Class Main Class Main Class Main Class Main Class Main Class Main Class Main Class Main Class
Infinitive Inflected Infinitive Gerundive Imperative Past Participles in Compound Tenses Adjectival Past Participles Noun Proper Noun Common Noun Adjective Preposition Adverb Conjunction Coordinative Subordinative Numeral Cardinal Ordinal Clitic Personal Pronoun Article Indefinite Definite Demonstrative Invariable Variable Indefinite Invariable Variable Possessive Relative/Interrogative/ Exclamative Invariable Variable Adverbial Locution Conjunctional Locution Prepositional Locution Pronominal Locution Iterjection Emphatic Foreign Word Acronym Extralinguistic
Tags
Examples
B Bf G imp VPP
estar, ser, dormir estares, seres, dormires estando, chamando, sendo come, dorme, trabalha comido, entregado
PPA N p c ADJ PREP ADV CONJ c s NUM c o CL PES ART i d DEM i v IND i v POS REL
comido, entregue
i v LADV LCONJ LPREP LPRON INT ENF ESTR SIGL EL
Lisboa, Amália casa, trabalho feliz, rico, giro a, de, com, para só, felizmente, como e, mas, porque, ou que, porque, quando, como dois, três primeiro, segundo, terceiro se, a, o, me, lhe eu, tu, ele, nós uns, umas a, o, as, os isso, isto essa, aquela, dito alguém, nada, algo outro, todo meu, teu
que, como, quando, quem cujo, quanto em cima só que em cima de o que, o qual ah, adeus, olá lá, cá, agora okay, mail PSD, ACAPO hhh
The Portuguese corpus
Table 5.5 (continued)
Main Class Main Class Main Class Main Class Main Class Main Class Main Class
Paralinguistic Fragmented word or filled pause Discourse Marker Discoursive Locution Without Classification Word Impossible to transcribe Sequence impossible to transcribe
Sub-Tags Ambiguous form Contracted forms Hyphenated forms (excepting compounds)
Tags
Examples
PL FRAG
hum, hã, nanana &dis, &eh
MD LD SC Pimp
bom, pronto, pá, digamos digamos assim, quer dizer
Simp
yyyy
: + –
Um\ARTi:NUMc da\PREP+ARTd Viu-se\Vppi-CL
xxx
.. Tagset ... Categories and main options a. Main category – Verb (\V) Verbs were tagged only for Mood and Tense information. Only those verbs used in compound tenses (ter and haver) were considered in the class of auxiliary verbs (\VAUX). This means that passive, modal and aspectual verbs were tagged as main verbs. The VAUX tag is followed by Mood and Tense subcategories. Some functional distinctions between categories were added when it seemed important for future research. This is the case, for instance, of the distinction between the past participle in compound tenses (\VPP) and the past participle in other contexts (\PPA): (1) podia ter\VAUXB trazido\VPP as tuas coisas [you could have brought your things] (2) é um videoclube\Nc privado\PPA [it is a private videoclub] The difficult and sometimes impossible task of deciding between some ambiguous categories was avoided by the use of portmanteau tags. The distinction such as the one between inflected or non-inflected infinitive verb forms, for example, was solved by a portmanteau tag, like \VB:VBf. (3) os ministros reuniram-se para debaterem\VBf o problema (inflected infinitive) [the ministers gathered to debate the problem]
Maria Fernanda Bacelar do Nascimento et al.
(4) os ministros reuniram-se para debater\VB o problema (non-inflected infinite) [the ministers gathered to debate the problem] (5) o conselho reuniu-se para debater\VB:VBf o problema (inflected or noninflected infinitive?) [the cabinet gathered to debate the problem] In cases where the infinitive form is preceded by an article, it was established that: (i) if nominal inflection is allowed, it was tagged as a Common Noun (\Nc); and (ii) if verbal inflection is allowed, it was tagged as an infinitive Verb (\VB): (6) o comer\Nc do Minho (os comeres\Nc do Minho) [*the ‘eat’ of Minho] [*the ‘eats’ of Minho] (7) o amar\VB a deus [*the ‘to-love’ god]
(o amarem\VBf a deus) [*the ‘to-love-they’ god]
b. Main category – Noun (\N) Nouns were tagged only for the distinction between Proper (\Np) and Common (\Nc) Nouns. As previously mentioned, following the transcription tradition at CLUL, in the C-ORAL-ROM Portuguese corpus only toponyms and anthroponyms were considered Proper Nouns and transcribed with an initial capital letter. This means that the distinction between Proper and Common Nouns was made prior to the tagging procedure, by the transcribers. Since Proper Nouns are the only words starting by capital letters (excluding acronyms), it was subsequently easy for the tagger to identify them In the cases where Proper Nouns constituted more than one word, it was decided that they should be connected by an underscore (_), constituting a single element for annotation. (8) Península_Ibérica\Np [Iberian Peninsula] Whenever a prosodic symbol was present, intervening in the Proper Noun, each element received the Np tag: (9) Margarida_Rebelo\Np / Pinto\Np
c. Main category – Adjective (\ADJ) No distinction of gender, number and degree was made for Adjective forms. Despite the fact that traditionally a distinction is made concerning Adjective Pronouns vs. Substantive ones (Indefinite, Demonstrative, Possessive, Numeral, Relative and Interrogative), it was decided not to distinguish these two functions. Instead, specific tags were created for these six major classes of pronouns (unifying Interrogative and Relative pronouns), regardless of its adjective or substantive use, as can be seen below (Table 5.5).
The Portuguese corpus
In addition, participial forms were never considered Adjectives, unless they have reached a meaning of their own, e.g. engraçado ‘funny’, although a participial form of the verb engraçar ‘to be fond of ’ can also be analysed as a true adjective. As pointed out in (a) above, these cases received the tag PPA.
d. Main category – Preposition (\PREP) Particular cases of contractions of Prepositions and Articles were annotated by joining the two tags with the sign ‘+’, e.g. dos\PREP+ARTd. However, the contraction of the Preposition com with a pronominal form that does not exist otherwise (comigo – com + migo) only received the tag of Personal Pronoun (\PES). e. Main category – Adverb (\ADV) No distinction was made regarding the different types of Adverbs. f. Main category – Conjunction (\CONJ) The only distinction made here concerned the Subordinative (\CONJs) and Coordinative (\CONJc) Conjunctions. g. Main category – Numeral (\NUM) A distinction was made between Ordinal (\NUMo) and Cardinal (\NUMc) Numerals. As mentioned above, no information is given concerning the Adjective use of Ordinal Numerals, which are always tagged as \NUMo. h. Main category – Clitic (\CL) This tag was used for annotating non-tonic Personal Pronouns. With se, no distinction was made about its reciprocal, reflexive, impersonal or passive use. (10) já o\CL tenho [I have it already] (11) agora tem que se\CL enterrar [now it has to be buried] In cases of enclisis and mesoclisis, which are connected by a hyphen, the CL tag was also connected to the verb tag by a hyphen. (12) acabou-se\Vppi-CL [it is over] (13) dar-lhe-ia\Vc-CL [to give it to him]
i. Main category – Article (\ART) The only division made in this category concerned the distinction between Definite Articles (\ARTd) and Indefinite Articles (\ARTi).
Maria Fernanda Bacelar do Nascimento et al.
The forms o, a, os, as were always tagged as Definite Articles, and never considered Pronouns. Similarly, the forms uns, umas were always tagged as ARTi. The difficult task of distinguishing the indefinite article from the numeral use of the forms um and uma was solved by the portmanteau tag \ARTi:NUMc. (14) ele comeu a\ARTd maçã (definite article) [he ate the apple] (15) ele comeu duas\NUMc maçãs (numeral) [he ate two apples] (16) ele comeu uma\ARTi:NUMc maçã (indefinite article or numeral?) [he ate an/one apple]
j. Main category – Personal (\PES) This tag was used for annotating tonic personal pronouns (eu, ele, nós, etc.). With the contracted forms comigo, consigo, contigo, connosco and convosco, the fact that the pronominal forms (migo, sigo, nosco, vosco) do not occur independently contributed to the decision of tagging them only as PES (instead of PREP+PES). l. Main category – Demonstrative (\DEM) Despite the traditional distinction between the adjectival and the substantive uses of demonstratives, \only their variable (\DEMv) and invariable (\DEMi) aspects were considered. (17) é muito agradável comer aquele\DEMv peixe (pfamdl08) [it is very nice to eat that fish] (18) não sei se isso\DEMi é algarvio (pfamdl08) [I don’t know if that is from Algarve]
m. Main category – Indefinite (\IND) Following the same criteria adopted for the demonstrative category, the only distinction made concerned the variable (\INDv) and invariable (\INDi) indefinites. (19) eu estou com muita\INDv pressa (pfamdl13) [I’m in a big hurry] (20) era preferível não gastarem tudo\INDi no natal (pfamcv08) [it was better that they did not spend everything at Christmas]
n. Main category – Possessive (\POS) This tag was used for annotating possessives independently of their adjective or substantive use. (21) eu levo a minha\POS máquina (pfamcv08) [I take my camera]
The Portuguese corpus
o. Main category – Relative/Interrogative/Exclamative (\REL) This tag classifies Relative, Interrogative and Exclamative Adverbs and Pronouns. Once again, the only distinction made in this class concerns their variable or invariable aspect. (22) não sei qual\RELv era o tema (pfamcv08) [I don’t know which the subject was] (23) houve uns que\RELi fizeram banda desenhada (pfamcv08) [there were some of them who did cartoons]
p. Main category – Locution As far as multiword expressions are concerned, only the prepositional (\LPREP), conjunctional (\LCONJ), pronominal (\LPRON) and adverbial (\LADV) locutions were tagged as a single unit, leaving aside nominal, verbal and quantificational locutions. (24) num instante\LADV [in a moment] (25) logo que\LCONJ [as soon as] (26) à beira de\LPREP [nearby] (27) o qual\LPRON [which] In order to optimise the treatment of locutions, the Portuguese team developed a tool that would run after the tagger and the lemmatiser to automatically identify and lemmatise a predefined list of these expressions. This tool did not prevent the manual revision of locutions, since it could not distinguish a locution from a casual grouping of words. Nevertheless, this tool proved to be very useful for the manual revision.
q. Main Category – Interjection (\INT) This tag was used for annotating interjection forms such as ai, olá, adeus, etc. r. Main Category – Emphatic (\ENF) This tag was used for annotating emphatic particles such as lá, cá, etc. In the cases where there were doubts regarding the annotation of these particles (whenever it was possible to assign them a deictic value), it was established that their classic category (Adverb) would be used. (28) o que ele queria era comer alguma coisa que por lá\LÁ\ADV aparecesse [what he wanted was to eat something that could be there] (29) como ele não lhe ligava lá\ENF como ela queria [since he didn’t care for her ENF the way she wanted]
Maria Fernanda Bacelar do Nascimento et al.
s. Main category – Foreign Words (\ESTR) This tag was used for annotating only those foreign words that were not registered with a grammatical category in Portuguese dictionaries. t. Main category – Acronym (\SIGL) This tag was used for annotating acronyms and sigla such as ADN, TAP, etc. As already mentioned in Section 5.2.1.2, forms that are already registered as common nouns in Portuguese dictionaries were transcribed and tagged as Nc (such as sida or radar). ... Particular cases It is a well-known fact that some tension exists with the classification of some words. This section will present the specific choices regarding some problematic classifications. a. o que Some semanticists and syntacticians consider these two words a single operator; the arguments in favour of this analysis are very credible but, nevertheless, polemic. It was decided to consider it a single unit (\LPRON) in cases where it functions as an interrogative pronoun (o que\LPRON comprou a Maria ‘what did Mary buy’) or as a relative one, only in relative structures where the antecedent is a sentence (estava a chover, o que\LPRON irritou a Maria ‘it was raining, which annoyed Mary’). Where it functions as a relative pronoun in relative structures where the antecedent is a noun or a pronoun, it was tagged word by word (comi tudo o\ARTd que\RELi tinha no prato ‘I ate everything that was on the plate’). b. como and quando In Portuguese, these words are traditionally considered interrogative and exclamative or circumstantial adverbs or conjunctions in subordinate structures. However, more recently, and similar to the analysis done for other languages, it has been argued that these words, in most contexts, can in fact be relative connectors, introducing mostly free relatives. Nevertheless, the relative function of these words was considered. Whenever they are tagged as RELi, they are exclamative or interrogative elements. Como can appear tagged as an adverb in some of the relative structures mentioned above (with an overt antecedent, e.g. o modo como ele canta ‘the way ‘how’ he sings’). It was tagged as a conjunction (\CONJs) in the free relatives and in comparative, conformative and causal structures. c. quanto This is another word which can traditionally be assigned several tags. A simpler case than the previous one, quanto was always considered a relative, interrogative or exclamative element, tagged as RELv.
The Portuguese corpus
d. mesmo In Portuguese, the item mesmo (as well as próprio, etc.) may have different classifications. This being the case, it was decided to tag it as: (i) an adverb (\ADV) in contexts when it could be substituted by another adverb (são mesmo (= realmente/totalmente) felizes ‘they are really happy’); (ii) a demonstrative (\DEMv), regardless of its substantive or adjective use, as mentioned above (ficou no mesmo\DEMv hotel que o sócio ‘he stood in the same hotel as his partner’); or (iii) a common noun (\Nc) (ele continua o mesmo\Nc ‘he’s still the same’). e. que In Portuguese, this word can have several uses: (i) subordinative conjunction; (ii) coordinative conjunction; (iii) relative, interrogative and exclamative pronoun (or adverb); and (iv) preposition. Although some linguists consider this word a relative pronoun in cleft (and pseudo-cleft) structures, others consider it a conjunction. For C-ORAL-ROM annotation, the latter was adopted, with que being tagged a CONJs in these contexts. ... Specific tags for the Portuguese spoken corpus Due to some of the characteristic phenomena of spoken language and the specific transcription guidelines used in the C-ORAL-ROM project, it was necessary to adapt the tagset. We implemented a post-tagger automatic process to account for the following cases: a. extra-linguistic elements; transcription: hhh; tag: EL; b. fragmented words or filled pauses; transcription: &(form); tag: FRAG; c. words and sequences impossible to transcribe; transcription: xxx, yyyy; tag: Pimp, Simp; d. paralinguistic elements, such as hum, hã and onomatopoeias; tag: PL. In the cases described in (a) to (c), the specific transcription adopted allowed for automatic tag identification and replacement, through a post-tagger process. The same process was applied in the cases described in (d), since there is a predictable finite list of symbols representing paralinguistic elements. Onomatopoeia however needed manual revision. Three other categories had to be added, but they did not allow for automatic post-tagging replacement, since they correspond to forms that also belong to classic categories: e. discourse markers, such as pá, portanto, pronto; tag: MD; f. discursive locutions, such as sei lá, estás a ver, quer dizer, quer-se dizer; tag: LD; g. non classifiable forms, for words whose context does not allow an accurate classification; tag: SC. For the cases (e) and (f), forms like pronto and não sei, for instance, are automatically tagged as pronto\ADJ and não\ADV sei\Vpi, and there is no automatic post-tagging
Maria Fernanda Bacelar do Nascimento et al.
procedure that can decide whether it is a Discursive Locution or not. These cases required manual revision (and frequent listening of the sequence). After having been trained with a spoken corpus, the tool was expected to be able, based on statistical and contextual rules, to classify with a reasonable success rate these kinds of expressions, assuming that they would have occurred in the training corpus. The cases in (g) were more difficult, if not impossible, to tag or post-tag automatically, since the tagger, working on statistical rules, would always try to classify these words according to those rules. It should be noted that, as will be mentioned in the next section, it chose to tag all the forms whenever possible, avoiding the use of SC tag, which was found to rarely occur in the annotation.
.. Lemmatisation of the spoken corpus The final format of the spoken corpus annotation includes, for each form, not only the PoS tag, but also the correspondent lemma, in the form: word\LEMMA\tag In order to accomplish this task, the Léxico Multifuncional Computorizado do Português Contemporâneo (LMCPC)3 was used as the source for a lemmatisation tool. The LMCPC is a 26,443 lemma frequency lexicon with 140,315 wordforms, with a minimum lemma frequency of 6, extracted from a 16,210,438-word corpus of contemporary Portuguese. The lemma and its correspondent forms (including inflected forms and compounds) are followed by morpho-syntactic and frequency information. The lemma and wordforms are lemmatised for main PoS categories, as N (noun), V (verb), A (adjective), or other, namely, F (foreign word), G (acronym/sigla), X (abbreviation). Wordforms with non-canonical orthography were also included under their rightful lemma. Regarding quantitative information, the frequencies were extracted from the PoS tagging information and, for more problematic forms, from some calculations based on manually revised data. Although not used in its full potential, the LMCPC proved to be a very useful resource for our purposes. The .txt format in which it can be displayed made its manipulation extremely easy. The lemmatisation of the C-ORAL-ROM spoken corpus comprised two major tasks: the formatting of the LMCPC data, and the construction of a tool to extract the lemma from the lexicon. The adaptations required for the LMCPC format were due to the different PoS tagset adopted in the two projects: in LMCPC, a main PoS category classification was used, whereas in C-ORAL-ROM, subcategory classification was adopted as well. Unfortunately, at the beginning of this task, we were not able to use the PoS information present in the LMCPC to improve the lemma selection process. Therefore, the lemmatisation procedure reduced the LMCPC data to a list of lemma and corresponding wordforms, one per line.
The Portuguese corpus
The lemmatisation tool ultimately developed turned out to be very simple. It consisted of a Perl script that extracted the lemma for each token of the corpus from the LMCPC data file: each form was searched for in the lexicon and the corresponding lemma was/were found and placed near the form. In the case of multiword expressions, since the lemma is the entire set of elements, there was no correspondence between the desired result and the LMCPC data. It was thus necessary to develop a tool to automatically compose the desired lemma format from a given list of locutions. The final format of the lemmatisation of a locution is given below: (30) o\O_QUAL\LPRON qual\O_QUAL\LPRON Since it is possible for a wordform to be attributed several lemma (CORLEX corpus, for instance, has a percentage of homographic words of 34%), and due to the problems concerning locutions (such as the overlapping of words that can pertain to different kinds of locutions (e.g. em cima\LADV vs. em cima de\LPREP) and the distinction between locutions and independent word grouping), a manual lemma revision was strictly required, together with the tagging one.
... Specific lemmatisation choices It was decided that some categories would be left without lemma, namely, Proper Nouns, Paralinguistic elements and Extralinguistic elements. Lemma were given in the masculine gender, as is common practice. However, some cases received a masculine and a feminine lemma: Articles; Indefinites; Demonstratives; Possessives; Personal Pronouns; Clitics; Cardinal Numerals (os\O\ARTd, as\A\ARTd). For some verbs, whenever their reflexive use implies a change in the semantic functions of their arguments, it was considered that the lemma included the reflexive pronoun: (31) a Ana lembrou\LEMBRAR\Vppi a mãe da sua consulta médica [Ana reminded her mother of her medical appointment] (32) a Ana lembrou-se\LEMBRAR_SE-SE\Vppi-CL de telefonar à mãe [Ana reminded herself to call her mother] (33) a Ana não se\SE\CL lembrou\LEMBRAR_SE\Vppi de telefonar à mãe [Ana didn’t remind herself/remember to call her mother] In Portuguese, whenever adverbs derived with the suffix -mente are coordinated, the first adverb loses the suffix, and surfaces in its adjectival form. In these cases, in spite of this adjectival form, our option was to lemmatise and tag it as an adverb: (34) pura\PURAMENTE\ADV e simplesmente\SIMPLESMENTE\ADV. [plain and simply]
Maria Fernanda Bacelar do Nascimento et al.
.. Evaluation Considering the introduction of new categories, and despite the tagset length and type of training corpus (written), the tagger achieved a success rate of 91.5%, excluding the tagging of MD and of any kind of locution. Following this, the Portuguese team performed a manual revision of 231,540 words and decided to attempt the training of the tagger over a subset of this subcorpus (comprising 184,153 words). However this training was revealed to be ineffective, since errors increased enormously (a precise error rate for this task was not established). Given the ineffectiveness of the training with the tagged spoken subcorpus, it was decided to proceed with the annotation of the remaining 87,052 words with the tool trained with the written corpus, the same one used before, improving the post-tagger skills. In the final calculus of the recognition rate of the tagger and the lemmatiser together, all kinds of locutions and discourse markers (MD) were considered. This fact resulted in a decrease from the previous recognition rate of 91.5% (where those tags were not considered) to 88%. Given this unexpected and undesirable low rate of success of the tagger, it became imperative to carefully observe the most typical errors performed by the tool. Considering the division between tag errors and lemma errors, it was observed that the errors regarding PoS tag occurred with a higher percentage (74.5% of the errors) than the errors regarding lemmatisation (64.8%). It is worth mentioning that 40% of the errors concerned both tag and lemma. Taking into account, firstly, the errors regarding PoS tag, as already expected (since the tagger did not include this tag), a high percentage of the errors occurred in the annotation of discourse markers (18% of the total errors; 25% of tag errors; 2% of the corpus). The annotation of locutions also increased the error rate, since they represent 29% of the total errors (39% of tag errors; 4% of the corpus). It is important to underline once again the fact that most of the locutions consist of discursive locutions, which are extremely frequent in oral discourse and particularly difficult to predict and, therefore, to automatically tag. Finally, it was observed that a significant percentage of error involved the subcategory tagging (i.e. tense and mood of verbs; common vs. proper nouns; subordinative vs. coordinative conjunctions) comprising 14% of total errors; 10% of tag errors; 1% of the corpus. These 3 types of errors together constitute 77.5% of the tagging errors. Examining the errors that occurred in the lemmatisation process, we can conclude that the majority of the errors concerns the lemmatisation of locutions (38% of total errors; 58% of lemmatisation errors; 5% of the corpus). A substantial percentage of errors regarding the gender of the lemma (11% of total errors; 16% of lemmatisation errors; 1% of the corpus) was also observed. These 2 types of errors together constitute 74.5% of the lemmatisation errors. Taking into account all these facts and bearing in mind that the errors regarding tagging and lemmatisation of discursive markers and locutions were the ones which constituted a particular problem for the tool (note again that in the first training cor-
The Portuguese corpus
pus, where these elements were not considered, we achieved a success rate of 91.5%), if we excluded these errors from the error rate we would have a success rate for the lemmatiser tool of 96.8% and a success rate for the tagger of 96.7%. Since a large part of these elements was mistagged, a manual revision of this subcorpus was mandatory. This subcorpus was not exhaustively revised, but some manual revision was performed, with attention to the following: a. All multiword expressions (locutions); b. All past participles, since we distinguish the form of compound tenses from the other uses of participles (which can be ambiguous with an adjective form); c. All forms of que, since it can have several uses (which can be predicted); d. The most common discourse markers, since the tagger could not identify them as an MD, namely: assim, bem, bom claro, digamos, enfim, então, exacto, olha, olhe, pá percebe, percebes, pois, portanto, pronto, sabe, sabes, sim. e.
Non-lemmatised words, since, as will be described below, it was highly probable that most words that were not lemmatised (\-\) would have received a wrong tag; f. All the clitic forms, since forms like o, a, os, as can be either a clitic, a definite article or a preposition (for a); also the form se can be a clitic or a conjunction; g. The forms a(s) and o(s), for the reasons given above; h. The form como, as it can be a verb form, an adverb, a conjunction or an interrogative; i. The form até, as it can be an adverb or a preposition. Where lemmatisation is concerned, for the remaining 87,052 words, we were able to improve the lemmatiser in order to combine the PoS information of the annotated corpus with the PoS information of LMCPC. This means that, for ambiguous forms, this tool was already able to select the corresponding lemma for a given tag. For instance, for an example like (35), the original tool would provide two lemma for the word processo, which can be homonymous between a noun and the first person singular of indicative present of the verb processar; this is shown in (36). (35) com este processo [with this process] (36) com este processo\PROCESSAR,PROCESSO\Nc However, once the improved tool correctly tagged the form as a noun, it is also able to choose the right lemma, as seen in (37). (37) com este processo\PROCESSO\Nc In the final manual revision, this allowed us to check, with a high level of accuracy, both lemma and PoS tagging: if a form did not receive a lemma, it would necessarily
Maria Fernanda Bacelar do Nascimento et al.
have been mistagged, as, for example, the word evoluem, in (38), which received the tag of an adjective, instead of the verb tag: (38) e\E\CONJc também\TAMBÉM\ADV aqui\AQUI\ADV / as\A\ARTd coisas\COISA\Nc evoluem\-\ADJ // [and also here things evolve] Finally, the wordforms of lemma SER and IR were verified: since they have homographic forms, even though the tagger was able to propose a lemma for it, it could be wrong. (39) o João foi\IR\Vppi ao cinema [John went to the cinema] (40) o João foi\SER\Vppi bombeiro [John was a fireman]
.. Specific problems with the morpho-syntactic tagging of spoken language Some characteristic phenomena of spoken language – such as retracting, interruption, linguistic forms whose distribution is not consistent with the distributional characters of written language, as well as linguistic forms and non-standard forms used as discourse markers – were believed to constitute a problem for the effectiveness of the corpus tagging. Being the main characteristics of a spoken corpus, it was considered important to preserve all spoken language phenomena, including repetitions and interruptions as well as the prosodic marks adopted by the team (overlapping signs, slashes, etc.); without them the corpus would be equivalent to a written one and, therefore, would not reflect any additional problem in its tagging. As already mentioned in the previous section, the spoken corpus was tagged with Eric Brill’s tagger. Despite having been trained on a written corpus, and contrary to our expectations, the achieved results were very satisfactory. Nevertheless, some posttagging adaptations had to be made in order to achieve the established spoken corpus annotation. Besides the already mentioned tagset adaptations – for paralinguistic and extralinguistic elements (\PL, \EL), fragmented words (\FRAG), words and sequences impossible to transcribe (\Pimp, \Simp) and discourse markers and discursive locutions (\MD, \LD) – there were other post-tagging adaptations that had to be made in order to account for the adopted formalisation of spoken phenomena, i.e. prosodic marks. In fact, the tagger identified and tagged all the prosodic marks as well as the informant identification. Example (41) shows a tagged text before the post-tagging adaptations. Prosodic marks, such as slashes, question marks and three dots, were tagged as punctuation (\O). Overlapping ([<], <, >) and interruption (+) marks, as well as informant identification, were tagged as null (\NN), while the alignment mark ($) was tagged as symbol (\SIMB).
The Portuguese corpus
(41) *MAR:/NN //O as\A\ARTd pessoas\PESSOA\Nc que\QUE\RELi estavam\ESTAR\Vii mais\MAIS\ADV interessadas\INTERESSAR\PPA // $ \-\SIMB *FER:/NN <\-\NN sim\SIM\ADV >\-\NN // $ \-\SIMB *MAR:/NN [<]\-\NN <\-\NN fora\FORA\LPREP >\-\NN maridos\-\PPA //O (pfamdl30) [*MAR: the people that were more interested / *FER: yes / *MAR: except husbands] In a following stage these tags were automatically removed, achieving the established final format given in (42). (42) *MAR: / as\A\ARTd pessoas\PESSOA\Nc que\QUE\RELi estavam\ESTAR\Vii mais\MAIS\ADV interessadas\INTERESSAR\PPA //$ *FER: < sim\SIM\ADV > //$ *MAR: [<] < fora\FORA\LPREP > maridos\-\PPA /4 After preserving the prosodic marks, it was interesting to observe the performance of the tagger concerning the spoken phenomena of retracting and interruption mentioned above.
a. Retracting phenomena Firstly, it is important to mention that retracting marks established in C-ORAL-ROM transcription guidelines were not used the optional second level, namely [/] for retracting with complete or partial repetition of a word, and [//] for retracting with no repetition involved. For both these cases, the tone unit mark ‘/’ was used. Where morpho-syntactic tagging is concerned, it was expected that these phenomena could affect the tagger performance, since, for instance, if in the written discourse a sequence of two identical wordforms is an indication that the two forms are of different PoS category, in the spoken discourse this may not happen. In fact, repetition of wordforms (retracting) is a very frequent phenomenon and they are, in this case, of equal PoS category. (43) written: é mais fácil se\CONJs se\CL for de carro [it is easier if one goes by car] spoken: não conheço a\ARTd / a\ARTd Maria [I don’t know the / the Mary] Despite some inconsistency concerning the tagging of this phenomenon due to the violation of the contextual rules constructed with the training, it was observed that in most cases the tagger assigned the correct tag. (44) e o\O\ARTd / o\O\ARTd comércio de casas / (pfamdl16) [and the / the real estate]
Maria Fernanda Bacelar do Nascimento et al.
(45) para servir as\A\ARTd / as\A\ARTd / &eh / empresas de betão // (pfamdl16) [to serve the / the concrete companies] (46) ainda há / &ah / vagas / que\QUE\RELi / que\QUE\RELi nos superam (pfammn02) [there are still vacancies that / that overcome us] However, as was already expected, the tagger also assigned wrong tags, as can be seen in (47) and (48). (47) acabam por gostar / e / e verificam que\QUE\CONJs os &pr / que\QUE\RELi são professores que\QUE\RELi lhes deram / que\QUE\CONJs os acompanharam / e que\QUE\CONJs os fizeram crescer // (pfamdl01) [they ended up enjoying it and / and they verified that the / that there are teachers that gave them / that accompanied them and that helped them to grow] (48) estar à frente / de um microfone ou no boneco da televisão / &ah / nos\EM+O\PREP+ARTd / nos\NOS\CL seduz (pfamn02) [being in front of a microphone or on television seduces us] Since the different optional retracting marks established in the guidelines were not adopted, namely, using the mark ‘/’ for both types of retracting and delimitation of tone unit, it was not possible to create a post-tagging process that would tag repeated words (i.e. words that occur on both the left and the right side of the slash) with the same PoS tag. Consequently, in the total manual revision of 231,540 words, the repeated words were all tagged with the same PoS category. (49) venho evocando / o\O\ARTd / o\O\ARTd / o\O\ARTd / a\A\ARTd sensação (pfammn02) [I have been evoking the-masc / the-masc / the-masc / the-fem feeling] It is important to point out that the team revised those words which were more susceptible of belonging to different categories (e.g. clitics, articles, etc.), accounting for the case mentioned above. It is also important to draw attention to the cases where the repeated words belong to a locution. In these cases, it was always classified as a member of the locution. (50) desde\DESDE_QUE\LCONJ / desde\DESDE_QUE\LCONJ que\DESDE_QUE\LCONJ nasceram\NASCER\Vppi (ppubmn02) [since / since they were born] (51) dentro\DENTRO_DE\LPREP da\DENTRO_DE+A\LPREP+ARTd / das\DENTRO_DE+A\LPREP+ARTd malas\MALA\Nc // (pfammn20) [inside the luggage]
The Portuguese corpus
(52) junto\JUNTO_A\LPREP ao\JUNTO_A+O\LPREP+ARTd mar / &ah / ao\JUNTO_A+O\PREP+ARTd último mar / &ah da Europa //$ (pfammn02) [near the sea / the last sea of Europe]
b. Interruption phenomena With interruption phenomena, the tagger always assigned a tag to the words within the interrupted utterance border, despite the contextual disruption. (53) &eh / isso\ISSO\DEMi +$ e / vejo agora / que realmente / (pfamdl01) [that + and I see now that really] However, for more ambiguous words in this kind of context, it is not always possible to assign a classification, due, precisely, to the contextual disruption. In these cases, in the manual revision, the reviewers chose to replace the assigned tag with the SC one (i.e. without classification). (54) [<] < a quinta\QUINTA\SC > +$ (ppubcv01) [the fifth] or [the farm] (55) / a repressão //$ que\QUE\SC +$ (ppubmn03) [the repression // that]
c. Linguistic forms whose distribution is not consistent with the distributional characters of written language In spoken language it is common to find linguistic forms that usually do not occur in written language. These specific spoken language phenomena were always transcribed in C-ORAL-ROM corpus with specific formalisms, which made it easier for the tagger to annotate them automatically. Regarding these phenomena, and as previously mentioned in Section 5.3.1, it was necessary to implement a post-tagger automatic process to account for the following cases: i.
Extralinguistic elements. In order to account for all the sounds related to laughing, crying, coughing, etc., the transcription ‘hhh’ was used. These elements were automatically tagged as \EL. (56) pois e Lisboa / hhh\-\EL // (pfamdl20) [yes and Lisbon]
ii. Filled pauses or fragmented words. Filled pauses can be transcribed either as ‘&eh’, ‘&ah’ or ‘&hum’, while fragmented words were transcribed as ‘&(form)’. Both cases have the mark ‘&’ preceding the form, which was the key for the tagger to recognise them and annotate them as \FRAG. (57) viajar / &eh\-\FRAG / com / ou sem a tal pessoa amada (pfamdl20) [to travel with or without that beloved person]
Maria Fernanda Bacelar do Nascimento et al.
(58) França / porque gosto da cultura &fran\-\FRAG / adoro a cultura francesa (pfamdl20) [France because I like the culture / I love the French culture] iii. Paralinguistic elements and onomatopoeia. Paralinguistic elements include forms like hum and hã, which are used as: (a) an interrogative element, when the informant does not understand what was said; and (b) a confirmation element, as shown in (59). With onomatopoeia, since they reproduce an imitation of a sound, which, in most of the cases in the C-ORAL-ROM corpus, do not have a conventional transcription, they were also considered paralinguistic elements, as shown in (60). (59) *PAU: < com ele a tentar negociar >$ *AMA: [<] < hum\-\PL hum\-\PL // hum\-\PL hum\-\PL // $ (pfammn11) [*Pau: with him trying to negotiate / *AMA: hum hum // hum hum] (60) assim uma coisa mesmo / brrue\-\PL // (pfammn11) [one thing really brrue] iv. Words and sequences of words impossible to transcribe. In spoken corpus it may be frequent to have unintelligible words or sequences of words, which, for that reason, were impossible to transcribe. Since the transcription adopted in these cases was ‘xxx’ for unintelligible words and ‘yyyy’ for unintelligible sequences of words (independently from the length of the signal), the post-tagger recognised them and annotated them respectively as \Pimp or \Simp. (61) tu tens de pôr / o xxx\-\Pimp / e esses tais / fragmentos de guião (pfamcv04) [you have to put the xxx and those script fragments] (62) ah / não estava a ver / yyyy\-\Simp / os antiaéreos (pfamcv09) [I/he wasn’t seeing yyyy the anti-aircraft] Note that, when a word or a sequence of words is deleted for reasons concerning privacy or decency5 (and replaced by a beep in the sound file), they are transcribed by the single variable ‘yyyy’ and tagged as \Simp. (63) de qualquer maneira > / tão abaixo como yyyy\-\Simp / não podemos ter chegado / de certeza / hhh // (pfamdl24) [anyway as low as yyyy we can’t have gotten for sure] It is important to point out that none of the elements mentioned above received lemma.
d. Linguistic forms and non standard forms used as discourse markers Discourse markers and discursive locutions are tags created specifically for C-ORALROM and denote linguistic forms frequently used by the speaker as a kind of hack-
The Portuguese corpus
neyed words or locutions. Therefore, these linguistic forms constitute a problem for automatic tagging since they belong, first and foremost, to a specific PoS category. Although it was plausibly possible to list some discourse markers and discursive locutions – it was not possible, however, to list them all, since these markers diverge from speaker to speaker and it might not be predictable which forms will be used as a discourse marker or a discursive locution – the tagger could not automatically tag them as so, since they also may occur with their classical tag. As previously mentioned, forms such as bom and quer dizer are automatically tagged as bom\ADJ and quer\Vpi dizer\VB and there is no automatic post-tagging procedure that can decide whether it is or not a discourse marker or a discursive locution. Therefore, these cases always required a manual revision (in the case of the non-exhaustive revision the reviewers also accounted for the most common and predictable discourse markers). Examples (64) and (65) show the word bom occurring either as an adjective or a discourse marker. (64) as coisas estão no bom\BOM\ADJ caminho (pmedin04) [things are on the right track] (65) e ela disse / bom\BOM\MD / então tens que ir aprender genética (pfammn17) [and she said / well / then you have to learn genetics] Examples (66) and (67) show a sequence of words occurring either as a discursive locution or with their classical tag. (66) < mas nenhuma delas > / &conse / xxx se consegue apresentar um programa //$ quer\QUER_DIZER\LD dizer\QUER_DIZER\LD // (pfamcv04) [but none of them xxx is able to present a show / I mean] (67) não acho / &n / não quer\QUERER\Vpi dizer\DIZER\VB que o macho latino tenha de ser carismático // (pfamcv04) [I don’t think it doesn’t mean that the macho Latino have to be charismatic]
.. Some options of the Portuguese team It is worth mentioning some options that were taken concerning the non-assignment of lemma or tag relative to some phenomena of spoken language:
a. Wordforms without lemma but with PoS tag i.
Some non-existent words that result from lapsus linguae still preserve a clear PoS function (usually they result from word blending or ‘spoonerisms’ which are then corrected by the speaker): (68) ou o lugal\-\Nc / um lugar\LUGAR\Nc no palco da vida //$ (pfammn02) [or a place in the life stage]
Maria Fernanda Bacelar do Nascimento et al.
(69) o &feni / o &feni / o femininismo\-\Nc / o feminismo\FEMINISMO\Nc (pfamdl22) [the feminism] (70) tivesse pordido\-\VPP / &eh / podido\PODER\VPP / &eh / pronunciar-se (pnatpd03) [it could have pronounced] ii. Even in cases where the speaker does not auto-correct himself, it is possible to assign the wordform a tag: (71) ainda que toda a carga negativa desabate\-\Vpc sobre vocês //$ (pnatpr01)6 [even if all the negative meaning falls over you] iii. Speakers also may produce wordforms that do not obey the normative usage and, therefore, are not registered in Portuguese dictionaries. (72) alguma função catársica\-\ADJ7 (pnatla02) [some cathartic function] However, when speakers produce wordforms that were deliberately created by them, these non-registered words receive a lemma. (73) estendeu / este princípio abandonatório\ABANDONATÓRIO\ADJ / a algumas outras regras da reforma fiscal (pnatps01) [he extended this ‘abandonative’ principle to some other rules of tax reform] iv. There are cases where the context does not provide enough information to lemmatise the ambiguous forms between the verbs ser and ir. In these cases, the forms receive a tag (as it is possible to determine the tense and mood of the verb), but do not receive a lemma. (74) o Napoleão de / &tai foi\-\Vppi / teve / tentou / o grande império dele (pfamcv09) [Napoleon was/went / had / tried his big empire] (75) os sócios / &eh / pronto / foram\-\Vppi / fizemos o quarteto // (pmedin03) [the partners / well / were/went / we made the quartet]
b. Wordforms with lemma but without PoS tag (\SC) It may also happen that ambiguous wordforms may clearly have a lemma – the wordform a has always the lemma A, despite being an article, a preposition or a clitic, and the wordform que has always the lemma QUE, despite being a conjunction or a relative element – but do not receive a PoS tag because the context does not allow us to classify it. (76) e até ficou a\A\SC / com a / com a / com a ponta do sapato (pfammn02) [and he even stood ‘to/the’ / with the shoe tip]
The Portuguese corpus
. Main data from lemmatisation The results of the lexical and morpho-syntactic analysis of the C-ORAL-ROM corpus were compared with the results of other corpora. We performed several comparisons, choosing only the corpora that were analysed according to analogous morphosyntactic criteria (the small differences of PoS tagging of the other corpora do not invalidate these comparisons) and that have the dimensions and internal constitution described above in points (a) and (b):
a. Spoken corpora i.
Português Fundamental: a CLUL general and spontaneous spoken corpus of 700,000 running words, collected in informal situations (see Section 6.1.1.1). ii. Língua Falada Brasil: a general formal and informal spoken Brazilian corpus of 963,535 words, collected by Maria Tereza Biderman (cf. Biderman forthcoming).
b. Written corpus i.
RL Corpus: a newspaper subcorpus of the CLUL RL Corpus, comprising 333,941 words. http://www.clul.ul.pt/englih/sectores/projecto_recursoslinguisticos.html
c. Written and spoken corpus i.
Corlex: a CLUL contemporary Portuguese corpus, mostly written, of 16 million words (5.3% spoken and 94.7% written; 56% newspaper, 20% literary, 20% techno-scientific, 4% various). The Multifunctional Computational Lexicon of Contemporary Portuguese is a frequency lexicon of European Portuguese language, extracted from this corpus. http://www.clul.ul.pt/english/sectores/projecto _lmcpc.html
.. The 100 most frequent verbs, nouns, adverbs, adjectives: A comparison between C-ORAL-ROM and CORLEX Tables 5.6 to 5.9 show the 100 most frequent verbs, nouns, adverbs and adjectives, while Table 5.10 shows the 15 most frequent interjections, discourse markers and emphatics (considered altogether), in the C-ORAL-ROM and CORLEX corpora. As the percentage of spoken discourse in the CORLEX corpus is small, this allows for the comparison to be regarded a comparison between spoken language (C-ORAL-ROM) and written language (CORLEX).
Maria Fernanda Bacelar do Nascimento et al.
Table 5.6 High frequency verbs: comparison between C-ORAL-ROM and CORLEX Rank
C-ORAL-ROM
CORLEX
Rank
C-ORAL-ROM
CORLEX
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
ser ter estar ir haver fazer dizer poder saber achar ver dar querer vir ficar falar começar passar pensar gostar dever chegar pôr conhecer andar chamar conseguir acabar deixar viver parecer sair trabalhar entrar levar acontecer lembrar_se ouvir comprar encontrar tentar perceber pedir existir comer continuar obrigar pagar sentir voltar
ser ter estar fazer ir haver poder dizer dar ver dever saber querer ficar vir passar deixar formar encontrar chegar falar levar começar parecer considerar continuar apresentar conhecer pensar conseguir partir pôr acabar chamar existir sentir tornar viver reservar referir entrar sair voltar seguir ouvir criar manter gostar permitir andar
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
ler aparecer perguntar tratar precisar tomar tirar abrir ligar olhar perder ganhar receber esperar correr criar funcionar trazer usar contar escrever mandar entender tocar colocar interessar permitir procurar explicar apresentar julgar resolver servir vender aceitar arranjar considerar acreditar escolher fumar lembrar valer beber marcar utilizar apanhar mostrar tornar estudar meter
tratar perder receber tomar olhar contar acontecer afirmar constituir abrir ganhar achar mostrar esperar trabalhar pedir realizar utilizar servir escrever representar cercar ler tentar pagar aparecer cair correr colocar morrer surgir lembrar trazer fossar prever procurar atingir explicar verificar marcar obter responder pretender incluir defender garantir ligar usar aumentar acompanhar
The Portuguese corpus
Table 5.7 High frequency nouns: comparison between C-ORAL-ROM and CORLEX Rank
C-ORAL-ROM
CORLEX
Rank
C-ORAL-ROM
CORLEX
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
coisa pessoa ano senior dia tempo casa exemplo problema vida vez parte professor trabalho hora gente homem cidade questão filme ideia maniera tipo ponto caso médico pai número grupo situação mãe direito país filho verdade associação dinheiro história sentido doente forma lado noite ordem impresa mulher doutor criança relação escola
ano vez local dia página tempo coisa pessoa homem parte cultura espaço vida sociedade país casa forma trabalho economia política caso desporto público hora direito lado mulher índice destaque cidade jornal educação presidente água escola grupo secção problema conto ponto mão noite mundo exemplo empresa facto valor lugar filho fim
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
universidade vinho aula palavra presidente aspecto mês semana deus teatro papel deputado cinema zona amigo família justiça momento curso nome sítio sistema centro conto mundo estado governo jovem cultura processo guerra gajo altura aluno sociedade nível manhã projecto rua literatura música programa espaço tarde doença formação qualidade facto idade ciência
obra jogo número relação meio região senhor situação processo gente ciência mês artigo terra serviço nome tipo semana equipa período nível olho área projecto pai último zona imagem momento resultado palavra sistema criança altura modo poder família questão força canal condição século história conta acordo corpo gabinete acção edição estado
Maria Fernanda Bacelar do Nascimento et al.
Table 5.8 High frequency adverbs: comparison between C-ORAL-ROM and CORLEX Rank
C-ORAL-ROM
CORLEX
Rank
C-ORAL-ROM
CORLEX
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
não muito já mais também depois aqui lá assim só sim agora bem ainda sempre mesmo então aí ali hoje até cá realmente nunca tão exactamente quase pois talvez claro logo menos aliás bastante pouco mal completamente apenas sobretudo tanto normalmente fora entretanto sequer nem como naturalmente ontem embora imenso
não mais já muito também ainda depois só bem sempre agora hoje aqui mesmo apenas então ontem lá tão nunca assim quase antes além como aí ali menos sim através dentro pouco talvez logo quanto tanto mal junto sobretudo cá pois quando bastante até longe porque perto fora aliás amanhã
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
obviamente extremamente perfeitamente antigamente precisamente nomeadamente tarde primeiro amanhã enfim atrás dentro antes praticamente propriamente absolutamente finalmente melhor actualmente provavelmente evidentemente efectivamente eventualmente longe cedo basicamente simplesmente afinal directamente certamente felizmente principalmente infelizmente totalmente essencialmente fundamentalmente geralmente imediatamente perto verdadeiramente abaixo claramente inclusivamente possivelmente concretamente especialmente rapidamente relativamente ultimamente curiosamente
entretanto atrás relativamente tarde igualmente realmente nomeadamente claro afinal finalmente acima actualmente enfim principalmente completamente melhor diante cedo praticamente sequer recentemente normalmente provavelmente exactamente especialmente precisamente geralmente particularmente nem directamente rapidamente próximo abaixo naturalmente debaixo essencialmente imediatamente simplesmente totalmente demasiado depressa respectivamente perfeitamente certamente embora anteontem adiante novamente frequentemente claramente
The Portuguese corpus
Table 5.9 High frequency adjectives: comparison between C-ORAL-ROM and CORLEX Rank
C-ORAL-ROM
CORLEX
Rank
C-ORAL-ROM
CORLEX
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
bom grande novo português diferente pequeno engraçado melhor importante maior difícil último capaz mau possível preciso interessante social nacional evidente giro meio antigo bonito único seguinte alto velho especial igual normal necessário genético geral humano internacional claro americano europeu público enorme fácil próximo anterior cheio brasileiro caro horrível simples concreto
grande político novo último internacional bom português maior pequeno público social melhor longo importante diferente próximo nacional antigo possível passado principal económico único alto necessário velho meio humano actual forte europeu anterior difícil geral seguinte médio mau preciso claro branco natural local superior cultural aberto elevado simples especial baixo curto
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
cultural forte pessoal chinês fiscal certo francês médico sozinho fundamental branco científico fraco grave livre político rápido mundial municipal presente estrangeiro estranho natural pior positivo total lindo óptimo principal superior baixo central médio próprio comercial conjunto militar negativo simpático doente final impossível técnico ético agradável civil curioso extraordinário feliz nuclear
financeiro cheio técnico próprio comum presente capaz verdadeiro respectivo fácil livre certo igual pobre diverso vivo real francês enorme grave pessoal comercial histórico notório recente particular militar americano central total regional menor tal interno todo rápido directo profissional fundamental mundial breve profundo religioso espanhol semelhante privado oficial negro físico final
Maria Fernanda Bacelar do Nascimento et al.
Table 5.10 High frequency interjections, discourse markers and emphatics: comparison between C-ORAL-ROM and CORLEX Rank
C-ORAL-ROM
CORLEX
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
não é pois quer dizer portanto sim ah pronto então pá e não sei quê e então e portanto está bem claro lá
lá ah ai pá ó bem oh eh bom ora pronto hum adeus então hã
.. Similarities and differences in the two corpora From Tables 5.11 to 5.15, we can note the similarities and differences in both corpora regarding the 10 most frequent lemmas of the classes mentioned above. This comparison reveals strong similarities concerning lexical words on the one hand, and remarkable differences concerning interjections, discourse markers and emphatics on the other hand. The proportion of the 10 most frequent lexical words common to both corpora involve 8 verbs (80%), 5 nouns (50%), 6 adjectives (60%) and 7 adverbs (70%). With interjections, discourse markers and emphatics however, as seen in Table 5.15, only 2 word-forms (20%) are common to both corpora, which is obviously due to the different types of discourse being compared. These word-forms consist of very specific units of spoken discourse, occurring mostly in spontaneous dialogues and conversations, and are mainly discourse markers, such as não é, quer dizer, portanto and Table 5.11 High frequency verbs: comparison of top ten Common
C-ORAL-ROM
CORLEX
dizer estar fazer haver ir poder ser ter
achar saber
dar ver
The Portuguese corpus
Table 5.12 High frequency nouns: comparison of top ten Common
C-ORAL-ROM
CORLEX
ano coisa dia pessoa tempo
casa exemplo problema senior vida
homem local página parte vez
Table 5.13 High frequency adverbs: comparison of top ten Common
C-ORAL-ROM
CORLEX
depois já mais muito não só também
aqui assim lá
ainda bem sempre
Table 5.14 High frequency adjectives: comparison of top ten Common
C-ORAL-ROM
CORLEX
bom grande maior novo pequeno português
diferente engraçado importante melhor
internacional político público último
Table 5.15 High frequency interjections, discourse markers and emphatics: comparison of top ten Common
C-ORAL-ROM
CORLEX
ah pá
não é pois quer dizer portanto sim pronto então e não sei quê
lá ai ó bem oh eh bom ora
Maria Fernanda Bacelar do Nascimento et al.
pronto. In the written corpus these items occur in the direct reporting of affective interactions, more typically in literary texts and, as a result, in a more artificial manner. Clearly the written corpus does not include discourse markers which are frequent in authentic oral texts, not only in informal discourse, but also in formal discourse. These observations were maintained when we compared the 100 most frequent lemmas of lexical words, where common elements comprised a high percentage: verbs 74%; nouns 57%; adjectives 64%; adverbs 80%. An examination of the 15 most frequent interjections, discourse markers and emphatics showed 50% of them to be common. Table 5.16 shows that the most frequent words in spoken and in written discourse have low information content. It is also interesting to note the overall similarity between the most frequent lemma of the four corpora. As was expected, the most frequent verbs concern tense, aspect, passive and modal auxiliaries, such as ter, ser, estar, ir, haver and poder, as well as those which insert speech reporting, such as dizer and achar. As far as nouns are concerned, it can be observed that the most frequent ones are those which have a very general meaning, namely coisa, pessoa and gente (the last one, due, certainly, to the pronominal locution a gente in which it occurs), as well as the nouns which designate time, like ano, dia and tempo. (About the near-grammatical function of une chose, see Blanche-Benveniste 1986.) Likewise, with the adjectives, the most frequent ones concern generic descriptive and evaluative adjectives, like bom, melhor, grande, maior and pequeno. Interestingly, português is one of the 10 most frequent adjectives in the Portuguese corpora and brasileiro is one of the 10 most frequent adjectives in the Brazilian corpus. With respect to these two last grammatical classes, the written (or mostly written) corpus unsurprisingly includes some items related to its constitution (56% newspaper), such as página, local, internacional and público. Adverbs too show strong similarities between the spoken and the written corpora. In contrast, with the final class of interjections, discourse markers and emphatics, there is an expected dissimilarity between the two types of discourse, which, as was previously mentioned, is essentially due to the high occurrence of discourse markers in spoken discourse. It is also important to highlight the presence of the discourse marker não é (né in the Brazilian corpus) in the first position of the three spoken corpora, as well as the confirmed stability of the most frequent discourse markers in both European Portuguese spoken corpora (não é, pois, quer dizer, portanto and pá), despite their being chronologically diversified: Português Fundamental was collected in 1970–1975, and C-ORAL-ROM in 1970–2002, though the majority of this corpus is composed of texts collected between 1994 and 2002.
The Portuguese corpus
Table 5.16 The 10 most frequent lexical words and interjections, discourse markers, emphatics in four corpora Lemma Verbs
Nouns
Adjectives
Adverbs
Interjections, discourse markers and emphatics
C-ORAL-ROM (spoken)
Português Fundamental (spoken)
Língua Falada – Brasil (spoken)
CORLEX (written)
ser ter estar ir haver fazer dizer poder saber achar coisa pessoa ano senhor dia tempo casa exemplo problema vida bom grande novo português diferente pequeno engraçado melhor importante maior não muito já mais também depois aqui lá assim só não é pois quer dizer portanto sim pronto então e não sei quê não é pois
ser ter estar dizer ir haver fazer querer saber poder coisa pessoa ano vez casa gente senhor dia maneira exemplo grande bom novo pequeno maior diferente preciso difícil português geral não muito lá já mais assim aqui depois também pois não é quer dizer portanto pá ah tal bem pois ai bom
ir dizer ter fazer ser achar estar saber poder querer casa gente coisa senhora ano dia pessoa filho tempo exemplo bom grande melhor normal novo maior diferente difícil bonito brasileiro não muito mais assim aí então já lá aqui agora né* ah certo claro eh oh ai pois pô ahahn
ser ter estar fazer ir haver poder dizer dar ver ano vez local dia página tempo coisa pessoa homem parte grande político novo último internacional bom português maior pequeno público não mais já muito também ainda depois só bem sempre lá ah ai pá ó bem oh eh bom ora
* This form corresponds to the contraction of the words não é.
Maria Fernanda Bacelar do Nascimento et al.
.. Main data from word-forms The first 10 word-forms of the spoken corpus C-ORAL-ROM were also compared with those of the written RL newspaper corpus, both being corpora of comparable dimensions. It was observed that, in both corpora, the first word-forms consist of functional words, except for the forms não and eh, in the spoken corpus, as shown in Table 5.17. The type-token ratio of these two corpora of similar dimension (Table 5.18) clearly shows that in the spoken discourse (with a ratio of 6) there is a more restricted vocabulary (and correspondingly much more repetitions), while in the written corpus (with a ratio of 11) there is a more diversified one. Table 5.17 10 most frequent word-forms in C-ORAL-ROM and RL Corpus Rank 1 2 3 4 5 6 7 8 9 10
C-ORAL-ROM Forms que a e de e o não eh um para
Freq.
Rank
12,458 9,965 9,957 8,660 7,665 7,206 6,273 3,639 3,632 3,324
1 2 3 4 5 6 7 8 9 10
RL Corpus – newspaper Forms Freq. de a o que e do da os em para
14,525 12,705 9,542 8,099 7,732 5,268 4,952 3,592 3,524 3,438
Table 5.18 Type-token ratio in two corpora
Types Tokens Ratio Types = different forms Tokens = running words Ratio = (types/tokens) x 100
Spoken corpus (C-ORAL-ROM)
Written corpus (RL Corpus – newspaper)
19,395 318,552 6
37,364 333,941 11
The Portuguese corpus
.. Lexical density It was important to compare the proportion of lexical words (nouns, verbs, adjectives and adverbs) that occurred in the Portuguese C-ORAL-ROM corpus with that of the RL corpus – newspaper, i.e. their lexical density. Generally, written language is described as ‘dense’, contrasting with spoken language, which is described as ‘sparse’. As a matter of fact, there is usually a high variation between oral spontaneous corpora and written corpora. Such variation is evidenced by several authors, for instance, by Biber et al., who state that “conversation has by far the lowest lexical density. News has the highest density” (Biber et al. 1999:62), or by Halliday, who asserts that “it is a characteristic difference between spoken and written language (that) written language displays a much higher ratio of lexical items to total running words” (Halliday 1989:61). However as the comparison of lexical density in Table 5.19 shows, the Portuguese C-ORAL-ROM corpus and RL corpus – newspaper in fact display an identical proportion of lexical units. To account for this, it is necessary to make the following comment. The CORAL-ROM corpus is a substantially heterogeneous one, which contains an significant amount of spoken formal and spoken informal public texts, contrary to what is usually observed in studies on spoken corpora, which are composed only by spontaneous speech. That is the case, for instance, of the subcorpus of conversation used in Longman Grammar of Spoken and Written English (Biber et al. 1999), in which there is variety regarding the type of the informant (age and sex) but no variety concerning the degree of formality (and, thus, length of planning time) that are also crucial factors in the constitution of general spoken corpora. In fact, spoken and written language cannot be differentiated in general. Spoken language is not merely spontaneous and written language may reach a high degree of informality. For this reason, the C-ORAL-ROM corpus contains a large diversity of sociolinguistic textual typology (see Table 5.20), which represent the orality in a more adequate way. This is thus an important aspect to point out, which, as a result, may attenuate the usual distinctions between spoken and written corpora that do not take into consideration the internal characteristics of each one of these corpora. It is also important to note that although, as a rule, words are considered isolated, it is also necessary to take into account the fact that many lexical forms occur in mulTable 5.19 Lexical density in Portuguese C-ORAL-ROM corpus and RL corpus Spoken Corpus (C-ORAL-ROM) 318,552 tokens Category Total no. word-forms Nouns Adjectives Verbs Adverbs Total
48,779 11,511 57,640 27,108 145,038
%
Written Corpus (RL-News) 333,941 tokens Category Total no. word-forms
%
15 4 18 9 46
Nouns Adjectives Verbs Adverbs Total
24 6 12 4 46
79,721 19,911 40,177 13,346 153,157
Maria Fernanda Bacelar do Nascimento et al.
Table 5.20 Lexical density in Portuguese C-ORAL-ROM Corpus according to nodes
Category Nouns Adjectives Verbs Adverbs Total
Spoken Corpus (C-ORAL-ROM) 318,552 tokens Telephone Informal Total no. % Total no. % word-forms word-forms
Formal Total no. word-forms
%
2,477 551 5,135 2,707 10,870
23,794 5,453 21,805 8,826 59,878
20 4 18 7 49
10 2 21 11 44
22,508 5,507 30,700 15,575 74,290
14 3 19 9 45
tiword units, which work as grammatical units. Such is the case, for instance, of causa, lado, maneira, medida, modo, nível, relação, tempo, vez, which frequently occur in conjunctional or prepositional locutions, such as por causa de, ao lado de, de maneira que, na medida de, de modo que, a nível de, em relação a, ao mesmo tempo que.
.. Multiword expressions As previously mentioned, multiwords were marked as sequences of words which work as a single unit. Table 5.21 displays the most frequent locutions in C-ORALROM corpus.
Table 5.21 Most frequent multiwords Adverbial Locutions (LADV)
Prepositional Locutions (LPREP)
Conjunctional Locutions (LCONJ)
Pronominal Locutions (LPRON)
Discursive Locutions (LD)
de facto às vezes se calhar mais ou menos no fundo por acaso pelo menos cada vez mais hoje em dia de vez em quando outra vez a seguir
em relação a a partir de em termos de dentro de depois de antes de apesar de por causa de ao longo de em vez de para além de a nível de
de maneira que por isso para que na medida em que desde que de modo que mesmo que sempre que a não ser que até que como se por outro lado
a gente o que cada um o qual si próprio tal e qual
não é quer dizer e não sei quê e portanto está bem não sei quê por exemplo e tal e pronto ao depois sim senhor eh pá
o que quer que seja
The Portuguese corpus
Notes . The term ‘reference corpus’ is used here to refer to a corpus containing samples of texts and not complete texts (cf. textual corpus). . http://www.cs.jhu.edu/∼brill . The LMCPC (in English, Multifunctional Computational Lexicon of Contemporary Portuguese) is available via internet at http://clul.ul.pt/english/sectores/projecto_lmcpc.html/ . In this example the mistagging was preserved. The manual revision subsequently performed by the reviewers corrected it. . That is the case of individuals or institutions mentioned in a depreciatory way. . Note that the word desabate does not exist in the Portuguese language. It seems to result from a blend of desabar with abater (semantically related verbs, meaning ‘to fall down’). . Note that the correct form is catártica.
Chapter 6
Notes on lexical strategy, structural strategies and surface clause indexes in the C-ORAL-ROM spoken corpora Emanuela Cresti
. Premises The scientific community is already aware of many general characteristics that Romance languages share in their long common history of derivation from Latin, despite the idiosyncratic development of each of them (Agard 1984; Amastae et al. 1995; Bourciez 1967; Godard 2003; Harris & Vincent 1988; Klausenburger 2003; Pei 1976; Posner 1996; Reinhemeir & Tasmowski 1997; Tagliavini 1949). For instance, morpho-phonetic phenomena such as that involving the vocalic quantity/quality alternation, diphthongisation, palatalisation (of consonants), sandhi phenomena, or morpho-syntactic phenomena such as verbal paraphrasis, clitics, evolution of inflectional morphology, relevance of relative clauses, cleft-sentences, and the null subject parameter contrasting Spanish, Italian, and Portuguese with French. Moreover an EU Project (EUROM4, Blanche-Benveniste et al. 1997) has demonstrated a strong similarity in terms of grammar, semantics and pragmatics among the four Romance languages (henceforth FRLs). This work provides convincing evidence of the easy sharing at least of passive competence among speakers of FRLs, which rests on a thousand-year tradition of cultural, social and economic exchanges. For the first time, however, the inquiry is driven based on comparable spoken corpora of four Romance languages. In many respects then, what we are called upon to explore appears to be a new domain, because corpus-based research, while at least a teenager in the linguistic field, is really a newborn in the domain of spoken multilingual corpora. Given that within the scope of this volume it is impossible to conceive and compile a ‘spoken Romance grammar’, what then can be our realistic goal? In one respect, research on certain specific aspects (e.g. cleft-sentences, clitics, the adjectival system, deverbal nouns, modal verbs), while very interesting, have already been studied in many works from different theoretical points of view;1 moreover, they risk remaining islands incapable of providing a general insight, which is crucial for comparative work in the field of spoken Romance languages.
Emanuela Cresti
Only in recent times has speech started to be investigated in a comprehensive way, because only recently has access to spoken corpora with sound/transcription alignment been available. But their availability also creates a problem, by showing that some of those reference units, which have been developed over centuries of grammatical research and which are still valid for the study of written corpora, are not however adequate for speech. We refer particularly to the communicative unit higher than the word, that is, the utterance. Nevertheless, if the origin of the question is in the sound, the solution can be found simply by fully considering the acoustic data. In fact, the utterance – generally invoked as the reference unit of speech – can be defined theoretically and identified in an operative way on the basis of its acoustic character.
.. The utterance Twenty years of research on spoken Italian, developed in our Lab, has allowed us to establish that intonation identifies and marks in the speech flow that sequence of words which constitutes an utterance. The operative definition of the utterance is such that every expression marked by a prosodic terminal break is an utterance. This is the definition that is employed in the transcription, tagging and alignment of our four Romance corpora, with the prosodic terminal break identified by all native speakers by perceptual recognition.2 From our theoretical point of view, an utterance corresponds to the accomplishment of a speech act, as defined by Austin (1962). On the whole, the proper structure of spontaneous speech appears to be the alternation of dialogic turns, whose organisation is built on utterances. Because of that, our research on the four spoken Romance languages will be devoted to capturing some general aspects on the basis of the utterance as reference unit. As shown in the first chapter, the choice of utterance as one of the most relevant reference units for corpus measures has already allowed us to display, on one side, fundamental features of speech, and, on the other, some comparative aspects among Romance languages and their variation according to the corpus design. First of all, the measure in words of MLU, in words and utterances of MLTurn, in words of MLTone, and in Tone units of utterances, the measure of speed (words per second) of previous units, the relevance of fragmentation phenomena (retracting, false start, self-interruption and other-interruption, overlapping, fragments) can be properly obtained and appreciated only in relation to utterances. In this way, it has been possible to get relevant results such as that concerning the relation between MLU, MLTurn and speed: the longer the first two variables, the slower the second. In parallel, a constancy of MLTone is recorded, independent of the length of upper entities such as utterance or turn; in contrast, this varies in a language-dependent way. In this regard, even if general tendencies are confirmed for the FRLs, different measures do emerge, especially for French, determined mostly by French orthographic conventions.3 Another important result that can be appreciated on the basis of the utterance regards the amount of fragmentation phenomena and their variation according to the corpus design.
Lexical strategy, structural strategies and surface clause indexes
The set of all the previous measures can constitute a standard reference speech frame for the study of acquisition, of pathologies, of second language acquisition (that until now are mostly driven on the basis of grammar references), but also for any kind of technological development.
.. Comparison between speech and writing In this chapter, on the basis of measurements delivered by the corpus providers and stored in the Diagram Menu of the DVD, we will look into the qualitative aspects of the spoken Romance field; that is, its structural and, in a broad sense, syntactic characteristics, correlating these to corpus design variables. The inquiry will investigate some very general points concerning different linguistic domains and levels: 1. comparison between speech and writing; 2. comparison among Romance languages; 3. linguistic variations according to the corpus design. Some general conclusions can be outlined. Regarding point (1), in a comparison between speech and writing, general strategies of speech regarding both lexicon and structural principles emerge. The lexicon of speech has a different organisation from that of writing: on the one hand, lexical density is low, and, more specifically, the proportion of nouns plus verbs is lower than in writing; on the other hand, the proportion of verbs is higher than nouns, as opposed to writing. Investigating speech structure in a proper way means considering what the primary characteristics of the utterance are, namely: a.
prosodic patterning, which features the utterance in a single prosodic unit (simple utterance) or in more than one prosodic unit (compound utterance); b. the presence of a finite verbal form within the utterance; c. the four combinations of the above two strategies: (1) compound verbless; (2) simple verbal; (3) simple verbless; (4) compound verbal. Following these points, primary structural strategies of speech can be established. There is a balance between simple and compound utterances, with the novelty being the relevance of simple utterances if compared with the written domain. There is also a prevalence of verbal utterances, with the novelty being the large proportion of verbless utterances of nearly 37% (confirmed for spoken American English), when in writing they represent a minority and a rhetorical device. Examining the four structural types derived from the previous features reveals that compound verbal utterances are the most common structural type of utterances (at least 43%), and in this respect speech appears similar to writing. But what is divergent is that a spoken text is always characterised by the complementary balance between compound verbal utterances and simple verbless ones (at least 27%), unlike a written text. The other two remaining types, simple verbal and compound verbless, represent a constant ‘band’
Emanuela Cresti
(nearly 27%), which is less significant and whose dimension and reciprocal relation changes in a language-dependent way. Very general morpho-syntactic characteristics of speech can also be reached through the observation of the most frequent surface clause indexes (and, but, that, not, no). It has transpired that coordination is less employed in speech than in writing, but the most relevant aspect is not the quantitative one, but rather the different employ of coordination, which behaves not as in writing as a logic operator (coordination is allowed only between phrases of the same rank and type). In fact, copulative coordination, comprising about 30–40%, and adversative coordination, about 60%, are used in speech with the pragmatic goals of opening the dialogic turn or connecting different speech acts (initial position). In speech, any functional or syntactic role must be studied in connection with its distribution according to the prosodic tagging. In this way it has been possible to discover, in the analysis of surface clause indexes we chose in C-ORAL-ROM, some unexpected characteristics. While they tend to differ systematically in correlation with their prosodic distribution, this does not necessarily happen. The relation between the prosodic position of the syntactic indexes and their linguistic values is outlined as follows: – – –
initial position, after a terminal prosodic break or at the opening of a dialogic turn, is typically connected with pragmatic values typical of speech; articulate position, after a non-terminal prosodic break, generally maintains syntactic functionalities similar to those of writing; linearised position, in the middle of other words, strongly varies its values with respect to the index under consideration.
Generally speaking, initial position, which is typical for coordination and forms of negation, is linked to a pragmatic value which is quite different from the traditional written one. Subordination through che/que ‘that’ is more relevant in speech than in writing and shows the prevalence of the relative function over the completive one, even if there is a clear distribution between informal texts (with more that-complement clauses) and formal ones (with more relative clauses). In speech the positions of subordinative expressions are mostly linearised and articulate, rather than initial. Moreover it has been discovered that, together with a low percentage of occurrence (nearly 2%), constant across languages, the index of subordination is employed in initial position with a peculiar pragmatic value for opening the utterance, which does not maintain a function of subordination. Negation is confirmed to be two to three times more frequent than in writing, revealing itself to be a very peculiar trait of spoken language. Its distribution differs greatly from language to language, unsurprising, given that its behaviour and the number of negation indexes are different in each Romance language. On the whole, very general aspects of comparison between speech and writing can be derived if utterances as well as their prosodic tagging are considered.
Lexical strategy, structural strategies and surface clause indexes
.. Comparison among spoken Romance languages With regard to points (2) and (3) relating to the comparison among Romance language and the variation according to the corpus design, the analysis was developed starting from the Italian corpus, which acts as the target language for comparison, and then was extended to other languages. Given the corpus design of C-ORAL-ROM, which was conceived to assure comparability among FRLs, a fundamental similarity can be actually observed for lexical strategy, structural strategies and types. Only French diverges from Italian, Portuguese and Spanish (henceforth referred to as IPS) in some respects and in a very particular way. But the domain which appears to be the most differentiating concerns the surface clause indexes. General characteristics of spoken lexicon are confirmed for FRLs: compared to writing, all have low lexical density and a lighter weightage of nouns and verbs, with a prevalence of the latter. Only Italian shows a substantial proportion of nouns and verbs (nearly 40%) compared to PSF (33–36%), and a minor difference between the two lexical classes. Similar structural strategies (compound and verbal traits) are observed for IPS; only French diverges, with a 34% occurrence of simple utterances and only 24% of verbless utterances. A predominance of the compound verbal structural type is found for IPS of nearly 44%, but not for French whose most common structural type is simple verbal, with nearly 40%, against an average of 18% for IPS. Because of this, the complementary relationship between compound verbal and simple verbless, valid for IPS, is inverted for French. These differences are not easily explicable, however, and cannot be addressed within the scope of this work. The morpho-syntactic characteristics are the most differentiated among FRLs from both quantitative and functional aspects Generally speaking, Italian appears to be the language which employs a less significant proportion of surface clause indexes than the other languages. The occurrence of the Portuguese index of subordination que at 33% is the highest amongst the FRLs, compared with values from 16 to 20% in IFS. Portuguese and Spanish have only one index of negation (nao and no respectively), compared with Italian (no, non) and French (non, ne. . . pas). With respect to prosodic distributions, even if different actual proportions are noted for IPS, they do show some regular alternation for the different positions of indexes, in any case maintaining the initial position with the already mentioned pragmatic valence. In contrast, a strong idiosyncrasy of French emerges, given that it shows a general trend of linearising every syntactic index and a very low value of initial positions. To sum up, we have discussed above some very general traits of comparison among FRLs observed in the corpora.
.. Variation according to the corpus design A premise must be made for point (3) regarding variation according to the corpus design. Our examination of the quantitative data regarding lexical strategy, structural
Emanuela Cresti
strategies, and morpho-syntactic characteristics has allowed us to identify within our corpus design a new way of evaluation of data, previously determined on the basis of external parameters (register, social domain of use, communicative event, semantic domain, channel). It means that only when the data was assembled according to criteria different from those of the original corpus design were regular as well as unexpected tendencies revealed. The new organisation of the corpus rests on the following logical distinctions: a.
at the highest level, a distinction is made between transmission texts and face-toface texts; b. within those domains, transmission texts are divided into Media and Telephone; and face-to-face into Informal and Formal texts; c. the latter ones are organised following the communication event trait as: Informal Dialogues, Informal Monologues, Formal Dialogues, Formal Monologues. This new 6-node taxonomy, crossing the previous corpus design which is organised into nodes, sub-nodes, and sub-sub-nodes,4 is shared by FRLs and has allowed the appreciation of some regular variations. General features can be noted for every node. For instance, texts of Media show contrastive characteristics: while they reveal measures mostly longer than informal texts and have a clear lexical nominal strategy, which is typical of formal texts, at the same time they show in some points a lower level of structuring than formal texts (for instance, in Italian, a high proportion of compound verbless type) or a peculiar distribution of syntactic indexes. The data at our disposal are however not sufficient to clarify whether this is indeed due to the channel or to the limited size and heterogeneity of our collection. The Telephone node, if considered in its totality, reveals strong differences, both positive and negative, with respect to all other nodes, and in some sense it represents the lowest step of speech structuring. For instance, it is the only node where the lexical verbal strategy is the strongest and, at the same time, where the simple verbless type of utterance is the most frequent and exceeds the compound verbal one. All these general characteristics are shared by FRLs. For face-to-face nodes, the reorganisation has allowed us to discover regular and interesting trends of both lexical and structural characteristics as well as morphosyntactic ones, even if the latter are less systematically organised. The lexical strategy observed is that of a regular decreasing trend of verbs from Informal Dialogues to Formal Monologues, which correlates inversely with a nominal increasing (the most regular found in Italian). With structural strategies, a regular increase in compound utterances and verbal utterances can be observed if the distribution is crossed with the event communication trait. Different values for each language and some idiosyncrasies can be noted, such as the most regular increase being found for Italian, and a very high maximum of more than 90% for verbal utterances for French. With structural types, the increase in the compound verbal type along this scale to a maximum in Formal Monologues correlates with the decrease in the other three structural types; even with very different values for each of the FRLs, this trend is found in all of them.
Lexical strategy, structural strategies and surface clause indexes
The distribution of morpho-syntactic characteristics does not seem to share such clear tendencies as determined by the corpus design, because it had to be correlated with the informational distributions and different syntactic functions that the same index may fulfil. Only some simple and common trends among FRLs can be captured, such as the prevalence of initial positions for coordinatives, the prevalence of linearised positions for subordinatives, and an average incidence of articulate positions for all the indexes, even if with different percentages and a peculiar position for French. Only for Italian has a general frame of distribution of indexes according to prosodic tagging been developed and its results show some interesting trends. On one hand, a kind of scale among indexes is recorded, correlated with their higher pragmatic employ, starting with no, and decreasing with other indexes (ma, e, non, che); on the other hand, a kind of complementarity between initial position and linearised position of an index is noted as well, while the articulate position (generally bound to the standard syntactic function) covers a more stable band of occurrences. In conclusion, in the domain of variation too, it has been possible to underline general trends and points of comparison among FRLs, revealing that the utterance is a good device for speech corpora analysis. In what follows, the most relevant quantitative data emerging from the computation will be discussed in detail. Some of the figures presented below are a selection from those found on the DVD accompanying this book. These figures are identified with reference to the DVD, and follow the numbering of the DVD itself, with the first number indicating the Menu, the second for the Frame, and a possible third one for a Subframe. A letter, if present, identifies the language: I(talian), P(ortuguese), S(panish), F(rench). Figures not found on the DVD are referred to in the regular format of this volume.
. The noun vs. verb lexical strategy in speech The wide variety of lexical strategies found in written and spoken language has been extensively dealt with in the literature (Biber 1988; Biber et al. 1999; Giordano & Voghera 2002; Halliday 1989; Laudanna, Voghera, & Gazzellini 2001; Miller & Weinert 1998). While the use of nouns seems prevalent in written language, in spoken language the use of verbs is much more frequent. The Longman Grammar (Biber et al. 1999: 65), for instance, shows that in fiction and academic prose (both written language) there are three to four nouns per lexical verb. Some research in the Romance domain also confirm the prevalence of the use of nouns over verbs in written corpora. In conversation, Biber notes that nouns and verbs are more or less equally frequent (around 13–14%). In contrast, in Romance conversation the percentage of values of verbs are clearly higher than those of nouns. The proportions of verbs and nouns found in the C-ORAL-ROM corpus is illustrated in Figure DVD 1.1. A first general observation can be made: nominal forms considered together with verbal forms account for approximately 40% of total lexical forms in Italian. This value
Emanuela Cresti
Figure DVD 1.1 Overall distribution of nouns and verbs in the four Romance languages
Figure DVD 1.2 Distribution of nouns and verbs in the six corpus nodes in the four languages
Lexical strategy, structural strategies and surface clause indexes
Figure 6.1 Distribution of nouns and verbs in the six corpus nodes in Italian
appears low, compared to general values for writing. It confirms the hypothesis of low lexical density of speech and allows us to see how high the percentage of functional expressions, interjections and discourse markers is.5 Data from the other Romance languages reveal an even lower incidence of these items than in Italian, as they only appear in approximately 31.5% to 35% of cases.6 Figure DVD 1.2 allows a more detailed examination of the lexical distribution of nouns and verbs specific to the nodes of the corpus design. From the line diagram for Italian in Figure 6.1, it is easy to evaluate the regular increase in nouns from Informal Dialogues to Media (23.54%), with a very marked drop in the Telephone node, and, in a complementary way, a decrease in verbs from Informal Dialogues to Formal Monologues and an increase in Media and Telephone (22.41%). The prevalence of nouns in Media and that of verbs in Telephone is revealing of the peculiar behaviour of transmission texts compared to face-to-face ones.7
.. Lexical strategy and formality In the case of natural nodes, face-to-face interactions produce texts which can be mainly distinguished on the basis of their formality or informality, and on the basis of the structure of the communicative event as dialogue or monologue. A consequence of formality is a greater degree of textuality of the spoken exchanges, and thus a greater structuring of formal texts when compared to informal ones. Therefore, a scale which stretches from the lowest degree of complexity to the highest exists, according to a generic grouping principle: informal > less textual; formal > more textual, so that the corpus variation is articulated according to the following scale:
Emanuela Cresti
Informal dialogues – Informal monologues – Formal dialogues – Formal monologues The verbal and nominal lexical strategies seem to be a direct consequence of formality, which normally determines a consistent use of nominal forms; after all, this is the rule in written language, whose main aim is the composition of texts. As a matter of fact, Figure 6.1 shows a regular lexical strategy of decreasing proportion of verbs and specularly increasing proportion of nouns along this complexity scale for Italian, with both strategies relatively balanced in Formal Dialogues, reaching very similar values. The verbal and nominal lexical trends of French, Portuguese and Spanish specifically for the four nodes on this complexity scale can be appreciated in Figure 6.2a, where we see that PF reveal a regular decrease in verbs, very similar to the Italian trend, while Spanish roughly shows a flat trend. In contrast, as seen in Figure 6.2b, while French and Spanish still show a regular increase in nouns, Portuguese does not feature so regular a trend. In conclusion, we can claim that: a. the verbal lexical strategy of speech is confirmed to a fair degree; b. a very general variation trend among natural nodes can be outlined, i.e. a decrease in verbs from Informal to Formal nodes and an increase in nouns from Informal to Formal ones, following a linear scale. Nevertheless, a different weight of verbs and nouns in the lexicon, a different general ratio between the two, and some peculiar characteristics for each Romance language can also be noted.
Figure 6.2a Verbal lexical strategy in the four natural nodes for French, Portuguese and Spanish
Lexical strategy, structural strategies and surface clause indexes
Figure 6.2b Nominal lexical strategy in the four natural nodes for French, Portuguese and Spanish
. Informational patterning Informational patterning is the essential feature of spoken utterances which makes them simple or compound. Simple utterances consist of a single linguistic sequence, ending with a terminal prosodic break.8 With regard to our theoretical framework, simple utterances feature a single informational unit of the ‘comment’ type (Cresti 2000; Hockett 1958) which is functional for the completion of the illocutionary act (Austin 1962). They generally correspond to a rather brief and syntactically simple linguistic sequence. *SRE: voi no ? (ifamcv02) [(do) you not . . . ?] *CIC: mille a voi // (ifamcv14) [(there’s) your thousand (lire)] *SAM: non ho capito la domanda // (inatla03) [I haven’t understood the question] Compound utterances consist of a number of linguistic chunks, spanning the beginning of the utterance to a terminal prosodic break, and contain at least one non-terminal prosodic break. They are made compound by possessing at least one comment unit plus at least one supplementary informational unit. They involve an informational relation between functionally distinct expressions, which are, however, also semantically related to a certain extent, thus creating an utterance which may be fairly long and syntactically complex, e.g.:
Emanuela Cresti
*SAB: eh / niente / c’ aveva questo pantalone di pelle tutt’ attillato / a vita bassa / guarda // (ifamdl09) [huh / well / she had these skin-tight leather trousers / low waist / look //] *LUC: sabato mattina / all’ undici / eccotelo // (ifamcv22) [saturday morning / at eleven / there he comes] *DON: se tu non hai i soldi / rimani malato e muori // (imedrp03) [if you haven’t got the money / you stay ill and die] The general distributive values connected with these two traits for the four Romance languages are illustrated in Figure DVD 2.1.1. As can be seen in Figure DVD 2.1.1, a distinctive characteristic of spoken language is the significant presence of simple utterances; although compound utterances comprise the majority of utterances, this is only a very small majority, with nearly half of the corpus being simple utterances. The data are comparable among IPS, even though the simple strategy appears to be more frequent in Portuguese and Spanish than in Italian. Only the French corpus shows a markedly different result, with compound utterances comprising 38.50% compared to 61.50% of simple utterances. The high incidence of the simple strategy in French is linked to other peculiar values which characterise this language.
Figure DVD 2.1.1 Overall distribution of simple and compound utterances for the four languages
Lexical strategy, structural strategies and surface clause indexes
.. Informational patterning according to the corpus design A marked variability in the proportion of simple vs. compound utterances, however, can be noticed when this is examined in connection with the corpus design nodes, as illustrated in Figure DVD 2.2.1. The simple/compound structural strategy does not follow a linear scale from Informal nodes to Formal ones, as, for instance, happens for lexical strategy, but instead involves variations relative to the structure of the communicative event. Specifically, it varies in connection with the dialogical or monologic trait. We can still find a scalar trend along this variable, but this time based on two intercrossed groupings, according to the communicative event, as follows: Informal Dialogues & Formal Dialogues – Informal Monologues & Formal Monologues Figure 6.3 illustrates the regular trend of the Italian compound strategy when considered according to the above-mentioned cross scale. The structural strategy of articulation is therefore positively correlated with monologues by this relation: monological > more articulated. It should also be pointed out that the formality trait also determines a positive percentile variation compared to the informal nodes, of 13% in dialogues and 9% in monologues.
Figure DVD 2.2.1 Distribution of simple and compound utterances in the six corpus nodes in the four languages
Emanuela Cresti
Figure 6.3 Distribution of compound utterances in the four natural nodes according to a cross scale in Italian
Figure 6.4 Distribution of compound utterances in the four natural nodes according to a cross scale in French, Portuguese, and Spanish
The same cross scale is valid for Portuguese, as seen in Figure DVD 2.2.1.P, even if formality seems to determine the variation in a less significant way. French still shows a cross scalar trend, as seen in Figure DVD 2.2.1.F, while Spanish exhibits a peculiar trend which is not properly scalar, shown in Figure DVD 2.2.1.S.
Lexical strategy, structural strategies and surface clause indexes
The line chart in Figure 6.4 show the trends for the different compound utterances of French, Portuguese and Spanish. Generally speaking, for Portuguese and French, the variation trend of compound and simple utterances along the corpus design seems to be linked both to the kind of register and the communicative event, following a cross scalar trend. In contrast, Spanish does not follow a cross scalar trend, but instead displays different values of distribution in Informal nodes with respect to Formal ones.
. The verbal utterance The presence of a verbal form inflected for person within an utterance represents the head of a verbal phrase, with its regency, and syntactically generates a clause (BlancheBenveniste 1993a; Blanche-Benveniste et al. 1984, 1990).9 It must be considered a basic structural core; this supposition is confirmed by two interconnected considerations. First, the relevance of the finite verbal form as a constructive core is attained within spoken language’s reference unit: that is, at utterance level. For this reason, when a verb appears in an utterance, it determines its syntactical structure. *TIZ: però / dove vado / il venerdì / da quella signora dove vo il venerdì / la c’ ha un bambino piccolino / di un anno e mezzo / no // (ifamdl08) [but / where I go / on a Friday / to that lady’s where I go on a Friday / she’s got a little boy / of one and a half (years) / no //] *LUC: quande mangiao / la m’ entraa dentro / la garza / sicché // (ifamcv22) [when I ate / it went inside / the bandage / so //] *MAR: a cosa si ribella ? (imedin01) [what is he rebelling against?] At the same time, the presence of a verbal form also acts as a structural index, because if the utterance does not feature a verb, but instead contains a lexical form of any other kind (e.g. an interjection, noun, adjective or adverb; see Cresti 1998; Fernández & Ginzburg 2002; Scarano 2004), the syntactical configuration is extremely reduced as a result, dropping down to a single expression, on its own or, at most, accompanied by its determiners, modifiers or attributes. *LUC: il gelato / no // (ifamcv10) [ice cream / no way //] *NIC: perché io / mh // (ifamdl03) [because I / mh //] *VER: eh // vabbè // (ifamcv03) [oh // alright//]
Emanuela Cresti
Figure DVD 2.1.2 Overall distribution of verbal and verbless utterances in the four languages
The incidence of verbal and verbless utterances in the four languages can be appreciated in Figure DVD 2.1.2.10 The general result concerning the verbal or verbless character of utterances, where nearly 37% of utterances in at least three of the languages (IPS) are verbless, is of great significance, as it confirms a basic observation from corpus linguistics, first pointed out by the Longman Grammar (Biber et al. 1999: 1071) with respect to Anglo-American speech, that 38% of conversational spoken utterances do not have a clause structure and are therefore verbless. This marks one of the most important syntactical differences between spoken and written language. While the Spanish and Portuguese corpora have 37.23% and 36.60% verbless utterances respectively, the French percentile values are, however, different, with 24.1% verbless vs. 75.9% verbal utterances. As already mentioned with regard to the simplicity strategy, French shows peculiar differences which would be interesting to investigate in greater depth.
.. The verbal utterance according to the corpus design The distribution of verbal and verbless utterances according to the corpus design can be seen in Figure DVD 2.2.2. Similar to what has been ascertained with regard to simple utterances, a correlation between the monologic communicative event and the increasing frequency of verbal utterances can be noted.
Lexical strategy, structural strategies and surface clause indexes
Figure DVD 2.2.2 Distribution of verbal and verbless utterances in the six corpus nodes in the four languages
Figure 6.5 shows the regular Italian trend of increase in verbal utterances in the natural nodes, corresponding to the cross scalar trend mentioned earlier. In this case too, the monologic trait is decisive. The verbal structuring strategy seems to be positively influenced by the monologic structure of the communicative event too, according to the relation: monologic > more verbal.11 Figure 6.6 shows the trend of verbal structuring of FPS. Here, French behaves according to a cross scale like Italian, where, most significantly, its proportion of incidence in monologic nodes is close to 100%. For Portuguese, the contribution of the monologic/dialogic alternation found in Italian is also in evidence. In Spanish, as for the compound strategy, the pattern is different from the other languages, because although the lowest value is found in Informal Dialogues (58.60%), that of Formal Monologues (68.50%) is lower than all other nodes, thus not following a regular scalar trend. Different variation frames of verbal and non-verbal utterances along the corpus design’s nodes can therefore be noted in the FRLs, even while, with the exception of Spanish, a general cross scale pattern still applies.
Emanuela Cresti
Figure 6.5 Distribution of verbal utterances in the four natural nodes according to a cross scale in Italian
Figure 6.6 Distribution of verbal utterances in the four natural nodes according to a cross scale for French, Portuguese and Spanish
. The ‘non-structuring strategies’ in Italian The compound and verbal structuring strategies we have considered up to now have a counterpart which could be described as ‘non-structuring’ strategies: simple and verbless utterances, which, as we have noticed, appear very frequently in speech. In brief, the same groupings already seen in the case of the articulation of compound pattern-
Lexical strategy, structural strategies and surface clause indexes
ing in the utterance and verbal strategy also apply to these basic construction strategies, but with complementary values. The simple utterance strategy is positively correlated with the dialogic trait, according to the relation: dialogic > more simple. In Figure 6.7 we can see how, in Italian, informality, compared to corresponding formality values, produces a positive percentile increase in simple utterances of 9.5% in dialogues and almost 12% in monologues, and how it behaves specularly where compound utterances are concerned. The structure of the communicative event has a similar degree of influence on the verbless utterance strategy: dialogic > more verbless; the dialogic trait is again decisive, as is evident in Figure 6.8. Also observable is the patterns where, in dialogues, the proportion of verbless utterances increases by 9.85% between formal and informal domains, and by of almost 8% in monologues. The opposite trends of verb and verbless structuring in Italian can also be appreciated in the figure. The relevance of the verbless strategy must not be overlooked, as its significant frequency is one of the main findings concerning speech construction. For example, Formal Dialogues, which are associated with public and professional circumstances of use, record a percentage of verbless utterances of over 35%: a very high, and somewhat unexpected, value. The patterns for the other Romance languages are illustrated in Figures DVD 2.2.1 and DVD 2.2.2, presented earlier.
Figure 6.7 Distribution of compound and simple utterances in the four natural nodes according to a cross scale in Italian
Emanuela Cresti
Figure 6.8 Distribution of verbal and verbless utterances in the four natural nodes according to a cross scale in Italian
. The structural types of utterances The four combinations generated by informational patterning and the verbal trait are: (1) compound verbless; (2) simple verbal; (3) simple verbless; (4) compound verbal. These will be referred to as structural types, and a few examples illustrating them follow. 1. Compound verbless: *LUC: il gelato / no // (ifamcv10) [ice cream / no way //] *SND: belli / i jeans // (ifamcv21) [nice / those jeans //] *LUC: sabato mattina / all’ undici / eccotelo // (ifamcv22) [saturday morning / at eleven / there he comes] 2. Simple verbal: *LUC: oggi fa freddo // (ifamcv10) [it’s cold today //] *AGO: t’ è piaciuto ? (ipubdl03) [did you like it?] *VFC: bisognava salire su un ponte // (imedrp01) [you had to climb on a bridge //]
Lexical strategy, structural strategies and surface clause indexes
3. Simple verbless: *ELA: tutto il giorno // (ifamdl08) [all day long //] *NIN: e allora ? (ifamdl01) [and then what ?] *SAM: benissimo // (inatla03) [very well //] 4. Compound verbal: *CLA: quando lei va via la sera / nell’ascensore ‘un ce più luce // (ifamdl16) [when she goes away in the night / in the elevator the light is off //] *LUC: io te l’avevo detto / che ‘un c’ eran tutti // (ifamcv10) [I had told you about it / that not all of them were there //] *ELA: poi non lo mangia / i’ biscotto // (ifamdl02) [then he doesn’t eat it / that biscuit //] The general proportions of incidence associated with these structural types for the four languages are shown in Figure DVD 2.1.3. For Italian, the compound verbal type is the most frequent (45.8%): this means that nearly half the time the “compoundness” of an utterance is associated with the
Figure DVD 2.1.3 Overall distribution of utterance types in the four languages
Emanuela Cresti
presence of a verb. In contrast, the simple strategy is, in most cases, accompanied by a verbless filling (26.8%) rather than a verbal one (16.1%). The compound verbal type in fact is the most common for three Romance languages, with its occurrence in Portuguese of 43.76% and in Spanish of 42.67% very close to that in Italian (45.8%). Again it is French, with a proportion of 36.54%, which shows a different strategy and surprisingly features the simple verbal type as the most common way of structuring utterances (39.36%). The percentage of simple verbless utterances is even less variable among the languages (Portuguese 27.13%, Spanish, 29%, Italian 26.8%) with French not as disparate either, with a percentage of 22.14%.
.. General tendencies We now consider the values connected with the four structural types in each of the six nodes of the corpus design; results for the four languages are shown in Figure DVD 2.2.5. For Italian, the four structural types are distributed in the six nodes showing general quantitative traits: a.
Compound verbal is, as previously noted, the most frequently-occurring type, always accounting for at least 35% of utterance types, though showing significant
Figure DVD 2.2.5 Distribution of four types of utterance in the six corpus nodes for the four languages
Lexical strategy, structural strategies and surface clause indexes
fluctuations among the different nodes, with maximum frequencies nearly double the minimum ones. b. Simple verbless is second by incidence, and shows the most significant fluctuations between nodes, with maximum frequencies around three times the minimum ones. A remarkable result is that of the Telephone node where the maximum overall incidence of the simple verbless structure is 39.52%. c. Simple verbal is the third most frequent strategy at around 12%, and features remarkably constant values. d. Compound verbless is the least frequent type, with a constant average of 10% of incidence in each node. By looking at these data, we can point out two results. Firstly, compound verbless and simple verbal strategies account for at least 21% of cases in each node, and are therefore a constant trait in speech. More specifically, the almost constant percentages of compound verbless (10%) and simple verbal utterances (between 12% and 17%) throughout the corpus design nodes show that these strategies are systematically used in speech, and are not strictly dependent on the variability of sociolinguistic traits or communicative events. Second, the variation in structure of the utterances is remarkably correlated with the socio-structural variability with respect to the compound verbal and simple verbless strategies, which appear complementary. As for the two remaining structural types, which represent the maximum and minimum extents of possible complexity, the wide variability of the percentages associated with them shows that it is their very complementarity which generates the textual construction character of each node. As already mentioned, a general pattern regarding the structural types of utterances emerges by comparing the Italian data with Portuguese and Spanish ones, seen in Figure DVD 2.2.5. For these three languages, at least 42% of utterances are compound verbal and almost 27% are simple verbless: the ratios between the two strategies which are responsible for the real structuring of utterances and for their variation along the corpus design’s nodes are the same. In this respect, French diverges, as the compound verbal type is not the most frequent, accounting for only 36.54% of occurrences, and the simple verbless appears less frequently with 22.14%; this can be seen in Figure DVD 2.2.5.F. The two other strategies, compound verbless and simple verbal, appear in Italian to be more stable and less dependent on the corpus design, with occurrences of 11.3% and 16.1% respectively. In Portuguese, the compound verbless strategy reaches 9.42% and the simple verbal 19.70%, and, following a similar pattern, in Spanish, 8% of utterances are compound verbless and 21% are simple verbal. Clearly, a difference can be noted between the ranges of these two structural types for Italian in contrast with these two other Romance languages. In French, compound verbless (1.91%) and simple verbal utterances (39.36%) show a completely different behaviour, as the latter appears to be the most common strategy of structuring and the former seems to be almost non-existent.
Emanuela Cresti
On the whole then, the patterns for compound verbless and simple verbal types seem to be more language-dependent, a phenomenon which should be interesting for a more in-depth investigation.
.. Structural types according to the corpus design With structural types, a systematic distribution applies, which follows the crossscalarity of structural strategies in an even more marked manner, as shown in Figures DVD 2.2.3 and DVD 2.2.4. As seen in Figure DVD2.2.4, in Italian, the highest peak of occurrence of compound verbal utterances is recorded in Formal Monologues (almost 70%), and the lowest incidence is found in Informal Dialogues (36.2%). This means that the monological trait, together with formality, is associated with a remarkable positive variation of the compound and verbal type construction strategy (36% vs. 70%). In contrast, with simple verbless utterances an opposite trend occurs, evident in Figure 2.2.3: the dialogic trait, together with informality, is associated with a positive variation of the simple and verbless type construction strategy (from 10% to 33%), although this is less significant a difference than that of the compound verbal strategy. The two strategies are therefore specular with respect to the variation of the corpus design and with respect to dialogue structure and register.
Figure DVD 2.2.3 Distribution of simple verbal vs. simple verbless utterances in the six corpus nodes in the four languages
Lexical strategy, structural strategies and surface clause indexes
Figure DVD 2.2.4 Distribution of compound verbal vs. compound verbless utterances in the six corpus nodes in the four languages
As for the intermediate types of utterance structuring (simple verbal utterances and compound verbless utterances), these co-vary according to a cross-scalarity. For both types, the dialogic trait together with informality determine the slightest positive variation in values. However, the most relevant quantitative characteristic of such strategies is the constantness of their percentile weight throughout the corpus structure. On the whole, for Italian, the structural types allow the highlighting of two variation trends, as can be seen in Figure 6.9: one for the compound verbal utterance type, and the other for the three remaining types (compound verbless, simple verbal, simple verbless). The former, in keeping with the opposition between the separate grouping of formal and informal nodes, increases towards Formal Monologues, the latter decreases towards Formal Monologues. The same cross scale is valid for Portuguese for compound verbal utterances, recording a range from 40% to 60%. Spanish also shows an increasing cross scalar trend this time with a 34%–78% range; a similar cross scale again applies for French, ranging from 28% to 73%. The four languages thus seem to share the same principle, but each has a different range of variation along the corpus design, with French exhibiting a very wide gap between the two extremes. With Portuguese, the cross scale trend of the simple verbless strategy ranges from 31% to 14%, while for the simple verbal and compound verbless types, Portuguese shows an incidence of roughly 20% for the former and 9% for the latter. The line chart
Emanuela Cresti
Figure 6.9 Distribution of structural types in the four natural nodes in Italian
of Figure 6.10 provides a general overview of the trends in the occurrence of structural types in Portuguese. Spanish shows a decreasing cross scale too for simple verbless ranging from 33% to 6%. The principle is still the same, but Spanish records a lower percentage in Formal Monologues when compared to other languages. In the simple verbal and compound verbless types, Spanish shows a less regular average of occurrence than Italian and Portuguese. A general overview of the trends in the occurrence of structural types in Spanish is shown in Figure 6.11. As for French, the distribution of simple verbless utterances again follows a decreasing cross scale: ranging from 27% to 3%, with a very low extreme value. The two remaining types, simple verbal and compound verbless, as already mentioned, have patterns which are different from the other Romance languages, as the former is the most common type and the latter the least common, but only if the transmission texts are taken into account. When only face-to-face texts are considered, a decreasing cross scale applies to these types too, but ranging from 42% to almost 23%, and from almost 3.50% to 0. Figure 6.12 illustrates the general overview of the trends in the occurrence of structural types in for French. On the whole then, for FRLs, an increasing cross scalar trend applies to the compound verbal type, whereas a decreasing cross scalar trend can be noticed for the simple verbal, simple verbless and compound verbless types, although the average and the range values of each type reveal strong differences within languages.
Lexical strategy, structural strategies and surface clause indexes
Figure 6.10 Distribution of structural types in the four natural nodes in Portuguese
Figure 6.11 Distribution of structural types in the four natural nodes in Spanish
Figure 6.12 Distribution of structural types in the four natural nodes in French
Emanuela Cresti
. Some remarks on Italian Media and Telephone As seen earlier, face-to-face texts not only show various degrees of structuring, but also distributive principles which are regularly connected with the traits of formality/informality and to the monological/dialogical communicative event, on which the classification of the nodes is based. The transmission and broadcastings texts, which in our corpus correspond to the Media and Telephone nodes, however, do not seem to be clearly related to these traits.12 Media, although formed by dialogues which have been inserted in their original emission format, is, in many cases, intended towards the formation of a true text (namely, news, scientific divulgation). On the contrary, the Telephone node, formed by dialogues which take place without the spatial co-presence of the participants and without the sharing of a visual space, has no textual perspective. Media’s texts are the result of some form of control, which may determine, in a more or less restrictive way, its time allowances, modalities, topics, and lexicon. In contrast, in the Telephone node, the exchange is free; in fact, because it is independent from all the multi-modal components which apply to face-to-face exchanges, it can be considered a dialogic event in its pure state. The two non-face-to-face speech nodes are therefore markedly different to both the other speech nodes and to one another. Consequently, Media and Telephone are very different from structural, lexical and syntactic points of view. Media shows some similarities with Informal Monologues, but diverges from these in lexical and syntactic characteristics, as we will see later, where it is instead more like the formal nodes. More specifically, Media bears likenesses to formal nodes with regard to lexical strategy, as it clearly features a mainly nominal strategy and the maximum percentage of nouns. Just as clearly, Telephone features the highest percentage of verbs in the whole corpus design, and is, for this reason, associated with informal nodes, but with much more definite values. Although the C-ORAL-ROM sampling can by no means be considered exhaustive with regard to the variety of Media, nonetheless a greater structuring of the texts emerges in Media when compared to informal texts. The structuring indexes are similar to those of Informal Monologues, with verbal utterances accounting for 71% of occurrences, comparable with 72.3% in Informal Monologues. Compound utterances comprise 69.5% of the total, while this percentage rises only to 70% in Informal Monologues. At the same time Media features the highest percentage of compound verbless utterances, which, as we have seen, is characteristic of informal nodes, whereas for all the remaining types Media accounts for very similar figures to those associated with Formal nodes. In Telephone texts, the ratio of verbal to verbless utterances is quite balanced, at around 50%; nevertheless, we must note that the incidence of verbless utterances reaches its maximum, given that Informal Dialogues only attain 44% occurrence. Simple utterances also prevail over compound ones, again reaching their maximum occurrence of 55%, compared to 51% in Informal Dialogues. The compound verbless utterance type records its overall lowest presence (10%), while simple verbal utterances
Lexical strategy, structural strategies and surface clause indexes
account for over 15% of utterances, second only to their value in Informal Dialogues (18.6%). As for the simple verbless type, this reaches its highest peak in the whole corpus, with 39.52%; this tendency is of course counterbalanced by the lowest overall percentage of compound verbal utterances of 35.25%. Telephone is the only node to feature a higher count in simple verbless utterances than in compound verbal ones. In conclusion, the choice of redefining our corpus design, with the primary distinction between natural and transmission nodes and the consequent identification of six nodes, allowed us to execute a comparison among the FRLs concerning face-toface nodes. The similarities between the lexical (verb/noun) strategies, the structural strategies and the structural types turned out to be stronger than expected, and at the same time some very peculiar language-dependent behaviours emerged. However, we wish to underline once again that without considering a reference unit such as the utterance, defined on the basis of its prosodic character (terminal break) and its corresponding pragmatic value (speech act), none of these measurements or comparisons could have been accomplished.
. Surface clause indexes The surface clause indexes that we have considered primary (e, ma, che, no, non) are the five most frequent functional expressions in the Italian corpus, and are among the first 20 by rank in the frequency lexicon for FRLs. They also act as the main coordinative and subordinative conjunctions and negative expressions; moreover, it must be noted that each of these indexes features, in speech, quantitative values and functional behaviours which are different from those in written language. The Longman Grammar has clearly pointed out this difference, offering, at the same time, exhaustive data on the incidence and behaviour of the equivalent English expressions. More specifically, it has been claimed that speech employs coordination less than written language in the structuring of its texts (see Biber et al. 1999: 81). At a first glance, this assumption seems to be correct, but at the same time we can verify that the number of coordinative indexes in our corpus is high; therefore it is interesting to look into this apparent contradiction. It has been shown that subordination is reduced in quantity and degree of complexity in spoken language. On the other hand, we know that most explicit subordinations are accomplished, in Italian, through the use of che, and with equivalent expressions in the other Romance languages (que in all three). However, we still do not know what exactly its percentile relevance is. Only recently we have acknowledged that negative expressions are much more frequent in speech than in written language (Biber et al. 1999: 158–160). The Longman Grammar records two important facts, the first concerning the considerably higher occurrence of negation in spoken language than in written language, the second concerning the different uses of the two primary types of negation, that is, the verb negation not and the holophrastic negation no and other negative expressions.
Emanuela Cresti
.. General percentile data The evaluation criteria of the percentile values of the surface indexes we are about to employ must first be explained. The criterion used is the same one previously employed for informational and structural characters, whose percentages were calculated as a proportion of the number of utterances. However, the ratio of functional expressions to number of utterances implies a spurious result, as it compares entities which have different ranks: words and utterances. We can however argue for our choice, based on Tables 6.6 and 6.7, which provide the overall percentile data on the relations between functional expressions and total utterances, and between functional expressions and number of words respectively. As is clear from Table 6.1, the ratios of the expressions to the number of words differ very slightly from one another.13 Where utterances are concerned, as it can be observed in Table 6.2, the coordination (e + ma) occurs in 23% of utterances, the main means of subordination (che) in 20% of utterances, and that a remarkable 17% of utterances feature a negative focus (no + non). The structure of spoken text is ultimately based on utterances, and the main coordinative and subordinative conjunctions, together with the most common devices of negation, can only be properly appraised in this environment. These data directly reveal the syntactic importance of such expressions in spoken language. As anticipated, the frame of distribution for surface clause indexes varies among the Romance languages and even the weight of each index and the syntactic functions Table 6.1 Incidence of e, ma, che, no and non, as a percentage of total Italian lexicon Index
Occurrences
Rank
Approximate percentage of forms
che e non no ma
7,910 6,860 4,366 2,595 2,281
5◦ 7◦ 11◦ 18◦ 20◦
3% 2% 1% 1% 1%
Table 6.2 Incidence of e, ma, che, no and non, as a percentage of occurrence in total number of utterances Index
Occurrences
Percentages
e ma e + ma che no non no + non
6,860 2,281 9,141 7,910 2,595 4,366 6,961
17% of utterances 6% 23% (coordination) 20% (various subordinative functions) 6% 11% 17% (various negative focalisation forms)
Lexical strategy, structural strategies and surface clause indexes
Figure DVD 3.1.1 Incidence of each index as a proportion of total number of utterances in the four languages
it plays may be partly different. The distribution of each index with respect to the four Romance languages can be evaluated in Figure DVD 3.1.1. In Spanish a much more prevalent occurrence of coordinative expressions emerges clearly (33%, vs. 23% in Italian); at the same time the percentage of subordination is similar (16% vs. 17% in Italian) and the only main negative expression, no, has an occurrence of 19%, just 2% more than in Italian. In Portuguese, the presence of coordination similar to that of Spanish emerges (33.46%), together with a much higher level of subordination (33.56%) compared to both Italian and Spanish (by at least 16%), and a level of negation similar to Spanish (19.10%). The French data reveal a degree of coordination (28.4%) which is intermediate between that of Italian and those of Portuguese and Spanish; a difference arises with regard to subordination, which accounts for only 16.3% of cases and appears to be quite lower than in all the other Romance languages. This can be easily explained, because relative sentences in French are accomplished not only through que or che, as in the other Romance languages, but also by a substantial use of qui. Moreover, there is a use of parce que (4,2%)14 which in many cases substitutes for certain other occurrences of que in other Romance languages. As for negation, French shows a behaviour which is quite different from that of other languages. In order to reach a more satifasfactory picture of negation, it would be necessary to obtain more data, as we only have the percentages of pas (11.3%), but not the ones regarding ne or non.
Emanuela Cresti
In conclusion, Italian texts appear to be the least characterised by surface clause indexes; Spanish and Portuguese texts bear similar values, and French is also comparable to these in terms of coordination. Portuguese, on the other hand, shows an impressive amount of subordination through the complementiser que.
.. Percentile data according to the corpus design As has been regularly done so far for every lexical and structural feature, it is possible to verify the distribution of each syntactic index according to the corpus design for the four Romance languages. Although the distribution of indexes along the six nodes of the corpus design reveals the existence of some regularities, it only partly confirms what has emerged with regard to the text structuring types and the lexical ratio of verbs to nouns.15 Different behaviour between natural nodes and broadcastings can be noted. Where the four natural nodes are concerned, the following phenomena can be established.16 a.
e: a scalar cross trend is found in Italian, Portuguese and French (from 36% to 44%), while Spanish records the highest value in Informal Monologue (nearly 64%). b. ma: a scalar linear trend in seen in Italian, with constant values associated with the coordinative ma in all the nodes (average 5–6%), except in Formal Monologues, where a positive variation (10%) is recorded instead. Different behaviours, although with a constant band never higher than 10%, are noted for the other three languages. c. che: a scalar linear trend, which reaches its peak in Formal Monologues (nearly 50%, four times the value of Informal Dialogues) is observed in Italian, as well as in Spanish (46%). In contrast, a cross scalar trend is seen for Portuguese (nearly 60%) and French (38%).17 As already noted, with negation a very different behaviour is recorded for FRLs: d. no and non: a kind of complementary distribution with a lowest value (nearly 2%) of the first in Formal Monologues and a peak of the latter in the same node (21%) is found in Italian. pas: a scalar cross trend with a peak in Formal Monologues (20%) is seen in French; for Portuguese (nao) and Spanish (no) the highest values are in Informal nodes, with a decreasing trend. We can also notice a general aggregation of the coordinative, subordinative, and negative expressions (a sign of a greater structural complexity) in Formal Monologues for FRLs, with the only exception of coordinative e for Spanish. In Italian values double and triple the average values of the other three nodes (e 36% vs. average 17%, ma 10% vs. average 5.5%, che 66% vs. average 20%, non 25% vs. average 11%). Moreover, Table 6.3, where all the Italian percentile values of the four indexes are gathered, reveals the
Lexical strategy, structural strategies and surface clause indexes
Table 6.3 Incidence (%) of the four Italian indexes in the natural nodes Index
Inf. Dialogues
Inf. Monologues
Form. Dialogues
Form. Monologues
e ma che non Total
10.20 5.50 11.50 8 35.2
25.60 5.40 26 11.20 68.2
19.80 6.30 29.20 16.70 72
36.40 10.50 53.60 21.30 121.8
different weight of syntactic structuring in each natural node and also shows that there is a linear scalar trend of the sum of their percentile values.18 But the distributional results, which had revealed themselves so significantly for lexical and structural characteristics, do not seem to be so relevant for syntactic indexes for two reasons: a. each index accomplishes more than one syntactic function; b. different functions very often are determined by the ‘informational position’ of the index. In fact, in the course of analysis an unexpected characteristic was discovered: even if it does not happen necessarily, different syntactic functions correlate with informational positions. As we will see in the next section, the proximity of an index with a particular prosodic mark sorts out its syntactic role, which can record a more or less pragmatic value.
. The informational positions of surface clause indexes To fully understand the role of syntactic indexes in utterance structuring, it has to be considered that in spontaneous speech a specific syntactic value is frequently associated with a specific informational position. The only informational positions we will consider are those regarding the left context of an index: a.
Initial: at the beginning of an utterance; that is, after a terminal prosodic break or after the start of a dialogic turn; b. Articulate: after a non-terminal prosodic break; c. Linearised: between two words (and/or between two unmarked positions). Firstly, we can appraise the remarkable differences between the distributive values for each surface index with regard to the three informational positions for FRLs. The figures containing the incidence of each surface index according to their informational positions are not presented in the book; they are however included in the DVD in the Diagram Menu 3.2. Only some traits can be captured: simple and common trends among FRLs include the prevalence of linearised positions for subordinatives, that of
Emanuela Cresti
initial positions for adversative and partly for copulative coordination. A similar average incidence of articulate positions occurs for all the indexes, even if with different percentages for IPS; only French shows peculiar values. But we need to know more about the meaning that informational positions play for each index. Detailed research on this subject for FRLs is lacking; only for Italian has it been possible to outline a general frame of indexes distribution according to prosodic tagging, accompanied by relevant examples.
.. A general frame of correlation between syntactic functions and informational positions for Italian As anticipated in 6.8.2 and 6.9, the three informational positions select by preference some of the functions which each syntactic index can perform. The correlation between the syntactic function of an index and a certain informational position may occur frequently, but not necessarily, so what we are going to describe for the Italian corpus are those trends which are very well documented.
The role of e in initial, articulate and linearised position In the Italian grammatical tradition the main role of e is that of copulative coordination (Serianni 1988: 452–453). However, besides this, e performs many other functions at various levels of syntactic and textual structuring: explicative, conclusive, intensifying, correlative, beginning sign (Serianni 1988: 196, 237, 306, 450–453). In the Italian corpus e is found in articulate position (around 43% of cases), maintaining a similar function to the one it performs in the syntactic coordination of written language. The following examples reveal how this coordination can involve different constituents. *ROS: [<] <ma scusami> / quando tu vai / e porti due giornali / non ti dividi già ? (ipubcv01) [but look / when you go / and take two newspapers / don’t you already divide yourself ?] *GIO: un caro saluto / e buona domenica / ai telespettatori di Rai sport // (imedsp03) [good day / and a happy sunday / to all of you watching Rai sport] We must point out that in the previous examples the coordination involves both syntactically and semantically distinct constituents; the same does not hold for linearised position, which is the least frequent (it is found in around 20% of cases). In most cases this is associated with fixed uses (formulas, hendiadys, figée expressions, idioms, phraseological units, etc. . . ). Therefore, it plays a specific compositional role where the coordination involves elements which are not really independent constituents, but, rather, parts of a crystallised construction, and therefore they tend to form a set ex-
Lexical strategy, structural strategies and surface clause indexes
pression. In other words, the coordination takes place within lexical compositional processes. *TIZ: tanto fino alle quattro e mezzo / lui dorme // (ifamdl08) [anyway until half past four (four and a half) / he sleeps //] *GCM: io gliel’ ho dati / di storia e geografia / alle mie classi // (ipubcv05) [I gave them to them / in history and geography / to my classes //] Initial e occurs at the beginning of an utterance or dialogic turn in 37% of cases. This position, which is mostly typical of speech, is regarded as coordinative with pragmatic value, and is also recorded in all the main grammar books (see Biber 1999: 83–84; Bosque-De Monte 1999: 2644; Mateus et al. 2003: 590; Riegel, Pellat, & Rioul 2001: 523), even if it is considered unusual. Therefore, the most unusual distribution in writing is instead one of the most frequent in speech. When it is placed at the beginning of the first utterance of a dialogic turn, e establishes a pragmatic coordination of dialogic turns, whereas when it is placed at the beginning of an utterance within a turn, it establishes a coordination of speech acts. *LOR: e quest’ altra volta si farà il sei // (ipubdl03) [and next time we’ll do it on the sixth //] *GPA: io posso dire / oh / son d’ accordo // e non penso che loro facciano storie // (ifamcv02) [I could say / oh / I agree // and I don’t think they would bother us //] *ISP: dice / hanno fatto la programmazione // e questo si chiama programmazione da noi / no // (inatpe03) [they say / they’ve done the schedule // and we call this schedule / you know //] When it opens a turn, it indicates, in most cases, something unexpected or in contrast with previous statements.
The role of ma in initial, articulate and linearised position The written function of ma is that of linking two elements in a coordinative and adversative relation. The contraposition may or may not be exclusive; the latter case occurs when the second element introduces a contrasting, unexpected datum with respect to the first one, but still allows their co-existence (It’s late, but I’m not tired). The contraposition is, conversely, exclusive when the second element denies, or cancels the first one, replacing it (He wasn’t yawning because he was tired, but because he was hungry) (see Serianni 1988: 454). In the Italian corpus initial position is by far the most important, occurring in 57% of cases. The utterance opening modality it determines is different to that of the adversative coordination of a written clause (in which case it is typically found in the middle of a period).
Emanuela Cresti
*OND: ma l’ influenza / è una malattia pericolosa / o no ? (imedsc02) [but the flu / is it a dangerous disease / or not ? ] *MAN: ora loro / vogliono cambiare // ma chi gliel’ ha dato / questo ? (ifamcv28) [now they / want to change // but who gave this to them ?] Here, the utterance opening value has a specific argumentative intent, partially in contrast with the argumentation of the interlocutor, who is therefore exhorted to answer. In other words, ma at the beginning of a turn or utterance, besides accomplishing a pragmatic coordination between turns or speech acts within a turn (as does an initial e), also creates a partial contraposition. The use of articulate ma (around 39% of cases) is lower compared to the use of initial ma. It is connected with a text structuring strategy and with the maintaining of an adversative syntactic coordination, which is, however, in most cases partial. Otherwise, it may be related to the adversative coordination of two clauses, one of which negative. *ZIA: tu sentissi / come cantava quel ragazzo // l’ era piccolino / ma c’ aveva una voce // (ifammn01) [if you heard / how that boy sang // he was only little / but he had such a voice // *OTT: [<] <no / non sono> separate // son diverse / ma non separabili // (ipubcv01) [no / they’re not separate // they’re different / but not separable //] *PAO: cioè / io / non dico che porto rancore / ma comunque ricordo // (imedts08) [well / I / am not saying I bear a grudge / but I do remember //] So, this time, the adversative coordination is never pragmatically intended towards the interlocutor’s reply, unlike when it is found in initial position. As for linearised ma, its use is very limited (around 4% of cases), and mostly connected to a negative; this means that the linearisation is only apparent, and is used either for underlining a negative focus (no ma), or in erudite forms. *DOM: no ma è possibile // (inatbu01) [no but it is possible] *SCA: il Capo dello Stato / si commuove pensando / che tu muovi il tuo passo / sofferente ma fermo / coraggioso / per andare da questi nostri cittadini provati // (inatps01) [the Head of State / is moved by the thought / that you take your step / suffering but firm / brave / to go to these tried citizens of ours]
The role of che in initial, articulate and linearised position Che is the main subordinative conjunction in Italian. Traditional grammars point out a dual functionality: as a conjunction (to introduce finite complement clauses) and
Lexical strategy, structural strategies and surface clause indexes
Figure 6.13 Incidence of syntactic functions in the distribution of che
as a pronoun (to introduce relative clauses and cleft sentences) (Serianni 1988: 463– 483, 267–272). According to Generative Grammar, che must always be considered, regardless of the different structures it introduces, a complementiser (Cinque 1988). In order to correctly identify the various syntactic functions performed by che, besides its general use as a complementiser, we must trace its nominal or verbal head, if it exists.19 The retracing of the nominal or verbal heads allows the distinction between relative and complement clauses, whereas the lack of heads signals the use of non-subordinative functions, which are typical of speech, and will be mentioned later. Let us take a look at Figure 6.13 which summarises the percentile values of the most relevant syntactic functions of che. The most important result that emerges from our data on che is the high proportion of its relative function (50%) (Alisova 1965; Aureli forthcoming; Cinque 1988; Fuchs 1987; Kleiber 1987; Scarano 2002; Strudsholm 1999), over that of complementiser (almost 33%) (see, among others, Acquaviva 1988; Blanche-Benveniste 1982; Bosque & Demonte 1999; Elsness 1981; Fava 1988; Thomson & Mulac 1991; and the pages dedicated to relative structures in Quirk et al. 1985). In contrast, they reveal the unexpected weight of non-subordinative functions (12.13% + 1.8%), among which we must point out those cases in which che assumes explicative or justificative values (1.8%), which only occur in spoken language.20 Uncertain instances are relatively few (2.6%).21 With informational positions, in the Italian corpus linearised position is the most frequent (51% of cases). This is the main difference when compared to the position of coordinative indexes, which on the contrary record a lower percentage of linearised positions. Furthermore, in the case of coordinative functional expressions, linearised position is associated with non-syntactic, lexical or formulaic uses, whereas it must be stressed that as for che, this position is the one in which it performs most syntactic functions.
Emanuela Cresti
The most common function of linearised che is that of introducing restrictive relative clauses;22 other common, if less frequent forms, are pseudo-relative clauses, cleft sentences, and different types of non-standard relative clauses. Less common uses of che include introducing various types of completive clauses. *ANT: e sai / con le leggi che ci sono / ’un te le fanno nemmeno coprire // (ipubcv06) [and you know / with these laws / they won’t even let you cover them] *ANG: si siede un po’ / è lì che pensa / dice / mah / chissà cos’ è successo // (ifammn20) [he sits down a bit / he’s thinking / well / who knows what happened // *ANT: [<] <no> // sei tu che vuoi parlare // (ifamdl01) [no // it’s you who wants to talk no matter what] *ART: loro preferiscano così // qualcuno dice che ci si mette meno // io / per conto mio / ci metto meno (ifamdl04) [they prefer it that way // some say that it’s quicker // I / personally / find it quicker //] Articulate position, which appears with just under half the utterances (41%), features a wealth of syntactic functions. This position seems to share all the syntactic functions with linearised position, but is actually typically connected with appositive relative clauses. Even if it forms complement clauses, from an informational point of view, the content of the sentence introduced by che appears a non-essential addition to the information contained in the utterance. It behaves as an adjunct or incise and in our theoretical framework, both appositive relatives and complement clauses have the informational functions of “appendix” (Cresti 2000; Scarano forthcoming). *SIL: invece mi sono portato [/] mi sono ricordato di portarmi / il mio film preferito / che è Harry ti presento Sally // (ifamcv05) [instead I brought with me [/] I remembered to bring / my favourite film / which is “When Harry met Sally”] *ANG: vede un uccello / che vola via // (ifammn20) [He sees a bird / that flies away //] *MIC: cioè / è proprio quel &mo [/] quel suo modo di essere / che fa ridere // (ifamdl01) [well / it’s the actual way he is / that makes you laugh //] *LUC: io te l’avevo detto / che ‘un c’ eran tutti // (ifamcv10) [I told you / that they weren’t all there //] Initial position, which is so frequent in the indexes previously considered, and close to the pragmatic coordination typical of the speech, has, in this case, the lowest incidence (around 8%). It is used in questions and exclamations, and not for syntactic subordination purposes (complement or relative uses), in accordance with grammar.
Lexical strategy, structural strategies and surface clause indexes
However, grammars do not generally grasp the pragmatic value of che in initial position, which shows non-subordinative uses at turn opening and utterance opening, with no linguistic antecedent in the dialogue, and with a marked illocutionary value (explanatory or justificative). Therefore, more generally, we can assume the existence of a correlation between the initial position of che and the function of marking the illocutionary value of the utterance (question, exclamation, explanation). *MAX: che anno era ? (ifamcv01) [what year was it ?] *ISP: io dovrò offrire solo le opportunità // *COW: sì // *ISP: che è una cosa (inatpe03) [*ISP: I will only have to offer the opportunities *COW: yes // *ISP: that is a different thing //] *VER: [<] <se ci dai una> mano / è molto gradita // che poi / la Barbara / aspetterà che Simone / torni da lavorare / lo porta fuori / io / devo rimanere in casa a aspettare tutti / e dopo faccio xxx + (ifamdl14) [if you give us a hand / it’s very welcome // *that then / Barbara / will wait for Simone / to come home from work / takes him out / I / have to stay home to wait for everyone / and then I’ll do + ]
The role of no in initial, articulate and linearised position No is an adverb with use as a holophrasis, as it acts as a clause substitute (pro-sentence, after an interrogative). It can also occur as an interjection (with functions which place it among discourse markers, Serianni 1988: 317), a sign of a tag question (C’è una vista straordinaria, no ? ‘There’s a wonderful view, isn’t there?’, Serianni 1988: 438), the second element of an indirect interrogative clause (Ora si tratta di capire se accetterà o no ‘Now we must wait and see if he accepts or not’, Serianni 1988: 483). In the Italian corpus, initial position is by far the most relevant (around 57% of cases). It is hard to find a correspondent use in written language: it is very seldom used, and in an altogether different way. Again, initial position seems associated with an utterance opening value; however, unlike the initial e and ma, it has not so much a function of coordination of turns or of speech acts within a turn, as it has two different pragmatic functions: (1) expressing disagreement and refusal; (2) making contact and toning down a previous statement (speaker’s intent of gaining goodwill). Besides occurring after a terminal prosodic break, the expression is very often followed by a non-terminal prosodic break, which isolates it and lets it be identified as a self-standing unit of information. *ANT: no // costano troppo // è difficilissimo // (ifamdl05) [no // they’re too expensive // it’s extremely difficult // ]
Emanuela Cresti
*COA: no / questo / secondo me / questo è / insomma / automatico // (inatpe01) [no / this / in my opinion / this is / well / automatic //] making contact: *ROS: no / io credo molto nell’ amicizia // però / effettivamente / parecchie volte / &he / non ho ricevuto quello che pensavo // (imedin01) [no / I believe a great deal in friendship // but / actually / many times / &he / I didn’t get back what I thought // ] The use of articulate no, is considerably lower compared to initial no (26%). Also articulated no seems connected to pragmatic functions. Moreover, as in the previous case, it is mostly found isolated between two non-terminal prosodic breaks, which identify it as an autonomous information unit. However, the fact of being placed not at the beginning but within the utterance determines different pragmatic functions: (1) solving contact problems; (2) taking time to express in different or more precise terms a previous statement (reformulation). *LUC: il gelato / no // (ifamcv10) [ice cream / no (way) //] *DOM: ecco / no / giusto per capire un attimino . . . (inatbu01) [there / no / just to understand a second . . . ] *GIM: [<] <sì / no> / lo sto leggendo // (imedin03) [yes / no / I’m reading it //] As for linearised no, still fairly frequent (around 19%), it is linked to a request or confirmation of an alternative (o no?), and to the repetition of a negative expression, very often doubled or trebled. Nevertheless, a linearised position may also be associated with a syntactic role (io no, ora no, x no) and lexical uses (se no, dire di no). *MAA: sì / certo // no no no no // <è vero> // (ipubcv02) [yes / of course // no no no no // it’s true // *MAR: è un bene o no / secondo lei ? (imedin01) [is it a good thing or not / according to you?] On the whole, similar to what has been noted with regard to other syntactic indexes, no does have a pragmatic function in initial position, but unlike those, it features other pragmatic functions (phatic and reformulation) in articulate position too. As for linearised position, only alternative questions, repetition, and lexical/syntactic uses are recorded.
The role of non in initial, articulate and linearised position Non is the most common judgement adverb, and is the main way of turning an affirmative phrase into a negative one (Serianni 1988: 426). Biber (1999: 134) points out a negative coordination function.
Lexical strategy, structural strategies and surface clause indexes
In the Italian corpus it is equally distributed among the various positions, as it exhibits a 41% incidence both in articulate and linearised position, whereas initial position, so speech-specific with regard to the other expressions, is only found in the remaining 18% of cases.23 When it appears in articulate position, non is used for opening an informational unit within a compound utterance. In most cases, it marks the conclusion of a first informational unit with topic function (Cresti 2000: 126–127) and positive polarity, and the beginning of a second one with comment function and negative polarity. Otherwise, it is used as a means to achieve a negative coordination between utterances. *BER: il [/] il vero tranello del caldo / non è il caldo di per sé / che ogni anno / tocca a tutti // (imedsc01) [the real trap of heat / is not the heat in itself / which every year everybody is faced with //] *GIM: &he / è uno pseudonimo / non ricordo il nome esatto // (imedin03) [&he / it’s a pen name / I can’t remember the exact name // *VER: quindi / nella fattispecie / il padrone / sia pubblico che privato / non è che usa + è [/] è / nella sua natura / che ricerca il profitto / di non farlo // non è che usa lo strumento in maniera alternativa // (inatps03) [so / in this case / the owner / public or private / it’s not that he uses + it’s in his nature / that follows profit / not to do it // it’s not that he uses the instrument in an alternative way //] When non is linearised – that is, internal – and not connected to the informational distribution, its function is similar to the one it performs in writing. The association with the adversative coordinative conjunction (ma non) can also be pointed out. *ANT: <se lo deve sentire addosso // ma non dev’ essere> / quel personaggio // (ifamdl01) [he must feel like it // but he mustn’t be / that character //] *PAN: adesso ha tentato questa / parata di stocca / ma non c’ è riuscita // (imedsp02) [now she’s tried this / parata di stocca / but she hasn’t managed it //] *BRA: niente // e le transaminasi non si sono più mosse // l’ emoglobina è a posto // (imedrp03) [so // and transaminasis has not moved since // haemoglobin is alright //] The remaining position, initial position, is the most typical of speech; again, we can observe that a fair percentage of utterances (18%) actually begins with a negative expression, which appears reinforced by being placed at the opening. A construction worthy of note is non è che
Emanuela Cresti
*SAM: non ho capito la domanda // (inatla03) [I haven’t understood the question // ] *PRO: non con noi // (ipubdl04) [not with us]
.. Incidence of informational positions of surface indexes in Italian Here we can summarise the preceding discussion on the occurrence of each informational position for all the indexes. Initial position, at the opening of a dialogic turn or utterance, is typically connected with pragmatic values proper of speech. Briefly, the following correlations were noticed. 1. initial no (57%): used to convey disagreement and refusal, or the opposite intention of making contact or generally toning down a previous statement; 2. initial ma (56.6%): denotes a specific argumentative intent (rare in writing, where it is mostly found in the middle of a sentence); 3. initial e (35.5%): pragmatic coordination of dialogic turns, or of linguistic actions within a turn, with the intention of pointing out something unexpected or contrasting with a previous statement (a very rare usage in writing, although recorded by grammars); 4. initial non (18.4%): reinforcement of negation; 5. initial che (8.2%): rare, but significantly linked to questions and exclamations, and to a peculiar use, typical of speech, of ‘non-subordinative’ connection, with explanatory and justificative value (1.8%). Articulate position generally maintains syntactic functionalities similar to those of writing. 1. articulate e (43.5%): syntactic coordination functions; 2. articulate che (40.6%): various syntactic functions, often with appositive relatives clauses and complement clauses with incidental value; 3. articulate non (40.6%): used for signalling the conclusion of a first informational unit with topic function and positive polarity, and the beginning of a second one with comment function and negative polarity; otherwise it is used as a means to achieve a negative coordination between utterances, or a set expression non è che; 4. articulate ma (38.7%): adversative coordination, though in most cases only partial adversative coordination; otherwise coordination of a negative coordinate clause (ma non), or adversative coordination of a clause with a main negative clause (non . . . ma); 5. articulate no (26%): used for phatic functions, or for taking time, to express a previous statement in other terms.
Lexical strategy, structural strategies and surface clause indexes
Linearised position, with regard to e, ma and no, is the least used, and is mainly linked with lexical or rhetorical functions; with regard to che and non however, it is instead the most frequent, and performs the main syntactic functions. 1. linearised che (51.2%): performs syntactic functions, introduces various types of relative and complement clauses; 2. linearised non (40.8%): similar use to that for writing, often in connection with the adversative coordinative conjunction (ma non); 3. linearised e (20.5%): set expressions, hendiadys, phraseological units or coordination within a lexical process; 4. linearised no (19%): request or confirmation of an alternative (o no?), repetition of a negative expression, lexical composition (se no, dire di no); 5. linearised ma (3.8%): reinforcement of a negative focus (no ma), or erudite forms. Figure 6.14 summarises the percentile values of the incidence of the Italian surface indexes in different informational positions.
Figure 6.14 Incidence of informational positions of surface indexes in Italian
Emanuela Cresti
. Some remarks on coordination, subordination and negation in the four Romance languages (FRLs) Only at this point can we try to give some partial answers, at least, to those questions concerning coordination, subordination and negative focusing in spoken FRLs which this section of the chapter started with. It has been maintained that the structuring of speech does not frequently feature coordination. This is in fact not the case.24 Spoken Romance data show that both the number of coordinative indexes and their frequency with respect to the utterances is high (ranging from 23% to 33%). So we cannot claim that coordination is not relevant in speech, but we must remember that coordination indexes play various functions in it, some of which are peculiar. More specifically, the functionality of coordinative expressions varies greatly in connection with the informational positions and their distribution among the nodes. We have already noticed that coordination is much more prevalent in Portuguese and Spanish, where it has an overall occurrence of 33%, but there is a great difference between the two languages. For instance, whereas the former is strongly characterised by e in initial position (with proportions ranging from 48% to 34% according to variation in the corpus design), characterised by a pragmatic value, the latter exhibits over 40% of incidence of initial e only in Informal Dialogues and Telephone. As for e, initial position in Italian is not the most frequent in all nodes, but only in Informal Monologues (47%). In IPS, the articulate position of e, which is associated with a syntactic coordination, is always common. Adversative conjunctions (ma, mais and pero) in initial position with its associated pragmatic functions are consistently the most used in all the nodes for IPS. For both types of coordinative conjunctions (e and y, ma, mas and pero), linearised position, which is generally employed with a lexical or rhetoric value, is unusual in the three languages, except for Portuguese Formal Dialogues where there is a high percentage of linearised e (30%). A different evaluation pattern is seen in the French corpus, which, while on the one hand still displays a generally substantial incidence of coordination (28.4%), on the other hand reveals that linearised position is the most common both for et (near 71%) and mais (near 60%); in contrast, initial position, so significant for the other languages, only represents 12% occurrence for et and 15.6% for mais. On the whole then, coordination is not so rare; it is, in fact, rather frequently used with speech-specific functionalities. Che/que is the most significant way of explicit subordination in speech and we have already shown the great variety of syntactic functions that it performs. We have also underlined that, in Italian, the majority (50%) of them are relative; in Portuguese too, 48% to 63% of que are relative.25 From the Italian data, what also emerged is how diversified they are in connection with informational positions; for instance, restrictive relative clauses, which probably are the most common kind of spoken subordination,
Lexical strategy, structural strategies and surface clause indexes
are associated with a linearised position, while appositive clauses necessarily occur in articulate position instead. The incidence of linearised position of che/que, which is associated with strictly syntactic functions, is in general the highest in all the nodes for FRLs. Articulate position, which is still associated with various syntactic functions, has also significant occurrence, but it varies greatly among the nodes and among the FRLs. Initial position of che/que, in non-subordinative uses, is on the contrary the least frequent in all the languages. Portuguese has an incidence of initial que which is, on average, higher than in Italian, and Spanish shows an even more significant value with an overall 11%. Que has an average occurrence of 1.3% in initial position in French, although 90.6% of them occur in linearised position. Regarding negations, very different uses of the two negative expressions was found in Italian, and their distribution according to informational positions is further proof of this: the initial position of no, which is associated with strictly pragmatic values, is in general the most frequent throughout the nodes; in contrast, non has its lowest incidence in initial position; the remaining two positions, articulate and linearised, are on a level between 36% and 50%. The very broad usage of negation in speech is underlined by the higher percentage of negative indexes in Portuguese and Spanish when compared to Italian, even though this is achieved through only one main negative expression. However, during the course of research it has been impossible to distinguish between the two different functions, of holophrase and verbal negation, performed by the index. It has only been possible to verify that its use is balanced between linearised and articulate positions in Portuguese, and mainly characterised by a linearised distribution in Spanish. As already noted, in French the only negative index considered reaches 99% in linearised position. As seen earlier, face-to-face texts not only show various degrees of structuring, but also distributive principles which are regularly connected with the traits of formality/informality and with the monologic/dialogic communicative event, on which the classification of the nodes is based. The broadcasting nodes do not seem to be clearly related to the said traits, and this has a consequence on the syntactic functions distribution too. Our general conclusion is that the C-ORAL-ROM corpus with its comparable text collection has allowed the observation of general patterns concerning different linguistic levels and domains. With regard to the comparison between speech and writing, the validity of the utterance as the basic spoken unit is confirmed. Moreover, typical speech strategies emerged: the low density of lexicon and the prevalence in it of verbs over nouns, the relevance of simple utterance structuring, in parallel with the amount of a verbless utterances occurrence. From the point of view of corpus design, some clarifications have been reached, specifically that general trends of variation can be appreciated within face-to-face nodes, following linear scales or cross scales from Informal Dialogues to Formal Monologues. As contribution to the field of comparative Romance language study, a wide structural similarity among the languages has
Emanuela Cresti
been outlined, but some divergences also noted, mostly involving French, as might be expected, which surely warrants interesting further investigation in the future.
Notes . Much research has already been done on specific morpho-syntactic topics, driven on the Italian sub-corpus and accompanied by comparisons with other Romance languages; see Panunzi (forthcoming); Scarano (forthcoming b); Tucci (forthcoming). . Various quality controls of the prosodic tagging have been accomplished during the CORAL-ROM project; see the Appendix of this volume. The results of the quality controls ensure reliability of the perceptual criterion of utterance recognition. . French orthographic words and the number of their written syllables often do not correspond to the acoustic datum, producing a mismatch between the apparent number of words and syllables and their sound weight. See Chapter 1, Section 1.6.4 of this volume. . The other distinctions probably are relevant at a lower level, and become interesting only in larger collections. . With regard to the evaluation of lexical density, some problems arise with PoS tagging for many spoken expressions which are very common in speech but hardly classifiable using traditional categories of written language. One of the most common solutions has been to add a ‘Discourse Markers’ tag for those expressions which do not fit into any traditional PoS tags. Literature on the topic is wide, however the definition of discourse markers is still open (cf. Bazzanella 1995; Berretta 1994; Giora, Meiran, & Oref 1996; Jucker & Ziv 1998; Lenk 1998; Redeker 1991; Schourup 1999; Schriffin 1987). The most relevant work on Italian (Bazzanella 1995) considers discourse markers as many expressions with a different lexical and morphological nature due to having lost their original meaning and function. Longman Grammar notices, on the basis of the analysis of its spoken corpus, that so-called discourse markers occur mostly at the opening of the utterance. In such position ‘Discourse Markers’ have two main roles: (a) they point to a transition in the dialogue; (b) they point to an interactive relation with the listener (e.g. well, right, now, I mean, you know, I see, okay, oh, etc.). . Different values for the proportions of nouns and verbs in FRLs may be partially due to the PoS tagging criteria of each national team. . From now on we refer to Media and Telephone texts as transmission texts as opposed to face-to-face texts. . For the concept of informational unit and its correspondence with tone unit, see Halliday (1976), Lambrecht (1994), Scarano (2003). A recent overview of the domain can be found in Simon (2004). . With regard to this topic, the ‘approche pronominal’ by Blanche-Benveniste and the team of GARS is a basic reference. . A one-to-one correspondence between a verb and an utterance frequently occurs in dialogic and informal nodes. In formal and monologic nodes, generally, every utterance features more than one verb. The whole matter needs to be studied with much more accurate means of analysis; new research has been already driven on this topic, employng a specific software. New data will appear in Cresti (forthcoming b).
Lexical strategy, structural strategies and surface clause indexes . Although formality once again causes a positive variation of 10% in dialogues and 8% in monologues, the quantitative variation of the verbal construction strategy is only by a lower degree a consequence of formality in dialogic nodes. Actually, the formal trait produces a lower positive variation in the degree of structuring (10%) than in articulation (13%). . The percentile values concerning Italian Media and Telephone have been calculated from the general measurements reported in DVD. Figures related to those percentile values can be examined in DVD (Figures DVD 2.2.1, DVD 2.2.2, DVD 2.2.3, DVD 2.2.4, DVD 2.2.5). . The exact percentage of negative tokens (non + no) is 2.51%. It is not quite as high as in our previous corpus, whose composition was more informal, but it is still higher than in American English. . This percentile value has been provided by the French team. . The figures containing the incidence of each surface index according to the corpus design are included in the DVD as Figures DVD 3.1.2, DVD 3.1.3, DVD 3.1.4, DVD 3.1.5 and DVD 3.1.6 and are not replicated here. . The percentile values of surface clause indexes according to the corpus design of FRLs have been calculated on the basis of general measurements reported in DVD and can be found in Figures DVD 3.2.1, DVD 3.2.2, DVD 3.2.3, DVD 3.2.4, DVD 3.2.5, and DVD 3.2.6. . The frequency of sentences introduced by other relative pronouns (for instance, in Italian il quale, cui) is quite low and ultimately not significant in spoken Romance languages. . The percentile values of all the indexes can be greater than 100%, because more than one index can occur in the same utterance, or the same index appears more than once. It may be interesting to note that this occurs only in Formal Monologues. . This task may involve the “extension” of the left context of the expression, perhaps even involving several dialogic turns. This is the reason why the simple evaluation of the position in connection with terminal or non-terminal prosodic breaks is not sufficient. . Certain works (Alisova 1972; Cinque 1988; Fiorentino 1999; Sornicola 1981) and grammars (Renzi 1988; Serianni 1988) point out the existence of a non-subordinative che which cannot be included in the above sentence types (question, exclamation). Such uses are clearly found in the corpus. . The syntactic value of che within the relative function, is, however, much more articulated; more specifically, various semantic types are present in literature (restrictive relative clause; appositive relative clause, pseudo-relative clause; cleft sentence). Besides these, there are cases of non-standard relative clauses (Alfonzetti 2002; Alisova 1965; Aureli 2003, 2004; Benincà 1993; Berretta 1993; Berruto 1987; Cinque 1988; Fiorentino 1999; Serianni 1988; Sornicola 1981), found in speech (polyvalent che; hypercorrect and analytic). As for the complementiser function (Burzio 1986; Greenbaum 1976, Kayne 1976; Quirk et al. 1985), the following types must be considered: complementiser (objective; subjective; indirect interrogative; subordination from nominal head). With regard to the non-subordinative functions of che, there are two different cases: the first concerns the correlative function (sia . . . che), the second – still without considering it a complementiser – is associated with what grammars normally refer to as sentence types (Fava 1995) (question, exclamation). . Conversely, the most relevant syntactic restriction regarding linearised che is the impossibility of selecting the appositive relative clauses: these – as will emerge soon – need an articulation. Moreover, it seems that non-subordinative che cannot occur in this position; linearised position is not generally associated with pragmatic functions.
Emanuela Cresti . As a matter of fact, with non, a distribution which only takes into account the left context is not entirely satisfactory, and it would be best to consider the right context too. From a certain point of view, non assumes a meaning in relation to what follows it (verb or nominal expression), as it would do if it were always linearised. . At any rate we cannot forget that in order to properly describe speech we must take into account the structure identified by prosody (informational patterning); this corresponds neither to coordination nor to subordination, but to a pragmatic organisation of the text. . The Portuguese team provided this information; for a detailed description of the incidence of syntactic functions of che in the Italian corpus, see Aureli (forthcoming).
Appendix
Evaluation of consensus on the annotation of terminal and non-terminal prosodic breaks in the C-ORAL-ROM Corpus* Massimo Moneglia, Marco Fabbri, Silvia Quazza, Andrea Panizza, Morena Danieli, Juan Maríia Garrido, and Marc Swerts
. Goals of the evaluation In C-ORAL-ROM, terminal breaks are considered the most relevant cue for determining utterance boundaries. The rough equivalence between an utterance and a sequence ending with a terminal break is based on the assumption that competent speakers are extremely sensitive to intentional prosodic variation (’t Hart et al. 1990) and that the voluntary accomplishment of a speech act is always accompanied by such variations. In general a competent speaker also distinguishes between breaks that terminate a sequence from those which are perceived as signalling that the sequence goes on. The former are assumed to terminate the utterance or, less conclusively, to provide enough evidence that what follows necessarily belongs to a different utterance, that is, to a different pragmatic and linguistic domain. The identification of utterance boundaries by means of prosody is very significant from the linguistic point of view, and is also an easy annotation to be added to spoken corpora. This evaluation aims to test the hypothesis that prosodic breaks, especially terminal ones, have strong perceptual prominence and can be detected with a high level of inter-annotator agreement. At the same time the goal of the evaluation is also to assess the reliability of the prosodic tagging of the C-ORAL-ROM speech corpora and to test if the coding scheme has the same level of reliability when applied to different languages. More specifically, given the multilingual nature of the resource, this hypothesis can be tested at a cross-linguistic level, verifying whether language-specific features may be correlated to a different perceptual salience in prosodic parsing. Finally, since the C-ORAL-ROM resource comprises, for each language sub-corpus, speech material belonging to distinct typologies, along the lines of a comparable corpus design, the evaluation also aims to test whether the reliability of the annotation scheme varies according to the huge variety of speech material documented.
Massimo Moneglia et al.
. Evaluation background As far as the limits of precision are concerned, there is insufficient evidence in the existing literature for this kind of prosodic labelling and no antecedent can be found for the four Romance languages. Recent literature regarding the evaluation of inter-annotator agreement with respect to various kinds of prosodic boundaries mainly addresses ToBI annotation of boundary tones by trained labellers, on non-spontaneous speech resources (Grice et al. 1996; Pitrelli et al. 1994; Syrdal & McGorg 2000). The prosodic annotation of the Dutch corpus by non-expert labellers has been also recently been verified by non-experts (Buhmann et al. 2002). In general, such literature testifies on one hand a high degree of agreement on boundary tones (85–92% for ToBI annotation). In the Dutch corpus, a “substantial consistency” (Landis & Koch 1977) of the annotation for strong and weak prosodic breaks has been quantified by means of K-coefficient (Cohen 1960) where results lie in the conventional range between 0.61 and 0.80 points. Although the prosodic labelling of the Dutch corpus is similar to that of C-ORALROM in terms of both the nature of the resource (spontaneous speech) and the annotation unit (prosodic break), no specific test has been performed on the distinction between terminal and non-terminal breaks. The annotation of breaks in the Dutch corpus may partially overlap that reported in C-ORAL-ROM, but it is not comparable. There, strong breaks are defined as “severe interruptions of the normal flow of speech”, while weak breaks are defined as “weak but still clearly audible interruptions of the speech flow”. Now, it is very likely that all terminal breaks are perceived as severe interruptions of the speech flow, but a remarkable number of non-terminal breaks also share this property. In other words, a strong break may not have the functional value of terminal breaks (that is, marking the end of the utterance) and therefore has lower linguistic relevance. Based on these premises, the evaluation of C-ORAL-ROM prosodic tagging aims to provide a figure of inter-annotator agreement on the perceptual judgments that allow the detection of terminal and non-terminal breaks in the four Romance languages; this will provide, at the inter-linguistic level, first evidence of the perceptual prominence of those breaks. Thus, the hypothesis that this kind of cue has high perceptual prominence and is easily recognised will be addressed. To this end, non-expert evaluators, with no theoretical background, have been preferred.
. Experimental setting Tagging consists of the marking of prosodic breaks in the orthographic transcription of speech. In C-ORAL-ROM, each word boundary (W) is considered a possible position for a prosodic break, while within-word breaks are not.1 Each word boundary in CORAL-ROM transcripts thus necessarily has one of the following values:
Evaluation of consensus
1. no break (O); 2. terminal break (T); 3. non-terminal break (N). Tagging is based only on perceptual judgement and does not require any specific linguistic knowledge, although the notion of the speech act is always familiar to the expert transcribers (postdoctoral scholars and doctoral students) who annotated the corpus. The standard procedure adopted in C-ORAL-ROM has been detailed in Chapter 1. This process ensures control over inter-annotator relevance of tags and maximum accuracy in the detection of terminal breaks. The accuracy with respect to non-terminal breaks is by definition lower. It is important to note, however, that the evaluation in fact did not involve an ‘exact replicability’ of the scheme, because the task proposed to the evaluators was not exactly the same as the one carried out by the expert annotators. Specifically, nonexperts only had to check the tagged transcripts, and not to transcribe and annotate them by themselves, as the C-ORAL-ROM researchers did. Obviously, while nonexperts are the best candidates to test the perceptual prominence of a given cue, spoken language transcription and annotation cannot be easily executed by non-expert.2 Given the size of the resource (roughly 35 hours of speech for each of the four corpora), evaluation was conducted only on a statistically significant portion of each corpus. From each language corpus, a subset was extracted, amounting to roughly 1/30 of its utterances (about 1,300 utterances and around 1:30 hours of speech). The speech sections to be evaluated were automatically selected using a random procedure ensuring the same distribution of speech types as in the corpus. The selection procedure extracts samples with the same proportion from each node of the corpus structure (see Section 1.1), and guarantees semantic and contextual coherence of the speech sections for evaluation, by choosing continuous series of utterances, and also by providing the surrounding utterances. For each selected speech section, the procedure outputs an XML file ensuring text-audio alignment and a text file where each tagged utterance is reported twice (i.e. a validation copy). The following illustrates an example of an Evaluation text file (Portuguese corpus, pfamcv03_selV.txt): 1 GRA / pouquíssimas pessoas // 1 GRA / pouquíssimas pessoas // 2 GRA e era assim // 2 GRA e era assim // 3 GRA uma senhora / foi todo o tempo a gritar // 3 GRA uma senhora / foi todo o tempo a gritar //
The task was performed on PCs, with the help of the WPC speech software, which enabled evaluators to view the annotated text to be evaluated and to listen to the corresponding aligned audio signal (preferably no more than three times, depending on the length of the utterance).
Massimo Moneglia et al.
The evaluators had two days’ training in which the trainers illustrated the evaluation goals, the C-ORAL-ROM corpus structure, the meaning and format of the prosodic tagging, and the evaluation criteria and procedure. The notions of terminal and non-terminal breaks (as presented in Section 1.2) were carefully explained by discussing specific examples extracted from the corpus. Written instructions were also provided. At the end of the training, a test was performed in order to assess the competence acquired by the evaluators and to ensure consistency between them in the evaluation.3 Each evaluator, independently, examined the original annotation and considered the possible existence of a prosodic break at each word boundary. If the evaluator’s perception did not match the original tagging, he could modify the validation copy by inserting, deleting or replacing prosodic-break tags. If the evaluator did not understand part of the utterance or was not able to evaluate it, he could exclude that text portion by including it between two asterisks. More specifically, the standard evaluation procedure, as explained to the evaluators, was as follows: a.
challenge of the terminal break of the selected utterance: evaluation of the perceptual cue “terminal break” by listening to a single utterance (Evaluation in isolation); b. challenge of the terminal break of the selected utterance: evaluation of the perceptual cue by listening to the utterance together with the following one (Generic confirmation, Specific confirmation, Terminal deletion); c. identification of possible unmarked terminal breaks within the utterance: by listening to the utterance in isolation (Terminal missing, Non-terminal substitution); d. evaluation of non-terminal breaks within the utterance (Non-terminal missing, Non-terminal confirmation, Non-terminal misplacement): by listening to the utterance in isolation and/or to part of it. Each evaluator worked independently from the others and took approximately 60 hours to accomplish the task, in daily sessions of four hours. None of the eight evaluators reported any difficulties in the evaluation and all of them could accomplish their task easily.4
. Selection of evaluators Two naïve, mother tongue evaluators for each of the four languages were engaged by Loquendo. Evaluators were chosen based on their having a secondary or higher education and no specific expertise in phonetics and prosody. It must however be considered that each language in the Romance collection has geographical varieties which are characterised, in some cases, by strong linguistic differences. Therefore in the selection of possible evaluators, geographical origin as well as linguistic competence were considered, in order to avoid the possibility that a mis-
Evaluation of consensus
match between the geographical origin of evaluators and the geographical variety to be evaluated might cause a lower reliability of the work. The choices adopted to address this problem thus had to take into account: (a) the linguistic variation represented in each C-ORAL-ROM corpus; and (b) the characteristics of the evaluation task, along the following considerations: a.
The sampling criterion for each language resource in C-ORAL-ROM did not exclude instances of any possible geographical variety (see Chapter 1). Therefore it was not possible to ensure a strict correspondence between each geographical variety used in the corpus and the geographical origin of only two evaluators. Moreover, such a requirement, reasonable in principle, was not only difficult for practical reasons, but also seemed too strong a condition with respect to the objectives of the evaluation itself. b. In the evaluation task, the detection of a signal and its recognition in the text by mother tongue speakers is strongly supported by the transcription and should not be affected by their geographical origin. In other words, under normal circumstances, the understanding of speech should be aided and ensured by the transcription; therefore the terminal and non-terminal prosodic breaks of a given text should be easily identified by all the speakers of a language, without being significantly affected by their place of origin. Given the above, it was felt that the evaluation of prosodic tagging of each corpus could be performed by any individual whose mother tongue was the language in question, without having to exclude any place of regional origin. However in order to reduce any possible variation caused by different linguistic competence between the evaluators, and between evaluators and C-ORAL-ROM researchers, mother tongue evaluators whose origins were outside national borders or overseas were excluded. A broader criterion was adopted however for the Portuguese corpus, and a Brazilian speaker was employed as evaluator. There is no certainty whether the Portuguese spoken in Brazil can still be considered a geographical variety, even an extreme one (Bacelar 2001a); Brazilian may in fact be considered a different language. Consequently, the level of speech recognition of continental Portuguese in the corpus may be strongly affected by the geographical origin of the evaluators. However, given that the speech recognition process would be assisted by the transcription in the evaluation task, the perception of terminal and non-terminal breaks should be in principle appreciable even by a Brazilian speaker. Given that the C-ORAL-ROM resource would also be distributed in Brazil, a very relevant market for the Romance corpora in general, and Portuguese in particular, it was deemed acceptable to test the level of consensus of Brazilian speakers with respect to the tagging level of the resource. Finally, in cases where the evaluators were uncertain about the accurate recognition of speech breaks, they could discard the affected positions. The consensus was thus computed only on the number of evaluated positions. The results of the evaluation shown in Table A.1 indicate that the possible limits of recognition do not affect the comparability of the work, since the proportion of non
Massimo Moneglia et al.
Table A.1. Non evaluated positions
Total w. Boundaries Evaluated w. Boundaries Evaluator 1 Evaluator 2 Not Evaluated Positions Evaluator 1 Evaluator 2
French
Italian
Portuguese
Spanish
12,893
10,925
12,958
11,512
12,776 12,831
10,900 10,892
12,933 12,534
11,474 11,512
117 (0.9%) 62 (0.5%)
25 (0.2%) 33 (0.3%)
25 (0.2%) 424 (3.2%)
38 (0.3%) 0
evaluated positions is insignificant, ranging from 0.3 to a maximum of 3.0 (obtained by the Brazilian evaluator).
. Measurements and statistics The evaluation data were gathered and statistically analysed in order to measure the degree of consensus expressed by the evaluators with the original annotation. Precision and Recall indexes were discarded, because they apply to cases where a sequence of values (tags) is compared with a reference sequence, taken as the correct one. This was not the case for the C-ORAL-ROM evaluation, where neither the original annotation nor the evaluators’ choices could be taken as the correct reference. Cases of disagreement, where one or both evaluators corrected the original annotation, were compared with the total number of word boundaries and, more perspicuously, with the number of positions which are reasonable candidates for a prosodic break. The agreement between evaluators was also measured. On the basis of the assumption that an annotation scheme is good if it can be applied with a high degree of agreement between two or more independent annotators, a replicability statistic was used to compare the three annotations made by C-ORAL ROM and by the two evaluators. Following previous work on annotation coding for discourse and dialogue,5 the K-coefficient was calculated. The evaluation data were obtained through a number of steps, described in the following paragraphs.
.. Evaluation data Word boundaries (W), which are candidate positions for prosodic breaks, were classified for the purpose of the evaluation into the categories shown in Table A.2. Each position in the evaluation file is classified with a tag expressing the degree of agreement of the evaluator with the original annotation. In Table A.3, the tags are ordered according to their increasing degree of disagreement.6
Evaluation of consensus
Table A.2. Break types Tag
Semantics
O N T
no break non-terminal break terminal break
Table A.3. Agreement types Tag
Semantics
Degree of disagreement
0O 0N 0T 1i 1d 2ns 2ts 3i 3d
agreement on non-break boundary agreement on non-terminal break agreement on terminal break non-terminal insertion non-terminal deletion non terminal break substitution (N→T) terminal break substitution (T→N) terminal insertion terminal deletion
ok ok ok non critical non critical critical critical very critical very critical
Computations on the evaluation files were performed according to the corpus structure. Cumulative computations were also performed horizontally (e.g. by adding up the measurements of all the “dialogue” nodes, or all the “monologue” nodes).
.. First step: Binary comparison file Starting from an evaluation file, which reports both the original tagging in the CORAL ROM corpus (C) and the evaluator’s choice (E), a first parser generated a comparison file (B) where each word boundary (candidate position for a break) is represented as a record containing information allowing a comparison of the original tagging and the evaluator’s choice, namely: (1) utterance number; (2) speaker’s name; (3) position (sequential number) of the boundary; (4) character position in the original line; (5) character position in the evaluator line; (6) break found in that position in the original text class; (7) class of the original boundary (O, T, N); (8) break found in that position in the evaluator text; (9) class (O, T, N) of the evaluator boundary; and (10) one tag for each possible agreement combination between original and evaluator’s version. An example of a comparison file B is shown below: UTT 1 2 3 3 ...
SPEAK -
POS. 0* 0* 0 1
CHAR-R 12 13 5 19
CHAR-E 12 13 5 19
BREAK-R // // -
VALUE-R T T O O
BREAK-E // // -
VALUE-E T T O O
AGREE 0T 0T 0O 0O
Massimo Moneglia et al.
.. Second step: Ternary comparison file Starting from the two comparison files B-E1 and B-E2 for the two evaluators E1 and E2, a new file (T) was generated, with each word boundary represented as a record. This file allows a comparison of the original tagging with respect to the choices of both evaluators, and a comparison between the choices of each evaluator with the other. The file records the following information: (1) utterance number; (2) speaker’s name; (3) position (sequential number) of the word boundary; (4) class (O, T, N) of the boundary according to original line; (5) class (O, T, N) of the boundary according to the first evaluator; (6) class (O, T, N) of the boundary according to the second evaluator; (7) agreement tag between original line and first evaluator’s line; (8) agreement tag between original line and second evaluator’s line; and (9) agreement tag between first and second evaluator’s line. An example of a ternary comparison file T is shown in the following: UTT 1 1 1 1 2 3 3 3
SPK PAT PAT PAT PAT ROS MIG MIG MIG
POS 0 1 2 3* 0* 0 1 2*
REF O O O T ? N O T
EV1 O O O T ? N O T
EV2 O O O T ? N O T
R-E1 0O 0O 0O 0T ? 0N 0O 0T
R-E2 0O 0O 0O 0T ? 0N 0O 0T
E1-E2 0O 0O 0O 0T ? 0N 0O 0T
.. Measurements Given a ternary comparison file (T), a new file (M) was generated reporting the measurements defined below that are relevant for producing the statistics.
1. General data a. b. c. d.
Number of utterances; Number of word boundaries; Number of T breaks by C-ORAL-ROM; Number of N breaks by C-ORAL-ROM.
2. For each evaluator (binary comparison) a.
Percentage of T breaks evaluated as Terminal or Non-terminal (Generic confirmation); b. Percentage of T breaks evaluated as Terminal (Specific confirmation); c. Percentage of N breaks evaluated as Terminal breaks (Added terminal); d. Percentage of 0-breaks evaluated as Non-terminal (Non-terminal missing); e. Percentage of N insertions and deletions that may be considered N misplacements (N break misplacement rate);
Evaluation of consensus
f.
Percentage of T insertions and deletions that may be considered T misplacements (T break misplacement rate); g. Percentage of boundaries actually modified(Activity rate).
3. Percentage of consensus on terminal and non-terminal breaks (ternary comparison) i.
Both evaluators disagree on terminal tags (Strong disagreement, with respect to terminal tags): a. percentage of T breaks evaluated as 0; b. percentage of T breaks evaluated as N; c. percentage of 0 evaluated as T; d. percentage of N evaluated as T. ii. Both evaluators disagree on non-terminal tags (Strong disagreement, with respect to non-terminal tags): a. percentage of N breaks evaluated as T; b. percentage of N breaks evaluated as 0; c. percentage of 0 evaluated as N; d. percentage of T evaluated as N. iii. Only one evaluator disagrees (Partial consensus): a. percentage of partially agreed T breaks, evaluated as 0 or N; b. percentage of partially agreed T breaks: if one of the two evaluators agreed with the original notation, keeping a T break in the evaluated position, but the other deleted the break, the percentage is made on T positions evaluated by both evaluators; c. percentage of partially agreed Ws: if one of the two evaluators agreed with the original notation, but the other did not agree with either of the two remaining forms, the percentage is calculated on the total positions evaluated by both evaluators. iv. a. b. c.
Both evaluators agree with the proposed tag (Total agreement): percentage of T breaks evaluated as T; percentage of N breaks evaluated as N; percentage of O boundaries evaluated as O.
v.
Percentage of totally disconfirmed W’s and percentage of W’s disconfirmed by at least one evaluator (Global disagreement).7
vi. Percentage of globally disagreed Ws that are actually cases of strong disagreement.8
Massimo Moneglia et al.
.. Kappa coefficient The Kappa coefficient measures the agreement among annotators in their coding of data with a finite set of categories. Kappa is defined as: k=
P(A) – P(E) 1 – P(E)
where: – –
P(A) is the proportion of times that the annotators actually agree; P(E) is the probability that annotators agree by chance.
Given the hypothesis that the annotators may have different category distributions, P(E) is given by: P(E) =
M N
Fr(Aj , ci )
i=1 j=1
where Fr(Aj, ci) is the frequency with which annotator Aj chooses category ci (M = 3, N = 3). For our purposes, 3 annotators are considered: C-ORAL-ROM (C), evaluator 1 (E1) and evaluator 2 (E2), and two different Kappa coefficients are defined, namely: K1: 3 graders (C,E1,E2), 3 categories (O,T,N) K2: 3 graders (C,E1,E2), 2 categories (T,N) For K1, we have considered as categories the three boundary tags: no break (O), non-terminal break (N) and terminal break (T). P(A) is calculated as the ratio between the number of total agreements among the evaluators, and the number of word boundaries evaluated by both E1 and E2. For K2 (a more realistic coefficient), we have restricted the set of categories to T and N, and consequently the set of evaluated boundaries to those annotated with T or N by at least one annotator. In this case, P(A) is calculated as the ratio between the number of total agreements on T or N and the sum of N and T breaks as resulting from C, plus the number of N or T insertions by E1 and E2, minus the sum of the agreements between evaluators on the insertion of N or T.9 The general measure of agreement, like the percentage of totally agreed Ws, was compared with a baseline corresponding to the worst possible “realistic” result: if, for example, all Ns and Ts were deleted and a comparable number of Ns and Ts were inserted in different positions, we still would have a number of agreed positions. Our percentage of agreement should be significantly higher than the baseline in order to provide a positive evaluation of the original corpus annotation.
Evaluation of consensus
. Results All the measurements were applied to each leaf node of the corpus structure, i.e. to each evaluated dialogue/monologue file. In Table A.4, only the results on total subcorpora are extensively reported, while the results for each sub-node are detailed only for certain specific commentaries.10 The four language sub-corpora selected for the evaluation have sizes ranging from 10,925 (Italian) to 12,985 (Portuguese) word boundaries. The number of T-breaks ranges from 969 (French) to 1,483 (Portuguese), while N-breaks range from 1,462 Table A.4a. Summary: Evaluated Boundaries and Breaks
Total w. boundaries Evaluated w. Boundaries Evaluator 1 Evaluator 2 Not Evaluated Positions Evaluator 1 Evaluator 2
French
Italian
Portuguese
Spanish
12,893
10,925
12,958
11,512
12,776 12,831
10,900 10,892
12,933 12,534
11,474 11,512
117 (0.9%) 62 (0.5%)
25 (0.2%) 33 (0.3%)
25 (0.2%) 424 (3.2%)
38 (0.3%) 0
French
Italian
Portuguese
Spanish
96.12% 100%
98.8% 97.12%
98.7% 99.4%
94.1% 99%
100% 100%
99.9% 100%
100% 100%
99.8% 99.7%
0.01% 0.02%
0% 0%
0.03% 0.04%
0% 0.02%
1.59% 0.46%
1.05% 2.75%
0.62% 0.2%
1% 0.7%
2.95% 5.01%
0.2% 0.16%
0.5% 0.4%
1.57% 0.12%
2.22% 1.66%
2.07% 4.17%
1.01% 0.41%
1.96% 1.42%
0.19% 0.14%
5% 0.98%
0% 0%
1.19% 0.65%
Table A.4b. Summary: Binary Comparisons
Specific Confirmation of T Evaluator 1 Evaluator 2 Generic Confirmation of T Evaluator 1 Evaluator 2 Terminal Missing Evaluator 1 Evaluator 2 Non Terminal Missing Evaluator 1 Evaluator 2 Added Terminal N→T Evaluator 1 Evaluator 2 Activity Rate Evaluator 1 Evaluator 2 Misplacement Rate Evaluator 1 Evaluator 2
Massimo Moneglia et al.
Table A.4c. Summary: Ternary Comparisons French
Italian
Portuguese
Spanish
Strong Disagreement 3d on T Strong Disagreement 2ts on T Strong Disagreement 2ns on W Strong Disagreement 3i on W Strong Disagreement 1d on N Strong Disagreement 2ns on N Strong Disagreement 2ts on W Strong Disagreement 1i on N Partial Consensus on T 0T vs. 3d or 2ts Partial Consensus on T 0T vs. 3d Partial Consensus on W
0% 0% 0.1% 0.01% 0.3% 1.04% 0% 0.12% 4.36% 0% 3.18%
0% 0.48% 0% 0% 1.87% 0% 0.08% 0.75% 2.42% 0% 3.65%
0% 0.35% 0.01% 0.03% 0% 0.03% 0.06% 0.10% 1.51% 0% 0.93%
0% 0% 0.01% 0% 0.25% 0.05% 0% 0.25% 5.16% 0% 2.56%
Total Agreement on T Total Agreement on N Total Agreement on O Global Disagreement Consensus in Disagreement
95.05% 86.56% 97.54% 3.52% 8.97%
97.14% 93.15% 95.1% 4.78% 23.50%
98.12% 98.38% 99.22% 1.57% 12.12%
94.84% 94.62% 98.28% 2.83% 9.26%
K Index (General) K Index (Realistic)
0.952 0.776
0.928 0.807
0.980 0.920
0.946 0.827
Table A.5. Marked positions in the four language corpora
Word boundaries marked with a prosodic break
French
Italian
Portuguese
Spanish
19%
35%
32%
31%
(French) to 2,604 (Portuguese). The percentage of word boundaries that received a break tag in the C-ORAL-ROM annotation in the four corpora is shown in Table A.5. The difference in the marking of the French corpus is relevant, and is reflected by a higher mid-length of tone units and by a higher mid-length of utterances in the whole corpus.11 As reported before, the option of excluding text portions from evaluation, in cases of doubt or unintelligible speech, was seldom applied, and therefore this option does not compromise the statistical significance of the evaluation.12 The comparison of the Activity Rates of the different evaluators (i.e. the percentage of evaluated word boundaries that were actually modified) shows the lowest values for Portuguese, ranging from 0.5% and 1%, and the highest for Italian, where the rate for Evaluator 2 is above 4.5%. French and Spanish evaluators show rates around 2%. In short, excluding the most extreme judgements, activity rate is comparable in the majority of cases, confirming that the task was accomplished with sufficient care by the evaluators.
Evaluation of consensus
.. Binary Comparison Looking at the Binary Comparison statistics on the evaluation data in Table A.4b, it is clearly seen that the percentage of T-breaks that were not deleted by the evaluators is 100% (with the single exception of the Formal section of the Spanish corpus, where it is around 98%). This means that where the original annotator perceived a terminal break, the evaluators perceived a break too, at least a non-terminal one (Generic Confirmation on T). Where the original annotator did not indicate the presence of a prosodic break, the evaluator’s perception of a non-terminal break was also rare. By examining the results for Non Terminal Missing, we can conclude that these disagreements occurred in only 1% of cases; an exception to this is the second Italian evaluator, with a percentage close to 3% (sometimes higher when data are separated), and the second Portuguese evaluator, with a percentage lower than 0.3% (Generic Confirmation on 0). Non-terminal breaks were confirmed in most cases; however the probability of a lower perceptual relevance with non-terminals is shown by the existence of both misplacement and non-terminal deletion, recorded by all evaluators, from around 0.5% to 7%.13 The percentage of Specific Confirmations on T, where the evaluators confirmed that the break was indeed a T-break, is in most cases above 95%. On the other hand, evaluators seldom perceived T-breaks where the original annotator did not perceive any kind of break, as shown by the Terminal Missing percentages, which are close to 0%. Given the above results, where the reliability of terminal annotation is concerned, the subsequent evaluation focuses on terminal substitution, when the evaluator replaces a non-terminal with a terminal, and on terminal adding, when an evaluator replaces a non-terminal with a terminal. Terminal substitution involved around 5% of cases, while values regarding terminal adding are significant only in the case of the French corpus, while for the other languages, the percentages are mostly below 1%. Indeed, in some cases the French evaluators perceived a stronger break where the annotator marked a non-terminal break; the percentage of Added Terminals, i.e. N-breaks substituted with T-breaks, ranges from 1.29% to 6.51% of the original N-breaks. The low percentages of break insertions and deletions could be weakened further if one takes into account the fact that in some cases a <deletion, insertion> pair may count as a single misplacement. The misplacement rate is indeed zero for T-breaks, but it can reach an average of 5% for N-breaks (and with an Italian evaluator reaching 9.9% in the Formal-Natural Context section). In conclusion, although the reliability of non-terminal breaks is not the primary objective of the evaluation, it must be noted that the insertions, deletions and misplacements with respect to non-terminal breaks are, as a whole, more than the corresponding operations performed on terminal breaks.
Massimo Moneglia et al.
.. Ternary Comparisons As for Ternary Comparisons, which give a measure of the inter-annotator agreement and of the reliability of the C-ORAL-ROM prosodic tagging, we can see in Table A.4c that the original annotation is basically confirmed, especially for terminal breaks: the percentages of T-breaks specifically confirmed by both evaluators are above 94% for all languages (Total agreement on T). The agreement on N-breaks is expectedly lower, with higher differences among the corpora. The values range from 75.01% in the Dialogues of French to 98.95% in the Monologues of Portuguese. Taking as a reference the whole set of evaluated word boundaries, the most general measure of agreement is the Total Agreement Rate (percentage of boundaries on which both evaluators agree), together with the complementary Global Disagreement Rate (percentage of boundaries disconfirmed by at least one evaluator). The highest consensus is expressed on the Portuguese corpus, but the values are very close for all languages, ranging from 95% to 98.9%, as seen in Table A.6. The percentage of totally agreed word boundaries may sound too optimistic as a general measure of agreement, due to the disproportion between word boundaries and actual candidates for a break (around 30% of the total, as reported above). The total agreement is however significant when compared with a baseline that may be considered the worst possible realistic result, obtained in the case if all Ns and Ts were deleted and a comparable number of Ns and Ts were inserted in different positions. This is shown in Table A.7, where, it can be seen that the total agreement percentages are significantly higher than the baseline.14 Strong disagreement is a very marginal phenomenon. Tables A.8a and A.8b compare Global Disagreement Rate (cases where at least one evaluator disconfirmed the original annotation) with Partial Consensus Rate (cases where only one evaluator disconfirmed), with Strong Disagreement Rate being the difference obtained. Clearly, global disagreement is mainly determined by partial consensus. Only in a minority of cases where an evaluator disagreed with the annotator did the second evaluator Table A.6. Total agreement
Global Disagreement Rate Total Agreement Rate Total
French
Italian
Portuguese
Spanish
3.52% 96.48% 100.00%
4.78% 95.21% 100.00%
1.07% 98.93% 100.00%
2.83% 97.17% 100.00%
Table A.7. Baseline
Worst Possible Result (baseline) Total Agreement Rate
French
Italian
Portuguese
Spanish
62.17% 96.48%
30.84% 95.21%
37.28% 98.93%
37.93% 97.17%
Evaluation of consensus
Table A.8a. Strong disagreement on disagreed positions (French & Italian) French Italian Rate on Rate on globally Rate on Rate on globally total positions disagreed positions total positions disagreed positions Global Disagreement 3.52% Partial Consensus 3.18% Strong Disagreement 0.34%
90.34% 9.66%
4.78% 3.65% 1.13%
76.36% 23.64%
Table A.8b. Strong disagreement on disagreed positions (Portuguese & Spanish) Portuguese Spanish Rate on Rate on globally Rate on Rate on globally total positions disagreed positions total positions disagreed positions Global Disagreement 1.07% Partial Consensus 0.93% Strong Disagreement 0.14%
86.92% 13.08%
2.83% 2.56% 0.27%
90.46% 9.54%
confirm the modification. This occurs with an incidence which ranges from 9% for French up to almost 24% for Italian.
.. K-coefficient K-coefficients measure the reliability of the annotation scheme, that is, the probability of obtaining the same annotation by different evaluators on the same corpus. K values range from 0 to 1, where K = 1 indicates total reliability of the annotation scheme. Such an ideal result is quite unrealistic, due to the intrinsic subjective nature of corpus annotation, so that researchers agreed in considering as a positive result any K value above 0.6. Two K coefficients were calculated: (1) a general one, comparing the behaviour of our three subjects, the original annotator and the two evaluators, with respect to the three categories of boundaries (T, N, O); and (2) a more realistic coefficient, limiting the analysis to the two break tags T and N, in order to avoid the positive effect of the high agreement rate on no-break boundaries. K coefficients were calculated on each evaluated dialogue/monologue considered as a single experiment. As shown in Table A.9, both coefficients are largely above the 0.6 threshold, confirming the reliability of the C-ORAL-ROM annotation scheme. A finer analysis shows that the agreement rate is slightly lower in Formal than in Informal speech, and in Monologues vs. Dialogues.
Massimo Moneglia et al.
Table A.9. Kappa Coefficient (realistic) per sub-nodes
Total Formal Informal Dialogues Monologues
French
Kappa Coefficient (realistic) Italian Portuguese
Spanish
0.766 0.765 0.767 0.790 0.675
0.807 0.785 0.826 0.839 0.779
0.827 0.772 0.885 0.880 0.818
0.920 0.893 0.946 0.921 0.944
. Discussion The prosodic tagging scheme that has been evaluated was applied to different speech modes in four Romance languages: Italian, Spanish, French and Portuguese. The relative proportion of the different speech styles in the different languages was essentially the same, and this was also reflected in the randomly selected subset of data to be evaluated. The evaluation procedure included inviting different labellers, i.e., native speakers of the four Romance languages in question, to one particular venue. This guaranteed that each evaluator would essentially follow the same training procedure within a comparable time frame. Moreover, all evaluators were working under identical listening conditions on identical machines. Finally, the evaluation took place at an independent site in Torino, involving people who were operating independently from the original initiators of the corpus. This external evaluation was performed on prosodically tagged speech material that had already been evaluated internally by each of the language partners themselves. Each prosodically tagged utterance was thus basically the composite result of the work of 3 different labellers: after one had transcribed an utterance and had prosodically annotated it, the resulting prosodic tags were subsequently checked by 2 other people and revised wherever necessary. This practice is comparable to the one applied in Kiel by Klaus Kohler and colleagues for establishing a similar spoken database of German. Regarding the different evaluation metrics, Kappa statistics are still one of the most widely applied statistical measurement methods to measure labelling consistency in general, and prosodic annotation in particular. Therefore, the current results can in principle be compared with those reported in the literature. Yet, as noted by Rietveld and van Hout (2002), the use of Kappa to measure replicability is not entirely uncontroversial. For this reason too, a more realistic Kappa measurement was adopted, which limits the analysis to N and T cases, while not taking into account the large majority of cases where there is no prosodic tag at all. In addition, the analyses have been expressed in terms of percentile agreement as well, which in theory can give somewhat different results, including a comparison with a baseline measure.
Evaluation of consensus
.. General results Regardless of the type of measure used, it is immediately clear that high levels of agreement are found between the original annotations and those of the evaluators. If the global results of the evaluation are considered, it can be observed that both (i) the percentile values, with global disagreement rates always below 5% of the evaluated material in the four languages, and (ii) the Kappa statistics results, showing values always higher than 0.6 for all languages, both in ‘general’ and ‘realistic’ conditions, indicate that the differences between the experts who annotated the corpus during the project and the non-experts who reviewed the annotation during the evaluation cannot be considered significant, in any of the languages in the project. Some conclusions can be drawn from this fact: 1. The tagging of the C-ORAL-ROM corpus seems to show a high degree of coherence with the annotation principles established during the project, according to the judgements of the subjects selected for the evaluation. 2. As far as the ‘replicability’ of the coding scheme is concerned, it seems that subjects with no special background in phonetics or linguistics can be trained on its use in a very short period, apparently with good results. This may indicate that the annotation scheme is based on ‘actual’ prosodic features that can be perceived not only by experts in phonetics and linguistics, but also by naïve annotators. It also confirms the simplicity of the coding scheme which requires only a short training period to be able to be processed and applied by new annotators. 3. It is important to note, however, that the ‘exact replicability’ of the scheme was not validated but rather evaluated, i.e. non-experts only had to check the speech material and not annotate it from scratch. 4. The minor disagreements observed between the judgments of the external evaluators and the C-ORAL-ROM annotation indicate a small degree of uncertainty in the task, more evident in the case of the Italian and French corpora, with 4.78% and 3.52% respectively in global disagreement. It is important to note, however, that these disagreements are mainly partial and not strong: this means that in most cases only one of the evaluators disagreed with the original annotation: 4.36% in the case of French, 2.42% in the case of Italian. This fact seems to indicate that the disagreements are more related to the perceptual basis of the procedure, that takes into account cues which are to some extent subjective, than to actual ‘errors’ of the original annotator.
.. Specific results Some additional comments to the main conclusions above in A.7.1 can be inferred from the figures with regard to the binary comparison. 1. The position marked with a terminal break in C-ORAL-ROM received a generic confirmation in almost all cases. This means that where the original annotator
Massimo Moneglia et al.
perceived a terminal break, the evaluators perceived a break too, at least a nonterminal one (Generic Confirmation). In other words, terminal breaks are strong linguistic cues and few doubts arise about the existence of a prosodic break when it is judged terminal. The absence of misplacement with respect to a terminal position also confirms this interpretation. 2. The difference with non-terminal breaks is evident. Even if non-terminal breaks were confirmed in most cases, the probability of a lower perceptual relevance is shown by the existence of both misplacement and non-terminal deletion, recorded by all evaluators, from around 0.5% to 7%. 3. The quality of being terminal is highly prominent and easy to recognise. This is confirmed by the fact that only in 5% of cases the terminal quality detected in a break was not confirmed, and that terminal breaks were hardly ever detected by the evaluators in unmarked position. This confirms the accuracy of the original annotation, at least with respect to the terminal prosodic breaks that were marked. The substitution of a non-terminal break with a terminal break is a phenomenon recorded with some significance only in the French collection. Therefore data from the binary comparison show that the reliability of the terminal tagging in C-ORAL-ROM is very high for all the possible modifications that may concern terminal breaks. The ternary comparison offers an even clearer result on the reliability of terminal breaks, given that both evaluators specifically confirmed terminal breaks in over 95% of cases for all languages and allows additional commentaries. 1. Although the reliability of non-terminal breaks is not the primary objective of the evaluation, the operations performed by the evaluators on non-terminal breaks were, as a whole, many more than the corresponding operations on terminal breaks. Non-expert evaluators showed a great sensitivity towards prosodic parsing even in the weakest positions, thus strengthening the reliability of their low reaction to original terminal breaks. 2. It may be interesting to note that in general the K coefficient was slightly lower in Formal than in Informal speech, and in Monologues vs. Dialogues. This tendency is systematic and can be expected even with more significant figures. In face-to-face dialogues the linguistic units of references are brief strings, each one matching a speech act, and always ending with a terminal break. Therefore, the judgement that each sequence is concluded is relevant at semantic level, action level and prosodic level. This may not frequently be the case when spoken language performances feature long textual strings, as in formal monologues. In this case, although the terminal break always ensures that what follows belongs to a different linguistic domain, the string may include various linguistic domains, gathered within the same prosodically concluded structure. Therefore the judgment about the terminal or non-terminal nature of the prosodic breaks may be less certain.
Evaluation of consensus
In conclusion, given the high scores on agreement, it is safe to say that the prosodically annotated data of the C-ORAL-ROM corpus are very trustworthy. They are therefore a reliable basis for future research. They can be used as a resource, for instance, for contrastive analyses of prosodic structures in the four comparable language corpora, to investigate stylistic effects on prosodic structure in various speech genres, and to gain further insight into the syntactic, semantic and pragmatic factors that determine why speakers opt for specific prosodic markings in certain discourse locations.
Notes * Massimo Moneglia designed the evaluation strategy with the collaboration of the Project Advisors Juan Maria Garrido, Marc Swerts and Morena Danieli, and wrote the descriptive part of this Appendix. Marco Fabbri conducted the corpus sampling and formatted the evaluation material; Silvia Quazza, Andrea Panizza and Morena Danieli wrote the report by Loquendo that is partially reproduced in §A.5 and A.6. The discussion in A.7 collates commentaries on the evaluation results by Swerts, Garrido and Moneglia. . The prosodic annotation of breaks in the Dutch corpus also considered in-word breaks. . See Moneglia et al. (2002) for results of an internal evaluation made with experts. . The prosodic tagging conducted in the C-ORAL-ROM transcripts includes six types of terminal breaks and non-terminal breaks (see Chapter 1). The evaluators were trained to recover the above indexes; however, only the break (according to the distinction 0 vs. N vs. T) is evaluated, not the choice of the tag. . The sampling selection procedure, the criteria for the selection of evaluators, the training material, samples of the evaluation file, and a detailed evaluation procedure are accessible on the net at http://lablita.dit.unifi.it/coralrom/loquendo/ . See Isard and Carletta (1995). Such statistics are useful for testing not only if the annotators agreed upon the majority of the coding, but also to what extent that agreement is significantly different from that obtained by chance. . Text portions not understood by the evaluators were marked and excluded from the evaluation. . This percentage counts those positions where both evaluators disagreed with the original annotator, plus positions where one of the annotators agreed with C-ORAL-ROM while the other did not. The percentage is calculated on total number of positions evaluated by both evaluators. . This percentage counts, within the number of positions where at least one of the evaluators disagreed with the original annotator, how many times the other evaluator agreed with the correction; that is to say, how many times there is strong disagreement with respect to the number of times there is a disagreement of any kind. . In other words, to calculate the denominator of this ratio, the following operations must be done: (a) add the number of N and T breaks in the original annotation to the number of positions with no breaks in the original annotation, where evaluator 1 or evaluator 2 put a N or a T break; (b) count the number of empty positions in the original annotation where both evaluators put a N or a T break, and subtract this from the value obtained for (a); otherwise positions with an insertion by both evaluators would be counted twice.
Massimo Moneglia et al. . Free consultation of all data at http://lablita.dit.unifi.it/coralrom/loquendo/ . See the measurements reported in Chapter 1. . The only remark is that the second Portuguese evaluator had both a low activity rate and a high percentage of unevaluated positions. This means that in a certain number of cases s/he may have omitted to mark a disagreement in doubtful cases. The higher rate of agreement recorded with the Portuguese corpus may depend on this. However even in the worst case, where all breaks involved in unintelligible speech by this evaluator might have been be questioned, the overall results will not change in substance. . See detailed data in http://lablita.dit.unifi.it/coralrom/loquendo/ . The baseline of the French corpus is higher, given that the number of marked positions is lower.
Bibliography*
Aarts, B. (1992). Small Clauses in English: The Non-verbal Types. Berlin and New York: Mouton de Gruyter. Aarts, J. (1991). “Intuition-based and observation-based grammars”. In K. Aijmer & B. Altenberg (Eds.), 44–62. Aarts, J. & Meijs, W. (Eds.). (1984). Corpus Linguistics: Recent Developments in the Use of Computer in English Language Research. Amsterdam: Rodopi. Aarts, J. & Meijs, W. (Eds.). (1986). Corpus Linguistics II: Few Studies in the Analysis and Exploitation of Computer Corpora. Amsterdam: Rodopi. Aarts, J. & Meijs, W. (Eds.). (1990). Theory and Practice in Corpus Linguistics. Amsterdam: Rodopi. Aarts, J., de Haan, P., & Oostdijk, N. (Eds.). (1993). English Language Corpora: Design, Analysis and Exploitation. Amsterdam: Rodopi. Abeillé, A. (2002). Une Grammaire électronique du Français. Paris: CNRS Editions. Abeillé, A. (2003). Treebanks Building and Using Parsed Corpora. Dordrecht: Kluwer Academic. Abeillé, A. & Blache, P. (2000). “Grammaires et analyseurs syntaxiques”. In J.-M. Pierrel (Ed.), Ingénierie des Langues (pp. 51–76). Paris: Hermès. Abeillé, A. & Godard, D. (2000). “French word order and lexical weight”. In R. Borsley (Ed.), The Nature and Function of Syntactic Categories (pp. 325–358). New York: Academic Press. Abeillé, A. & Rambow, O. (Eds.). (2000). Tree Adjoining Grammars: Formalism, Linguistic Analysis and Processing. Stanford: CSLI. Abeillé, A., Clément L., Kinyon, A., & Toussenel. F. (2001). “The TALANA annotated corpus for French: Some experimental results”. In P. Rayson, A. Wilson, T. McEnery, A. Hardie, & S. Khoja (Eds.), Proceedings of Corpus Linguistics 2001 Conference (CL 2001). Lancaster: UCREL. Acquaviva, P. (1991). “Frasi argomentali: Completive e soggettive”. In L. Renzi & G. Salvi (Eds.), 633–674. Agard, F. B. (1984). A Course in Romance Linguistics. Vol. 1: A Synchronic View. Washington, DC: Georgetown University Press. Aijmer, K. (2002). English Discourse Particles: Evidence from a Corpus. Amsterdam and Philadelphia: John Benjamins. Aijmer, K. & Altenberg, B. (Eds.). (1991). English Corpus Linguistics: Studies in Honour of Jan Svartvik. London and New York: Longman. Aijmer, K., Altenberg, B., & Johansson, M. (Eds.). (1996). Languages in Contrast: Papers from a Symposium on Text-based Cross-linguistic Studies. Lund: Lund University Press.
Bibliography
Alexandersson, J. & Reithinger, N. (1997). “Learning dialogue structure from a corpus”. In G. Kokkinakis, N. Fakotakis, & E. Dermatas (Eds.), Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech 97) (pp. 2231–2235). Rhodes, Greece. Alexandersson, J., Buschenbeck-Wolf, B., Fujinami, T., Maier, E., Reithinger, N., Schmitz, B., & Siegel, M. (1997). Dialogue Acts in Verbmobil-2. Technical Report 204, BMBF. Alexejew, P. M., Kalinin, W. M., & Piotrowski, R. G. (1968). Sprachstatistik. Berlin: AkademieVerlag. Alfonzetti, G. (2002). Le Relative Non-standard: Italiano Popolare o Italiano Parlato? Palermo: Centro di studi filologici e linguistici siciliani, Dipartimento di scienze filologiche e linguistiche, Facoltà di lettere e filosofia. Alisova, T. (1965). “Relative limitative e relative esplicative nell’italiano popolare”. Studi di Filologia Italiana, 23, 299–333. Allen, J. & Core, M. (1997). “DAMSL: Dialog Act Markup in Several Layers”. Ms., Rochester, NY: Department of Computer Science, University of Rochester. Allen, J., Ferguson, G., & Stent, A. (2001). “An architecture for more realistic conversational systems”. In Proceedings of Intelligent User Interface (IUI 01). Santa Fe, USA. Allwood, J. (1998). “Some frequency based differences between spoken and written Swedish”. In T. Haukioja (Ed.), Proceedings of the 16th Scandinavian Conference of Linguistics, Turku (pp. 18–29). Turku: Department of Linguistics, University of Turku. Allwood, J. (1999). “The Swedish spoken language corpus at Goteborg University”. In R. Andersson, Å. Abelin, J. Allwood, & P. Lindblad (Eds.), Proceedings of 12th Swedish Phonetics Conference (Fonetik 99) (pp. 5–9). Göteborg: Department of Linguistics, Goteborg University. Allwood, J., Traum, D., & Jokinen, K. (2000). “Cooperation, Dialogue and Ethics”. International Journal of Human-Computer Studies, 53(6). Special Issue on Collaboration, Cooperation and Conflict in Dialogue Systems, 871–914. Alonso-Cortés, A. (1999). “Las construcciones exclamativas: La interjección y las expresiones vocativas”. In V. Demonte & I. Bosque (Eds.), 3993–4050. Amastae, J., Goodall, G., Montalbetti, M., & Phinney, M. (Eds.). (1995). Contemporary Research in Romance Linguistics: Papers from LSRL XXII. Amsterdam and Philadelphia: John Benjamins. Ambrose, J. (1996). Bibliographie des études sur le français parlé. Paris: Didier érudition. Anderson, A., Bader, M., Bard, E., Boyle, E., Doherty, G., Garrod, S., Isard, S., Kowtko, J., McAllister, J., Miller, J., Sotillo, C., Thompson, H., & Weinert, R. (1991). “The HCRC map task corpus”. Language and Speech, 34, 351–366. Angelini, B., Brugnara, F., Falavigna, D., Giuliani, D., Gretter, R., & Omologo, M. (1994). “Speaker independent continuous speech recognition using an acoustic-phonetic Italian Corpus”. In Proceedings of the 3rd International Conference on Spoken Language Processing (ICSLP 94) (pp. 1391–1394). Yokohama, Japan. Ashby, W. (1991). “When does variation indicate linguistic change in progress?”. Journal of French Language Studies, 1, 1–19. Atkins, S., Clear, J., & Osler, N. (1992). “Corpus design criteria”. Literary and Linguistic Computing, 7(1), 1–16. Attal, P. (1999). Questions de Grammaire. Lille: Presses du Septentrion. Auer, P. & Di Luzio, A. (Eds.). (1992). The Contextualization of Language. Amsterdam and Philadelphia: John Benjamins.
Bibliography
Auer, P., Couper-Kuhlen, E., & Müller, F. (Eds.). (1999). Language in Time: The Rhythm and Tempo of Spoken Interaction. Cambridge: Cambridge University Press. Aureli, M. (2003). “Pressione dell’uso sulla norma: Le relative non standard nei giudizi degli utenti”. Studi Italiani di Linguistica Teorica ed Applicata, 1, 45–67. Aureli, M. (2004). “Le relative non-standard in alcuni corpora di italiano parlato (LIR, LIP, LABLITA, AVIP)”. In F. Albano Leoni, F. Cutugno, M. Pettorino, & R. Savy (Eds.), Atti del Convegno Nazionale “Il Parlato Italiano”, CD-ROM (pp. 1–22). Napoli: M. D’Auria. Aureli, M. (forthcoming). “La subordinazione introdotta da che nel parlato di C-ORALROM: L’italiano in rapporto ad altre lingue romanze”. In I. Korzen & H. Jansen (Eds.) Lingua, Cultura e Intercultura: L’Italiano e le altre Lingue. Atti dell’ VIII Convegno della Società Internazionale di Linguistica e Filologia Italiana (SILFI 2004). København: Samfundslitteratur. Austin, L. J. (1962). How to Do Things with Words. Oxford: Oxford University Press. Avesani, C. (1996). “ToBI: Un sistema di trascrizione dell’intonazione per l’italiano”. In Atti delle V Giornate di Studio del Gruppo di Fonetica Sperimentale (V GFS) (pp. 85–98). Trento. Bacelar do Nascimento, M. F. (1987). “Um corpus de língua falada”. In M. F. Bacelar do Nascimento et al. (Eds.), 29–75. Lisboa: INIC/CLUL. Bacelar do Nascimento, M. F. (org.). (2000). Português Falado, Documentos Autênticos, Gravações Audio com Transcrição Alinhada. 4 CD-ROMs. Lisboa: Centro de Linguística da Universidade de Lisboa e Instituto Camões. Bacelar do Nascimento, M. F. (Ed.). (2001a). Portugues Falado: Varietades Geograficas e Sociais. Lisboa: CLUL and Instituto Camoens. Bacelar do Nascimento, M. F. (2001b). “Les études portugaises sur la langue parlée”. In A. M. H. Araújo Carreira (Ed.), Les Langues Romanes in Dialogue(s), 11, Travaux et Documents (pp. 209–221). Vincennes-Saint-Denis: Université de Paris 8. Bacelar do Nascimento, M. F., Garcia Marques, M. L., & Segura da Cruz, M. L. (1987). Português Fundamental, Métodos e Documentos, Tomo 1: Inquérito de Frequência. Lisboa: INIC/CLUL. Bach, S. (1997). Capitoli per una Grammatica Contrastiva di Quattro Lingue Romanze: Morfosintassi del Verbo, le Preposizioni, le Congiunzioni. (Pré)publications. Århus: Romansk Institut. Badiou, A. (1969). Le Concept de Modèle. Maspéro: Paris BADIP, http://languageserver.uni-graz.at/badip/ Baker, M., Francis, G., & Tognini-Bonelli, E. (Eds.). (1993). Text and Technology: In Honour of John Sinclair. Amsterdam and Philadelphia: John Benjamins. Bal, W., Germain, J., Klein, J., & Swiggers, P. (1991). Bibliographie Sélective de Linguistique Romane et Française. Louvain: Duculot. Bally, C. (1950). Linguistique Générale et Linguistique Française. Berne: Francke Verlag. Barras, C., Geoffrois, E., Z., Wu, Z., & Liberman, M. (1998). “Transcriber: A free tool for segmenting, labeling and transcribing speech”. In Proceedings of the First International Conference on Language Resources and Evaluation (LREC 98) (pp. 1373–1376). Paris: ELRA. Batoréo, H. (2000). Expressão do Espaço no Português Europeu. Contributo Psicolinguístico para o Estudo da Linguagem e Cognição. Lisboa: Fundação Calouste Gulbenkian e Fundação Para a Ciência e a Tecnologia. Bazzanella, C. (1990). “Phatic connectives as interactional cues in contemporary spoken Italian”. Journal of Pragmatics, 14, 629–647. Bazzanella, C. (1994). Le Facce del Parlare. Firenze: La Nuova Italia. Bazzanella, C. (1995). “I segnali discorsivi”. In L. Renzi, G. Salvi, & A. Cardinaletti (Eds.), 225– 257.
Bibliography
Bazzanella, C. (2001a). “Segnali discorsivi e contesto”. In W. Heinrich, C. Heiss, & M. Soffritti (Eds.), Modalità e Substandard (pp. 41–64). Bologna: CLUEB. Bazzanella, C. (2001b). “I segnali discorsivi tra parlato e scritto”. In M. Dardano, A. Pelo, & A. Stefinlongo (Eds.), Scritto e Parlato: Metodi, Testi e Contesti (pp. 79–97). Roma: Aracne. Bell, A. (1984). “Language style as audience design”. Language in Society, 13, 145–204. Benincà, P. (1993). “Sintassi”. In A. Sobrero (Ed.), Introduzione all’Italiano Contemporaneo, Vol. I: Le Strutture (pp. 247–290). Bari-Roma: Laterza. Bernardini, S. & Zanettin, F. (Eds.). (2000). I Corpora nella Didattica della Traduzione – Corpus Use and Learning to Translate. Bologna: CLUEB. Bernini, G. & Ramat, P. (1992). La Frase Negativa nelle Lingue d’Europa. Bologna: Il Mulino. Bernstein, B. (1977). Langage et Classes Sociales. Paris: Editions de Minuit. Berrendonner, A. (1990). “Pour une Macro-syntaxe”. Travaux de linguistique, 21, 25–36. Berrendonner, A. (2003). “Élements pour une Macro-syntaxe”. In A. Scarano (Ed.), Macrosyntaxe et Pragmatique. L’analyse Linguistique de l’ Oral (pp. 93–109). Roma: Bulzoni. Berretta, M. (1993). “Morfologia”. In A. Sobrero (Ed.), Introduzione all’Italiano Contemporaneo, Vol. I: Le Strutture (pp. 193–245). Bari-Roma: Laterza. Berretta, M. (1994). “Il parlato italiano contemporaneo”. In L. Serianni & P. Trifone (Eds.), Storia della Lingua Italiana, Vol. II: Scritto e Parlato (pp. 239–290). Torino: Einaudi. Berruto, G. (1983). “L’italiano popolare e la semplificazione linguistica”. Vox Romanica, 42, 38– 79. Berruto, G. (1987). Sociolinguistica dell’Italiano Contemporaneo. Roma: La Nuova Italia Scientifica. Berruto, G. (1990). “Italiano regionale, commutazione di codice ed enunciati mistilingue”. In M. A. Cortelazzo & A. M. Mioni (Eds.), L’italiano Regionale (pp. 105–130). Roma: Bulzoni. Bertinetto, P. M. (Ed.). (2001). AVIP: Archivio delle Varietà di Italiano Parlato. Pisa: Scuola Normale Superiore. ftp://ftp.cirass.unina.it/cirass/pub/avip/ Bianconi, S. (1980). Lingua Matrigna. Bologna: Il Mulino. Biber, D. (1985). “Investigating macroscopic textual variation through multi-feature/multidimensional analyses”. Linguistics, 23, 155–178. Biber, D. (1986a). “On the investigation of spoken/written differences”. Studia Linguistica, 40, 1–21. Biber, D. (1986b). “Spoken and written textual dimensions in English: Resolving the contradictory findings”. Language, 62, 384–414. Biber, D. (1988). Variation Across Speech and Writing. Cambridge: Cambridge University Press. Biber, D. (1989). “A typology of English texts”. Linguistics, 27, 3–43. Biber, D. (1990). “Methodological issues regarding corpus-based analyses of linguistic variation”. Literary and Linguistic Computing, 5, 257–269. Biber, D. (1993). “An analytical framework for register studies”. In D. Biber & E. Finegan (Eds.), 31–56. Biber, D. (1994). “Representativeness in corpus design”. Literary and Linguistic Computing, 8, 1–15. Biber, D. (1996). “Investigating language use through corpus-based analyses of association patterns”. International Journal of Corpus Linguistics, 1, 171–197. Biber D. (2000). “Corpus based analysis of grammar: Variability in the form and use of English complement clauses”. In M. Bilger (Ed.), Corpus, Méthodologie et Applications Linguistiques (pp. 224–237). Paris: Champion. Biber, D. & Finegan, E. (1991). “On the exploitation of computerized corpora in variation studies”. In K. Aijmer & R. Altenberg (Eds.), 204–220.
Bibliography
Biber, D., Conrad, S., & Reppen, R. (1994). “Corpus-based approaches to language issues in applied linguistics”. Applied Linguistics, 15, 169–189. Biber, D. & Finegan, E. (Eds.). (1994). Sociolinguistic Perspectives on Register. New York: Oxford University Press. Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). The Longman Grammar of Spoken and Written English. London and New York: Longman. Biber, D. & Conrad, S. (2001). “Register variation: A corpus approach”. In D. Schiffrin, D. Tannen, & H. Hamilton (Eds.), The Handbook of Discourse Analysis (pp. 175–196). Oxford: Blackwell. Bigert, J., Knutsson, O., & Sjöbergh, J. (2003). “Automatic evaluation of robustness and degradation in tagging and parsing”. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2003) (pp. 51–57). Borovets, Bulgaria. Bilger, M. (1983). Analyse Distributionnelle de la Coordination par ET. Thèse de 3e cycle. Aix-enProvence: Université de Provence. Bilger, M. (1984). “Et...Quoi de neuf?”. Recherches sur le Français Parlé, 6, 81–107. Bilger, M. (1985). “Pour une nouvelle analyse des coordinations dites par ‘gapping”’. Queste: études de Langue et de Littérature Françaises, 2, 175–191. Bilger, M. (1997a). “Pour une nouvelle approche des phénomènes de coordination”. In R. Lorenzo (Ed.), Actas do XIX Congreso Internacional de Lingüística e Filoloxía Románicas, Universidade de Santiago de Compostela, 1989 (pp. 925–932). A Coruña: Fundación “Pedro Barrié de la Maza, Conde de Fenosa”. Bilger, M. (1997b). “Corpus de portugais et d’espagnol”. Revue de l’Association Français de Linguistique Appliquée, 2, 27–38. Bilger, M. (1998). “Le statut micro et macro-syntaxique de ET”. In M. Bilger, F. Gadet, & K. van den Eynde (Eds.), Analyse Linguistique et Approches de l’Oral: Hommage à Claire BlancheBenveniste (pp. 91–102). Louvain and Paris: Peeters. Bilger, M. (1999a). “Coordination: Analyses syntaxiques et annotations”. Recherches sur le Français Parlé, 15, 255–272. Bilger, M. (dir.). (1999b). Revue Française de Linguistique Appliquée IV(2). Special Issue on L’Oral Spontané. Bilger, M. (Ed.). (2001). “Cahiers de l’Université de Perpignan n◦ 31”. Special Issue on Linguistique sur corpus: Etudes et réflexions. Bilger, M., Blasco, M., Cappeau, P., Paillaud, Sabio, F., & Savelli, M.-J. (1997). “Transcription de l’oral et interprétation: Illustration de quelques difficultés”. Recherches sur le Français Parlé, 14, 57–86. Binazzi, N. (1997). Le Parole dei Giovani Fiorentini: Variazione Linguistica e Variazione Sociale. Roma: Bulzoni. Bindi, R. Monachini, M., & Orsolini, P. (1991). “Italian Reference Corpus: General Information and Key for Consultation”. Unpublished Report. Pisa: Istituto di Linguistica Computazionale, CNR. Blanche-Benveniste, C. (1980). “Divers types de relatives en français parlé”. T.A. Informations: Revue Internationale du Traitement Automatique du Langage, 21(2), 16–25. Blanche-Benveniste, C. (1982). “Verb complements and sentence complements: Two different types of relation”. Communication and Cognition, 15, 333–361. Blanche-Benveniste, C. (1983). “Examen de la notion de subordination”. Recherches sur le Français Parlé, 4, 71–115.
Bibliography
Blanche-Benveniste, C. (1984). “L’image de la norme linguistique propre aux textes écrits: Les différences entre cette norme et les normes de la langue quotidienne”. In Actes de les Primeres Jornades sobre Noves Perspectives sobre la Representaciô Escrita en el Nene (pp. 103–116). Barcelona: Publicacions de l’Institut Municipal d’Educaciô de l’Adjuntament de Barcelona. Blanche-Benveniste, C. (1985a). “État des enquêtes sur les langues romanes parlées”. In Actes du XVIIème Congrès international de Linguistique et Philologie Romanes. Vol. 7: Contacts de Langue; Discours Oral (pp. 291–292). Aix-en-Provence/Marseille: Laffitte. Blanche-Benveniste, C. (1985b). “La langue du dimanche”. Reflet, 14, 42–43. Blanche-Benveniste, C. (1985c). “Las regularidades sintacticas en el discurso del francés hablado: consideraciones linguisticas y sociolinguisticas”. In V. Lamiquiz (dir.), Sociolingûistica Andaluza, 3: El Discurso Dociolinguistico (pp. 19–30). Sevilla: Servicio de publicaciones de la Universidad de Sevilla. Blanche-Benveniste, C. (1986). “‘Une chose’ dans la syntaxe verbal”. Recherches sur le Français Parlé, 7, 141–168. Blanche-Benveniste, C. (1987). “Les études sur les langues parlées viennent-elles compliquer l’établissement d’une typologie?”. Travaux du Cercle Linguistique d’Aix-en-Provence 5: Typologie des Langues, 49–57. Blanche-Benveniste, C. (1988). “Quelques caractères de l’oralité”. Boletim de Filologia, 30, 87–95. Blanche-Benveniste, C. (1989). “Les outils de l’analyse syntaxique et les données du français parlé”. Le Trèfle, 11, 3–6. Blanche-Benveniste, C. (1990). “Usages normatifs et non normatifs dans les relatives en français, en espagnol et en portugais”. In J. Bechert, G. Bernini, & Cl. Buridant (Eds.), Toward a Typology of European Languages (pp. 317–335). Berlin and New York: Mouton de Gruyter. Blanche-Benveniste, C. (1991). “La difficulté à cerner les régionalismes en syntaxe”. In G.-L. Salmon (Ed.), Variété et Variantes du Français des Villes: États de l’Est de la France – AlsaceLorraine – Lyonnais – Franche-Comté – Belgique (pp. 211–220). Paris/Genève: Champion/ Slatkine. Blanche-Benveniste, C. (1992). “A propos des énoncés sans verbe: Les énoncés réponses”. Recherches sur le Français Parlé, 11, 57–85. Blanche-Benveniste, C. (1993a). “Les unités: Langue écrite, langue orale”. In Proceedings of the Workshop on Orality versus Literacy: Concepts, Methods and Data (pp. 139–194). Strasbourg: European Science Foundation. Blanche-Benveniste, C. (1993b). “The construct of oral and written language”. In L. Verhoeven, R. van ’t Rood, & C. van der Laan (Eds.), Attaining Functional Literacy: A Cross-cultural Perspective. From Literacy Research to Action Plans (pp. 11–13). The Hague: Netherland’s National Commission for Unesco / Tilburg University. Blanche-Benveniste, C. (1993c). “Une description linguistique du français parlé”. Le Gré des Langues, 5, 8–28. Blanche-Benveniste, C. (1993d). “Variations syntaxiques et situations codifiées”. In A. Queffélec, D. Latin, & J. Tabi-Manga (Eds.), Inventaire des Usages de la Francophonie: Nomenclatures et Méthodologies. Actualité Scientifique. Actes du Colloque de l’ Association des Universités Partiellement ou Entièrement de Langue Française (AUPELF) (pp. 373–382). Paris and London: J. Libbey. Blanche-Benveniste, C. (1996a). “Corpus et études sur la langue parlée”. In M. F. Bacelar do Nascimento, M. C. Rodrigues, & J. Bettencourt Gonçalves (Eds.), Actas do XI Encontro National da Associao Portuguesa de Linguistica, Vol. I: Corpora (pp. 27–38). Lisboa: Colibri. Blanche-Benveniste, C. (1996b). “De l’utilité du corpus linguistique”. Revue Française de Linguistique Appliquée I(2): Corpus, de leur Constitution à leur Exploitation, 25–42.
Bibliography
Blanche-Benveniste, C. (1997a). Approches de la Langue Parlée en Français. Paris: Ophrys. Blanche-Benveniste, C. (1997b). “La notion de variation syntaxique dans la langue parlée”. In F. Gadet (Ed.), 19–29. Blanche-Benveniste, C. (1997c). “Quels outils pour la compréhension multilingue?”. In M. Slodzian & J. Souillot (Eds.), Comprehension Multilingue en Europe (pp. 121–123). Paris: Centre de Recherches en Ingénierie Multilingue, INaLCO. Blanche-Benveniste, C. (1997d). “Transcription et technologie”. Recherches sur le Français Parlé, 14, 87–99. Blanche-Benveniste, C. (2000). “Corpus de français parlé”. In M. Bilger (Ed.), 15–25. Blanche-Benveniste, C. (2001). “Nouveaux apports de la grammaire contrastive des langues romanes”. In I. Uzcanga Vivar, E. Llamas Pombo, & J. M. Pérez Velasco (Eds.), Presencia y Renovación de la Lingüística Francesa (pp. 41–54). Salamanca: Ediciones Universidad de Salamanca. Blanche-Benveniste, C. (2002). “Réflexions sur les transcriptions de corpus de français parlé”. Revue Parole, 22/23/24, 91–118. Blanche-Benveniste, C. (2003). “La naissance des syntagmes dans les hésitations et répétitions du parler”. In J.-L. Aroui (Ed.), Le sens et la Mesure: De la Pragmatique à la Métrique. Hommage à Benoît de Cornulier (pp. 153–169). Paris: Champion. Blanche-Benveniste, C., Deulofeu, J., Stéfanini, J., & Eynde van den, K. (1984). Pronom et Syntaxe: L’Approche Pronominale et son Application à la Langue Française. Paris: SELAF. Blanche-Benveniste, C. & Eynde Van den, K. (1987). Analyse Morphologique et Syntaxique des Formes QUI, QUE, QUOI. Preprint. Leuven: Département Linguïstiek, K.U. Leuven. Blanche-Benveniste, C. & Jeanjean, C. (1987). Le français Parlé: Transcription et édition. Paris: Didier Érudition. Blanche-Benveniste, C. & Temple, L. (1989). “Décrire le français parlé”. Le Français dans le Monde (n◦ spécial février – mars 1989): Recherches et Applications, 26–33. Blanche-Benveniste, C., Bilger, M., Rouget, Ch., Van den Eynde, K., & Mertens, P. (1990). Le Français Parlé: Études Grammaticales. Paris: Éditions du C.N.R.S. Blanche-Benveniste, C., Valli, A., Mota, M. A., Simone, R., Bovino, E., & Uzcanga Vivar, I. (coords.). (1997). EuRom: Méthode d’Enseignement Simultané des Langues Romanes. Firenze: La Nuova Italia. Blanche-Benveniste, C. & Valli, A. (1997). (coords.). Le français dans le monde (n◦ spécial Janvier 1997): L’Intercompréhension: Le Cas des Langues Romanes. Blanche-Benveniste, C., Rouget, C., & Sabio, F. (Eds.). (2002). Choix de Textes de Français Parlé. Paris: Champion. Bortolini, U., Tagliavini, C., & Zampolli, A. (1972). Lessico di Frequenza della Lingua Italiana Contemporanea. Milano: Garzanti. Bortolini, U. & Pizzuto, E. (1997). (Eds.). Il Progetto CHILDES Italia: Contributi di Ricerca sulla Lingua Italiana. Pisa: Del Cerro. Bosque, I. (1980). Sobre la Negación. Madrid: Cátedra. Bosque, I. (1990). Las Categorías Gramaticales. Madrid: Síntesis. Bosque, I. & Demonte, V. (Eds.). (1999). Gramática Descriptiva de la Lengua Española. Madrid: Espasa. Bourciez, E. (1967). éleménts de Linguistique Romane. Paris: Klincksieck. Boves, L. & Oostdijk, N. (2003). “Spontaneous speech in the spoken Dutch Corpus”. Proceedings ISCA and IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR). Tokyo, Japan.
Bibliography
Boyer, H. (Ed.). (1997). Langue Française, 114. Special Issue on Les Mots des Jeunes: Observations et Hypothèses. Brandao, S. F. & Mota, M. A. (orgs.). (2000). Análise Contrastiva de Variedades do Português. Projecto de Pesquisa Luso-Brasileiro em curso. Rio de Janeiro: UFRJ. Brandao, S. F. & Mota, M. A. (orgs.). (2003). Análise Contrastiva de Variedades do Português: Primeiros Estudos. Rio de Janeiro: In-Fólio. Brants, T. (2000). “TnT: A statistical part-of-speech tagger.” In Proceedings of the 6th Conference on Applied Natural Language Processing (ANLP 2000) (pp. 224–231). Seattle, USA. Brill, E. (1992). “A simple rule-based part of speech tagger”. In Proceedings of the DARPA Speech and Natural Language Workshop (pp. 112–116). San Mateo: Morgan Kauffman. Brill, E. (1993). A Corpus-based Approach to Language Learning. PhD thesis: University of Pennsylvania, Department CIS. Brill, E. (1994). “Some advances in transformation-based part of speech tagging”. In Proceedings of the 12th National Conference on Artificial Intelligence (AAAI 94) (pp. 722–727). Menlo Park: AAAI Press. Brunet, É. (dir.). (1986). Méthodes Quantitatives et Informatiques dans l’étude des Textes: En Hommage à Ch. Muller. Genève-Paris: Slatkine-Champion. Buhmann, J., Caspers, J., van Heuven, V., Hoekstra, H., Martens, J.-P., & Swerts, M. (2002). “Annotation of prominent words, prosodic boundaries and segmental lengthening by noexpert transcribers in the spoken Dutch corpus”. In M. C. Rodriguez & C. Suarez Araujo (Eds.), Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC 2002) (pp. 779–785). Paris: ELRA. Burzio, L. (1986). Italian Syntax. A Government and Binding Approach. Dordrecht: Reidel. Byron, D. K. & Heeman, P. A. (1997). “Discourse markers use in task-oriented spoken dialog”. In Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech 97) (pp. 2223–2226). Rhodes, Greece. Calzolari, N., Ceccotti, M. L., & Roventini, A. (1983). Documentazione sui tre nastri contenenti il DMI. Technical Report. Pisa: ILC-CNR. Calzolari, N. & Monachini, M. (1996). Synopsis and Comparison of Morphosyntactic Phenomena Encoded in Lexicon and Corpora. EAGLES document EAG CLWG MORPHSYN/R. Camacho, J. (2000). “Structural restrictions on comitative coordination”. Linguistic Inquiry, 31, 366–375. Cappeau, P. (1997). “Données erronées: Quelles erreurs commettent les transcripteurs?”. Recherches sur le Français Parlé, 14, 117–126. Cappeau, P. (2001). “Faits de syntaxe et genres à l’oral”. Le Français dans le Monde, n◦ spécial: Oral: Variabilité et Apprentissage, 69–77. Cappeau, P. & Bilger, M. (1995). “J’ai une douleur dans la cuisse mais pas là: Analyse d’un cas de contraste”. Recherches sur le Français Parlé, 13, 33–43. Caputo, M. R. (1996). Le Domande in un Corpus di Parlato Spontaneo. Analisi Prosodica e Pragmatica. PhD thesis: Università Federico II di Napoli. Carletta, J., Isard, A., Isard, S., Kowtko, J., Doherty-Sneddon, G., & Anderson, A. (1997). “The reliability of a dialogue structure coding scheme”. Computational Linguistics, 23(1), 13–31. Carter, R. & McCarthy, M. (1995). “Grammar and the spoken language”. Applied Linguistics, 16, 141–158. Castellani, A. (1962). “Proposte ortografiche”. Studi Linguistici Italiani, 3, 188–191. Castellani, A. (1965). “Sulla formazione del tipo fonetico italiano: Fenomeni consonantici”. Studi Linguistici Italiani, 2, 24–45; 5(1), 89–96. Castellani, A. (1982). “Italiano e toscano”. Studi Linguistici Italiani, 8 [n. s. 1] (1), 105–106.
Bibliography ˇ Cermák, F. (2002). “Today’s corpus linguistics: Some open questions”. International Journal of Corpus Linguistics, 7(2), 265–282. Chambers, J. (1995). Sociolinguistic Theory: Linguistic Variation and its Social Significance. Oxford: Blackwell. CHAT, http://childes.psy.cmu.edu/manuals/CHAT.pdf Chesire, J. (1997). “Involvement in ‘standard’ and ‘non-standard’ English”. In J. Chesire & D. Stein (Eds.), Taming the Vernacular: From Dialect to Written Standard Language (pp. 68–82). London and New York: Longman. Church, K., Gale, W., Hanks, P., & Hindle, D. (1991). “Using statistics in lexical analysis”. In U. Zernick (Ed.), Lexical Acquisition (pp. 115–165). Englewood Cliffs, NJ: Lawrence Erlbaum. Cinque, G. (1988). “La frase relativa”. In L. Renzi (Ed.), 443–503. Clear, J. (1992). “Corpus sampling”. In G. Leitner (Ed.), 21–31. CLIPS, Corpora e Lessici di Italiano Parlato e Scritto, F. Albano Leoni (coord.). http://www.cirass.unina.it Cohen, J. A. (1960). “A coefficient of agreement for nominal scales”. Educational and Psychological Measurement, 20, 37–46. Conein, B. (1992). “Hétérogéneité et hétérogéneité linguistique”. Langages, 108, 101–113. Contini, M., Lai, J.-P., Romano, A., Roullet, S., de Castro Moutinho, L., Coimbra, R. L., Pereira Bendiha, U., & Secca Ruivo, S. (2002). “Un projet d’atlas multimédia prosodique de l’espace Roman”. In B. Bel & I. Marlien (Eds.), Proceedings of the Speech Prosody 2002 Conference (pp. 227–230). Aix-en-Provence: Laboratoire Parole et Langage. Contreras, H. (1962–1963). “Una clasificación morfo-sintáctica de las lenguas románicas”. Romance Philology, 16, 261–268. Coppieters, R. (1997). “Quelques réflexions sur la question des données: Corpus et intuitions”. Recherches sur le Français Parlé, 14, 21–41. Couper-Kuhlen, E. (1996). “Intonation and clause combining in discourse: The case of because”. Pragmatics, 6, 398–426. Couper-Kuhlen, E. & Selting, M. (Eds.). (1996). Prosody in Conversation: Interactional Studies. Cambridge: Cambridge University Press. Coupland, N. (1988). Dialect in Use: Sociolinguistic Variation in Cardiff English. Cardiff: University of Wales Press. Coveney, A. (1997). “L’approche variationniste et la description de la grammaire du français: Le cas des interrogatives”. Langue Française, 115, 88–100. Coveney, A. (2002). Variability in Spoken French: A Sociolinguistic Study for Interrogation and Negation. Exeter: Elm Bank. Cresti, E. (1987). “L’articolazione dell’informazione nel parlato”. In AA.VV. Gli Italiani Parlati: Sondaggi sopra la Lingua d’oggi (pp. 27–90). Firenze: Accademia della Crusca. Cresti, E. (1994). “Information and intonational patterning in Italian”. In B. Ferguson, H. Gezundhajt, & P. Martin (Eds.), Accent, Intonation, et Modéles Phonologiques (pp. 99–140). Toronto: Editions Mélodie. Cresti, E. (1998). “Gli enunciati nominali”. In M. T. Navarro (Ed.), Italica Matritensia. Atti del IV Convegno Internazionale della Società Internazionale di Linguistica e Filologia Italiana (SILFI 96) (pp. 171–191). Firenze: Cesati. Cresti, E. (2000). Corpus di Italiano Parlato, Voll. I–II, CD-ROM. Firenze: Accademia della Crusca.
Bibliography
Cresti, E. (forthcoming a). “Caratteri e frequenza degli avverbi negativi secondo le loro funzioni informative: Dati da un campione di parlato spontaneo (LABLITA)”. In P. D’Achille (Ed.), Generi, Architetture e Forme Testuali. Atti del VII Convegno Internazionale della Società Internazionale di Linguistica e Filologia Italiana (SILFI 2002). Roma. Cresti, E. (forthcoming b). “La testualità parlata: Alcuni dati dal corpus italiano di CORAL-ROM nella prospettiva del parlato romanzo”. In I. Korzen & H. Jansen (Eds.), Lingua, Cultura e Intercultura: L’Italiano e le Altre Lingue. Atti dell’ VIII Convegno della Società Internazionale di Linguistica e Filologia Italiana (SILFI 2004). København: Samfundslitteratur. Cresti, E. & Firenzuoli, V. (1999). “Illocution et profils intonatifs de l’italien”. Revue Française de Linguistique Appliquée, IV(2), 77–98. Cresti, E. & Firenzuoli, V. (2002). “L’articolazione informativa topic-comment e commentappendice: Correlati intonativi”. In A. Regnicoli (Ed.), La Fonetica Acustica come Strumento di Analisi della Variazione Linguistica in Italia. Atti delle XII Giornate del Gruppo di Fonetica Sperimentale (XII GFS) (pp. 153–160). Roma: Il Calamo. Cresti, E., Moneglia, M., Bacelar, F., Sandoval, A. M., Veronis, J., Martin, P., Choucri, K., Mapelli, V., Falavigna, D., & Cid, A. (2002). “The C-ORAL-ROM project: New methods for spoken language archives in a multilingual romance corpus”. In M. C. Rodriguez & C. Suarez Araujo (Eds.), Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC 2002) (pp. 2–10). Paris: ELRA. Cresti, E. & Gramigni, P. (2004). “Per una linguistica corpus based dell’italiano parlato: Le unità di riferimento”. In F. Albano Leoni, F. Cutugno, M. Pettorino, & R. Savy (Eds.), Atti del Convegno Nazionale “Il Parlato Italiano”, CD-ROM (pp. 1–26). Napoli: M. D’Auria. Crystal, D. (1975). The English Tone of Voice. London: Edward Arnold. Crystal, D. (1980). “Neglected grammatical factors in conversational English”. In S. Greenbaum, G. Leech, & J. Svartvik (Eds.), Studies in English Linguistics for Randolph Quirk (pp. 153– 166). London: Longman. Cutting, D., Kupiec, J., Pedersen, J., & Sibun, P. (1992). “A practical part-of-speech tagger”. In Proceedings of the 3rd Conference on Applied Natural Language Processing (ANLP 92) (pp. 133–140). Daelemans, W., Zavrel, J., Berck, P., & Gillis, S. (1996). “MBT: A memory-based part of speech tagger-generator.” In E. Ejerhed & I. Dagan (Eds.), Proceedings of the 4th Workshop on Very Large Corpora (pp. 14–27). Somerset, NJ: Association for Computational Linguistics. D’Agostino, E. (2001). Le Forme Lessicali del Parlare: Analisi Quantitativa e Qualitativa del Parlato Italiano. Napoli: Editoriale scientifica. De Mauro, T., Mancini, F., Vedovelli, M., & Voghera, M. (1993). Lessico di Frequenza dell’Italiano Parlato. Milano: ETAS. Delais-Roussarie, E. & Durand, J. (Eds.). (2003). Corpus et Variation en Phonologie du Français. Toulouse: Presses Universitaires du Mirail. Delmonte, R. (1997). “Rappresentazioni lessicali e linguistica computazionale”. In T. De Mauro & V. Lo Cascio (Eds.), Lessico e Grammatica: Teorie Linguistiche e Applicazioni Lessicografiche. Atti del Convegno Interannuale della Società di Linguistica italiana (SLI) (pp. 431–462). Roma: Bulzoni. Delmonte, R. (2002). “L’annotazione morfosintattica del Corpus AVIP/API”. In C. Crocco, R. Savy, & F. Cutugno (Eds.), API: Archivio del Parlato italiano, DVD. Napoli: Università degli Studi “Federico II”. Deulofeu, J. (1986). “Syntaxe de que en français parlé et le problème de la subordination”. Recherches sur le Français Parlé, 8, 79–104.
Bibliography
Deulofeu, J. (1999). “Questions de méthode dans la description morphosyntaxique de l’élément que en français contemporain”. Recherches sur le Français Parlé, 15, 163–198. Devoto, G. & Giacomelli, G. (1972). I Dialetti delle Regioni d’Italia. Firenze: Sansoni. Druetta, R. (1996). “Dix années de recherches contrastives (1984–1994)”. Franco-Italia, 9, 11–66. Dutch Corpus, http://lands.let.kun.nl/cgn/doc_English/topics/project/pro_info.htm Eagles, http://www.ilc.cnr.it/EAGLES/home.html EAGLES (1996). Recommendations for the Morphosyntactic Annotation of Corpora. Technical report, Expert Advisory Group on Language Engineering Standards, EAGLES Document EAG – TCWG – MAC/R. http://www.ilc.cnr.it/EAGLES96/browse.html Eeg-Olofsson, M. (1991). “Probabilistic word-class tagging of a corpus of spoken English”. In M. Eeg-Olofsson (Ed.), Word-Class Tagging: Some Computational Tools (pp. 1–99). Göteborg: Department of Computational Linguistics, University of Goteborg. Eguren, L. (1999). “Pronombres y adverbios demostrativos: Las relaciones deícticas”. In I. Bosque & V. Demonte (Eds.), 929–972. Elcock, W. D. (1960). The Romance Languages. London: Faber-Faber. Elsness, J. (1981). “On the syntactic and semantic functions of that-clauses”. In S. Johansson & B. Tysdahl (Eds.), Papers from the 1st Nordic Conference for English Studies (pp. 281–303). Oslo: Department of English, Oslo University. Engwall, G. (1984). Vocabulaire du Roman Français (1962–1968): Dictionnaire des Fréquences. Stockholm: Almqvist Wiksell. Erman, B. (1986). “Some pragmatic expressions in English conversation”. In G. Tottie & I. Backlund (Eds.), English in Speech and Writing: A Symposium (pp. 131–147). Stockholm: Almqvist & Wiksell. Fava, E. (1991). “Interrogative indirette”. In L. Renzi & G. Salvi (Eds.), 675–720. Fava, E. (1995). “Tipi di atti e tipi di frase”. In L. Renzi, G. Salvi, & A. Cardinaletti (Eds.), 19–48. Fernández Leborans, Ma J. (1999). “El nombre propio”. In I. Bosque & V. Demonte (Eds.), 77– 128. Fernández Soriano, O. (1999). “El pronombre personal: Formas y distribuciones. Pronombres átonos y tónicos”. In I. Bosque & V. Demonte (Eds.), 1209–1274. Fernández, R. & Ginzburg, J. (2002). “Non-Sentential utterances: A corpus study”. Traitement Automatique des Langues, 43(2), 13–42. Finegan, E. & Biber, D. (2001). “Register variation and social dialect variation: Re-examining the connection”. In P. Eckert & R. Rickford (Eds.), Style and Sociolinguistic Variation (pp. 235–267). Cambridge: Cambridge University Press. Fiorentino, G. (1999). Relativa Debole. Milano: Franco Angeli. Firenzuoli, V. (2000). “Nuovi dati statistici sull’italiano parlato”. Romanische Forschungen, 13, 213–225. Firenzuoli, V. (2003). Le Forme Intonative di Valore Illocutivo dell’Italiano Parlato: Analisi Sperimentale di un Corpus di Parlato Spontaneo (LABLITA). PhD thesis: Università degli Studi di Firenze. Firenzuoli, V. & Signorini, S. (2003). “L’unità informativa di Topic: Correlati intonativi”. In G. Marotta & N. Nocchi (Eds.), La Coarticolazione. Atti delle XIII Giornate di Studio del Gruppo di Fonetica Sperimentale (XIII GFS) (pp. 177–184). Pisa: ETS. Fletcher, P. & Carman, M. (1995). “Transcription, segmentation and analysis: Corpora for the language impaired”. In Leech et al. (Eds.), 116–127. Forget, D., Hirschbühler, P., Martineau, F., & Rivero, M. L. (Eds.). (1997). Negation and Polarity: Syntax and Semantics. Selected Papers from the Colloquium “Negation: Syntax and Semantics”. Amsterdam and Philadelphia: John Benjamins.
Bibliography
Fox, B. A. & Thompson, S. A. (1990). “A discourse explanation of the grammar of relative clauses in English conversation”. Language, 66, 297–316. François, F. (1990). La communication inégale. Heurs et malheurs de l’interaction verbale. Neuchatel: Delaschaux et Neistlé. Fraser, B. (1990). “An approach to discourse markers”. Journal of Pragmatics, 14, 383–395. Fries, U., Tottie, G., & Schneider, P. (Eds.). (1994). Creating and Using English Language Corpora. Papers from the 14th International Conference on English Language Research on Computerized Corpora (ICAME 14). Amsterdam: Rodopi. Fuchs, C. (Ed.). (1987). Langages, 88. Special Issue on Les Types de Relative. Furui, S., Maekawa, K., & Isahara, H. (2000). “A Japanese national project on spontaneous speech corpus and processing technology”. In Proceedings of Automatic Speech Recognition (ASR 2000): Challenges for the new Millenium (pp. 244–248). Paris. Gadet, F. (1996a). “Niveaux de langue et variation intrinsèque”. Palympsestes, 10, 17–40. Gadet, F. (1996b). “Variabilité, variation, variété”. Journal of French Language Studies, 1, 75–98. Gadet, F. (Ed.). (1997). Langue Française, 115. Special Issue on La Variation en Syntaxe. Gadet, F. (2000). “Vers une sociolinguistique des locuteurs”. Sociolinguistica, 14, 99–103. Gadet, F. (2003). La Variation Sociale en Français. Paris: Ophrys. Garside, R. (1987). “The CLAWS word tagging system.” In R. Garside, G. Leech, & G. Sampson (Eds.), The Computational Analysis of English: A Corpus-based Approach (pp. 30–41). London and New York: Longman. Garside, R. (1995). “Grammatical tagging of the spoken part of the British National Corpus: A progress report”. In G. Leech et al. (Eds.), 161–167. Garside, R., Leech, G., & McEnery, T. (1997). Corpus Annotation: Linguistic Information from Computer Text Corpora. London and New York: Longman. Gavioli, L., & Mansfield, G. (1990). The PIXI Corpora: Bookshop Encounters in English and Italian. Bologna: CLUEB. GENELEX Consortium (1993). Couche Morphologique. Version 3.0, Technical report. Paris: GsiErli. Giacomelli, G. & Poggi Salani, T. (1984–1985). “Parole toscane”. Quaderni dell’Atlante Lessicale Toscano, 2–3, 123–229. Giannelli, L. (Ed.). (1994). Una Teoria e un Modello per l’Analisi dell’Italiano Substandard. Padova: Unipress. Gibbon, D., Moore, R., & Winski, R. (Eds.). (1997). EAGLES: Handbook of Standard and Resource for Spoken Language Systems. Berlin and New York: Mouton de Gruyter. Giora, R., Meiran, N., & Oref, P. (1996). “Identification of written discourse topics by structure coherence and analogy strategies: General aspects and individual differences”. Journal of Pragmatics, 26, 455–474. Giordano, R. & Voghera, M. (2002). “Verb system and verb usage in spoken and written Italian”. In A. Morin & P. Sébillot (Eds.), Proceedings of 6th International Conference on the Statistical Analysis of Textual Data (JADT 2002) (pp. 289–299). IRISA/INRIA: Université de Rennes. Givón, T. (Ed.). (1979). Discourse and Syntax. New York: Academic Press. Godard, D. (dir.). (2003). Les Langues Romanes. Paris: CNRS Editions. Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992). “SWITCHBOARD: Telephone speech corpus for research and development”. In Proceedings IEEE Conference on Acoustics, Speech and Signal Processing (pp. 517–520). San Francisco, USA. Goñi, J. M., González, J. C., & Moreno, A. (1997). “ARIES: A lexical platform for engineering Spanish processing tools”. Natural Language Engineering, 3(4), 317–345.
Bibliography
Goodwin, C. (1991). Conversational Organization: Interaction between Speakers and Hearers. New York: Academic Press. Gougenheim, G., Michéa, R., Sauvageot, A., & Rivenc, P. (1964). L’Elaboration du Français Fondamental. Paris: Didier. Gramigni, P. (2003). “Les corpora de LABLITA: Une analyse comparative”. In A. Scarano (Ed.), Macro-syntaxe et Pragmatique: L’Analyse Linguistique de l’Oral (pp. 329–358). Roma: Bulzoni. Greenbaum, S. (1976). “Current usage and the experimenter”. American Speech, 51, 161–175. Greenbaum, S. (1992). “A new corpus of English: ICE”. In J. Svartvik (Ed.), 171–179. Greenbaum, S. (Ed.). (1996a). Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press. Greenbaum, S. (1996b). “Introducing ICE”. In S. Greenbaum (Ed.), 3–12. Grice, M., Reyelt, M., Benzmüller, R., Mayer J., & Batliner, A. (1996). “Consistency in transcription and labelling of German intonation with GToBI”. In H. T. Bunnell & W. Idsardi (Eds.), Proceedings of the 4th International Conference on Spoken Language Processing (ICSLP 96) (pp. 1716–1719). Philadelphia, USA. Grimshaw, A. (1990). Conflict Talk: Sociolinguistic Investigations of Arguments in Conversations. Cambridge: Cambridge University Press. Gross, G. & Piot, M. (Eds.). (1988). Langue Française 77: Syntaxe des Connecteurs. Gross, M. (1981). “Les bases empiriques de la notion de prédicat sémantique”. Langages, 63, 7–52. Guenier, N. (2001). “Le français ‘de référence’: Approche sociolinguistique”. Cahiers de l’Institut Linguistique de Louvain, 27, 9–33. Guirao, J. M. & Moreno-Sandoval, A. (2004). “A ‘toolbox’ for tagging the Spanish C-ORALROM corpus”. In M. T. Lino, M. F. Xavier, F. Ferreira, R. Costa, & R. Silva (Eds.), Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004) – Workshop “Compiling and Processing Spoken Language Corpora” (pp. 28–32). Paris: ELRA. Habert, B., Nazarenko, A., & Salem, A. (1997). Les Linguistiques de Corpus. Paris: Armand Colin. Halliday, M. A. K. (1966). “Lexis as a linguistic level”. In C. E. Bazell, J. C. Catford, M. A. K. Halliday, & R. H. Robins (Eds.), In Memory of J. R. Firth (pp. 148–162). London and New York: Longman. Halliday, M. A. K. (1976). System and Function in Language: Selected Papers. London: Oxford University Press. Halliday, M. A. K. (1989). Spoken and Written Languages. Oxford: Oxford University Press. van Halteren, H. (1999). “Renovating a wordclass tagset: from WOTAN to WOTAN 2”. Poster presented at ACH-ALLC ’99. University of Virginia, Charlottesville, Virginia, June 9–13, 1999. Hansson, P. (1999). “Discourse markers in dialogue”. In R. Andersson, Å. Abelin, J. Allwood, & P. Lindblad (Eds.), Proceedings of the 12th Swedish Phonetics Conference (Fonetik 99) (pp. 65–68). Göteborg: Department of Linguistics, Göteborg University. Harris, M. (1978). The Evolution of French Syntax: A Comparative Approach. London: Longman. Harris, M. & Vincent, N. (Eds.). (1988). The Romance Languages. Oxford: Oxford University Press. ’t Hart J., Collier, R., & Cohen, A. (1990). A Perceptual Study on Intonation. An Experimental Approach to Speech Melody. Cambridge: Cambridge University Press. Hasan, R. (1989). “Semantic variation and sociolinguistics”. Australian Journal of Linguistics, 9, 221–275.
Bibliography
Heeman, P. (1999). “POS Tags and decision trees for language modeling”. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-99) (pp. 129–137). College Park, USA. Heeman, P. & Allen, J. F. (1999). “Speech repairs, intonational phrases and discourse markers: Modeling speakers’ utterances in spoken dialog”. Computational Linguistics, 25, 527–571. Heeman, P. A., Byron, D., & Allen, J. (1998). “Identifying Discourse Markers in Spoken Dialog”. In American Association for Artificial Intelligence – Spring Symposium on Applying Machine Learning to Discourse Processing (AAAI 1998) (pp. 44–51). Stanford: Stanford University. Hidalgo Navarro, A. (1997). La Entonación Coloquial. Función Demarcativa y Unidades de Habla. València: Universitat de València, Cuadernos de Filología, anejo XXI. Hirschberg, J. & Litman, D. (1993). “Empirical studies on the disambiguation of cue phrases”. Computational Linguistics, 19, 501–530. Hirst, D. & Di Cristo, A. (Eds.). (1998). Intonation Systems: A Survey on Twenty Languages. Cambridge: Cambridge University Press. Hockett, C. F. (1958). A Course in Modern Linguistics. New York: The Macmillan Company. Hoey, M. (Ed.). (1993). Data, Description, Discourse. Papers on the English Language in Honour of John M. Sinclair. London: HarperCollins. Horne, M., Hansson, P., Bruce, G., Frid, J., & Filipsson, M. (1999). “Discourse markers and the segmentation of spontaneous speech: The case of Swedish men ‘but/and/so”’. Working Papers, 47, 123–139. Ide, N. & Véronis, J. (Eds.). (1995). The Text Encoding Initiative: Background and Context. Dordrecht: Kluwer Academic. IMDI, http://www.mpi.nl/IMDI/ Isard, A. & Carletta, J. (1995). “Replicability of transaction and action coding in the Map Task corpus”. In M. Walker & J. Moore (Eds.), Empirical Methods in Discourse Interpretation and Generation: Working Notes of the AAAI Spring Symposium Series (pp. 60–66). Stanford, CA: Stanford University. Izre’el, S., Hary, B., & Rahav, G. (2001). “Designing CoSIH: The corpus of spoken Israeli Hebrew”. International Journal of Corpus Linguistics, 6, 171–197. Johannessen, J. (1998). Coordination. Oxford: Oxford University Press. Johansson, S. (Ed.). (1982). Computer Corpora in English Language Research. Bergen: Norwegian Computing Centre for the Humanities. Johansson, S. (1995). “Mens Sana in Corpore Sano: On the role of corpora in linguistic research”. The European Messenger, IV(2), 19–25. Johansson, S. & Hofland, K. (1989). Frequency Analysis of English Vocabulary and Grammar, Vol. 1–2. Oxford: Clarendon Press. Johansson, S. & Hofland, K. (1994). “Toward and English-Norwegian parallel corpus”. In Fries et al. (Eds.), 25–37. Johansson, S. & Stenström, A.-B. (Eds.). (1991). English Computer Corpora: Selected Papers and Research Guide. Berlin and New York: Mouton de Gruyter. Jordan, M. P. (1998). “The power of negation in English: Text, context and relevance”. Journal of Pragmatics, 29, 705–752. Jucker, A. & Ziv, Y. (Eds.). (1998). Discourse Markers: Descriptions and Theory. Amsterdam and Philadelphia: John Benjamins. Juilland, A. & Chang-Rodríguez, E. (1964). Frequency Dictionary of Spanish Words. La Haya: Mouton & Co. Kahrel, P. & Berg van den, R. (Eds.). (1994). Typological Studies in Negation. Amsterdam and Philadelphia: John Benjamins.
Bibliography
Karcevsky, S. (1931). “Sur la phonologie de la phrase”. Travaux du Cercle Linguistique de Prague, IV, 188–228. Kayne, R. (1976). “Il relativo francese que”. Rivista di Grammatica Generativa, I(3), 59–111. Kempson, R. (1977). Semantic Theory. Cambridge: Cambridge University Press. Kennedy, G. (1998). An Introduction to Corpus Linguistics. London and New York: Longman. Kerbrat-Orecchioni, C. (2001). “Oui, non, si: Un trio célèbre et méconnu”. Marges Linguistiques, 2, 95–129. Klausenburger, J. (2001). Course Book in Romance Linguistics. Munich: Lincom Europa. Kleiber, G. (1987). Relatives Restrictives et Relatives Appositives: Une Opposition “Introuvable”? Tübingen: Niemeyer. Knowles, G., Williams, B., & Taylor, L. (Eds.). (1996). A Corpus of Formal British English Speech. London and New York: Longman. Koch, P. & Oesterreicher, W. (1990). Gesprochene Sprache in der Romania: Franzosisch, Italienisch, Spanisch. Tübingen: Gunter Narr. Koch, P. & Oesterreicher, W. (2001). “Langage oral et langage écrit”. In Lexicon der Romanistischen Linguistik, Tome 1–2, 584–627. Kovacci, O. (1999). “El adverbio”. In I. Bosque & V. Demonte (Eds.), Gramática descriptiva de la lengua española (pp. 705–786). Labov, W. (1966). The Social Stratification of English in New York City. Washington, DC: Center for Applied Linguistics. Lai, J. P., Romano, A., & Roullet, S. (1997). “Analisi dei sistemi prosodici di alcune varietà parlate in Italia: Problemi metodologici e teorici”. Bollettino dell’Atlante Linguistico Italiano, 21, 23–70. Lambrecht, K. (1994). Information Structure and Sentence Form. Cambridge: Cambridge University Press. Landis, J. & Koch, G. (1977). “The measurement of observer agreement for categorical data”. Biometrics, 33, 159–174. Laudanna, A., Voghera, M., & Gazzellini, S. (2001). “Lexical representation of nouns and verbs in written Italian”. Brain and Language, 81, 250–263. Lavacchi, L. & Nicolás, C. (2000). Dizionario Spagnolo Italiano, CD-ROM Edition. Firenze: Le Lettere. Laviosa, S. (Ed.). (1998). META, 43(4). Special Issue on L’approche Basée sur le Corpus / The Corpus-based Approach. Leech, G. (1991). “The state of the art in corpus linguistics”. In K. Aijmer & B. Altenberg (Eds.), 8–29. Leech, G. (1992). “Corpora and theories of linguistic performance”. In J. Svartvik (Ed.), 105–122. Leech, G. (1997). “Teaching and language corpora: A convergence”. In Wichmann et al. (Eds.), 1–23. Leech, G. & Wilson, A. (1993). Tagset Guidelines. EAGLES input document. Pisa: ILC-CNR. Leech, G., Myers, G., & Thomas, J. (Eds.). (1995). Spoken English on Computer: Transcription, Mark-up and Application. London and New York: Longman. Leitner, G. (Ed.). (1992). New Directions in English Language Corpora: Methodology, Results, Software Developments. Berlin and New York: Mouton de Gruyter. Lenk, U. (1998). Marking Discourse Coherence: Functions of Discourse Markers in Spoken English. Tübingen: Gunter Narr. Leth Andersen, H. & Skytte, G. (Eds.). (1995). La Subordination dans les Langues Romanes. Copenhague: Munksgaard.
Bibliography
Levow, G. (1998). “Characterizing and recognizing spoken corrections in human-computer dialogue”. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (ACL-COLING 98) (pp. 736–742). Montreal, Canada. Liberman, M. (1975). The Intonation System of English. PhD thesis: MIT Indiana University. Lickley, R. & Bard, E. (1992). “Processing disfluent speech: Recognizing disfluency before lexical access”. In J. J. Ohala, T. M. Nearey, B. L. Derwing, M. M. Hodge, & G. E. Wiebe (Eds.), Proceedings of the 2nd International Conference on Spoken Language Processing (ICSLP 92) (pp. 935–938). Alberta. Lickley, R. & Bard, E. (1996). “On not recognizing disfluencies in dialogue”. In H. T. Bunnell & W. Idsardi (Eds.), Proceedings of the 4th International Conference on Spoken Language Processing, (ICSLP 96) (pp. 1876–1879). Philadelphia, USA. Lickley, R., Shillcock, R., & Bard, E. (1991). “Processing disfluent speech: How and when are disfluencies found?”. In Proceedings of the 2nd European Conference on Speech Communication and Technology (Eurospeech 91) (pp. 1499–1502). Genoa. Llisterri, J. (1997). “Transcripción, etiquetado y codificación de corpus orales”. Published in http://liceu.uab.es/∼joaquim/publicacions/FDS97.html Lory-Bouchet, M. (1977). “Essai d’analyse de la structure d’un discours parlé (conversation libre à plusieurs locuteurs)”. Recherches sur le Français Parlé, 1, 149–169. MacWhinney, B. (1994). The CHILDES Project: Tools for Analyzing Talk. Hillsdale, NJ: Lawrence Erlbaum Associates. Maekawa, K., Koiso, H., Furui, S., & Isahara, H. (2000). “Spontaneous speech corpus of Japanese”. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC 2000) (pp. 947–952). Paris: ELRA. Mara, E. Corpora Inediti. University of Graz. Maraschio, N. & Stefanelli, S. (Eds.). (in preparation). LIR: Lessici di Frequenza dell’Italiano Radiofonico. Marcos Marín, F. (1992). “El Corpus Oral de Referencia de la Lengua Española Contemporánea”. Informe del Proyecto. Madrid. Published in ftp://ftp.lllf.uam.es/pub/corpus/oral Martin, Ph. (1978). “Questions de phonosyntaxe et de phonosémantique en français”. Linguisticae Investigationes, II, 93–126. Martin, P. (1981). “Mesure de la fréquence fondamentale par intercorrélation avec une fonction peigne”. In Actes des XIIèmes Journées d’Etude sur la Parole (JEP 81). Montréal, Canada. Martin, P. (1999). “Prosodie des langues romanes: Analyse phonétique et phonologique”. Recherches sur le Français Parlé, 15, 233–253. Martin, P. (2000). “Peigne et brosse pour Fo: Mesure de la fréquence fondamentale par alignement de spectres séquentiels”. In Actes des XXIIIèmes Journées d’Etude sur la Parole (JEP 2000) (pp. 245–248). Aussois, France. Martin, P. (2001). “ToBi : L’illusion scientifique?”. In V. Aubergé & A. Lacheret-Dujour (Eds.), Actes du Colloque Journées Prosodie 2001 (pp. 144–148). Grenoble: Presses Universitaires de Grenoble. Mast, M., Kompe, R., Harbeck, S., Kiessling A., Niemann, H., Nöth, E., Schukat-Talamazzini, E. G., & Warnke, V. (1996). “Dialog act classification with the help of prosody”. In H. T. Bunnell & W. Idsardi (Eds.), Proceedings of the 4th International Conference on Spoken Language Processing (ICSLP 96) (pp. 1732–1735). Philadelphia. MATE, http://mate.nis.sdu.dk/about/deliverables.html Mateus, M. H. M., Brito, A. M., Duarte, I., & Hub Faria, I. (2003). Gramática da Língua Portuguesa. Lisboa: Caminho.
Bibliography
Mendes, A., Amaro, R., & Bacelar, F. (2003). “Reusing available resources for tagging a spoken Portuguese corpus”. In A. Branco, A. Mendes, & R. Ribeiro (Eds.), Tagging and Shallow Processing of Portuguese: Workshop Notes of TASHA. Otoubro de 2003. Lisboa: Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa. Merialdo, B. (1994). “Tagging English text with a probabilistic model”. Computational Linguistics, 20, 155–171. Meyer, C. F. (1996). “Coordinate structures in English”. World Englishes, 15, 29–41. Migliorini, B., Tagliavini, C., & Fiorelli, P. (1969). Dizionario d’Ortografia e di Pronunzia. Torino: ERI. Miller, J. & Weinert, R. (1998). Spontaneous Spoken Language. Oxford: Clarendon Press. Moignet, G. (1969). “Le verbe voici-voilà”. Travaux de Linguistique et de Littérature, 7, 189–201. Monachini, M. (1996). ELM-IT: EAGLES Specifications for Italian Morphosyntax. Lexicon Specification and Classification Guidelines. Eagles Document EAG-CLWG-ELM-IT/F. Pisa: ILC-CNR. http://www.ilc.cnr.it/EAGLES96/browse.html Monachini, M. & Östling, A. (1992). Towards a Minimal Standard for Morphosyntactic Corpus Annotation, NERC technical report NERC-WP8-61. Pisa: ILC-CNR. Mondada, L. (1998). “Variations sur le contexte en linguistique”. Cahiers de l’ ILSL, 11: Mélanges Offerts à Morteza Mahmoudian, 243–266. Moneglia, M. (2000). “Le corpus LABLITA”. In M. Bilger (Ed.), Corpus: Méthodologie et Applications Linguistiques (pp. 49–56). Paris: Champion. Moneglia, M. (forthcoming a). “I corpora dell’italiano parlato di LABLITA: Criteri di costituzione, unità di analisi e comparabilità dei dati linguistici orali”. In E. Burr (Ed.), Tradizione e Innovazione. Atti del VII Convegno Internazionale della Società Internazionale di Linguistica e Filologia Italiana (SILFI 2000). Firenze: Cesati. Moneglia, M. (forthcoming b). “C-ORAL-ROM: Un corpus di riferimento del parlato spontaneo per l’italiano e le lingue romanze”. In I. Korzen & H. Jansen (Eds.), Lingua, Cultura e Intercultura: L’Italiano e le Altre Lingue. Atti dell’ VIII Convegno della Società Internazionale di Linguistica e Filologia Italiana (SILFI 2004). København: Samfundslitteratur. Moneglia, M. & Cresti, E. (1997). “Intonazione e criteri di trascrizione del parlato”. In U. Bortolini & E. Pizzuto (Eds.), 57–90. Moneglia, M. & Cresti, E. (2001). “The value of prosody in the transition to complex utterances: Data and theoretical implications from the acquisition of Italian”. In Proceedings of the 8th International Congress of International Association for the Study of Child Language (IASCL 99) (pp. 851–873). Chicago: Cascadilla Press. Moneglia, M. & Cresti, E. (2003). “Il progetto C-ORAL-ROM”. In N. Maraschio & T. Poggi Salani (Eds.), Italia Linguistica Anno Mille. Italia Linguistica Anno Duemila. Atti del XXXIV Congresso Internazionale di Studi della Società di Linguistica Italiana (SLI 2000) (pp. 709– 722). Roma: Bulzoni. Moneglia, M., Scarano, A., & Spinu, M. (2002). “Validation by expert transcribers of the CORAL-ROM prosodic tagging criteria on Italian, Spanish, Portuguese corpora of spontaneous speech”. Published in http://lablita.dit.unifi.it/coralrom/papers/Validazione%202.1. pdf Moore J., Foster, M. E., Lemon, O., & White, M. (2004). “Generating tailored, comparative descriptions in spoken dialogue”. To appear in V. Barr & Z. Markov (Eds.), Proceedings of the 17th International Florida Artificial Intelligence Research Symposium Conference (FLAIRS 2004). Miami Beach, USA.
Bibliography
Moreno, A. (1991). Un Modelo Computacional Basado en la Unificación para el Análisis y Generación de la Morfología del Español. PhD thesis: Universidad Autónoma de Madrid. Moreno, A. (2003). “Los corpus orales del LLI-UAM: Primera generación y segunda generación”. Lamusa digital, 3. Special Issue: Proceedings of the Computers Literature and Philology Conference (CLIP 2002). http://www.uclm.es/lamusa Moreno, A. & Goñi, J. M. (1995). “A morphological model and processor for Spanish implemented in Prolog”. In M. Alpuente & M. I. Sessa (Eds.), Proceedings of Joint Conference on Declarative Programming (GULP-PRODE 95) (pp. 321–331). Salerno: Palladio Editrice. Moreno, A. & Goñi, J. M. (2002). “Spanish inflectional morphology in DATR”. Journal of Logic, Language and Information, 11, 79–105. Moreno, A. & Guirao, J. M. (2003). “Tagging a spontaneous speech corpus of Spanish”. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2003) (pp. 292–296). Borovets, Bulgaria. Moreno, A., López, S., Grishman, R., & Sánchez, F. (2003). “Developing a syntactic annotation scheme and tools for a Spanish treebank”. In A. Abeillé (Ed.), 149–163. Morin, Y.-C. (1985). “On two French subjectless verbs Voici and Voilà”. Language, 61(4), 777– 820. Moulines, E. & Charpentier, M. (1990). “Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones”. Speech Communication, 9, 453–467. Muller, C. (1985). Langue Française, Linguistique Quantitative, Informatique: Recueil d’Articles, 1980–1984. Genève: Slatkine. Muller, C. (Ed.). (1996). Dépendance et Intégration Syntaxique: Subordination, Coordination, Connexion. Tübingen: Max Niemeyer. MULTILEX Consortium (1993). Standards for a Multifunctional Lexicon. CAP GEMINI INNOVATION for the MULTILEX Consortium. Paris. Muzart-Fonseca Dos Santos, I. (1995). “Brouillons de parole: Visualisation et genèse du texte oral”. Recherches sur le Français parlé, 13, 177–200. Nakamura, J. (1993). “Statistical Methods and Large Corpora: A New Tool for Describing Text Types”. In M. Baker et al. (Eds.), 293–312. Nakatani, C. & Hirschberg, J. (1994). “A Corpus-based study of repair cues in spontaneous speech”. Journal of the Acoustical Society of America, 95(3), 1603–1616. Nespor, M. (1993). Fonologia. Bologna: Il Mulino. Nivre, J., Grönqvist, L., Gustafsson, M., Lager, T., & Sofkova, S. (1996). “Tagging spoken language using written language statistics”. In Proceedings of the 16th International Conference of Computational Linguistics (COLING 96) (pp. 1078–1081). Copenhagen: Centre for Language Technology. Nivre, J. & Grönqvist, L. (2001). “Tagging a corpus of spoken Swedish”. International Journal of Corpus Linguistics, 6(1), 47–78. Nolan, F. & Grabe, E. (1997). “Can ToBi transcribe intonational variation in British English?”. In A. Botinis, G. Kouroupetroglou, & G. Carayiannis (Eds.), Theory, Models and Applications. Proceedings ESCA Workshop on Intonation (pp. 259–262). Athens: ESCA and University of Athens. Oostdijk, N. (1986). “Coordination and gapping in corpus analysis”. In J. Aarts & W. Meijs (Eds.), 177–201. Oostdijk, N. (1988). “A corpus linguistic approach to linguistic variation”. Literary and Linguistic Computing, 3, 12–25. Oostdijk, N. & Haan, P. de (Eds.). (1993). Corpus-based Research into Language. Amsterdam: Rodopi.
Bibliography
Oostdijk, N., Goedertier, W., Van Eynde, F., Boves, L., Martens, J. P., Moortgat, M., & Baayen, H. (2002). “Experiences from the spoken Dutch corpus project”. In M. C. Rodriguez & C. Suarez Araujo (Eds.), Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC 2002) (pp. 340–347). Paris: ELRA. Orletti, F. (Ed.). (1994). Fra Conversazione e Discorso. Firenze: La Nuova Italia. Orton, H., Sanderson, S., & Widdowson, J. (1978). Linguistic Atlas of England. London: Croom Helm. O’Shaughnessy, D. (1992a). “Analysis of false starts in spontaneous speech”. In J. J. Ohala, T. M. Nearey, B. L. Derwing, M. M. Hodge, & G. E. Wiebe (Eds.), Proceedings of the 2nd International Conference on Spoken Language Processing (ICSLP 92) (pp. 931–934). Alberta, Canada. O’Shaughnessy, D. (1992b). “Automatic recognition of hesitations in spontaneous speech”. In Proceedings IEEE Conference on Acoustics, Speech and Signal Processing (pp. 593–596). San Francisco, USA. O’Shaughnessy, D. (1993). “Analysis and automatic recognition of false starts in spontaneous speech”. In Proceedings IEEE Conference on Acoustics, Speech and Signal Processing (pp. 724– 727). Minneapolis, USA. O’Shaughnessy, D. (1994). “Correcting complex false starts in spontaneous speech”. In Proceedings IEEE Conference on Acoustics, Speech and Signal Processing (pp. 349–352). Adelaide, Australia. O’Shaughnessy, D. (1995). “Timing patterns in fluent and disfluent spontaneous speech”. In Proceedings IEEE Conference on Acoustics, Speech and Signal Processing (pp. 600–603). Detroit, USA. Oviatt, S. (1995). “Predicting spoken disfluencies during human-computer interaction”. Computer Speech and Language, 9, 19–35. Panunzi, A. (forthcoming). “Il verbo “essere” nella lingua italiana d’uso: Indagine su un corpus di parlato spontaneo (C-ORAL-ROM) e primi confronti interlinguistici”. In I. Korzen & H. Jansen (Eds.), Lingua, Cultura e Intercultura: L’Italiano e le Altre Lingue. Atti dell’ VIII Convegno della Società Internazionale di Linguistica e Filologia Italiana (SILFI 2004). København: Samfundslitteratur. Panunzi, A., Picchi, E., & Moneglia, M. (2004). “Using Pitagger for lemmatization and PoS tagging of a spontaneous speech corpus: C-ORAL-ROM Italian”. In M. T. Lino, M. F. Xavier, F. Ferreira, R. Costa, & R. Silva (Eds.), Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004) (pp. 563–566). Paris: ELRA. Pavón Lucero, Ma . V. (1999). “Clases de partículas: Preposición, conjunción y adverbio”. In V. Demonte & I. Bosque (Eds.), 565–656. Pei, M. (1976). The Story of Latin and the Romance Languages. New York: Hagerstown. Penn Treebank, http://www.cis.upenn.edu/∼treebank/home.html Percy, C. E., Meyer, C. F., & Lancashire, I. (Eds.). (1996). Synchronic Corpus Linguistics. Papers from the 16th International Conference on English Language Research on Computerized Corpora (ICAME 16). Amsterdam: Rodopi. Picallo, C. & Rigau, G. (1999). “El posesivo y las relaciones posesivas”. In V. Demonte & I. Bosque (Eds.), 973–1024. Picchi, E. (1994). “Statistical tools for corpus analysis: A tagger and lemmatizer of Italian”. In W. Martin, W. Meijs, M. Moerland, E. ten Pas, P. van Sterkenburg, & P. Vossen (Eds.), Proceedings of EURALEX 1994 (pp. 501–510). Amsterdam, Holland. Pierrehumbert, J. (1980). The Phonology and Phonetics of English Intonation. PhD thesis: MIT Indiana University.
Bibliography
Piot, M. (1996a). “Problemi nella classificazione delle congiunzioni subordinanti del francese”. In E. D’Agostino (Ed.), Tra Sintassi e Semantica: Descrizioni e Metodi di Elaborazione Automatica della Lingua d’Uso (pp. 399–413). Salerno: Pubblicazioni dell’Università degli Studi di Salerno – Sezione di Studi Filologici, Letterari e Artistici. Piot, M. (1996b). “Subordination-coordination: étude de transferts et des relations entre processus”. In C. Muller (Ed.), 35–42. Pirelli, J., Beckmann, M., & Hirschberg, J. (1994). “Evaluation of prosodic transcription labelling reliability in the ToBi framework”. In Proceedings of the 3rd International Conference on Spoken Language Processing (ICSLP 94) (pp. 123–126). Yokohama, Japan. PiSystem, http://www.ilc.cnr.it/pisystem/ Poggi Salani, T. (1977). “Tra lingua e cultura”. Rivista Italiana di Dialettologia, I, 79–98. Pons Borderia, S. (1998). Conexion y Conectores: Estudio de su Relacion en el Registro Informal de la Lengua. Valencia: Universitat de Valencia. Pontecorvo, C. & Duranti, A. (1996). “Bambini e genitori in famiglia: conversazione e socializzazione”. Età Evolutiva, 55, 53–119. Portolés Lázaro, J. & Martín Zorraquino, M. A. (1999). “Los marcadores del discurso”. In V. Demonte & I. Bosque (Eds.), 4051–4214. Português Fundamental (1984). Vocabulário e Gramática, tomo 1: Vocabulário. Lisboa: INIC/ CLUL. Posner, R. (1996). The Romance Languages. Cambridge: Cambridge University Press. Pusch, C. D. (2002). “A survey of spoken language corpora in Romance”. In C. D. Pusch & W. Raible (Eds.), 245–264. Pusch, C. D. & Raible, W. (Eds.). (2002). Romanistische Korpuslinguistik: Korpora und gesprochene Sprache / Romance Corpus Linguistics: Corpora and Spöken Language. Tübingen: Gunter Narr. Quirk, R. (1992). “On corpus principles and design”. In J. Svartik (Ed.), 457–469. Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1985). A Comprehensive Grammar of the English Language. London and New York: Longman. Ratnaparkhi, A. (1996). “A maximum entropy part-of-speech tagger”. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 96) (pp. 133– 141). Philadelphia: University of Pennsylvania. Redeker, G. (1991). “Linguistic markers of discourse structure”. Linguistics, 29, 1137–1179. Reinhemeir, S. & Tasmowski, L. (1997). Pratique des Langues Romanes: Espagnol, Français, Portugais, Roumain. Paris and Montréal: L’Harmattan. Reithinger, N. & Klesen, M. (1997). “Dialogue act classification using language models”. In Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech 97) (pp. 2235–2238). Rhodes, Greece. Renouf, A. (1998). Explorations in Corpus Linguistics. Amsterdam: Rodopi. Renzi, L. (Ed.). (1988). Grande Grammatica di Consultazione, Vol. I: La Frase. I Sintagmi Nominale e Preposizionale. Bologna: Il Mulino. Renzi, L. (1994). Nuova Introduzione alla Filologia Romanza. Bologna: Il Mulino. Renzi, L. & Salvi, G. (Eds.). (1991). Grande Grammatica di Consultazione, Vol. II: I Sintagmi Verbale, Aggettivale, Avverbiale. La Subordinazione. Bologna: Il Mulino. Renzi, L., Salvi, G., & Cardinaletti, A. (Eds.). (1995). Grande Grammatica di Consultazione, Vol. III: Tipi di Frase, Deissi, Formazione delle Parole. Bologna: Il Mulino. Renzi, L. & Andreose, A. (2003). Manuale di linguistica e filologia romanza. Bologna: Il Mulino. Riegel, M., Pellat, J.-C., & Rioul, R. (2001). Grammaire Méthodique du Français. Paris: PUF.
Bibliography
Rietveld, T. & van Hout, R. (2002). Statistics in Language Research. Berlin and New York: Mouton de Gruyter. Rivenc, P. (1973). “A l’aube de l’ère des corpus”. Voix et Images du CREDIF, 18. Paris: Didier. Rivenc, P. (2000). Pour Aider à Apprendre à Communiquer dans une Langue étrangère. Paris: Didier Erudition. Rivenc, P. & Rojo Sastre, A. (1968). Préface de Vida y Diálogos de España (1er grado), 9–10. Paris: Didier and Philadelphia: Chilton. Rodrigues, C. (2003). Lisboa e Braga: Fonologia e Variação. Lisboa: Fundação Calouste Gulbenkian e Fundação Para a Ciência e a Tecnologia. Rossi, F. (1998). Le Parole dello Schermo: Analisi Linguistica del Parlato Riprodotto in Sei Film dal 1948 al 1957. PhD thesis: Università degli Studi di Firenze. Rossi, F. (1999a). Le Parole dello Schermo. Roma: Bulzoni. Rossi, F. (1999b). “Non lo sai che ora è? Alcune considerazioni sull’intonazione e sul valore pragmatico degli enunciati con dislocazione a destra”. Studi di Grammatica Italiana, XVIII, 145–193. Rouget, C. (2000). “Les nominalisations sont-elle réservées aux descriptions techniques?” In M. Bilger (Ed.), Corpus: Méthodologie et Applications Linguistiques (pp. 296–305). Paris: Champion. Sánchez López, C. (1999). “Los cuantificadores: Clases de cuantificadores y estructuras cuantificativas”. In V. Demonte & I. Bosque (Eds.), 1025–1128. Sanders, C. (1994). “Register and genre in French and English”. In J. Coleman & R. Crawshaw (Eds.), Discourse Variety in Contemporary French (pp. 87–105). London: AFS/CILT. Sankoff, D. & Cedergren, H. I. (Eds.). (1981). Variation Omnibus. Edmonton: Linguistic Research. Sauvageot, A., Gougenheim, G., Michéa, R., & Rivenc, P. (1956). L’Elaboration du Français élémentaire: étude sur l’établissement d’un vocabulaire et d’une grammaire de base. Paris: Didier. Scarano, A. (2002). Frasi Relative e Pseudo-relative in Italiano: Sintassi, Semantica e Articolazione dell’Informazione. Roma: Bulzoni. Scarano, A. (Ed.). (2003). Macro-syntaxe et Pragmatique. L’analyse Linguistique de l’ Oral. Roma: Bulzoni. Scarano, A. (2004). “Enunciati nominali in un corpus di italiano parlato: Appunti per una grammatica corpus based”. In F. Albano Leoni, F. Cutugno, M. Pettorino, & R. Savy (Eds.), Atti del Convegno Nazionale “Il Parlato Italiano”, CD-ROM (pp. 1–18). Napoli: M. D’Auria. Scarano, A. (forthcoming a). “Relative appositive e aggettivi appositivi: Tra sintassi e articolazione dell’informazione”. In P. D’Achille (Ed.), Generi, Architetture e Forme testuali. Atti del VII Convegno Internazionale della Società Internazionale di Linguistica e Filologia Italiana (SILFI 2002). Roma. Scarano, A. (forthcoming b). “Aggettivi qualificativi nel parlato italiano, francese, portoghese e spagnolo (C-ORAL-ROM): Frequenze e strutture”. In I. Korzen & H. Jansen (Eds.), Lingua, Cultura e Intercultura: L’Italiano e le Altre Lingue. Atti dell’ VIII Convegno della Società Internazionale di Linguistica e Filologia Italiana (SILFI 2004). København: Samfundslitteratur. Scarano, A. & Signorini, S. (forthcoming). “Corpus linguistics and diachronic variability: A study on Italian spoken language corpora”. In C. Pusch (Ed.), Proceedings of Freiburg Workshop on Romance Corpus Linguistics II: “Corpora and Historical Linguistics. Investigating Language Change through Corpora and Databases”. Tübingen: Narr.
Bibliography
Schmid, H. (1994). “Probabilistic part-of-speech tagging using decision trees”. In Proceedings of International Conference on New Methods in Language Processing (pp. 44–99). Manchester: Centre for Computational Linguistics (CCL) and Univ. of Manchester, Institute of Science and Technology (UMIST). Schneider, S. (2002). “An online database version of the LIP corpus”. In C. D. Pusch & W. Raible (Eds.), Proceedings of Freiburg Workshop on Romance Corpus Linguistics I: Corpora and Spoken Language (pp. 201–208). Tübingen: Narr. Schourup, L. (1999). “Discourse markers”. Lingua, 107, 227–265. Schriffin, D. (1987). Discourse Markers. Cambridge: Cambridge University Press. Searle, J. (1969). Speech Acts: An Essay in the Philosophy of Language. Cambridge: Cambridge University Press. Segura da Cruz, M. L. (1987). “A Norma lexicológica no tratamento do ‘corpus’ de frequência”. In M. F. Bacelar do Nascimento et al. (Eds.). Seijido, M. & Cappeau, P. (forthcoming). Bibliographie des Corpus de Français Parlé. Inventaire commandé par la DGLFLF (Délégation Générale à la Langue Française et aux Langues de France). Serianni, L. (1988). Grammatica Italiana: Suoni, Forme, Costrutti. Torino: UTET. Shaikevich A. (2001). “Contrastive and comparable corpora: Quantitative aspects”. International Journal of Corpus Linguistics, 6(2), 229–256. Shriberg, E. (1994). Preliminaries to a Theory of Speech Disfluencies. PhD thesis: University of California at Berkeley. Shriberg, E. & Lickley, R. (1993). “Intonation of clause-internal filled pauses”. Phonetica, 50(3), 172–179. Shriberg E., Bates, R., & Stolcke, A. (1997). “A prosody-only decision-tree model for disfluency detection”. In Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech 97) (pp. 2383–2386). Rhodes. Signorini, S. (forthcoming). “Soggetto e topic in un corpus di parlato italiano e spagnolo (CORAL-ROM)”. In I. Korzen & H. Jansen (Eds.), Lingua, Cultura e Intercultura: L’Italiano e le Altre Lingue. Atti dell’ VIII Convegno della Società Internazionale di Linguistica e Filologia Italiana (SILFI 2004). København: Samfundslitteratur. Signorini, S. & Tucci, I. (forthcoming). “Il restauro e la costituzione del primo corpus di italiano parlato: Il Corpus Stammerjohann”. In Costituzione, Gestione e Restauro di Corpora Vocali. Atti delle XIV Giornate del Gruppo di Fonetica Sperimentale (XIV GFS). Viterbo, Italy. Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., & Hirschberg, J. (1992). “TOBI: A standard for labeling English prosody”. In J. J. Ohala, T. M. Nearey, B. L. Derwing, M. M. Hodge, & G. E. Wiebe (Eds.), Proceedings of the 2nd International Conference on Spoken Language Processing (ICSLP 92) (pp. 867–870). Rhodes, Greece. Simon, A. C. (2004). La Structuration Prosodique du Discours en Français. Une Approche Multidimensionnelle et Expérientielle. Berne: Peter Lang. Simone, R. (1997). “Esistono verbi sintagmatici in italiano?” In T. De Mauro & V. Lo Cascio (Eds.), Lessico e Grammatica: Teorie Linguistiche e Applicazioni Lessicografiche. Atti del Convegno Interannuale della Società di Linguistica Italiana (pp. 155–170). Roma: Bulzoni. Simone, R. (1997). “Langues romanes de toute l’Europe, unissez-vous!” In C. BlancheBenveniste & A. Valli (coords.), 25–32. Sinclair, J. M. (1991). Corpus Concordance Collocation. Oxford: Oxford University Press. Sinclair, J. M. (1993). “Written discourse structure”. In Sinclair et al. (Eds.), 6–31.
Bibliography
Sinclair, J. M. (1996). “The empty lexicon”. International Journal of Corpus Linguistics, 1(1), 99– 119. Sinclair, J. M. (1997a). “Corpus evidence in language description”. In Wichman et al. (Eds.), 27–39. Sinclair, J. M. (1997b). “Corpus linguistics at the millennium”. In J. Kohn, B. Rüschoff, & D. Wolff (Eds.), New Horizons in CALL. Proceedings of European Association for Computer Assisted Language Learning (EUROCALL 96) (pp. 1–10). Szombathely: Bersenyi Daniel College. Sinclair, J. M. (1999a). “Large corpus research and foreign language teaching”. In R. de Beaugrande, M. Grosman, & B. Seidlhofer (Eds.), Language Policy and Language Education in Emerging Nations (pp. 79–86). Stamford, CT: Ablex. Sinclair, J. M. (1999b). “The internalisation of dialogue”. In R. Rossini Favretti, G. Sandri, & R. Scazzieri (Eds.), Incommensurability and Translation: Essays in Honour of Thomas S. Khun (pp. 391–406). Cheltenham: Edward Elgar. Sinclair, J. M. & Coulthard, R. M. (1975). Towards of Analysis of Discourse: The English Used by Teachers and Pupils. London: Oxford UP. Sinclair, J. M., Hoey, M., & Fox, G. (Eds.). (1993). Techniques of Description: Spoken and Written Discourse. London and New York: Routledge. Sinclair, J. & Ball, J. (1995). Text Typology (External Criteria). Electronic Document. Published on http://www.ilc.cnr.it/EAGLES96/texttyp/texttyp.html Siu, M. & Ostendorf, M. (1996). “Modeling disfluencies in conversational speech”. In H. T. Bunnell & W. Idsardi (Eds.), Proceedings of the 4th International Conference on Spoken Language Processing (ICSLP 96) (pp. 382–391). Philadelphia, USA. Skytte, G., Korzen, I., Polito, P., & Strudsholm, E. (Eds.). (1999). Tekststrukturering på italiensk og dansk: Resultater af en komparativ undersøgelse / Strutturazione Testuale in Danese e in Italiano: Risultati di una Indagine Comparativa. Vol. 3. + 3 CD-ROM. København: Museum Tusculanum. Sobrero, A. (Ed.). (1993). Introduzione all’Italiano Contemporaneo. Bari-Roma: Laterza. Sornicola, R. (1981). Sul Parlato. Bologna: Il Mulino. Souter, C. & Atwell, E. (Eds.). (1993). Corpus-based Computational Linguistics. Amsterdam: Rodopi. Spina, S. (2001). Fare i Conti con le Parole: Introduzione alla Linguistica dei Corpora. Perugia: Guerra. Spina, S. (forthcoming). “Il Corpus di Italiano Televisivo (Cit): Struttura e annotazione”. In E. Burr (Ed.), Tradizione e Innovazione. Atti del VII Convegno Internazionale della Società Internazionale di Linguistica e Filologia Italiana (SILFI 2000). Firenze: Cesati. Stammerjohann, H. (1970). “Strukturen der Rede: Beobachtungen an der Umgangssprache von Florenz”. Studi di Filologia Italiana, XXVIII, 295–397. Stein, D. (1997). “Syntax and varieties”. In J. Chesire & D. Stein (Eds.), Taming the Vernacular: From Dialect to Written Standard Language (pp. 35–50). London and New York: Longman. Stenström, A.-B. (1986). “A study of pauses as demarcators in discourse and syntax”. In J. Aarts & W. Meijs (Eds.), 203–218. Stenström, A.-B. (1990). “What is the role of discourse signals in sentence grammar?”. In J. Aarts & W. Meijs (Eds.), 213–229. Stenström, A.-B. (1994). An Introduction to Spoken Interaction. London and New York: Longman.
Bibliography
Stenström, A.-B. & Svartvik, J. (1993). “Imparsable speech: Repeats and other nonfluencies in spoken English”. In N. Oostdijk & de P. Haan (Eds.), Corpus-based Research into Language (pp. 241–254). Amsterdam: Rodopi. Stenström, A.-B., Andersen, G., & Hasund, I. K. (2002). Trends in Teenage Talk. Amsterdam and Philadelphia: John Benjamins. Stirling, J., Fletcher, I., Mushin, R., & Wales, L. (2001). “Representational issues in annotation: Using the Australian map task corpus to relate prosody and discourse structure”. Speech Communication, 33, 113–134. Stolcke, A., Ries, K., Coccaro, N., Bates, R., Jurafsky, D., Taylor, P., Martin, R., Meteer, M., & Van Ess-Dykema, C. (2000). “Dialogue act modeling for automatic tagging and recognition of conversational speech”. Computational Linguistics, 26, 339–371. Strudsholm, E. (1999). Relative Situazionali in Italiano Moderno. Münster: LIT. Stubbs, M. (1996). Text and Corpus Analysis: Computer-assisted Studies of Language and Culture. Oxford and Cambridge, MA: Blackwell. Svartvik J. (Ed.). (1990). The London Corpus of Spoken English: Description and Research. Lund: Lund University Press. Svartvik, J. (Ed.). (1992). Directions in Corpus Linguistics. Proceedings of Nobel Symposium, 82. Berlin and New York: Mouton de Gruyter. Svartvik, J. (1999). “English corpus studies: Past, present, future”. English Corpus Studies, 6, 1–16. Svartvik, J. & Quirk, R. (Eds.). (1980). A Corpus of English Conversation. Lund: Liber/Gleerups. Svartvik, J. & Ekedahl, O. (1995). “Verbs in public and private speaking”. In B. Aarts & C. F. Meyer (Eds.), The Verb in Contemporary English: Theory and Description (pp. 273–289). Cambridge: Cambridge University Press. Swerts, M. & Geluykens, R. (1993). “The prosody of information units in spontaneous monologues”. Phonetica, 50, 189–196. Syrdal, A. & McGorg, J. (2000). “Inter-transcriber reliability of ToBI prosodic labeling”. In Proceedings of the 4th International Conference on Spoken Language Processing (ICSLP 98) (pp. 235–238). Beijing, China. Sznajder, L. (Ed.). (1990). L’information Grammaticale, 46. Special Issue on La Coordination. Tagliavini, C. (1949). Le Origini delle Lingue Neolatine: Corso Introduttivo di Filologia Romanza. Bologna: Patron. Takagi, K. & Itahashi, S. (1996). “Segmentation of spoken dialogue by interjection, disfluent utterances and pauses”. In H. T. Bunnell & W. Idsardi (Eds.), Proceedings of the 4th International Conference on Spoken Language Processing (ICSLP 96) (pp. 693–697). Tamburini, F. (2000). “Annotazione grammaticale e lemmatizzazione di corpora in italiano”. In R. Rossini Favretti (Ed.), Linguistica e Informatica: Corpora, Multimedialità, Percorsi di Apprendimento (pp. 57–73). Roma: Bulzoni. Tannen, D. (1989). Talking Voices: Repetition, Dialogue and Imagery in Conversational Discourse. Cambridge: Cambridge University Press. TEI (1991). List of Common Morphological Feature for Inclusion in TEI Starter Set of Grammatical-annotation Tags, TEI AI 1W2 document, TEI; http://www.tei-c.org/Vault/AI/ ai1w02.txt TEI, http://www.tei-c.org/Guidelines2/ Teubert, W. (1996). “Comparable or parallel corpora?”. In J. M. Sinclair et al. (Eds.), International Journal of Corpus Linguistics, 9(3), Corpus to Corpus: A Study of Translation Equivalence (pp. 238–264). Teubert, W. (2001). “Corpus linguistics and lexicography”. International Journal of Corpus Linguistics (Special Issue), 125–154.
Bibliography
Teubert, W. & Kervio-Berthou, V. (2001). “Linguistique de corpus et lexicographie”. Cahiers de Lexicologie, 77(2), 137–163. Thomas, J. & Short, M. (1996). Using Corpora for Language Research: Studies in Honour of Geoffrey Leech. London and New York: Longman. Thompson, S. A. & Mulac, A. (1991). “The discourse conditions for the use of the complementizer that in conversational English”. Journal of Pragmatics, 15, 237–251. Tizzanini, G. (forthcoming). “L’articolazione dell’informazione: Dati quantitativi di un corpus di italiano parlato”. In E. Burr (Ed.), Tradizione e Innovazione. Atti del VII Convegno Internazionale della Società Internazionale di Linguistica e Filologia Italiana (SILFI 2000). Firenze: Cesati. ToBi, http://www.ling.ohio-state.edu/∼tobi/#what Tognini Bonelli, E. (1995). “Italian corpus linguistics: Practice and theory”. TEXTUS, VIII, 391– 412. Tognini Bonelli, E. (1996a). “Translation equivalence in a corpus linguistics framework”. International Journal of Lexicography, 9(3), 197–217. Tognini Bonelli, E. (1996b). Corpus Theory and Practice. Birmingham: TWC Monographs. Tognini Bonelli, E. (2000). “‘Unità funzionali complete’ in inglese e in italiano: Verso un approccio corpus-driven”. In S. Bernardini & F. Zanettin (Eds.), 153–175. Tognini Bonelli, E. (2001). Corpus Linguistics at Work. Amsterdam and Philadelphia: John Benjamins. Tottie, G. (1981). “Negation and discourse strategy in spoken and written English”. In H. Cedergren & D. Sankoff (Eds.), Variation Omnibus (pp. 271–284). Edmonton, Alberta: Linguistic Research. Tottie, G. (1982). “Where do negative sentences come from?” Studia Linguistica, 36, 88–105. Tottie, G. (1983a). Much about ‘Not’ and ‘Nothing’: A Study of the Variation between Analytic and Synthetic Negation in Contemporary American English. Lund: CWK Gleerup. Tottie, G. (1983b). “The missing link? Or, why is there twice as much negation in spoken English as in written English?” In S. Jacobson (Ed.), Papers from the Scandinavian Symposium on Syntactic Variation (pp. 67–74). Stockolm: Almqvist & Wiksell. Tottie, G. (1988). “No-negation and not-negation in spoken and written English”. In M. Kytö, O. Ihalainen, & M. Rissanen (Eds.), Corpus Linguistics: Hard and Soft (pp. 245–265). Amsterdam: Rodopi. Tottie, G. (1991). Negation in English Speech and Writing: A Study in Variation. San Diego: Academic Press. Tranel, B. (1973a). “Voici and voilà”. In R. A. Jacobs (Ed.), Studies in Language (pp. 141–151). Lexington/Toronto: Xerox College Publishing. Tranel, B. (1973b). “Chains, categories external to S, and French complement inversion”. Natural Language and Linguistic Theory, 1, 107–139. Traum, D. (1999). “Speech acts for dialogue agents”. In M. Wooldridge & A. Rao (Eds.), Foundations and Theories of Rational Agents (pp. 169–201). Dordrecht: Kluwer. Traum, D. (2000). “20 questions for dialogue act taxonomies”. Journal of Semantics, 17(1), 7–30. Tucci, I. (forthcoming). “Strategie lessicali e di articolazione informativa per l’espressione della modalità nel parlato: Dati dai corpora italiano e spagnolo C-ORAL-ROM”. In I. Korzen & H. Jansen (Eds.), Lingua, Cultura e Intercultura: L’Italiano e le Altre Lingue. Atti dell’ VIII Convegno della Società Internazionale di Linguistica e Filologia Italiana (SILFI 2004). København: Samfundslitteratur.
Bibliography
Uchimoto, K., Nobata, C., Yamada, A., Sekine, S., & Isahara, H. (2002). “Morphological analysis of the spontaneous speech corpus”. In Proceedings of the 19th International Conference of Computational Linguistics (COLING 2002) (pp. 1298–1302). Taipei, China. Uchimoto, K., Nobata, C., Yamada, A., Sekine, S., & Isahara, H. (2003). “Morphological analysis of a large spontaneous speech corpus in Japanese”. In Proceedings of the 41st Annual Meeting of the Association of Computational Linguistics (ACL 2003) (pp. 479–488). Sapporo, Japan. Upton, C., Parry, D., & Widdowson, J. D. A. (1994). Survey of English Dialects: The Dictionary and Grammar. London: Routledge. Valli, A. & Véronis, J. (1999). “Etiquetage grammatical de corpus oraux: Problèmes et perspectives”. Revue Française de Linguistique Appliquée, IV(2), 113–133. Van Eynde, F., Zavrel, J., & Daelemans, W. (2000). “Part of speech tagging and lemmatisation for the spoken Dutch Corpus”. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC 2000) (pp. 1427–1433). Paris: ELRA. Venier, F. (1996). “I verbi sintagmatici”. In G. Blumenthal, G. Rover, & C. Schwarze (Eds.), Lexikalischer Analyse Romanischer Sprachen (pp. 149–156). Tübingen: Max Niemeyer. Véronis, J. (Ed.). (2000). Parallel Text Processing: Alignment and Use of Translation Corpora. Dordrecht: Kluwer Academics. Voghera, M. (1990). Le Strutture dell’Italiano Parlato. PhD thesis: University of Reading. Voghera, M. (1992). Sintassi e Intonazione nell’Italiano Parlato. Bologna: Il Mulino. Voghera, M. (1994). “Lessemi complessi: Percorsi di lessicalizzazione a confronto”. Lingua e Stile, 29, 185–214. Voghera, M. (1996). “Corpora dell’italiano”. Revue Française de Linguistique Appliquée, 1–2, 131– 134. Voghera, M. & Laudanna, A. (2002). “Nouns and verbs as grammatical categories in the lexicon”. Journal of Italian Linguistics, 14(1), 9–26. Voghera, M. & Laudanna, A. (2003). “Proprietà categoriali e rappresentazione lessicale del verbo: Una prospettiva interdisciplinare”. In M. Giacomo-Marcellesi & A. Rocchetti (Eds.), Il Verbo Italiano. Studi diacronici, sincronici, contrastivi, didattici. Atti del XXXV Congresso Internazionale della Società di Linguistica Italiana (SLI 2001) (pp. 293–307). Roma: Bulzoni. Walker, M., Passonneau R., & Boland, J. E. (2001). “Quantitative and qualitative evaluation of Darpa communicator spoken dialogue systems”. In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics (ACL 2001) (pp. 515–522). Ward, W. (1991). “Understanding spontaneous speech: The Phoenix system”. In Proceedings of the IEEE International Conference on Audio, Speech and Signal Processing (ICASSP 91) (pp. 365–367). Toronto. Weisser, M. (2003). “SPAACy: A semi-automated tool for annotating dialogue acts”. International Journal of Corpus Linguistics, 8(1), 63–74. Wichmann, A., Fligelstone, S., McEnery, A., & Knowles, G. (Eds.). (1997). Teaching and Language Corpora. London and New York: Longman. Wightman, C. H. (2002). “ToBI or not ToBI”. In B. Bel & I. Marlien (Eds.), Proceedings of the Speech Prosody 2002 Conference (pp. 25–30). Aix-en-Provence: Laboratoire Parole et Langage. Wikberg, K. (1989). “On the role of the lexical verb in discourse”. In L. E. Breivik, A. Hille, & S. Johansson (Eds.), Essays on English Language in Honour of Bertil Sundby (pp. 375–388). Oslo: Novus Forlag. Willems, D. (1997). “Histoire, linguistique et sources orales”. Recherches sur le Français Parlé, 14, 10–20.
Bibliography
Wilmet, M. (1997). Grammaire Critique du Français. Louvain-la-Neuve, Duculot and Paris: Hachette. WinPitch (1996, 2004). http://www.winpitch.com Young, S. & Matessa, M. (1991). “Using pragmatic and semantic knowledge to correct parsing of spoken language utterances”. In Proceedings of the 2nd European Conference on Speech Communication and Technology (Eurospeech 91) (pp. 223–227). Genoa, Italy. Zampolli, A. & Ferrari, G. (1979). “Il dizionario di macchina dell’italiano”. In D. Gambarara, F. Lo Piparo, & G. Ruggiero (Eds.), Linguaggi e Formalizzazioni. Atti del Convegno Internazionale di Studi della Società di Linguistica Italiana (SLI) (pp. 683–707). Roma: Bulzoni. Zanettin, F. (1998). “Bilingual comparable corpora and the training of translators”. In S. Laviosa (Ed.), 616–630. Zanuttini, R. (1997). Negation and Clausal Structure: A Comparative Study of Romance Languages. New York and Oxford: Oxford University Press. Zavrel, J. & Daelemans, W. (2000). “Bootstrapping a tagged corpus through combination of existing heterogeneous taggers”. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC 2000) (pp. 17–20). Paris: ELRA.
Note * In the bibliographical references some acronyms are reported: coord., coords.= coordinated by; dir.= directed by; org., orgs. = organised by.
In the series Studies in Corpus Linguistics (SCL) the following titles have been published thus far or are scheduled for publication: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
PEARSON, Jennifer: Terms in Context. 1998. xii, 246 pp. PARTINGTON, Alan: Patterns and Meanings. Using corpora for English language research and teaching. 1998. x, 158 pp. BOTLEY, Simon and Tony McENERY (eds.): Corpus-based and Computational Approaches to Discourse Anaphora. 2000. vi, 258 pp. HUNSTON, Susan and Gill FRANCIS: Pattern Grammar. A corpus-driven approach to the lexical grammar of English. 2000. xiv, 288 pp. GHADESSY, Mohsen, Alex HENRY and Robert L. ROSEBERRY (eds.): Small Corpus Studies and ELT. Theory and practice. 2001. xxiv, 420 pp. TOGNINI-BONELLI, Elena: Corpus Linguistics at Work. 2001. xii, 224 pp. ALTENBERG, Bengt and Sylviane GRANGER (eds.): Lexis in Contrast. Corpus-based approaches. 2002. x, 339 pp. STENSTRÖM, Anna-Brita, Gisle ANDERSEN and Ingrid Kristine HASUND: Trends in Teenage Talk. Corpus compilation, analysis and findings. 2002. xii, 229 pp. REPPEN, Randi, Susan M. FITZMAURICE and Douglas BIBER (eds.): Using Corpora to Explore Linguistic Variation. 2002. xii, 275 pp. AIJMER, Karin: English Discourse Particles. Evidence from a corpus. 2002. xvi, 299 pp. BARNBROOK, Geoff: Defining Language. A local grammar of definition sentences. 2002. xvi, 281 pp. SINCLAIR, John McH. (ed.): How to Use Corpora in Language Teaching. 2004. viii, 308 pp. LINDQUIST, Hans and Christian MAIR (eds.): Corpus Approaches to Grammaticalization in English. 2004. xiv, 265 pp. NESSELHAUF, Nadja: Collocations in a Learner Corpus. 2005. xii, 332 pp. CRESTI, Emanuela and Massimo MONEGLIA (eds.): C-ORAL-ROM. Integrated Reference Corpora for Spoken Romance Languages. 2005. xvii, 303 pp. (incl. DVD). CONNOR, Ulla and Thomas A. UPTON (eds.): Discourse in the Professions. Perspectives from corpus linguistics. 2004. vi, 334 pp. ASTON, Guy, Silvia BERNARDINI and Dominic STEWART (eds.): Corpora and Language Learners. 2004. vi, 312 pp.