Corpus-Based Perspectives in Linguistics
Usage-Based Linguistic Informatics (UBLI)
Volume 6 Corpus-Based Perspectives in Linguistics Edited by Yuji Kawaguchi, Toshihiro Takagaki, Nobuo Tomimori and Yoichiro Tsuruga
Corpus-Based Perspectives in Linguistics Edited by
Yuji Kawaguchi Toshihiro Takagaki Nobuo Tomimori Yoichiro Tsuruga Tokyo University of Foreign Studies
John Benjamins Publishing Company Amsterdam / Philadelphia
8
TM
The paper used in this publication meets the minimum requirements of American National Standard for Information Sciences – Permanence of Paper for Printed Library Materials, ansi z39.48-1984.
Library of Congress Cataloging-in-Publication Data Corpus-based perspectives in linguistics / edited by Yuji Kawaguchi ... [et al.]. p. cm. -- (Usage-based linguistic informatics, issn 1872-2091 ; v. 6) Includes bibliographical references and index. 1. Corpora (Linguistics) 2. Linguistic analysis (Linguistics) I. Kawaguchi, Yuji, 1958P128.C68C66 2007 410--dc22
2007012739
isbn 978 90 272 3318 9 (Hb; alk. paper) © 2007 – Tokyo University of Foreign Studies No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Co. · P.O. Box 36224 · 1020 me Amsterdam · The Netherlands John Benjamins North America · P.O. Box 27519 · Philadelphia pa 19118-0519 · usa
Contents Message from the President Setsuho IKEHATA (President, Tokyo University of Foreign Studies) ............................ 1 Foundations of Center of Usage-Based Linguistic Informatics (UBLI) Yuji KAWAGUCHI......................................................................................................... 3
1. Workshop on Corpus Linguistics ―Research Domain― Introduction Yuji KAWAGUCHI....................................................................................................... 31 Linguistic Atlases―Objectives, Methods, Results, Prospects― Jean-Philippe DALBERA ............................................................................................ 39 From the Linguistic Atlas to the Database, and vice versa ―The Corsican Example― Marie-José DALBERA-STEFANAGGI ........................................................................ 55 A Usage-based French Dictionary of Collocations Peter BLUMENTHAL ................................................................................................. 67 Corpus of Old French Literary Texts Pierre KUNSTMANN .................................................................................................. 85 Building a Large Corpus for Phonological Research ―The PFC Project― Chantal LYCHE........................................................................................................... 91 Collateral Languages and Digital Corpus Jean-Michel ELOY .....................................................................................................115 Parallel and Comparable Corpora ―The State of Play― Tony McENERY and Zhonghua XIAO ....................................................................... 131 First Language & Second Language Writing Development of Elementary Students ―Two Perspectives― Randi REPPEN ......................................................................................................... 147 The Uneasy Interface ―Methodological Issues in Using Data from Traditional and Urban Dialectology in (Re-)constructing Sociolinguistic History― Tim POOLEY............................................................................................................. 169 A Corpus of French Texts with Non-standard Orthography Yves Charles MORIN................................................................................................. 191 Resources and Tools for Old French Text Corpora Achim STEIN ............................................................................................................. 217
2. Corpus Linguistics in Linguistic Informatics Introduction Yuji KAWAGUCHI..................................................................................................... 233 Transitive Direct, Transitive Indirect and Pronominal Verb Constructions in French ―The Case of approcher― Yoichiro TSURUGA ................................................................................................... 237 Demonstratives in De Bello Gallico and Li Fet des Romains ―A Parallel Corpus Approach to Medieval Translation― Yuji KAWAGUCHI..................................................................................................... 265 Patient-Orientedness in Resultative Compound Verbs in Chinese Keiko MOCHIZUKI .................................................................................................. 287 Corpus Research in Chinese and Its Application to Chinese Language Teaching ―A Case of Localizers in Chinese― Takayuki MIYAKE ..................................................................................................... 301 Rhetorical Questions with Interrogative Markers in Nanai Shinjiro KAZAMA...................................................................................................... 319 Vacillation in the Selection of Complementizers of Malay Transitive Verbs Isamu SHOHO........................................................................................................... 337 Voice in Relative Clauses in Malay ―A Comparison of Written and Spoken Language― Hiroki NOMOTO and Isamu SHOHO....................................................................... 353 Testing the Primacy of Aspect and Reverse Order Hypothesis in Japanese Returnees ―Towards Constructing a Corpus of Second Language Attrition Data― Asako YOSHITOMI ................................................................................................... 371 Corpus-based Analysis of Lexical Errors of Advanced Japanese Learners Ayano SUZUKI and Tae UMINO .............................................................................. 391 Syntactic Patterns of Intrasentential Code-Switching in the Discourse of Japanese-English Bilingual Families Tomoko TOKITA and Yuji KAWAGUCHI ...................................................................411 Index of Proper Nouns ........................................................................................................ 429 Index of Subjects ................................................................................................................ 435 Contributors ........................................................................................................................ 441
Message from the President Setsuho IKEHATA (President, Tokyo University of Foreign Studies) It is a great honor for our institution to host these workshop and roundtable discussion, and to be able to welcome today ten fine international scholars. We would like to thank everyone who has joined us, especially those who have found the time to travel a great distance to participate. The 21st Century COE Program began in 2002 as a policy focus of the Japanese Ministry of Education, Culture, Sports, and Science and Technology. The objective of this program is to create world-class centers for research and education in many different disciplines at Japanese universities to raise the level of research and to foster the growth of creative individuals. The program is aimed at turning our universities into unique, internationally competitive institutions. The Tokyo University of Foreign Studies proposed programs in the humanities, and interdisciplinary and combined fields, and new disciplines. In each of these two fields, one of our programs was selected. In the humanities field the program selected was the Center of Usage-Based Linguistic Informatics (UBLI), which all of you are participating in today. In interdisciplinary and combined fields and new disciplines, the Center for Documentation and Area-Transcultural Studies (CDATS) was chosen. The UBLI program was originally created to combine the rapidly advancing fields of computer science, linguistics, and language education, to create a new discipline called “linguistic informatics”. The program is building this area of study by gathering and analyzing vast amounts of linguistic usage data to reveal how language is actually used, and apply this new knowledge to language education. At the same time, we are educating graduate students to be highly knowledgeable in phonetics and linguistics, and skilled in computer technology; through surveys, evaluations, and analyses of resources on the Internet, work to develop the TUFS Language Modules, and analysis and processing of multi-language corpora, they gain a deep understanding of language resource information. We also train our graduate students as fieldworkers by encouraging them to do on-site language field surveys. For our university, the UBLI program presented many new challenges that we had never faced before. As I mentioned before one of these was in our work to bring the fields of linguistics, language education, and computer
2
Setsuho IKEHATA
science together in linguistic informatics. The second challenge was to gain the co-operation and participation of many professors, researchers, graduate students, and post-doctoral researchers. We were fortunate to have strong support from researchers inside and outside the institution, including many from overseas. The third challenge was to increase the international profile of our research activities. This was achieved through extensive participation in international symposia and workshops, and presentations by our graduate students and postdoctoral researchers at international conferences. The publication of a book series based on the program’s research from the John Benjamins Publishing Company in Holland is more proof of our effort in this area. The development of the TUFS Language Modules was a fourth challenge. We publish the results of our research quickly on the web, and seek the opinions of foreign language educators from outside the university. This cycle has been used to improve our modules. And the fifth challenge was in our graduate education. More than ever before, we have worked hard to use research as the way to educate our graduate students. By striving to overcome each of these challenges, the UBLI program has produced many benefits in both research and education. These achievements were possible only with the superb and devoted leadership of Professor Yuji Kawaguchi, and the endless effort of our professors, graduate students, and postdoctoral researchers. The university is extremely proud of all of them. The five-year first stage of the UBLI program will end next March. Although it may be difficult to sustain the same level of funding that has been provided by the government, the university is committed to maintaining funding and supporting further research in the future, to maintain and further develop the program’s activities. At this roundtable discussion, we welcome your honest evaluations of the work done at the UBLI program, and invite your ideas on the future directions in research. I am positive that today’s meeting will be a productive and enjoyable experience for all of us.
Foundations of Center of Usage-Based Linguistic Informatics (UBLI) Yuji KAWAGUCHI (Center of Excellence (COE) Program Leader) 1. Linguistic Informatics The Center of Usage-Based Linguistic Informatics (UBLI) — a 21st Century COE Program — was adopted by the Ministry of Education, Culture, Sports, Science and Technology in September 2002 as a five-year plan. The goal of the 21st Century COE Program was to raise the current level of various disciplines in Japan to a point where they could be globally competitive. Subsequent to that, in addition to achieving international research levels, the emphasis shifted to the development of young researchers. As the name suggests, under the 21st Century COE, Centers of Excellence are to be formed in their respective disciplines through five years of research. Following our 2004 interim assessment, this year will be the final year for the UBLI. The aim of the UBLI is for the systematic integration of computer science with linguistics and language education. The name given to this area of study is “linguistic informatics,” and over the course of five years, we will ultimately create a research area called “linguistic informatics.” Underlying the design of this COE program was an intent to improve and reform language education in Japan’s higher education, especially at universities. The original starting point was to aim for the realization of more efficient multilingual education, while bringing language education to the fore via the utilization of computer technology, and striving to advance educational content via the utilization of linguistic theory. It would appear that setting this kind of goal is important for the monolingual country that is Japan. Furthermore, since there are very few situations in which languages other than Japanese are used on a daily basis, it seems that, even for cross-cultural understanding, there is a need for foreign language education to be positioned as something beyond simply a means of communication, and to promote cross-cultural understanding through multilingual education from an early stage. “Linguistic informatics,” as according to the UBLI, is an academic field, which is distinctly and strongly colored by application, and which ultimately leads to improvements and upgrades in language education. Some researchers regard linguistic informatics as a division of applied linguistics
4
Yuji KAWAGUCHI
in a broad sense of the word. If this is the case, then why not call it applied linguistics? Why confine it to the term “linguistic informatics”? I would like to begin my explanation from this point. However, prior to this explanation, there is something which should be brought to the readers’ attention. The term “linguistic informatics” referred to here, does not denote natural language processing, machine translation, or computer linguistics. Even supposing for the moment that it is associated with these fields, this would be “linguistic informatics” in the very narrowest of meanings. As I have remarked previously, the goal of the UBLI is not related to language processing using computer technology or to linguistic analysis in itself. We are aiming to determine what should be done to linguistic theory and educational practice, with the aid of computer technology, so that they can further meet the needs of society. This point must first be stressed in order to dispel all the misunderstandings related to “linguistic informatics.” Now then, if we take an overview of the research methods used at the UBLI, we can see that we have corpus linguistics and computer linguistics research. We also have research based on discourse analysis, second language acquisition, and descriptions of language proficiency. Furthermore, we can see that educational technology practices also feature prominently in our research. This shows that “linguistic informatics” is an academic field that has been put together by combining a wide variety of methods and concepts from other disciplines. It also shows that, more often than not, the distinction between this and other disciplines is vague and difficult, and they share various theories and methodologies. The fact that “linguistic informatics” is an applied discipline is indeed confirmed in the significance of this point. Since the inception of applied linguistics, scientific research in foreign language education has occupied a central position. Even today, these circumstances remain unchanged. However, in recent years, we have reached a point where, particularly in English and other languages, the results of corpus linguistics, which use computers to analyze large volumes of linguistic data, are now being applied to language education. At the same time, technology in language education which utilizes the Internet has also been developing at a rapid pace. “Linguistic informatics” is the term which tried to capture this new trend in research. We were still at a stage where no name has been given to the field of research which, based on computer technology, uses corpus linguistics and computer linguistics techniques to analyze data on actual language use and attempts to reflect the results in language education. Although this is “the application of corpus analysis and language-usage analysis to educational practice,” to call this applied linguistics would probably result in the scope of applied linguistics being
Usage-Based Linguistic Informatics
5
restricted to an all too narrow domain. Therefore, at the UBLI, we ventured to call this field of research, which uses computers to analyze language usage and which attempts to link this analysis to more efficient and advanced educational practice, “linguistic informatics.” 2. Fields of Basic Research in Linguistic Informatics In order to bring the abovementioned field of research to fruition, the UBLI is organized into four research groups (the Linguistic Informatics Group, the Linguistics Group, the Language Education Group, and the Computer Science Group), and each group proceeds with their research while maintaining close coordination. The Linguistics Group uses computers to analyze corpora and it conducts research on language use data. The Language Education Group conducts analyses on learner corpora and natural discourse based on second language acquisition theory, and it also conducts research on a language proficiency descriptive model. The Computer Science Group carries out the design and development of computer-assisted language processing and the e-learning system. Finally, the Linguistic Informatics Group pulls together the basic research conducted by the other three groups, and attempts to make language education more sophisticated and efficient. In this section I will explain how the basic research from each of the Linguistics, Language Education, and Computer Science groups comes together. 2.1. Analysis of Linguistic Usage and Phonetic Analysis Until now, research in theoretical linguistics has concentrated on the structures of language systems and their functions. On the other hand, as a result of the rapid progress of computer technology, it is now possible to process massive amounts of language corpora, at levels which language researchers could not possibly have processed previously.1 As research of vast spoken language corpora has advanced, it has become progressively evident that there exists vast numbers of linguistic variations within language communities which are presumed to be homogeneous or else have been researched while eliminating heterogeneous parts to a certain degree in analysis. By using data on actual language usage to verify phenomena which had been problematic in language research, we have found that natural biases or tendencies can be found for many linguistic phenomena. It seems that linguistic interest is also in the process of shifting toward clarification of the diversity, and associated mechanisms, in the actual realization of systems and functions from the language systems and the functions themselves. In recent years, the importance of research on linguistic usage has been increasing. 1
McEnery and Wilson (2003).
6
Yuji KAWAGUCHI
Previously, UBLI published in Japanese, Analyses in Sentence Structures in Corpus Linguistics, Working Papers in Linguistic Informatics 3 (September 2004), Lexicon and Grammar in Corpus Linguistics, Working Papers in Linguistic Informatics 7 (October 2005), Corpus Analysis and Linguistic Theory in Linguistic Studies, Working Papers in Linguistic Informatics 11 (July 2006), Aspects of Corpus Linguistics — Spoken language corpora and written language corpora —, Working Papers in Linguistic Informatics 12 (November 2006), and in English, Corpus-Based Analyses on Sentence Structures, Linguistic Informatics II (April 2004). Several research papers were added to the final collection of papers, and they were published in the spring of 2005 by the Dutch publishing company, John Benjamins, as Corpus-Based Approaches to Sentence Structures, the second volume in the “Usage-Based Linguistic Informatics” series. Each of studies analyzes the various forms and syntactic structures which appear in the large-scale language corpora. The results from our research on language use will be applied to the Web-based teaching materials developed by the UBLI. Possible examples include the difference in frequencies of verbs in the written language and the spoken language, frequently occurring collocations, the frequency of specific sentence structures, adjectives and their positioning, and the usage of cases. Incidentally, research on the very nature of linguistic data is also vital in corpus analysis. From such research, the fundamental question of “What is usage in linguistics?” is asked, and at this time, language research comes face to face with linguistic variation. Recognition of the importance of usage-based linguistic analysis is reignited, and we should appreciate the significance of using computer science.2 Linguistic symbols are units in which sound and meaning are inextricably linked. Research on the phonetic aspect of linguistic symbols has occupied an important place ever since linguistics came into being. At the UBLI as well, phonetic or phonological analysis on language use has been carried out in parallel with corpus linguistics. Cross-Linguistic Perspectives in Phonetics — Phonetic Description and Prosodic Analysis, Working Papers in Linguistic Informatics 4 was released in October 2004. Then, in December 2005, Prosody and Sentence Structures, Linguistic Informatics IV was released, and later, papers by overseas collaborators were added to this title, and this was released in the spring of 2006 by John Benjamins, as Prosody and Syntax Cross-Linguistic Perspectives. At the 2
Yuji Kawaguchi, “Usage-Based Approach to Linguistic Variation — Evidence from French and Turkish —”, Spoken Language Corpus and Linguistic Informatics, Kawaguchi et al. (2006), 247-267.
Usage-Based Linguistic Informatics
7
COE, in addition to research on phonetic and phonological variations of single sounds, research is also being focused on prosodic structure — earlier research outcomes of which are believed to have been poorly reflected in educational practice. Within this research, with regard to Asian languages, the conversation teaching materials developed by the UBLI were regarded as phonetic corpora, and analysis of the accents and intonations was conducted.3 These results will also be ultimately used to improve the teaching materials of phonetics. 2.2. Construction and Analysis of Language Corpora At the UBLI, we have planned the construction of multilingual corpora that meet our research objectives. In particular, in order to analyze actual language usage, it becomes increasingly important to construct natural spoken language corpora. In addition to Japanese, for which construction of a corpus began during the initial stages of the program being adopted, from 2004, recordings of conversations were taken on location for French, Spanish, Russian, Malaysian, Turkish, Canadian multilinguals, Chinese and Italian (Salentino dialect). Spoken language corpora were constructed from between 10 and 30 hours. Of these, since comparable spoken language corpora did not exist for Russian, Malaysian, or Turkish, great significance can be found in the very construction of the corpora. Linguistic analyses using these corpora are also underway.4 In the Language Education Group, analysis of the spoken Japanese corpus is also being conducted from a perspective of dialogue analysis, and in particular a perspective of social psychology, targeting the same spoken language corpus. The outcomes have been Natural Dialogue Analysis and Conversation Training. Pursuing the Creation of an Integrated Module, Working Papers in Linguistic Informatics 6 (April 2005) and A Social Psychological Approach to Natural Conversation Analysis, Working Papers in Linguistic Informatics 13 (November 2006). By comparing the previously developed conversation teaching materials with the spoken Japanese corpus, detailed analyses were carried out from such perspectives as discourse function and politeness. These analyses are the basic research for implementing natural conversation in conversation teaching materials.5 3 4
5
Prosody and Syntax Cross-Linguistic Perspectives, Yuji Kawaguchi et al. (2006), intonation analysis of Indonesian, Filipino, Turkish, and Japanese. For instance, Selim Yılmaz, “Viewpoint and Postrheme in Spoken Turkish”, Spoken Language Corpus and Linguistic Informatics, Kawaguchi et al. (2006) John Benjamins, 269-286, and Isamu Shoho, “Nonreferential Use of Demonstrative Pronouns in Colloquial Malay”, op. cit., 287-301. See Usami (2004) for the necessity of analyzing natural conversation.
8
Yuji KAWAGUCHI
If the abovementioned corpora are grammar function, discourse function and so-called language function-specific corpora in a broad sense of the term, then at the UBLI, we have also constructed a research objective-specific corpora. Among the more important corpora are learner language corpora, which in recent years have been gaining attention in research on second language acquisition.6 Recently, the importance of combining native corpora and learner language corpora has been recognized. While native corpora have led to an increased accuracy in teaching material descriptions, at the same time, the development of teaching materials and curricula are being sought to bring to fruition more effective language education using learner corpora. In this way, the construction and analysis of learner corpora can be said to be an important field of research for applying the results of the analysis of language corpora to educational practice, and for striving for an efficient language education;7 and it can be thought of as one of the major areas of study in linguistic informatics. Previously, the Language Education Group announced its basic research on learner corpora for Japanese learners of English in the December 2004 publication Second Language Pedagogy, Acquisition, Evaluation, Working Papers in Linguistic Informatics 5. During 2006, the group also published its findings on its research into learning corpora for the Japanese and English languages, see, Course Materials, Evaluation, Second Language Acquisition (SLA), Working Papers in Linguistic Informatics 10 (March 2006) and SLA-Based Language Teaching and Language Assessment, Working Papers in Linguistic Informatics 14 (November 2006). Furthermore, in March 2006, the Language Education Group also published Linguistic Informatics V, Studies in Second Language Teaching and Second Language Acquisition. Papers by overseas collaborators were added to this title, and this was released in the summer of 2006 by John Benjamins, as Readings in Second Language Pedagogy and Second Language Acquisition in Japanese Context. 3. Computer science and TUFS Language Modules As was remarked at the outset, “linguistic informatics” is the academic field that attempts to integrate linguistic theory and educational practice on a computer science base. Following on from the trend of recent years for the development of language teaching materials using networks, the UBLI has been developing “TUFS Language Modules” — Web-based language teaching materials covering 17 different languages. Not only do these 6
7
For instance, Part I by Sylviane Granger, Granger et al. (2002) for the necessity of learner language corpora. Part II and Part III contain arguments related to an analysis of the interlanguage and research on foreign language education using learner language corpora. Hunston (2002) 206-212.
Usage-Based Linguistic Informatics
9
teaching materials use the latest techniques available in educational technology,8 but they are structured with content to which linguistic theory has been applied, and it could be argued that they are the most visible academic results of “linguistic informatics.” One of the TUFS language module types, the “Pronunciation Module” was released in 2003 in 12 languages, including English, German, and French. In 2004, “Dialogue Modules” were published in all 17 languages: English, German, French, Spanish, Portuguese, Russian, Chinese, Korean, Mongolian, Indonesian, Filipino, Laotian, Cambodian, Vietnamese, Arabic, Turkish, and Japanese; and one of these was implemented as some undergraduate courses of the Faculty of Foreign Studies at TUFS. Then in the spring of 2006, “Grammar Modules” were released in 11 languages: German, French, Spanish, Russian, Chinese, Mongolian, Filipino, Cambodian, Vietnamese, Turkish, and Japanese; and “Vocabulary Modules” were released in 11 languages. Amongst these, the development of Web-based teaching materials for Mongolian, Laotian, and Cambodian were world firsts. As was remarked at the outset, the UBLI program is a project aimed at innovations for foreign language education at higher education in Japan. Therefore, it is envisaged that the language teaching materials, other than for English, are teaching materials mainly for university students learning a new foreign language for the first time. As their name suggests, TUFS Language Modules were designed based on a “module-type notion.” Specifically, the idea is that they are divided into four types of modules: pronunciation, dialogue, grammar, and vocabulary; and while each module is mutually independent to a degree, they come together to form a cohesive set of teaching materials. In this sense, it could probably be argued that TUFS Language Modules take a perspective which focuses on structure. Naturally, since they are Web-based teaching materials, they can be modularized, and consequently, they can be more efficiently corrected and revised. Furthermore, by utilizing hyperlinks, they are capable of providing a sense of unity. These benefits are the reason why module-type teaching materials were adopted. A structuralistic linguistic view is reflected in the module structure comprising four parts: pronunciation, dialogue, grammar, and vocabulary. One linguistic universality is the “double articulation of language.” Let us 8
For example, in conversation teaching materials, XML technology which supports UTF-8/16 is implemented, and then using a program, the XML data and the audio and video data are synchronized through an MXSML server with JavaScript. See Lin et al. (2004). The same text also refers to the development of TUFS Language Modules in general.
10
Yuji KAWAGUCHI
suppose there is an expression which appears in a dialogue. The French greeting “Salut, ça va?” (Hello. How are you?), for example. First, the words are divided by primary articulation into the smallest linguistic signs having meaning. In this case there are three: SALUT, ÇA, and VA. Next, for example, SALUT is further divided by the second articulation into /saly/ which is a combination of four phonemes. According to this hypothesis, both levels of articulation form parts of the linguistic structure while functioning independently from each other; or if expressed in linguistic terms, the monemes and phonemes form a linguistic structure while remaining mutually independent. Some correlations can be drawn to the module-type linguistic view. The following section briefly describes each of the pronunciation, dialogue, grammar, and vocabulary modules. TUFS Language Modules
http://www.coelang.tufs.ac.jp/modules/index.html
3.1. Pronunciation Modules The Pronunciation Modules consist of a “practical course” and a “theoretical course.” Each course was developed based on a different design concept. In the “practical course,” the design of the teaching materials aims to be
Usage-Based Linguistic Informatics
11
as user-friendly as possible. For this reason, ordinary day-to-day vocabulary and expressions are used as examples, and a phonetic view which contrasts Japanese is introduced. The course is devised so that learners pick up pronunciation through practice and training. Furthermore, three stages are envisaged in the procedures for acquiring speaking and listening skills, and learning pronunciation proceeds according to those levels. The first stage emphasizes the correctness of individual sounds, the second stage places emphasis on smoothness, and the third stage pursues peculiarities of the individual language and fluency. Naturally, it would not do to have the same acquisition procedures for all 17 languages. While there are some languages in which effort is placed on acquiring segmental sounds, there are other languages for which time is spent on practicing suprasegmental sounds. Nevertheless, by having experts estimate the acquisition procedure they believe to be ideal for each language; this will lead to teaching resources in which the phonetic qualities of each individual language will be highlighted. The “theoretical course” pursues self-study material so that people who have already learnt the “practical course,” or who already have knowledge in the language, can increase their skills. It is supposed that this course will be used as supplementary teaching materials in a university course, for example, and technical terms and IPA (International Phonetic Alphabet) are used so that the phonetic and phonological basics of the language can be learnt. It is generally said that instruction in pronunciation needs to be adjusted to the age of the learner. However, since these modules target university students, who have passed the critical stage of language acquisition, the instruction in pronunciation has not been limited to only practical aural comprehensions of phonetics. It also includes the acquisition of phonetic and phonological knowledge, such as the use of minimal pairs, links with IPA Module, the fluctuation of phonemes, and neutralization and the functional load of phonemes. Furthermore, it refers to demarcative functions, contrastive functions, and enunciative functions, which possess suprasegmentals such as rhythm, accent and intonation; and by linking to the Dialogue Module, learners are able to practice example sentences which are uttered in more natural spoken environments. The theoretical course has been implemented into undergraduate coursework of French since 2005, and evaluations of both teaching materials and learning were conducted. 3.2. Dialogue Modules Whereas the Pronunciation Modules have been developed based on a focus on form, the Dialogue Modules are teaching materials, which, rather than being based on the form of a language, emphasize language proficiency, and more particularly, they emphasize communicative competence. The
12
Yuji KAWAGUCHI
context in which dialogues are formed and discourse strategies, etcetera, take more of a focus-on-meaning stance. The syllabus employed for this reason is a notional and functional syllabus which is used broadly in language education for communication. Of these, the focus in Dialogue Modules is particularly on the communicative function. In developing the Dialogue Modules, CALL teaching materials for five languages (German, French, Spanish, Portuguese, and Chinese) were surveyed. Based on the results, the material was organized into functional classifications as referred by Wilkins (1976) and van Ek (1990), and ultimately 40 fundamental language functions, such as “greeting someone” and “thanking someone,” were selected.9 There are two courses in the Dialogue Modules currently released: “for lesson use” and “for student use.” They have both been developed using the same content, but the design of the respective teaching materials differs greatly. The “for lesson use” pages suppose that they are being used in a university course, etcetera. The screens are designed to be all-encompassing. In the classroom, the instructor analyzes the needs of the learners, and learning commences from the corresponding language functions. First, the dialogue is played, and the students are made to imagine the context of the speech. Next, they learn by role-playing the entire dialogue. At this point it is important that students are made aware that, rather than sentences of the target language, the aim of the learning is discourse. Also, it is important that the students are provided with language resources which are as real as possible. At the UBLI, basic research is proceeding to incorporate natural discourse in the Dialogue Modules.10 It could be said that the “for lesson use” pages are teaching materials for classroom practice based on a communicative approach. The “for student use” pages have been designed so that learners can acquire the four language skills of listening, speaking, reading, writing on their own. Four learning models were established so that learning could take place from various aspects, according to the goals of the learner. Model 1 is “Listening and Speaking (Role Play)” where a student assumes one of the roles and practices the dialogue. If a learner can respond immediately and without looking at the text, then this will lead to an improved speaking ability. Model 2 is “Reading and Speaking (Reading Aloud)” where students practice reading in time with the sounds that they hear. Model 3 is “Listening 9 10
Yuki, Abe and Lin (2005) 339-342. Usami, Mayumi (ed.) (2005) Natural Dialogue Analysis and Conversation Training ? Pursuing the Creation of an Integrated Module (in Japanese), Working Papers in Linguistic Informatics 6, the 21st Century COE “Usage-Based Linguistic Informatics”, Graduate School of Area and Culture Studies, see also Suzuki, Matsumoto and Usami (2005).
Usage-Based Linguistic Informatics
13
and Writing (Dictation)” where students practice writing down as they listen to speech. This model is effective in improving a learner’s listening comprehension. Finally, Model 4 is “Reading and Writing (Copying)” where students practice transcribing the text that they read. With the “for student use” pages, by clarifying the acquisition procedures for the target language, learners can proceed with their study, while being conscious of their target language skills. In addition to this, in the Dialogue Module for English, a Teacher’s Manual has been prepared which proposes detailed commentary for use in the classroom as well as task-based instruction examples.11 In the future, by devising similar manuals for other languages, it is believed the learning environment of the Dialogue Modules will be improved. 3.3. Grammar Modules In the past, it was common for grammar instruction syllabi to contrast the native language and target language, and to be based on experiential intuition. Even today, this situation has barely changed. It is extremely difficult to find a grammar syllabus which is believed to clearly enhance learning effects. Nevertheless, in English, during the 1980s it had already been demonstrated that acquiring grammar according to an acquisition order was faster than learning naturally.12 However, in order to design syllabi based on similar experimental studies for the 17 languages covered by the UBLI, much more long-term and steady research is still required. Consequently, the design of the current Grammar Module concentrates on morphological and syntactic commentary with a focus on the form of the language. Naturally, grammatical items are arranged with consideration given to difficulty and practicality. Furthermore, example sentences which are only for explanation have been excluded, and examples which are close to actual language use are included. In some languages, the module has been designed so that learners can be aware of misuses and correct usages, and there is even commentary on frequency of actual usage and biases in sentence structures. 13 Although the course titles vary depending on the language, by establishing several study courses, such as “Ability Development Course,” “Standard Course,” the teaching materials become cognizant to a degree of the procedures for grammar acquisition. Also, by arranging the same speech on the cards, commentary and example sentences, 11 12 13
Yoshitomi, Asako (ed.) (2004) Eigo for KIDS: Eigo de Hanaso! Teacher’s Manual. A learner’s manual was released for a Japanese conversation module in December 2006. Ellis (1989). Biber, Conrad and Leech (2002) Longman Student Grammar of Spoken and Written English presents a single-language scale, but to realize this for 17 languages would not be easy.
14
Yuji KAWAGUCHI
the modules have also been designed so that the input to the learner is as great as possible. Although these kinds of devices are included, it could be said that the Grammar Module is by and large based on the traditional syllabi. 3.4. Vocabulary Modules Vocabulary Modules record between approximately 500 and 900 basic vocabulary, although the numbers vary depending on the language. The vocabulary from Level 4 of the Japanese Language Proficiency Test forms a common basis, and vocabulary specific to each language has been added. Vocabulary search and synonym search interfaces have been built in, meaning that it can also be used as a rudimentary dictionary of basic vocabulary. The semantic categories for the vocabulary have been based on Bunrui Goihyo, Lexical Taxonomy of Japanese, elaborated by the National Institute for Japanese Language.14 According to the theory of universal grammar, it is supposed that all languages would possess the same grammatical basis; and in language acquisition, only lexical learning plays a part, so there is no need for structural learning.15 If we ascribe to this viewpoint, then it means that lexical learning is purely the accumulation of elements used in language as stock in memory. Leaving this radical hypothesis aside, in reality, language acquisition involves structural learning as well as lexical learning. Furthermore, in recent years, it has been determined that, rather than learning individual and isolated vocabulary, comparatively large amounts of vocabulary can be learnt in a short time by repeatedly and intentionally learning meaningful groups of words, called “chunks.” In addition to the elementary dictionary-type functions mentioned above, the exercises have been set up so that learners can learn about 200 basic words through exercises. In lexical learning, it seems that rather than learning the individual words by rote, it is important to learn by supposing a systematic network between the individual words. To this end, two types of study courses have been established in the exercises: “learning by situation” such as overseas travel, sports, and roads; and “learning by semantic category” such as adjectives, and things worn or carried and associated actions. This should enable words to be grouped and for them to be learnt as vocabulary networks.16 Furthermore, the Vocabulary Modules are closely linked to the Pronunciation, Dialogue, and Grammar Modules, so that 14 15 16
Bunrui Goihyo (2003) Revised Edition, The National Institute for Japanese Language. See the critique in 1. Grammar, Radford (1997). See for instance LNT (Lexical Network Theory) in Norvig and Lakoff (1987).
Usage-Based Linguistic Informatics
15
students can learn the vocabulary, by checking the pronunciation of the vocabulary they have learnt, learning grammatical characteristics, and by matching the vocabulary to dialogue situations. 4. Multilingual Learning in a Ubiquitous Environment The TUFS Language Modules assume learners can understand Japanese. However, conversely, what if the Japanese teaching materials in the TUFS Language Modules could be studied in various other languages? The TUFS Language Modules (multilingual version) attempt to achieve this. At the UBLI, we have been developing modules on a trial basis for non native Japanese speakers to learn Japanese. In spring 2006, Pronunciation Modules and Dialogue Modules were set up so that persons, who understand English, French, Chinese, Korean, Mongolian, or Turkish, could learn Japanese, http://www.coelang.tufs.ac.jp/english/module/ for details. 4.1. From Linguistic Theory to Education What exactly does it mean to apply linguistic theory to educational practice using computer technology? Two instances that represent this in tangible forms are the “IPA (International Phonetic Alphabet) Module” and the “Cross-Linguistic Grammar Module.” During development of the Pronunciation Module, while aiming for the acquisition of speech and phonetic knowledge in the various languages, the use of IPA as cross-linguistic phonetic notation was considered. At the same time, development of the “IPA Module” began. The IPA Module contains specialized knowledge which is essential, not only in the learning of foreign languages, but also in learning phonetics. Furthermore, as a result of developing the theoretical course in the Pronunciation Module, a link with the IPA Module became possible, and phonetic and phonological theory was able to be applied to the educational practice of individual languages. The current IPA Module was developed, based on the 1996 revised version. An image and explanation of the vocal organ has been developed, and a list of vowels, consonants (pulmonic), and consonants (non-pulmonic) have been presented. By clicking on any of the phonetic symbols, users can hear the sound of that symbol. Other phonetic symbols, diacritics, and other symbols for suprasegmental sounds are also listed. Both the Japanese and English versions of the IPA Module have been released, http://www.coelang. tufs.ac.jp/ipa/ (Japanese) or http://www.coelang.tufs.ac.jp/ipa/english/index. htm (English).
16
Yuji KAWAGUCHI
The “Cross-Linguistic Grammar Module”17 is a little removed from the practical interest of acquiring the four skills necessary for communication in an individual language. It deals with studying general linguistic issues. This module is for learners to acquire expertise related to grammar research. Study courses are set up in the Cross-Linguistic Grammar Module, similar to the Grammar Module. The learning objective is to take a broad cross-linguistic view of the “general grammatical characteristics” which are shared across various languages, by taking grammatical items as examples from the Grammar Modules for those languages. What is called “grammar” in the teaching materials for the various languages varies widely. By taking a cross-sectional view of a certain grammar entry across more than ten languages, in addition to a contrastive linguistic interest, we can also give consideration to what “grammar” really is, and even what human language is. Two study courses are available: the “Step by Step Course” in which the arrangement of components in a sentence is studied in an orderly sequence; and a “Functional Course” in which characteristic expressions of each language are compared by predicate function, conative function, and presentational function, etcetera. 4.2. Second Language Acquisition Research A difference between second language acquisition by university students, who have passed the critical stage, and the acquisition of a first language by young children over a long period of time, is that, in general, the period of acquisition for university students is short. For this reason, the social context during acquisition may have a significant impact. Previously, second language acquisition by university students had been centered around instructors in lessons at university. However, as a result of Internet technology and other innovations, recently, second language acquisition has no longer been bound to just the classroom. Now it has become possible for such acquisition to be centered around the learners, and for it to occur at places other than the classroom. In response to these kinds of demands of the times, the TUFS Language Modules were registered as e-learning materials at the Tokyo University of Foreign Studies from 2005. All students at the university can now login to their own account, and can study anywhere at anytime. They can also check their study history. With some languages, the modules are being used as teaching materials or supplementary teaching materials in an undergraduate course, but most students use them as 17
Makoto Minegishi, “Developing Grammatical Modules Based on Linguistic Typology”, Spoken Language Corpus and Linguistic Informatics, Kawaguchi et al. (2006) John Benjamins, 331-348 for the concept and development of “Cross-Linguistic Grammar Module.”
Usage-Based Linguistic Informatics
17
self-study materials within the e-learning system. University students, who have a mature character and who are well-educated, are able to take responsibility for their own learning. In other words, we should be able to regard second language acquisition for these students as autonomous learning. Within each of the modules of the TUFS Language Modules, the stages of learning are shown using sections or steps, and the path for learning guidance designed by the developers is clearly expressed. However, in network-mediated autonomous learning, there is no guarantee that learning will proceed according to that path. For example, in implementing autonomous learning in a classroom, the instructor gives consideration to the beliefs of the learners and to learning strategies, and because they know the characteristics of the individual learners, motivation should be enhanced and learning efficient. However, making network-based language learning autonomous is not all that easy. While there are some learners who can study autonomously, there are others who cannot. There is also a myriad of strategies, and it appears that it is difficult for a learner to select a strategy that suits him/herself and then to monitor him/herself. Although it may seem true that Web-based teaching materials, such as the TUFS Language Modules, provide learners with an unrestricted learning environment and enhance the possibilities of language learning, in order that effective autonomous learning can proceed in such an environment, there needs to be a system that will assist with setting goals and managing the progress of learning, with selecting teaching materials, and with pointing learners in the right direction. In the case of learning based on the e-learning system which is currently being run at the university, since the Language Modules are positioned as more supplementary teaching materials, instructors are able to point students (learners) in the right direction, and it is possible to manage the progress of their learning. On the other hand, in the case of autonomous learning, learners need to be able to evaluate their own learning achievements to take the place of classroom tests. In 2004, the Language Education Group conducted a questionnaire survey to evaluate the teaching materials in the Pronunciation Modules.18 Then, in 2005, this was extended, and a detailed evaluation of teaching materials was conducted for the Dialogue Modules. Furthermore, a large-scale questionnaire was conducted on the degree to which the descriptions of language proficiency listed in the “Common European Framework of Reference for Languages (CEFR)” apply to Japanese learners. The results and observations from the questionnaire were listed in the Development of Teaching Materials, 18
For further details, Second Language Pedagogy, Acquisition, Evaluation, the 5th volume of the Working Papers in Linguistic Informatics, Chapter 2: Evaluation, 35-102.
18
Yuji KAWAGUCHI
Evaluation, Second Language Acquisition, Working Papers in Linguistic Informatics 10, published in March 2006. Eventually, the plan is to present the descriptions of language proficiency for Japanese learners in the form of a Can-Do list. Based on these descriptions of language proficiency, in the future, it will be possible to set language proficiency levels using the TUFS Language Modules to a certain degree for the 17 languages, for the correlation between the TUFS Language Modules and CEFR, see SLA-Based Language Teaching and Language Assessment, Working Papers in Linguistic Informatics 14 (November 2006), 79-194. 5. Lectures, Workshops and International Conferences As was remarked at the outset, the primary goal of the 21st Century COE Program was to select outstanding research projects in various academic fields, and by appropriating an ample budget to each project over five years, form world-class research centers, and raise the standard of Japan’s research organizations. To this end, from the initial stages, the UBLI had planned two international conferences. Also, since 2002, we have invited numerous researchers from the fields of linguistics and language education, and have held lectures. Copies of the lectures have been published in Linguistics, Applied Linguistics, Information Technology, Working Papers in Linguistic Informatics 2 in March 2004, and in the 9th volume of the series, Symposium, Lecture, Research Report in February 2006. 5.1. The First International Conference on Linguistic Informatics The outline of the first international conference was decided at the end of 2002 when the 21st Century COE was adopted. Subsequently, the First International Conference on Linguistic Informatics was held at the Tokyo University of Foreign Studies for the two days of December 13 and 14, 2003. At the first conference, research papers were presented on two key research domains. The first key domain was research typified by computer-assisted linguistics and corpus linguistics. Needless to say, when researching linguistic structures in detail, computer-assisted corpus analysis is essential. On the premise of computer-assisted language processing, a schema is produced based on linguistic analysis, and it is important to remember that this is applied to the language processing platform.19 This can be argued to be a field of research that is built on a collaboration between linguistic theory and computer science. Furthermore, as indicated by Francisco Moreno19
For instance, Christian Leclère, “The Lexicon-Grammar of French Verbs: a syntactic database”, in Yuji Kawaguchi et al. (2005) 29-45.
Usage-Based Linguistic Informatics
19
Fernández, when building a language corpus, we must consider what kind of language data will be represented by the corpus.20 From just looking at the reports from the first conference, there are a wide range of genre and individualities in language corpora, like medieval literature, bilingual databases, linguistic atlas data, workplace language, and natural dialogue. It can be said that the first conference presented the concept of linguistic analysis using various corpora. As a result of the deepening of corpus analyses in recent years, it is now known that numerous linguistic variations can be seen in language communities. It could probably be argued that we can no longer understand the reality of language usage without directing our attention to linguistic variation. In general, when researching linguistic variation, two different approaches are envisaged. The differences between the two approaches lie in how the register, style, gender, age, interpersonal relationships and other social contexts are perceived. First, there is the perspective where social context is thought to be an independent variable. If this standpoint is adopted, the social variables are connected to linguistic variation, and the relationship between the two becomes something for quantitative analysis. Let us call this the quantitative approach. In contrast to this, there is the viewpoint where it is regarded that social context is not an independent variable, and social context and linguistic variation always form a set. In this viewpoint, describing the linguistic variation and context in as much detail as possible becomes the problem. This is called the qualitative approach. For example, the research by Kanetaka Yarimizu et al. on standardization in the linguistic atlas of regional French is a typical example of the quantitative approach.21 In this analysis, the social context of the geographical expanse of standardization is believed to be a variable that is independent of standard form. On the other hand, the socio-pragmatic analysis of workplace language by Janet Holmes could be said to be a qualitative approach.22 This is because, in a certain workplace, learning by experience the verbal behavior and techniques appropriate for that workplace suggests that it is integrated into the social context of the workplace, and so the social context and verbal behavior form one set. The other key domain of the first conference was the studies 20
21
22
McEnery and Wilson (2003) 77-81 and Francisco Moreno-Fernández, “Corpora of Spoken Spanish Language. The Representativeness Issue”, in Kawaguchi et al. (2005) 120-144. Yarimizu Kanetaka, Yuji Kawaguchi, Masanori Ichikawa, “Multivariate Analysis in Dialectology A Case Study of the Standardization in the Environs of Paris”, in op. cit., 99-119 Janet Holmes, “Socio-pragmatic Aspects of Workplace”, in op. cit., 196-220.
20
Yuji KAWAGUCHI
surrounding the relevance between linguistic theory and second language acquisition. Mayumi Usami’s assertion that the analysis of natural dialogue is necessary for developing conversation textbooks is a prime example of this. Development of the TUFS Language Modules would also have been impossible without the foundations of linguistic theory and second language acquisition. As mentioned earlier, the Pronunciation Modules have been designed, based on phonetic and phonological theory, and the Dialogue Modules are built on a notional and functional syllabus. A large number of researchers were invited from research organizations in Japan and abroad to attend the first conference. Over the two days, the gross attendance was 300, and there was a lively exchange of opinions. Several reports were also presented by postgraduate students of the university. The collection of reports from the First International Conference on Linguistic Informatics was published by John Benjamins in the spring of 2005 as the first volume in the “Usage-Based Linguistic Informatics” series. As a result of this first conference, the concept of linguistic informatics, the construction of which is our aim, became progressively clearer. First, I would like to define what is meant by language usage that is ultimately applied to educational practice. There are four types of Language Modules: pronunciation, grammar, vocabulary and dialogue. In the Pronunciation Modules, it would appear that data on social phonetic variations should be applied.23 In the Grammar Modules and the Vocabulary Modules, it is important that reference be able to be made to the question of how the grammatical rules and vocabulary are related to the actual usage. For example, Biber, Conrad and Leech (2002) provide us with one model. When explaining grammatical items, comment is given on how frequently that item appears in language use of specific genre, and on what kind of characteristic collocations there are. However, to achieve this for 17 languages would not be an easy feat. Far from it, if we take a language like French which is taught throughout the world, the fact that there are no syllabi based on linguistic usage for the negative “...pas” or the personal pronoun “on,” demonstrates that the adoption of such a point of view is imperative. If usage data is applied to the Dialogue Modules, then 23
In the French Pronunciation Module (theoretical course), as phonemes are introduced, reference has been made to the fluctuation of phonemes, and basic explanation has been given for minimal phonetic variation. It may be ideal for learners to be able to learn about geographic and social variations, even though there is a possibility that the content would become inappropriate as teaching material for beginners. At present, with regard to languages for which some norms are envisaged, such as Portuguese-Portuguese and Brazilian-Portuguese for example, two separate Pronunciation Modules are developed, and they each have their own separate pronunciation description.
Usage-Based Linguistic Informatics
21
that will inevitably be natural dialogue data. The Linguistics Group at the UBLI has conducted analysis of linguistic usage based on corpora. It has conducted much basic research on grammatical phenomena. However, most of that analysis concerns corpora of written language, and spoken language has been in the minority. It has only been in recent years that research on spoken language corpora has been conducted in earnest.24 Even if we consider the shift from listening skills to reading and writing ones, both written language and spoken language are important in second language acquisition. It is not as if listening skills and reading skills are in some kind of subordinate-superior relationship. There are instances of learners, who have only undertaken training in listening, who develop reading skills naturally. It goes as far as hypotheses which assert that both skills are interrelated language processing mechanisms. Since the early stages of the adoption of the 21st Century COE program, the UBLI has conducted field surveys, and has built spoken language corpora for French, Spanish, Italian (Salentino dialect), Russian, Malaysian, Turkish, Chinese (Taiwan), Japanese, and Canadian Multilinguals (Vancouver). Since building corpora involves persistent work that requires many long hours, such as for transcribing, it has really only been recently that we have been able to analyze the spoken language corpora. Research on learner corpora in Japanese and English has also been ongoing, but this is also a field of research that has only begun quite recently.25 In January and October of 2005, a symposium and national conference were held at the Tokyo University of Foreign Studies. Round-table discussions with COE program promoters were held, and there was open debate on how linguistic theory and educational practice should be integrated.26 As a result of these two discussions, the direction for forming a 24
25
26
In the case of French, it was 1987 when transcribing guidelines were proposed for the construction of spoken language corpora; and it was in 1990 and beyond when the studies of the spoken language analysis first begun to be published. Claire Blanche-Benveniste et C. Jeanjean (1987) Le français parlé: transcription et édition, Paris: Didier Érudition and Claire Blanche-Benveniste (ed.) (1990) Le français parlé: études grammaticales, Paris: Éditions du C.N.R.S.. Interest in actual research into classroom discourse had been expressed previously as part of lesson analyses, etcetera, but it was not until the 1990s before learner language corpora were built as part of second language acquisition research and foreign language pedagogy, empirical research using these corpora began, and the importance was first recognized. For details on the symposium, “Is the Integration of Linguistic Theory and Language Education Possible?”, Symposium, Lecture, Report, Working Papers in Linguistic Informatics 9, 124-139. For details on the national conference, Susumu Zaima, “German Language Research Methodology Based on Language Use — Language Use, Application and Evaluation —”, Spoken Language Corpus and Linguistic Informatics, Kawaguchi et al. (2006) John Benjamins, 309-329.
22
Yuji KAWAGUCHI
center for linguistic informatics became clearer. In other words, by further promoting the corpus linguistic analysis centered around written languages, and by inviting overseas researchers conducting cutting-edge research on the analysis of spoken language, and discussing with them the significance of research on spoken language corpora, discourse analysis of corpora, the construction of learner language corpora and the application of research findings to language education, the belief was that the subject of linguistic informatics research could be more stringently defined. 5.2. Workshops and The Second International Conference on Linguistic Informatics On December 9, 2005, a workshop entitled “Spoken Language Corpora — its Significance and Application —” was held in conjunction with C-ORAL-ROM, a consortium researching the spoken Romance languages. In addition to a report on the results of the C-ORAL-ROM project, 27 Emanuela Cresti compared language processing at the UBLI and at C-ORAL-ROM.28 From the UBLI, reports were presented on research into spoken language corpora for Malaysian, Turkish, and Japanese. On the following day, December 10, the Second International Conference on Linguistic Informatics was held. Following lectures on the state of grammatical research into spoken language, the pragmatic analysis of spoken language, and the application of spoken language corpora to education, a general discussion was held between the lecturers and the C-ORAL-ROM members.29 There was an audience in excess of 300 people over the two days, and the meeting was a success. In this way, even further clarification was given to the subject of research for the center for linguistic informatics to attempt to use computer technology and apply linguistic theory to education. In other words, the most important academic contribution of linguistic informatics is the analysis of various linguistic variations and discourse functions that appear in actual usage, by using written language and spoken language corpora based on computer science and corpus linguistics. At the same time, learner language corpora would be constructed and analyzed. Then, by incorporating the findings of these analyses into educational practice, a more efficient and advanced language education could be achieved. This is the 27 28
29
For further details, see Cresti and Moneglia (2005). Emanuela Cresti, “Some Comparisons between UBLI and C-ORAL-ROM”, Spoken Language Corpus and Linguistic Informatics, Kawaguchi et al. (2006) John Benjamins, 125-152. For the report on The Second International Conference on Linguistic Informatics, see Chapter 1 of Spoken Language Corpus and Linguistic Informatics.
Usage-Based Linguistic Informatics
23
goal of linguistic informatics. In this instance, the term “educational practice” refers to the class-mediated and network-mediated learning in the TUFS Language Modules. The UBLI has received research grants over the five years from 2002. This year is the final year. We invited experts in the fields of corpus linguistics, analysis of linguistic usage, and language education from overseas and the Second Workshop on Corpus Linguistics — Research Domain — was held on September 14, 2006. And on September 15, in order to have these five years of research objectively evaluated, round-table discussions were held with the COE program promoters, to examine the formation and academic results of the Center for Usage-Based Linguistic Informatics (UBLI). By being objectively evaluated by overseas experts while acknowledging critical comments and recommendations, it is my hope that, together with the Tokyo University of Foreign Studies, the formation of the UBLI was the first step toward international recognition as a research center of global standards. Tokyo, December 24, 2006 References Biber, Douglas, Susan Conrad and Geoffrey Leech (2002) Longman Student Grammar of Spoken and Written English, Essex: Longman/Pearson Education Limited. Brugman, Claudia and George Lakoff (1988) “Cognitive topology and lexical networks”, Steven Small et al. (eds.) Lexical Ambiguity Resolution, Morgan Kaufman, 477-508. Cresti, Emanuela and Massimo Moneglia (2005) C-ORAL-ROM : integrated reference corpora for spoken Romance languages, Amsterdam/Philadelphia: John Benjamins. van Ek, Jan (1980) Threshold Level English, Oxford: Permgamon Press. Ellis, Rod (1989) “Are classroom and naturalistic acquisition the same ?: A study of the classroom acquisition of German word order rules”, Studies in Second Language Acquisition, 11, 305-328. Granger, Sylviane, Joseph Hund and Stephanie Petch-Tyson (Eds.) (2002) Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching, Amsterdam/Philadelphia: John Benjamins. Hunston, Susan (2002) Corpora in Applied Linguistics, Cambridge Applied Linguistics, Cambridge: Cambridge University Press. Kawaguchi, Yuji, Susumu Zaima, Toshihiro Takagaki, Kohji Shibano and Mayumi Usami (2005) Linguistic Informatics State of the Art and the Future, Usage-Based Linguistic Informatics 1, Philadelphia/Amsterdam, John Benjamins.
24
Yuji KAWAGUCHI
Kawaguchi, Yuji, Ivan Fónagy and Tsunekazu Moriguchi (2006) Prosody and Syntax Cross-Linguistic Perspectives, Usage-Based Linguistic Informatics 3, Philadelphia/ Amsterdam, John Benjamins. Kawaguchi, Yuji, Susumu Zaima and Toshihiro Takagaki (2006) Spoken Language Corpus and Linguistic Informatics, Usage-Based Linguistic Informatics 5, Philadelphia/Amsterdam : John Benjamins. Koike, Ikuo (ed.) (2003) Oyo Gengogaku Jiten (A Dictionary of Applied Linguistics), Kenkyusha. Koike, Ikuo (ed.) (2004) Daini Gengo Shutoku Kenkyu no Genzai (Current Status of Research into Second Language Acquisition), Taishukan Publishing. Lin, Chun Chen, Kentaro Yuki, Kazuya Abe and Naoyuki Naganuma (2004) TUFS Tagengo e-learning system Kaiwa Kyozai Kaihatsu (Development of Conversation Teaching Materials for TUFS Multilingual e-learning System), Working Papers in Linguistic Informatics 1: TUFS Language Modules, Tokyo University of Foreign Studies — Graduate School, 21st Century COE “Center of Usage-Based Linguistic Informatics (UBLI)”, 115-121. McEnery, Tony and Andrew Wilson (2003) Corpus Linguistics, 2nd Edition, Edinburgh Textbooks in Empirical Linguistics, Edinburgh: Edinburgh University Press. Norvig, Peter and George Lakoff (1987) “Taking: A study in lexical network theory”, J. Asket et al. (eds.) Proceedings of the Annual Meeting of the Berkeley Linguistics Society, 13, 195-206. Radford, Andrew (1997) Syntax : a minimalist introduction, Cambridge: Cambridge University Press. Suzuki, Takashi, Koji Matsumoto and Mayumi Usami (2005) “An Analysis of Teaching Materials Based on New Zealand English Conversation in Natural Settings — Implications for the Development of Conversation Teaching Materials —”, Yuji Kawaguchi et al. (eds.), Linguistic Informatics State of the Art and the Future, Amsterdam/Philadelphia: John Benjamins, 295-315. Takagaki, Toshihiro, Susumu Zaima, Yoichiro Tsuruga, Francisco MorenoFernández and Yuji Kawaguchi (2005) Corpus-Based Approaches to Sentence Structures, Usage-Based Linguistic Informatics 2, Amsterdam/ Philadelphia: John Benjamins. Usami, Mayumi (2005) “Why Do We Need to Analyze Natural Conversation Data in Developing Conversation Teaching Materials ? — Some Implications for Developing TUFS Language Modules —”, Yuji Kawaguchi et al. (eds.), Linguistic Informatics State of the Art and the Future, Amsterdam/Philadelphia: John Benjamins, 279-294.
Usage-Based Linguistic Informatics
25
Wilkins, David. A. (1976) Notional syllabuses, Oxford : Oxford University Press. Yoshitomi, Asako, Tae Umino and Masashi Negishi (2006) Readings in Second Language Pedagogy and Second Language Acquisition In Japanese Context, Usage-Based Linguistic Informatics 4, Amsterdam/ Philadelphia : John Benjamins. Yuki, Kentaro, Kazuya Abe and Chun Chen Lin (2005) “Development and Assessment of TUFS Dialogue Module — Multilingual and Functional Syllabus —”, Linguistic Informatics State of the Art and the Future, Yuji Kawaguchi et al. (ed.), Amsterdam/Philadelphia: John Benjamins, 333-357.
26
Yuji KAWAGUCHI
Appendix COE Program Promoters Yuji KAWAGUCHI (French and Turkish Linguistics), Susumu ZAIMA (German Linguistics), Nobuo TOMIMORI (Romance Linguistics), Toshihiro TAKAGAKI (Spanish Linguistics), Yoichiro TSURUGA (French Linguistics), Ikuo KAMEYAMA (Russian Literature), Hideki NOMA (Korean Linguistics), Kohji SHIBANO (Information Technology), Makoto MINEGISHI (Theoretical Linguistics), Mayumi USAMI (Social Psychology of Language)
Research Projects in 2002-2006 Linguistic Informatics: Developments of TUFS Language Modules, Linguistic Culture Portal Site, Multilingual Corpora, Teaching Materials for Advanced Liberal Arts Courses, Discourse Analysis, Publications of Linguistic Informatics and Working Papers in Linguistic Informatics Yuji KAWAGUCHI, Makoto MINEGISHI, Kohji SHIBANO : TUFS Language Modules Yuji KAWAGUCHI, Yoichiro TSURUGA, Toshihiro TAKAGAKI, Susumu ZAIMA : Linguistic Culture Portal Site, Multilingual Corpora Makoto MINEGISHI, Ikuo KAMEYAMA, Yuji KAWAGUCHI : Teaching Materials for Advanced Liberal Arts Courses Mayumi USAMI : Discourse Analysis Yuji KAWAGUCHI, Toshihiro TAKAGAKI, Susumu ZAIMA, Yoichiro TSURUGA, Mayumi USAMI, Makoto MINEGISHI, Ikuo KAMEYAMA, Nobuo TOMIMORI, Kohji SHIBANO : Publications of Linguistic Informatics and Working Papers in Linguistic Informatics
Linguistics: Corpus-Based Analysis of Linguistic Usages, Prosody and Syntax in Cross-Linguistic Perspectives Yoichiro TSURUGA : Verbal class in French — Frequency analysis and construction —, Impersonal constructions in French Yuji KAWAGUCHI : Diachronic research on negative constructions in French, Corpus-based analysis of French conditional Naotoshi KUROSAWA : Word order of modifier and modified constituent in Latin and Portuguese Kiyoko SOHMIYA : Aspects of marked constructions as seen in corpora Kazuyuki URATA : Diachronic research on the subjunctive in English Susumu ZAIMA, Takashi NARITA : Corpus-based research on verb construction in German
Usage-Based Linguistic Informatics
27
Toshihiro TAKAGAKI : Construction of a Spanish corpus and the development of relevant tools to advance Spanish language research Hidehiko NAKAZAWA : Corpus-based analysis of Russian aspect, Utilization of a corpus for research on Russian verbs Takayuki MIYAKE : Research on the syntactic characteristics of Chinese verbs based on corpus analysis Keiko MOCHIZUKI : Comparative study of compound verbs in Japanese and Chinese that express “causal phenomena” and “resultant phenomena” and their corresponding English sentence structures Shinjiro KAZAMA : Descriptive study of grammar using spoken and literary corpora Isamu SHOHO : The causes and results of marked word order in the Malaysian language Satoko YOSHIE : Construction of a Wakhi vocabulary corpus Shinji YAMAMOTO : Italian language in the 21st century Futoshi KAWAMURA : Database of case-marking for Old Japanese adjectives Yuji KAWAGUCHI, Tsunekazu MORIGUCHI, Nobuo TOMIMORI, Hiroko SAITO, Masashi FURIHATA, Yoshio SAITO : Prosody and syntax in ambiguous sentences, Prosodic analysis of speech through the TUFS Dialogue Module
Applied Linguistics : Discourse Analysis, Second Language Acquisition, Assessment of TUFS Language Modules Mayumi USAMI : Construction and analysis of a multilingual corpus of spoken language, Basic research on methodology for natural conversation analysis, Development of a basic transcription system for Japanese, Korean, Chinese and English. Tae UMINO : Construction and analysis of Japanese learner-language corpus, Basic research aimed at the development of learner’s manual for Japanese Dialogue Module Asako YOSHITOMI : Construction of an English learner language corpus, Revision of the English Dialogue Module teacher’s manual Masashi NEGISHI, Hideyuki TAKASHIMA, Masanori ICHIKAWA, Koyo YAMAMORI : Development of a Language Proficiency Scale, Assessment of TUFS Language Modules
Computer Sciences : e-learning, Natural Language Processing Hiroshi SANO : Construction of an educational material corpus for Japanese language education Chun Chen LIN : Construction of e-learning system of TUFS Language Modules
28
Yuji KAWAGUCHI
TUFS Language Modules (Supervisors) IPA Module, Cross-Linguistic Grammar Module IPA Cross-Linguistic Grammar
Yoshio SAITO, Hiroshi NAKAGAWA, Yukie MASUKO Makoto MINEGISHI, Shinjiro KAZAMA
Language Modules Pronunciation (P), Dialogue (D), Grammar (G) and Vocabulary (V) Modules English German French Spanish Portuguese Russian Chinese Korean Mongolian Indonesian Filipino Laotian Cambodian Vietnamese Arabic Turkish Japanese
Keizo NOMURA (G), Hiroko SAITO (P), Kazuyuki URATA (G,V), Asako YOSHITOMI (D) Takashi NARITA (P,D,G), Akiko MASAKI (P), Susumu ZAIMA (V) Yuji KAWAGUCHI (P,D,G,V), Akira MIZUBAYASHI (D) Shigenobu KAWAKAMI (P,D,G), Toshihiro TAKAGAKI (G,V) Naotoshi KUROSAWA (P,D,G,V), Chika TAKEDA (D) Hidehiko NAKAZAWA (P,D,V,G) Kazuyuki HIRAI (P,D), Takayuki MIYAKE (G,V) Eui-sung CHO (P), Koichi IKARASHI (D), Hideki NOMA (G,V) Renzo NUKUSHINA (D), Hideyuki OKADA (G,V), Yoshio SAITO (P) Masashi FURIHATA (P,D,G,V) Tsunekazu MORIGUCHI (P), Michiko YAMASHITA (D,G,V) Reiko SUZUKI (P,D,G,V) Hiromi UEDA (P,D,G,V), Tomoko OKADA (P,D,G,V) Yoshio UNE (P,D,G,V), Hiroki TAHARA (P) Robert RATCLIFFE (P,D,G,V) Takahiro FUKUMORI (P), Mutsumi SUGAHARA (D,G,V) Yohei ARAKAWA (V), Futoshi KAWAMURA (G), Yumiko SATO (P), Tae UMINO (D,G)
1. Workshop on Corpus Linguistics ―Research Domain―
Introduction Yuji KAWAGUCHI The Japanese Ministry of Education, Culture, Sports, Science and Technology initiated the 21st Century COE Program in 2002. One of the objectives of this COE Program is to introduce the aspect of competition into the system of Japanese universities. In the face of this governmental decision, there was no time prior to the proposal to question the actual applicability of such a principle of competition in a branch of science such as “Humanities.” Nevertheless, one thing was clear at that stage. For a small university like the Tokyo University of Foreign Studies (TUFS), which comprises only one faculty, it was practically impossible to propose an interdisciplinary program. For this reason, we planned the integration of the linguistics and language education branches, which are the biggest strengths of this university. In fact, the unique feature of TUFS is its multilinguality, which can be attributed to the 26 language sections in the university. Our proposal was titled “Center of Usage-Based Linguistic Informatics (UBLI).” The following points must first be stressed in order to dispel all misunderstandings related to “linguistic informatics,” which came into existence as a name for research domain since the 1980s. Therefore, I prefer to use “Usage-Based Linguistic Informatics (UBLI)” in the following manner. The UBLI mentioned here does not refer to natural language processing or computer linguistics. Even if it were associated with these fields, it would present “linguistic informatics” in the narrowest sense of the terms. The goal of UBLI is not related to language processing using computer technology or to linguistic analysis in itself. UBLI implies the systematic integration of computer science with linguistics and language education. UBLI was designed with the underlying intention of improving language education in Japan. The initial aim was realizing a more efficient multilingual education system, bringing language education to the fore through the utilization of computer technology, and elaborating advanced educational material through the utilization of linguistic theory. Certain factors facilitated the realization of the UBLI plan. First, the small size of TUFS is advantageous in that it facilitates the provision of all the required equipments of Local Area Network-based research projects, which had already become familiar not only to students but also to every member of the university staff. Second, there has been an enormous
32
Yuji KAWAGUCHI
downsizing of information technology, especially over the last ten years. Hence even linguists can now use information technology as computer scientists could ten years ago. Finally, our colleagues from the field of information technology offered their cooperation in our COE project. The Center of UBLI was developed due to these reasons. In the field of linguistics, advances in computer science have opened a new avenue of research that was previously unavailable, namely, corpus linguistics. As Susan Hunston writes: It is no exaggeration to say that corpora, and the study of corpora, have revolutionised the study of language and of the applications of language, over the last few decades1.
The Center of UBLI has been interested in corpus-based linguistic analysis and corpus-based language education right from the outset since corpus-based studies are crucial for all researches on the real use or usage of any given language. Linguistic corpora furnish a great variety of linguistic research domains and fruitful applications in language education, which I hope our readers will convince during their lectures of the present part. Thus, linguistic corpora have broadened our perspective of new research domains of linguistics and applied linguistics. A proverb by Bernard of Chartres, a famous scholar of the school of Chartres founded in the twelfth century, states A dwarf on a giant’s shoulder sees farther of the two. In other words, a human being on a giant’s shoulder can see more and farther than the giant, not because the human’s eyes are penetrating but because he is lifted up on the giant’s shoulder. For Bernard, this giant represented the Greek and Roman Cultures as well as Arabic traditions. His surprise and esteem for the giant holds true for contemporary linguistics. In this case, the giant represents linguistic corpora created by computer science. In fact, through linguistic corpora, it is also possible to analyze the variety of complex linguistic uses that have not been fully explicated in previous linguistic analyses. The center of UBLI has received research grants for a period of five years beginning in 2002; this five-year period will end in March 2007. It is evident that five years are insufficient for each researcher to create and establish a new research field with a firm theoretical background. A workshop entitled “Spoken Language Corpora – Their Significance and Application” was held on December 9, 2005, in collaboration with C-ORAL-ROM, a consortium researching the spoken Romance languages. 1
Susan Hunston, Corpora in Applied Linguistics, Cambridge University Press, 2002, p.1.
Introduction
33
The second international conference was held on the following day, December 10. Lectures were delivered on the status of grammatical research being conducted on spoken language, pragmatic analysis of spoken language, and the applications of spoken language corpora to education. Further, a general discussion forum was thrown open among the lecturers and participants of the conference2. The strategy required to develop the center of UBLI became clearer through this workshop and conference. In other words, the purpose of UBLI could be defined more stringently through the following steps: further promoting corpus linguistic analysis centered around written texts and inviting overseas researchers involved in conducting cutting-edge research on corpus linguistics to discuss with them the significance of research and processing on spoken and written language corpora, the construction of learner corpora, and the applications of the research findings in language education. Thus, we invited experts in the fields of corpus linguistics, analysis of linguistic usage, and language education from overseas for the second workshop on “Corpus Linguistics – Research Domain” that was held on September 14, 2006. The first part describes the contributions of eleven experts to this workshop on corpus linguistics. The key term “corpus” is defined as a collection of texts stored in an electronic database. Corpus generally refers to a huge machine-readable body of language comprising thousands or millions of words. The primary difference between an archive and a corpus lies in the fact that a corpus is designed and thus structured for a particular “representative” function. It is occasionally annotated with some additional tags such as part-of-speech, metalinguistic or prosodic features, etc3. In an interesting introduction to corpus linguistics, Tony McEnery and Andrew Wilson claim: ... Corpus linguistics in contrast is a methodology (bold by Kawaguchi) rather than an aspect of language requiring explanation or description. A corpus-based approach can be taken to many aspects of linguistic enquiry. Syntax, semantics and pragmatics are just three examples of areas of linguistic enquiry that have used a corpus-based approach... Corpus Linguistics is a methodology that may be used in almost any area of linguistics, but it does not truly delimit an area of linguistics itself 4.
2 3 4
See Yuji Kawaguchi, Susumu Zaima and Toshihiro Takagaki (2006) Spoken Language Corpus and Linguistic Informatics, Amsterdam/Philadelphia: John Benjamins. See Paul Baker, Andrew Hardie and Tony McEnery, A Glossary of Corpus Linguistics, Edinburgh University Press, 2006. Tony McEnery and Andrew Wilson, Corpus Linguistics, 2nd Edition, Edinburgh University Press, 2001, p.2
34
Yuji KAWAGUCHI
In fact, our workshop on corpus linguistics covered a wide range of research areas such as dictionary, linguistic atlas, dialect, translation, ancient texts, non-standard texts, sociolinguistics, second language acquisition, and natural language processing. Both quantitative and qualitative methods are used for corpus-based analysis. Linguistic corpora are generally utilized as a source of examples to test the applicability of a researcher’s intuition to a linguistic phenomenon, for instance, the frequency of some units, the possibility of some verbal constructions, or the existence of some variants; in this case, the investigation is regarded to have a corpus-based perspective. On the other hand, the use of linguistic corpora can be more inductive or corpus driven to prove some theoretical hypotheses. We will now provide a synopsis of each corpus-based analysis in the order of its appearance in the present part. The first research field of corpus-based analysis is that of dictionary, in particular, the dictionary of collocations. Peter Blumenthal, in his “A Usage-based French Dictionary of Collocations,” presents a research project whose main practical aim is to compile a collocation dictionary of French nouns. This work is based on his quantitative, probabilistic notion of collocations; he defines collocations as those combinations of words that occur more frequently across a corpus than would be expected by mere chance. The tagged corpus contains the sum of two years’ issues of Le Monde and excerpts from modern French novels. Collocations showing a high degree of specificity based on the log likelihood calculation are categorized by means of a syntactic grid. He presents a detailed sample article that shows collocational relationships using débat as a keyword. This sample illustrates the advantages of the type of dictionary that he aims to develop, the main advantages being an elaborate record of the degrees of specificity pertaining to the listed collocations. Thus, it enables the user to easily distinguish highly stereotypical word combinations from other more sophisticated structures. Further, his approach also accounts for structures that are usually neglected in lexicography, such as keywords forming a part of an adverbial or prepositional phrase. The article includes drawbacks of his lexicographic work and also deals with the questions frequently asked by his test users. Can linguistic atlases be considered as linguistic corpora? Jean-Philippe Dalbera gives us a positive answer in his work “Linguistic Atlases —Objectives, Methods, Results, Prospects—.” He characterized geolinguistic works of three different generations. The first generation set up the concept of the linguistic atlas, thereby completing the spadework with regard to its methodology (survey, transcription, management, and recording of the linguistic facts). The second generation aimed at constructing a reliable
Introduction
35
corpus of linguistic data by improving the atlas tool. Thus, for Dalbera, linguistic atlases constitute explicitly parameterized corpora. He introduces three key parameters for defining the corpora of linguistic atlases, namely, comparativism, diatopy, and lexicon. The analysis and use of these corpora are to be devised for various steps, from the production of leveled representations of geographic areas (dialectal boundaries, etc.) through to the models of certain modules of grammar (etymology being perceived as lexical reconstruction). The keyword here undoubtedly remains variation, which places dialectology centrally within linguistics, at least as far as diachronic linguistics are concerned. In “From the Linguistic Atlas to the Database, and vice versa—The Corsican Example—,” Marie-José Dalbera-Stefanaggi summarizes the history of the linguistic atlases of Corsica, namely, ALF Corse by Gilliéron and Edmont prior to World War I, the ALEIC by Bottiglioni from 1933 to 1952, and her own Nouvel Atlas Linguistique de la Corse (NALC). In NALC, which is entirely computerized and based on its relational database, Banque de Données Langue Corse (BDLC), linguistic atlases appear to have become real linguistic corpora. She adds that the atlas and the database prove to be dialectal tools that should be conjointly improved since they indeed enrich each other. In “Collateral Languages and Digital Corpus,” Jean-Michel Eloy reexamines the notion of “usage” and mentions the problems involved in constructing digital corpora of regional French. His paper is closely linked with the main aspect of UBLI, that is, “usage-based,” since it deals with a text database that he is actually developing in the Picard language. His primary concern is providing the appropriate framework to understand the actual content of Picard texts. The keywords are as follows: related and near languages, collaterality, emergence, and koinèisation of the language. To establish a definite framework, he first uses genetic schemes followed by dialectometry, lexicostatistics, and a multidimensional approach to intercomprehension. He proposes collaterality as a dynamic and sociolinguistic concept. The actual texts in the database provide an insight to all the levels of variation and lack of standardization, although certain elements of koinèisation also exist. These data must be viewed in their specificity – writing traditions, a certain voluntarism, the sociopolitical situation of the language – in order to process relevant requests in preparing the database. Another perspective of corpus-based linguistic analysis can be explored in documents written in Old French. In fact, different types of works and tools required for the study of Old French have been available on the Internet for several years. Pierre Kunstmann, in his “Corpus of Old French Literary Texts,” points out the most important ones and provides a brief analysis of
36
Yuji KAWAGUCHI
these projects, which have changed over a very short period: before 2000 and since 2000. The production of new documents involves three steps: transcription, critical editing, and encoding texts. In the databases of ancient documents, the simplest form is the concordance. Once a text is digitized, it is easy to extract a series of lemmatized indexes. Once the lemmas are defined, the text offers lexicons and dictionaries. He claims that in this type of project, the problem of maintaining and transcoding documents is as crucial as that of ensuring document quality and reliability. The corpus-based perspective in Chantal Lyche’s contribution is based on French phonology. In her “Building a Large Corpus for Phonological Research—The PFC Project—,” she partially describes the PFC project (Phonologie du français contemporain: usages, variétés et structure). PFC is arguably the largest and most ambitious survey of modern French ever conceived from a phonological perspective. The PFC project—carried out with the coordination of Jacques Durand (Toulouse II), Bernard Laks (Paris X), and Chantal Lyche (Oslo)—involves over 30 researchers from a variety of countries and aims at the recording, partial transcription, and analysis of over 600 speakers from the francophone world on the basis of a common protocol. Further, it aims at a broad coverage of varieties of contemporary French by selecting groups of speakers from approximately 60 different locations. Thus far, more than 400 speakers have been recorded from various places, including Belgium, Burkina Fasso, Canada, Côte d’Ivoire, Louisiana, Mauritius, Reunion Island, and Switzerland in addition to continental France. After presenting in some detail the protocol that was adopted at all investigation points, the researchers described the different coding systems that were elaborated for the study of schwa, liaison, and prosody. Lyche discusses at some length the methodology when coding prosody and presents a number of problems that did not arise when coding segmental features. Further, she mentions a number of works related to the PFC corpus and concludes by providing the examples of a few words from the database that now includes 212 speakers; these words have been recorded and validated. A different type of corpus-based perspective is referred to in Tony McEnery and Zonghua Xiao’s paper, “Parallel and Comparable Corpora—The State of Play—.” Here, they remark that at present, with the ever-increasing international exchange and accelerated globalization, translation and contrastive studies are more popular than they have ever been. In this paper, they illustrate the value of parallel and comparable corpora in translation and contrastive studies. They define parallel corpora as those that contain collections of L1 texts and their translations and comparable corpora as those that contain matched L1 samples from different languages. As mentioned
Introduction
37
earlier, their main concern is the potential value of parallel and comparable corpora in translation and contrastive studies. A parallel corpus is a useful starting point for contrastive research. A carefully matched bidirectional parallel corpus provides a basis for both translation and contrastive studies. However, it is not easy to develop such an ideal bidirectional corpus due to the heterogeneous pattern of translation between languages and genre. The phonetic reconstruction of earlier stages of languages often relies on the phonetic interpretation of their graphic systems. These systems, however, tend to be conservative. Documents using non-standard orthographies may offer additional insights on the phonetic characteristics of a language at the time when they were written. The corpus of such non-standard orthographic works is highly needed for the historical study of any language. In “A Corpus of French Texts with Non-standard Orthography,” Yves Charles Morin presents a catalog of such documents, including those he has examined over the last twenty years for his research on the evolution of French. His experience allows him to conclude that each document must be analyzed on its own. There does not seem much to be gained by cross-referencing the data between different authors. What role will linguistic corpora play in writing sociolinguistic history? In “The Uneasy Interface—Methodological Issues in Using Data from Traditional and Urban Dialectology in (Re-)constructing Sociolinguistic History—,” Tim Pooley seeks to address some of the questions that arise when one attempts to write the history of a vernacular language from a variationist sociolinguistic perspective by using oral data. The perspective selected is that chosen by the informants based on their sociolinguistic profile with special reference to research on northern France. Since the analysis of the behavior and perceptions of contemporary informants will necessarily introduce a historical perspective if speakers of different age ranges are selected, any attempt to go further back in time would imply the use of data gathered by scholars working within the framework of traditional dialectology. Using the well-known acronym N-O-R-M-S, i.e., non-mobile old rural males, coined by Chambers and Trudgill to summarize the details of the informants used in this latter approach, Pooley highlights a number of methodological issues that arise when one takes into account the factors implied by each initial of the term (1) mobility, (2) age, (3) rurality and urban development, (4) gender, and (5) style. Language education and second language acquisition (SLA) are other important application fields of linguistic corpora. Randi Reppen, in her “First Language & Second Language Writing Development of Elementary Students—Two Perspectives—,” provides a snapshot of the linguistic changes that occur between the third and sixth grades (ages 8–13) in the
38
Yuji KAWAGUCHI
writing of both native and non-native English speakers. Both quantitative and qualitative methods are used to analyze the development in the writing style. The quantitative aspect includes over 65 linguistic features and a multidimensional analysis, while the qualitative aspect includes a detailed linguistic case study of a second-language writer over a two-year period. Thus, by analyzing a large collection of essays both quantitatively and qualitatively, the general developmental trends that occur between the third and sixth grades and with each individual learner can be identified. These analyses reveal similar developmental patterns for both first and second language writers. It is evident that many linguistic changes occur during elementary school. Finally, in Achim Stein’s paper, we can see technical perspectives on the development of Old French literary text corpora. In “Resources and Tools for Old French Text Corpora,” he presents the development of the electronic version of the “Amsterdam Corpus,” a previously unpublished collection of Old French literary texts originally compiled by A. Dees. Further, the improvement in the quality of these texts (3.2 million words) as well as the tools designed for part-of-speech tagging and lemmatization of Old French Texts are described in this work. It also describes how different types of resources (lexical, textual, and technical) can be profitably used for the treatment and enhancement of historical text corpora. In addition to the usual requirements that the corpora should meet (reusability and availability of data), the author focuses on the preservation of original data and the quality of the philological information distributed in the corpora.
Linguistic Atlases — Objectives, Methods, Results, Prospects — Jean-Philippe DALBERA Introduction The matter that I have been asked to address during this workshop deals with linguistic atlases and more broadly with dialectology. Several major questions must be kept in mind while considering the above mentioned issues. I will focus on four of these questions: (I) Do linguistic atlases constitute, ipso facto, usage based corpora? (II) To what extent does the genesis of these atlases shed light on their characteristics? (III) From there on, is it possible to determine the possible uses of these atlases, or at least their privileged uses? (IV) Can these uses have a theoretical impact and any consequences with regard to language models? 1. Historical overview Linguistic atlases were developed almost worldwide but essentially in Europe during the first three quarters of the 20th century. The precursor was G. Wenker with his attempt to delimit linguistic phenomena geographically. Nevertheless Gilliéron remains the pioneer. Indeed, assisted by Edmont, he compiled the first real linguistic atlas, devising the surveying methods, the transcription protocols as well as the notions of network, areas and strata hence deeply changing the way dialects and linguistic boundaries were perceived and developing a certain concept of what linguistic change is. But it is also well known that many other attempts were carried out over the same period of time. Japan for instance saw the publication of the 29 maps of the Phonetic Dialect Atlas as early as 1905 as well as the publication of the 37 maps of the Grammatical Dialect Atlas in 1906. Following the Atlas Linguistique de la France (ALF), other major masterpieces in the field of geolinguistics were started, such as the Atlas Italian und der Südschweiss (AIS) by Jud and Jaberg or the Atlante Linguistico Italiano (ALI). The latter were in turn followed by a long series of national atlases, impossible to list here, together with all the theoretical and methodological improvements naturally implied by the necessary criticism of the original work. In France, the improvements and amendments which had to be brought to the ALF led to the series of the Atlas Linguistiques de la France par région, which is about to be finished. In the same vein, regional or national linguistic atlases
40
Jean-Philippe DALBERA
are under preparation here and there aiming at covering areas for which geographically located linguistic data are not yet available. Our development about these will be extremely brief for M.R. Simoni already published here a paper concerning, seen from inside, that period 1 . These two waves of geolinguistic enterprises are sometimes referred to as atlases “of the first and second generations”; however these do absolutely not represent a terminus ad quem as, indeed, third generation atlases are already on their way. This latter appellation can be interpreted in two different ways. On the one hand, these third generation atlases – that is to say interpretive maps, such as the Atlas Linguarum Europae and the Atlas Linguistique Roman – are under development. We will underline their impact later on. These are characterized by no (or only very few) new surveys as already existing material is used. This material was collated by the preceding generations and gathered according to specific procedures in order to provide significant syntheses with variable scales in the field of linguistic areology as well as for new or partially new investigational domains, notably semasiological or motivational ones, which often refer to the interrelation between languages during their diachronic process and the successive cultures they have expressed. On the other hand, standing for this third generation, appears an object which, in relation to the technological evolution, is not an atlas anymore, at least a paper atlas, but rather includes or underlies an atlas: it has become a multimedia dialectal database. Indeed, technology can never totally be separated from the purely linguistic matter. The introduction of the tape recorder changed the face of the surveying methods at a moment in time. Here, data processing, thanks to its storage, consultation and connection prospects, carries a non-trivial weight on the geolinguistic enterprise and constrains us to rethink quite a few behaviours. 2. Atlases: objectives and methods The prototype for the first generation of linguistic atlases is the Atlas Linguistique de la France by Edmont and Gilliéron. This enterprise makes use of a certain number of concepts (development of a representative network, elaboration of questionnaires, training of a surveyor, definition of a protocol for field surveys, primacy of oral data, transfer of linguistic data onto display maps, representations on outline maps, etc.). These concepts henceforth tend to represent the touchstone in the domain. Nevertheless, for all that, this does not mean that this enterprise is totally free from contemporary ideological bounds and some of the parameters taken up (consciously or not) by Gilliéron will not be kept by his successors. 1
Simoni (2004).
Linguistic Atlases
41
Initial objectives The elaboration of the ALF was motivated by scientific discussion. Indeed, its main objective was to empirically test the validity of the “phonetic laws”, that is to say this very strong hypothesis with regard to diachronic evolution put forward by the comparatists and according to which if in a given context of a source language, a sound x systematically becomes a sound y in a daughter language, consequently all the x’s placed in the same conditions must behave the same way, except for analogy or borrowing phenomena. Correlatively, this test enables us to delimitate coherent or at least homogenous linguistic groups that are likely to represent distinct linguistic entities born of a common source – in other words, dialects. It is well known that this test will turn out to be negative. The phonetic laws will be described as mirages and the dialects will be erased as such. The linguistic heritage will hence cease to be conceived as a systematic and overwhelming transmission of sounds to become a transmission of words – it being understood that the latter are random and that they are subject to various pressures (“each word has its background”). Revision of the objectives The refutation of a hypothesis nevertheless remains a scientific asset. This statement will hence not interrupt the development of the linguistic atlases. Only the objectives will undergo revision. From then on, they will be asked to portray a representation of the linguistic situation of a country at a given moment or to give an idea of the geographical diffusion or distribution of linguistic facts. The aim will be to link these linguistic facts and the possible variation of these facts with external data, such as geographical, historical, economical and religious considerations and notably with historical or cultural breaks. The initial diachronic matter will not be abandoned; rather the best will be taken from the existing dialectal displays by trying to understand space as a projection of time. Thanks to the first geographical experience, the structuralists will clarify the split between gradual change (referring to a common evolutional slope however qualified by a more or less strong conservatism) and abrupt change (referring to phenomena of shared innovation). The interference phenomena at the boundaries of these areas will also be studied. And assessing the possible impact of the normalized and standard language on the dialectal varieties will also become a concern thanks notably to the comparison of successive representations of the same area. In a word, there is a change of direction but the work continues.
42
Jean-Philippe DALBERA
Revision of the methods Many things have changed too with regard to the methods used to elaborate these first atlases. Certain requirements initially considered by Gilliéron as guarantees of objectivity, such as the unique surveyor (ear homogeneity), the unique informant, the “first shot” (that is to say the first spontaneous answer given – requirement inspired from the snapshot, etc.) will hence be abandoned afterwards. This instantaneous aspect will be replaced by a qualified one as the answers will be crosschecked thanks to several informants, the regrets and corrections of certain informants will be considered, the criteria for the selection of the informants will be explicit and calculated, the pronunciation discrepancies between the informants will be given attention (the transcriptions were by the way variable depending on the atlases supervisors), etc. As far as the transcription methods of the facts on the outline maps are concerned, different techniques will be implemented, some noting the forms using an extensive phonetic transcription, others preferring symbols referring to captions that clarify the lemmas associated with these symbols. And, obviously, mentioning the term of lemma amounts to admitting that the noted facts are not core facts but that they have already been subject to some sort of levelling. An implicit formatting of the data There is still another parameter left to highlight. It might go unnoticed since it seems so obvious but it is nevertheless very important: it deals with the nature of the linguistic items that are to be recorded on the maps. Indeed, the approach is fundamentally a comparative one. The commensurable nature of the answers to be obtained is hence crucial. This is the reason why questionnaires are used and why the collection, here and there, of a certain quantity of free speech is insufficient. However, this requirement of commensurability ends up, effectively if not legitimately, by narrowing the dialectal collations and by turning them into word comparisons. A consequence of this perspective is a quasi-exclusive focus on the lexical units or on some morphological features (conjugation elements for instance), when working on flexional languages. And, despite the fact that its validity and relevance as a linguistic unit are strongly questioned within the theoretical frameworks of modern linguistics and, without this being an adopted stance, the word thus becomes a major item in geolinguistics. Anyway, as far as the developmental details of the concept of linguistic atlases are concerned and as far as the evolution of the production methods are also concerned, it can be said that, after the ALF, all the initial problems have been put up with. The framework is laid down. A whole series of studies using the maps of this atlas flourishes. They try to understand the
Linguistic Atlases
43
spatial phenomena of partition and diffusion of linguistic facts, try to highlight the history of words, etc. Further concepts are under creation: (homonymic) collision, paronymic attraction, verbal therapeutic, etc. Almost everywhere, a second generation of linguistic atlases (preceded by a first generation or not) is on the go and the mere geolinguistic point becomes a full chapter in the linguistic textbooks. The quantity of linguistic data gathered within this second generation framework is considerable. However, the regular rhythm of everyday work henceforth somewhat stultifies the initial ambitions with regard to the theorisation of linguistic change and new models of diachronic evolution. One has the vague impression that the diatopical point of view has become an end in itself and that the geolinguistic object turns in on itself. Research consists of (but also limits itself to) a spatial description of linguistic variation. In relation to the rise of sociolinguistic preoccupations, another important element is that the sheer making of the atlases has overridden all the rest and has become during a short moment the research object itself, focusing on issues such as the surveying techniques, what kind of questionnaire, spontaneity and naturalness of the dialogue, impact of the surveyor on the data, attention given to the legibility of the maps2, taking the negative data into account3, etc. Of course, the analyses making use of the data from the atlases have not disappeared (in France, Gardette, Séguy, Tuaillon, etc. can be mentioned) but these have become rare and above all have changed scale, limiting themselves (still with regard to the French example) to the areas defined by the regional atlases and hence losing part of their generalisation. It should also be mentioned that the non-simultaneity in elaborating and publishing these regional atlases and the low rate of the global use of these maps are two linked phenomena. All that has lead to a paradoxal situation : within the period of most important gathering of data have been elaborated very few essays of dialectal reconstruction. To sum up In a few words, the first generation of work has set up the concept of linguistic atlas and thereby completed the spadework with regard to methodology (survey, transcription, management and recording of the linguistic facts). The second generation has aimed at constituting a reliable 2 3
Bouvier-Martel (1975-86) for instance introduces the notion of delimited homogenous areas. Cf. Ravier (1965), among others.
44
Jean-Philippe DALBERA
corpus of linguistic data by improving the atlas tool. Depending on the scientific Institutes which have launched these atlases throughout the word, their objectives might have varied from one to another, but considered with hindsight, they can all be subsumed under quite a homogenous matter. The “National Institute for the Japanese Language” for instance aims at presenting the different regional dialects, at showing their spatial diffusion, at explaining the process that led to the elaboration of standard Japanese, at determining in return the influence of the standard language on the dialects and at considering the pedagogic consequences of the knowledge of these varieties. This program is representative of this kind of enterprise. A linguistic atlas thus undeniably constitutes a corpus of usage based linguistic facts. Of course, there would be a lot to say about this notion of usage – this can be seen from the numerous discussions about the above mentioned surveying techniques. However this restriction also applies to a good many other corpora. An atlas is a constructed corpus. The most important parameter that defines it is comparativism. The starting point is a questionnaire, the privileged unit is the word and the preferred dimension is space (when working from a network of locales). The corpus is hence a constructed corpus directed towards words and space; it must be handled as carefully as any other corpus. To be convinced about this, the privileged case represented by a well-delimited linguistic area that appears in several successive linguistic atlases, is a good reference point: Corsica provides a good example. In Corsica, geolinguistics have produced three atlases during the course of the century: the ALF Corse by Gilliéron and Edmont before World War I, the ALEIC by Bottiglioni from 1933 to 1952, and the NALC which is currently under publication. That unusual situation is developped here in Dalbera-Stefanaggi’s paper. Three sets of facts have hence been published. Three atlases, three corpora? Or four corpora if Bottiglioni’s Dizionario is taken into account? Or a unique corpus consisting of all these elements? What distinguishes them? The recorded facts are not the same but they overlap of course. It is clear for us that we have to deal with one corpus ; the facts are filterd by three regards but remain contempor corsican facts; and the difference between atlas and dictionary appears as quite light when considered to-day through the database tool. One example only: the transcriptions that are given for the facts slightly diverge. Bottiglioni particularly criticizes the impressionistic transcription forwarded by the ALF Corsica surveyor, together with those proposed by the AIS surveyors. Indeed, he considers this kind of transcription as an illusion as Bottiglioni casts doubt on the ear’s capacities to seize so many different phonetic nuances. Moreover, many obvious transcription mistakes in
Linguistic Atlases
45
Edmont’s notebooks can easily be pointed to. However, even if these impressionistic transcriptions could conform to reality, they would not become legitimate for all that. Indeed, according to Bottiglioni, the forms recorded and transcribed by the surveyor must not represent the expression of a fugitive moment but rather they must reflect the normal and usual expression of average speakers. Bottiglioni hence sets out to forward a normalising transcription that forwards a representation of the average speech. Nowadays we are less peremptory with regard to this point, especially after the NALC. Indeed, the necessity of producing a more fine-grained analysis of these data is becoming more obvious. If certain of Edmont’s transcriptions are indeed irremediably wrong, others on the other hand reach a level of truth that only a foreign ear is capable of capturing insofar as there is no interference with already known elements. Quite a few transcriptions which were thought to be wrong because they did not comply with a point of view based on a certain Italian norm, appear in fact to be very enlightening and of a certain level of reality, thanks to their impressionism free from all a priori. As far as we are concerned here, it is clear that, as soon as a corpus of linguistic facts is being dealt with, an explicit or implicit filtering is applied. 3. Using atlases The prospects available to the third generation get more precise at present. In fact, there are two types of possibilities. Very clearly, the intention is to use the corpus. On the one hand, this objective implies the creation of appropriate conceptual surroundings. On the other hand, it implies equipping oneself with the capacity of managing considerable – and sometimes disparate – quantities of data. Obviously, these two objectives are independent. However, in practical terms, they rapidly proved to be complementary and convergent, in such a way that the geolinguistic work – which could be characterised as third-generation work – comes in fact from their merger. Clearly, the appearance and the development of the data processing tool accompany and stimulate the heuristic work, in the same way that the tape recorder accompanied the fine-tuning of the surveying techniques and the data transcription for the preceding generation. The database tool An unexpected but non-negligible impetus for these atlas programs came from the development of the data processing tools that, little by little, found themselves put at the linguists’ disposal. Considerable improvements have been carried out thanks to this new medium. These are linked with, at random, easy access to the data, quick consultation, computer toolboxes
46
Jean-Philippe DALBERA
(with regard to sorting, navigation procedures in the corpus, storage of partial results, etc.), etc. Its handling Those who have never manipulated dialectal facts do not know that feeling of powerlessness when facing tasks such as considering an ever increasing quantity of data, sorting data with multiple criteria, re-examining thousands of items as soon as a heuristic aim is modified even slightly, or manually writing down all the partial results obtained. Using a database does not solve the linguistic problems however, it enables the researcher to handle far more heavy corpora and, hence, far more significant data. This is trivial. The multi-dimensional aspect We can focus on this aspect for a while. We have already mentioned the dilemmas about transcription (which instructions should be given to the transcriber?) and about the transfer of these transcriptions onto the maps (how to subsume multiple transcriptions?). What used to be an obstacle for a paper atlas is not one anymore within a database framework. Indeed, nothing prevents us any longer from noting several competing transcriptions, without any hierarchy, and linking each of these transcriptions to a specific parameterisation. The same holds for the different phonetic spellings: there is no need anymore to have to choose only one spelling as the database authorises the storage of and a possible consultation of all of them. The non-issue about core versus analysed data vanishes or, at least, it is pushed aside as, henceforth, it becomes a matter for the analyst – for the person who shoulders the responsibility for an explicit analysis of the dialectal facts. This is much more satisfactory. The heuristic support Another consequence: if the dialectal database is correctly set up, the person who consults it becomes an active researcher, free to elaborate and test his own hypotheses from the facts recorded in the database, free to produce his maps upon request, to cross the onomasiological and semasiological perspectives, and to introduce the external parameters he believes are relevant, etc. Additionally, the structure of the database allows multiplying the different types of accesses instead of imposing only certain formats for the answers. Qualitative vs. quantitative processing The possibility of managing considerable quantities of facts offers further opportunities, among which a quantitative processing. It becomes
Linguistic Atlases
47
very tempting to fiddle around with statistics for instance. This approach was initiated by Séguy notably and then renewed and developed by Goebl in Europe. But there are specialists here who are much more qualified than I am in this field so I will not pursue this matter. Nevertheless I totally agree with the reservations expressed by Kawaguchi in the first volume of the UBLI: “we regret, however, that in Goebl’s article, the procedure of measuring the similarity between research points needs more explicit explanations” 4 . Indeed, certain difficulties must not be erased. What is a differential feature? What is a feature? Counting items without defining exactly what is being added is insufficient. And is this possible without conducting an analysis in the first place? Can any types of similarities be legitimately added? Indeed, a fact will be considered as similar or as different depending on the point of view, phonetic or phonological. And the same difficulty will appear when either lexical units or lexical types are referred to, etc. And how to sort out typological similarities from genetic ones? Crucial issues on the notion of dialect, on the notions of typological groupings and genetic affiliation, etc. resurface. But these issues would take us too far afield. Analysed maps or the so called interpretive atlases What can be expected from the linguistic atlases of the first and second generations ? At a primary level, the issue bears on map reading. The maps that come from the first atlases display “core” facts (there would be a lot to say about this notion of “core” item but we will leave this matter aside here). Used in isolation and with no specific aim, these maps might not mean much. However, progressively, the individual facts can be considered more globally so that regularities can be perceived: this approach –called the inductive approach– which consists in generalising the matter little by little is the one which motivated the first comparatists. We are aware of its limits. Nevertheless certain map superimpositions are eminently instructive and enlightening and quite a few modern atlases include these interpretive maps (which represent systems or at least significant correlations of facts) together with display maps. The former hence represent completed analyses along with important results. In certain cases, the representations produced by superimposition cover approaches that linguists have been willing to undertake for a long time. This is the case for the maps that illustrate phonetic evolutions. One could get along without the atlas maps beforehand and establish that the evolution of the Latin geminate sonorants in Romance languages amounts to neutralising the geminate/simple opposition or to maintaining this distinction however 4
Kawaguchi (2005) p. 101.
48
Jean-Philippe DALBERA
losing oppositions within the simple sonorants. However the in vivo demonstration is more visual and clear. Nevertheless, the linguists had not considered all the possible dimensions of the diatopical variation and the maps from the atlases put forward some that the researchers had not thought about. We are going to examine one of these unexplored possibilities. The motivational dimension of the lexicon The mere fact of consulting atlas maps, whether on paper or on screen, sometimes leads us to venture into new avenues. The person, who patiently puts side-by-side several atlas maps that represent geographically adjacent areas, is sometimes intrigued by certain recurrences. Thus the puzzle he is trying to piece together sometimes highlights, beyond phonetic or etymological resemblances, a constant with regard to the representation Man has of certain realities. It happens that some words – despite their obvious distinct modern phonic forms and their different etymological sources – nevertheless can apparently be brought back to a similar representation of reality. In fact, they appear as calques of one another. The notion of motivation is hence introduced here. This suggests that another dimension of continuity within diversity could be found but, this time, with regard to semantics and without considering any genetic relationships. This is the kind of issue that the linguists from the Atlas Linguarum Europae have addressed. Their field of investigation includes Indo-European languages as well as languages belonging to other families (Uralic for instance, or Basque; among the Indo-European languages most of the families are represented: Celtic, Italic, Germanic, Hellenic, Slavic, etc.). Each of the European syntheses is compiled by a coordinator but on the basis of reports established by national committees. This collective work has already produced important results that lead us to rethink the languages developmental chronology and that lead to distinguish – through the coat of modern varnish – a stratigraphy of a surprising depth. These results also lead to reconsider the languages developmental phases, in connection with works carried out by archaeologists, anthropologists, historians, myth specialists, etc. Further ways to use the atlases: towards an in-depth revision of the etymological approach There is no reason to remain at this stage of use. Remember what has been mentioned earlier: the initial matter in the atlases program, whether implicit or explicit (this is not relevant here), was to shed light on the phonetic and the lexical evolution. The maps are essentially lexical ones and the putative major isoglosses refer to evolutions that bear on historical phonetics. Moreover, the most spectacular advance with regard to linguistic
Linguistic Atlases
49
theory lies within the etymological domain and in the adage “each word has its background”, which totally contradicts the phonetic laws assumptions. Yet, for those who have the patience to examine the etymological reconstructions proposed by Gilliéron and his successors, it is obvious that the “alleged phonetic laws” are constantly and rigorously referred to. The only restriction is that they do not represent the exclusive cause of linguistic evolution and that many other internal or external factors interfere in the development of a language. It is well-known that the etymological science has progressively entered two divergent ways: the classic concept – the Ancients’ one notably (Greek etumos the “real” meaning of the words), which aimed at establishing the first meaning of the words – and the modern way which rather advocates a history for the words and widely considers external data such as contextual, geographical, historical, social, economical or cultural data, etc. However, the theory of the linguistic sign had also changed with the introduction of the notion of the motivation of the sign as opposed to its Saussurian arbitrariness. Ipso facto, this dimension gave a major value to the atlases maps resulting from complementary (onomasiological) maps and (semasiological) dictionaries. From there, a threefold pathway opens onto etymology, lexical change and the origins of language. 4. Contribution to a model of linguistic change All this gives rise to a theorisation of the approach: as far as methodology is concerned, the identification of the motive becomes a keystone of the comparative reconstruction of the signified. The phonic form of the etymon is reconstructed thanks to the comparison of the observable languages. In the same way, the motive is reconstructed thanks to the comparison of its different expressions. This is a crucial point: all the reconstructions are based on a comparative reasoning. We know how to compare corresponding phonic sequences in French, Italian, Occitan, Catalan, and we are capable of reconstructing an abstract and common pro-form. But this process is less mastered when it comes to determining what has to be compared when working on concepts and how to reconstruct a common invariant form. Concepts appear as fleeting and volatile. Meaning seems to disintegrate as soon as it is touched upon. Hence the suspicion it arouses in the minds of those who undertake etymological analyses and the minimalist role it is granted. In order to break with this unfavourable situation for etymology – that it to say this close to all-signifier versus close to no-signified approach –, we argue5 that four propositions should be taken 5
Cf. Dalbera (2006).
50
Jean-Philippe DALBERA
into account: (1) The signified is accessible only if it is expressed; the signified of a sign can hence only be another sign. (2) Identifying the motive which is the grounding to the creation of a sign is possible only if the former displays recurrent representations. (3) We assume that homogeneity with regard to a motive (or at least some kind of recurrence which would be relevant of a certain representation unity) within a homogenous linguistic area is a licit postulate. (4) Even if a sign is no longer transparent, it can still retain the possibility of seeing its signified elucidated under the hypothesis that the latter is “approximately equivalent” to the motive that led to the creation of a series of signs which refer to the same notion within a defined area. Proposition (1) provides a solution to the problem of the insubstantial and volatile nature of the signified: the latter has now become tangible, thanks to the bundle of signs that appear as its close equivalents. Proposition (2) defines the regularity of the mechanism: only persistent and recurrent representations are to be taken into account. Proposition (3) transforms the constraint mentioned in (2) into a heuristic tool, suggesting that the observed regularity could be generalised. Proposition (4) perfects the heuristic approach, inverting variable (the unknown element) and parameter. These propositions entail that the motive identified in a transparent series can hypothetically be transposed to a series of opaque and arealy adjacent signs. The latter hence no longer appear as inanalysable and thus semantically uninterpretable phonic sequences, rather they appear as signs with a defined signified and for which the modalities of expression are sought for in the phonic form. The perspective is hence inverted. Instead of a semasiological approach starting from the expression and attempting to understand the semantic construction, an onomasiological approach is adopted starting from the assumed signified and the aim is to determine if, in the phonic sequence which is supposed to express this concept, the construction mode can be identified: thanks to which root, which procedures (derivation or composition?), which figures of speech (metaphor, metonymy, etc.) has the sign been coined? Now this is precisely what distinguishes the atlases and the dictionaries: the former have an onomasiological point of view whereas the latter (with no difference between usage and etymological dictionaries) have a semasiological one. A dictionary lists the different possible meanings for a given word whereas an atlas map displays the possible expressions for a given notion. Moreover, the relevance to pointing out the motive – first step to the reconstruction of the fundamental semantics – can in this way be evaluated thanks to its heuristic power. If the enlightened semantics is fundamental,
Linguistic Atlases
51
then it must be capable to account for the etymology of words that belong to other languages or dialects that had not been considered at first to construct the hypothesis. This technique might even have to be applied cyclically and, in fact, one might have to alternate semasiological and onomasiological investigations. This methodology of lexical reconstruction might also represent a model for lexical diffusion insofar as the motivational analyses realised up to now seem to show that the lexical renewal (whether in space or in time) is sometimes carried by the signifier (in terms of opaque alterations), sometimes by the signified (in terms of close equivalents). This entails several theoretical and methodological consequences. Firstly, variation becomes crucial: the motive – or better, the fundamental semantics – must be postulated according to recurrent series and can be revealed by such or such more explicit and less altered variant. And the corollaries of this are very strong constraints with regard to the exercise instantiated by motivational analysis. The latter must make the difference between what is anecdotic and what is recurrent and not merely accept superficial resemblances. But these are also the guarantees that are required for phonetic reconstruction. This also leads to put forward dialectology and one of its preferred tools in the foreground of the heuristic scene. This tool is the atlas map as it is the only one to instantiate diachronic and diatopical variation studies. Comparing the standards would often prove insufficient to buttress a motivational analysis. At a certain level, the principle of recurrence can only be respected by dialectology. Conclusion We can hence conclude that linguistic atlases constitute explicitly parameterised corpora whose keywords are comparativism, diatopy and lexicon; the latter being flexible by the way, insofar as further phonetic, morphological and syntactical atlases are being compiled. The analysis and use of these corpora are to be devised for various steps, from the production of levelled representations of the geographic areas (dialectal boundaries, etc.) through to the models of certain modules of Grammar (etymology being perceived as lexical reconstruction). And the keyword to all this undoubtedly remains variation, which places dialectology centrally within linguistics, at least as far as diachronic linguistics are concerned. References Dalbera J.Ph. (2002) Le corpus entre données, analyse et théorie, Corpus 1: 89-104.
52
Jean-Philippe DALBERA
Dalbera J.Ph. (2006) Des dialectes au langage. Une archéologie du sens, Paris, Champion. Dalbera-Stefanaggi M.J. (à paraître) From linguistic atlases to database and vice-versa: the corsican example, UBLI 6, Amsterdam-Philadelphia, John Benjamins Publishing Company. Dauzat A. (1949) Les principes du Nouvel Atlas Linguistique de la France, Archivum Linguisticum, 1 : 44-51. Dauzat A. (1955) La méthode des Nouveaux Atlas Linguistiques de la France, Orbis, 1 : 22-31. Goebl H. (1993) Dialectometry: A short overview of principles and practice of quantitative classification of linguistic atlas data, in R.Köhler and B.B. Rieger (eds) Contributions to Quantitative Linguistics, Kluwer Academic Publishers : 277-315. Lauwers P., Simoni M.R., Swiggers P. (2002) Géographie linguistique et biologie du langage : autour de Jules Gilliéron, Leuven-Paris: Dudley, MA, Peters. Simoni M.R. (2004) Les Atlas Linguistiques de la France par régions (1939-1970), Flambeau 30, Revue annuelle de la section française, Université des Langues Etrangères de Tokyo : 1-22. Yarimizu K., Kawaguchi Y., Ichikawa M. (2005) Multivariate analysis in dialectology. A case of the standardization in the Environs de Paris, in Linguistic Informatics. State of the art and the future, UBLI 1, Amsterdam-Philadelphia, John Benjamins Publishing Company : 99-119.
Linguistic Atlases
53
Appendix. list of the European linguistic atlases for the Romance area (already published or under preparation) JABERG, K., JUD, J. [1928-40] Sprach- und Sachatlas Italiens und der Südschweiz, Zofingen. ALAL POTTE, J. Cl. [1975-92] Atlas Linguistique et ethnographique de l’Auvergne et du Limousin, Paris, CNRS. ALB TAVERDET, G. [1975-80] Atlas Linguistique et ethnographique de la Bourgogne, Paris, CNRS. ALCat GRIERA, A. [1923-39, 1962-64] Atlas Lingüístic de Catalunya, Barcelone. ALCe DUBUISSON, P. [1971-82] Atlas Linguistique et ethnographique du Centre, Paris, CNRS. ALCL ALVAR, M. [1999] Atlas lingüístico y etnográfico de Castilla y León, Madrid. ALDC VENY J., PONS GRIERA L. [2001] Atles Linguistic del Domini Català, Barcelona. ALEA ALVAR, M. [1961] Atlas lingüístico y etnográfico de Andalucía, Granada. ALEAç BARROS-FERREIRA M., SARAMAGO J., SEGURA L., VITORINO G. [2001] Atlas Linguístico Etnográfico dos Açores, I, Lisboa. ALEANR ALVAR, M. [1979-83] Atlas lingüístico y etnográfico de Aragón, Navarra y Rioja, Madrid. ALECant ALVAR, M. [1995] Atlas lingüístico y etnográfico de Cantabria, Madrid. ALEPO CANNOBIO S., TELMON T. [en cours] Atlante Linguistico Etnografico del Piemonte Occidentale, Torino. ALF GILLIERON, J., EDMONT, E. [1902-10] Atlas Linguistique de la France, Paris, Champion. ALFC DONDAINE, C. [1972-84] Atlas Linguistique et ethnographique de la FrancheComté, Paris, CNRS. ALG SEGUY, J. [1954-73] Atlas Linguistique et ethnographique de la Gascogne, Paris, CNRS. ALGa GARCÍA C., SANTAMARINA A. [1990-2003] Atlas lingüistico galego, Santiago de Compostela. ALI Atlante Linguistico Italiano, Roma, Istituto Poligrafico e Zecca dello Stato. ALIFO SIMONI-AUREMBOU M.R. [1973-78] Atlas Linguistique et ethnographique de l’Ile de France et de l’Orléanais, Paris, CNRS. ALJA MARTIN, J.B., TUAILLON, G. [1971-78] Atlas Linguistique et ethnographique du Jura et des Alpes du nord, Paris, CNRS. ALL GARDETTE, P. [1950-76] Atlas Linguistique et ethnographique du Lyonnais, Paris, CNRS. ALLOc RAVIER, X. [1978-94] Atlas Linguistique et ethnographique du Languedoc Occidental, Paris, CNRS. ALLOr BOISGONTIER, J. [1981-84] Atlas Linguistique et ethno-graphique du Languedoc oriental, Paris, CNRS. AIS
54
Jean-Philippe DALBERA
ALLR ALMC ALN ALO ALP ALPo ALR NALC THESOC
LANHER, J., LITAIZE, A., RICHARD, J. [1979-88] Atlas Linguistique et ethnographique de la Lorraine Romane, Paris, CNRS. NAUTON, P. [1957-63] Atlas Linguistique et ethnographique du Massif Central, Paris, CNRS. BRASSEUR, P. [1980-84] Atlas Linguistique et ethnographique Normand, Paris, CNRS. MASSIGNON, G., HORIOT, B. [1971-83] Atlas Linguistique et ethnographique de l’Ouest, Paris, CNRS. BOUVIER, J. Cl., MARTEL, Cl. [1975-86] Atlas Linguistique et ethnographique de Provence, Paris, CNRS. GUITER, H. [1966] Atlas Linguistique et ethnographique des Pyrénées Orientales, Paris, CNRS. PaTRUT I., [1956] Atlasul Linguistic Romin, Bucuresti, Editura Acabemiei Republicii Populare Romîne. DALBERA-STEFANAGGI, M.J. [1995-99] Nouvel Atlas Linguistique de la Corse, Paris, CNRS Editions. DALBERA J. Ph. (sous la direction de) [1998-2006] Thesaurus Occitan, Base de données dialectales multimédia, UMR 6039 «Bases, Corpus et Langage», Nice, CNRS-UNSA.
From the Linguistic Atlas to the Database, and vice versa — The Corsican Example — Marie-José DALBERA-STEFANAGGI In Corsica, geolinguistics have given rise to three enterprises throughout the 20th century: the ALF Corse1 by Gilliéron and Edmont before World War I, the ALEIC 2 by Bottiglioni from 1933 to 1952, and the NALC 3, currently under publication. The latter hinges upon a Linguistic Database also currently under preparation, i.e. the Banque de Données Langue Corse (BDLC, CNRS/Corsica University4). This particular structuring leads to a re-evaluation of the credit awarded to the linguistic atlas and the language dictionary, respectively. Indeed, having to choose between one of these two approaches no longer represents the main issue, instead the aim is to make these two objectives complementary and mutually enriching. The Atlas Linguistique de la France: Corse In order to compile the Atlas Linguistique de la France directed by J. Gilliéron, E. Edmont surveyed 44 places in Corsica in 1911 and 1912 – spending five days in each place. The questionnaire that had been used to survey mainland France was widely enriched and made more specific for this study. It then included 3,000 questions that Edmont – who was not a native speaker of Corsican – had to ask in Italian, a language that he learned for this purpose before going to Corsica. The ALF Corse was published in Paris in 1914. However the publication was interrupted at the fourth issue (which represented a total of 799 maps) and the rest of the information, i.e. the survey notebooks, was stored at the National Library. The authors explained at various occasions that what guided their work was their dream of a total “objectivity”, their concern about collecting raw, pure and pristine data, representations of linguistic facts comparable with snapshots. We will not go back over the politico-scientific debate5 that followed the publication of the ALF Corse, neither over the criticism made to its 1 2 3 4 5
Gilliéron & Edmont 1914-1915. Bottiglioni 1933-1942. Dalbera-Stefanaggi 1995-1999. Available on the web site of the University of Corsica. With regard to this matter, the issue concerning the notation of the nasal vowels is particularly enlightening. Cf. Dalbera-Stefanaggi 2001:77-89.
56
Marie-José DALBERA-STEFANAGGI
authors. Nowadays, a more fine-grained analysis of the documents should probably be produced and several levels should be distinguished. Indeed, even if quite a few obvious transcription errors can easily be put forward, on the other hand, other transcriptions reach a level of truth that only a foreign ear is capable of capturing insofar as there is no interference with already-known elements. Quite a few transcriptions which were thought to be wrong because they did not comply with a point of view based on a certain Italian norm, appear in fact to be very representative of a certain level of reality, thanks to their impressionism freed from all a priori. The Atlante Linguistico Etnografico Italiano della Corsica Compiled by G. Bottiglioni, this atlas can be defined as being in total opposition with the ALF Corse on most of the points. The former forwards an extremely rich, fine-grained and relevant representation of the island dialectal geography. This fact entails that Corsica is considered as an exception among the French regions and this is why it was thought for a while that it would be useless to “cover” it again. If we try to leave aside the context, the ideological background of particularly troubled times, as well as the issue Corsica represents, it can be noted that Bottiglioni takes the exact opposing view of each of Gilliéron’s and Edmont’s points. Additionally, he strongly denies the AIS authors6 and suggests he represents their areologic complement. Enriched with additional and more detailed information, the ALEIC is based on several years of surveying and thought. Published from 1933 to 1942, the 2,000 ALEIC maps come along with two volumes: a dictionary that represents the whole corpus gathered during the surveys and an introduction that clarifies, among other items, the applied methodology together with the norm and the dialectal variation theory underlying the whole work. Among others, the issue of how facts should be represented is very clearly addressed. Bottiglioni particularly criticizes the impressionistic transcription forwarded by the AIS surveyors –a fortiori, the one proposed by the ALF Corse–. Indeed, Bottiglioni considers this kind of transcription as an illusion since he casts doubt on the ear’s capacities to seize all the phonetic nuances. Moreover, even if these impressionistic transcriptions could conform to reality, they would not become legitimate for all that. Indeed, according to Bottiglioni, the forms recorded and transcribed by the surveyor must not represent the expression of a fugitive moment but rather they must reflect the normal and usual expression of average speakers. Bottiglioni hence sets out to forward a normalising transcription that displays a representation of the 6
Jaberg & Jud 1928-1940.
From the Linguistic Atlas to the Database
57
average speech. The NALC and the BDLC In the 1970s, the debate revolved around whether a further enterprise for another linguistic atlas about Corsica was justified or not. We know that, following Albert Dauzat’s demand in 1939 for a new “coverage” of France, hence renewing, completing and detailing Gilliéron and Edmont’s work, the Atlas Linguistiques de la France par régions had been set up by the Centre National de la Recherche Scientifique7. Should Corsica be part of the project as it was already present in two previous atlases? The answer turned out to be positive and a project for a Nouvel Atlas Linguistique de la Corse was included in the national program. As soon as 1974, a first questionnaire was devised and several surveys were carried out. In 1986, an order placed by the Territorial Authority of Corsica for a Linguistic Database gave a new extent to this project: the Nouvel Atlas Linguistique de la Corse henceforth hinged upon the Banque de Données Langue Corse hence representing one of its by-products, i.e. the cartographic use of data. Nowadays, twenty years after these first steps, the progress of the work as well as the completed or still under preparation “paper” publications, the level of implementation and the accessibility of the data and the BDLC analyses, even if far from finished, prove to be widely positive. Collecting data: questionnaires, surveyors and network The data have been collected thanks to three different types of questionnaires that can be described as follows: The actual NALC questionnaire consists of approximately 3,500 questions. In accordance with the spirit of the Atlas Linguistiques de la France par régions collection and more generally of the regional or national atlases currently under preparation (and in order to be able to keep track of the latter without hiding the specificities of the former), the questions are divided into the following themes: agro-pastoral life, nature, home, family, human body, village (or town), and beliefs. This questionnaire is not an a priori questionnaire: it relies on the analysis of many recordings (over one thousand hours of recordings are currently at our disposal), over long periods of time, thanks to semi-directive surveys and free conversations about such or such technical or cultural domains. Hence, thanks to a series of successive modifications, we have obtained what we can call, better than a questionnaire, a “responsaire”, i.e. the list of the signified, expressed in French by convention, for the gathered lexical units. This responsaire 7
Cf. for instance Séguy 1973.
58
Marie-José DALBERA-STEFANAGGI
represents in advance the list of the expected maps. This methodology has enabled us to often collect terms that were unexpected since they were unknown. Furthermore, the texts considered as “interesting” are transcribed in full, with a phonemic orthography and, once integrated into the database, are accessible through different ways (keywords, locations, etc.). Hence, they represent a corpus of oral data that can be collated, when their sound quality is good enough, with the field recordings. This corpus has an undeniable ethno-linguistic value and, moreover, it supplies keys that sometimes lead to etymological and semantic interpretations, thanks to a definition proposed spontaneously by an informant8. The test-questionnaire is collected within a network that is expected to become quasi-exhaustive. It comprises several dozens of units that belong to ordinary vocabulary. Its interest is not lexical but phonological, synchronically as well as diachronically. It relies on preliminary analyses in this domain and illustrates the processing considered as relevant with examples that can be applied more generally due to their low lexical specificity. Subset of the former, it enables us to classify very quickly the studied speech in a typological zone, and to define, from a reduced and obligatorily representative sample, the main transcription problems that will occur for the whole corpus. The “fine-grained” phonetic transcription used for this part of the questionnaires is not generalised to the transcription of the open questionnaire. However it is used as a kind of “filter” for the latter. This part of the NALC is included in the first volume published in 1995. The principles guiding the specific questionnaires are devised in a similar way to the NALC general questionnaire. The questionnaire about the coasts for instance originates from a study carried out by local specialists in about fifteen places along the littoral with regard to the marine environment in Corsica9. The same holds for the questionnaires about flora, fauna, etc. One of the particular aspects of this type of questionnaire is that it deals with the link that exists between a folk taxonomy and a scientific one. And this requires a perfect knowledge of the addressed matter. This is why we have strived to call on specialists. The surveys have also been organised in various ways, their only shared characteristic being the respect of the rules relative to dialectological surveys. Thus, the test-questionnaire induces questions and answers (translations), whereas the ethno-lexical surveys look more like more or less directed interviews about specific themes in Corsican, with one or several informants at the same time. A tape recorder was used every time as well as a video 8 9
Cf. document . This enterprise has given rise to a specific publication: Dalbera-Stefanaggi 1999.
From the Linguistic Atlas to the Database
59
camera during the last period of time. As far as the informants are concerned, they were interviewed under the two following conditions: their “first” mother tongue had to be Corsican and they had to use this language daily. More than a fixed age criterion, we were led to establish a “relative” age one, which combined age with social, geographic and cultural characteristics. Former work carried out in Corsica within various frameworks10 guided us with regard to this point. The Nouvel Atlas Linguistique de la Corse has never had specific surveyors. We achieved most of the surveys ourselves for quite a long time, and, more recently, advanced students, some of which have become our colleagues, have started to carry out surveys under our supervision. These students usually come from the region they work on so this allows them to have an approach “from the inside”. Nevertheless, in order to preserve homogeneity within the data, we are responsible for the analysis and the transcription of the tapes. As for the survey questionnaire, the NALC network operates at several levels. During a first period (up to 2002), a network of 59 places represented our reference grid. One or two points could be found in each district. However, it is the former geographical distribution, into pievi, rather than the latest distribution into districts that was finally favoured. Point 59 for instance is in Italy, on the Maddalena Island, between Corsica and Sardinia. Point 58, Bonifacio, represents the Genoese dialect. A specific coastal network, thanks to which the questionnaire linked with the see is gathered, comprises fourteen locations along the coast. These have been selected with the above-mentioned coastal atlas in mind. A more fine-grained grid hinges around these 59 main points. This second network essentially enables us to collect the phonetic test-questionnaire. This methodology was motivated by the concern about not neglecting any possibilities of understanding the organisation of the linguistic space, especially the breaks or the transitions between the already pointed out areas. Hence, we were led to add further points to our initial network in the areas of higher population density (in the North-East) or in the areas where a linguistic transition was still not well known (e.g. the Taravu-Fiumorbu area). Moreover, this network was devised in order to be extendable: additional points were added to the initial 115-point grid. To allow this, we set up a two-level numbering system. The core points were numbered from 01 to 59, from West to East and from North to South. Then, “satellites” hinged around these core points, whose numbers displayed on their right-hand side a third column of smaller fonted figures. Hence, points 061, 062 are satellites of 10
On these aspects, cf. Dalbera-Stefanaggi 1991.
60
Marie-José DALBERA-STEFANAGGI
point 06, points 261, 262, 263 come along with point 26, etc. This numbering allowed us to theoretically multiply by 9 the initial network of 59 core points. However, subsequently since 2002, we have been driven to devise our network as potentially exhaustive, due to the unexpected development of our work, i.e. the abundance of gathered documents and the rise of noted points. Since then, we have used the “outline map” proposed by the INSEE11 that displays an exhaustive map of the 350 Corsican towns, each with a fixed number according to their alphabetical ordering. This new network is currently being applied retroactively to the first two volumes of the Atlas, which is now being republished. It hence becomes the stable base for any new cartography work. It is obvious that it is the computerised base of the Atlas that made this evolution possible, since it provides a broad flexibility to the system. A correspondence table, produced automatically, enables us to re-read earlier documents. Using the data: transcription and cartography work Let’s first recall that the NALC is entirely computerised and based on a relational database. This entails certain consequences and allows us to carry out a certain number of projects. First, we have been driven to devise a specific phonetic font that uses the IPA symbols. Second, the cartography work is totally automatic once the forms have been keyed in per location in different files. Indeed, in the BDLC, a phonetic form, a phonemically orthographied form, a lemmatised form, a morphological form, as well as an etymological form are captured for each lexical unit, and each of these aspects can be represented thanks to a symbol. This multiple use of the data, which enables us to go beyond the old debate with regard to transcription (impressionistic or standardised), has led us to set up a multi-dimensional cartography, depending on the type of data that needs to be represented, the objectives, and the different above-mentioned analysis levels. How to cartography facts This task rests on an impressionistic transcription. Also obviously “constructed”, it nevertheless possesses its proper level of relevance and it must be received for what it is. Such a day, in such a place, such a person uttered such a form that we transcribed in such a way (since variability also inevitably comes from the transcriber). And if these forms are displayed on maps, it means that they have been given interest to. Hence, the disappearance of a vowel, the excessive closure of an [e] yielding an [i], the 11
French National Institute of Statistics and Economic Information.
From the Linguistic Atlas to the Database
61
absence of voicing of a consonant for which we were expecting a lenition, the velar realisation of an /r/ (it should be noted that these examples already are instances of an implicit norm or of a phonological level of analysis) instantiate this possibility but are also signs that enable us to reconstruct and interpret: these elements generally carry diachronic information within their sociolinguistic arrangement. Standardised cartography It works both on the synchronic and the diachronic level. “Levelling” procedures of the phonetic forms were necessary in order to symbolise and structure only the commensurable items. In particular, it proves necessary to take into account the phonological status of the transcribed elements for each point. Indeed, mapmaking nowadays has to integrate synchronic structural data. As far as diachronical cartography is concerned, we also have tried to link only stable processing. Hence, for instance, we considered relevant to cartography the variability –i.e. an important range in variation– of the consonants than come from the Latin occlusives12, or with regard to morphology, to cartography the way feminine plural endings are formed in –e or –i. How to cartography correlated analyses This type of cartography occurs at a deepest level and it has proved to convey a heuristic value. It allows us to cross and compare facts whose analysis leads us to unveil possible links. Once these links have been geographically displayed, it becomes then possible to sketch out zones and hence forward a hypothesis, since space is a projection of time. It hence works, always automatically, thanks to pairs of maps that have been specially selected and that have already been analysed. This type of cartography hence enables us to define areas of spatial, synchronic, or diachronic organisation instead of areas of facts. Prospects This very long work of preparation and elaboration, inescapable in this kind of realisation with regard to the collection and the analysis of the linguistic facts as well as with regard to their computerisation, nevertheless proves to be “profitable”. The first NALC volume, which comprises phonetic and phonological-oriented maps (the test-questionnaire, the 113-point network) and analysed maps, was published in 1995. The second volume, dedicated to maritime vocabulary, was published in 1999. A third volume, dedicated to wild fauna and flora vocabulary, is currently under preparation. 12
NALC 1, maps 236-241, 246, 251.
62
Marie-José DALBERA-STEFANAGGI
The first two volumes being out of print, a re-edition of these two volumes and a first edition of the third volume are currently under preparation with the Editions of CTHS (Paris) and the Editions Alain Piazzola (Ajaccio), who have bought the CNRS Editions’ rights. These three volumes should hence be published simultaneously in 2007-2008. Further volumes should follow. Concomitantly, collections of ethno-linguistic documents, also from the BDLC, are currently under publication, under the title detti è usi di paesi. Of course, this is a long-term and uneasy enterprise and the work progresses step after step. However, with hindsight, we can assert that, far from being in opposition as two contradictory or at least two disjoint entities are expected to be, the Atlas and the Database prove to be dialectal tools that should be improved conjointly as, indeed, they enrich one another. References Bottiglioni, G. 1933-42. Atlante Linguistico Etnografico Italiano della Corsica, Pisa. Dalbera-Stefanaggi, M.J. 1991. Unité et diversité des parlers corses. Alessandria: Edizioni dell’Orso. Dalbera-Stefanaggi, M.J. 1995 & 1999. Nouvel Atlas Linguistique de la Corse (vol. I and II). Paris: CNRS Editions. Dalbera-Stefanaggi, M.J. 2001. Essais de linguistique corse. Ajaccio: A. Piazzola. Gilliéron, J. & E. Edmont, 1914-15. Atlas Linguistique de la France: Corse. Paris: Champion. Jaberg, K. & J. Jud. 1928-1940. Sprach und Sachatlas Italiens und der Südschweiz. Zofingen Séguy, J. 1973. “Les atlas linguistiques de la France par régions”. Langue française 18. 65-91. Document Example of an unexpected enrichment of the corpus due to the collection and analysis of semi-directed productions (ethno-texts). The following example is an extract from Purcelli è maghjali, Detti è usi di paesi, Ajaccio, Alain Piazzola, 2006. During a semi-directed conversation about how to prepare charcuterie, the surveyor is explained how to prepare intestines in order to use them as casing for the meat. The answer unveils a term that had not been foreseen by the questionnaire.
From the Linguistic Atlas to the Database
63
Urdì, urdà, urdillà Within our corpus, the verb appears under several forms. For instance, in San Gavinu, par urdalli, si staccanu da par eddi “to “order” they separate from one another”. Suffixed in various ways, the word has a very precise technical meaning: unwind the ball made of intestines, and place the latter, on a shuttle, in such a way that it becomes possible to separate them easily into layers. This technical meaning is metaphorical for another technical meaning, the one that refers to the ordering process when a weaver places the threads on a loom: “to set up the warp”. We would prefer to refer to polysemy rather to a figurative meaning, since the “ordering” of the intestines seems to be a precise technical activity. In certain areas (e.g. Alta Rocca), our informant added further details to the description of this technique: while a woman unwinds the intestines, another one, with her knife, in a very regular way, carves the strip made of a kind of festoon of fat that stretches all along the intestines: « U ntriddu, c’est la graisse qui maintient tous les boyaux è quandì si orda, si chjama ordà; sò i donni chì facini st affari; pianu pianu, pianu pianu, una tira u budeddu è quill’ altra faci cù u culteddu »: « […] and when we order (we call that « order », the women do that) very slowly, very slowly, one pulls the intestine and the other one does with the knife…”. We note that the feminine aspect of this work is put forward here, which is usual whenever spinning and weaving processes are evoked, hence recalling the Parcae’s attributions… The Latin ordior conveys the basic meaning “to order threads on a loom, to place the warp” and derived meanings such as “to start, to set up…” subsequently found in the Romance languages. It is the basic meaning that is rather dominant here, probably due a popular interpretation.
64
Marie-José DALBERA-STEFANAGGI
From the Linguistic Atlas to the Database
65
66
Marie-José DALBERA-STEFANAGGI
A Usage-based French Dictionary of Collocations Peter BLUMENTHAL 1. Aims The primary practical purpose of the Franco-German research project presented in this paper is to compile a collocations dictionary of French nouns. The project, which has been going on for three years, is the result of a collaboration between the ATILF 1 (with a research group headed by Pascale Bernard) and the department of Romance languages at the University of Cologne (with a team led by Peter Blumenthal). Drawing on quantitative, corpus-based methods, we aim at providing a detailed account of the restricted range of syntactic and semantic contexts in which at least 2000 nouns stereotypically appear. Our approach can thus be considered as roughly following recent models of English lexicography (Benson, Benson and Ilson 1997; cf. also Kjellmer 1994 and Deuter 2002). However it must be mentioned that French dictionaries founded on a similar theoretical basis have already been developed – some of them are freely accessible on the internet (cf. Beauchesne 2001; Binon, Verlinde, Van Dyck and Bertels 2000; González Rodríguez 2004; Grobelak 1990; Ilgenfritz, Stephan-Gabinel and Schneider 1989; Mel’čuk et al. 1984-1999; Zinglé & Brobeck-Zinglé 2003). Moreover, it is needless to say that even “traditional” dictionaries of French (i.e., those not particularly focussing on collocations) provide more or less elaborate descriptions of the typical combinatorial properties of lexemes. This is especially the case with the most popular and successful single-volume dictionary of French, the Petit Robert (Rey-Debove & Rey 2006). Moreover, its electronic version offers the considerable advantage of containing menu items (in particular exemples/expressions and recherche avancée – texte intégral) that allow the users to search for typical word combinations or to test their intuitions about possible word combinations. Of course, for obvious reasons, the fifteen-volume Trésor de la langue française (Imbs 1971-1994) provides a much greater amount of information than the Petit Robert. Its column “SYNT” (syntagme(s) rencontré(s) dans la documentation) orders typical word combinations by semantic and syntactic 1
Analyse et Traitement Informatique de la Langue Française, a joint research center of the CNRS (Centre National de Recherche Scientifique) and the Université Nancy 2.
68
Peter BLUMENTHAL
criteria. In the face of the competitive situation just sketched, if we want to bring to market yet another dictionary, we have to tackle two crucial questions: on which theory should we draw, and what should the implementation considerations be so as to render our dictionary even more attractive than its competitors? To be sure, attractiveness does not exist in absolute terms; it rather depends on specific user needs. The readership we have in mind is the (admittedly rather restricted) group of people interested in obtaining detailed information on a wide range of syntactically differentiated word combinations and on their respective degree of frequency across different genres. In particular, the dictionary shall be of great worth for instructional purposes (for example for teachers of French as a first or foreign language) and for those who need to capture subtle nuances of expression in demanding texts. The dictionary is not only reserved for native speakers of French; it should also meet the needs of non-natives with a very good command of French. In the following, we list some of the characteristics which we hope will contribute to the competitiveness of our dictionary: — a usage-based approach — a corpus based on two different text genres — elaborate and reliable statistical methods for determining collocates — utilisation of a differentiated syntactic grid for classifying collocates — precise indication of the degree of specificity of collocates — quotation of sample sentences illustrating particularly interesting word combinations — separate treatment of purely literary word combinations. Before presenting these points in more detail, we must briefly address a question which is fundamental in this context: which notion of collocation should the dictionary be based on? As is generally known, different schools of linguistics have interpreted the term in different ways. Roughly speaking, it is possible to distinguish between qualitative and quantitative definitions (cf. Hausmann & Blumenthal 2006). According to the proponents of a purely qualitative interpretation, the cooccurrence of a keyword and its collocates results from unpredictable lexical restrictions that characterise each constituent of a collocation2. The collocate tends to shed a particular light on the overall characterisation of the subject, thus adding some kind of figurative 2
Еxamples: émettre un son, but pousser un hurlement; infliger une déception, but inspirer une crainte (cf. Steinlin 2003:3).
A Usage-based French Dictionary of Collocations
69
surplus-value to the corresponding non-collocational description of a situation. Just consider the following oft-cited example: someone qualified as a célibataire endurci (i.e., a “confirmed bachelor”) is described as outperforming other men in his class in terms of prototypicality, in other words, he corresponds to the absolute stereotype of a bachelor in all respects. In its conventional usage, this expression has a purely denotative and classifying function. However, when the literal meaning of the adjective “endurci” is contextually licensed, the collocation may receive a more figurative interpretation (according to the Petit Robert: “Qui est devenu dur, insensible”). One notes that some collocations potentially withhold latent associations and connotations which in normal usage often fade away. It must be emphasised that the proponents of a quantitative interpretation – we regard ourselves to be among these – do not reject the idea that collocations arise from selectional restrictions, nor do they deny the existence of contextually enriched idiomatic meanings. What they are wary of rather, is the fact that these theoretical ideas can hardly be formalised for practical exploitations of text corpora. For this reason, their definition of the notion of collocation relies on probabilistic, statistical methods. More exactly, they define collocations as those combinations of words which occur more frequently across a corpus than would be predicted by mere chance – with the expectation that their definition would also cover those word pairs that qualify as collocations under a qualitative analysis. In the following we would like to show the precise methodical consequences which our decision for a quantitative approach imply and how they reflect on our practical lexicographic work. 2. Corpora and research methods To meet the needs of our target audience, whose main interest is assumed to lie in the combinatorial possibilities of formal written genres of contemporary French, our corpus was chosen to cover both newspaper language and literary texts. Literary language had to be included in order to counter-balance the thematic bias exhibited by daily newspapers which for all their linguistic wealth, tend to focus on certain topics (e.g., economy and politics), while neglecting others (e.g., the domains of feelings and everyday life). Our newspaper of choice was selected as a result of a corpus-based linguistic comparison of regional (L’Est Républicain, Ouest-France, Sud-Ouest) and national (Le Figaro, Libération, Le Monde) daily newspapers. Lexical and collocational differences among national newspapers turned out to be much smaller than those between the regional press on the one hand and the national press on the other. The richer and more differentiated use of
70
Peter BLUMENTHAL
vocabulary, which was particularly apparent in the areas of economy, finance, culture and science, clearly argued for the selection of a national daily, with Le Monde and Le Figaro both coming into question on purely linguistic criteria. The factors which eventually turned the balance in favour of Le Monde were merely practical ones: the number of copies per issue in the years 1999 and 2000 (about 52 Million words) and the prestige that the newspaper enjoys among our target audience3. The literary part of our corpus was provided by courtesy of ATILF. It consists of the data from Frantext (1950-2000), a database containing 224 novel excerpts amounting to a total of about 16 million words. The next step was to categorise and lemmatise our corpus via Treetagger (cf. Stein & Schmid 1995) as a prerequisite for the application of log likelihood, a probabilistic calculation that enables us to determine the degree of specificity of word combinations4. Word combinations lying above the specificity threshold of 10.8 are considered to be collocations. The following table provides an illustration of the collocates of crainte thus determined: Collocates of the keyword crainte: rank 1
collocate
syntactic category
position
log likelihood score
cooccurrences
par
PREP
L
881
318
2
exprimer
V
L
578
80
3
apaiser
V
L
487
45
4
voir
V
R
464
101
5
inflationniste
ADJ
R
435
41
6
raviver
V
L
394
34
7
sans
PREP
L
365
99
8
dissiper
V
L
347
31
9
représailles
N
R
331
31
10
exprimer
V
R
320
52
11
susciter
V
L
289
43
12
bogue
N
R
260
26
Collocates of CRAINTE in Le Monde 1999/2000 (span: up to 5 words to the left [= L] and to the right [= R] of the key) 3 4
This was also the newspaper chosen by the authors of the programme Voisins de Le Monde, which allows internet users to compile their own dictionary of collocations. Cf. Manning & Schütze 2000:151-153.
A Usage-based French Dictionary of Collocations
71
As can be seen in the table5, the combination of the preposition par with the keyword crainte (par crainte, par la crainte) does not only exhibit the highest absolute number of co-occurrences (318x), it also represents the most specific combination attested (log likelihood score 881). Thus, unlike its synonym peur, crainte typically functions as part of an adverbial phrase (section 7.2 of the grid in section 3.). With regards to semantics, it is striking that prepositional phrases denoting the absence of crainte (sans + crainte) are much more frequent and specific than their positive counterparts: avec + crainte has a log likelihood score of 7.3 and thus ranks very low (713th place) – far below the threshold of specificity. Note that a ranking in terms of absolute frequencies would not yield the same result as our ranking, which is based on the criterion of specificity. Thus, for instance, although sans and crainte cooccur more frequently (99x) than exprimer and crainte (80x) in absolute terms, the former combination is less specific than the latter one, since in our corpus, the word sans is much more frequent than the word exprimer. As the reader will have noticed, some collocations result from language-internal structures (in particular those containing highly specific collocates such as par, exprimer and apaiser), whereas others merely reflect issues that were of current (and transient) interest to the newspaper readers of the years 1999 and 2000. A case in point is the collocate bogue – a word frequently cited in connection with the “year 2000 problem” – which occurs in the following citation from Le Monde (4.2.1999): “près de la moitié des informaticiens britanniques refusent, par crainte du bogue, de prendre l’avion ce jour-là”. For the purposes of our dictionary, we excluded collocates of this kind, which obviously cannot be claimed to represent stable linguistic structures. 3. The syntactic grid As our listing of collocates of crainte suggests, collocational relationships can pertain between all kinds of syntactic categories, for example between an attributive adjective and a noun, between the components of a verb phrase or between an adverbial and the rest of the clause. To categorise these kinds of relationships, we devised a syntactic grid which represents 19 individual syntactic patterns covering 7 broader sections (1.1-7.2). According to the respective syntactic categories of their constituents, collocations are inserted into this grid.
5
De is not taken into account, since the programme turned out to produce unreliable results as to the syntactic category involved (Preposition? Article?).
72
Peter BLUMENTHAL
1.1 ADJECTIVE + KEY 1.2 KEY + ADJECTIVE 1.3 KEY + NOUN 2.1 KEY + de + subjective genitive NOUN 2.2 KEY + PREPOSITION + object NOUN6 2.3 KEY + PREPOSITION + INFINITIVE 2.4 KEY + COMPLEMENT CLAUSE 2.5 KEY + PREPOSITION + adverbial NOUN7 2.6 KEY + RELATIVE CLAUSE 3.1 quantifying NOUN + KEY 3.2 NOUN + PREPOSITION + KEY 3.3 NOUN + KEY 4. ADJECTIVE/ADVERB + PREPOSITION + KEY 5.1 KEYsubject + VERB 5.2 KEY + ATTRIBUTE 6.1 VERB + KEYobject 6.2 copular VERB + KEY 7.1 adverbial KEY8 7.2 PREPOSITION + adverbial KEY
4. An example: débat To put together a dictionary entry, we map the totality of collocates of a given keyword onto the syntactic grid just presented. Without going into the details, note that the window span (i.e., the number of words to the left and the right of the keyword) that we used for searching collocates ranged from 1 to 5 words (for example, compare pattern 1.1 in the grid, where only a single word to the left of the keyword is taken into account, whereby for pattern 6.1, up to five words are processed). Here below is a sample dictionary entry, which in the following will be examined in detail.
6 7 8
Noun in object function. Noun which is part of an adverbial prepositional phrase. The keyword is part of an adverbial prepositional phrase.
A Usage-based French Dictionary of Collocations
73
débat, n.m. (Le Monde 1999/2000: 13340, Frantext: 197) 1.1 ADJ + DÉBAT: actuel (14) | âpre (142): Six points de vue relancent cet âpre débat. | autre (55): Deux autres débats partagent la majorité. | dernier (12) | difficile (18) | docte (20) | épineux (30) | éternel (87): On n’ouvrira pas ici l’éternel débat: “Qui juge les juges?”. | éventuel (14) | faux (256): Il parut à Gerfaut que c’était un faux débat. (Manchette 1976) | grand (802): Le grand débat sur la défense des salariés a lieu dans toute l’Europe. | houleux (43) | important (34) | inépuisable (14) | intense (75): Ce point a fait l’objet d’un intense débat au sein du conseil. | interminable (77): La direction et les syndicats se sont lancés dans un interminable débat sur la nature de la grille d’intégration. | lancinant (14) | languissant (22) | large (236): Nous souhaitons un large débat public sur le nucléaire. | libre (29) | long (156): Les sages engagent un long débat sur les options finales. | même (12) | moindre (18) | multiple (28) | nombreux (202): Ce texte a suscité de nombreux débats. | nouveau (27) | passionnant (30) | plein (138): Il interviendra en plein débat sur le projet de loi sur l’audiovisuel. | prochain (13) | récent (63): Ce problème a été occulté par les récents débats législatifs. | réel (54): La campagne n’a pas permis de réel débat sur l’Europe. | sempiternel (53): C’est une application du sempiternel débat sur les rapports entre l’évolution des mœurs et celle des lois. | seul (24) | vaste (282): Ces extraits ne sont que le début d’un vaste débat. | véritable (234): Il plaide pour un véritable débat démocratique sur les choix énergétiques. | vieux (106): On retrouve ici la trame du vieux débat entre finance et économie. | vif (513): Les nouveaux programmes d’histoire de terminale suscitent un vif débat. | virulent (19) | vrai (793): Le vrai débat sera d’ordre économique. 1.2 DÉBAT + ADJ: académique (43) | acharné (16) | actuel (313): Ils ne veulent pas rester en marge du débat actuel. | animé (< 448): Il sera le lieu d’un débat animé. | approfondi (71): Son projet de loi doit faire l’objet d’un débat approfondi. | argumenté (30) | artificiel (13) | budgétaire (446): Nous souhaitons un débat budgétaire ouvert. | byzantin (15) | communautaire (20) | confus (18) | consacré (80): Ils participeront à des débats consacrés à des thèmes divers. | constitutionnel (16) | contradictoire (923): Le juge a organisé un débat contradictoire. | critique (15) | crucial (14) | décisif (17) | démocratique (972): Cela participe du débat démocratique. | déontologique (15) | difficile (14) | économique (81): Cette loi constitue un enjeu majeur du débat économique. | éducatif (39) | électoral (90): Les défenseurs des sans-papiers s’invitent dans le débat électoral. | empoisonné (11) | enflammé (43) | engagé (36) | esthétique (20) | éternel (13) | éthique (187): Il provoqua un débat éthique sur le clonage dans l’espèce humaine. | européen (105): Le débat européen n’est pas son affaire. | existentiel (23) | feutré (16) | fiscal (193): Ce texte recadre le débat fiscal. | fondamental (29) | historiographique (43) | historique (14) | houleux (446): Deux mois de débats houleux ont été nécessaires. | hypertrophié (22) | hypocrite (24) | idéologique (308): Il n’a pas su précisément poser les termes du débat idéologique. | important (11) | improvisé (16) | inédit (21) | ininterrompu (22) | institutionnel (146): Le débat institutionnel a tourné court. | intellectuel (163): Nous ne saurions nous ériger en arbitres du débat intellectuel. | intense (19) | interdit (34) | intéressant (21) | intérieur
74
Peter BLUMENTHAL
(54): Nous assistons à leurs débats intérieurs et nous partageons avec eux l’angoisse face à la mort. | interminable (16) | interministériel (14) | international (25) | interne (> 593): Il souhaite l’organisation d’un débat interne. | intitulé (88): Un débat intitulé “Artiste, dites-vous?” est organisé par la jeune peinture. | inutile (13) | judiciaire (73): Le débat judiciaire a été tenu d’une main de maître. | juridique (121): Derrière ce débat juridique formel, l’enjeu est considérable. | lancinant (14) | larvé (16) | légitime (53) | majeur (24) | mené (30) | moral (12) | national (481): Le ministre réclame un grand débat national. | ouvert (> 115): Il faut qu’un débat ouvert ait lieu sur ce sujet. | parlementaire (1619): Le débat parlementaire aura lieu en octobre. | passionné (251): Il a déclenché des débats passionnés. | pédagogique (45) | permanent (45) | philosophique (154): Ce débat philosophique porte sur le concept d’humanité. | pointilleux (18) | politicien (28) | politique (2163): Leurs voix permettraient d’ouvrir le débat politique. | préalable (126): Le parti va jouer un rôle décisif dans le débat préalable sur les institutions. | préparatoire (19) | présidentiel (22) | prévu (> 30) | profond (15) | public (> 3318): Il courait de réunions d’étudiants en débats publics. | quotidien (25) | récent (45) | récurrent (233): Cette controverse illustre le débat récurrent sur la détention provisoire. | relatif (11) | retransmis (42) | scientifique (104): Le débat scientifique sur l’innocuité de ces gènes n’est pas clos. | sémantique (46) | serein (127): L’association souhaite un débat serein sur l’école. | sérieux (11) | serré (26) | social (24) | sociologique (11) | statutaire (22) | stérile (21) | stratégique (126): Le débat stratégique est en passe de changer, radicalement. | technique (11) | télévisé (627): Les candidats ont participé à leur premier débat télévisé. | thématique (14) | théologique (68): La question du sexe des îles, comme celle du sexe des anges, mériterait probablement un long débat théologique. | théorique (110): Ce débat théorique a été enterré. | transparent (29) | tronqué (16) | tumultueux (15) | venimeux (20) | vif (40) | virulent (27) 1.3 DÉBAT + NOM: citoyen (72): Le débat citoyen commence seulement. 2.1 DÉBAT + de + NOMsubjectif: assemblée: Les débats de l’assemblée générale seront diffusés en direct. | congrès: La fiscalité sera au cœur des débats du congrès. | parlementaire: La chaîne retransmet en direct les débats des parlementaires. | expert: La discussion en séance publique s’annonçait comme un débat d’experts. 2.2 DÉBAT + PRÉP + NOMobjectif: de idée: Ces thèses peuvent contribuer au débat des idées. | censure: Il se prépare au débat de censure à l’Assemblée nationale. | politique: Cela ne veut pas dire qu’il n’y ait pas de vrais débats de politique monétaire. | investiture: Début du débat d’investiture du premier ministre. 2.3 DÉBAT + PRÉP + INF: à venir: Il veut se placer au mieux pour les débats à venir. 2.5 DÉBAT + PRÉP + NOMcirconstanciel: sur avenir: Il veut reprendre le long débat sur l’avenir du Musée en termes d’espace. | thème: Il organise un débat sur le thème: “Quel avenir pour la littérature?”. | parité | immigration | réforme | question | projet: L’association tente de prolonger le débat sur le projet d’autoroute. | partage: L’euro contribue à raviver le débat sur le partage des revenus. | statut | rôle | entre : La perspective du retour au plein emploi avive le débat entre les partenaires sociaux. ◊ Les
A Usage-based French Dictionary of Collocations
75
débats entre partisans et adversaires de ces projets se cantonnent essentiellement aux milieux directement impliqués. | expert | avec représentant: Les radios ont multiplié les débats avec les représentants des candidats. | dans : Il a dû défendre sa position lors de débats houleux dans tout le pays. | domaine: Les débats dans ce domaine ne relèvent pas seulement de la conscience individuelle. | sans vote: Il a passé des heures au Sénat pour un débat sans vote sur l’aménagement du territoire. | tabou: Le gouvernement fera connaître ses choix après une large concertation et un débat sans tabou. | devant cour: Trois questions pour la dernière semaine de débats devant la Cour de justice. | tribunal: L’interprétation des faits fait l’objet de débats devant le tribunal. | de société: On demande un vrai débat de société. | opinion | campagne | fond: Il n’a eu de cesse de dissuader tout débat de fond. | arrière-garde: On peut comprendre que ça les agace, mais c’est un débat d’arrière-garde. | nature: La Commission démissionnaire donne le ton dans ce débat de nature politique. |
: Le juge X, qui présidait les débats de mardi, s’est montré parfaitement clair. | à : Un débat au parlement aurait une fonction symbolique forte. | huis clos: Il avait pris l’initiative d’organiser un débat à huis clos. | niveau: L’évolution institutionnelle fera d’abord l’objet d’un débat au niveau local. | à la sauvette : Cela ne peut pas être un débat à la sauvette. | en lecture: Tout dépendra de l’issue du débat en deuxième lecture. | séance publique | cours: Les débats en cours portent sur la constitution de la future société. | concernant : Le débat concernant l’expertise souligne ce que nous disons depuis le début. | parmi : Les causes de cet événement font encore l’objet d’âpres débats parmi les scientifiques. | chez : Le point qui a fait débat chez les luthériens est plus subtil. | sous égide: Il souhaite un débat sous l’égide d’un institut privé. | à propos de : Nous avons eu le même type de débat à propos de la loi sur le dopage. | pour : Ce ne sera pas le sujet du débat pour le budget 2000. 3.2 NOM + PRÉP + DÉBAT: absence (< 73): La vie intellectuelle est sinistrée, marquée par l’absence de débat. | animateur (16) | âpreté (12) | clarté (< 16) | clôture (27) | complexité (< 17) | conclusion (< 31) | contribution à (97): Un tribunal correctionnel [ne peut] apporter sa contribution à quelque débat national. | cycle (< 14) | cœur (862): Le problème du financement est au cœur du débat. | déroulement (21) | élément (15) | émergence (11) | émission (< 25) | enjeu (< 103): Quel est le véritable enjeu du débat? | enlisement (11) | espace (< 42) | essentiel (< 33) | fond (< 75): Le fond du débat est lié au concept de la modernité. | intégralité (< 18) | lieu (< 203): La fondation était d’abord un lieu de débat et de régulation. | maîtrise (11) | meneur (33) | mûrissement (20) | organisation (34) | ouverture (< 397): Il réclame également l’ouverture d’un débat national. | participant à (27) | partie (< 7) | polarisation (34) | politisation (14) | portée (21) | position (< 17) | poursuite (12) | préparation (< 21) | publicité (118): Le Code de procédure pénale réaffirme le principe de la publicité des débats. | recentrage (15) | relance (< 100): Tous plaident pour la relance du débat sur la décentralisation. | réouverture (48) | report (29) | retranscription (18) | retransmission (12) | sérénité (< 99): Cela nuirait à la sérénité des débats. | sujet (< 86): Il devint vite l’unique sujet de débat de la soirée électorale. | tenue (15) | thème (< 37) | ton (18) | tournant dans (13) | type (12)
76
Peter BLUMENTHAL
5.1 DÉBATsujet + VERBE: aboutir (< 17) | accompagner (< 21) | agiter (309): Le débat n’agite que les antichambres ministérielles. | s’amorcer (< 30) | s’amplifier (55) | apparaître (< 19): Si le débat apparaît clairement aujourd’hui, c’est que le phénomène n’a plus rien de marginal. | avoir lieu (< 217): Le débat parlementaire aura lieu en octobre. | commencer (< 95): Le débat commence à 16 heures. | se concentrer sur (< 64) | concerner (< 80): Le débat concerne aussi le sort de l’usine. | conduire à (< 21) | continuer (< 23) | se cristalliser sur (33): Le débat s’est cristallisé sur les questions de l’organisation du statut des entraîneurs d’athlétisme. | déboucher sur (34) | dépasser (< 62): Ce débat dépasse la communauté scientifique et doit être démocratique. | se déplacer (< 31) | se dérouler (< 113): Ce débat se déroule dans un contexte nouveau. | diviser (< 39) | dominer (< 32) | durer (60): Le débat dura un peu plus d’une heure. | s’embourber (11) | s’engager (< 309): Un nouveau débat s’engage sur la méthode à mettre en œuvre. | s’enliser (32) | s’ensuivre (14) | être dominé (< 32) | être escamoté (< 89): Le débat de fond est escamoté. | être faussé (22) | être occulté (< 35) | être relancé (< 128): Le débat sur l’autonomie est relancé. | être tranché (< 208): Le débat est loin d’être tranché. | évoluer (< 21) | faire rage (< 22) | se focaliser (< 136): Rapidement, le débat se focalise sur les métiers respectifs. | s’inscrire (< 16) | s’instaurer (< 77): Le débat s’instaurera peut-être. | s’intensifier (23) | intervenir (< 19) | se limiter à (< 15) | mériter (< 14) | se nouer (17) | opposer (< 196): Le débat a clairement opposé la droite et la gauche. | permettre (< 20) | se polariser (11) | porter sur (> 516): Le débat a porté sur l’économie. | se poursuivre (< 40) | précéder (< 114): Le débat qui a précédé le vote a été calamiteux. | se profiler (< 20) | rebondir (52): Le débat a rebondi sur la parité. | ressurgir (16) | se résumer (< 37) | resurgir (< 40) | réunir (< 16) | risquer (< 73): Le débat risque d’être passionné. | suivre (< 44) | surgir (< 30) | se tenir (< 31) | tourner court (< 82): Le débat institutionnel a tourné court. | tourner autour de (< 82): Le débat tourne autour de deux thèmes majeurs. | transcender (13) | traverser (22) 5.2 DÉBAT + ATTRIBUT: être (< 426): Le débat fut relativement bref. | rester (< 44): Le débat reste ouvert. | rester confiné à (< 28) 6.1 VERBE + DÉBATobjet: aborder (< 61) | accepter (< 15) | alimenter (377): Deux films alimenteront les débats. | amorcer (41) | amplifier (33) | animer (< 199): Les chefs de l’opposition animent le débat. | apporter à (> 28) | approfondir (25) | arbitrer (< 64): Chine, Inde et Brésil devraient arbitrer les débats. | assister à (< 108): Les spectateurs ont pu assister à un débat historique. | attiser (24) | aviver (26) | avoir (< 141): Il faut avoir un débat sur ce que l’on veut manger, lire, voir, bref consommer. | boycotter (17) | brouiller (12) | centrer (16) | clarifier (67) | clore (328): Il vient un moment où il faut savoir clore un débat. | concerner (< 12) | conclure (< 11) | confisquer (11) | contribuer à (< 79): Elles peuvent contribuer au débat des idées. | cristalliser (< 26): Il cristallise le débat entre abolitionnistes et partisans de la peine capitale. | déclencher (41) | dépasser (21) | dépassionner (149): Le député souhaite “dépassionner le débat sur la chasse”. | déplacer (13) | désamorcer (12) | se désintéresser de (14) | dominer (< 122): Une impression de malaise a dominé les débats. | donner lieu à (< 72): Ces pistes de réflexion donneront lieu
A Usage-based French Dictionary of Collocations
77
à un débat. | échapper à (19) | éclaircir (19) | éclairer (88) | élargir (128): Le ministre de l’intérieur élargit le débat. | élever (11) | éluder (30) | s’emparer de (< 24) | empêcher (< 42) | empoisonner (23) | enfermer (< 24) | s’engager dans (< 265): Il a renoncé à s’engager dans un tel débat. | engager (< 265): Il faut engager maintenant le débat sur le contenu. | enrichir (< 46) | entrer dans (< 47) | envenimer (75): Le succès des écologistes a envenimé le débat. | épuiser (19) | escamoter (< 29) | esquiver (< 126): Il ne pourra pas longtemps esquiver le débat. | être concerné (< 12) | être intéressé (< 13) | être phagocyté (< 24) | être suivi de (< 212): Chaque séance est suivie d’un débat. | éviter (< 148): Cela permettait d’éviter les vrais débats de fond. | exclure de (< 27) | faire (< 280): Le niveau de ressources des bénéficiaires fait débat. | faire l’objet de (< 280): Le texte a fait l’objet d’un débat au bureau national. | fausser (< 16) | focaliser sur (< 73): L’affaire focalise depuis trois ans les débats sur justice et politique. | focaliser (< 73): Ce dossier a focalisé, trois années durant, les débats entre autorité judiciaire et pouvoir politique. | il y a (< 141): Il n’y avait aucun débat. | illustrer (< 16) | s’immiscer dans (39) | influencer (11) | instaurer (19) | s’intéresser à (< 13) | interférer dans (< 25) | interférer avec (< 25) | intervenir dans (< 224): Un ange est intervenu dans le débat, en direct d’Ajaccio. | introduire dans (< 42) | introduire (< 42): Il s’étonne que personne n’ait introduit le débat. | juger (< 19): Plusieurs producteurs jugent fondé le débat lancé par le collectif. | lancer (287): Elle lance le débat sur une mauvaise piste. | maîtriser (15) | mener (62) | mériter (< 76): Cette “première” méritait bien un débat. | monopoliser (15) | nourrir (326): Il veut nourrir un débat de fond. | occulter (< 24) | organiser (487): La Bibliothèque publique d’information organise un débat. | orienter (< 30) | ouvrir (> 697): Etes-vous le mieux placé pour ouvrir ce débat ? | participer à (413): Il doit participer au débat public. | permettre (< 50): La campagne n’a toutefois jusqu’à présent pas permis de réel débat sur l’Europe. | perturber (< 26) | peser sur (< 131): L’amendement a continué à peser sur les débats. | placer (14) | politiser (< 36) | polluer (< 27) | porter (11) | poser (< 34) | poursuivre (< 28) | préjuger de (15) | prendre part à (< 26) | préparer (< 33) | présider (29) | privilégier (< 15) | profiter de (26) | prolonger (< 43) | proposer (< 95): Il propose alors un débat interne. | provoquer (< 256): Cette question a aussi provoqué un débat. | ramener à (< 21): Ces critiques, parfois justifiées mais aussi parfois simplistes, ramènent au débat sur le système lui-même. | ramener (< 21): L’heure est venue de ramener le débat à ses justes proportions. | ranimer (72): A quoi sert-il de ranimer un tel débat? | raviver (188): Les dégâts sociaux ont ravivé le débat. | recentrer (65): Il est utile de recentrer le débat sur ses véritables enjeux. | réclamer (< 33) | refuser (< 29) | réhabiliter (35) | relancer (3094): La publication du rapport relance le débat. | renvoyer à (< 15): Sans doute est-on plus d’une fois gêné par le caractère allusif de certains développements, qui renvoient à des débats connus des spécialistes mais d’eux seuls. | renvoyer (< 15): Le gouvernement français renvoie le débat à l’automne. | replacer (37) | reporter (38) | ressortir de (18) | résumer (< 59): On ne saurait mieux résumer le débat entre multinationales et Etats nationaux. | resurgir (< 19) | retransmettre (49) | réveiller (< 18) | revenir sur (< 32) | rouvrir (388): Le Sénat rouvre le débat sur la liberté de la presse. | sortir de (< 30) | souhaiter (< 252): Il
78
Peter BLUMENTHAL
souhaiterait un débat à la loyale. | soulever (< 70): Longtemps, cette disposition n’a soulevé aucun débat. | soumettre à (15) | sous-tendre (14) | stimuler (< 22) | suivre (< 212): J’avais regardé la télévision pour suivre les débats. | susciter (789): Ce texte a suscité de nombreux débats. | trancher (347): Ce n’est pas la Commission démissionnaire qui tranchera le débat. | verrouiller (15) | verser à (20) | voir (< 19) FRANTEXT: interrompre (13) 6.2 VERBEattributif + DÉBAT: être (< 30): C’est un vieux débat. | être en (< 30): Le calendrier est toujours en débat. 7.2 PRÉP + DÉBATcirconstanciel: lors de: Les connaissances établies lors de ce débat ont contribué à la discussion. | au cours de: La vraie question n’a pas été abordée au cours du débat. | après: Après cinq heures de débat, des applaudissements éclatent. ◊ Les décisions ne devraient être prises qu’après un très vaste débat associant toutes les parties concernées. | à l’issue de: A l’issue de ce débat, l’aréopage voterait-il? | à l’occasion de: Les amendements, à l’occasion des débats budgétaires, ont été rejetés par le gouvernement. | sans: Le magazine enchaîne les sujets à vive allure, sans débat ni réel éclairage. | au terme de: Au terme d’un débat agité, les sénateurs ont modifié le projet de loi. | à la veille de: A la veille du débat, les forces étaient égales. | avant: L’examen intervient plus d’un mois avant le débat. | pendant: Pendant le débat, la purification ethnique continue. | tout au long de: Ils nous ont répété tout au long du débat que c’était techniquement impossible. | dans le cadre de: Il fera étudier les propositions dans le cadre d’un débat démocratique. | à l’ouverture de: A l’ouverture des débats, il avait demandé la remise en liberté de son client. | derrière: Derrière ce débat budgétaire, une difficulté de fond se profile. | en marge de: Elle est restée trop longtemps en marge du débat public. | quant à: Quant au débat sur la productivité, il n’a même pas été évoqué. | durant: Il n’y a pas eu de décision de prise durant le débat. | à la suite de: Il prit sa décision à la suite d’un débat de conscience. | en ouverture de: En ouverture de débat, le président expose les enjeux. | dès l’ouverture de: Il relancera la question dès l’ouverture des débats. | à la fin de: A la fin des débats, nous verrons où nous en sommes. | malgré: Malgré les débats houleux, il n’avait pas révisé ses prévisions. | par-delà: Mais, par-delà ce débat légitime, tous doivent constater les faits. | en prélude à: En prélude au débat, ils proposent un film. | à la reprise de: A la reprise des débats, l’opposition devait s’abstenir sur cet article. | en conclusion de: Il s’est seulement exprimé en conclusion du débat. | à l’approche de: A l’approche du débat sur le projet de loi de finances, ils comptaient relancer l’idée de la taxe.
5. Comments on the entry The purpose of this section is to give an account of the wealth of information provided by the dictionary entry presented and to highlight the advantages of our approach. Our entry was compiled on the basis of 13.340 occurrences of débat in Le Monde and of 197 occurrences in Frantext. On the whole, the entry contains 444 collocates, which are however not equally distributed across the grid patterns. Thus, most collocates (123) fall under
A Usage-based French Dictionary of Collocations
79
category 6.1, which contains collocations involving débat as the direct, indirect or prepositional object of a verb. The second most preferred pattern is 1.2, which includes 105 postposed attributive adjectives. Other patterns, by contrast, are rarely (e.g., 5.2) or never at all (2.4, 7.1) attested. Note that the processes of analysis for each section differ fundamentally depending on the specific collocate patterns to be determined. Thus, although as a rule, we proceed by calculating the log likelihood ratio, this procedure may be technically or linguistically impossible with regards to certain patterns. For example, considering section 2., we face the problem that at the present stage of development, collocation extraction programmes cannot determine whether the prepositional phrase following a keyword represents a subjective genitive (2.1), an objective genitive (2.2) or an adverbial modification (2.5). In such cases, we have to proceed manually, counting and ranking the collocates in terms of absolute frequencies in the corpus. This is also the case with the patterns subsumed under section 7. To give a more detailed illustration of the make-up of individual dictionary sections, let’s have a closer look at section 1.1 of the débat entry (preposed attributive adjective). The 36 collocates attested are arranged in alphabetical order to facilitate searches. Each collocate is followed by its log likelihood ratio in brackets. As can be read off from these ratios, grand (802) is the most specific preposed adjective, followed by vrai (793) and vif (513). As evidenced by the frequency figures reported above, the word débat frequently appears in the media but it is rather seldom in literary language. Therefore, the fact that all adjectival collocates attested in the literary corpus are also encountered in the newspaper corpus is no surprise. The collocate interrompre in section 6.1 is the only exception to this tendency. Although the combination interrompre + débat does occur in newspaper language, it does not reach the threshold of specificity. As shown in 6.1, collocates restricted to literary usages are listed at the end of the relevant section. An interesting question that arises in this context is the choice of collocates to be illustrated by means of original examples (quotations from the corpus, slightly shortened if necessary). In our opinion, as a rule, candidates for such illustration are those collocates which are most specific, since they permit the non-native audience to gain an overview of the stylistically least marked and thus most neutral usages. In addition, abiding to our intuition, we attempt to illustrate those usages that may pose any kind of problem, be it to native or to non-native speakers of French. This procedure could be further refined by interviewing different groups of potential users. Also in need of explanation are the patterns in section 2. (a keyword followed by a postmodifier). As mentioned above, the relevant collocates can
80
Peter BLUMENTHAL
not be determined by means of probabilistic methods. Nonetheless, it is sometimes necessary to record both function words (prepositions) and content words (the nouns governed by these very prepositions). Section 2.5 shows our solution to this problem. We assume that the reader is mainly interested in finding out which prepositions commonly follow the keyword. These prepositions are ranked in terms of absolute frequency and graphically highlighted by a rectangle for a quick and clear overview. As can be seen in 2.5, most prepositional phrases following débat are introduced by sur, which, in turn, is most frequently followed by the nouns avenir and thème. In the example sentences, both the preposition and the following noun are printed in bold. An important question that remains to be addressed is why in our entry, log likelihood scores are frequently preceded by the ‘less than’ symbol (<) (cf., for instance, dépasser in 5.1). This symbol indicates that the actual degree of specificity lies below the degree of specificity computed – a situation that arises when syntactically inappropriate collocate uses are automatically included in the statistical data analysis. For example, in the case of dépasser, most occurrences correspond to the relevant syntactic pattern, which is exemplified by the quotation “Ce débat dépasse la communauté scientifique”. Unfortunately, however, inappropriate sentences like le débat est dépassé also found their way into our data. Conversely, the opposite and slightly more infrequent situation, where the number of collocates is greater than that automatically computed, is indicated by the symbol ‘greater than’ (>). In the present paper, we provided detailed information both on the probabilistic computation of the coocurrence of words as well as on how to make these data available to dictionary users. Even if the representation couldn’t claim to adequately reflect linguistic usage at large, it should however reveal how common certain word combinations are in newspaper language and novels, thus allowing our target audience to distinguish more or less stereotypical collocations (e.g., vrai débat) from less frequent, more elaborate and sophisticated ones. Furthermore, the specificity values can also uncover the context with which a keyword is generally associated; these values consequently represent structures of the mental lexicon (cf. Visual Thesaurus). Without doubt, representing quantitative information presents one of the advantages of the design of our dictionary of collocations. Other advantages reside in the fact that detailed and systematic information on syntactic patterns omitted in comparable dictionaries are provided here. This mainly applies to those keywords that are subject to adverbial modification (section
A Usage-based French Dictionary of Collocations
81
7), but it also is the case with section 2.9 In spite of their relevance for the cooccurrence behaviour of nouns, both syntactic domains have so far been neglected. The incredible amount of expressive options that arise from the combination of single, stereotyped collocations becomes apparent when the following facts are considered. Imagine a sentence whose subject débat is accompanied by a postposed adjective (section 1.2), a postmodifying prepositional phrase (2.5) and a transitive or intransitive verb (section 5.1), as in the following made-up example: Le débat parlementaire sur l’immigration continuera. Sections 1.2 and 5.1 contain 105 and 58 collocates respectively. With respect to section 2.5, confining our attention to prepositions and prepositional phrases, we have 15 options. Assuming that all collocates in all three sections were semantically compatible with each other, we would obtain a total of 105 x 15 x 58 = 91.350 lexically distinct, but still relatively simple, sentences. This figure suggests that as a rule, most sentences based on a combination of individual collocations shouldn’t be regarded to form a collocation as a whole; it rather results from the specific communicative intention of the speaker, who essentially has a wide choice of wording. In the light of these facts, it would be all the more interesting to find out which combinations could be considered to be stereotypical themselves. The discussion above leads on to the complex issue of long-distance collocations and, more generally, to the domain of cliché research10 – but such a discussion is beyond the scope of this paper. 6. Difficulties, problems and questions As the reader will have guessed, the main problem of our method of compiling is that it is highly time-consuming and laborious. Although we have automated a maximum number of processes (the extraction of collocates from the corpus, their alphabetical arrangement within the sections, the layout of the entries11), it takes at least three working days to compile and carry out a quality-check for a single entry like débat. Under these conditions, our small research teams advance so slowly that we sometimes wonder whether it wouldn’t be better confine ourselves to a reduced and simplified model. Lexicographic questions and problems at issue mainly concern specificity and frequency values, the distinction between different uses of a given keyword and the possibility of classifying the collocates of individual 9 10 11
In Blumenthal (2005) we discuss these problems in a comparison between a choice of our own results with corresponding references from the Petit Robert. Cf. Siepmann 2006:106-109. We are grateful to Sascha Diwersy for his support.
82
Peter BLUMENTHAL
sections on the basis of semantic criteria (with the arrangement of synonyms in a group as a possible outcome). To mention just one frequent, specificity-related question: when presented with a sample entry, some people suggest that the log-likelihood scores provide information much too specific and thus confusing. And what’s more, without doubt, our log likelihood scores will have to be taken with a pinch of salt - they obviously only hold for our corpus. Other newspapers or other years’ issues of the same newspaper would have yielded different results. Therefore, it is a pertinent question to ask whether specificity ratios should be totally dispensed with and replaced by a few, broad, specificity or frequency classes. Controversial discussions of this kind have been going on for some time, with strong arguments on both sides. The project leaders, who are still hesitant over the decision to be taken, would be very grateful for suggestions from readers speaking from their own practical lexicographic experience… Bibliography Beauchesne, J. 2001. Dictionnaire des cooccurrences. Montréal: Guérin. Benson, M., E. Benson and R. Ilson. 1997. The BBI Dictionary of English Word Combinations. Amsterdam & Philadelphia: Benjamins. Binon, J., S. Verlinde, J. Van Dyck and A. Bertels. 2000. Dictionnaire d’apprentissage du français des affaires. Paris: Didier [DAFA, cf. www. projetdafa.net]. Blumenthal, P. 2005. “Le Dictionnaire des collocations: un simple dictionnaire d’exemples?”. L’exemple lexicographique dans les dictionnaires français contemporains, Heinz (ed). 2005. 265-282. Deuter, M. (ed). 2002. Oxford Collocations Dictionary for Students of English. Oxford et. al.: OUP. González Rodríguez, T. 2004. Dictionnaire des collocations [cf. www. tonitraduction.net]. Grobelak, L. 1990. Dictionnaire collocationnel du français général. Varsovie: Państwowe Wydawnictwo Naukowe. Hausmann, F.J. & P. Blumenthal. 2006. “Présentation: collocations, corpus, dictionnaires”. Langue française 150.3-13. Ilgenfritz, P., N. Stephan-Gabinel and G. Schneider. 1989. Langenscheidts Kontextwörterbuch Französisch-Deutsch. Berlin & München: Langenscheidt. Imbs, P. (ed.). 1971-1974. Trésor de la langue française. Paris: CNRS & Gallimard. Kjellmer, G. 1994. A Dictionary of English Collocations. 3 vol. Oxford: Clarendon. Manning, C.D. & H. Schütze. 2000. Foundations of Statistical Natural
A Usage-based French Dictionary of Collocations
83
Language Processing. Cambridge MA: MIT. Mel’čuk, I. et al. 1984/1988/1992/1999. Dictionnaire explicatif et combinatoire du français contemporain: recherches lexico-sémantiques. Vol. I-IV. Montréal: PUM [cf. http://olst.ling.umontreal.ca/dicouebe/]. Rey-Debove, J. & A. Rey (eds). 2006. Le Nouveau Petit Robert. Texte remanié et amplifié. Paris: Le Robert. Stein, A. & H. Schmid. 1995. “Étiquetage morphologique de textes français avec un arbre de décisions”. Traitement automatique des langues 36:1-2.23-35. Steinlin, J. 2003. Générer des collocations. Mémoire de DEA sous la dir. de S. Kahane. Paris: Université de Paris VII [cf. www.olst.umontreal.ca/ textdownloadfr.html]. Visual Thesaurus [cf. www.visualthesaurus.com]. Voisins de Le Monde (Les) [cf. http://w3.univ-tlse2.fr/erss/voisinsdelemonde/]. Zinglé, H. & M.-L. Brobeck-Zinglé. 2003. Dictionnaire combinatoire du français. Expressions, locutions et constructions. Paris: La maison du dictionnaire.
84
Peter BLUMENTHAL
Corpus of Old French Literary Texts Pierre KUNSTMANN The importance of electronic texts corpora for the study of ancient forms of a language has often been stressed ; not being able to take advantage of a native speaker’s competence, one makes do with the large textual collections that almost instantly vouch for the use of a form, a word or a construction. For the study of Old French, different types of works and tools have been available on the Web for several years, with free access (sometimes requiring a password). The aim of this paper is to point out the most important ones and to provide a brief analysis of these projects. I have taken the liberty to re-use the main sections of an article published, six years ago already, in the Revue de Linguistique Romane (“Ancien et moyen français sur le Web : textes et bases de données”1), to better show the changes that have occurred in this field in a short period of time: before and after, before 2000 and since 2000. The panorama is indeed changing rapidly : new sites appear, others are becoming richer, some become ossified or disappear. I will deal mainly with old texts that can be read in full and data bases that can be searched. 1. Full texts In view of the accumulation of Old French texts or text images on the Web, it is very important to know where the texts come from, how they are presented and how reliable they are — reliability, as is well known, is a fundamental value in philology. A first — logical — distinction is necessary: one must distinguish between what is more or less equivalent to reproducing a previously existing document and what really represents the production of a new document. 1.1. Reproductions of documents In this category, let us mention first: — texts formatted as “images”; that is the case with the great majority of medieval texts offered by Gallica 2 (“library of original documentary corpora”) on the Bibliothèque Nationale de France Web site. 1 2
Revue de Linguistique Romane, 64, janvier-juin 2000, p. 17-42. http://gallica.bnf.fr/classique/ouverture.htm
86
Pierre KUNSTMANN
— transcriptions formatted as texts of older editions that are already part of the public domain (such as Érec et Énide, by the great German philologist Foerster on the site of the Centre d’Études des Textes Médiévaux de l’Université de Haute-Bretagne)3 — lastly texts offered by the big electronic libraries, which must be considered separately. While those offered by the Bibliothèque Nationale de France have clear references, others do not (as is the case with the Bibliotheca Augustana) and some have deficient and misleading ones: the ABU (Association des Bibliophiles Universels), for example, — such abuse! — pirates the texts transcribed transcribed in our Laboratoire de Français Ancien (LFA) without any hesitation and even creates a new function: that of copyist referring to the person who pirated the text4. Back to the Middle Ages, one would think… Another form of reproduction consists in focusing not on the text, but on the manuscript that preserved it. The manuscript folios are entered formatted as images and placed on the Web. The best example is the Charrette project5 at Princeton University; seven out of eight manuscripts of the Chrétien de Troyes romance Le Chevalier de la Charrette have been reproduced this way. One of these manuscripts reappears on the LFA site in Ottawa: this involves the folios of MS Garrett 125 (kept, by the way, in the Princeton library) that contain Le Chevalier au Lion6. 1.2. Production of new documents Three types may be distinguished: transcription, critical edition and encoded texts. 1.2.1. Transcriptions The major project undertaken by K. Meyer (Copenhagen) on Le Chevalier au Lion is a good example of the first type7. It is a near diplomatic transcription of all the manuscripts that transmit the romance. The texts are presented by groups of lines; each line is preceded by first the manuscript code and the line number, then comes the line itself followed by the folio number. The scribe’s abbreviations have been developed and the added letters have been underlined. The same type of work has been done by Y. 3 4 5 6 7
http://www.uhb.fr/alc/medieval/arthur.htm http://abu.cnam.fr/BIB/auteurs/troyesc.html http://www.uottawa.ca/academic/arts/lfa/activites/textes/chevalier-au-lion/P/P1-720.html http://www.princeton.edu/~lancelot/index2.html http://www.uottawa.ca/academic/arts/lfa/activites/textes/chevalier-au-lion/R/images/Y-g40r. jpg http://www.uottawa.ca/academic/arts/lfa/activites/textes/kmeyer/kpres.html
Corpus of Old French Literary Texts
87
Lepage for Le Couronnement de Louis8. On the other hand, the transcriptions of the same texts done in the LFA for the Dossier électronique du Chevalier au Lion can be considered semi-diplomatic as far as punctuation is used, which already constitutes a first critical interpretation of the document. Let us also mention here the edition of Le Roman de Renart based on MSS C and M by a Japanese team of Hiroshima University (N. Fukumoto, N. Harano et S. Suzuki)9. 1.2.2. Critical editions A good example is the edition of La Vengeance Raguidel10 by M. Plouzeau on the LFA site. 1.2.3. Encoded texts The most important corpus of encoded texts is the one of Amsterdam Free University: the Corpus des textes littéraires established under the supervision of A. Dees and P. van Reenen. This corpus has been transformed, enriched and re-encoded by A. Stein (Stuttgart), P. Kunstmann (Ottawa) and M.-D. Glessgen (Zurich), working as a team. This new version, under the title Nouveau Corpus d’Amsterdam, was presented last February to an audience of specialists (the Lauterbad workshop, in which Y. Kawaguchi took part) it is accessible on the Web, with a password, on the Stuttgart11 site and will soon be available on the ATILF site (Analyse et Traitement Informatique de la Langue Française) in Nancy. It is also available on CD-Rom. In fact, such a corpus is already a database. 2. Databases 2.1. Textual bases The simplest form is the concordance, either static or dynamic. The first is well known and was of great use in the 80s and 90s. The range was large: from raw concordance to analytic concordance with a detailed listing of forms and functions; the most striking examples of this last type were the Concordance analytique de La Mort le Roi Artu, which I created a long time ago at the LFA (1982) and M.-L. Ollier’s Lexique et Concordance de Chrétien de Troyes (1986). But for the static concordance, the size was a big handicap: 2000 pages for La Mort le Roi Artu and 29 microfiches for Chrétien de Troyes, which are now very difficult to read. This problem 8 9 10 11
http://www.uottawa.ca/academic/arts/lfa/activites/textes/Couronnement/coltexte.htm http://home.hiroshima-u.ac.jp/france/RRenart.html#edition http://www.uottawa.ca/academic/arts/lfa/activites/textes/Vengeance/present.htm http://www.uni-stuttgart.de/lingrom/stein/corpus/
88
Pierre KUNSTMANN
disappeared with the dynamic concordance: with an electronic text and a concordance program, the user can build a concordance, partial or complete, and read the occurrences he requested in a one line context (KWIC) or on a whole page. In fact, the dynamic concordance gave rise to another type as a natural text step: the interactive database, which for Old French is well represented on the Web. There are actually three bases in this field, which I will mention in ascending order of complexity and analytic sophistication. First, the Textes de Français Ancien base, created by LFA in collaboration with ARTFL (American Research on the Treasury of the French Language) at the University of Chicago12. The second base, older than the first but more recent on the Web, is the Base de Français Médiéval, set up under the direction of C. Marchello-Nizia (ÉNS, Lyon)13. The base is made up of about 3 million word occurrences; it can be searched with the program Weblex. The texts of these two bases are in the process of being integrated into a general Old French base (a continuation of FRANTEXT for Old French) at the ATILF laboratory in Nancy. It is ATILF that houses the most important base for Middle French Texts: the Base textuelle du Moyen Français, a version of FRANTEXT, comprising 220 unabridged texts (some 6 million occurrences). Those are, for the most part, texts that this laboratory gathered over the years to compile the Dictionnaire du Moyen Français. Let us also mention here the “Major Collaborative Research Initiative” project Modéliser le changement : les voies du français directed by F. Martineau in Ottawa14. It includes a section devoted to Old French texts that will be not only morphologically tagged, but also syntactically parsed. 2.2. Lexical indexes Once a text is digitized, it is easy to extract lists of lexical and grammatical words as well as proper names in order to capture the vocabulary of the work in its entirety. The LFA site offers a series of lemmatized indexes (the forms that occur are grouped together under headwords or “lemmas”). These lemmatized indexes, added to the ongoing data capture of the Base de graphies verbales15 from Old French to Renaissance (revision and electronic capture of the collection of inflected forms established long ago by R. Martin ; between 16 000 and 20 000 handwritten sheets of analysed verbal 12 13 14 15
http://www.lib.uchicago.edu/efts/ARTFL/projects/TLA/pwrest/LFA.search.html http://bfm.ens-lsh.fr/ http://www.voies.uottawa.ca/index.html http://atilf.atilf.fr/gsouvay/bgv/
Corpus of Old French Literary Texts
89
forms), the Tobler-Lommatzsch list of lemmas (37 047 lemmas, without the corresponding written forms, but often with a series of variant spellings, about 15 000 of them) and the automatic generation of lemmas, using TWIC, for the written forms of the Nouveau Corpus d’Amsterdam, will thus ad up to a large database; this base, thanks to a scientific collaboration agreement between the different partners, will be merged with the ATILF base to form a huge Old and Middle French database. This will also make it possible to refine the lemmatizer developed in Nancy and offer a tool for indexing and exploring any text of these periods that will be submitted to this procedure. 2.3. Lexicons and dictionaries When the listed words are given a definition, one enters the domain of lexicons and dictionaries. I will present two dictionary projects here, works in progress on the Web: the first one is a language (or rather dialect) dictionary, the second one an author dictionary. Both are corpus dictionaries and aim for the same goal: to allow navigation between the dictionary entries and the texts of the corpus on which the dictionary is based. The first is the Anglo-Norman Dictionary (AND) 16 ; there are two versions: a printed one (published by the Modern Humanities Research Association), and an electronic one on the Anglo-Norman On-Line Hub. The version on the Web offers fascinating short and long-term possibilities; it is extremely easy to consult. Essentially, one navigates between the words contained in the quotations of the dictionary articles on the one hand and the corresponding passages in the text corpus on the other (for the moment, only ten texts are digitized and currently available). It should be noted that this text corpus can also be searched with a particularly well articulated and powerful concordance program. Last but not least, this XML dictionary with TEI encoding is meant to be accessible to other comparable and compatible tools; it is already linked to the Complément bibliographique of the Dictionnaire Étymologique d’Ancien Français (Heidelberg) and it could easily be linked to the second dictionary that I will discuss next. The second dictionary is the Dictionnaire Électronique de Chrétien de Troyes (DÉCT), which I plan to complete in collaboration with two ATILF colleagues (G. Souvay et H. Gerner) and with A. Stein (Stuttgart). The corpus is made up of five romances (Érec, Cligès, Lancelot, Yvain, Perceval) by Chrétien de Troyes, a classical author of the French (and even European) Middle Ages, who created the character of Lancelot and the myth of the Grail, for instance. In the first phase, we are working with only one manuscript, which is actually the best and the only one to contain the entire 16
http://www.anglo-norman.net/
90
Pierre KUNSTMANN
text of the five romances. Writing the entries is made easier by following the model presented by the DMF (Dictionnaire du Moyen Français), directed by R. Martin at ATILF. Like the AND, the DÉCT is composed of three parts: the dictionary proper, the text corpus and a concordance program, but the much reduced corpus size, the lemmatization of all the words (lexical and grammatical) and their grammatical tagging allow a full and systematic navigation between dictionary and corpus and make the word in context search (concordance) particularly efficient. A first version of DÉCT (letter “A” and textual base) should be launched officially before the end of the year. The dictionary, which will come with a conceptual classification of the author’s vocabulary, will give the experts a brand new means of working, thinking and researching. As a lexical base with multiple ways of interrogation, it will be freely accessible on the Web to anybody who is interested, mere amateurs as well as scholars. The DÉCT wants to reach out directly to the public and facilitate access to a body of work considered fundamental to Western culture. There will probably be an English version and the intention is to add a learning method for Old French. In the conclusion of my article six years ago, I stressed certain inherent risks in this type of project. The problem of maintaining and transcoding these documents is crucial, as is the question of document quality and reliability. These considerations led some of us to meet on October 15th 2004 in Nancy to set up the Consortium pour les Corpus de Français Médiéval (CCFM), with scholars representing seven different institutions and five countries. The next CCFM meeting will take place next month in Lyon at the École Normale Supérieure. Thus, there is clear progress and the future now looks almost certain!
Building a Large Corpus for Phonological Research — The PFC Project1 — Chantal LYCHE
In a classical statement Jespersen (1924) stressed the need to study language in its living context: “I am firmly convinced that many of the shortcomings of current grammatical theory are due to the fact that grammar has been chiefly studied in connection with ancient languages known only through the medium of writing, and that a correct apprehension of the essential nature of language can only been obtained when the study is based in the first place on direct observation of living speech and only secondarily on written and printed documents. In more than one sense a modern grammarian should be a novarum rerum studiosus.” Such a recommendation crucially applies to phonology, concerned as it is with the sounds of language. How distressing then, to be confronted with analyses based on spurious data as it occurs for French, where the nature of the data put forward to support a number of theoretical claims causes great concern to the conscientious phonologist. Kaisse (1985: 63) for example, raises a similar issue, (“it is an unfortunate fact that much of the literature on liaison is prefaced with a paragraph disagreeing with the basic data on which some previous analysis was based”), and Morin (1987) describes various areas of flagrant disagreement, concluding (Morin 1987: 815) “It is important to realize that sound theories can only be based on sound data”. Liaison, for example, is a case in point. A number of comprehensive data collections have been undertaken and provide the analyst with a solid empirical base. In the wake of a long structuralist tradition, Ågren (1973), Encrevé (1988) and De Jong (1994) who analyse a radio corpus, the speech of politicians, and the Orléans corpus respectively, contributed largely to a meticulous description and a better understanding of the phenomenon. It is therefore all the more unfortunate that some widely disputed examples still find their way into the data phonologists propose to account for. The analysed data usually stem from normative texts as exemplified in Fouché 1
PFC : Phonologie du Français Contemporain: usages, variétés et structure. I would like to thank Jacques Durand and Bernard Laks for their helpful comments when writing this presentation.
92
Chantal LYCHE
(1954) which is repeatedly cited, or elements of native speaker’s intuition as in Dell (1973), or so-called Standard French, an evanescent notion subject to a variety of interpretations (Morin 2000). In this paper, we will present an endeavour to counter this situation via the elaboration of a large reference oral corpus for French, the Phonology of Contemporary French project (PFC: “Phonologie du Français Contemporain”), established under the leadership of Professor Jacques Durand (Université de Toulouse-Le Mirail), Professor Bernard Laks (Université de Paris X, Nanterre) and Professor Chantal Lyche (Universitetet i Oslo). 2 After a general presentation of the project and the common protocol, we will turn to the annotation system adopted for schwa and liaison and will describe the devised coding system for one aspect of prosody, together with the methodology it builds upon. The last section will be devoted to a brief account of the manifold research activities stemming from PFC. 1. PFC-presentation The PFC project, is a federative project which gathers some fifty researchers from a variety of countries and aims at the recording, partial transcription and analysis of over 600 speakers from the francophone world on the basis of a common protocol including two reading tasks and two conversations. As already mentioned, although no socio-linguistic endeavour of this size has been carried out for French, PFC was not elaborated in a vacuum, but rests upon a long tradition of extensive surveys. The second half of the twentieth century witnessed the development of a number of larger corpora,3 whose exploitation for phonological research was however not always feasible, and/or which did not allow for geographical comparison. For example, the corpus gathered by Encrevé (1988) focuses on a specific social group, the political leaders of France, thus restricting its use, while the Corpus d’Orléans, not subject to a sufficiently strict protocol, does not permit the identification of the phonological characteristics of this particular variety. We opted within PFC for a different approach, and chose to formulate explicit phonological objectives while privileging geographical diversity. The phonological perspective adopted calls for the elaboration of the phonemic inventory of the speaker (and thereafter of the location), and 2
3
The PFC project benefits from support from l’Institut de Linguistique Française, the DGLFLF (Délegation Générale à la Langue Française et aux Langues de France), the Center of Advanced Studies in Theoretical Linguistics (University of Tromsø). It is equally supported by the CNRS TCAN VARCOM project and the ANR PFC-COR project. PFC would not be viable either without the dedicated efforts of all the researchers involved. See Laks (2003) for an historic overview and an evaluation of different corpora.
Building a Large Corpus
93
the collection of robust data on two central phenomena in French phonology, schwa and liaison. Striving to achieve geographical diversity leads us to aim at the francophone world rather than France alone.4 This decision impacted on some later methodological choices as we will see below. At this date, outside of ‘hexagonal’ France, the following regions have been covered, or are under investigation: Canada (four locations), Belgium (three locations), Switzerland (two locations), Guadeloupe, Reunion Island, Mauritius, Ivory Coast, Burkina Faso, Senegal, Mali, Beyrouth, New Caledonia and Louisiana. The investigations are usually carried out locally under the supervision of a local PFC director. This procedure ensures that indigenous particularities will be accounted for, and furthermore, it guarantees the reliability of the recordings. Within continental France, we attempted at covering the main dialectal areas without neglecting a distinction town/rural districts. Forty-two locations have been selected and can be viewed on the interactive map available on the PFC website : http://www.projet-pfc.net/. The PFC methodology is in no way novel, indeed, it builds on traditional sociolinguistic surveys and owes much to the classical work of Labov, as to investigation techniques used by the Milroys. For each selection of speakers, the protocol requires two reading tasks (a word-list and a passage) as well as a semi directed and an informal conversation. A minimum of ten speakers are interviewed per investigation point and although we propose to include about the same number of men and women in about three age categories (20-40, 40-70, 70+), it has proved unrealistic with a project of this size and with such a limited number of speakers per location, to aim at true social diversity. The investigators, following the network principle (Milroy 1980), usually interview people they have close contacts with (friends and/or family), and it has often seemed profitable to study members of several generations within the same family in order to investigate age-grading (Durand 2006, Durand and Eychenne 2004). Because of the reading tasks, a certain amount of literacy is required of the speakers who, in addition, must be fully integrated in their environment, ideally having lived in the same town or district all their lives, and preferably having done all their schooling there. The strict application of our protocol to every investigation point strengthens the project and gives it its full coherence.5 4
5
Positive fallout sprang from our decision to incorporate French speaking territories outside of continental France, as we established an invaluable international network with researchers concerned with phonology but also sociolinguistic phenomena, syntax, etc., thus shedding some new light on the data. The detailed protocol is available from the PFC website; see also Durand and Lyche (2003), Durand, Laks, Lyche (2003a, b), Durand, Laks, Lyche (2005), Lyche (2005), Durand (2006).
94
Chantal LYCHE
1.1. Two reading tasks The reading tasks were integrated within the protocol for two reasons: (i) they guarantee the full comparability of the results and (ii) they are required once the phonological goals had been clearly formulated. No oral corpus, regardless of its size, can claim to include the all the data the phonologist is looking for.6 Certain oppositions can only be ‘provoked’ as they will most unlikely never occur in an everyday conversation of reasonable length. The opposition between jeune (‘young’) and jeûne (‘fast’) for example is a case in point. If the adjective can indeed appear or be provoked easily, the use of the substantive is highly restricted unless the recording is performed in a hospital! We then assume that if certain oppositions are missing from the reading tasks, they will certainly be equally missing from the two conversations, and therefore we put the full burden on the two reading tasks to provide us with the necessary data, even if that particular data cannot form the sole basis for a phonological analysis. Our word-list includes 84 items followed by 5 minimal pairs. The last 10 items appear first randomly within the list before being repeated in the minimal pairs, and we thus collect two pronunciations for each word. Table 1. PFC word-list 1. roc 2. rat 3. jeune 4. mal 5. ras 6. fou à lier 7. des jeunets 8. intact 9. nous prendrions 10. fêtard 11. nièce 12. pâte 13. piquet 14. épée 15. compagnie 16. fête 6
33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46 47. 48.
liège baignoire pécheur socialisme relier aspect niais épais des genêts blond creux reliure piqué malle gnôle bouleverser
65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80.
compagne peuple rauque cinquième nier extraordinaire meurtre vous prendriez botté patte étriller faites feutre quatrième muette piquais
This is equally true of a corpus built for syntactic research as Séguy (1973: 84) observes: “[… ] il faudrait donc obtenir, à chaque point d’enquête, que le magnétophone fût mis en enregistrement durant les repas, et cela pendant des jours et des semaines, car les faits syntaxiques intéressants n’apparaissent dans le discours que suivant les caprices du hasard, de sorte qu’il faudrait constituer et dépouiller un corpus énorme. »
Building a Large Corpus 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32.
islamique agneau pêcheur médecin paume infect dégeler bêtement épier millionnaire brun scier fêter mouette déjeuner ex-femme
49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64.
million explosion influence mâle ex-mari pomme étrier chemise brin lierre blanc petit jeûne rhinocéros miette slip
81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94.
95
trouer piquer creuse beauté patte pâte épais épée jeune jeûne beauté botté brun brin
The list, which includes a large number of items already present in earlier surveys (for ex. Walter 1982), aims at elaborating the phonemic system of the informant. If it focuses particularly on vocalic oppositions word finally, that is in stressable position assuming we are dealing with a variety where stress is group final, it also allows investigations on a number of mid-vowels in unstressed position, and it should allow us to specify the consonant inventory of the informant. The list however must be obligatorily supplemented in all regions outside of France and even within the borders of continental France, in areas where certain peculiarities emerge. Our list has a few shortcomings, it fails to provide information on phonemic length for example, so crucial in Belgium, Switzerland or in Normandy, it fails equally for the tense-lax opposition which characterizes Canada and it fails all the same for the absence/presence of /r/ in codas in African countries, or in varieties in close contact with Creoles (Bordal 2006). In all these cases, another list is devised (for ex. Bordal 2006) and is read by the informants in addition to the list in table 1. As the reading tasks should not induce stress on the speakers, we ask the investigators to reduce the number of items of this new list to a minimum. Additional words might be useful as well when the researcher is concerned with more specific phenomena like vowel harmony, consonant assimilation, etc., which are not subject in our list to a systematic investigation. For this task, all informants are asked to read aloud the number preceding the item, thus providing phonological information on the realization of numbers and distracting somewhat the speaker’s attention. We hope that the speaker will not concentrate on producing what he feels to be
96
Chantal LYCHE
the ‘correct’ pronunciation until he comes to the minimal pairs. Obviously the repetition of ten words from the list as minimal pairs present a linguistic interest as we obtain two pronunciations for the same word, but in addition we hope to arouse the speaker’s linguistic awareness. The subject’s reaction to the minimal pairs will then constitute a rich source of valuable sociological observations. The second reading task required by the protocol includes a small text taking the shape of a local newspaper article built with simple vocabulary and elementary syntax in order to be accessible to most informants. The text, although impregnated with ‘French culture’ is easily understandable for speakers located outside continental France and has been until now, well accepted by all our informants. Table 2.
Texte PFC (© Projet PFC) Le Premier Ministre ira-t-il à Beaulieu? Le village de Beaulieu est en grand émoi. Le Premier Ministre a en effet décidé de faire étape dans cette commune au cours de sa tournée de la région en fin d’année. Jusqu’ici les seuls titres de gloire de Beaulieu étaient son vin blanc sec, ses chemises en soie, un champion local de course à pied (Louis Garret), quatrième aux jeux olympiques de Berlin en 1936, et plus récemment, son usine de pâtes italiennes. Qu’est-ce qui a donc valu à Beaulieu ce grand honneur? Le hasard, tout bêtement, car le Premier Ministre, lassé des circuits habituels qui tournaient toujours autour des mêmes villes, veut découvrir ce qu’il appelle “la campagne profonde”. Le maire de Beaulieu — Marc Blanc — est en revanche très inquiet. La cote du Premier Ministre ne cesse de baisser depuis les élections. Comment, en plus, éviter les manifestations qui ont eu tendance à se multiplier lors des visites officielles ? La côte escarpée du Mont Saint-Pierre qui mène au village connaît des barrages chaque fois que les opposants de tous les bords manifestent leur colère. D’un autre côté, à chaque voyage du Premier Ministre, le gouvernement prend contact avec la préfecture la plus proche et s’assure que tout est fait pour le protéger. Or, un gros détachement de police, comme on en a vu à Jonquière, et des vérifications d’identité risquent de provoquer une explosion. Un jeune membre de l’opposition aurait déclaré: “Dans le coin, on est jaloux de notre liberté. S’il faut montrer patte blanche pour circuler, nous ne répondons pas de la réaction des gens du pays. Nous avons le soutien du village entier.” De plus, quelques articles parus dans La Dépêche du Centre, L’Express, Ouest Liberté et Le Nouvel Observateur indiqueraient que des activistes des communes voisines préparent une journée chaude au Premier Ministre. Quelques fanatiques auraient même entamé un jeûne prolongé dans l’église de Saint Martinville. Le sympathique maire de Beaulieu ne sait plus à quel saint se vouer. Il a le sentiment de se trouver dans une impasse stupide. Il s’est, en désespoir de cause, décidé à écrire au Premier Ministre pour vérifier si son village était vraiment une étape nécessaire dans la
Building a Large Corpus
97
tournée prévue. Beaulieu préfère être inconnue et tranquille plutôt que de se trouver au centre d’une bataille politique dont, par la télévision, seraient témoins des millions d’électeurs.
The text focuses, like the word-list, on vowels in stressable positions, and some of the items present in the word-list are repeated here: jeune, jeûne, cotte, côte, etc. The repetition of the same items allows us to obtain valuable sociological information on speakers’ attitude toward the norm. Certain speakers, for example, keep these words apart while reading the word-list, but merge the pronunciations when reading the text. The text permits in addition observations on consonants in general, obstruent+liquid cluster simplification word finally, consonant assibilation, so-called ‘aspirated h’, etc. It includes further a wide array of schwa positions with a number of potential insertion sites (Marc Blanc, Ouest Liberté) together with 35 potential liaison sites. The text, just like the word-list, does not pretend to consider the wide array of potential variation observed in specific geographic areas. Therefore, the local investigators have devised a number of supplementary paragraphs to test specific phenomena. It is our contention that on the basis of the word-list and the text, a first version of the phonemic inventory of the informant can be established, which will then be confronted to his/her performance in the conversations before it can be finalised. The two reading tasks prove crucial in our protocol, but can they be conducted with equal ease throughout the Francophone world? What about locations where French is spoken in parallel to one or several languages (as is often the case in African countries) or where French is mostly spoken by older speakers who maintain exclusively an oral competence, as in Louisiana for instance? The PFC protocol has been conducted in three African countries (Ivory Coast, Burkina Faso and Senegal) where no adaptation proved necessary. The speakers complied readily to the two reading exercises which did not elicit particular difficulties. The situation differs sharply in a district where the speakers do not read French as in Louisiana. Let us recall that French is still used in Louisiana either as a Creole or as a variety of French often referred to as Cajun French, but in all cases nearly exclusively as an oral language. We were compelled for our investigation to eliminate the reading tasks and chose to introduce translation exercises for the word-list (Klingler 2006, Lyche 2006). The speakers are usually bilingual English-French and a large number of the words included in the list are part of their everyday vocabulary even if French is threatened in Louisiana, and at best, shows a state of advanced attrition (Dubois 2003). Since the text cannot be used either, we studied systematically the elements tested in the text and proposed to construct a number of sentences to be translated. These
98
Chantal LYCHE
sentences should cover the same phenomena under scrutiny in the text: phonemic inventory, schwa, liaison, etc. The translation tasks will be supplemented by small tales, well-known throughout Acadiana, that the informants are asked to relate. Even if the stories will necessarily diverge from speaker to speaker, we can expect that a number of lexical items will recur, together with a number of syntactic constructions. Such a procedure should provide us with a minimum of comparability. So far, Louisiana remains the only area which requires a radical rethinking of the protocol. 1.2. Two conversations Data based on spontaneous speech proves imperative to corroborate preliminary observations made on the basis of the reading tasks. The protocol aims at two registers through two types of conversations, semi-directive and free. We recommend that the semi-directive conversation is carried out by an investigator who is not too close to the speaker, but who has been introduced to him/her through a friend (Milroy 1980), while the free conversation takes place either between an investigator and an informant who have close contacts, or between two informants who belong to the same social network. The second solution reduces the transcription work considerably, since one transcription will provide the data for two speakers. We do not impose any specific discussion themes for the free conversation, while the semi-directive conversation is more constrained. We ask the investigator to submit the speakers to a large number of questions concerning his/her youth, school education, studies, work environment, etc. This information will then constitute the basis for the sketchy sociological portrait which is established for each informant and which is then recorded into the database. Both conversations are entered as well into the base, their length varying from twenty to thirty minutes, five to ten of which are transcribed. 1.3. Transcription Once the recordings are gathered, they are exploited under Praat. Using this well-known piece of software, we describe the data on four tiers (three of which are compulsory for all speakers) and then proceed to a theoretical interpretation of our corpus. The first tier is an orthographic transcription which is close to the standard spelling of French and aligned on the signal. On the second tier, we code the orthographic transcription for schwa (mute ‘e’), on the third tier we do the same for liaison, and on the fourth for prosody. A number of software applications have been developed which allow for a mechanical extraction of our codings and the statistical exploration of the results. For the transcription of all data, we favoured standard French
Building a Large Corpus
99
orthography after considering a number of alternatives which all proved unsatisfactory. A few members of the project proposed at first to adopt a phonemic transcription, even a phonetic one. Either system is highly time consuming and would require intensive training of the transcribers. In addition, the purpose of the transcription is to provide an easier access to the corpus, it should in no way constitute an analysis. How can we engage in a phonemic transcription while the phonemic system has not yet be devised and when the purpose of the recordings is to determine this very system? A phonetic transcription is equally unsuited, as phonetic phenomena are usually gradient and a fine phonetic transcription multiplying the use of diacritics is unthinkable for a corpus of the size of PFC. We resort however to a phonemic transcription when coding prosody as we will see below, but when we do so, the transcription complements the standard orthography, it does not replace it. The transcription tier is established as a base which is duplicated for the subsequent codings. The principles of the transcription are kept simple: follow the standard orthography even when syllables and/or segments are reduced, and do not introduce lexemes which are not present but which would be required from a prescriptive perspective. A typical example is the negation ne which is frequently absent in conversations and which is not reintroduced in the transcription. A further problem arises when dealing with certain varieties of French outside of Europe where, in most cases, there does not exist any written tradition, or where the language is punctuated by local or foreign expressions borrowed from other native languages. The most extreme case is no doubt Louisiana French which survived as an oral language for which the standard French orthography is often inappropriate. The local literature is traditionally oral, but a number of popular tales, short stories, poems, etc. have been edited in the second part of the twentieth century. Several directions are taken in order to endow this variety of French with its full identity. A few writers follow the tradition common for Creole languages and write a kind of ‘phonetic’ French (Guidry 1982) that we exemplify with the following sentence: (1) quand j’assaye d’yeux dôner ein conseil (‘when I try to give them a piece of advice’), while a version closer to the standard would read (2) quand j’essaie d’eux donner un conseil. A number of local linguists have tackled this thorny graphic problem for the last twenty years and a consensus toward a standard orthography now emerges (Brown 1993). Among other factors, let us point out that the orthography in (1) uses conventions which stigmatise it to the eyes of an hexagonal French reader, while the sound file does not reveal a major gap from what can be heard in several regions of France. For varieties of French which show no written tradition, the transcription should be kept as close as possible to standard French, and
100
Chantal LYCHE
more specifically, particular care should be taken to keep morphological indications. Thus, when the first person singular of aller (‘to go’) is pronounced [Yva], it will not be transcribed je vais (the standard French orthography), nor j’va, nor je va, but je vas. The form vais is inexistent in Louisiana French, it does not correspond to any local variety and therefore should be excluded. The transcription j’va or j’vas is not suitable either, because of the absence of the graphic e. The schwa in je is not systematically elided in Louisiana French, it functions more or less like a standard French schwa, and there is no reason to eliminate the segment from the orthography. Finally, je va does not fare any better, as the morphological marker has been erased, and its absence projects onto the reader a negative image of this particular variety of French. When writing Louisiana French or similar varieties, one should strive to achieve a certain balance between maintaining what is known to be the etymology and the observed regularisations which have taken place. In most varieties of French however, the problems are not so acute, they usually stem from the occurrence of lexemes inexistent in standard French. For most such varieties of French, the lexicon has been thoroughly studied and local dictionaries are most often available. The transcriber will then adopt the orthography given in those lists and create new words based on the same principles whenever necessary. The flexibility of Praat allows the creation of a special tier for writing comments or translations when needed. The following examples are taken from Boutin (2006): 1: 2:
On mange sauce gombo avec sauce, feuille de euh, (clic) djoumgblé We eat sauce gombo with sauce, leaf of (clic) djoumglé La sauce de chez nous on appelle ça kplé The local sauce, we call it kplé
In the first sentence, djoumgblé is listed in Lafage (2003, 2004), and the transcription is therefore straightforward. In the second sentence, kplé is not listed in the dictionaries, but the adopted orthography maps the pronunciation. We are dealing here with a local term with a complex onset consisting of two plosives with simultaneous constriction and it is a standard procedure to write this particular cluster kp when voiceless and gb when voiced. As can be drawn from the above discussion, even varieties of French which at times appear distant from the standard, can be handled quite adequately by standard orthography. A certain consensus is emerging concerning the number of basic requirements large oral corpora must fulfil in order to be exploited by the scientific community: (1) the transcription is aligned to the signal, (2) the basic transcription is orthographic, (3) the corpus is defined by a set of
Building a Large Corpus
101
standardized metadata.7 If any transcription is the product of an interpretation, the orthographic transcription reduces the interpretation factor to a minimum, although it remains an interpretation liable to fluctuate from person to person, just like the same transcriber cannot be expected to always be consistent. There is therefore a need to develop automatic transcription systems which might not be fully reliable, but which present the considerable advantage of being systematic and consistent in their errors. Although PFC benefits from the experience of a team of researchers specializing in automatic recognition, the available systems are far from being robust enough for spontaneous speech.8 Adda-Decker (2006) presents several alternatives and analyses a variety of errors observed when transcribing more natural speech. The error rate is considerably reduced for certain speakers in a reading context (5%), but in telephone conversations the error rate can be superior to 30%. 2. Devising annotation systems Our corpus fulfils the requirements set above for large oral corpora, as an orthographic transcription is produced for the word-list, the text, an average of ten minutes of each type of conversation. Our protocol goes however one step further and we devised several sets of coding systems with the objective to assist the phonologist in sorting out the data. Praat allowing the duplication of tiers, a minimum of two tiers are then added, one for schwa and one for liaison. In both instances the transcription tier is duplicated and the different codes are typed within the graphic words. We will now turn to schwa and liaison codings and will conclude the section with prosody coding and its methodology, 2.1. Coding schwa The second tier is devoted to schwa which is coded throughout the text and for three minutes of each type of conversation. The system adopted consists of a numerical annotation of the standard orthography and provides information on the presence/absence of the segment together with information on the context. In order not to multiply the numbers and to reduce the percentage of errors, we settled for four digits: (1) presence/absence of schwa, (2) position within the word, (3) the context left to the syllable containing schwa and (4) the context to the right of schwa. By schwa, we understand any graphic e which is not part of a digraph (as in eu) 7
8
These principles were clearly expressed at the ESF SCSS Exploratory Workshop Corpora in Phonological Research, convened by Gjert Kristoffersen, Marc van Oostendorp, Jacques Durand and Chantal Lyche, Amsterdam 15-17 June 2006. Part of this work is conducted within a CNRS TCAN project (VARCOM) directed by Noël Nguyen and an ANR project (PFC-COR) under the direction of Bernard Laks.
102
Chantal LYCHE
or a trigraph (as in eau). In addition, we code for schwa any pronounced final consonant as a schwa readily appears in such a context within a number of dialects. This final schwa can be a prepausal schwa as described among others by Hansen (2003), or it can be part of discourse planification, creating a short sonorous pause (Candea 2000). Thus in a sentence like dans le tiroir de ma table (‘in the drawer of my table’) the three underlined e’s will be coded together with the final r of tiroir. Moreover, as we systematically code the final consonant pronounced, we retrieve information on cluster simplification or even on consonant deletion. In our example, we would code tablexxxx9 if the whole cluster is pronounced ([tabl]), on the other hand, if the liquid is dropped ([tab]), the coding would follow the plosive and the coding would be tabxxxxle. An automatic search on the value of the third digit will bring forth the needed information. The third digit in the coding gives the left context: a vowel, as in VC? (1), a consonant as in CC? (2), an intonational break (3), an uncertain schwa (4) and a cluster simplification (5). Searching for all ‘5’s in third position gives all the instances of simplification in the corpus, and the location of the ‘5’ renders explicit the precise location of the simplification. The use of ‘5’ in third position has been extended to code consonant elision in varieties of French which often delete the phoneme /r/ as observed in Africa and in regions where French coexists with a French Creole (Boutin 2006, Bordal 2006). If the infinitive partir (‘to leave’) is pronounced [pati], the orthographic transcription will be coded paxx5xtixx5xr on the schwa tier. Similarly, when a word initial schwa or a schwa in a monosyllable is realized while the preceding consonant is dropped, our coding clearly indicates the situation: je vois pas (‘I don’t see’) pronounced [?vwapa] will be coded jexx5x vois pas. Thus, the quest for final epenthetic schwas led to a coding system which proved suitable to coding /r/ elision, a frequent phenomenon in many varieties of French. The coding system adopted allows a quick search of the data, it does not in any way begin to analyse it. We propose to sort out the different contexts, but we do not impose a framework of any kind. We maintain a similar methodology for coding liaison contexts, which is done on a third tier within Praat. 2.2. Coding liaison We code the entire text for liaison, 5 minutes for each type of 9
The symbol ‘x’ is used for any code. Here the full expression would read: table0423. The schwa is not pronounced (0), it is in word final position (4), it is preceded by two pronounced consonants (2) and it is followed by a pause (3).
Building a Large Corpus
103
conversation. By liaison we understand the French external sandhi phenomenon by which a word final consonant is only realized when the next word is vowel initial. Thus in petit bus [p?tibys] (‘small bus’) no consonant will appear between word 1 and word 2, but in petit avion [p?titavjõ] (‘small plane’), a [t] is realized. Liaison is said to be more frequent in formal registers and is regarded as a highly variable phenomenon. In addition to establishing an empirical base for phonological analyses, the purpose of the coding is to obtain an up-to-date classification of the contexts: what types of liaisons are categorical, what types of liaisons are variable, what types of liaisons are erratic? We therefore propose to code all potential liaison sites with the exception of two, described in all classical works as non-appearing regardless of the register (Delattre 1951). We exclude the liaison after et (‘and) and the liaison after a singular substantive (un banc éclairé ‘a lighted bench’).10 On the other hand, a liaison which does not have any graphic support is noted systematically. We do observe in our corpus a number of such liaisons, either a [z] introduced as a plural marker (vingt euros [vD}zøQo] ‘twenty euros’), or a [t] indicating a third person singular (il va à Paris [ilvatapaQi]). The system is alphanumerical providing information on the length of word 1, the presence/absence of liaison, and if a liaison is realized whether it is linked or not to word 2, or whether it is “epenthetic”. Finally, in the case of a realized liaison, we transcribe the liaison consonant in the SAMPA system ([n, z t] being by far the most frequent). 2.3. Coding prosody The last tier that we will consider here concerns prosody, the study of which can be characterized as a latecomer in the project. Our first objective when building the protocol was directed toward a segmental analysis of French (phonemic system, schwa and liaison) as mentioned above. The phonology of a language cannot be reduced however to a pure segmental module, it also embraces supra segmental elements and the two modules commonly intertwined, interact as well with the syntax. After defining carefully the two annotation systems for schwa and liaison, we focused on coding prosody and set out to define two main goals within PFC: (1) a team of researchers are concerned with the prosodic characteristics of geographically defined varieties of French (Simon 2004, Post et al. 2006), (2) special attention is given to the link between schwa and liaison on the one hand and prosody on the other (Lacheret-Dujour and Lyche 2006, to 10
We could of course have maintained all potential liaison contexts, but we hoped to reduce somewhat the task of the coder and more importantly, to reduce the number of potential errors, as it could be expected that totally strange contexts would escape the attention of the coder.
104
Chantal LYCHE
appear, Lacheret-Dujour, Lyche and Morel 2004, 2005). I will now turn to (2) and present a brief account of the methodology adopted, paying special attention to the relation between schwa and prosody. As a general framework for our endeavour, we proposed to follow the principles agreed upon when coding schwa and liaison: the system should be alphanumerical, theory independent, accessible to a naïve coder and it should produce a preliminary sorting out of the data. It is far from obvious however, that prosodic information will comply with these requirements. For example, can a naïve coder code prosody as easily as he/she will code liaison words? This question can only receive a negative answer regardless of the type of coding we envisage. Clearly, even a minimalist program will necessitate a few hours of training, while a finer grained approach demands specialists endowed with a certain level of phonetic sophistication supplemented by strong instrumentation. In addition, any type of coding presupposes clear objectives, and a number of strong hypotheses need to be formulated when broaching the examination of the relation between segmental and prosodic elements. Delimiting the domain of prosody however proves highly complex (Lacheret-Dujour and Beaugendre 1999) in the midst of a multitude of approaches and often contradictory definitions. We chose a modular approach (Lacheret-Dujour and Lyche to appear) although the different modules interact closely. On the phonological side, we distinguish between lexical and postlexical prosody, the first one characterized by stress and the second one by intonation while the phonetic side provides the substance, the different prosodic parameters anchored in the signal: f0, length, intensity, spectral characteristics. The question of stress in French can be a vexing one and some phonologists go as far as to reject the notion of stress in French (Garde 1968). It seems however that a theoretical consensus can be reached if we propose that French is a language where stress is assigned at the group level and not at the word level, fulfilling a demarcative function but not a constrative one since stress for example, does not intervene in the morphology of the language. As a rule, primary stress falls on the last syllable of a group, although it can fall on the first syllable of a longer group. In addition, we observe a secondary stress whose function is pragmatic as well as rhythmic. These principles are general enough to be uncontroversial and can constitute the foundation required before we envisage other hypotheses (Lacheret-Dujour and Lyche 2006, to appear). Our first hypothesis is purely phonological. We do not propose here to review the vast literature on schwa, the numerous debates around the process itself (for example, insertion vs. deletion) and the factors conditioning its presence, but to consider a few elements where prosody can lead to a better understanding of the phenomenon. For example, if the vowel either before or
Building a Large Corpus
105
after an absent schwa is systematically distinct from the same vowel in a different context, we can produce arguments in favour of deletion, assuming that the deleted schwa leaves traces of its underlying presence. We then need for example to measure the vowel quality and length of /N/ in colmater ([kNlmate] ‘fill in’), collerete ([kNlQDt] ‘little collar’), col rond ([kNlQõ] ‘round collar’), etc. Our first interest will then concern the nature of the underlying structure of schwa. Our second objective aims at defining the rhythmic factors conditioning the presence of schwa. It has long be established (Léon 1966) that schwa is more likely to be absent in a longer group, than in a shorter one: portefeuille (‘wallet’) needs a medial schwa while in porte-bouteille (‘bottle holder’) the schwa in porte is not realized. On the other hand, in a perspective where the alternation of strong / weak syllables is optimal, schwa seems stronger away from stress than closer to stress: doucement (‘slowly’) and all similar manner adverbs are always pronounced in northern French without a schwa, leading Dell (1973) to submit them to a rule of compulsory schwa deletion. Following a number of observations made in the literature (Casali 1997, Steriade 2001) it seems to us that the initial position is naturally endowed with more strength than a median syllable. This reflects upon initial schwas which are known to be more stable and the schwa in demain (‘to-morrow’) is not unusual. The coding system will then be required to produce information on the length of the groups, the stressed syllables and distance to stress. Finally, we propose to analyse the pragmatic role of schwa within the discourse. Consistent with the hypothesis that an initial position reinforces the likelihood to find a schwa, we observe that schwa is more likely to be maintained at the beginning of a new turn-taking (j’ai bien regardé [QgaQde] ‘I looked attentively’ vs. regarde ça [Q?gaQd] ‘look at this’), just like it is more easily absent when included within a topic which does not carry new information. To the question Paul arrive demain tout seul? (‘Does Paul arrive to-morrow alone?’), one can answer Non, il vient demain avec son frère (‘no, he comes to-morrow with his brother’) where demain is more likely to be uttered without a schwa. It follows that the coding system must be devised in such a way that this information will be easily retrieved. The requirements emerging from the three sets of hypotheses presented above, determine the nature of the coding system we finally adopted (Lacheret-Dujour and Lyche 2006). The hypotheses however came about as the result of constant toing – and – froing between initial hypotheses, a preliminary coding system and the data. The confrontation with the data was an absolute necessity in order to finalize our annotation system, just like looking closely at the data must precede the choice of the portions to be coded. We mentioned above that our protocol requires that the entire text,
106
Chantal LYCHE
three/five minutes of each conversation be coded for schwa and liaison respectively. The selection of the specific portions to be coded is left to the coder with a basic recommendation to choose passages where the speaker expresses him/herself freely and naturally. We know that within three/five minutes a number of interesting contexts will emerge. The same procedure however is not applicable to prosody where a rigorous selection of the passages to be coded is imperative in order to code relevant data, but in addition, because coding prosody is a much more time-consuming task than coding segmental phenomena. Coding schwa and liaison could be performed directly on the orthographic transcription, without being perturbed by the fact that the unit coded is a graphic word. Prosody however is directly linked to the syllable, the stress bearing unit, excluding therefore the graphic word from being the coded element. The prosodic tier does not include a reduplication of the transcription, but each coded portion of the aligned text is segmented into syllables which are then transcribed into the SAMPA transcription system. As our domain of investigation here concerns schwa exclusively, we duplicate the schwa codings after the relevant syllables. In addition we signal the beginning and the end of the word with the symbol*. The figure below exemplifies the coding of the first sentence of the text. Table 3.
Coding prosody: the first sentence of the PFC text.
Building a Large Corpus
107
We observe in table 3 that the prosodic tier shows the segmentation into syllables, the transcription in SAMPA, the delimitation of lexemes, the notation of pauses (indicated by the symbol ‘#’) and how the schwa coding is duplicated in order to facilitate the automatic treatment of the data. For the prosody coding itself, we distinguish four fields: 1 = presence or absence of a perceived prominence, 2 = presence or absence of a long vowel, 3 = presence of a pause and its nature, 4 = first syllable or not in a new turn taking. The devised annotation system can be performed by naïve coders who are given a few hours of training. Our experience shows that the coder performs with relative ease although he/she often retains a certain degree of insecurity about the detection of prominences. In general, the accurate identification of prominences remains a vexing question. Poiré (2006) describes an experiment conducted within PFC where 7 coders (all experienced phonologists) were asked to mark prominent syllables within a PFC recording, and he concludes to the total lack of inter-judge agreement.11 Morel, Lacheret-Dujour, Lyche and Poiré (2006) carries this work further and show that, for the judges who fare best, prominence detection is proportionate to F0 variation but not to length. While coding prosody on the prosodic tier, our coders are asked to rely on their perception of stress and when hesitant, they can visualize the F0 curve provided by Praat. In spite of all our efforts for rigorous annotation of the data, we can merely aspire to rudiments of answers from our minimalist coding system. We propose therefore that an advanced coding system relying on specific software should be introduced (Lacheret-Dujour and Lyche 2006). It seems for example, that a psycho-acoustic model of perception (cf. prosogram developed by Piet Mertens (as for ex. in Mertens 2004) will give more robust results than what can be achieved when coding manually. This software stylizes automatically F0 variations indicating exclusively psycho-acoustic variations triggering perceptive responses, thus providing clean and quantified data. The work currently undertaken will not be vacuous in spite of the numerous obstacles encountered. We consider that coding manually will inevitably lead to various observations, regularities, which will enable us to finalize a more elaborate system based on automatic procedures. The alternating motion between the data, the hypotheses and the coding system proves once more inevitable.
11
See Martin (2006) on a similar topic.
108
Chantal LYCHE
3. Exploiting the corpus We alluded in the introduction to the lack of solid empirical base for a number of phonological analyses of liaison and we will now briefly illustrate the impact that the PFC findings are liable to have on our understanding of the liaison phenomemon. A study conducted on 5 investigation points (Brécey/Normandy, Treize-Vents/Vendée, Biarritz/Basque country, Douzens/ Languedoc and Nyon/Switzerland) clearly supports the claim that liaison cannot be handled as a uniform phenomenon (Lyche, Durand, Laks 2006). The analysed data supports Morin and Kaye (1982) in that we cannot endorse an exclusively phonological treatment of liaison and that the morphological role played by [z] and [t] respectively remains central. We observe further that in Douzens, a variety of French where liaison was relatively more frequent, the higher level of frequency concerned [z] as a plural marker and [t] as a verb marker, excluding for instance potential liaison in [z] for verbs (je voulais/ aller ‘I wanted to go’). The strong morphological character of the plural liaison is enhanced by the fact that former instances of categorical liaison in [z] tend to become variable when the consonant does not fulfils its function as a plural marker (Laks, 2005). We thus noted instances of monosyllabic prepositions (dans, chez) without liaison although liaison in such monosyllabic grammatical words has long been considered categorical. Liaison with prenominal adjectives, at the source of numerous theoretical debates within the literature, occurs rarely and is limited to a few frequent adjectives thus supporting a construction analysis as in Bybee (2001). Our data shows in addition that variable liaison is scarce and that the graphical presence of a consonant can induce its pronunciation when reading a text (Laks 2005). Thanks to our coding system and to the large amount of gathered data, we will soon be able to establish new categories of systematic, variable and non-occurring liaisons. A number of projects crystallize around PFC and analyze in different ways the stored data. For example, a CNRS TCAN project, VARCOM (Traitement de la variation phonologique dans la communication orale), centers around phonetic and phonological variation in French, focusing on how this variation linked to regional varieties is perceived in normal conversation one the one hand, and in a man-machine relation on the other. A major locus of variation in French concerns mid-vowels and their distribution, most areas in Southern France maintaining a reduced vocalic system with open-mid vowels in closed syllables and close-mid vowels in open syllables. Under the leadership of Noël Nguyen (Aix en Provence), formant charts have been semi-automatically extracted for all speakers in about ten PFC investigation points. The word-list and to a lesser extent the text can be submitted to the automatic procedure devised within the project,
Building a Large Corpus
109
since the transcription is strictly aligned to the signal. Nguyen and Espesser (2004) for example, extract 46 words from the PFC list (mid-vowels, /A/, nasal vowels) and detail the procedures used for the automatic extraction of the formants together with the methods applied to minimize a certain amount of variation due to the speakers’ physiological differences. We thus obtain the average phonemic system of each speaker and can thereafter establish the main characteristics of the local variety of French. Within the VARCOM project, Martine Adda-Decker (LIMSI-CNRS) supervises research on automatic speech recognition. The PFC corpus with extensive amounts of spontaneous speech presents a challenge for the available systems. For example, for individual phonemes, frequency of occurrence and acoustic realizations will vary from one register to the other (Adda-Decker 2006). Pronunciation dictionaries, against which each realization of a word is tested, must then be modified accordingly. Speech recognition systems can prove valuable for linguistic research when pronunciation dictionaries include all the actual variants (including geographical variants) of the same word. It is then possible to test the frequency of realisation of a particular variant. The success of the procedure would lead to the elimination of manual annotation systems such that schwa and liaison codings. A test was performed on the liaison realization within the PFC text for 4 investigation points and checked against manual verification. Liaisons were adequately recognized by the system and the margin of errors was kept low. The system, as expected, fails however when confronted with the two types of conversations. The PFC base is being used for the perceptive identification of regional characteristics. In this experiment six regions have been chosen (Normandy, Vendée, Basque country, Switzerland, Languedoc, Provence) and naïve listeners have been asked, from a sampling of spontaneous speech and a passage of the text, to identify the geographical origin of the speakers (Woehrling and Boula de Mareüil 2006). The percentage of correct answers reaches over 42% with older speakers better identified than young. We observe further that listeners tend to blend the three southern regions in their evaluation while Switzerland is best identified. The different areas of research sketched above will be pursued within the ANR project PFC-COR under the leadership of Bernard Laks (Paris X) and new projects are being conceived aiming at a full prosodic exploitation of the base, at sociolinguistic studies in areas where French is not the sole L1 language and special attention is given to the pedagogical fallouts of the obtained results. The complete transcriptions of 212 speakers have been validated and are now recorded in the base with a total of 110 095 schwa codings and 30 306 liaison codings. We expect to reach 600 speakers within
110
Chantal LYCHE
a few months. 4. Conclusion The survey presented here was intended as a bird’s eye view of the project and alluded all too briefly to the current exploitations of the database. The database itself has been the object of extensive modifications integrating cutting-edge technology. An interactive map allows direct access to the different investigation points with all the information they encompass: general linguistic description of the location, detailed information about the informants, etc. An SQL search engine produces dynamic statistics, cut-crossing for example sociolinguistic and phonological information. Codings are easily extracted with the form in context together with its corresponding audio file. Automatic procedures are being implemented to search for possible errors (instructions not followed while transcribing, coding errors, etc). The finalization and the full operation of a clean database remains a priority for the coming year. Our ambition is to establish our data as one of the reference corpora of spoken French. No existing oral corpus of French systematizes in a similar way inter and intra speaker variation and we contend that our method gives our endeavour its full strength. Convinced that robust data will sharpen the formulation of theoretical questions, we strive to build a rich base providing not only the necessary material for better descriptions of phonetic and phonological phenomena, but also intriguing data which will trigger theoretical debates between various approaches. References Adda-Decker, M. 2006. De la reconnaissance automatique de la parole à l’analyse linguistique de corpus oraux. Actes des XXVIes Journees d’Études sur la Parole, Dinard, June 2006, 389-400. Ågren, J. 1973. Enquête sur quelques liaisons facultatives dans le français de conversation radiophonique. Uppsala : Acta Universitatis Upsaliensis. Bordal, G. 2006. Traces de la créolisation dans un français régional: le cas du /r/ à l’Ile de la Réunion. Mémoire de Master, Universitetet i Oslo. Boutin, B. A. 2006. Le corpus PFC Abidjan : de l’établissement de l’échantillon aux questions de transcriptions et codages. Conference Phonologie du Français contemporain : données et enjeux théoriques, Paris 3-4 February 2006. Brown, R. 1993. The social consequences of writing Louisiana French. Language in Society 22, 67-101. Bybee, J. 2001. Phonology and Language Use. Cambridge: Cambridge University Press. Candea, M. 2000. Contribution à l’étude des pauses silencieuses et des
Building a Large Corpus
111
phénomènes dits d’hésitation en français oral spontané. Thèse de doctorat, Université de Paris III. Casali, R. 1997. Resolving hiatus. ROA 215-0997. De Jong, D. 1994. La sociophonologie de la liaison orléanaise, in C. Lyche (ed.) French Generative Phonology : Restrospective and Perspectives, Salford: AFLS/ESRI. Delattre, P. 1951. Principes de phonétique française à l’usage des étudiants anglo-américains. Middlebury College. Dell, F. 1973 [1985] Les règles et les sons. Paris : Hermann. Dubois, S. 2003. Pratiques orales en Louisiane. La Tribune Internationale des Langues Vivantes 33, 89-95 Durand, J. 2006. Mapping French Pronunciation. The PFC project. In J.-P. Montreuil & C. Nishida (eds), New Perspectives on Romance Linguistics. Vol. 2: Phonetics, Phonology and Dialectology. Selected Papers from the 35th Linguistic Symposium on Romance Languages (LSRL), Austin, Texas, February 2005, Amsterdam: John Benjamins. Durand, J. and J. Eychenne 2004. Le schwa en français: pourquoi des corpus? Corpus 3, 311-356. Durand, J., B. Laks and C. Lyche 2003a. Linguistique et variation. In E. Delais & J. Durand (eds.) Corpus et variation en phonologie du français : méthodes et analyses. Toulouse: Presses Universitaires du Mirail, 11-88. Durand, J., B. Laks and C. Lyche 2003b. Le projet ‘Phonologie du français contemporain’ (PFC). La Tribune Internationale des Langues Vivantes 33, 3-9. Durand, J., B. Laks and C. Lyche 2005. Un corpus numérisé pour la phonologie du français. In G. Williams (ed.) La linguistique de corpus. Rennes : Presses Universitaires de Rennes, 205-217. Durand, J. and C. Lyche 2003. Le projet ‘Phonologie du Français Contemporain’ (PFC) et sa méthodologie. In E. Delais & J. Durand (eds.) Corpus et variation en phonologie du français : méthodes et analyses. Toulouse: Presses Universitaires du Mirail, 212-276. Encrevé, P. 1988. La liaison avec et sans enchaînement, Paris: Seuil Fouché, P. 1954. Traité de prononciation française. Paris: Klincksieck. Garde, P. 1968. L’accent. Paris: PUF. Guidry, R. 1982. C’est p’us pareil. Lafayette : Center for Louisiana Studies. Hansen, A. B. 1997. Le nouveau [?] prépausal dans le français parlé à Paris. In J. Pérot (ed.) Polyphonie pour Ivan Fonagy. Paris: L’Harmattan, 173-198. Jespersen, O. 1924. The philosophy of Grammar, London: George Allen & Unwin [repr. 1951].
112
Chantal LYCHE
Kaisse, H. 1985. Connected Speech. The Interaction of Syntax and Phonology. Orlando: Academic Press. Klingler, T. 2006. PFC en terrain louisianais: défis et adaptation du protocole. Conference Phonologie du Français contemporain : données et enjeux théoriques, Paris 3-4 February 2006. Lacheret-Dujour, A. and F. Beaugendre 1999. La prosodie du français. Paris : Editions du CNRS. Lacheret-Dujour, A. and C. Lyche 2005 to appear. Tanscription prosodique normalisée au sein du projet PFC: l’état d’un chantier. To appear in G. Williams (eds.) Proceedings of Les quatrièmes journées de la Linguistique de Corpus, Lorient 15-17 September 2005. Lacheret-Dujour, A. and C. Lyche 2006. Le rôle des facteurs prosodiques dans l’analyse du schwa et de la liaison. In A.-C. Simon (ed.) Bulletin PFC 6, ERSS, 27-49. Lacheret-Dujour, A., C. Lyche and M. Morel 2004. Pour une transcription prosodique normalisée au sein du projet PFC (Phonologie du français contemporain) : champ d’action et perspectives. In Isabelle Marlien and Bernard Bel (eds.) Proceedings of the 25th Journées d’Etude sur la Parole (JEP’04), Fez, Marocco, 19- 22 April 2004. Lacheret-Dujour, A., Lyche, C. and M. Morel 2005. Phonological analysis of schwa and liaison within the PFC project: how determinant are the prosodic factors? In INTERSPEECH-2005, 1437-1440, http://www.iscaspeech.org/archive/interspeech_2005. Lafage, S. 2003, 2004. Le lexique français de Côte d’Ivoire, appropriation et créativité. Tomes 1 et 2. Le français en Afrique. Revue du Réseau des Observatoires du Français Contemporain en Afrique 16 et 17, Paris : Didier-Erudition. Laks, B. 2003. Les grandes enquêtes phonologiques en France. La Tribune Internationale des Langues Vivantes 33, 10-17. Laks, B. 2005. La liaison et l’illusion, Langages.158, 101-126. Léon, P. 1966. Apparition et maintien du « e » caduc. La Linguistique 2, 111-121. Lyche, C. 2005. Pour un renouvellement des données phonologiques : le corpus PFC (Phonologie du français contemporain), LIDIL (Linguistique appliquée et didactique des langues) 31, 119-138. Lyche, C. 2006. Corpus en domaine francophone : la Louisiane’, Ecole Thématique CNRS, Linguistique de corpus oraux, Nantes 19-24 June 2006. Lyche, C., J. Durand and B. Laks 2005. French liaison and data. 3d International Conference on Language Variation in Europe, Amsterdam Meertens Instituut, June 23–25, 2005.
Building a Large Corpus
113
Martin, Ph. 2006. La transcription des proéminences accentuelles : mission impossible ? In A.-C. Simon (ed.) Bulletin PFC 6, ERSS, 81-87. Mertens, P. 2004. Le prosogramme: une transcription semi-automatique de la prosodie. Cahiers de l’Institut de Linguistique de Louvain, 30 (1-3), 7-25. Milroy, L. 1980. Language and Social Networks. Oxford: Blackwell. Morel, M., A. Lacheret-Dujour, C. Lyche and F. Poiré 2006. Vous avez dit proéminence? Proceedings from the 26th Journées d’Etude sur la Parole (JEP’06), Dinard 12- 15 June 2006, 183-186. Morin, Y.-C. 1987. French data and phonological theory. Linguistics 25: 815-843. Morin, Y.-C. 2000. Le français de référence et les normes de prononciation. In M. Francard et alii. (eds.) Actes du colloque de Louvain-la-Neuve 3-5 novembre 1999, Cahiers de l’Institut de Linguistique de Louvain, 26 : 91-135. Morin, Y.-C. and J. Kaye. 1982. The syntactic bases for French liaison, Journal of Linguistics 18, 291-330. Nguyen, N. and R. Espesser 2004. Méthodes et outils pour l’analyse acoustique des systèmes vocaliques. Version 1.0. In J. Eychenne and G. Mallet (eds.) Bulletin PFC 3, ERSS, 77- 85. Poiré, F. 2006. La perception des proéminence et le codage prosodique. In A.-C. Simon (ed.) Bulletin PFC 6, ERSS, 69-79. Post, B., E. Delais-Roussarie and A.-C. Simon 2006. IVTS, un système de transcription pour la variation prosodique. In A.-C. Simon (ed.) Bulletin PFC 6, ERSS, 51-68. Séguy, J. 1973. Les Atlas linguistiques de la France par région. Langue Française 18, 65-90. Simon, A.-C. 2004. Analyse de la variation prosodique du français dans les données conversationnelles : propositions théoriques et méthodologiques. In J. Eychenne and G. Mallet (eds.) Bulletin PFC 3, ERSS, 99-113. Steriade, D. 2001. The Phonology of Perceptibility Effects: the P-map and its consequences for constraint organization. Ms. MIT. Walter, H. 1982. Enquête phonologique et variétés régionales du français. Paris : PUF. Woehrling, C. and P. Boula de Mareüil 2006. Identification d’accents régionaux en français: perception et categorisation. In A.-C. Simon (ed.) Bulletin PFC 6, ERSS, 89-102.
114
Chantal LYCHE
Collateral Languages and Digital Corpus Jean-Michel ELOY Introduction This paper deals with one of the most salient aspects of the TUFS project, the fact that it is corpus-based. The term “usage-based” in its title “UBLI” asks many questions, which justify a sociolinguistic approach. Our research team, named LESCLaP whose main keywords are “language contacts”, “language politics” and “sociolinguistics”, has been working for a few years on a very rich problem, that of linguistic proximity. That reflection deals firstly with the linguistic situation of the region where we work, which includes French and the regional language, Picard. This concern with proximity sheds some light – or some trouble – on the Picartext project1. The Picartext project, on which LESCLaP works, is a text database in Picard. We intend here to shed some light on Picartext, considering that it should be interesting also in relation to the UBLI corpora. We shall first definite a general framework, using the notion of language proximity, firstly from a diachronic point of view, then from a synchronic one. But more precisely, we have to handle correctly what is exactly the literary corpus of Picartext, and situate it, that is to say, characterize it on different levels. We’ll specify about Picard that it is a “near” and “collateral” language, a non-standardised language, which is enriched by an important literature – which obliges us to describe precisely the value of literature in contemporary linguistic processes. 1. Near / collateral languages Two recent seminars have dealt with language proximity, one in Amiens in 2001 et one in Limerick in 2005 (Eloy 2004, Eloy & OhIfearnain 2006) The theoretical assessments from these seminars aim firstly to know more precisely what proximity means, secondly to draw a working plan, which will have to be multidimensional and complex. We have had to revisit comparatism, before formulating synchronic terms.
1
http://www.u-picardie.fr/LESCLaP
116
Jean-Michel ELOY
1.1. Relevance of the genealogic viewpoint Nineteenth century comparatism, as is known, was reconstruction oriented, from which was drawn a genealogical representation, the famous Stammbaum by Schleicher. We should like to point out that this genealogic metaphor results in three indissociable readings. First, it supposes that languages are definitely closed sets. Then, of course, it points a history, in terms of “genetic, lineage, mother- and daughter-languages, etymological sources”… but also, what is interesting for us, it allows us to predict and it organises in synchrony some degrees of similarity or proximity – related languages, sister- or cousin-languages, language “families”, groups and subgroups… Another illustration or formulation of the same idea, can be found in the fact that the “laws” or regularities of evolution itself (in the sense of neo-grammarians), although they are diachronical, may work as “mapping laws” between related languages. This possibility will of course be limited by our knowledge of historical linguistics : not everybody can perceive the “identity” of Engl. Father and Latin pater, or a fortiori of centum and hund(red), or to perceive Engl. Heart in French cordial (Latin cordis) or in French cardiaque (Greek kardia). In other words, the feeling of proximity, insofar as it is based on historical relations of forms, only follows from it in very varied ways, according to an epilinguistic factor – metalinguistic knowledge or capacities of observation. And that feeling of proximity has a role in language building, as we’ll see later. The genealogical representation of comparatists, which has not been made wholly invalid, has established that language divergence is the major way of new languages appearing – call it birth or building. Any search for “proto-languages”, any historical perspective on a language group, is based on that process. Languages of all groups, Indo-european (and a fortiori subgroups as Slavic, Germanic, etc.), Bantu, Semitic, etc. may be supposed to have known, before being institutionalised as national or ethnic, and sometimes standardised, languages, as we see them nowadays, innumerable steps of dialectal divergence, of partial and evolutive individuation, in other words steps of weak distinction and short distance. Let us take the example of romance languages, in the perspective of Z. Muljacic (2004). This author retraces the history of Romance languages as a succession of phases of divergence and convergence. He evaluates the varieties born from Latin as not far from one thousand around the 9th century, becoming about 150 in the 13th century, 50 in the 16th century, and about 25
Collateral Languages and Digital Corpus
117
at the end of the 18th century. Then their number increases after the 19th c., reaches about 40 around 1970, and goes on climbing today. Of course, for ancient periods, these are stabilised and named varieties, but not institutionalised languages – that meaning of language does not exist, for vulgar languages, before the 16th century. Such an abstract of course raises many difficulties, because language events are always so complex that we hardly dare sum up realities that are always heterogeneous. Muljacic, taking up again Kloss’ concepts of Abstand and Ausbau (“distance” and “elaboration”), also characterises situations which are halfway, in the sense that distinctions are only beginning to exist. Without those halfway states, it would be unthinkable how we could come from Latin to that set named Romance languages. It is more than a century since wave theory (Wellentheorie) by Johannes Schmidt (1872) was produced to complement the genealogical representation with “horizontal” considerations, particularly with the notions of expansion and diffusion of innovations – not far from what we now call interferences and loans. The value of this theory is that it gives an important place to neighbourhood, and therefore to openness and permeability of languages. As interferences are important in the evolution of languages, it is well known that this idea, which was a marginal one century ago, is now in a central theoretical place in what has been called “contact linguistics” (Nelde 1983) Our diachronic detour was not useless. In the face of our Picard corpus, we have to keep in mind, not only the close relatedness with near varieties, but mostly the idea that any language plays its future in such close relations. We think it important that today’s observer, even if he works in a synchronic perspective, should be conscious that this very process of “glottogenesis” is working through our “near languages”. In other words, there is always some Ausbau, may you call it intervention, voluntarism or simply norm. “A language plays its future”, means that the speakers are undertaking some processes of building (cf. Ausbau in H. Kloss 1967, 1987) which modify the language, and make its history. 1.2. How to handle synchronically the nearness of linguistic systems A synchronic approach gives other results than diachronic, but equally important. But first, how make more precise that notion of nearness ? On a qualitative level, our main difficulty is that of choosing, because there exist many approaches : contrastive grammars, typologies, using various parameters, different from one language to another. Many works
118
Jean-Michel ELOY
describe sharp differences between near linguistic systems. But even if the notion of proximity is only a metaphor for “proportion of common features”, that metaphor implicates that we need a certain quantification of the so-called distance. And as we need to have means of comparing distances, it supposes we have at our disposal a system of measurement and a universal, context-free measure. Dialectometry Dialectometrists count differences between adjacent points of atlas : phonological, lexical, morphological, either by category, or all together. Then they establish connections between that quantity and geographic distribution. Thanks to maps, they can show various geographic sound structures : geolinguistic areas, that is areas characterised by linguistic features. In other words, dialectometry is a tool for discovering the way languages and language differences occupy and structure geographic space. Because of that heuristic function, in a tool such as VDM by Edgar Haimerl (see H. Goebl 2004), you may edit different maps, according to which data and which algorithms you have decided to take into account. That method has really obtained results, but it is also interesting in that it raises many problems – about data elaboration (ponderation of linguistic features), interpretation of maps, connections of them with social facts (linguistic consciousness and identity, causalities, etc). One presupposition of dialectometry is that all points are not identical : therefore any koinè, any standardised variety in a certain sense disappears a priori, because koinè or even the standard, in a certain sense, are variation free. In the case of a finished koinè, the map should be monochromatic. Or better, it should show only obvious contrasts between large linguistic areas, between standard languages, which would be of very weak interest. The identity processes, from self-allocation of individuals to the very social nomination itself, suppose non-variant linguistic objects (that is, objects considered as non-variants). On linguistic realities, speakers realise simplifications, regularisations, incisive categorisations – that we’ll find in literary forms in Picartext. The structures made up by dialectometry – areas in contrast, dendrograms or trees – can be set against speaker categorisations, which are explored by “subjective dialectology”. The method used by Yuji Kawaguchi (2007) in Limerick is an interesting methodological improvement, because of distinctions made among variation types (semasiological vs onomasiological, lexical vs phonological), and the use of cluster analysis to measure distances between varieties. The result is also a geolinguistic structure, which as well as
Collateral Languages and Digital Corpus
119
dialectometry asks questions about dialectal limits or areas. And he has proved that measuring distances between near varieties is possible. Lexicostatistics Lexicostatistics is a method which consists in counting the lexical forms common to two varieties. It has proved to be useful where genetic classifications are difficult because of lack of historical data. It was used for instance in African linguistic domains, particularly Bantu (Bastin et al. 1983), where it gives a means of identifying near varieties thanks to their shared vocabulary – either by “lineage” or by contacts – and of making assumptions about history and genealogy. It has been used also to connect linguistic data and population genetics data, and to find some significant correspondences (Poloni et al. 1997). This method suffers many objections, due to its purely lexical base, and even the fact that it is based on a very narrow number of items (often 200, sometimes only 100). Many linguists consider that lexicon is not the essential part of a linguistic system. Our “near languages” lexicons, in fact, show relatively few differences, therefore they are difficult to handle within the framework of lexicostatistics (at least as it is now). Also, we don’t know very well the role of each of the linguistic levels in the whole result which is called possibility of intercomprehension, which in any case is not predictable on a the basis of a single lexicon. Moreover, that method does not include any element of linguistic consciousness, therefore of ethnolinguistic identity, though very important. In brief, the validity of lexicostatistics is a fact in genetic questions and mostly in relation to other types of data, but is not proved in cases of near languages. Perhaps that tool could be taken up again to be adapted to these realities. Study of intercomprehension Charlotte Gooskens (2006) seeks to establish/describe the role of attitudes and phonetic distance in interscandinavian intercomprehension (norwegian, danish, swedish). She questions 17 groups of 40 persons, from the 3 countries. First they have a test of intelligibility, which shows very asymmetrical relations : Danes understand Norwegian and Swedish very little, Norwegian understand Swedish very well and Danish much less (but better than Danes understand Norwegian), the Swedish understand Danish very little and Norwegian very well. She adds a declarative enquiry on attitudes, with the questions : “What do you think of the Danish/Norwegian/Swedish language? ugly or beautiful ?”, “Would you like to live or study in Denmark/Norway/Sweden?”,
120
Jean-Michel ELOY
and asks them about their contact with each of the neighbouring languages (TV, newspapers, meetings or visits). All the results are also very asymmetrical. Last, she studies the phonetic distances. The degree of similarity between word forms was assessed by means of the so-called Levenshtein distance, calculated automatically from the phonetic transcriptions of the aligned cognate words : distance is like a cost, according to the number of symbols you have to add, to suppress or to replace from one to another transcription. To find the predictors of intelligibility, the author looks for correlations between these variables by the method of multiple linear regression. The result is that, among those different factors, phonetic proximity is the best predictor, but attitudes are very weak proofs (and very polysemic). Near languages : a huge question The work generated at Amiens colloquium (Eloy 2004) shows how large and rich is the domain thus opened. Many cases have been brought up for common consideration (see Annex, for the list of cases brought to the colloquia of 2001 and 2005). In fact, most of the communications did not intend to “measure” distance. Most of them seemed to adopt another scheme : considering language proximity, what do people do with it ? So their main interests were in sociolinguistic aspects. 1.3. Collateral languages We have proposed, for a certain type of language situation, the concept of collateral languages. “We propose to call by this name varieties which are near each another – objectively and subjectively –, at the linguistic, sociolinguistic and historical or glottopolitical level, and those varieties, which tend to contrast each to another, are historically linked because of the modalities of their development” (Eloy 2004:21) The aim was to stress language building processes, and the anthropological nature of these processes. We could not do with only the notion of “neighbour language”, not precise and clear enough – it suggests too much a geographical neighbourhood. We have to stress that for us this concept of collateral languages applies to closely related languages, and not, for instance, to French and Breton, for which the notion of roof-language (Dachsprache) is sufficient. We first think of the cases where the developments were linked on the basis of genetic proximity. This common history clearly characterises the relation between a standard language and the cognate so-called “dialectal”
Collateral Languages and Digital Corpus
121
languages in diglossic relations. That link is often a history of resemblance-repulsion. Symmetrically, the dominant language is structured in particular to distinguish it from some forms of inferior reputation, and the lower varieties are structured to acquire an identity of their own in relation to the dominant language. All this, if varieties are linguistically close, gives a type of “near languages” with common features : think for instance of the relations between Italian dialetti, Romance Iberic languages, Langues d’oïl, and their respective standard languages. In other words, it seemed interesting to us to replace diglossia in the historical process of “recognition-birth” (Marcellesi 1986) or language building. That concept of collateral languages may also be applied to some varieties which are not at present in hierarchic relation, but whose history – in large part a shared one –, is the history of a process of divergence. Many of them are not – are no longer – in a diglossic relation, due to geopolitical evolution, either each of them has found a way of flourishing in an independent country (e.g. Scandinavian, Slavic languages), or they were cut off each from another by frontiers (e.g. Magyar-Csango, WestvlaamschNederlands). The concept of collateral language gives linguistic proximity a dynamic and sociopolitical perspective. Language builder One has obviously understood, from the terms we just used, that for us a language is not only a linguistic system. By “language”2 we mean a complex social reality, which of course includes linguistic data, but also practical conceptions – i.e. practices and ways of understanding and planning them –, theories and categories, naïve or not, made from them and making them, and the various ways of institutionalisation which stabilise them according to the social forces. We have also used the terms koinè and (sub)standard, and we think it is useful to reflect on how we understand them. At first, they differ mainly in that a koinè is verbal and a standard is written, or in that a koinè is not made by a voluntarism, unlike a standard. But in fact, we see some unified usages being adopted in little areas, around an author or an association, which may be called either sub-koinès and regiolects, if we consider them as habits, or substandards, if we consider the institutional role of the associations. 2
In fact, we mean : “By the French word ‘langue’ ” : translation in English is another problem…
122
Jean-Michel ELOY
Then these substandards go round, influence each other, and probably are the very step to koinè making ; the Picartext database should permit a verification of and illustration of this process, for instance by tracing some words through places and dates, from text to text. Either on a local scale, or on the regional one, the spread of such substandards seems to be a matter of imitating models : all we do there is find again the way literature created modern “great” languages. Simultaneously we find attitudes aiming to stress a contrast between Picard and French : even where spontaneously the speaker varies between two forms – e.g. “un peu” and “un molet” (a little) –, in a formal context he will use “un molet” because it is different from French. 1.4. A multidimensional approach We may consider that intercomprehension is proximity in practice, and not limited to a strictly linguistic level. These two questions are coextensive. Thus Gooskens’s multidimensional study of intercomprehension calls for an extension with regard to proximity. Which factors contribute towards proximity, understood as proximity as felt by speakers ? The necessary multidimensional approach may be outlined like this : Table 1. Outline for a multidimensional approach to intercomprehension (intercomprehension is considered as the practical aspect of proximity) A- on the level of linguistic data (oral vernacular / written standard) 1- degrees of phonetic-phonological differences 2- degrees of morpho-syntactical differences 3- degrees of lexical differences 4- possibilities for conversion rules 5- weighted index (synthesis of different levels) (linguistic distance ) 6- degrees of standardisation of current use (intrinsic variability) B- common-historical frames (language ideologies) 7- tolerance to variation, interferences, mixing (epilinguistic rigidity-flexibility) 8- school training in communicating in near languages 9- importance of other languages in media or everyday life (linguistic autarky-hospitality) 10- degrees of linguistic consciousness (sensitiveness-passivity) 11- A-language prestige for B-language speakers and vice-versa (linguistic respectdisrespect) C- individual-situational frames 12- cognitive dispositions to understanding and expressing (linguistic facility-difficulty) 13- dispositions due to a level of linguistic education (degrees of linguistic culture) 14- frequency of microcontexts favouring (or not) an effort to achieve intercomprehension
Collateral Languages and Digital Corpus
123
Do we have to say it ? Even if we admit the possibility of assigning a value to every variable, it would be a huge program of work. But what is still much more difficult, is that the phenomenon is complex (in the sense of Edgar Morin (1990) : the way different factors get organised and balanced is not decided once and for all) : it is exactly the opposite of a mechanism producing calculable results. But let us repeat that languages are not things ! They are complex, social and historical practices : languages are built by societies which, through them, build themselves. This process, fundamentally, interests us. 2. Picartext project The Picard language has been at the beginning of our work about proximity and collateral languages. Typically French and Picard are collateral languages. Thus we shall find in the literary corpus of Picartext traces of proximity and traces of the work of contrasting Picard with French – one could even say that this work is one of the reasons for writing, as “a defence and illustration” of the language. Beyond the Picard case, other collateral language databases should be compared. We’ll list again the elements which shed light on the Picartext database, and see which objectives they imply. 2.1. Genetic, historical and dialectological knowledges Our data are clearly situated in a genetic architecture : among Romance languages, Picard is one of the “Langues d’oïl”, as well as Norman, Walloon, Poitevin, etc born from the Middle Ages “Langue d’oïl”. An important remark has to be stressed concerning the word “French” : one must have not only an historical but also a non naïve conception of what is standard language. In fact, the “French language” was built within the “Langue d’oïl”, first as a koinè, then as an institutionalised standard. It would not be correct to call by the name of the standard the whole group of the “Langues d’oïl”, because one may say, for instance, that Norman belongs to “Langues d’oïl” but not to French. (v. Eloy & Simoni-Aurembou 1998). The emergence of the Picard language is an historical process, both ancient and recent as a very brief abstract will show. The medieval “Françoise” language, vulgar and not standardised at first, was put in written form under diverse graphical varieties (or “scripta”). There is, in the 13th century, among others, a Picard scripta, which is a rightful and even rather prestigious written form. One may say also that the medieval “Françoise” language is rich of its diversity, or tolerant to regional “colours” – but regional features are never more than 25 %.
124
Jean-Michel ELOY
In the 16th and 17th centuries, the French language, aiming to become equal or superior to Latin, was made very civilized, and soon excluding any regional feature, just like many other features of popular culture. In that period we find the first “patois” texts, where excluded linguistic forms are openly displayed in provocative ways, in anonymous, burlesque or anti-authority writings. That precise language will go on being written more and more, mostly after romanticism beginning in 19th century, as a literary language supported on the people’s language. Since, at different degrees according to the moment, and rather strongly in the 2nd part of 20th century, it supports a sense of identity, and claims are made for its acknowledgment, i-e for its being made official, at least by symbolic decisions. It seems nowadays that, from a certain point of view, an abundance of literary products and of positive feelings makes up for the narrow place occupied by the language in social life. One can read this history as reflected in ancient and contemporary texts : the number of texts, their literary genres and ambitions, the linguistic imagination underpinning them, etc. will all have to be plotted on curves along the years, in order to understand what is happening today. Thus the data in the Picartext database are chronologically situated. We also dispose of maps – thanks to dialectology acquired over more than a century. And in fact it is important that we map our data, for different reasons : there is no standard whose expansion it would be sufficient to study, and on the contrary, the local dimension is very important for speakers. Claiming the language is all the more legitimate since it is rooted in a geographic place. So our data are systematically linked to geography, generally by the birthplace or lifeplace of authors. 2.2. Variability and standardisation As apparent variation is caused indistinctly by linguistic and graphical variability and by various attitudes, the Picard text is complex to handle. Quantifying linguistic variability in the Picard area – as anywhere – is difficult : it may be felt as weak or strong ; though unity can easily be demonstrated from the maps of Atlas. (Carton F., Lebegue M., 1995) The textual database provides useful tools for the exploration of linguistic and graphical variation in a literary corpus. The fact that the texts are all place and time-situated will give access to linguistic external coherences – external in the sense that they are linked to non-linguistic data. At the opposite of variation, we must be more precise concerning standardisation. Because of its socio-political status, the Picard language has not
Collateral Languages and Digital Corpus
125
benefitted from the same work of standardisation as a national language. It is not because of a lack of literary corpus : a rather important literature has been built up since the Middle Ages, particularly in the last two centuries. The causes are socio-political : building of the French nation and national language, the social characteristics of politicians, lack of ethnic structures, of regional structures and policies, etc. Work is actually being done, by many authors and several associations, towards a graphical standardisation. That’s why there are today some substandards, or sketches of standards. There are two main systems, called after the names of people who coined them, Vasseur and Feller-Carton. In fact, disagreements between them are few, though obvious. There are, too, in past but also in today’s texts, some inconsistent or even cacographic spellings. But at the level of the language itself, nobody nowadays thinks of standardisation : standard French even seems to be a counter-model, given that it denies geolinguistic variation. Is there a link between standardisation and status ? One can easily see that an official status implies a demand for standardisation. But vitality is not necessarily linked to status, and still less to standardisation. The models of “ethnolinguistic vitality”, for instance, (s. for instance Landry & Allard 1994) associate those features, but without any causality schemes. Besides, a weak language may be easier to standardise, because it is easier to define which practices must be reformed. Lastly, it is not impossible that people love their “low” language precisely because they feel it “free” – see the classical characteristics of the low varieties in diglossic models, or the notion of “langue paritaire” (“parity language”) by Le Du et Le Berre (1996). Koinèisation will be one of the most interesting subjects to study in Picartext. 2.3. A sociolinguistic approach A textual database is defined by the typology of the documents it contains. We have chosen to store literary texts, at least in a first step, thereby following the great predecessors like Frantext ou Beltext, considering the central importance of literature writing in language building (Ausbau) (Eloy 1997) But is there a Picard literary language ? Does the written language of the texts really differ from taped surveys ? It seems obvious, because written enunciation has possibilities and constraints different from oral enunciation. But are we in the case of an “oral literature”, only transcribed ? This is a way of searching which will imply some comparisons with oral corpora.
126
Jean-Michel ELOY
It is quite a mystery as to why someone begins to write. But it is twice as mysterious to understand why someone writes in a minorised language like Picard. It is never obvious to write in the low variety of the regional diglossia, as it could be in a monolingual situation, or one considered as such. Since the 17th century, when people have written in Picard it is because they set themselves in opposition to standard French – or to something which is linked with standard French –, it is a marked choice, whatever specific qualities they find in that language – many people speak about its density and colourfulness. The publisher, individual or association, in most cases, is also a militant. That choice has the advantage of getting away from French language censors – whether scolarly, publishers, bourgeois – but does not refuse judgments from the readers who are close enough to read in Picard. Thus, our linguistic data are in part voluntarist, at least in that sense that choosing the marked language implies linguistic consciousness. The traces of this specificity may be found in the content, in the ideas of the texts. But we also make the hypothesis that through variation, some markers of a work on language appear, meaningful about the authors’ epilinguistic representations. Both the textual material and the social use of Picartext are socially rather original. From these two points of view, enriching the database will imply a link between scholars and people –the “social group” ? – concerned with Picard literary practice. On one hand, to implement the database, we call on Picard writers and associations to donate as many as possible already digitalised texts. Of course, all the old texts, printed on paper, are being scanned – though we also intend for that to call upon the goodwill of people and associations interested in Picard. On the other hand, the funds we obtain are linked to public availability of linguistic data. So we will soon provide online access to the Picartext database. We also guess that the database will give rise to many questions and debates, and that new information will arrive : which will interest our sociolinguists. Conclusion It is no longer necessary to argue for the value of a textual database. Let’s summarize however about this new linguistic resource we are making. Picartext, beyond our own searches, will be at the disposal of anyone thinking of improving Picard literary expression, particularly from the point of view of standardisation. More generally, it will be at the disposal of writers and academic or extra-university scholars. It seems to us that correctly interpreting linguistic data implies a need for the following information :
Collateral Languages and Digital Corpus
127
– situating the language on a genetic/typological level – dialectological mapping – a history of the elaboration (Ausbau) of the language (“collateral languages”), including what is actually happening now, i-e socio-political linguistics – a literary history, in the internal as well as external sense – an internal (systemic) history of the language, including ongoing evolutions – a graphical features history, including analysis of current trends – situating our very own work in the social field. Listing these prerequisites is enough to see that most of them are at the same time objectives for research: in other words, each of them will be enriched by Picartext possibilities. Our first objective, as we said before, is to “bring variation under control”, i-e to account for it in a adequate linguistic description. Thus the base seemed to us to be a necessary tool for exploring variation. A contrario, the limits of variation, the non-variant data, will allow us to bring out pre-standardised forms, to show ongoing koinèisation. Moreover, as was the case for Frantext and the Trésor de la Langue Française3, textual data will authorize new Picard dictionary making. A better knowledge of koinèisation may be considered as a step towards possible standardisation, in other words as an element of grammatisation (Auroux 1994). We have set, as the framework of our research, the theme of proximity of near and collateral languages, which concern a great number of languages. The difficulties of the notion of proximity, the vastness of the research program it implies, not only don’t invalidate but confirm its validity and fecundity. We have tried to show, through the case of the Picard language, how that framework gives definition to the text corpus, and lets us understand what we are up against. That is the way we thought it useful to understand the term “usage-based” in UBLI project. Probably we have still much to learn from the Picard case, and mostly from its comparison with many other languages. “Near” and “collateral” language besides French, non standardised and emerging language : these characteristics, which define our data, define also our directions of research. It seemed to us that this sociolinguistic reflexion could contribute to the great work of TUFS about so many, and sociolinguistically so diverse, languages.
3
With all due respect…
128
Jean-Michel ELOY
References AUROUX Sylvain, 1994, « Introduction. Le processus de grammatisation et ses enjeux »., in AUROUX S. (éd.): Histoire des idées linguistiques. Tome 2: Le développement de la grammaire occidentale, Bruxelles : Mardaga, pp. 11-64 BASTIN Y., A. COUPEZ & B. DE HALLEUX (1983), Classification lexicostatique des langues bantoues (214 relevés), Bulletin des Séances, 27, 2, Bruxelles : Académie Royale des Sciences d’Outre-Mer, pp. 173-199. CARTON F., LEBEGUE M., 1989, 1997, Atlas linguistique et ethnologique picard I et II, Paris : Editions du Cnrs, 2 vol. ELOY J.-M. (éd.), Des langues collatérales. Problèmes linguistiques, sociolinguistiques et glottopolitiques de la proximité linguistique, Actes du Colloque international d’Amiens (29-11-2001), Paris : L’Harmattan, 2 vol., ISBN 2-7475-5827-4 et 2-7475-5828-2 ELOY J.-M. et Ó HIFEARNAIN Tadhg (éd.), 2007, Langues proches et collatérales, Actes du Colloque international de Limerick (18-06-2005), Paris : L’Harmattan, sous presse ELOY Jean-Michel et SIMONI-AUREMBOU Marie-Rose, 1998, « Variations et variétés en domaine d’oïl », Revue française de linguistique appliquée, vol. III, fasc. I, pp. 7-22, ISSN 1386-1204 ELOY Jean-Michel, 1997, La constitution du picard : une approche de la notion de langue., Louvain, Peeters (Bibl. des Cahiers de l’Inst. de Linguistique), 259 p., ISBN 90-6831-905-1 GOEBL Hans, 2004, “Bref aperçu sur les problèmes et méthodes de la dialectométrie (avec application à l’ALF)”, in J.-M. ELOY (éd.), Des langues collatérales. Problèmes linguistiques, sociolinguistiques et glottopolitiques de la proximité linguistique, Paris : L’Harmattan, pp. 39-60, ISBN 2-7475-5827-4 (I) GOOSKENS Charlotte, 2007, “Contact, attitude and phonetic distance as predictors of inter-Scandinavian communication », in ELOY J.-M. et Ó HIFEARNAIN Tadhg (éd.), Langues proches et collatérales, Actes du Colloque international de Limerick (18-06-2005), Paris : L’Harmattan, sous presse KAWAGUCHI Yuji, 2007, “Is it possible to measure the distance between near languages ? - A case study of French dialects”, in ELOY J.-M. et Ó HIFEARNAIN Tadhg (éd.), Langues proches et collatérales, Actes du Colloque international de Limerick (18-06-2005), Paris : L’Harmattan, sous presse KLOSS Heinz, 1967, “Abstand languages and Ausbau languages”, Anthropological Linguistics vol. 9, n°7, pp. 29-41
Collateral Languages and Digital Corpus
129
KLOSS Heinz, 1987, “Abstandsprache und Ausbausprache”, in STEGER Hugo, WIEGAND Herbert Ernst (éd.) : Sociolinguistics. Soziolinguistik. An International Handbook of the Science of Language and Society. Vol. I, Berlin : de Gruyter, pp. 302-308, ISBN 3-11-009694-3 LANDRY Rodrigue, ALLARD Réal (éd.), 1994, Ethnolinguistic Vitality, IJSL 108 LE DU J. et LE BERRE Y., 1996, « Parité et disparité : sphère publique et sphère privée de la parole », La Bretagne Linguistique vol. 10, Actes du colloque “Badume – Standard – Norme, Le double jeu de la langue” tenu à Brest du 2 au 4 juin 1994, pp. 7-25 MARCELLESI Jean-Baptiste, 1986, « Actualité du processus de naissance de langues en domaine roman »., Cahiers de linguistique sociale n° 9 (E. et T. BULOT éd.), pp. 21-29 MORIN Edgar, 1990, Introduction à la pensée complexe., Paris, ESF Editeur, 158 p. MULJACIC Žarko, 2004, « La dynamique des langues romanes », in J.-M. ELOY (éd.), Des langues collatérales. Problèmes linguistiques, sociolinguistiques et glottopolitiques de la proximité linguistique, Paris : L’Harmattan, pp. 299-314, ISBN 2-7475-5828-2 NELDE Peter H. (éd.), 1983, Theorie, Methoden und Modelle der Kontaktlinguistik., Bonn, Dümmler, 450 p. POLONI E. S. et al. (1997), “Human genetic affinities for Y-chromosome p49a, f/TaqI haplotypes show strong correspondence with linguistics”, American Journal of Human Genetics, 61, pp. 1015-1035.
130
Jean-Michel ELOY
Annex Romance domain
Celtic domain Germanic domain
Other domains
Cases of linguistic proximity mentioned in the 2001 and 2005 colloquia Oïl languages (Picard, Walloon, Champenois, Norman) / standard French Canadian French (Ontarien, Québécois, Joual) / standard French, Chiac Oïl and Italian medieval dialects / Levantine 13th Century French Catalan parlances / standard Catalan, Italian dialetti (standard Piemontese, Piemontese dialects, Genovese) / Toscano Asturiano / Castillano / Galician Franco-provençal dialect/ standard French Créole(s) / standard French Greece Arumanian / Rumanian standard (néo)Breton / Breton dialects or “badumes” Irish / Scottish / Manx Gailic Norwegian / Swedish / Danish German / Nederlands German dialects / standard German Scots / standard English, Irish Parley, Irish English Flemish (Westvlaamsch) / standard Nederlands Turkish languages Ukrainian / standard Russian Serbian / Croatian standard Magyar / Moldavian Csango / related languages Finnish varieties (Voro-seto, Carelian, Votian, Mordvinian) / standard Finnish / standard Estonian litteral Arabic / dialectal, Maltese Arabic, maghribi / punic Cyprus and Pontiki Greek dialects / standard Greek Greece Arvanitika / Albanian
Parallel and Comparable Corpora — The State of Play — Tony McENERY and Zhonghua XIAO Since the 1980s, corpus linguistics has developed at an accelerated speed. While the construction and exploitation of English language corpora still dominate corpus linguistics research, corpora of other languages, particularly typologically related European languages such as French, German and Portuguese and Asian languages such as Chinese, Korean and Japanese, have also become available and have notably added to the diversity of corpus-based language studies. 1 In addition to monolingual corpora, parallel and comparable corpora have been a key focus of non-English corpus linguistics, largely because corpora of these two types are important resources for translation and contrastive studies. As Aijmer & Altenberg (1996: 12) observe, parallel and comparable corpora ‘offer specific uses and possibilities’ for contrastive and translation studies: • they give new insights into the languages compared – insights that are not likely to be noticed in studies of monolingual corpora; • they can be used for a range of comparative purposes and increase our knowledge of language-specific, typological and cultural differences, as well as of universal features; • they illuminate differences between source texts and translations, and between native and non-native texts; • they can be used for a number of practical applications, e.g. in lexicography, language teaching and translation. In this chapter, we will explore the potential value of such multilingual corpora. Before we explore the value of these corpora, however, it is necessary to clarify some terminological issues.
1
Lists of available corpus resources invloving different languages, , both monolingual and multilingual, can be found at the websites of Evaluations and Language Resource Distribution Agency (ELDA, http://www.eida.fr/cata/tabtxt1.html), TELTRI Research Archive of Computational Tools and Resources (TRACTOR, http://tractor.bham.ac.uk/tractor/ catalogue.html), Oxford Text Archive (OTA, http://ota.ahds.ac.uk) and Linguistic Data Consortium (LDC, http://www.ldc.upenn.edu/Catelog/byType.jsp).
132
Tony McENERY and Zhonghua XIAO
1. Multilingual Corpora: Terminological Issues When we refer to a corpus involving more than one language as a multilingual corpus, the term multilingual is used in a broad sense. A multilingual corpus, in a narrower sense, must involve at least three languages while those involving only two languages are conventionally referred to as bilingual corpora. In this chapter, we are using multilingual and bilingual interchangeably. Given that corpora involving more than one language are a relatively new phenomenon, with most research hailing from the early 1990s (e.g. the English-Norwegian Parallel Corpus (ENPC), see Johansson & Hofland, 1994),2 it is unsurprising to discover that there is some confusion surrounding the terminology used in relation to these corpora. Generally, there are three types of corpora involving more than one language: • Type A: Source texts plus translations, e.g. Canadian Hansard (cf. Brown, Lai & Mercer, 1991), CRATER (cf. McEnery & Oakes, 1995). • Type B: Monolingual subcorpora designed using the same sampling frame, e.g. The Aarhus corpus of contract law (cf. Faber & Lauridsen, 1991). • Type C: A combination of A and B, e.g. the ENPC (cf. Johansson & Hofland, 1994), the EMIILE.3 Different terms have been used to describe these types of corpora. For Aijmer & Altenberg (1996) and Granger (1996: 38), type A is a translation corpus whereas type B is a parallel corpus; for McEnery & Wilson (1996: 57), Baker (1993: 248, 1995, 1999) and Hunston (2002: 15), type A is a parallel corpus whereas type B is a comparable corpus; and for Johansson & Hofland (1994) and Johansson (1998: 4) the term parallel corpus applies to both types A and B. Barlow (1995, 2000: 110) certainly interpreted a parallel corpus as type A when he developed the ParaConc corpus tool. It is clear that some confusion centres around the term parallel. When we define different types of corpora, we can use different criteria, for example, the number of languages involved, and the content or the form of the corpus. But when a criterion is decided upon, the same criterion must be used consistently. For example, we can say a corpus is monolingual, bilingual or multilingual if we take the number of languages involved as the criterion for definition. We can also say a corpus is a translation (L2) or a non-translation (L1) corpus if the criterion of corpus content is used. But if 2 3
It is interesting to note, however, an earlier corpus-based contrastive study, namely Hilipovic 1969, dates back as early as the 1960s. An introduction to the EMILLE project can be found at the following URL http://www.emille.lancs.ac.uk.
Parallel and Comparable Corpora
133
we choose to define corpus types by the criterion of corpus form, we must use it consistently. Then we can say a corpus is parallel if the corpus contains source texts and translations in parallel, or it is a comparable corpus if its subcorpora are comparable as they apply the same sampling frame. It is illogical, however, to refer to corpora of type A as translation corpora by the criterion of content while referring to corpora of type B as comparable corpora by the criterion of form. Consequently, in this paper, we will follow McEnery et al and Baker’s terminology in referring to type A as parallel corpora and type B as comparable corpora. As type C is a mixture of the two, corpora of this type should be referred to as comparable corpora in a strict sense. A parallel corpus can be defined as a corpus that contains source texts and their translations. Parallel corpora can be bilingual or multilingual. They can be uni-directional (e.g. from English into Chinese or from Chinese into English alone), bi-directional (e.g. containing both English source texts with their Chinese translations as well as Chinese source texts with their English translations), or multi-directional (e.g. the same piece of writing with English, French and German versions). In this sense, texts which are produced simultaneously in different languages (e.g. EU and UN regulations) also belong to the category of parallel corpora (cf. Hunston, 2002: 15). In contrast, a comparable corpus can be defined as a corpus containing components that are collected using the same sampling frame and similar balance and representativeness (cf. McEnery, 2003: 450), e.g. the same proportions of the texts of the same genres in the same domains in a range of different languages in the same sampling period. However, the subcorpora of a comparable corpus are not translations of each other. Rather, their comparability lies in their same sampling frame and similar balance. By our definition, corpora containing components of varieties of the same language (e.g. International Corpus of English (ICE)) are not comparable corpora as suggested in the literature (e.g. Hunston, 2002: 15), because all corpora, as a source for linguistic research, have ‘always been pre-eminently suited for comparative studies’ (Aarts, 1998), either intra-lingual or inter-lingual. Brown, LOB, Frown and FLOB are typically designed for comparing language varieties synchronically and diachronically. The British National Corpus (BNC), while designed for representing modern British English, is also a useful basis for various intra-lingual studies (e.g. spoken vs. written, monologue vs. dialogue, and variations caused by socio-economic parameters). Nevertheless, these corpora are generally not referred to as comparable corpora. While parallel and comparable corpora are supposed to be used for different purposes (i.e. translation and contrastive studies respectively, see
134
Tony McENERY and Zhonghua XIAO
section 2), the two are also designed with different focuses. For a comparable corpus, the sampling frame is essential. The components representing the languages involved must match with each other in terms of proportion, genre, domain and sampling period. For a parallel corpus, the sampling frame is irrelevant, because all of the corpus components are exact translations of each other. Once the source texts are selected using a certain sampling frame, it does not apply twice to translations. However, this does not mean that the construction of parallel corpora is easier. For a parallel corpus to be useful, an essential step is to align the source texts and their translations, i.e. to produce a link between the two, at the sentence or word level. Yet the automatic alignment of parallel corpora is not a trivial task for some language pairs (cf. Piao, 2000, 2002). Depending on the specific research question, a specialised (i.e. containing texts of a particular type, e.g. computer manuals) or a general (i.e. balanced, containing as many text types as possible) corpus should be used. Parallel and comparable corpora can be of either type. For terminology extraction, specialised parallel and comparable corpora are clearly of use while for the contrast of general linguistic features such as tense and aspect, balanced corpora are supposed to be more representative of any given language in general. Existing parallel corpora appear to suggest that corpora of this type tend to be specialised (e.g. contract law and genetic engineering). This is quite natural, considering the availability of translated texts by genre (in machine-readable form) in different languages (cf. Johansson & Hofland, 1994: 27; Mauranen, 2002: 166; Aston, 1999), and indeed, as will be seen later in our discussion, specialised parallel corpora can be especially useful in domain-specific translation research. 4 While most of the existing comparable corpora are also specialised, it is relatively easier to find comparable text types in different languages. Therefore, in relation to parallel corpora, it is more likely for comparable corpora to be designed as general balanced corpora. For instance, as the Korean National Corpus (Park, 2001) and the Chinese National Corpus (Zhou & Yu, 1997) have adopted a sampling frame quite similar to that of the BNC, these corpora can form a balanced comparable corpus that makes contrastive studies of these three languages possible. Parallel and comparable corpora are used primarily for translation and contrastive studies. The two types of corpora have their own advantages and disadvantages, and thus serve for different purposes. While the source and translated texts in a parallel corpus are useful for exploring ‘how the same 4
Readers are advised to refer to Halverson (1998) for an argument for the need for representative parallel corpora.
Parallel and Comparable Corpora
135
content is expressed in two languages’ (Aijmer & Altenberg, 1996: 13),5 they alone serve as a poor basis for cross-linguistic contrasts, because translations (i.e. L2 texts) cannot avoid the effect of translationese (cf. Hartmann, 1985; Baker, 1993: 243-5; Teubert, 1996: 247; Gellerstam, 1996; Laviosa, 1997: 315; McEnery & Wilson, 2001: 71-2; McEnery & Xiao, 2002). In contrast, while the components of a comparable corpus overcome translationese by populating the same sampling frame with L1 texts from different languages, they are less useful for the study of how a message is conveyed from one language to another. Also the development of application software for machine aided and machine translation, while it may be based on comparable data, has clearly benefited from having access to parallel data, for example to bootstrap example-based machine translation systems (see section 2). Nonetheless, comparable corpora are a useful resource for contrastive studies and translation studies when used in combination with parallel corpora. Note, however, that comparable corpora can be a poor basis for contrastive studies if the sampling frames for the comparable corpora are not fully comparable. In the section that follows, we will illustrate, through examples, the value of corpora, particularly parallel and comparable corpora, to translation and contrastive studies. 2. Corpus-based Translation and Contrastive Studies As Laviosa (1998a) observes, ‘the corpus-based approach is evolving, through theoretical elaboration and empirical realisation, into a coherent, composite and rich paradigm that addresses a variety issues pertaining to theory, description, and the practice of translation.’ Corpus-based translation studies fall in two broad areas: theoretical and practical (Hunston, 2002: 123). In theoretical terms, corpora are used mainly to study the translation process by exploring how an idea in one language is conveyed in another language and by comparing the linguistic features and their frequencies in translated L2 texts and comparable L1 texts. In the practical approach, corpora provide a workbench for training translators and a basis for developing applications like machine translation (MT) and computer-assisted translation (CAT) systems. In this section, we will discuss how corpora have been used in each of these areas. Parallel corpora are a good basis for studying how an idea in one 5
This view has been challenged recently, however, notably by Mauranen (2002: 167), who argues that interpreting translation as ‘the decoding and re-encoding of fixed contents, which presumably, exist outside languages’ is ‘hardly an adequate view of either language or translation.’ However, if we interpret the relationship between contents and languages as that between meanings (the carried) and forms (the carrier), this view is quite natural.
136
Tony McENERY and Zhonghua XIAO
language is conveyed in another language.6 Xiao & McEnery (2002a), for example, use an English-Chinese parallel corpus containing 100,170 English words and 192,088 Chinese characters to explore how temporal and aspectual meanings in English are expressed in Chinese. In that study, the authors found that while both English and Chinese have a progressive aspect, the progressive has different scopes of meanings in the two languages. In English, while the progressive canonically (93.5%) signals the ongoing nature of a situation (e.g. John is singing, Comrie, 1976: 32), it has a number of other specific uses that do not seem to fit under the general definition of progressiveness’ (Comrie, 1976: 37). These ‘specific uses’ include its use to indicate contingent habitual or iterative situations (e.g. I’m taking dancing lessons this winter, Leech, 1971: 27), to indicate anticipated happenings in the future (e.g. We’re visiting Aunt Rose tomorrow, ibid: 29) and some idiomatic use to add special emotive effect (e.g. I’m continually forgetting people’s names, ibid ) (c.f. Leech, 1971: 27-29). In Chinese, however, the progressive marked by zai only corresponds to the first category above, namely, to mark the ongoing nature of dynamic situations. As such, only about 58% of situations referred to by the progressive in the English source data take the progressive or the durative aspect, either marked overtly or covertly, in Chinese translations. The authors also found that the interaction between situation aspect (i.e. the inherent aspectual features of a situation, e.g. whether the situation has a natural final endpoint) and viewpoint aspect (e.g. perfective vs. imperfective) also influences a translator’s choice of viewpoint aspect. Situations with a natural final endpoint (around 65%) and situations incompatible with progressiveness (92.5% of individual-level states and 75.9% of achievements) are more likely to undergo viewpoint aspect shift and presented perfectively in Chinese translations.7 In contrast, situations without a natural final endpoint are normally translated with the progressive marked by zai or the durative aspect marked by -zhe. Note, however, that the direction of translation in a parallel corpus is important in studies of this kind. The corpus used in Xiao & McEnery (2002a), for example, is not suitable for studying how aspect markers in Chinese are translated into English. For that purpose, a Chinese-English parallel corpus (i.e. L1 Chinese plus L2 English) is required. Another problem that arises with the use of a one-to-one parallel corpus (i.e. containing only one version of translation in the target language) is that 6 7
However, the quality of translation is an important factor which should be taken into serious consideration during corpus construction. Situaions, telic, individual-level states and achievements are commonly used terms in aspect theory. Readers can refer to Xiao & McEnery (2002b) for a more elaborate account of situation aspect.
Parallel and Comparable Corpora
137
the translation only represents ‘one individual’s introspection, albeit contextually and cotextually informed’ (Malmkjær, 1998). One possible way to overcome this problem, as suggested in Malmkjær, is to include as many versions of a translation of the same source text as possible. While this solution is certainly of benefit to translation studies, it makes the task of building parallel corpora much more difficult. It also reduces the range of data one may include in a parallel corpus, as many translated texts are translated once only. It is typically texts such as literary works where multiple translations of the same work are available. These works tend to be non-contemporary and the different versions of translation are usually spaced decades apart, thus making the comparison of these versions less meaningful. The distinctive features of translated language can be identified by comparing the translations with comparable L1 texts, thus throwing new light on the translation process and helping to identify translation norms. Laviosa (1998b), for example, in her study of L1 and L2 English narrative prose, finds that translated L2 language has four core patterns of lexical use: a relative lower proportion of lexical words over function words, a relatively higher proportion of high-frequency words over low-frequency words, relatively greater repetition of the most frequent words, and less variety in the words that are most frequently used. Other studies show that translated language is characterised, beyond the lexical level, by nominalization, simplification (Baker, 1993, 1998), explication (i.e. increased cohesion, Øverås, 1998) and sanitisation (i.e. reduced connotational meanings, Kenny, 1998). As these features are regular and typical of translated English, further research based upon these findings may not only uncover the translation norms or what Frawley (1984) calls the ‘third code’ of translation, it will also help translators and trainee translators to become aware of these problems. McEnery & Xiao (2002), on the basis of a specialised English-Chinese parallel corpus of healthcare, found that the ratio of overt/covert marking of aspectual meanings was exceptionally low in Chinese translations. As Chinese is recognised as an aspect language (cf. Xiao & McEnery, forthcoming), the authors hypothesised that the low frequency of aspect markers was atypical of the target L1 language and was attributable to the translated nature of the data in this case. To test this hypothesis, they constructed a comparable L1 Chinese corpus using the same sampling frame and compared the frequencies of two well-established perfective aspect markers in the two datasets, namely, the translated Chinese and L1 Chinese. The experiment showed that in the translated Chinese, the two aspect markers occurred 27.32 times per 10,000 words whereas they occurred 62.33 times per 10,000 words in the comparable L1 Chinese data. A cross-tabulation
138
Tony McENERY and Zhonghua XIAO
between the word numbers and actual frequency counts showed a log-likelihood ratio of 49.1 for 2 degrees of freedom, which is statistically significant at the level p<0.001. Therefore, the authors’ null hypothesis that the difference in frequencies of aspect markers in the two datasets existed by chance was rejected and they were able to claim that translated Chinese is indeed different from L1 Chinese in terms of aspect marking. The above studies show that translated language is translationese. The effect of source language on the translations is strong enough to make the L2 data perceptibly different from the target L1 Chinese. As such, a uni-directional parallel corpus is a poor basis for cross-linguistic contrast. This problem, however, can be alleviated by a bi-directional parallel corpus (e.g. Maia, 1998; Ebeling, 1998), because the effect of translationese is averaged out to some extent. In this sense, a well-matched bi-directional parallel corpus can become the bridge that brings translation and contrastive studies together. To achieve this aim, however, the same sampling frame must apply to the selection of source data in both languages. Any mismatch of proportion, genre, or domain, for example, may invalidate the findings derived from such a corpus. While we know that translated language is distinct from the target L1 language, it has been claimed recently that parallel corpora represent a sound basis for contrastive studies. James (1980: 178), for example, argues that ‘translation equivalence is the best available basis of comparison’ while Santos (1996: i) claims that ‘studies based on real translations are the only sound method for contrastive analysis.’ Mauranen (2002: 166) also argues, though not as strongly as James and Santos, that translated language, in spite of its special features, ‘is part of natural language in use, and should be treated accordingly’, because languages ‘influence each other in many ways other than through translation’ (ibid: 165). While we agree with Mauranen that ‘translations deserve to be investigated in their own right’, as is done in Laviosa (1998b) and McEnery & Xiao (2002), we hold a different view of the value of parallel corpora for contrastive studies. It is true that languages in contact can influence each other, but this influence is different from the influence of a source language on translations in respect to immediacy and scope. Basically, the influence of languages in contact is generally gradual (or evolutionary) and less systematic than the influence of a source language on the translated language. As such, translated language is at best an unrepresentative special variant of the target language. If this special variant is confused with the target L1 language and serves alone as the basis for contrastive studies, the results are clearly misleading to teachers and students of second languages, because contrastive studies are ‘typically geared towards second language teaching and learning’ (Teich, 2002: 188). Using
Parallel and Comparable Corpora
139
parallel corpora alone, for example, McEnery & Xiao (2002) would have come to the misleading conclusion that aspect markers occurred only infrequently in Chinese. As Chinese as an aspect language relies heavily on aspect to encode temporal information, which is different from English which encodes both tense and aspect, this false conclusion would inevitably have an adverse effect on Chinese learners of English. Parallel corpora can serve as a useful starting point for cross-linguistic contrasts because findings based on parallel corpora invite ‘further research with monolingual corpora in both languages’ (Mauranen, 2002: 182). In this sense, parallel corpora are ‘indispensable’ to contrastive studies (ibid). Based on the preliminary findings in McEnery & Xiao (2002) and Xiao & McEnery (2002a), we have initiated an ESRC-funded project on contrasting tense and aspect in English and Chinese on the basis of two one-million-word L1 corpora of the two languages. With reference to practical translation studies, as corpora can be used to raise linguistic and cultural awareness in general (cf. Hunston, 2002: 123; Bernardini, 1997), they provide a useful and effective reference tool and a workbench for translators and trainees. In this respect even a monolingual corpus is helpful. Bowker (1998), for example, found that corpus-aided translations were of a higher quality with respect to subject field understanding, correct term choice and idiomatic expressions than those undertaken using conventional resources. Bernardini (1997) also suggests that traditional translation teaching should be complemented with LCC (large corpora concordancing) so that trainees develop ‘awareness’, ‘reflectiveness’ and ‘resourcefulness’, the skills that ‘distinguish a translator from those unskilled amateurs.’ In comparison to monolingual corpora, comparable corpora are more useful for translation studies. Zanettin (1998) demonstrates that small comparable corpora can be used to devise a ‘translator training workshop’ designed to improve students’ understanding of the source texts and their ability to produce translations in the target more fluently. In this respect, specialised comparable corpora are particularly helpful for highly domain-specific translation tasks, because when translating texts of this type, as Friedbichler & Friedbichler (1997) observe, ‘the translator is dealing with a language which is often just as disparate from his/her native language as any foreign tongue.’ Several studies show that translators with access to a comparable corpus with which to check translation problems ‘are able to enhance their productivity and tend to make fewer mistakes’ (ibid) when translating into their native language. When translation is from a mother tongue into a foreign language, ‘the need for corpus tools grows exponentially and goes far beyond checking grey spots in L1 language
140
Tony McENERY and Zhonghua XIAO
competence against the evidence of a large corpus’ (ibid). For example, Gavioli & Zanettin (1997) demonstrate how a very specialised corpus of texts on the subject of hepatitis helps to confirm translation hypotheses and suggest possible solutions to problems related to domain-specific translation. While monolingual and comparable corpora are of use to translation, it is difficult to generate ‘possible hypotheses as to translations’ with such data (Aston, 1999). Furthermore, verifying concordances is both time-consuming and error-prone, which entails a loss of productivity. Parallel corpora, in contrast, provide ‘[g]reater certainty as to the equivalence of particular expressions’, and in combination of suitable tools (e.g. ParaConc), they enable users to ‘locate all the occurrences of any expression along with the corresponding sentences in the other language’ (ibid). As such, parallel corpora can help translators and trainees to achieve improved precision with respect to terminology and phraseology and have been strongly recommended for these reasons (e.g. Williams, 1996). A special use of a parallel corpus with one source text and many translations is that it can offer a systematic translation strategy for linguistic structures which have no direct equivalents in the target language. Buyse (1997), for example, presents a case study of the Spanish translation of the French clitics en and y, where the author illustrates how a solution is offered by a quantitative analysis of the phonetic, prosodic, morphological, semantic and discursive features of these structures in a representative parallel corpus, combined with the quantitative analysis of these structures in a comparable corpus of L1 target language. Another issue related to translator training is translation evaluation. Bowker (2001) shows that an evaluation corpus, which is composed of a parallel corpus and comparable corpora of source and target languages, can help translator trainers to evaluate student translations and provide more objective feedback. Finally, in addition to providing assistance to human translators, parallel corpora constitute a unique resource for the development of machine translation (MT) systems. Starting in the 1990s, the established methodologies, notably, the linguistic rule-based approach to machine translation, have been challenged and enriched by an approach based on parallel corpora (cf. Hutchins, 2003: 511; Somers, 2003: 513). The new approaches, such as example-based MT (EBMT) and statistical MT, are based on parallel corpora. With EBMT, for example, a new input is matched against the database of already translated texts to extract suitable examples which are then combined to generate the correct translation (see Somers: ibid). As well as automatic MT systems, parallel corpora have also been used to develop computer-assisted translation (CAT) tools for human translators, such as translation memories (TM), bilingual concordances and translator-
Parallel and Comparable Corpora
141
oriented word processor (cf. Somer, 2003; Wu, 2002). 3. Conclusion In this chapter, we first clarified the confusion surrounding the terminology related to multilingual corpora. It was argued that consistent criteria should be applied in defining types of corpora. For us this means that parallel corpora refer to those that contain collections of L1 texts and their translations while comparable corpora refer to those that contain matched L1 samples from different languages. The main concern of this chapter was the potential value of parallel and comparable corpora to translation and contrastive studies.8 We maintain that while parallel corpora are well-suited to research and teaching in translation studies, they provide a poor basis for cross-linguistic contrasts if used as the sole source of data. They should most often be used in conjunction with L1 target and source corpora. These L1 target and source corpora may or may not be comparable. Parallel corpora are undoubtedly a useful starting point for contrastive research, however, and may lead to further research in contrastive studies based upon comparable corpora. In contrast, comparable corpora used alone are less useful for translation studies. Nonetheless, they certainly serve as a reliable basis for contrastive studies. It appears then that a carefully matched bi-directional parallel corpus provides a sound basis for both translation and contrastive studies. Yet the ideal bi-directional parallel corpus will often not be easy, or even possible, to build because of the heterogeneous pattern of translation between languages and genres. So we must accept that, for practical reasons alone, we will often be working with corpora that, while they are useful, are not ideal for either translation or contrastive studies. In this chapter, we also discussed the pros and cons of the use of different types of corpora in translation and contrastive studies and evaluated proposals for possible solutions to related problems. It is our belief that as the number of parallel and comparable corpora grows, the corpus-based paradigm will soon enter the mainstream of translation and contrastive studies. Acknowledgements This work is supported in part by a grant (Reference No. RES-000220135) from the Economic and social Research Council (ESRC).
8
Apart from translation and contrastive studies, Botley, McEnery & Wilson (2000) give a fine account of other potential uses of parallel and comparable corpora.
142
Tony McENERY and Zhonghua XIAO
References Aarts, J. (1998) Introduction. In S. Johansson & S. Oksefjell (eds.) Corpora and Cross-linguistic Research (pp. ix-xiv). Amsterdam: Rodopi. Aston, G. (1999) Corpus use and learning to translate. Textus 12, 289-314. Available online at http://www.sslmit.uniho.it/guy/textus.htm. Baker, M. (1993) Corpus linguistics and translation studies: implications and applications. In M. Baker, G. Francis & E. Tognini-Bonelli (eds.) Text and technology: in honour of John Sinclair (pp. 233-52). Amsterdam: Benjamins. Baker, M. (1995) Corpora in translation studies: an overview and some suggestions for future research. Target 7, 223-243. Baker, M. (1999) The role of corpora in investigating the linguistic behaviour of professional translators. International Journal of Corpus Linguistics 4, 281-98. Barlow, M. (1995) A guide to ParaConc. Huston: Athelstan. Barlow, M. (2000) Parallel texts and language teaching. In S. Botley, A. McEnery & A. Wilson (eds.) Multilingual corpora in teaching and researching (pp. 106-15). Amsterdam: Rodopi. Bernardini, S. (1997) A ‘trainee’ translator’s perspective on corpora. Paper presented at Corpus use and learning to translate held at Bertinoro, Nov. 1997. Available online at URL http://www.sslmit.unibo.it/introduz.htm. Botley, S., McEnery, A. and Wilson, A. (2000) Multilingual corpora in teaching and research. Amsterdam : Rodopi. Bowker, L. (1998) Using specialised native-language corpora as a translation resource: a pilot study. Meta 43(4). Available online at the following URL : http://www.erudit.org/meta/1998/v43/n4/index.html. Bowker, L. (2001) Towards a methodology for a corpus-based approach to translation evaluation. Meta 46(2), 345-64. Brown, P., Lai, J. and Mercer, R. (1991) Aligning sentences in parallel corpora. In 29th Annual Meeting of the Association for Computational Linguistics (pp. 169-176). Berkeley, CA. Buyse, K. (1997) The study of multi- and unilingual corpora as a tool for the development of translation studies: a case study. Paper presented at Corpus use and learning to translate held at Bertinoro, Nov. 1997. Comrie, B. (1976) Aspect. Cambridge: Cambridge University Press. Ebeling, J. (1998) Contrastive linguistics, translation, and parallel corpora. Meta 43(4). Faber, D. and Lauridsen, K. (1991) The compilation of a Danish-English-French corpus in contract law. In S. Johansson & A-B. Stenström (eds.) English computer corpora. Selected papers and research guide (pp. 235-43). Berlin: Mouton de Gruyter.
Parallel and Comparable Corpora
143
Filipovic, R. (1969) The choice of the corpus for the contrastive analysis of Serbo-Croatian and English. In The Yugoslav Serbo-Crotian-English contrastive Project B Studies 1 (pp. 37-46). Institute of Linguistics, University of Zagreb. Frawley, W. (1984) Prolegomenon to a theory of translation. In W. Frawley (ed.) Translation: Literary, linguistic and philosophical perspectives (pp. 159-75). London & Toronto: Associated University Press. Friedbichler, I. and Friedbichler, M. (1997) The potential of domain-specific target-language corpora for the translator’s workbench. Paper presented at Corpus use and learning to translate held at Bertinoro, Nov. 1997. Gavioli, L. and Zanettin, F. (1997) Comparable corpora and translation: a pedagogic perspective. Paper presented at Corpus use and learning to translate held at Bertinoro, Nov. 1997. Gellerstam, M. (1996) Translations as a source fro cross-linguistic studies. In K. Aijmer, B. Altenberg and M. Johansson (eds.) Language in contrast: papers from a symposium on text-based cross-linguistic studies, Lund, March 1994 (pp. 53-62). Lund: Lund University Press. Granger, S. (1996) From CA to CIA and back: an integrated approach to computerised bilingual and learner corpora. In K. Aijmer, B Altenberg and M. Johansson (eds.) Language in contrast: papers from a symposium on text-based cross-linguistic studies, Lund, March 1994 (pp. 38-51). Lund: Lund University Press. Halverson, S. (1998) Translation studies and representative corpora: establishing links between translation corpora, theoretical/descriptive categories and a conception of the object of study. Meta 43(4). Hartmann, R. (1995) Contrastive textology. Language and Communication 5, 25-37. Hutchins, J. (2003) Machine translation: general overview. In R. Mitkov (ed.) Oxford handbook of computational linguistics (pp. 501-11). Oxford: Oxford University Press. Hunston, S. (2002) Corpora in applied linguistics. Cambridge: Cambridge University Press. James, C. (1980) Contrastive analysis. London: Longman. Johansson, S. (1998) On the role of corpora in cross-linguistic research. In S. Johansson & S. Oksefjell (eds.) Corpora and cross-linguistic research: Theory, method, and case studies (pp. 3-25). Amsterdam: Rodopi. Johansson, S., Ebeling, G. and Hofland, K. (1996) Coding and aligning the English-Norwegian parallel corpus. In K. Aijmer, B Altenberg and M. Johansson (eds.) Language in contrast: papers from a symposium on text-based cross-linguistic studies, Lund, March 1994 (pp. 87-112). Lund: Lund University Press.
144
Tony McENERY and Zhonghua XIAO
Johansson, S. and Hofland, K. (1994) Towards an English-Norwegian parallel corpus. In U. Fries, G. Tottie and P. Schneider (eds.) Creating and using English language corpora (pp. 25-37). Amsterdam: Rodopi. Johansson, S. and Oksefjell, S. (1998) Corpora and cross-linguistic research: Theory, method, and case studies. Amsterdam: Rodopi. Kenny, D. (1998) Creatures of habit? What translators usually do with words? Meta 43(4). Laviosa, S. (1997) How comparable can ‘comparable corpora’ be? Target 9, 289-319. Laviosa, S. (1998a) The corpus-based approach: a new paradigm in translation studies. Meta 43(4). Laviosa, S. (1998b) Core patterns of lexical use in a comparable corpus of English narrative prose. Meta 43(4). Leech, G. (1971) Meaning and the English verb. London: Longman. Maia, B. (1998) Word order and the first person singular in Portuguese and English. Meta 43(4). Malmkjær, K. (1998) Love thy neighbour: will parallel corpora endear linguists to translators? Meta 43(4). Mauranen, A. (2002) Will ‘translationese’ ruin a contrastive study? Languages in Contrast 2(2), 161-86. McEnery, A. (2003) Corpus linguistics. In R. Mitkov (ed.) Oxford handbook of computational linguistics (pp. 448-63). Oxford: Oxford University Press. McEnery, A. and Oakes, M. (1995) Sentence and word alignment in the CRATER project: methods and assessment. In S. Warwick-Armstrong (ed.) Proceedings of the Association for Computational Linguistics Workshop SIG-DAT Workshop. Dublin. McEnery, A and Wilson, A. (1996) Corpus linguistics (1st edition). Edinburgh: Edinburgh University Press. McEnery, A and Wilson, A. (2001) Corpus linguistics (2nd edition). Edinburgh: Edinburgh University Press. McEnery, A and Xiao, Z. (2002) Domains, text types, aspect marking and English-Chinese translation. Journal of Languages in Contrast 2(2), 211-31. Øverås, S. (1998) In search of the third code: An investigation of norms in literary translation. Meta 43(4). Park, B. (2001). Introducing Korean National Corpus. Talk presented at Corpus Research Group, Lancaster University, Nov. 19th, 2001. See also http://www.sejong.or.kr/english/index.html Pearson, J. (1998) Terms in context. Amsterdam: Benjamins. Piao, S. (2000) Sentence and word alignment between Chinese and English.
Parallel and Comparable Corpora
145
PhD thesis. Lancaster university. Piao, S. (2002) Word alignment in English-Chinese parallel corpora. Literary and Linguistic Computing 17(2), 207-30. Santos, D. (1996). Tense and aspect in English and Portuguese: a contrastive semantical study. PhD thesis. Universidade Tecnica de Lisboa. Somers, H. (2003) Machine translation: latest developments. In R. Mitkov (ed.) Oxford handbook of computational linguistics (pp. 512-28). Oxford: Oxford University Press. Teich, E. (2002) System-oriented and text-oriented comparative linguistic research: cross-linguistic variation in translation. Languages in Contrast 2(2), 187-210. Teubert, W. (1996) Comparable or parallel corpora? International Journal of Lexicography 9(3), 238-64. Williams, A. (1996) A translator’s reference needs: dictionaries or parallel texts. Target 8(2), 277-99. Wu, D. (2002) Conception and application of computer-assisted translation. Paper presented at International Symposium on Contrastive and Translation Studies between Chinese and English. Shanghai. August 2002. Xiao, Z. and McEnery, A. (2002a) A corpus-based approach to tense and aspect in English-Chinese translation. Paper presented at International Symposium on Contrastive and Translation Studies between Chinese and English. Shanghai 2002. Xiao, Z. and McEnery, A. (2002b) Situation aspect as a universal aspect: implications for artificial languages. Journal of Universal Language 3(2), 139-77. Xiao, Z. and McEnery, A. (2004) Aspect in Mandarin Chinese. Amsterdam: John Benjamins. Zanettin, F. (1998) Bilingual comparable corpora and the training of translators. Meta 43(4). Zhou, Q. and Yu, S. (1997) Annotating the Contemporary Chinese Corpus. International Journal of Corpus Linguistics 2(2), 239-58.
146
Tony McENERY and Zhonghua XIAO
First Language & Second Language Writing Development of Elementary Students — Two Perspectives — Randi REPPEN Much of what is known today about the writing development of elementary students is based on work done in the 60s and 70s by Hunt (1965) and Loban (1963 & 1976). These two researchers systematically collected student writing from across a range of grades and described some of the linguistic characteristics that were found in the students’ writing. Since then, most studies exploring student writing have typically followed one of three approaches: exploring one type of writing across several grades (e.g., Crowhurst 1983, 1987, 1990); selecting a few linguistic features across a range of writing tasks (e.g., Poynton 1986); or a detailed analysis of one student’s writing (e.g., Baghban 1984; Bissex 1980). These types of studies provide valuable insights for certain aspects of student language development, but do not include a detailed investigation of a large number of linguistic features across a range of grades while systematically controlling for topic. In addition, most studies of elementary student language do not explore the linguistic development of both native and non-native English speaking students writing in English. This chapter explores the writing development of third through sixth grade students through two different approaches: a quantitative, corpus-based linguistic comparison of both first and second language English students; and a detailed qualitative linguistic case study of one English as a second language (ESL) student over a two year period. By using both quantitative and qualitative approaches, it is possible to determine if there are patterns of linguistic development that can be seen across a large collection of student writing and also across a set of essays from one student over a two year period. The goal of this chapter is to provide a snapshot of the linguistic changes that take place between third and sixth grade in the writing of both native and non-native English speakers. The texts used in this chapter are a subset from a larger 6 year project (Reppen 2001 a & b) that involved students in grades three (age 8 or 9) through six (age 11 or 12) in twenty six classrooms from across Arizona in the southwestern United States. The
148
Randi REPPEN
students in this study range in age from 8 years to 13 years old and speak English, Spanish, or Navajo as a first language. The students in participating classrooms wrote in-class essays each month on topics designed to elicit different types of writing (e.g., description, narrative, taking a position). The topics, all similar to writing tasks that elementary students regularly encounter, are provided in the appendix. Each month, all of the classes wrote on the same topic, therefore allowing accurate comparisons to be made across grades and language groups while controlling for the effect of topic. The type of writing being elicited was also cycled across the school year so that, for example, the narrative topics did not all occur at the beginning or end of the year. This collection of student writing provides a valuable resource to investigate many important questions about first language (L1) and second language (L2) English student writing development. In the first part of this chapter the corpus of student writing is used to describe the linguistic changes that occur in the writing of third and sixth grade students from two different first language groups: English and Navajo. The linguistic development of these two first language (L1) groups is described using Multi-Dimensional analysis (Biber 1988; Conrad and Biber 2001; Reppen 2001a). In the second portion of the chapter, rather than looking at a large number of essays and describing the linguistic changes of two groups of students across two grades, the focus is a case study of one L1 Spanish student writing in fourth and fifth grade. These two approaches provide both a macro and micro perspective on the writing development of first and second language elementary students. 1. A corpus-based look at writing development Technological advances over the last fifteen years have opened avenues of research that previously were not available. One such avenue is corpus linguistics. The first portion of this chapter, the quantitative portion, uses corpus linguistic methods to create a snap shot of the linguistic features in this corpus. Corpus linguistics depends on large collections of natural texts (i.e., the students’ essays) and computers to help with the analysis of texts. By using computers, it is possible to keep track of many more linguistic features (and with much greater accuracy) than previously possible in studies conducted by hand. For example, the study described in this chapter identifies and counts over seventy linguistic features in over 900 essays. Without the use of computers, this task would be impossible. Imagine trying to accurately count the nouns, prepositions, relative clauses, and complement clauses on two pages of this book, let alone seventy features across over 900 texts. For further information about corpus linguistics see: Biber, Conrad & Reppen (1998); Meyer (2003); or Reppen & Simpson (2002).
First Language & Second Language Writing Development
149
1.1. A description of the corpus and analysis To describe the linguistic changes that occur between third and sixth grade, a baseline is necessary. A baseline provides a description of the linguistic characteristics against which other groups could be compared. Since by fifth grade, students have moved from ‘learning how to read and write’ to ‘using reading and writing to learn’, fifth grade was selected as the baseline for anchoring comparisons. The corpus described above was used to create a ‘Model of fifth grade student language’ using Multi-Dimensional analysis (Biber 1988; Reppen 2001a). Multi-Dimensional analysis uses factor analysis to identify groups of linguistic features that co-occur. The co-occurring features reflect functions or production features in the texts. Then by looking at the linguistic features and how texts are distributed across the factors the factors can be interpreted. Once the factors have been interpreted, descriptive labels are applied called Dimensions. Other texts can then be compared across the dimensions that have been identified, thus providing a principled means for comparing groups of texts. The first step in the process involved compiling a corpus, or collection of texts (both spoken and written), that was representative of the ‘world’ of fifth grade language. Table 1 shows the corpus that was used in this study to represent fifth grade student language. The corpus is composed of two types of texts: those produced by fifth grade students (in class essays and spoken texts) and those produced for fifth grade students (e.g., textbooks, fiction). All of the texts included in the corpus were produced by native English speakers. The fiction texts were selected from books that commonly appear on fifth grade reading lists and are frequently read by fifth graders. Although the corpus of fifth grade student language is small in comparison to many corpora used now in studies of adult language, it is, to my knowledge, the most representative corpus of fifth grade student language to date. Table 1. Corpus of fifth grade student language Type of text Number of texts Texts published for students: Basal Readers 23 Science textbooks 25 Social studies 24 Fiction 53 Texts written by students: Student writing 179 Spoken texts: Interaction 21 Monologue 14 Totals 339
Approximate number of words 5,000 5,000 5,000 16,000 14,000 9,000 8,000 62,000
150
Randi REPPEN
Basal readers: Holt, HBJ, Ginn, Scott Foresman, Houghton Mifflin Science textbooks: Holt, Heath, HBJ, Scott Foresman, Silver Burdett Social studies textbooks: American Book Co., Holt, Houghton Mifflin, Macmillan (1978 & 1982) Children’s fiction: Cornell Corpus (Hays 1988) from CHILDES Student writing: Pinedale and Sundance fifth grade students (only text > 60 words) Spoken texts: Carterette & Jones 1974; Hicks 1990 (from CHILDES)
Because the goal of this portion of the chapter is a quantitative linguistic description, all the texts in the corpus were tagged for over 70 linguistic features using a tagging program developed by Biber (1988). This program assigns each word in a text to a grammatical category (Biber 1988; Biber, Conrad & Reppen 1998). An example of a tagged text is provided below in Figure 1 with the tags glossed in parenthesis after the tag codes. The words in parenthesis do not appear in the original tagged file, but have been added to make the tags more transparent for the reader. I ^pp1a+pp1+++ would ^md+prd+++ be ^vb+be+vrb++ Steve ^np++++ Martin ^np++++ because ^cs+cos+++ he ^pp3a+pp3+++ 's ^vbz+bez+vrb++0 funny ^jj+pred+++ . ^.+clp+++ and ^cc+cls+++ I ^pp1a+pp1+++ think ^vb+vprv+tht0++ he ^pp3a+pp3+++ is ^vbz+bez+vrb++ a ^at++++ very ^ql+amp+++ nice ^jj+atrb+++ man ^nn++++ Figure 1.
(first person pronoun) (modal) (verb, BE as main verb) (proper noun) (proper noun) (clausal subordinator, causal) (third person pronoun) (contracted verb, main verb BE) (predicative adjective) (final punctuation –marks the end of a T-unit) (coordination conjunction) (first person pronoun) (private verb with omitted THAT complement clause) (third person pronoun) (verb, BE as main verb) (article) (amplifier) (attributive adjective) (noun)
Example of a tagged text
After all the texts were tagged, they were edited using an interactive editor called ‘Fixtag’. This program works much like a spell checker, calling
First Language & Second Language Writing Development
151
words to the screen that are problematic (e.g., distinguishing some types of relative and complement clauses), or words that have ambiguous tags (marked by the tagging program with a tag and ?? so that the tags will be called to the fixtag screen). Since Biber’s tagging program was originally designed and used with texts produced by adults and not texts produced by children, it was essential to establish the accuracy of the program on texts produced by children. The accuracy for the features checked was well over 95% (Reppen 1994). 1.2. Multi-Dimensional analysis After the texts have been tagged and fixtagged, a factor analysis is used to identify linguistic features that have strong co-occurrence patterns. The groups of linguistic features that were identified through the factor analysis on the corpus of fifth grade student language are shown in Table 2. It is important to understand that over 70 linguistic features were tagged in all of the texts and then a factor analysis was performed to identify which features co-occurred. The features that were identified as having a statistically significant co-occurrence ‘load’ onto the factors. In this analysis, five factors were identified. The five factors or ‘groupings’ of linguistic features have a complementary distribution pattern. That is, if there are a high number of features from the positive end there will be a low number of features from the negative end. Table 2.
Summary of the linguistic features for each factor in the model of student language. (Features in parentheses are not used to compute dimension scores.) Dimension 1 Dimension 2 nouns once occurring words word length type/token ratio nominalizations past tense passives public verbs attributive adj. (perfect aspect) prepositions (verb complements) (type/token ratio) (once occurring words) present tense verbs infinitives initial ANDs other initial words time adverbials third person pronouns (adverbials) (IT pronouns)
Dimension 3 causal subordination THAT deletion private verbs DO as pro verb
152
Randi REPPEN
(first person pronouns) (participials)
contractions (present tense) (first person pronouns) (necessity modals) (general emphatics) (time adverbials)
Dimension 4 prediction modals BE as main verb
Dimension 5 second person pronouns conditional subordination possibility modals
no negative features (first person pronouns) Dimension 1: Edited informational discourse versus On-line informational discourse Dimension 2: Lexically elaborated narrative versus Non-narrative Dimension 3: Involved personal discourse versus Non-personal uninvolved discourse Dimension 4: Projected scenario Dimension 5: Other-directed idea justification/explanation
After the five factors were identified, they were assigned descriptive labels through a process of looking at the linguistic features that co-occurred and the distribution of the texts in that factor. Once labeled, the factors are referred to as Dimensions. These groupings of linguistic features reflect different functions being accomplished by the texts (e.g., interactiveness, narration, etc.). This type of analysis characterizes texts by co-occurrences of linguistic features along different dimensions, hence the name Multi-Dimensional Analysis (Biber 1988, Biber, Conrad & Reppen 1998). Using the linguistic features, Dimension scores can be computed for each text and then mean scores can be computed for groups of texts. These scores are then used to plot the texts along the dimensions. Of the five dimensions that were identified through the Multi-Dimensional analysis, Dimension 1, ‘Edited informational vs. on-line informational discourse’ typifies the oral literate continuum. Therefore, this dimension will be used in as an anchor to describe the linguistic development between L1 and L2 English third and sixth grade students. 1.3. A detailed look at Dimension 1 The linguistic features associated with the positive end of Dimension 1, such as nouns, long words, nominalizations, and passives (see Table 2), are
First Language & Second Language Writing Development
153
features that reflect careful production circumstances and an informational focus. Texts that characterize the upper end of Dimension 1 are social studies and science textbooks (see Figure 2 below). Figure 2 shows the distribution of texts along Dimension 1. Social studies textbooks have a mean score of 10 for Dimension 1, thus typifying the upper end of that dimension. EDITED INFORMATIONAL DISCOURSE 10+ | | 9+ | | 8+ | // 4+ | | 3+ | | 2+ | | 1+ | | 0+ | | -1+ | | -2+ | | -3+ | | -4+ | | -5+ | | -6+ | | -7+
Social Studies Textbooks
Science Textbooks Basal readers
FAVORITE GAME ESSAY 6TH Fiction; TV OPINION ESSAY 6TH
tv opinion essay 3rd; PIC SERIES 6TH favorite game essay 3rd
interaction picture series essay 3rd
monologue
ON-LINE INFORMATIONAL DISCOURSE Figure 2. Distribution of L1 English texts along Dimension 1
154
Randi REPPEN
Text sample 1, from a social studies textbook is representative of the upper end of Dimension 1,‘Edited informational’. The text excerpt below illustrates the use of long words, the frequent use of nouns (bolded) and attributive adjectives (underlined), and passive constructions (underlined and italicized), all characteristics typical of texts with high Dimension 1 scores. These features work together to create a carefully edited text that carefully packages a large amount of information. (1)
Social studies textbook (fifth grade) Slavery in the Americas grew out of a need for cheap labor. The large farms or plantations that had been founded in the Americas needed many workers. At first the Spanish tried to force American Indians into slavery. When this did not work they brought slaves from Africa. By 1600, there were 40,000 Black slaves in Spanish America. In 1619 the first Africans were brought to English settlements in America.
The linguistic features that characterize the other end of Dimension 1 (e.g., initial ands, discourse particles; see Table 2) reflect on-line production circumstances. Conversation is an example of an on-line production situation. On-line language is produced under time constraints and often ideas are strung together through the use of and or then, and fillers such as “uh” or “um” are used to fill time while the speaker thinks about what to say next. In Biber’s 1988 model and also in studies of conversation (e.g., Tannen 1984, 1989), these on-lines features are usually associated with texts that have an interpersonal focus. In the student texts, however, even though the features reflect on-line production, the texts reflect an informational rather than interpersonal or interactional focus, hence the label ‘On-line Informational Discourse’. As seen in Figure 2, spoken interactions and monologues characterize this end of Dimension 1. Text sample 2 below is typical of the negative end of Dimension 1 having an absence of positive features of Dimension 1 (e.g., long words, attributive adjectives) and a high occurrence of negative features, particularly initial ands and then (underlined) and third person pronouns (bolded). The text sample also reflects the informational focus rather than interpersonal focus of the lower end of Dimension 1. The speaker is focusing on conveying information about the episode rather than presentation of self. (2)
Spoken Monologue Fifth grade He’s on the bus. And the balloon is following behind. And he walks. And he gets off the bus. And he catches the balloon. Then he uhm he walks around. And he goes with his mother.
When comparing Text samples 1 and 2, both typical of the different ends of Dimension 1, note that in the spoken monologue (Text 2) there are very few long words, no attributive adjectives, and nouns (i.e., bus, balloon,
First Language & Second Language Writing Development
155
and mother) only occur five times. In contrast, the social studies text (Text sample 1), has no instances of sentence initial and or then and only one third person pronoun (they). One of the interesting findings for Dimension 1 is that the texts written by students have an ‘on-line’ characterization; that is, student essays have a slightly negative score on Dimension 1 (see Figure 2) and exhibit many characteristics of on-line production, though not as many as the spoken texts do. The essays are written much in the same way that the students speak. Text sample 3 from a student essay demonstrates the on-line nature of student writing. Sentence initial and has been underlined and the third person pronouns have been bolded to help with the comparison of texts. (3)
Student essay (fifth grade; L1 English) One day some kids went to a circus. And a clown was catching plates. And he did not drop one. And then the kids put up the plates where they belonged. And then the kid thought he could do it. but he was not as good as the clown who did it. These were his mom’s plates. And he tried. He got it going. And then he dropped them. And they all broke. His mom came in. She was very mad.
The student essay has a very high number of initial ands and third person pronouns. Even though this is a written text, it has many characteristics of on-line production. In contrast, the social studies text (Text 1) has no on-line features, and ideas are well integrated, reflecting the careful production circumstances of an edited informational text. Even quick a reading of the two texts reinforces the ‘literate’ vs. ‘spoken’ feel of these two texts. 2. Writing development of Navajo and English students in grades three and six This section will investigate the linguistic changes that take place between third and sixth grade across two different L1s – English and Navajo. In order to explore the linguistic development that takes place between grades 3 and 6, a subset of essays from the larger corpus was selected. The essays used in this part of the study are from L1 Navajo and L1 English third and sixth graders writing in English. Table 3 shows the number of essays and the total number of words used in this portion of the study. The number of essays collected are about the same across the grades and language groups, however, the number of words produced is not the same and reveals in interesting difference. In third grade the L1 Navajo students write about half as much as their L1 English counterparts, however, by sixth grade the trend is reversed with the L1 Navajo students writing almost twice as much as the L1 English sixth graders!
156
Randi REPPEN
Table 3. Total number of essays and words by grade and L1 Grade 3rd 6th 3rd First Language English English Navajo Total # of essays 137 154 125 Total # of words 10,622 18,144 5,828
6th Navajo 147 23,089
Often in child language studies average word length and type token rations are reported. Table 4 gives this information for the corpus described in Table 3. These numbers show the same trends across the two language groups without much difference across the grades. In both cases from third to sixth grade the amount of variability in the type/token ratio is reduced (i.e., the standard deviation is less). Table 4. Average word length and type/token ratio (standard deviation in ( )) Grade 3rd 6th 3rd 6th First Language English English Navajo Navajo Average word length 3.10 ( 0.30) 3.90 (0.25) 3.00 ( 0.42) 3.87 (0.25) Type/token ratio 64.58 ( 12.27) 76.39 (8.02) 53.87 ( 16.46) 74.38 (8.04)
The student texts were tagged and plotted along Dimension 1 by grade and language group along Dimension 1 using the procedures described in the previous section. Plotting the mean scores of the student texts provides a principled means for comparing across grades and languages to see what linguistic changes take place from third to sixth grade. Of course this can not capture every change that occurs, it is only one lens from which we can begin to investigate the many changes that occur as students, both first and second language learners gain control over their written production. In analyzing the texts, some interesting effects for topic began to arise. It seems that as early as third grade, students are aware of register or genre differences and students control language well enough to use their linguistic resources in different ways according to their goals. Of course, this awareness continues to develop and become more sophisticated as students mature, but it is interesting to note that register awareness and some linguistic control is already in place by third grade. For the purpose of this chapter three different writing tasks (i.e., narration, explanation, and taking a position) were selected to compare across grades and language groups. These essays represent three different types of writing that students are frequently asked to perform. In order to see the changes take place between third and sixth grade the mean scores for Dimension 1 for the each group of essays for the two language groups were
First Language & Second Language Writing Development
157
plotted along Dimension 1. The L1 English essays have been plotted in Figure 3 and the L1 Navajo essays are plotted in Figure 4. 2 | | | 1 | | | 0 | | | -1 | | | -2 | | | -3 | | | -4 | 3rd
TV GAME
PICTURE Grade
Figure 3.
2 | | | 1 | | | 0 | | | -1 | | | -2 | | | -3 | | | -4 | 6th
GAME TV
PICTURE
Grade
Dimension 1 scores of L1 English for three writing tasks Topics and tasks: TV: Should kids watch a lot of TV: Why or why not? Task: taking a stand/position Game: Choose your favorite game or sport. Describe how to play your game or sport to someone who has not played it. Task: explanation Picture: Picture series. Write about what is happening in the pictures. Write so that someone who has not seen the pictures will know what happened. Task: description
158
Randi REPPEN
2 | | | 1 | | | 0 | | | -1 | | | -2 | | | GAME -3 | | | -4 | | | -5 | | | TV -6 3rd Grade Figure 4.
2 | | | 1 | | | 0 | | | -1 | | | -2 | | | -3 | | | -4 | | | -5 | | | -6 6th
TV GAME PICTURE
Grade
Dimension 1 scores of L1 Navajo for three writing tasks Topics and tasks: TV: Should kids watch a lot of TV: Why or why not? Task: taking a stand/position Game: Choose your favorite game or sport. Describe how to play your game or sport to someone who has not played it. Task: explanation Picture: Picture series. Write about what is happening in the pictures. Write so that someone who has not seen the pictures will know what happened. Task: description
First Language & Second Language Writing Development
159
Looking at Figures 3 and 4 there is quite a shift toward the upper end of Dimension 1 that takes place for both language groups. It is also interesting to note that for both language groups the ‘Picture’ task elicited the ‘least literate-like’ writing. This is important to consider since often writing tests use these as prompts for elementary students. As noted above, the L1 Navajo students make significant upward ‘progress’ along Dimension 1 and for two of the writing tasks, the TV and Picture series tasks, the overall changes toward the upper end of Dimension 1 are much greater than those seen in the L1 English group. Any essay with less than 60 words was omitted from this analysis because the linguistic feature counts are not stable. Therefore, the third grade L1 Navajo students the ‘picture series’ essays were not plotted since so few essays had more than 60 words. So, for this group of students, the Picture essay does not even appear on Dimension 1 for the third grade L1 Navajo students. Although the essays written by the sixth grade L1 Navajo students are moving toward the upper end of Dimension 1, the essays are still quite neutral, hovering right around 0, not showing strong features of either end of the continuum. Two groups of essays written by the sixth grade L1 English students are around 2 showing a trend toward more literate use of language. It is also interesting to note that the all of the sixth grade L1 Navajo essays are grouped much more closely than the L1 English essays. Below are some examples of the students’ essays, one from third grade and two from sixth. There are two types of changes that take place between third and sixth grade. First, it is rather transparent to see the content and organizational changes between the third and sixth grade essays in text samples 3, 4 and 5. At a grammatical level there are also major changes that take place from third to sixth grade as reflected in these essays. Text #3 has one subordinate clause (Yes because they have a TV) and three to clauses (They like to watch TV all the time. They like to watch too much TV a lot. Kids just to watch TV). The complement clause (Do you think kids watch too much TV?) is copied from the prompt. This is an effective strategy, and shows an awareness of the need for some type of introduction, unlike several of the students who began essays in response to this prompt with simply ‘yes’ or ‘no’ and no other introduction, however, it is not the same as if the student had produced a complement clause in the introduction without copying it from the prompt. (3) Student #1823 (third grade; L1 Navajo) Do you think kids watch too much TV? Yes because they have a TV. They like to watch TV all the time. They like to watch too much TV a lot. Kids just like to watch TV. (37 words)
In addition to being longer, the two sixth grade essays, one by an L1
160
Randi REPPEN
Navajo student and one by an L1 English student, have a variety of complex structures. In the two examples below the conditional clauses have been italicized, to-clauses have been underlined and italicized, the complement clauses have been bolded, and the subordinate clauses have been underlined. (4)
(5)
Student #2296 (sixth grade; L1 Navajo) If you watch TV too much you can get your brain crazy. And you will be doing drugs and wearing short clothes. When you go to school you won’t have the brain to think after watching all those scary movies. But at the same time it can be good because you can learn a lot like news, 911, and good movies that have no fighting, killing, and sex. Watching the good movies is good. But don’t watch too much of violence. (81 words) Student #4016 (sixth grade; L1 English) Do you know that the average kid spends at least 5 hours in front of the television each day. I don’t think kids should watch a lot of TV because it can cause bad eye sight. And it can get as bad as losing the chance of having a baby because it can damage or effect the ovaries in your body. If watching television many hours a day is a habit, then you don’t take the time to do outdoor activities and other things besides TV. (86 words)
Text sample 6 below shows the tremendous variability that can occur within the same grade level. This third grade student is well above the norm for her grade. In addition to using the ‘traditional or conventional’ story opening and closing, this student has good organizational skills and controls a range of complex grammatical structures. (6)
Student #328 (third grade; L1 English) The Broken heart Once upon a time about a week ago. I my little old self and my mean old brother we went to the circus. We saw a clown juggling plates. When we went home from the circus my brother had a good idea. He remembered the clown juggling plates. He saw his moms plates. He wasn’t thinking. He grabbed his moms favorite set of dishes and started juggling the plates. He didn’t catch the plates. And they fell down and broke. The maid heard the noise of the dishes breaking. She said when your mother gets home she going to look at this mess. So when the boys mom came home she saw the mess. When she saw the mess she could hear the dishes breaking. Every time she heard the dishes breaking the more heart broken she was. The boys said sorry. Then they saw those where the best and favorite dishes mom had. They said sorry, sorry, sorry, sorry, sorry. Please forgive us Please. OK said their mom and they lived happily ever after. (175 words)
2.1. Summary of quantitative writing development As both first and second language students move from third to sixth
First Language & Second Language Writing Development
161
grade their writing shows some similar trends. For both groups there is an increase in the amount of writing produced in the same amount of time. Both groups also begin to package information in a way that uses more complex linguistic structures such as complement clauses and other types of subordination. By sixth grade, groups still use many linguistic features associated with on-line or more oral production, such as a high number of causal subordination. It seems that both groups are moving toward more literate and decontextualized use of language. Both groups have a way to go before they are using language that can be considered carefully produced, edited language that reflects control of the linguistic features associated with information packaging and literate language. From this analysis, tremendous changes take place in the linguistic development for both groups, but particularly for the L1 Navajo students. Although, by sixth grade, the L1 Navajo students still are not using the same linguistic resources as their L1 English peers. 3. A qualitative look at writing development: A case study After taking a predominantly quantitative look at some of the linguistic changes that occur from third to sixth grade by looking at entire classes and using individual student essays to serve as examples of general trends, the final part of this chapter shifts to a different methodology. Through a case study, using a qualitative approach, we will explore the changes that take place one L1 Spanish student’s writing as he progresses from fourth to fifth grade. Prior to arriving in the US, Carlos (not his real name) was a good student in his home country of Mexico. Carlos had been in the US for three months before beginning school. Carlos had not received any English instruction in Mexico and the three months before starting school his only English input was going to the store and playing with a few English speaking children. His daily life was still predominantly in Spanish since he lived with his family in a trailer park that was made up of 100% Spanish speakers. Carlos had already completed fourth grade in Mexico, but it was decided that fourth grade was the best placement due to the limited ESL resources at the school since most of the students were L1 English. Additionally, most of the students in fourth grade were Carlos’ age peers since the school had a ‘pre-first’ grade and almost all students were 7 years old as they entered first grade. Five of the sixteen essays written by Carlos were selected to show the development of his English proficiency across the two year period from fourth through fifth grade. All of the essays were written in class and were not edited by anyone else. The time allowed for writing each of the essays
162
Randi REPPEN
was about half an hour. The essay below, Text #7: Grade 4 October shows a strong Spanish and oral language influence (e.g., the spelling and the use of ‘y’ for ‘and’). A gloss for text #7 is provided inside the square brackets below the original text. This essay provides a baseline for Carlos’ English writing. (7)
Grade 4 October I thig yes y not.Vat not to much becas den thi not goin to lorn an schol my mom tels my to don’t si to much tive an a agri gus her. (33 words) [I think yes and no. But not too much because then they not going to learn in school my mom tells me to don’t see too much TV and I agree with her.]
By looking at the next two ‘rule essays’ (texts #8 & 9), both written in January, but a year apart, we see that by fifth grade, both spelling and punctuation has become more standardized. The essay written in fourth grade is quite linguistically sophisticated, it has several types of dependent clauses (e.g., an adverbial clause: wen we wet home the snw is tom in to water; a conditional clause: so if I was the tacher I let the kist play in the snow), but the non-standard orthography and lack of punctuation can get in the way for a non-accommodating reader. By fifth grade Carlos’ essay is much more ‘reader-friendly’. It is interesting to note that in Text #8, the fourth grade essay, Carlos makes use of several oral language features such as adverbial clauses and also the oral language use of go as a quotative (i.e., byt wen I git to shcloo the tacher go you ar not goin to play on sonw). However, in the fifth grade essay, text #9, written one year later, Carlos is relying on dependent clauses that are more commonly associated with written language, including two complement clauses (i.e., I think the rule going to bed at 9:30 is not a good rule; I think the rule cleaning your shoes befor goin in the house) and he does not use any adverbial clauses. (8)
(9)
Grade 4 January The rule playing in snow. Soom times we de dont let us play in the snow wen we wet home the snw is tom in to water so if I was the tacher I let the kist play in the snow. Soom times wen you git up I’m happy byt wen I git to shcloo the tacher go you ar not goin to play on sonw most of the kist ar not happy. (73 words) Grade 5 January I think the rule going to bed at 9:30 is not a good rule. Breaking the rule is not fair. If you break the rule you get into big trouble you have to go to bed at 8:00 I think the rule cleaning your shoes befor goin in the house If you break the rule you get into big trouble you have to clean the carpet in Satuday. (68 words)
First Language & Second Language Writing Development
163
In Texts #10 and 11, Carlos is responding to a series of pictures. The first essay is written in May at the end of Carlos’ first year in school. He is able to produce much more language than at the beginning of the school year. In the October essay (text #7) Carlos produced 33 words in response to the prompt, while in Text #10, at the end of the school year, he wrote 96 words almost three times as many words. This reflects Carlos’ increased ability in writing in English, since in the same amount of time, he is able to produce almost three times as much text. He is also moving closer to target like spelling and punctuation, thus making his writing more reader friendly. (10) Grade 4 May Pictures Picture one. In picture one ther is a clown. His jugeling plaets. He has long pans, ther pink and his hat is yellow. A lots of pepeol a kids a at the circos. Picture two. In picture two two littel kids are in ther house. Ther in the ckichen. Ther giting the mothers plaets. Picture tree. In picture tree one of the littel kids stared to jugel her mother plaets Picture four. In picture four the kid that was jugeling drop the plaets and his mom came real mad because he drop her real new plaets. (96 words) (11) Grade 5 November The Brokend Window! One day three little kids where playing out side the post office. The kids name were Welly, Pat, and Joey. They where playing football. One of the kids said “Look at this.” He catcyed the ball whith his foot. The other kid said “I can do that too.” “Let’s see,” said Pat. “OK” said Welly. He thew the ball up in the air, but he hit the post office window. He broke it. The old men came out of the post office. He chase the kids to the corner, the kids’ mom was there. She said, “What’s goin on.” The old men said, “Your kids broke my window. “Kids your in trouble.” Said his mom. “Yes” said the kids, “You have to pay of your allowance,” his mom said. 2 months past. They paid the old man and they never played in front of the post office. (147 words) {Drawing of a brick wall with a broken window}
In comparing the two texts in response to a picture prompt (Texts 10 and 11), we see that by November of fifth grade, Carlos is already showing greater control over punctuation, spelling, and essay structure. In the fourth grade essay, he relies on the picture frames to provide structure for the essay, while in the fifth grade essay, Carlos simply writes a story based on the scenes. Also, in the fourth grade essay Carlos uses a formula for producing
164
Randi REPPEN
text. He writes the frame number and then begins each scene description with ‘In picture X…’. This is an effective, but novice writing strategy. However, about six months later, Carlos does not need to rely on this strategy for generating texts in response to a series of pictures. Notice though that in text #11, the fifth grade essay written in November, there are fewer complex linguistic structures. There are no dependent clauses, even though the text is almost four and a half times longer than the January essay written in grade four (Text #8) which has four dependent clauses. This reinforces the observation that picture tasks, since they tend to elicit narrative writing, do not always provide the best indicators of a student’s ability to write complex structures and package information. In this two year period Carlos shows many of the same linguistic changes that were reflected in the essays in the quantitative portion of the chapter. Carlos’ writing shows a trend toward the use of more complex linguistic structures and much greater control over basic writing conventions. 4. Conclusions By looking at student writing through both quantitative and qualitative approaches, similar trends emerge. There are many linguistic changes that take place as students move from third to sixth grade. At one level this is expected and not surprising, however, it is important to be able to articulate the areas where changes occur and also to be able to identify the patterns that emerge and determine if students from different first languages exhibit different patterns of development. By looking both quantitatively and qualitatively, using a large collection of essays to see general developmental trends that occur between third and sixth grades and looking at an individual learner across a two year period, it is evident that many linguistic changes that occur during elementary school. It is also evident that L1 and L2 English students show a similar developmental pattern, just beginning from different starting points. By having begun with the quantitative analysis, it is possible to be a more confident about the trends shown in the case study since they reflect the same trends seen in the analysis of a large number of essays. It is worth noting that there is still a long way to go before these sixth grade students are ready to handle the linguistic demands of academic writing needed to be successful in college. References Baghban, M. 1984. Our daughter learns to read and write: A case study from birth to three. Newark, DE: International Reading Association. Biber, D. 1988. Variation across speech and writing. Cambridge: Cambridge University Press.
First Language & Second Language Writing Development
165
Biber, D., S. Conrad, & R. Reppen. 1998. Corpus linguistics: Exploring language structure and use. Cambridge: Cambridge University Press. Bissex, P. 1980. Gyns at work. Cambridge, MA: Harvard University Press. Carterette, E. & M. Jones. 1974. Informal speech: Alphabetic and phonemic texts with analyses and tables. Berkeley, CA: University of California Press. Crowhurst, M. 1983. Persuasive writing at grades 5, 7, and 11: A cognitivedevelopmental perspective. (Report No. CS-207719) Paper presented at the Annual Meeting of the American Educational Research Association, Montreal, Canada. (ERIC Document production Service No. ED 230 977). 1987. Cohesion in argument and narration at three grade levels. Research in the teaching of English, 21, 185-201. 1990. The development of persuasive/argumentative writing. In R. Beach & S. Hynds (Eds.), Developing discourse practices in adolescence nd adulthood (pp. 200-223). Norwood, NJ: Ablex. Hayes, D. 1988. Speaking and writing: Distinct patterns of word choice. Journal of Memory and Language, 27, 572-585. Hicks, D. 1990. Kinds of texts: Narrative genre skills among two communities. In A. McCabe & C. Peterson (Eds.), Developing narrative structure (pp. 55-87). Hillsdale, NJ: Lawrence Erlbaum. Hunt, K. 1965. Grammatical structures written at three grade levels. Research Report No. 3. Urbana, IL: NCTE. Loban, W. 1963. The Languageof elementary school students. Research Report No. 1. Urbana, IL: NCTE 1976. Language development: Kindergarten through grade twelve. Research Report No. 18. Urbana, IL: NCTE. MacWhinney, B. 1991. The CHILDES project. Hillsdale, NJ: Lawrence Erlbaum. Meyer, C. 2003. English Corpus Linguistics: An introduction. Cambridge: Cambridge University Press. Reppen, R. 1994. Variation in Elementary Student Language: A multidimensional Perspective. Unpublished Ph.D. Dissertation, Northern Arizona University, Flagstaff, Arizona, USA. 2001a. Register variation in student and adult speech and writing. In S. Conrad & D. Biber (Eds.), Variation in English: Multi-dimensional studies. 187-199. London: Longman. 2001b. Elementary student writing development: Corpus-based perspectives. In R. Simpson & J. Swales (Eds.), Corpus linguistics in North America: Selections from the 1999 Symposium. 211-225 Ann Arbor: University of Michigan Press.
166
Randi REPPEN
Reppen, R. & R. Simpson. 2002. Corpus linguistics. In N. Schmitt (Ed.), An introduction to applied linguistics. 92-111. London: Arnold. Tannen, D. 1984. Conversational Style: Analyzing talk among friends. Norwood NJ: Ablex. 1989. Talking voices. New York: Cambridge University Press
Appendix. List of writing prompts and brainstorming activities SEPTEMBER You can be any famous person: Describe who you would be. BRAINSTORM: Talk about famous people (remember they can be historical, scientific, popular etc). List some on the board if you desire. What would your life be like if you were _________? OCTOBER Should kids watch a lot of TV: Why or why not? BRAINSTORM: Discuss how much TV students watch. Talk about different types of shows (cartoons, talk shows, news, movies) What are some advantages and disadvantages of watching a lot of TV? NOVEMBER What will you do for Thanksgiving? BRAINSTORM: Discuss different Thanksgiving plans and different traditions. DECEMBER Choose your favorite game or sport. Describe how to play your game or sport to someone who has not played it. BRAINSTORM: List games or sports on the board. Discuss how to describe a game/sport that someone has not played. JANUARY Explain what the ideal school would be like. How would this ideal school be similar or different to the school you are in? BRAINSTORM: List what you like about your school. What would you include in your ideal school? What else would you put in your ideal school?
First Language & Second Language Writing Development
167
FEBRUARY You have traveled back 200 years in a time machine. What did you see? Tell a story about what happened? BRAINSTORM: Discuss what a time machine is. Where can you go in a time machine? List some places that students would like to go if they had a time machine. Relate to Social Studies and what it was like 200 years ago. Emphasize this is 200 years BACK: not “Back to the Future”. MARCH Describe your best friend. BRAINSTORM: What is a good friend? How do you decide who is your friend? Discuss ways to describe friends (e.g., physical characteristics, inner qualities). APRIL What is your favorite subject and why? What is your least favorite subject and why? Be sure to say why you like or do not like the subjects. BRAINSTORM: No pre-activity. Please remind students that there are no right or wrong answers. MAY Picture series. Ask the students to write about what is happening in the pictures. Write so that someone who has not seen the pictures will know what happened. BRAINSTORM: Discuss how to write so that people who do not see something can understand what happened. Relate it to a news report. If possible read a news article to the class. Discuss what the students understood? What did they “see”?
168
Randi REPPEN
The Uneasy Interface — Methodological Issues in Using Data from Traditional and Urban Dialectology in (Re-)constructing Sociolinguistic History — Tim POOLEY Introduction In this study I have chosen to focus on the role of corpora in writing sociolinguistic history, taking as a model work on French and in particular the city of Lille (Pooley, 2004). In contrast to Tony Lodge’s (2004) sociolinguistic history of Parisian French, which begins at the beginning, or more precisely as early as extant written sources will permit, and works forward to a chosen cut-off date (i.e. the mid-20th century), I elected, for reasons which will become increasingly clear in subsequent sections, to start with contemporary oral sources and to work backwards as far as possible from the present. Even if one goes no further back than oral data will allow, any account of the historical development, particularly of vernacular varieties, forces investigators to take into consideration different types of data, leaving no real possibility of presenting the narrative as a seamless continuum. The join, as with different types of flooring material, can be neatly done, but it will always be visible. In work on European French, the interface, i.e. how the data of traditional dialectology can best be related to those from urban dialectology, has only recently been considered, e.g. in the Phonologie du Français Contemporain (e.g. Durand, Laks and Lyche, 2003) project and by certain individual researchers like myself (e.g. Pooley, 2004). This comes in stark contrast, for instance, to studies of English-speaking societies, where influential textbooks, such as the first edition of Chambers and Trudgill (1980) and Francis (1983) have included both approaches under the heading of dialectology for more than two decades. In studies of the French-speaking areas of Europe, however, the linguistic varieties focussed on by the disciplines of dialectology, dialect geography or geolinguistics, as some would have it, were, if they were mentioned together at all, as is the case for instance in Lefebvre’s (1991) study of Lille, were often made to appear to be so far apart as to be almost completely unconnected from those investigated in sociolinguistic work (otherwise known as the quantitative
170
Tim POOLEY
paradigm or Labovian approach). This is partly a side effect of the different methodologies and goals of the two sub-disciplines and partly due to the at least superficially radical changes in public discourse since the turn of the century regarding these varieties which were traditionally considered dialects but now have been recognised as languages and thus part of the national cultural heritage (‘langues de France’). Another differentiating factor not to be underestimated is that traditional dialect studies were resolutely system-oriented towards what Carton calls Type 4 varieties (Table 1) or local patois (micro-dialects) maximally differentiated from standard or supra-local French (General French or français commun). While only a small part of the system, or rather sub-systems, was ever portrayed, mainly those bits which were markedly different from the national standard variety, variation was considered almost exclusively in spatial and linguistic (mostly lexical) terms. Table 1.
Typology of varieties spoken in the Nord-Picardie region. Adapted from Carton, (1981, 1987); Carton and Lebègue (1989) Descriptors of Variety Type 1
Type 2
Type 3
Type 4
General French
Regional French
Dialectal (local) French
Patois
langue
mixed variety
mixed variety
micro-dialect
not a dialecte
French
patois
patois
Quantity and markedness of dialect features none
some
many
all
least marked
some marked
most marked
Geographical spread max
wide
limited
local
The Labovian approach, on the other hand, takes account of the community of speakers who live in a particular location, and sets out to analyse the various social dialects (inter-speaker variation) and stylistic ranges used by all speakers (intra-speaker variation). The resulting data are likely to cover a range of intermediate usages corresponding to Type 2 and Type 3 varieties, bearing in mind that in most cases the use of certain phonological or perhaps grammatical features only would be studied. Capturing such usage requires a more representative selection of speakers than those chosen for traditional studies, famously characterised as N-O-R-M-S, or non-mobile old rural males by Chambers and Trudgill as long ago as 1980. This acronym serves as a useful mnemonic in relation to a number of issues which arise with respect to the interface of the two approaches for researchers seeking to write the history of a vernacular
The Uneasy Interface
171
variety. Each of the following sections will discuss some of the difficulties for corpus building which arise from the mobility of informants (Section 1), considerations of age differences (Section 2); the selection of urban and rural fieldwork sites (Section 3); differences in the speech of men and women (Section 4) and finally stylistic variation (Section 5). Most, but by no means all, of my examples will be taken from studies of northern France, and the city of Lille in particular (Figure 1).
Figure 1.
Lille-Métropole
172
Tim POOLEY
1. Mobility Given the explicit emphasis on non-mobility in traditional dialectology, where subjects who had lived most or all of their lives in a particular locality were deliberately sought out. Even in urban dialectology, obvious outsiders or ‘oddballs’ (Chambers, 2003: 94) were often deliberately avoided. In the dialectological tradition, the quest for ‘pure’ dialect was pursued, i.e. the variety as unaffected as possible by dialect contact, mixing, levelling and koinéisation, since these processes, even if these terms were not used for the phenomena concerned, were seen as threats to, or adulterations of, localised varieties, which, if they could not be preserved, at least could be noted down for posterity. Informants, perceived as typical, were selected purely on their ability to ‘produce’ dialect, with no serious account taken of how this variety was socialised either within the individual’s repertoire or in the community as a whole at the time of the fieldwork. This is not to say that scholars like Gilliéron, Brun and Séguy were unaware of variation, but it is by no means unfair to claim that their model left them no way to account for it. In contrast, Labovians were, from the earliest studies in the 1960s, conscious of the need to consider the social dimension. The basis of social-class classifications used by Labov (1972: 213) in New York, i.e. profession, income and level of education, to which Trudgill (1974:40-1), added father’s occupation, type of housing and area of residence, seem perfectly applicable in most developed countries, taking due account of new emerging hierarchies in a post-industrial world. From these parameters a four-way division (Labov) – upper/middle and lower middle classes, upper and lower working classes – or five-point scale (Trudgill) can be constructed, which is both sociolinguistically plausible and practically manageable. Some studies on French-speaking Europe have used comparable models, e.g. Singy (1996) in Switzerland and Bauvois (2003) on Belgium. Studies on France have tended to use either simplified versions, e.g. Armstrong and Boughton (1998) or Armstrong and Unsworth (1999) with a two-way division between working and middle class. Studies on Lille have used education as the yardstick of social differentiation, as illustrated by Lefebvre’s (1991) study in Tables 2-4. Table 2 lists the 5 levels of education: Table 3 gives the major cross-categorisations used in the study and Table 4 an idealised add ‘version with’ 4 speakers per cell, which corresponds to the generally accepted norms of good practice in the field (e.g. Foulkes and Docherty, 1999).
The Uneasy Interface
173
Table 2. Social stratification according to level of education (Lefebvre, 1991) Category Observations GP No qualifications, left school by age 13 (groupe primaire) GCEP Having primary school certificate and (certificat d’études primaires) left school between 11 and 14. GCAP Having a basic trade qualification (certificat d’aptitude professionnelle) obtained around age 16. Having studied for the Baccalauréat GBAC (A Level equivalent). GSUP
Higher education.
Table 3.
Social distribution of speakers in Lefebvre (1991) (103 speakers) M F 1902-1918 1918-1938 1938-1953 GP 4 10 3 7 0 GCEP 14 2 4 9 3 GCAP 4 5 3 1 5 GBAC 5 10 5 2 5 GSUP 13 30 6 10 15
1953-1960 4 0 0 3 12
Table 4.
Idealised social distribution of speakers for sociolinguistic survey of Lille based on Lefebvre (1991) 160 speakers M F 1902-1918 1918-1938 1938-1953 1953-1960 GP 16 16 8 8 8 8 GCEP 16 16 8 8 8 8 GCAP 16 16 8 8 8 8 GBAC 16 16 8 8 8 8 GSUP 16 16 8 8 8 8
The French Ministry of Education’s predictor of academic success according to parents’ occupation (Table 5) combines both dimensions and is particularly appropriate for school students too young to fulfil on their own account professional and educational criteria formulated for adults (Pooley, 2004). As a social-class classifier, it may not be universally acceptable, according as it does higher status to professions, such as teaching which has a high potential linguistic-marketplace index, compared to such factors as income and housing. In Bauvois’ (2003) study, security of employment is seen as particularly statusful. A stratificational presentation, like that of Lefebvre, allows for the study of social mobility. As more and more young people obtain the Baccalauréat (62% nationally in 2004 compared to 15% in the 1960s), finding informants who left school after completing only their primary education becomes difficult, if not impossible. Indeed, since the school-leaving age was raised to
174
Tim POOLEY
16, (which affects speakers born from 1953 onwards), informants who had only completed primary education, have become thinner on the ground. Lefebvre (1991:7) herself claimed that her sample was within the norms of representativeness of sociological work, which required a sample of between 1/1,000 and 1/10,000 of the population (Ghiglione and Matalon, 1978). More recently, Neuman (1997: 222) has suggested that a much larger sample (about 1,500) would be needed to achieve statistical representativeness for a population of 150,000 or more. The convention of requiring four speakers per cell would require a larger sample (Table 4) but many studies on France do not respect this criterion. The key issue is representativeness with manageability. According to David Sankoff (1980), to go beyond 150 subjects brings vastly increased data-handling problems without necessarily yielding increased analytical returns. In reality, therefore, virtually all sociolinguistic work is based on judgement samples, even where statistically representative samples had been previously been taken or at least planned, as in Labov’s study of the Lower East Side of New York (Labov, 1966) or Shuy, Wolfram and Riley’s work (1968) in Detroit, since as Labov (1966: 180-1) claimed, linguistic behaviour within a community tends to be more homogeneous than other forms of social behaviour, such as dietary preferences or voting intentions, since it operates at a less conscious level and is thus less open to manipulation (Milroy and Gordon, 2003: 28). Table 5. Favourability to success at school and parents’ profession 1) highly Professions libérales (professionals); cadres (fonction publique et favourable entreprises privées) (managerial, public or private sectors); professeurs (secondary teaching); instituteurs (primary teaching); professions de l’information, des arts et du spectacle (IT, arts, entertainment); cadres administratifs et commerciaux (administrative and commercial executives); ingénieurs (engineers); chefs d’entreprise (10+ salariés) (heads of companies with 10 or more employees) 2) favourable Professions intermédiaires de la santé, du travail social (health care and social work), de la fonction publique et du commerce (middle managers in public service and commerce); clergé (clergy), techniciens (technicians), contremaîtres; agents de maîtrise (supervisory); retraités cadres et professions intermédiaires (retired middle management) 3) average Agriculteurs-exploitants (farmers); artisans (tradesmen); commerçants favourability (shop owners); employés civils; agents de service de la fonction publique (lower-ranking public service); policiers (police); militaires (military); employés administratifs d’entreprise; employés du commerce (clerical); personnels de service direct aux particuliers (service personnel dealing directly with the public); chefs d’entreprise retraités (retired heads of companies). 4) unfavourable Ouvriers qualifiés et non qualifiés (semi- and unskilled workers); ouvriers agricoles (agricultural labourers); retraités employés et ouvriers (retired manual); chômeurs n’ayant jamais travaillé (unemployed having never worked); personnes sans activité professionnelle (people not in paid work e.g. stay-home parents). Source: http.www.education.gouv.fr/ival
The Uneasy Interface
175
Even if one seeks out life-long residents of a particular location, however, various apparent anomalies may yet occur. Chambers lists three types of oddballs: 1) outsiders; 2) aspirers and 3) interlopers. Both outsiders and aspirers have the outward characteristics of insider members of the community but have attitudes that set them apart. For instance, some members of small village communities, despite their typical characteristics in broad-brush socio-economic terms, prefer to opt out of, i.e. not participate fully in, community life. And this may be reflected in linguistic usage as proved to be the case in the use of a locally marked variant in the Austrian village of Grossdorf (Lippi-Green, 1989), by a long-term resident who chose to opt out of local networks and spoke more like a commuter to the nearest city. Career aspirations, even if unfulfilled, give a mental mobility that has been demonstrated to affect linguistic behaviour, as in the index of social ambition devised by Douglas-Cowie (1978) in a study of the northern Irish village of Articlave, near Coleraine, which correlates with more frequent use of standard variants of the variables selected for analysis. Taking the opposite perspective, Armstrong and Unsworth (1999) in a study of school students in the Aude region of southern France devised an Index of Regional Attachment, which proved to correlate significantly with gender, with females being mentally more mobile than males. This formalises a notion that seems highly plausible from the history of the decline of regional languages in France, where the economic underpinning of the local community was, at least in most documented examples, a male-dominated profession, which, in some cases such as mining and fishing, caused men and women to live in considerable degrees of segregation. With many women finding themselves in a relatively subordinate position within a low-status community, it is not surprising that their greater mental mobility, as well as greater responsibility for the educational success and thus the potential upward mobility of children, caused them to adopt French at the expense of a local language more readily than men (Pooley, 2003). Chambers (2003: 107) gives the example of an interloper, an informant whose use of Canadian raising is atypical of Torontarians of his generation. The person concerned, one Mr J, however, turns out to be American, even though he had lived in Toronto since the age of 11. Payne’s (1980) work on Philadelphia describes a case, where so-called interlopers are more easily exposed. Only speakers with local residence from infancy, both of whose parents were also locally born were observed to acquire the local variety of English in all its subtlety. Perhaps the most frequent way of signalling mobility is the difference between area of birth and area of residence. As far back as 1945, Martinet (1945) differentiated respondents who had resided in the same département
176
Tim POOLEY
of France, and those whose family had moved home in their formative years, differentiating quite significantly southerners who had moved within the traditional Oc area of France and and those who had crossed between Oïl (northern) and Oc (southern) areas. Lefebvre (1991: 24) considers mobility in terms of the difference between the commune of current residence and that of birth, but only to argue that her data are representative, i.e. they display approximately the same proportions as the latest available census data at the time (1982) as regards the regional origins of the population of Lille and mobility of residence within Lille-Métropole (p.17). She also mentions social mobility, which is defined by the difference in status between the informant’s profession and that of her/his father (p. 172-3) but does not exploit the notion in correlative terms. Another potentially important facette of mobility is shown in commuting patterns. Those whose daily occupation does not force them to leave their local area, certainly appear to be more locally based, although leisure activities may well have the same effect. For the school students interviewed in a questionnaire study in Lille (Pooley, 2004), social class and area of residence correlate with higher degrees of mobility. These greater degrees of mobility, in turn, correlated with significantly lower scores in a ‘Picard Skills Test’ (which tested knowledge on the ancestral language), giving some indicative confirmation of the long-standing premiss of dialectological studies, that stability of residence is conducive to the preservation of traditional varieties. Mobility, on the contrary, in all its manifestations tends to cause levelling, or the spread of a supra-local variety. Finally, another kind of outsider is the migrant from a different ethnolinguistic background. Studies of the effect of large-scale immigration in France, whether contemporary, as with the case of the Maghrebians in present-day Lille (Pooley, 2000; 2004) or a Parisian banlieue (Armstrong and Jamin, 2002) or historical as would be the case of the Flemish in the same city (Pooley, 2006), appear to show that despite considerable numbers, the effect of the immigrant presence may be summarised as assimilation to native norms with levelling effects of greater or lesser magnitude, i.e. that incomers adopt local vernaculars but contribute to the attenuation of some of the most marked or localised forms. Horvath’s (1985) study of Sydney shows that migrants can influence local norms, and living in London, I frequently have occasion to observe informally signs that white teenagers and young adults are adopting what might be called ethnic-minority norms. Fox’s (2004) study of Tower Hamlets (East End of London), describes a recently formed dialect of English in areas where the Bangladeshi community is numerically highly dominant.
The Uneasy Interface
177
2. Age – apparent and real time One of the great innovations of variationist sociolinguistics was that a synchronic study covering speakers of different ages will throw light on possible changes in progress, contrary to the traditional view that language change was gradual and imperceptible, except well after the event. This is not to deny that the relationship between variation and change is complex, since variation may or may not be indicative of change. In a dialect-divergent situation like that which pertained in Lille in the early decades of the 20th century, one would clearly expect older people to use more patois (local dialect) features and to a greater degree than their younger counterparts. For a fair number of such features, the year 1938 proves to be both a convenient and meaningful cut-off date. In my 1983 corpus (Pooley, 1996), some features, e.g. the non-palatalisation of l as in [s≤`u`k] travail ‘work’ were used only by speakers born in 1938 or earlier. Other Picard variants like word-final consonant devoicing were used significantly less by speakers born post 1938. This date coincides with the imminent outbreak of the Second World War when Lille was occupied by German forces between 1939 and 1944 and significant numbers of its citizens fled the major towns. While the populations moved back to pre-war levels after the mid-1940s, suggesting that the majority of people returned home, it was impossible simply to reconstitute communities just as they had been in the late 1930s. Other dates which appear significant in France would be 1953 when since people born in that year had to attend school until the age of 16. In a recent article (Pooley, 2006), I took as a cut-off date 1965, since that was around the year of birth of the youngest speakers in both my 1983 and Lefebvre’s corpus recorded a few years earlier. These are the first children of the baby boomers, sometimes known as Generation X, who in many parts of France are the first generation whose speech is virtually bereft of regiolectal features. It could be that certain speech forms correspond to certain life stages and are repeated in each generation, e.g. adolescents use certain forms but abandon them later in adult life. Armstrong and Jamin’s (2002) study of the northern Paris suburb of La Courneuve suggests a possible example. In this study the teens and twenties (15-25 age group) use affrication, e.g. /s/ realised as [sR] and /d/ as [cY] before /h x v/ as in tu dis [sRx ch] ‘you say’; toi ‘you’; [sRv@] je veux dire que [Yu1cYh9jR?9] ‘I mean that’ significantly more than adults (30-50). The authors suggest that the younger age group represents those who are immersed in street culture, whereas members of the older group are more likely to be married and in employment and thus settled into mainstream culture and values. What needs to clarified is how such forms, which appear to be part of traditional Parisian working-class
178
Tim POOLEY
vernacular have survived or emerged in some of the so-called grands ensembles like La Courneuve, which were largely built in the post-World War Two era, whereas in other banlieues, e.g. Villejuif (Laks, 1978), they do not appear to be significant. Age-grading of dialectal variants seems, however, intrinsically unlikely in a situation of dialect shift, where the vernacular has generally moved significantly closer to supra-local norms. Not surprisingly, it is difficult to find convincing examples of age-grading in Lille. The basic assumption of apparent-time studies is that speakers’ speech remains stable thoughout their (adult) lives. In a dialect-divergent situation, on the other hand, many older speakers are acutely conscious that they use dialect less than in the past, partly because they have few interlocutors who can reciprocate in the speech variety concerned, and possibly because perceptions of the variety concerned may have changed. ‘Ça faisait bourgeois de ne pas parler patois’ said a man born around the turn of the 19th and 20th centuries of the period of his youth and young adulthood, although this would not have applied (so strongly) when I recorded him in the 1980s (cf. Pooley, 2004: 308). From our own life histories, we may be conscious of changes in our way of speaking or possibly that of our own families. Some studies point to a U-shaped or curvilinear distribution indicative of age-grading (Trudgill, 1988), whereby the oldest and youngest members of a community behave similarly, most notably in the relatively heavy use of vernacular variants, whereas those in between, working-aged adults tend to behave differently, using more standard forms. The general explanation of the standardising pressures of the linguistic marketplace remains, by and large, solid, in cases where most features of the vernacular are stable over the whole of the age range. In contrast in a dialect-divergent situation, where dialect shift is known to be ongoing, age-grading seem highly unlikely. A number of scholars have been working on sociolinguistic variation in a particular area for long enough now to return to the field-site after a period of several years, as was the case for Trudgill (1983) and for myself (analysed in Pooley, 1996) when periods of 15 and 12 years respectively had elapsed, giving a real-time dimension to the data. In both cases the new corpora investigated the speech of adolescents who were hardly out of nappies at the time when the previous recordings were made. Cedegren’s studies (1973, 1987) of ch-lenition in Panamanian Spanish also based on two pieces of fieldwork carried out with a 15-year interval and covering on each occasion the whole social and age spectrum indicated slight differences in behaviour among the 1920s generation when they were in their forties compared to the period when they approaching 60. Mees and Collins (1999) is a genuine panel study comparing the speech
The Uneasy Interface
179
of the same young people from Cardiff at three points in time: firstly in 1976 aged 10; secondly, in 1981 aged 15 and thirdly in 1990 aged 24. With regard to the glottalisation of t, e.g. [vN>?] water, certain female informants adopted it strongly in adulthood, possibly indicating that it is now perceived to be a middle-class variant in Great Britain, although historically this was not the case, and arguably not yet in its region of origin, namely the south-east of England. It is a fundamental principle that synchronic age-stratified patterns should be set against a reference point in real time (Labov, 1972: 275). Commonly, researchers look to dialect atlases to provide some kind of baseline for comparison, as did Labov in his Martha’s Vineyard study of 1963. The expectation as regards possible ongoing changes would be typified by the patterns which emerged in Bailey et al.’s (1991) study of Texas speech. Innovative forms were found to be infrequent in the Linguistic Atlas of the Gulf States, while recessive forms were generally more frequent. North American dialect atlases, however, tend to record forms much closer to current everyday speech than European ones, at least as far as English and French are concerned. Historically, North American dialects are of relatively recent formation and generally the result of new dialect formations, which would tend to reduce their local markedness compared to the localised forms that some speakers from the British Isles would have known but only exported in attenuated form. In other words, there would be no equivalent to Type 4 varieties. The aim of the two generations of French, dialect atlases was to record for posterity, forms of such Type 4 varieties already known to be under threat of obsolescence. 3. Differentiation of Space To be sure, dialectologists were aware of sociolinguistic variation but idealised it out of their data. Gilliéron remarked upon the increasing degrees of francisation in the speech forms of hamlets, villages and small towns (bourg-ades). Arguably the most conservative variety of Picard was that of Gondecourt described by Edouard Cochet (1933) who remarked on the difference of the local patois and the more urbanised speech forms of the nearby town of Seclin (Figure 1). In the larger and then fast developing industrial town of Roubaix, Viez (1910) remarked upon the differences between the ‘pure’ patois of the older speakers born in the mid-19th century, when Roubaix still retained something of a semi-rural character and the more urbanised variety spoken by younger speakers. Gilliéron and Edmont (1902-10) were able to select informants with ages ranging from 12-85 (born no later than the late 1880s, but going as far as back as the second decade of the the 19th century) of either sex and
180
Tim POOLEY
coming from various walks of life (teachers, farmers, farm labourers, blacksmiths, secretaries, hairdressers, roofers etc.) and expect them to be able to produce authentic local dialect. By the mid-20th century, however, finding such speakers proved considerably more difficult, as Loriot (1967) testifies in the description of fieldwork carried out in Normandy in the 1940s. Carton and Lebègue (1989, 1998) concentrate on speakers of 60 years of age and over and by and large able to speak a Type 3 variety. The investigators sought to reconstruct the traditional variety from the output of several such speakers in a given locality, which was chosen as a reference point not because of its position on a pre-prepared grid, as was the case for Gilliéron and Edmont, but because of the availability of informants able to produce linguistic material (equivalent to Type 3 varieties) from which traditional Type 4 forms could be extrapolated. While their observations made them aware of other forms of variation, early dialect geographers focussed on the dialect as a structural system, a linguistic object, and on variation purely in spatial terms. The place of the variety portrayed in the repertoire of informants or in the local community is not described, save through a few informal general remarks. Carton (1972) is based on recordings of ‘typical’ speakers whose usage ranges from Type 2 to Type 4 varieties. A continuous stretch of speech of as little as 90 seconds is usually enough for intraspeaker variation to manifest itself, although dialect studies which set out to do this such as Eloy (1988) area are very much in the minority. Gilliéron’s remarks (Notice Gilliéron and Edmont, 1902-10) suggest a concentric-circle model of urban space around town centres (including relatively minor bourgs), with people in the central areas using the most converged forms, i.e. closest to supra-local French, and speakers in the localities (small villages, hamlets) furthest removed from such centres using the broadest forms. This corresponds well with the Central Place System proposed by Hohenberg and Lees (1985) for pre-industrial societies and the Centre-Periphery model of Reynaud (1981) for industrialised ones, of which Paris in relation to France is an archetypal example, with its long-standing and largely still prevailing dominance in all aspects of national life. Historically the centrality of Lille in northern France was by no means so clear-cut, but its regional dominance has been strengthened by the process of metropolisation set in train in the late 1960s, whereby long established and historically sometimes bitter local rivals, such as Roubaix and Tourcoing, are becoming increasingly integrated into Lille-Métropole (Figure 1). By the end of the 20th century, these 19th-century new towns had become decaying old towns and so-called periurban areas developed in the surrounding and interstitial greener communes as shown schematically in Figure 2. This is
The Uneasy Interface
181
very different from the classic pattern, where greater degrees of population movement in towns leads to greater degrees of levelling and mixing of dialects. If urban demographic growth is largely a result of long-term in-migration from the rural hinterland, then interdialectal contact will be slow and gradual. In the 19th century, however, the major towns of Lille-Métropole experienced exponential growth fuelled principally by immigration from Belgium, in particular the Flemish-speaking areas. The 20th century has seen large-scale immigration from Poland, North Africa, southern Europe and West Africa, i.e. groups who have no stake in the use of traditional local forms, favouring dialect levelling and koinéisation.
Figure 2.
The curve of periurbanisaton in Lille (adapted from Damette, 1997)
It is nonetheless arguable (Pooley, 2004) that certain areas falling within the bounds of the arrondissement of Lille are still rural, particularly those like to the west and south-west of the city. The major criteria would be a population under 2,000 and agriculture as the major economic activity, e.g. Hantay and Sailly-lez-Lannoy. Agriculture was not the only economic basis enabling autochtonous minority-language communities to keep going. In the Nord–Pas-de-Calais, fishing, mining and textiles also provided such community support, but the increasing dominance of the service sector has, unlike in certain neighbouring countries such as Spain, contributed to the desocialisation of regional varieties. Moreover, the increasing popularity of the so-called ‘composite’ life
182
Tim POOLEY
style (Pedersen, 1994), which combines the benefits of life in town and country, further complicates traditional dichotomous territorial divisions of urban and rural areas and completely undermining the supposed traditional concentric circle of urban development. Moreover, the notion of N-O-R-M-S will lose its potency as the agricultural sector employs fewer and fewer people in percentage terms in relation to the working population as a whole. It also has to be borne in mind that many present-day farmers are married to spouses who work in towns. Increasingly their profession requires specialised knowledge and, at the very least, reasonable business skills. Such developments have contributed to the emergence of periurban areas, which tend to have linguistic profiles like new towns, i.e. particularly favourable to levelled varieties and inhospitable to the transmission and fostering of traditional local varieties and variants, even where such localities figured in dialect atlases a few generations ago. 4. Gender The classic sociolinguistic gender pattern in variationist sociolinguistics for western societies may be summarised in two principles (Labov, 1990). Firstly, in situations of stable variation, men use vernacular variants more than women and, secondly in situations of change, women tend to take the lead in adopting the incoming form, irrespective of whether the change goes in the direction of the standard variety or not. In relatively stable communities, where men were the main breadwinners through activities such as agriculture, fishing or mining, vernacular varieties of French and/or ancestral languages were maintained in some instances until well into the 20th century. In such communities, women had a relatively subordinate role and from the late 19th century onwards, increasingly, as already observed, were charged with the socialisation of children in (‘good’) French in the hope of intergenerational upward mobility. I know of no entirely convincing example in French-speaking Europe, where women can be demonstrated to have taken the lead in the progression of a new non-standard (or supra-regional) variant, although examples with some plausibility may be cited. In Pooley (2001), I argued that within the sociolinguistic literature three types of ‘feminine’ variants, i.e. where the sociolinguistic gender pattern is reversed, could be found in western societies: 1) relic variants 2) supra-local variants 3) network-related variants Relic variants are simply those which survive longest in the mouths of women, partly because of their greater longevity, and secondly, because in
The Uneasy Interface
183
some cases, women are less mobile than men (e.g. Hadjadj, 1981), who have had to contend with military service or leave their local area for paid work. A minority of dialectologists have favoured women as informants for such reasons, most notably Pop (1950). Changes in the direction of a supra-local variety may not appear in some cases to be in the direction of the standard. In England, the glottal realisation of intervocalic t may be a case in point (Milroy, Milroy, Hartley and Walshaw, 1994; Mees and Collins, 1999), although, as already pointed out, the status of the variant appears to be changing. In French-speaking Europe, however, the supra-local norm is what I have called Oïl French (Pooley, 2006), a regionally neutral statistical standard that, although differentiated from Reference French on certain points, is the de facto colloquial norm for the majority of French speakers, and for generation Xers (those born since 1965) a strong, if not overwhelming, majority. The third category of variants deviating from the dominant sociolinguistic pattern are those variants correlating with network factors, e.g. the greater use of certain vernacular features by young women in the Clonard, a Catholic area of Belfast studied by Milroy (1987). In Pooley (2001) I suggested that in varieties of French spoken in France, the only plausible examples pointed to a temporary and delicately balanced state of affairs. In the textile towns of northern France, like Roubaix, my findings show (Pooley, 1994) that older women born prior to World War Two used certain vernacular variants more than their male contemporaries. The most striking example is the devoicing of word-final consonants, e.g. sage [r`R] as opposed to [r`Y] ‘wise’ or ‘well-behaved’. This variant, however, while much less used by subsequent generations, also reverted to the more orthodox sociolinguistic gender pattern. Although historically also a dialectal form, correlations with other variables suggest that in the early 20th century, word-final consonant devoicing (WFCD) was perceived as a French rather than a dialect feature, a state of affairs which was completely reversed by the end of the 20th century. Cécile Bauvois (2001) found that in the Belgian town of Mons under 100 miles away, WFCD was more widely used, for instance by speakers from middle-ranking social groups in reading styles, as opposed to working-class groups in spontaneous speech in the studies on Lille. Among primary-school teachers, women used more WFCD than men. Bauvois suggests that in this highly feminised profession, the minority of men who take it up appear to want to conform more overtly to the prestige norm, whereas women, who for the time being find themselves in a well established majority, feel less pressure to conform to the normed variety. If this were correct, then the reversal of the sociolinguistic pattern could, contrary to my suggestion put forward on the basis of French varieties,
184
Tim POOLEY
become stable, although there is good reason to believe that this would require the presence of factors capable of buttressing resistance to macro-level social pressures favouring convergence to supra-local norms. 5. Stylistic variation Traditional dialectology is resolutely system-oriented and takes little or no account of stylistic variation, being interested only in Type 4 varieties, which may be arrived by comparison of several exemplars of Type 3 varieties (Carton and Lebègue, 1989). Carton (1972) is based on recordings of older speakers who used varieties locatable at various points on the typology, There is no attempt to show how these varieties were socialised or what place these varieties have in the speakers’ repertoire. In variationist sociolinguistics it has always been seen as crucial either to record selected informants in different situations, or to obtain informants’ most spontaneous speech forms, assumed to be the most vernacular, as in Pooley (1996 – 1983 corpus). At that time I was interested only in spontaneous speech, particularly what Carton (1981, 1987) calls Dialectal French or intended patois (‘patois d’intention’) (Table 1). Any elicitation, whether by minimal-pair tests, word-list styles, reading of a continuous passage and probably formal interview would not have been conducive to such an aim but would have produced more standard forms of French and certainly not (Picard) dialect. Bearing in mind also that for most speakers Picard is an unwritten variety, it would have been inappropriate to use written material to elicit dialectal variants which some speakers would not use spontaneously, and which, for those that did, would have been unlikely to be triggered by a written stimulus. This begs the question as to whether dialectal forms are the most spontaneous usages for speakers who have them in their repertoire. Do they constitute a different sub-register from informal styles in French or are they more or less part of a continuum of regiolects ranging from usages indistinguishable from the supra-local variety to broad(-ish) patois? To speak of spontaneous usage is to assume that the Observer’s Paradox or Hawthorne effect can be overcome. Recording the speakers in different situations would provide some kind of check on this. The overcoming of the Observer’s Paradox may, at least to a considerable degree, be achieved by recording speakers in groups so that normal social constraints would apply and prove stronger, as Nordberg (1980: 7) argues, than the microphone which is obviously likely to inhibit spontaneity. Group recordings also raise the dilemma of whether to opt for participant observation or allow the subjects to talk among themselves, as did Armstrong (1993). For various practical reasons, I have always chosen participant observation: firstly, while gathering my first corpus in 1983,
The Uneasy Interface
185
because I was either known to the participants or introduced by a trusted intermediary; and subsequently for the school-based corpora recorded from 1995 onwards, I considered that it would have been a breach of trust, as well as a risk to expensive recording equipment, to leave school-student informants unsupervised. In Milroy’s (1987) study group conversations proved more than sufficient and attempts to elicit word-list styles and even more reading-passage styles proved to be of little worth, partly because of the lack of variation elicited and the low level of reading skills of some informants. The decision to use students from special-needs sections known as SEGPA (Section d’Education Générale et Professionnelle Adaptée) confronted me with the problem of poor readers, particularly with regard to the reading of continuous passages, although word lists posed no problem and produced interesting degrees of variation. Nurturing the hope of achieving some degree of comparability, I used firstly, the passages selected by Lefebvre (1991) and secondly, that of the PFC protocol. Both yielded poor results with many subjects stumbling over so many words that to speak of continuous speech was a complete misnomer. In more recent fieldwork, however, the reading passage was replaced by a set of sentences, obviating the major difficulties of reading a continuous passage yet providing some examples of connected reading aloud from all but the very poorest readers. Concluding remarks The forms shown in dialectological work are difficult to evaluate socially. While it is by no means blatantly inaccurate to re-affirm what widely published works of reference have been saying for some time, i.e. that the traditional local dialects tend(ed) to be spoken by the least well educated and the least mobile and now increasingly by older members of the community. What dialect geography signally fails to shed light upon, however, is the socialisation and indeed the desocialisation of these varieties, now increasingly referred to as languages in the French context (e.g. Eloy, this volume). That is of course not at all how they were perceived, when they still enjoyed vitality. Regrettably, it is now impossible to reconstruct either the linguistic repertoires of the dialect-atlas and monograph informants in all their variability, as is acknowledged in Lodge’s (2004) sociolinguistic history of Paris. To be sure, the forms noted must have some social reality but the degree of prominence is difficult to assess among those, usually born before World War One, who provided the data. Nor is it possible to assume progressive intergenerational differences occurring at a relatively uniform rate, since the 20th century was by no means a period of gradual change but was punctuated, and not merely in its first part, by cataclysmic events, which
186
Tim POOLEY
means that there are intergenerational hiatuses, in particular, but not exclusively, between speakers born before the Great War and those born in the interwar years and subsequent generations. The methodologies established by dialect geographers are nonetheless still with us to a degree. Recent studies, I include PFC in this, still rely on small numbers of typical speakers largely on an age-related basis, rather than a statistically more reliable judgement sample, covering a wider social range of speakers, although some investigators have taken a mere sub-sample from a larger project based on sound sociolinguistic practice. The study of European varieties of French is not burdened with an overabundance of data, and some of these are not necessarily of the quality desirable in the light of accepted standards of good practice elsewhere, e.g. the criterion of selecting four speakers per cell for each social category chosen for analysis. Nonetheless, in some areas including northern France, the material available, if examined with judicious cross-referencing, taking due account of different types of data, makes it possible to build up a reasonably plausible (and indeed reliable) picture of the sociolinguistic changes that have occurred over the last century and indeed to project back still further. References Armstrong, N. (1993) A study of phonological variation in French secondary school pupils. Unpublished PhD Thesis, University of Newcastle-uponTyne. Armstrong, N. and Boughton, Z. (1998) Identification and evaluation responses to a French accent. Some results and issues of methodology. Paroles 5 (6): 27.60. Armstrong, N. and Jamin, M. (2002) Le français des banlieues: Uniformity and discontinuity in French of the Hexagone. In: K. Salhi (ed.) French in and out of France. Language policies, intercultural antagonisms and dialogue. Oxford: Peter Lang, 107-136. Armstrong, N. and Unsworth, S. (1999). Sociolinguistic variation in southern French schwa, Linguistics, 37/1:127-156. Bailey, G., Wikle, T., Tillery, J. and Sand, L. (1991) The apparent time construct. Language Variation and Change 3, 241-264. Bauvois, C. (2001) L’assourdissement des sonores finales en français: une distribution sexolectale atypique. In: N. Armstrong, K. Beeching and C. Bauvois (eds.) La langue française au féminin. Paris: L’Harmattan, 21-36. Bauvois, C. (2003) Ni d’Eve ni d’Adam. Paris: L’Harmattan. Carton, F. (1972) Recherches sur l’accentuation des parlers populaires dans la région de Lille. Lille: Service de Reproduction des Thèses, Université
The Uneasy Interface
187
de Lille III. Carton, F. (1987) Les accents régionaux. In: G. Vermes and J. Boutet (eds.) France, pays multilingue. Paris: L’Harmattan, 1987, Vol 2, 29-49. Carton, F. and Lebègue, M. (1989, 1998). Atlas linguistique et ethnographique picard, 2 Vols, Paris: Editions du CNRS. Cedegren, H. (1973) The interplay of social and linguistic factors in Panama. Unpublished doctoral dissertation, Cornell University. Cedegren, H. (1987) The spread of language change: verifying inferences in linguistic diffusion. In: P. H. Lowenberg (ed.) Language spread and language policy: Issues, implications and case studies, Georgetown University Round Table on Languages and Linguistics 1987, Washington DC: Georgetown University Preess, 45-60. Chambers, J. (2003) 2nd ed. Sociolinguistic Theory. Oxford: Blackwell. Chambers, J. and Trudgill, P. (1980) Dialectology. Cambridge: Cambridge University Press. Cochet, E. (1933) Le patois de Gondecourt (Nord): grammaire, lexique. Paris: Droz. Damette, F. (1997) La région du Nord–Pas-de-Calais. Villes et système urbain. Report for l’Agence de Développement et d’Urbanisme de Lille-Métropole. Douglas-Cowie, E. (1978) Linguistic code-switching in a Northern Irish village: Social interaction and and social ambition. In: P. Trudgill (ed.) Sociolinguistic patterns in British English, London: Edward Arnold, 37-51. Durand, J., Laks, B. and Lyche, C. (2003) La prononciation du français dans sa variation. Perros-Guirec : La Tribune Internationale des Langues Vivantes, 33 (May 2003). Eloy, J.-M. (1988) La famille comme refuge dialectal: un refuge revisité. Plurilinguismes 1, 102-116. Eloy, J.-M., Blot, D., Carcassone, M. and Landrecies, J. (2003) Français, picard, immigrations. Une enquête épilinguistique. Paris: L’Harmattan. Fox, S. (2004) New dialect formation in the English of the East End of London. Unpublished PhD dissertation: University of Essex. Foulkes, P and Docherty G. (1999) (eds.) Urban voices. London: Arnold Francis, W. N. (1983) Dialectology. London: Longman. Ghiglione, R. and Matalon, B. (1978) Les enquêtes sociologiques. Paris: Armand Colin. Gilliéron, J. and Edmont, E. (1902-1910) Altlas linguistique de la France. Paris: Champion. Hadjadj, D. (1981) Etude sociolinguistique des rapports entre patois et français dans deux communautés rurales du centre de la France en 1975.
188
Tim POOLEY
International Journal of the Sociology of Language 29, 71-98. Hohenberg, B. and Lees, L. 2nd ed. (1995) The making of urban Europe: 1000-1994. Cambridge, Massachusetts: Harvard University Press. Horvath, B. (1985) Variation in Australian English. Cambridge: Cambridge University Press. Labov, W. (1966) The social stratification of English in New York City. Washington, DC: Center for Applied Linguistics. Labov, W. (1972) Sociolinguistic patterns. Philadelphia: University of Pennsylvania Press. Labov, W. (1990) The intersection of sex and social class in the course of linguistic change. Language Variation and Change 2, 205-254. Laks, B. (1978) Contribution empirique à l’analyse socio-différentielle de la chute des /r/ dans les groupes consonantiques finals. Langue française 34, 109-125. Lefebvre, A. (1991) Le français de la région lilloise. Paris: Publications de la Sorbonne. Lippi-Green, R. (1989) Social network integration and language change in progress in an alpine rural village. Language in Society, 18:213-234. Lodge, R.A. (2004) A sociolinguistic history of Parisian French. Cambridge: Cambridge University Press. Martinet, A. (1945) La prononciation du français contemporain. Paris: Droz. Mees, I. and Collins, B. (1999) Cardiff: a real-time study of glottalisation. In: P. Foulkes and G. Docherty (eds.) Urban voices. London: Arnold, 185202. Milroy, J. (1992) Linguistic variation and change. Oxford: Blackwell. Milroy, L. (1987) 2nd ed. Language and social networks. Oxford: Blackwell. Milroy, L. and Gordon, M. (2003) Sociolinguistics. Methods and Interpretation. Oxford: Blackwell. Milroy, J., Milroy, L., Hartley, S. and Walshaw, D. (1994) Glottal stops and Tyneside glottalisation: competing patterns of variation and change in British English. Language Variation and Change 6, 327-357. Neuman, W.L. (1997) 3rd ed. Social research methods. Qualitative and quantitative approaches. Boston: Allyn and Bacon. Nordberg, B. (1980) Sociolinguistic fieldwork experences of the Unit for Advanced Studies in Modern Swedish. FUMS Report No. 90. Uppsala: FUMS. Payne, A. (1980) Factors controlling the acquisition of Philadelphia dialect by out-of-state children. In: W. Labov (ed.) Locating language in time and space. New York: Academic Press, 143-178. Pedersen, I.-L. (1994) Linguistic variation and composite life modes. In: B. Nordberg (ed.) The sociolinguistics of urbanization: the case of the
The Uneasy Interface
189
Nordic countries. Berlin: Walter de Gruyter, 87-115. Pooley, T. (1994) Word-final consonant devoicing in a variety of working-class French – a case of dialect contact? Journal of French Language Studies 4, 215-233. Pooley, T. (1996) Chtimi: the urban vernaculars of northern France. Clevedor: Multilingual Matters. Pooley, T. (2000) The use of regional French by Blancs and Beurs: questions of identity and integration in Lille. Interfaces 5, 51-69. Pooley, T. (2001) Les variantes sociolinguistiques féminines: essai de synthèse. In: N. Armstrong, K. Beeching and C. Bauvois (eds.) La langue française au féminin. Paris: L’Harmattan, 53-73. Pooley, T. (2003) La différenciation hommes-femmes dans la pratique des langues régionales de France. Langage et société 106, 9-31. Pooley, T. (2004) Language, dialect and identity in Lille. 2 Vols. Lewiston, NY: Edwin Mellen Press. Pooley, T. (2006) On the geographical spread of Oïl French in France. Journal of French Language Studies. 16.3, 357-390. Pop, S. (1950) La dialectologie. Aperçu historique et méthode d’enquêtes linguistiques. 2 Vols. Louvain: Centre international de dialectologie générale. Reynaud, A. (1981) Société, espace et justice, inégalités régionales et justice socio-spatiale. Paris: Presses Universitaires de France. Sankoff, G. (1980) A quantitative paradigm for the study of communicative competence. In: G. Sankoff (ed.) The social life of language. Philadelphia: University of Pennsylvania Press, 295-310. Shuy, R., Wolfram, W and Riley, W. (1968) Field techniques in an urban language study. Washington, DC: Center for Applied Linguistics. Singy, P. (1996) L’image du français en Suisse romande. Paris: L’Harmattan. Trudgill, P. (1974) The social differentiation of British English. Cambridge: Cambridge University Press. Trudgill, P. (1983) On dialect. Oxford: Blackwell. Trudgill, P. (1988) Norwich revisted: recent changes in an English urban dialect. English World-Wide 9 (1), 33-49. Viez, H. (1910) Le parler populaire (patois) de Roubaix. Marseille: Lafitte Reprints, [1978].
190
Tim POOLEY
A Corpus of French Texts with Non-standard Orthography Yves Charles MORIN The phonetic reconstruction of earlier stages of languages essentially relies on four techniques, unequally applicable to various languages: (1) comparative reconstruction, (2) internal reconstruction, (3) phonetic interpretation of early graphic systems, and (4) phonetic interpretation of poetic conventions such as rhymes and meter during its history. These techniques all have limitations and the best results are obtained when these can be combined and completed with one another, as is often possible with French. Graphic systems may sometimes be relatively phonetic, i.e. may give more or less reliable indications of the pronunciation of a language, usually for relatively formal registers. It is rarely the case, however, that such phonetic correspondence remains stable in the course of time. Graphic systems are notoriously conservative and tend to be retained after sound changes have disrupted the phonographic correspondences that might have existed at earlier stages. Non-standard orthographies — whether they are intentionally devised as substitutes for current conventional spelling systems or elaborated by semi-literate scriptors after they acquired some rudiments of conventional orthography — tend to follow some kind of phonographic principle (not necessarily based on some prior phonemic awareness, as is sometimes claimed, however). Early documents using such orthographies thus offer a glimpse of the sound patterns of a language at the time when they were written, and have constituted some of the most important evidence used in scholarly works on the history of French. The present survey is meant to be a reasoned, albeit limited, catalog of such documents, including those that have been entered in the computer databases1 that I have developed over the last twenty years as a tool for my work on the reconstruction of the pronunciation of French at different stages of its development. One essential characteristic for inclusion in a base was the richness of the expected information on the pronunciation and the relatively large size of documents devised by the same grammarian or 1
The documents signaled with the sign + under the heading db in the tables below have been included in the databases; those with the sign ° are currently being entered.
192
Yves Charles MORIN
semi-literate scriptor to allow for some reliability of the results. The documents included in the databases can be divided into six categories ranging from transcriptions of French in non-Latin graphic systems (section 1) and documents written by semi-literate individuals (section 2), to texts for which the authors or printers deliberately chose to use a reformed spelling, either fully articulated (section 5) or succinctly sketched (section 6). Less radical are documents that used a conventional orthography enriched with additional diacritics for easier reading (section 3) and documents that adapted the conventional spelling to note non-standard pronunciations (section 4). In the last section, we give a short list of texts written in relatively standard orthographies, which could nonetheless cast some light on the pronunciation (section 7), none of which, however, have been included in the databases. 1. Texts with non-Latin conventional graphic systems One occasionally finds documents that are written with a conventional graphic system other than Latin. They usually are too short to allow for in-depth analyses. The only exceptions are documents with French words written in Hebrew characters: glossaries, dictionaries, treatises and — rarely — poems (cf. Blondheim 1927). An inventory of known glossaries can be found in Banitt (1972:xiv–xv) [G1 to G6], three of which have been edited, and a fourth only partially, together with various fragments [F1 to F9] and dictionaries [D1 and D2]. These glossaries written with Hebrew characters must not be confused with modern editions of French glosses in Rashi’s commentaries (Darmesteter 1909, Darmesteter and Blondheim 1937) that try to recover the original forms of the Talmudic commentator from often quite corrupted copies found in later manuscripts.
Figure 1. Glossaire de Bâle, 6 first lines of fol. 27r° (from Banitt 1972)
A Corpus of French Texts
193
Glossaries contain series of Hebrew expressions, either from the Bible or teachers’ commentaries, followed by their French equivalents, or le’azim, as appears in the excerpt of the Glossaire de Bâle given in Figure 1. In this excerpt, one can read several glosses in Latin characters later added by an early analyst; in particular chant son is a Latin transliteration of the la’az immediately below, which explains a Hebrew term found before on the same line. The Traité des fièvres on the other hand is basically a French text entirely written with Hebrew characters, with only occasional Hebrew and Latin comments and prescriptions. Great hopes were entertained by the first scholars who examined these texts, in particular Darmesteter, that an alternate script would exhibit less graphic conservatism and add significantly to our knowledge on the phonetic characteristics of Old French. It appeared soon, however, that the use of the Hebrew script for French and for Latin belonged to a long tradition, which could also be quite conservative (cf. Banitt 1972:58). Different Hebrew characters could thus be used for Latin and French in the same medieval texts for sounds that medieval speakers certainly did not distinguish, such as for Latin ‹s› and sin/shin for French ‹s› (besides sade for samekh [s] < [ ]), probably because they belonged to independent scriptural traditions that developed at different times (cf. Banitt 1972:61, Kiwitt 2001:33). Conventional name
date
Editions and characteristics
Glossaire de Bâle (G1)
13th c.
Glossaire de Paris (G4)
13th c.
edition: Banitt 1972. dialect: Champenois (prob.). edition: Lambert and Brandin 1905; the Hebrew characters of the French le’azim have not been edited, they are simply transliterated according to the authors’ own interpretation of their phonetic values, barring the possibility of a systematic reconstruction of the original Hebrew script. dialect: Lorrain (prob.). edition: Banitt 1995–2005. dialect: Anglo-Norman, Norman. edition: (partial): Siskin 1981.
Glossaire de Leipzig end 13th c. (G2) Glossaire de Parme (G5) 14th c. Traité des fièvres
early 13th c.
editions: Katzenellenbogen 1933, Kiwitt 2001; in Kiwitt’s edition, the Hebrew characters are transliterated according to a protocol which preserves most of the information found in the original script. dialect: Bourguignon (prob.).
db
194
Yves Charles MORIN
References Banitt, Menahem. 1972. Le glossaire de Bâle, 2 vol. Jérusalem: Académie nationale des sciences et des lettres d’Israël. Banitt, Menahem. 1995–2005. Le glossaire de Leipzig, 4 vol. Jérusalem: Académie nationale des sciences et des lettres d’Israël. Blondheim, David Simon. 1927. Poèmes judéo-français du Moyen Age. Paris: Champion. Blondheim, David Simon. 1937. Les gloses françaises dans les commentaires talmudiques de Raschi, vol. 2: Études lexicographiques. Baltimore: The John Hopkins Press / London: Oxford University Press / Paris: «Les belles lettres». Darmesteter, Arsène. 1909. Les gloses françaises de Raschi dans la Bible, accompagnées de notes par Louis Brandin et précédées d’une introduction par Julien Weill. Paris: Durlacher. Darmesteter, Arsène and D. S Blondheim. 1929. Les gloses françaises dans les commentaires talmudiques de Raschi, vol 1. Paris: Champion. Katzenellenbogen, Lucie. 1933. Eine altfranzösische Abhandlung über Fieber. Würzburg: Konrad Triltsch. Kiwitt, Marc. 2001. Der altfranzösische Fiebertraktat Fevres: Teiledition und sprachwissen-schaftliche Untersuchung. [Würzburger medizinhistorische Forschungen 75]. Würzburg: Königshausen und Neumann. Lambert, Mayer and Louis Brandin. 1905. Glossaire hébreu-français du XIII e siècle. Recueil de mots hébreux bibliques avec traduction française. Paris: Leroux. Siskin, Harley Jay. 1981. A partial edition of a fourteenth century biblical glossary: Ms Parma 2780, Ph.D. thesis. Ithaca: Cornell University. 2. Texts with deviant orthographies There is a large body of research on deviant orthographies, usually of semi-literate scriptors from various social backgrounds (cf. Ernst 1999). A survey of the relevant documents is well beyond the scope of this paper. Many of them can be found in some of the raw materials used by historians of private life: private letters, intimate and travel diaries, “livres de raison” (family diaries/registers), memoirs (cf. Beaurepaire and Taurisson 2003), which more often than not, however, have been adapted to modern spelling conventions and need to be re-edited to be useful as linguistic documents. The very concept of deviant orthography may be ill suited for texts written during or before the fourteenth century, as their specific spelling may sometimes simply reflect a scriptural tradition of partial phonetic adjustment to a regional norm. One can probably find a complete continuum between the spelling innovations found in the Annales de Laval by Le Doyen (Figure
A Corpus of French Texts
195
2) and those in the Haynin’s Mémoires (Figure 3). Ceſt recitez en ce petit liuret Le temps futur, ſil ſera chault ou froit Figure 2.
Et aduiendra ou ſoit vil temps ou cher; Soit guerre ou paix ſans mauluais temps ſerchez.
Annales de Laval by Le Doyen, f ° 1r° (cf. Godberg–Beauluère, p. 6)
lesquelles ne peulte touttes estre venues a ma connoisanse, pluseurs du Roiame se conplaindoite et doleoite. Et tant que de fet a la remonstranse et requeste du duc de Berry, seul frerre du roi et le plus prochain de la couronne, et de pluseurs autres prinches et segneurs de Roiame, estant desplaisant de che quil veoite le Roiame ensi gouverné et eux mime pareillement, desirant di pourveir et mettre remede de tout leur pooir come tenus, iestoite. Et afin qu’ensi en avenist, il proumerte l’eun a l’autre d’estre aidant et confortant et d’eus trouver et coumuniquier ensanble … Figure 3. Haynin’s Mémoires, f ° 9v° (cf. Brouwers, vol. 1, p. 6)
Le Doyen is a royal notary in Laval, well accustomed to writing. He abides by the current orthographical system with its regional phonetic interpretation, e.g., ‹oi› for a [e] or [ε] sound in some words, such as froit ‘froid’. His only significant departure from the dominant normative patterns is the use of ‹z› in word-final position for ‹r› as in recitez or ſerchez in the excerpt given above (which Godberg–Beauluère correct as ‹r›). This is a feature found in many manuscripts of the same period and probably reflects a phonetic development [r] > [ð] ~ Ø. On the other hand, Haynin is a member of the gentry, whose education was certainly more concerned with warfare than writing skills. His spelling reflects the current Picard scripta, e.g. ‹ch› for regional [ ] or [ ] in prinches and che and ‹ei› for [i] in pourveir ‘pourvoir’. One quite original feature, however, is the generalization of mute ‹e› in word final-position in 3sg endings, as in peulte (normally written peut, sometimes peult) and 3pl endings, as in conplaindoite (for regional complaindoient) or proumerte (for pro(u)mirent), which signals the retention of word-final [t], a normal feature of Picard, still found in modern dialects with the 3pl ending. Lodge (2004:143, 166–167) lists, together with large excerpts, some classic examples of deviant orthographies relevant to the history of Parisian French. A larger corpus, including other regions has been undertaken and is described by Ernst and Wolf (2002). Another large corpus for varieties of French spoken between the seventeenth and nineteenth centuries in North America and Northwestern France is under construction and described in Martineau (to appear) and Bénéteau and Martineau (2006).
196
Yves Charles MORIN
Conventional name Mémoires de Haynin
date 1465–1477
Annales de Laval de Le Doyen
1480–1537
Mémoires de Ménétra
1768–1802.
Œuvres poétiques de Ménétra
? –1802
Journal de Lepailleur
1839–1845
Editions and characteristics db edition: Brouwers 1905. dialect: Hainaut Picard. edition: Godberg–Beauluère 1859 – poor edition, currently being revised after ms. BnF. f. fr. 11512. dialect: Mayenne dialect. edition: Roche 1982; adapted to modern spelling. dialect: Parisian French. edition: none – found in the same ms. as the Mémoires dialect: Parisian French. edition: Aubin (1982); adapted to modern spelling. dialect: Québec French.
References Beaurepaire, Pierre-Yves and Dominique Taurisson (eds.). 2003. Les ego-documents à l’heure de l’électronique. Nouvelles approches des espaces et des réseaux relationnels. Montpellier: Presses universitaires de Montpellier. Bénéteau, Marcel and France Martineau. 2006. “Le « Journaille » de Barthe: incursion dans le français du Détroit sous le régime anglais”. La passion des lettres – Études de littérature médiévale et québécoise en hommage à Yvan Lepage, Pierre Berthiaume and Christian Vanderdorpe (eds.), 165–178. Ottawa: Éditions David. Ernst, Gerhard. 1999. “Zwischen Alphabetisierung und « français populaire écrit ». Zur Graphie privater französischer Texte des 17. und 18. Jahrhunderts”. Sociolinguistica 13.91–111. Ernst, Gerhard and B. Wolf. 2002. Journal de Chavatte (CD-rom).Tübingen: Niemeyer. Haynin, Jean de. 1905. Mémoires de Jean, sire de Haynin et de Louvignies, 1465–1477, D. D. Brouwers (ed.). Liège: Denis Cormaux (Société des bibliophiles liégeois). Le Doyen, Guillaume. 1859. Annales et chronicques du pais de Laval et parties circonvoisines, depuis l’an de Nostre Seigneur Jhesu-Crist 1480 jusqu’a l’année 1537, avec un préambule retrospectif du temps anticque, jadis composée par feu maistre Guillaume le Doyen, en son vivant notaire Roïal au Comté de Laval, publiées pour la 1re fois par M. H. Godberg, avec notes et esclaircissements de M. Louis de la Beauluère, correspondant du ministère de l’instruction publicque pour les travaux
A Corpus of French Texts
197
historicques. Laval: Honoré Godberg. Lepailleur, François-Maurice. 1996. Journal d’un patriote exilé en Australie 1839–1845, Georges Aubin (ed.). Sillery: Septentrion. Lodge, R. Anthony. 2004. A sociolinguistic history of Parisian French. Cambridge/New York: Cambridge University Press. Martineau, France. to appear. “Perspectives sur le changement linguistique: aux sources du français canadien”. Canadian Journal of Linguistics/La revue canadienne de linguistique 50. Martineau, France. to appear. “Le français de la région du Détroit, un français de la frontière?” Cahiers Charlevoix. Martineau, France. to appear. “Variation in Canadian French usage from the 18th to the 19th century”. Multilingua. Ménétra, Jacques-Louis. 1982. Journal de ma vie: Jacques-Louis Ménétra, compagnon vitrier au XVIIIe siècle, Daniel Roche (ed.). Paris: Montalba. 3. Annotated conventional orthography Many grammarians felt the necessity to remedy the phonetic opacity of the conventional orthography by adding some diacritical marks. Sylvius (1531) even championed a two-tiered orthography for French: one tier for the etymological representation of words and a superposed tier for their pronunciation, e.g. ‹g› for [ ], ‹ › for [ ], ‹ › for [z] in ‹li ons› for (nous) lisons, ‹ › for [s] in ‹ › for poisser, etc., including a diacritic to mark mute letters ‹'s›,‹'t›, etc. Accented vowel-letters such as ‹é, è, ê› and ‹ç› with a cedilla are the only annotations still found in the standard orthography that can be traced back to Sylvius’ proposal. The use of diacritics to mark mute letters also found its way into some manuals for the learning of French as a second language. Sainliens, alias Holliband, (1566–1580) and Milleran (1692) made a regular use of such diacritics, which can now be most usefully applied to the study of liaison and elision (cf. Crevier 1993, 1994). For instance, the following excerpt from Sainliens shows that word-final ‹t› and ‹z› where silent before a consonant (où eſ t le, preſ tez moy) and articulated before a vowel (quant à, eſ tiez en vn) and at the pause (aiſément, que):
quant à voſ tre haquenée, elle va lẻs ambles auſſi aiſément, que ſi vous eſ tiez en vn bateau: preſ tez moy voſ tre e ſcharpe de Taffetas, à cauſe de la poul dre
et du ſoleil : courage, ie voy la ville Logerons nous aux faul x 2bourgs, ou en la ville ? où eſ t le mei lleur logis?
Figure 4. Sainliens, Littleton (1566 f° A iiij r°)
198
Yves Charles MORIN
Some early printers may also have put this system to a more general use. Estiene Caveiller in his edition of Burrier (1542) used “un système orthographique complet et tout à fait nouveau” (Beaulieux 1927:43). The facsimile of the printer’s notice given by Catach (1968:276) shows that mute letters are marked with a subscript ‹–›, e.g. chaſ cun, endroic t, cog noiſ tre, ' ' ' ' ' quel s; a subscript dot indicates the consonant value of ‹i› and ‹u/v›, e.g. ' ịuſque, v.ous, oeuụre; ‹ › notes palatal [ ], e.g. fuei et s; ‹ › has the same ' function as ‹ç›, e.g. c.onſerụa ion. An internal dot may also appear inside the body of the following letters ‹c. e. h. m. n.›, whose distinctive function is not altogether obvious. Authors Burrier, Antoine
date 1542
Sainliens, Claude
1566
Milleran, René
1692
Characteristics db Mute letters are marked with a subscript ‹–›, values of ' ‹i, v/u, e› distinguished with subscript or superscript dots, ‹ › for palatal [ ], ‹c, e, h, m, n› marked with internal dots. Mute letters are marked with a subscript ‹–› or superscript ‹–›. Mute letters are in italics (or in roman in italicized words).
References Burrier, Antoine. 1542. Les Loix, statutz et ordonnances roiaulx, faictes par les feus roys de France, puis le règne de monseigneur sainct Lois jusques au règne du roy François premier du nom à présent régnant. Les ordonnances, statutz et édicts faicts par le roy François jusques en l’an mil cinq cents quarante deus. Le tout reveu, corrigé et vérifié aus originauls... en obseruant l’orthographe des apostrophes et collision des voieles descendants des deux langues greque et latine. Paris: Estiene Caveiller (printer) for Poncet le Preux and Galiot du Pré (booksellers). Dubois, Jacques (Sylvius). 1998. Introduction à la langue française suivie d’une grammaire, translation and notes by Colette Demaizière of Sylvius (1531). Paris: Champion. Crevier, Isabelle. 1993. “René Milleran, grammairien et réformateur de l’orthographe au XVIIe siècle”. Travaux de linguistique et de philologie 31.347–365. Crevier, Isabelle. 1994. La liaison à la fin du XVIIe siècle dans La Nouvelle gram̃ aire françoise de René Milleran, de Saumur. Ph.D. thesis. Montréal: Université de Montréal. Milleran, René. 1692. La nouvelle gram̃ aire françoise. Marseille: Henri Brebion.
A Corpus of French Texts
199
Sainliens, Claude [Claudius Holliband]. 1566. The French Littelton. A most easie, perfect, and absolute vvay to learne the frenche tongue: newly set forth by Claudius Hollidand. London: Thomas. Sainliens, Claude [Claudius Hollyband]. 1573. The French school-master. London: Abraham Veale. Sainliens, Claude [Claudii a Sancto Vinculo]. 1580. De pronuntiatione linguae gallicae. London: Thomas. Sylvius, Jacobus. 1531. Linguam Gallicam Isagωge, una cum eiusdem Grammatica Latino-gallica, ex Hebraeis, Graecis et Latinis authoribus. Paris: Robert Estienne. 4. Phonetic use of conventional orthography Conventional orthography may also be adapted to render pronunciations that do not conform to the prevalent norm, or at least what the author considered it to be. Such “phonetic” transcriptions of what could be considered genuine speech are relatively rare. One may mention Jean Héroard’s Journal (Ernst 1985, Foisil 1989) in which the author, a physician, recorded verbatim the conversations between the future King of France and his entourage, adapting the orthography to reflect the pronunciation of the young king (at least until he considered it to be no longer divergent from that of the adult norm). One also occasionally finds small passages in literary written works, where the authors tried to give an indication of some actual pronunciation (cf. Saint-Gérand, 2004). More often than not, however, this technique is used with definite parodic intentions, where the stereotypical features associated with the variety of speech they portray are often grossly multiplied (cf. Lodge 2004:136, 137, 154–158, 173–175, 210–214, 224–225 for Parisian French; some of the documents he presents are available from http://www.ahds.ac.uk/collections/ as Paris speech in the past lll-2423-1 [as of November 30, 2006]). Divergent pronunciations are usually signaled through modifications to the current orthographic usage, in ways that are sometimes difficult to interpret. Lamant deſpourueu de ſon eſperit eſcriuant a ſamie, uoulant parler le courtiſan. MA Dame ie vourayme tan
May ne le dite pa pourtan Les muſaille on derozeille: Celuy que fet les gran merueille Nou doin bien to couché enſemble,
Figure 5. Le beau fils de Paris (1549:270)
The omission of word-final letters in Le beau fils de Paris, as in tan (= tant)
200
Yves Charles MORIN
or couché (= coucher), not only indicates that they were not sounded in the speech portrayed, but are a tell tale sign that final [-t] and [-r] were retained in the “polite” pronunciation. The presence of final graphic ‹-s› in les, however, does not indicate that an [-s] or [-z] was pronounced before fran in les fran merueille, as this spelling conforms to the regular spelling convention allowing ‹-s› to be mute in such context. It is not clear, on the other hand, why the author (or the printer) used Nou (= Nous) in Nou doin, as the regular orthographic Nous would have signaled the same pronunciation. (The retention of ‹-s› after ‹e› in les, as in les fran merueille, may actually have been prompted by the desire to keep a mark of the plural.) Conventional name Le beau fils de Paris
date 1549
Dialogue de trois vignerons du Maine de Jean Sousnor Mazarinades
1624
1649–1651
Editions and characteristics editions: J. de Tournes 1549:270–276, Guiffrey 1881:670–680. dialect: Parisian popular French (?). edition: s.l., BnF 8 - LB36 - 2251. dialect: Mayenne dialect.
db
editions: Rosset 1911, Deloffre 1961. dialect: Parisian rural French.
References Ernst, Gerhard. 1985. Gesprochenes Französisch zu Beginn des 17. Jahrhunderts. Direkte Rede in Jean Héroards “Histoire particulière de Louis XIII”. Tübingen: Niemeyer. Deloffre, Frédéric (ed.). 1961. Agréables conférences de deux paysans de Saint-Ouen et de Montmorency sur les affaires du temps (1649–1651). Paris: «Les Belles Lettres». Foisil, Madeleine (ed.). 1989. Journal de Jean Héroard, médecin de Louis XIII. Paris: Fayard. Marot, Clément. 1459. Les Œuvres de Clément Marot, de Cahors, vallet de chambre du Roy. Lyon: Jean de Tournes. Marot, Clément. 1881. Œuvres de Clément Marot de Cahors en Quercy, Valet de chambre du Roy, augmentée d’un grand nombre de ses compositions par ci-devant non imprimées, vol. 3, Georges Guiffrey (ed.). Paris: Imprimerie A. Quantin. Lodge, R. Anthony. 2004. A sociolinguistic history of Parisian French. Cambridge/New York: Cambridge University Press. Rosset, Théodore (ed.). 1911. Dix conférences en patois de la banlieue parisienne. Paris: Colin. Sousnor, Jean, Sieur de la Nichilière (alias Jean Rousson). 1624. Dialogue de trois vignerons du pays du Maine sur les miseres de ce temps. s. l.
A Corpus of French Texts
201
Saint-Gérand, Jacques-Philippe. 2004. “Échos d’une oralité problématique: représentations et reconstitutions en question”. Langue du XIX e siècle. http://www.chass.utoronto.ca/epc/langueXIX/oralite/ [as of November 30, 2006]. 5. Spelling reformers The Renaissance renewed interest in the pronunciation of Latin and its relation to spelling soon extended to vernacular language. Meigret and Peletier were the forerunner, in the middle of the sixteenth century, of a never-ending debate on a never fully completed reform of French orthography. In often-passionate discussions, the protagonists gave us invaluable information on the pronunciation of French, and a key to understanding the phonetic regularities embodied in the various reformed spellings put forward. The concrete proposals made by spelling reformers are useful only insofar as there survived a large enough body of texts embodying a specific reform with relatively infrequent inconsistencies — reformers indeed often blame the carelessness of printers for many of the blemishes in their work. It must also be emphasized that, contrary to optimistic views, spelling reformers are not necessarily aware of the phonemic distinctions in their own usage of French and, if they are, do not necessarily put forward a new orthography that allows such distinctions to be represented. They are usually convinced that their own usage — which they embody in their spelling — is representative of some well-established norm, which accounts for the specific regional characteristics one often finds in their texts. 5.1. Teachers One may distinguish two groups of reformers. The first comprises schoolteachers, tutors, or religious instructors, whose long experience teaching children or ordinary persons how to read convinced them that the reformation of spelling was a social, if not moral, necessity. Rambaud’s spelling system (1578) was somewhat of an abugida in which the majority of the symbols do not represent simple indivisible sounds, but complete syllables. There were no obvious resemblances between the shape of these symbols and those of the Latin alphabet (cf. figure 6). Rambaud’s contribution to the debate on French orthography was probably nil.
202
Yves Charles MORIN
& nonchalance, & le grand tort que nous leur faiſons: car les faiſons fail fir, & puis apres les tormentons: nous faiſons les fautes, & les poures Figure 6.
Rambaud script (pp. 62–63)
Le Gaygnard’s influence was not any greater. His book contains long lists of words, but no connected text. It is difficult to have a precise idea of the ways he intended specific words to be spelled, not only because he often applied his reform only to such parts of words as were relevant to some topic under discussion at that moment, but mostly because the printer did an extremely poor work, making it difficult sometimes to even guess which were the words the author actually intended to give as examples (cf. Morin, to appear).
On le fait, en mettant la main drète au front, puis a lestomac, ensuite a l’épaule gauche, et de la a la drète en disant : In nomine Patris et Filii, et Spiritus Sancti. Figure 7.
Vaudelin script (1715:137)
Although Vaudelin’s (cf. Figure 7) and Féline’s scripts were less radically departing from the conventional Latin script, they did not leave any trace either in the later adjustments made to the French orthography. Linguists, however, often used them as reliable transcriptions of earlier stages of the pronunciation of Standard Parisian French (cf. Cohen’s 1946 and Martinet’s 1946 analysis of Vaudelin’s work). Thurot (1881: LXXXVII) hailed Féline’s work as “le guide le plus sûr que je connaisse pour la prononciation de notre
A Corpus of French Texts
203
temps” (the most reliable guide I know for the current pronunciation [of French]). Conventional name Rambaud
date 1578
Le Gaygnard
1609
Vaudelin, Gile
1713–1715
Féline, Adrien
1848–1854
Editions and characteristics db edition: J. de Tournes, Lyon. observations: syllabic abugida not abiding to the orthographic principle, not readable without complete retraining, also used for Latin and Provençal. dialect: Marseilles regional French. edition: J. Berjon, Paris. observations: regular alphabet without new characters. dialect: Poitou regional French. editions: Vve J. Cot and J.-B. Lamesle, Paris: 1713; J.-B. Lamesle, Paris: 1715. observations: very large number of new characters, reduced e not represented, not readable without specific training. dialect: no obvious regional features. edition: Firmin-Didot, Paris: 1851 (dictionary). editions: Guillaumin, Paris: 1848; FirminDidot, Paris: 1852, 1854. observations: true “phonetic alphabet”; ‹ε, › for [œ, ø], ‹û› for [u] and ‹u› for [y], ‹ , , , › for nasal vowels, ‹h› for [ ], ‹ › for [ ] and ‹ › for [ ] dialect: minimal Parisian system.
References Cohen, Marcel. 1946. Le français en 1700 d’après le témoignage de Gile Vaudelin. Paris: Honoré Champion. Féline, Adrien. 1848. Mémoire sur la réforme de l’alphabet, à l’exemple de celle des poids et mesure. Paris: Guillaumin. Féline, Adrien. 1851. Dictionnaire de la prononciation de la langue française indiquée au moyen de caractères phonétiques précédé d’un mémoire sur la réforme de l’alphabet. Paris: Firmin-Didot. Féline, Adrien. 1852. Ekritur Fonetik – Tablô dε lêktur. Paris: Firmin-Didot. Féline, Adrien. 1854. Méthode pour apprendre à lire par le système phonétique. Paris: Firmin-Didot. Le Gay[g]nard, Pierre. 1609. L’Aprenmolire françois, pour aprendre les jeunes enfans et les estrangers à lire en peu de temps les mots des escritures françoizes, nouvellement inventé et mis en lumière, avec la
204
Yves Charles MORIN
vraye ortographe françoize. Paris: J. Berjon. Hermans, Huguette. 1985. La « déclaration des abus » d’Honorat Rambaud comme témoin du système phonologique du moyen français. Doctoral dissertation. Louvain: Katholieke Universiteit Leuven. Martinet, André. 1946 [1947]. Note sur la phonologie du français vers 1700. Bulletin de la Société de Linguistique 43.13–23. Reprinted 1969, in Le français sans fard, pp. 155–167, Paris: PUF. Morin, Yves Charles. 2004. The implantation of French in Marseilles during the sixteenth century. The French Language and Questions of Identity, Colloquium held at Fitzwilliam College, Cambridge, 7–8 July 2004. Morin, Yves Charles. to appear (ms 2006). “Le Gaygnard (1609): L’ancienne orthographe, la nouvelle pédagogie et la réforme orthographique”. Normes et pratiques orthographiques, Alain Desrochers, France Martineau and Yves Charles Morin (eds.). Ottawa: Éditions David. Rambaud, Honorat. 1578. La déclaration des abus que lon commet en escrivant. Lyon: Iean de Tournes. Thurot, Charles. 1881–1883. De la prononciation française depuis le commencement du XVI e siècle, d’après le témoignage des grammairiens, 3 vol. Paris: Imprimerie Nationale. Vaudelin, Gile. 1713. Nouvelle manière d’écrire comme on parle en France. Paris: Veuve Jean Cot and Jean-Baptiste Lamesle. Vaudelin, Gile. 1715. Instructions cretiennes mises en ortographe naturelle pour faciliter au peuple la lecture de la sience du salut. Paris: Jean-Baptiste Lamesle. 5.2. Humanists Humanists were less concerned with social issues — although there are notable exceptions, e.g., Lesclache — than internal consistency with an orthographic principle requiring spelling to be a faithful “mirror” of the pronunciation of the language, as they found argued in some of the earlier Latin grammarians’ works. The pronunciation on which they based their orthography strongly reflected that of their native regional variety, as contemporary witnesses did not fail to highlight. Pasquier thus commented in a letter to Ramus the regional features he noted in Meigret’s, Peletier’s, Baïf”s and Ramus’ works: Ceux qui mettent la main à la plume prennent leur origine de divers païs de la France, et est mal-aisé qu’en nostre prononciation il ne demeure tousjours en nous je ne sçay quoy du ramage de nostre païs. ‘Those who put their hand to the pen originate from various parts of France and it is uneasy for one not to retain in his pronunciation some ineffable part of his birthplace’s ways of sounding’. (quoted by Firmin-Didot 1868:195, Pasquier 1723, tome 2, page 65)
A Corpus of French Texts
205
As a rule, humanists’ reformed spellings retained much of the familiar aspects of traditional spelling and mostly used diacritics to further distinguish between the different phonetic values of ambiguous letters, in particular accented vowels and cedilla ‹ç›, which eventually found their way into the modern orthography. Meigret’s and Peletier’s works make up the two largest corpora of homogeneous documents. Their usefulness is nonetheless tempered by the typographical canons of that time, which limited to one the number of accents a word could receive and which did not find fit to use diacritic marks to forms that would not otherwise be ambiguous. With the exception of Peletier’s work, documents written in a specific reformed spelling were produced within a relatively short period of time. Notwithstanding the restrictions mentioned earlier, Peletier’s work is extremely rich and presents a variegated sample of texts bearing on issues as diverse as grammar, mathematics, poetic rhetoric and actual poetical creations, written over a period of thirty years during which he adjusted his transcriptions to the changing mores in the recognized norms of pronunciation. Reformer Meigret
date 1548–1551
Peletier du Mans
1550–1581
Ramus
1562
1572
Editions and characteristics db editions: Chrestien, Paris: 1548, 1550a, 1550b, 1550c, 1551. observations: few new characters – sandhis indicated between host and clitics. dialect: Lyon regional French. editions: Marnef, Poitiers: 1550; J. de Tournes, Lyon: 1554a, 1555a, 1555b, 1555c; R. Coulombel, Paris: 1581; Monferran: 1996; Arnaud et al.: 2005. edition: J. de Tournes, Lyon: 1554b. observations: new characters. dialect: Maine regional French. edition: A. Wechel, Paris: 1562. observations: few new characters – sandhis indicated between host and clitics. editions: A. Wechel, Paris: 1572; Du Val, Paris: 1587. observations: relatively more new characters than previously – sandhis (except regular elision) are no longer indicated; the characters and the inventory of vowels are different from those of the previous edition. dialect: Picard regional French.
206
Yves Charles MORIN
Poisson, Robert
1609
Simon, Estienne
1609
Dobert, Antoine
1650
Lesclache, Louis de 1668
Lartigaut, Antoine
1669
Mercure de Trévoux 1708
Marle, C.-L.
1827–1829
edition: J. Planchon, Paris: 1609. observations: some new characters, circumflex above consonants. dialect: Norman regional French. edition: J. Gesselin, Paris: 1609. observations: regular alphabet without new characters, double letters for long vowels and [s]. Very limited corpus (4 pages). dialect: unidentified regional French. edition: F. de Masso, Lyon: 1650. observations: regular alphabet without new characters, ‹k› for [k]. dialect: unidentified. edition: L. Rondet, Paris: 1668. observations: regular alphabet without new characters. dialect: Auvergne regional French. edition: L. Ravenau and J. D’Ouri, Paris: 1669. observations: regular alphabet without new characters. dialect: Picard regional French. appeared in a periodic journal. observations: regular alphabet without new characters, radical simplification and neutralization of vowels distinctions. dialect: unidentified. edition: appeared in periodic journals, and Dupont, Paris: 1829. observations: regular alphabet without new characters. dialect: unidentified.
References Dobert, le père Antoine. 1650. Récréations literales et mysterieuses, où sont curieusement estalez les principes et l’importance de la nouvelle orthographe, avec un acheminement à la connoissance de la poësie, et des anagrammes. Lyon: F. de Masso. Lartigaut, Antoine. 1669. Les progrès de la véritable ortografe ou l’ortografe francêze fondée sur ses principes. Paris: Laurant Ravenau and Jan D’Ouri. Le Gay[g]nard, Pierre. 1609. L’Aprenmolire françois, pour aprendre les jeunes enfans et les estrangers à lire en peu de temps les mots des escritures françoizes, nouvellement inventé et mis en lumière, avec la
A Corpus of French Texts
207
vraye ortographe françoize. Paris: J. Berjon. Lesclache, Louis de. 1668. Les véritables règles de l’ortografe francèze ou l’art d’aprandre an peu de tams à écrire côrectement. Paris: Laurent Rondet. Marle, C.-L. 1827–1829. “Orthographe – plan de réforme: Appel aux Français, Réforme orthographique”. Journal de la langue française, didactique et littéraire (4 vol.). Marle, C.-L. 1828. Ortografe raizonable mize en pratique, par Voltaire, Montesquieu, Beaumarchais, Wailly, Richelet, et tous les filozofes qui ont réfléchi sur notre langue écrite. [Signed: Marle aîné.] (Extrait du Journal grammatical). Paris: Marle. Marle, C.-L. 1829. Manuel de la diagraphie, découverte qui simplifie l’étude de la langue, par M. Marle aîné,… Paris: impr. de P. Dupont. Meigret, Louis. 1548. Le menteur ou l’incrédule, de Lucien de Samosate, traduit de gręc ęn françoęs, auęq vne ecritture qadrant à la prolaçion françoęse. Paris: Chrestian [sic] Wechel. Meigret, Louis. 1550a. Le trętté de la grammęre françoęze. Paris: Chrestien Wechel. Meigret, Louis. 1550b. Defęnses de Louís Meigręt touchant son orthographíe, contre lęs çęnsures ę calõnies de Glaumalis du Vezelet, ę de sęs adherans. Paris: Chrestien Wechel. Meigret, Louis. 1550c. La reponse de Louís Meigręt a l’apolojíe de Iáqes Pelletier. Paris: Chrestien Wechel. Meigret, Louis. 1551. Reponse de Louís Meigręt a la dezesperée repliqe de Glaomalis de Vezelet, transformé ęn Gyllaome dęs Aotels. Paris: Chrestien Wechel. Mercure de Trévoux. 1708. Projet d’un Esei de granmére francése de laqele on ôte toutes lés letres inutiles, é où l’on ficse la prononsiasion de celes qi sont néceséres… – Remarques sur ce projet, en forme de lettre – Réponse de l’Auteur du projet à cette lettre. Mercure de Trévouz, November and December. Morin, Yves Charles. 2004. “Peletier du Mans et les normes de prononciation de la durée vocalique au XVIe siècle”. Les normes du dire au XVI e siècle, Jean-Claude Arnould and Gérard Milhe Poutingon (eds), 421–434. Paris: Champion. Pasquier, Estienne. 1956. Choix de lettres sur la littérature, la langue et la traduction, Dorothy Thickett (ed.). Geneva: Droz. Peletier du Mans, Jacques. 1550. Dialogue/ de/ l’ortografe/ et prononciacion françoęse/. Poitiers: Marnef. Peletier du Mans, Jacques. 1554a. L’Algèbre/. Lyon: Jean de Tournes. Peletier du Mans, Jacques. 1554b. L’Arithmétique/. Lyon: Jean de Tournes.
208
Yves Charles MORIN
Peletier du Mans, Jacques. 1555a. L’Art Poëtique/. Lyon: Jean de Tournes and Guil. Gazean. Peletier du Mans, Jacques. 1555b. Dialogue/ de/ l’ortografe/ et prononciacion françoęse/, 2nd ed. Lyon: Jean de Tournes. Peletier du Mans, Jacques. 1555c. L’amour des amours – Vers lirique/ s. Lyon: Jean de Tournes. Peletier du Mans, Jacques. 1581. Euvre/ s Poetique/ s, intituléz Louange/ s avęq quelque/ s autre/ s Ecriz du mę՛me/ Auteur, ancore/ s non publiéz. Paris: Robert Coulombel. Peletier du Mans, Jacques. 1996. L’amour des amours, Jean-Charles Monferran (ed.). Paris: Société des textes français modernes. Peletier du Mans, Jacques. 2005. Euvres poetiques intitulez louanges aveq quelques autres ecriz, Sophie Arnaud, Stephen Bamforth and Jan Miernowski (eds.). Œuvres complètes, vol. 10, Isabelle Pantin (dir.). Paris: Honoré Champion. Poisson, Robert. 1609. Alfabet nouveau de la vrée et pure ortografe fransoize et modele sur iselui en forme de dixionere. Paris: Jaqes Planchon. Rambaud, Honorat. 1578. La déclaration des abus que lon commet en escrivant. Lyon: Iean de Tournes. Ramus, Pierre La Ramée, dit Petrus. 1562. Gramere. Paris: André Wechel. Ramus, Pierre La Ramée, dit Petrus. 1572. Grammaire de P. de La Ramée, lecteur du Roy en luniversité de Paris. Paris: André Wechel. Ramus, Pierre La Ramée, dit Petrus. 1587. Grammaire de P. de La Ramée, lecteur du Roy en l’Université de Paris, reveue et enrichie en plusieurs endroits. Paris: D. Du Val. Simon, Étienne. 1609. La vraye et ancienne orthographe françoise restaurée: tellement que desormais l’on apprandra parfetement à lire & à escrire & encor avec tant de facilité & breveté, que ce sera en moins de mois, que l’on ne faisoit d’années. Paris: J. Gesselin. 6. Texts with reformed spellings A large number of other works have been published for which the printer — either on his own or at the author’s request — used non-traditional spellings, for which little or no justification is presented. One of the first systematic study of such innovative spellings is that of Firmin-Didot (1868), who used labels such as “personal orthography”, “intelligent personal orthography”, “independent orthography”, “containing instructive details” to describe them. This was later completed by Beaulieux (1927) and recent systematic studies of sixteenth and seventeenth-century spelling systems (Catach 1968 and Biedermann-Pasques 1992). On the basis of these analyses
A Corpus of French Texts
209
and of an earlier and quite succinct “survey” by Le Gaygnard (1609), I have selected a tentative list of potentially informative documents. They are listed in this section when the printer or the author used what may be considered a reformed spelling, whether original or adopted from someone else; they appear in the next section when the spelling — without being as innovative — appears to embody distributional regularities that may reveal some of the phonetic properties of the variety of French they use. The criteria used for this selection do not bear on the quality of the spelling systems as is normally examined in these works on orthography, but rather on prospective usefulness for the discovery of new information on the pronunciation; thus we include Domergue’s work in spite of Firmin-Didot’s severe critique (1868:306). Conversely, some editions have been omitted which are often singled out for the originality of their spelling reforms, such as Fouquelin (1555) or Tahureau (1555, 1565). The editions that I have consulted (Fouquelin 1557, Tahureau 1555, 1567), for instance, do not appear to warrant the large investment required to enter such documents into a computer database, as they would probably not add to our knowledge of the evolution of French pronunciation. (This conclusion, however, does not necessarily apply to the editions of Fouquelin 1555 and Tahureau 1565, which I have not yet consulted.) Printers, as already mentioned, found it very difficult to follow the authors’ instructions, and more often than not departed from their plans. When the principles governing the new orthography are clearly presented, it is possible to identify as mistakes forms that blatantly contravene them. With texts written in a reformed orthography with little indication on its nature, however, printer’s mistakes are more difficult to notice, in particular with relatively short texts that do not allow for statistical control of the results. The list below also contains the work of Jean-Antoine de Baïf written with his reformed spelling, most of which has not been printed during his lifetime and only survived in one autograph manuscript. The Pléiade poet announced a forthcoming grammar in which he would expose the characteristics of his script, which however was never published or has since been definitely lost. A comparison of his manuscript with his Ẻtrẻnes de poẻzie fransoęze an vęrs mezurẻs (1574), the only one of his printed texts using the reformed script, shows that printing imposed heavy constraints on the liberty of spelling reformers. In his manuscript, Baïf made a rich use of diacritics above vowel-symbols: breve vs. macron for poetic meter, grave vs. circumflex accent for phonological length and acute accents for ‘adverbs’ as well as for some still unidentified usage (cf Morin 1999b). The use of accents, however, is quite limited in the printed work; in particular his new characters never appear in print combined with a circumflex accent.
210
Yves Charles MORIN
Author Taillemont, Claude de
date 1556
Baïf, Jean-Antoine de
1569–??
Joubert, Laurent
1578–1579
Boessieres, Jean de
1580
Bomier, Jan
1596
Moinet, Simon
1663
Domergue, Urbain
1794–1806
Editions and characteristics db editions: J. Temporal, Lyon, fac simile in Pérouse 1989. observations: one new character and various diacritics; cf. Beaulieux 1927:54. dialect: Lyon regional French. editions: D. du Val 1574, Paris; Groth 1888; Bird 1964. observations: new characters (the 1574 printed material used the same fonts as Ramus 1572 with different phonetic values). dialect: no obvious regional features. editions: S. Millanges, Bordeaux: 1578, N. Chesneau, Paris: 1579. observations: cf. Firmin-Didot 1868:203. dialect: (the author originates from Valence in Dauphiné). edition: T. Ancelin, Lyon. observations: cf. Le Gaygnard (1609:168), Beaulieux 1927:55. dialect: (the author originates from Montferrand in Auvergne). edition: J. Portau, Niort. observations: cf. Le Gaygnard 1609:168. dialect: unknown. edition: S. Moinet, Amsterdam. observations: cf. Firmin-Didot 1868:230. dialect: (the author identifies himself as “Parisian”). editions: F. Barret, Paris: 1794; Librairie économique, Paris: 1805, 1806. observations: cf. Firmin-Didot 1868:306, Millet 1933. dialect: (the author originates from Aubagne in Provence).
References Baïf, Jean-Antoine de. 1569. Psautier [A], BnF ms. fr. 19140, f° 123–184. Baïf, Jean-Antoine de. 1573. Psautier [B], BnF ms. fr. 19140, f° 1–121. Baïf, Jean-Antoine de. 1574. Ẻtrẻnes de poẻzie fransoęze an vęrs mezurẻs. Paris: Denys du Val. Baïf, Jean-Antoine de. 1570–87(?). Chansonnettes, BnF ms. fr. 19140, f° 311–374.
A Corpus of French Texts
211
Baïf, Jean-Antoine de. 1888. Psaultier: metrische Bearbeitung der Psalmen, Ernst Joh. Groth (ed.). [Sammlung französischer Neudruke herausgegeben von Karl Vollmöller, n° 9]. Heilbronn: Verlag von Gebr. Henninger. Baïf, Jean-Antoine de. 1964. Chansonnettes – Édition critique, G. C. Bird (ed.). Vancouver: University of British Columbia Press. Baïf, Jean-Antoine de. 1966. Chansonnettes en vers mesurés, Barbara Anne Terry (ed.). Birmingham, Alabama: Birmingham Printing. Beauchatel, Christophle de. 1579. “Annotacions sur l’orthographie”. Traité du ris, Laurent Joubert 1579:390–407. Beaulieux, Charles. 1927. Histoire de l’orthographe française, vol. 2. Les accents et autres signes auxiliaires. Paris: Champion. Biedermann-Pasques, Liselotte. 1992. Les grands courants orthographiques au XVII e siècle et la formation de l’orthographe moderne – Impacts matériels, interférences phoniques, théories et pratiques (1606–1736). Tübingen: Niemeyer. Boessieres, Jean de. 1580. L’Arioste francoes,... avec les argumans et allégories sur châcun chant. (Suivi de: Epitre et advertissemant aux Francoés, by J. Bouchet.) Lyon: T. Ancelin. Bomier, Jan. 1596. Les Aforismes d’Hipocrate expliquez en vers françois. Niort: J. Portau. Catach, Nina. 1968. L’orthographe française à l’époque de la Renaissance (auteurs, imprimeurs, ateliers d’imprimerie). Geneva: Droz. Domergue, Urbain. 1794 [an V]. La prononciation françoise, déterminée par des signes invariables, avec application à divers morceaux, en prose et en vers, contenant tout ce qu’il faut pour lire avec correction et avec goût; suivie de notions orthographiques et de la nomenclature des mots à difficulté. Paris: F. Barret. Domergue, Urbain. 1805. Manuel des étrangers amateurs de la langue françoise, ouvrage... contenant tout ce qui a rapport aux genres et à la prononciation, et dans lequel l’auteur a prosodié, avec des caractères dont il est l’inventeur, la traduction qu’il a faite en vers françois de cent cinquante distiques latins, des dix églogues de Virgile, de deux odes d’Horace, et quelques morceaux en prose de sa composition. Paris: Librairie économique. Domergue, Urbain. 1806. La prononciation françoise, où l’auteur a prosodié, avec des caractères dont il est l’inventeur, sa traduction en vers des dix églogues de Virgile et quelques autres morceaux de sa composition; augmentée d’un tableau des désinences françoises, pour faciliter l’étude des genres: manuel indispensable pour les étranger, amateurs de cette langue, infiniment utile aux François eux-mêmes, 2e édition.
212
Yves Charles MORIN
Paris: Librairie économique. Firmin-Didot, Ambroise. 1868. Observations sur l’orthographe, ou, Ortografie française; suivies d’une Histoire de la réforme orthographique depuis le XV e siècle jusqu’à nos jours, 2nd rev. ed. Paris: Ambroise Firmin-Didot. Fouquelin, Antoine, alias Foclin. 1555. La rhétorique françoise. Paris: Wechel. Fouquelin, Antoine. 1557. La rhétorique francoise – Nouvellement reueüe et augmentée. Paris: Wechel. Joubert, Laurent. 1578. Erreurs populaires au fait de la médecine et régime de santé corrigés par M. Laur. Joubert. Bourdeaus: S. Millanges. Joubert, Laurent. 1579. Traité du ris: contenant son essance, ses causes, et mervelheus effais, curieusemant recerchés, raisonnés & observés; Un dialogue sur la cacographie fransaise: avec des annotacions sur l’orthographie. Paris: chez Nicolas Chesneau. Millet, abbé Adrien. 1933. Les Grammairiens et la phonétique ou l’enseignement des sons du français depuis le XVI e siècle jusqu’à nos jours. Paris: Monnier. Moinet, Simon. 1663. La Rome ridicule du sieur de Saint-Amant, travestië à la nouvêle ortografe, pur invantiön de Simon Moinêt, Parisiïn. Amsterdam: Simon Moinêt. Morin, Yves Charles. 1999a. “L’hexamètre «héroïque» de Jean Antoine de Baïf”. Métrique du Moyen âge et de la Renaissance, Dominique Billy (ed.), pp. 163–184. Paris/Montréal: L’Harmattan. Morin, Yves Charles. 1999b. “La graphie de Jean-Antoine de Baïf: au service du mètre!” L’écriture du français à la Renaissance – Orthographe, ponctuation, systèmes scripturaires. Nouvelle Revue du Seizième siècle 17/1.85–106. Morin, Yves Charles. 2000. “La prononciation et la prosodie du français au e XVI siècle selon le témoignage de Jean-Antoine de Baïf”. Où en est la phonologie du français? Bernard Laks (ed.). Langue française 126.9–28. Tahureau, Jacques. 1555. Oraison de Jacques Tahureau au Roy de la grandeur de son règne, & de l’excellance de la langue françoyse; plus Quelques vers du mesme autheur dediez à Madame Marguerite. Paris: veufve Maurice de La Porte. Tahureau, Jacques. 1565. Dialogues non moins profitables que facetieus ou les vices d’un châcun sont repris fort aprement, pour nous animer davantage à les fuir et suivre la vertu. Paris: Gabriel Buon. Tahureau, Jacques. 1568. Dialogues non moins profitables que facetieus ou les vices d’un chacun [sic] sont repris fort aprement, pour nous animer
A Corpus of French Texts
213
davantage à les fuir et suivre la vertu. Paris: Gabriel Buon. Tahureau, Jacques. 1981. Les dialogues non moins profitables que facétieux, Max Gauna (ed.). Geneva: Librairie Droz. Taillemont, Claude de. 1556. La Tricarite, plus qelqes chants an faveur de pluzieurs damoêzelles. Lyon: J. Temporal. Taillemont, Claude de. 1989. La Tricarite, Gabriel A. Pérouse (ed.). Geneva: Droz. 7. Texts with personal spellings The systematic study of distributional orthographical regularities may eventually throw some unexpected light on some aspects of pronunciation in any text, if it is sufficiently long, be it written in the most traditional conservative orthography. The odds, however, are against it. Firmin-Didot’s comments suggest that the three following authors may be more promising than most others. A forthcoming computer edition in Epistemon of Matthieu’s work (http://www.cesr.univ-tours.fr/Epistemon/cornucopie/Cornuc. asp) will soon allow one to put the conjecture to the test. Author Matthieu, Abel
date 1559–1572
Godard, Jean
1620
Frémont d’Ablancourt, Nicolas
1664
Editions and characteristics db editions: R. Breton, Paris: 1559, 1560; I. de Bordeaux, Paris: 1572. observations: cf. Firmin-Didot 1868:191. dialect: no obvious regional features, the author originates from Chartres. edition: N. Jullieron, Lyon. observations: cf. Firmin-Didot 1868:213. dialect: no obvious regional features, the author originates from Paris. edition: T. Jolly, Paris. observations: cf. Firmin-Didot 1868:257. dialect: no obvious regional features, the author originates from Paris.
References Firmin-Didot, Ambroise. 1868. Observations sur l’orthographe, ou, Ortografie française; suivies d’une Histoire de la réforme orthographique depuis le XV e siècle jusqu’à nos jours, 2nd rev. ed. Paris: Ambroise Firmin-Didot. Frémont d’Ablancourt, Nicolas. 1664. Dialogue des lettres de l’alphabet, où l’usage et la grammaire parlent, published at the end of Traduction de Lucien by Perrot d’Ablancourt, pp. 468–493. Paris: T. Jolly. Godard, Jean. 1620. La langue françoise. Lyon: Nicolas Jullieron.
214
Yves Charles MORIN
Matthieu, Abel, sieur des Moystardières. 1559. Deuis de la langue françoyse, à Jehanne d’Albret, royne de Navarre. Paris: Richard Breton. Matthieu, Abel, sieur des Moystardières. 1560. Second Deuis et principal propos de la langue françoyse, à la royne de Navarre. Paris: Richard Breton. Matthieu, Abel, sieur des Moystardières. 1572. Deuis de la langue françoise, fort exquis et singulier. Auesques vn autre Deuis & propos touchant la police & les estatz, où il est contenu (oultre les sentences & histoires) vn brief extraict du grec de Dion, surnommé Bouche-dor: de la Comparaison entre la royauté & la tyrannie. Paris: Iean de Bordeaux/Veufve Richard Breton. 8. Concluding remarks The documents marked with the sign “+ “in the tables above have been entered in different databases at different periods, with largely different protocols. Each was built to solve a specific problem. Some of them have been completely lemmatized; some are little more than a raw concordance. Little effort has yet been made to use a common approach, and it appears unlikely that a uniform protocol may even be profitably considered for printed and written documents to be included in a common database. Their authors referred to a common norm for the pronunciation of French that was escaping them. Each had largely different conceptions of the organization of sounds in language and of the ways a spelling system had to incorporate sound distinctions. On the other hand they shared a common vocabulary to describe the sounds of the language that often masked their divergences and could underlie widely different realities. The experience gathered these last twenty years has shown that each document must be analyzed on its own. There does not appear much to be gained by cross-referencing the data between different authors. It appeared most profitable to examine each spelling system individually and relate it to the forms that the French, the high language of the privileged classes, had taken in the different regions where it eventually became the official language. Acknowledgments Most of the databases presented here have been created over a period of twenty years and benefited from several grants from the Government of Québec (FCAC, FCAR) and Social Sciences and Humanities Research Council of Canada (SSHRC). I would like to acknowledge with thanks the help of Michèle Bonin, Carole Bouchard, Catherine Courchesne, Isabelle Crevier, Jocelyne Cyr, Sophie Daoust, Danielle Fréchette, Jocelyn Guilbault,
A Corpus of French Texts
215
Martine Ouellet, Julie Richard and Sandra Thibault. These bases are now being maintained and expanded within the framework of the MCRI project Modeling Change: The Paths of French, directed by F. Martineau, also funded in part by the SSHRC. I would like to thank for their contribution Stéphanie Brazeau and Jaïme Dubé. The databases for the texts with deviant orthographies (section 2) have been created for the realization of a project also funded in part by the SSHRC, Évolution et variation dans le français du Québec du XVIIe au XIXe siècles, under the direction of France Martineau, with the collaboration of Alain Desrochers. We would like to acknowledge with thanks the help of Marianne Chevrier, Jennifer Dionne, Maxime Glandon, Amélie Hamel, Philippe Leblond, Ashley Lemieux and Mylène Pérault for their participation to this project.
216
Yves Charles MORIN
Resources and Tools for Old French Text Corpora Achim STEIN 1. Introduction This contribution presents resources and tools which have been assembled or developed for the “Amsterdam Corpus”, a previously unpublished collection of Old French literary texts originally compiled by Anthonij Dees (Vrije Universiteit Amsterdam). The project aimed at enhancing the quality of these texts and at making one of the largest machine-readable resources for Old French (3.2 million words) available to the scientific community.1 It will be shown how different kinds of resources (lexical, textual and technical) can be profitably used for the treatment and enhancement of historical text corpora. In addition to the usual requirements corpora should meet (reusability, availability of the data), two other criteria will be central: the preservation of the original data, and the quality of the philological information, the latter being of particular importance for historical corpora. 2. Resources for Old French 2.1. Lexical resources 1. The electronic version of the Tobler/Lommatzsch Altfranzösisches Wörterbuch (Blumenthal/Stein 2002) contains a list of the 37.000 lemmas of the dictionary as well as 15.000 cross-references, which are mostly spelling variants of the lemmas.2 2. The base of verb forms: In the 1960s Robert Martin extracted the verb forms of the most important historical dictionaries and manuals for Old and Middle French up to the 16th century (Tobler/Lommatzsch 1925ss, Godefroy 1880, Huguet 1925ss) as well as of some critical editions. Each record contains the verb form, morphological information and the lemma. In a joint project between the ATILF (the former Institut National de la Langue 1
2
In collaboration with Pierre Kunstmann (University of Ottawa) and Martin-Dietrich Gleßgen (University of Zurich), partly funded by the Conseil de recherches en sciences humaines du Canada (CRSH), the University of Ottawa (Faculty of Arts) and the Alexander von Humboldt-Stiftung. By courtesy of the publishing house, Franz Steiner Verlag, Stuttgart, these lemma lists and other materials are publically available: http://www.uni-stuttgart.de/lingrom/stein/tl/
218
Achim STEIN
Française, Nancy) and the Laboratoire de Français Ancien (LFA, Ottawa), these records were transformed into a database and published on the LFA web site. The verb database contains about 56.600 forms and has added 37.400 new forms to our lexicon. 3. Following R. Martin’s example, graphical forms were also extracted from all the articles of the Godefroy dictionary at the LFA. These forms are the lemmas, the variants cited after the lemmas, and the inflected forms which occur in the examples. These 115.000 forms are annotated with part of speech tags (however without any further morphological information) and the lemma. 4. A further resource, which has been compiled manually, is the inventory of grammatical morphemes. They were particulary problematic since the Amsterdam Corpus does not differentiate certain categories (probably because they were not of interest for Dees’ work, see Morin, forthcoming): for example the tag “600”, marks ambiguous graphical forms which can be adverbs, conjunctions or pronouns (ce, ne, que, qui, ou and their variants), and the form mais is always marked as a conjunction irrespectively of its potential adverbial sense. Such cases are problematic in two respects: first, because they will have to be disambiguated manually in a revised version of the corpus, and second, because they provide the wrong distributional input for the training of the TreeTagger. For the time being we decided not to correct the manual markup of the Amsterdam Corpus but to focus on lemmatisation: about 4.000 grammatical morphemes were extracted from the Amsterdam Corpus, revised and associated with 134 Tobler/Lommatzsch lemmas. In the final markup these lemmas are marked with “_S” (see table 1). The assignment of forms to categories follows, if possible, the CATTEX conventions established for the markup of the Base de francais médiéval (Heiden/Prévost 2005), although the abbreviations are not the same. The 134 lemmas correspond to the part of speech tags listed in (1), where the number of forms and an example are given for each category. (1)
ADJ:poss (438, e.g. mien), CON:coord (23, e.g. car), DET:def (57, e.g. li), DET:demo (177, e.g. cel), DET:ind (646, e.g. alcun), DET:ndf (79, e.g. un), DET:poss (401, e.g. nostre), PRE (448, e.g. auec), PREDET:a (47, e.g. al), PREDET:de (41, e.g. del), PREDET:en (56, e.g. el), PRO:clit (23, e.g. en), PRO:demo (341, e.g. cela), PRO:ind (430, e.g. alcun), PRO:invar (247, e.g. quoi), PRO:pers (310, e.g. el), PRO:poss (238, e.g. mien).
5. Manual lemmatisation: P. Kunstmann added missing lemmas for about 4.500 graphical forms for which the other resources had not provided lemma information and which had a token frequency above 3 (verbs, adjectives, adverbs) or 10 (proper nouns, nouns, adverbs).
Resources and Tools
219
2.2. Text resources 1. The Amsterdam Corpus of Literary Texts was compiled in the beginning of the 1980s by a group of scholars directed by Anthonij Dees and resulted in the Atlas des formes linguistiques des textes littéraires de l’ancien français (Dees 1987). The files of the original version of the Amsterdam Corpus were provided by Pieter van Reenen, a member of Dees’ project, in 1999. They contained about 200 different texts, some of them in several manuscripts, adding to a total of 299 text samples with 3.184.834 words (tokens). These forms had been manually annotated by Dees’ team with a set of 225 numeric tags encoding part of speech and other morphological categories (e.g. “566” for verb, future tense, 3rd person, plural. Figure 1 displays some lines from Le roman de Renart). c6/ren2: or_311 me_412 *covient_513 tel_026 chose_006 dire_592 c6/ren2: dont_341 je_411 vos_451 *puisse_521 faire_592 rire_592 c6/ren2: car_331 je_411 *sai_511 bien_311 ce_341 *est_513 la_105 pure_025 c6/ren2: que_600 de_301 sarmon_002 n_319 *avez_515 vos_451 cure_006 c6/ren2: ne_600 de_301 corsaint_002 oir_592 la_106 vie_006 c6/ren2: de_301 ce_341 ne_319 vos_451 *prant_513 nule_185 envie_005 c6/ren2: mais_331 de_301 tel_026 chose_006 qui_600 vos_451 *plaise_523 c6/ren2: or_311 *gart_523 chascuns_281 que_600 il_431 se_600 *taisse_523 c6/ren2: que_600 de_301 2bien_002 dire_592 *sui_511 en_319 voie_006 c6/ren2: et_331 toz_281 garniz_581 se_600 diex_001 me_412 *voie_523 Figure 1. Format of the original files of the Amsterdam Corpus
About a third of the texts are electronic versions of existing editions (e.g. the Miracles de Notre Dame de Chartres by Jean le Marchant, edited by P. Kunstmann, Chartres/Ottawa, 1973), but usually Dees preferred the transcriptions of manuscripts selected especially for this corpus. As figure 1 shows, the texts in the original Amsterdam Corpus were neither lemmatised nor punctuated (although the lemmas were probably available at some stage of the original project, see Dees 1987:xviii and the remarks in Morin, forthcoming). Therefore, lemmatisation and introduction of punctuation (at least for prose texts based on editions) were the main issues of the re-edition project. The Amsterdam Corpus also served as a lexical resource, providing an inventory of almost 134.000 Old French inflected forms with part of speech information, and as a base for training the TreeTagger (see section 3). 2. A number of Old French Texts were published by Pierre Kunstmann on the web site of the Laboratoire de Français Ancien, University of Ottawa. Indices of two texts were exploited for our project: Le conte du Graal by
220
Achim STEIN
Chrétien de Troyes (edited by P. Kunstmann) and Le couronnement de Louis (edited by Y. Lepage). For each lemma (P.K. normally used the Tobler-Lommatzsch lemma), the index gives its part of speech and an ordered set of all inflected forms, each set consisting of the number of occurrences, the lemma and the list of references. 2.3. Merging the resources Table 1 resumes the lexical and textual resources mentioned above, and lists the number of graphical forms, the code for the lemma source (used in the annotation of the lemmatised corpus), and the information provided: Table 1. Lexical resources Resource Tobler/Lommatzsch verb forms (Martin) Godefroy grammatical morphemes manual lemmatisation indexes of LFA texts chartes de l’Aube Amsterdam Corpus
code T M G S Z I C A
graphical forms 58727 71265 115498 3999 4456 41377 4260 133894
information lemma, variants pos, lemma pos, lemma pos, lemma pos, lemma pos, lemma pos, lemma pos
The resources were merged into a “lexicon of Old French forms” by converting the morphological information into a standardised format required by the TreeTagger, with a reduced set of 50 tags. The tags distinguish the part of speech and some minor categories like subtypes of adjectives (e.g. numerals) and pronouns (e.g. indefinite, interrogative). Although most of our resources provide more information (e.g. person, gender, number, verb tense), these categories are missing in the indices and in the Godefroy database. Since the definition of a complete final tagset is not an issue at this stage of the project, we reduced the tagset to the minimal information shared by all the resources. For philological reasons, it would of course be desirable to extend all the tags to the most explicit one at a later stage, for example to the tagset of the Amsterdam Corpus. After elimination of duplicate entries, the merged lexicon has 235.000 graphical forms. The operation of merging does not affect the lemma information provided by the resources: if the lemmas differ for a given graphical form, they are listed as alternatives in the lexicon (and, later, in the corpus, cf. the attribute “src” in fig. 3). The upper-case letters listed in table 1 are appended to the lemma by an underscore to indicate the resource which provided the information, as shown in table 2.
Resources and Tools
221
A programme applying rules for orthographical variation provided a lemma for the 127.000 unlemmatised forms in the lexicon (these are the forms from the Amsterdam Corpus without a matching lemmatised graphical form in any of our resources). At present, about 50 rules deal with the most frequent endings of unlemmatised forms by simply stripping off a predefined string and optionally adding another string. If the result matches an existing lemma with the same part of speech tag, the lemma is adopted for the unlemmatised form, but marked with an asterisk as being “constructed” (although tests have shown that the result is very reliable). In the following example (2), the correspondance rule “u[mn]-o[mn]” has been applied to assign the lemma abondance to the form abundance: (2)
abundance NOM *{abondance;u[mn]-o[mn]}_A|abondance_T
The final step is the comparison of each lemma with the list of the ToblerLommatzsch lemmas: Each lemma that appears in the Tobler-Lommatzsch is marked with a plus sign, e.g. “+abiter_I”.3 Table 2. The merged lexicon of Old French forms form pos tag1 lemma1 abitablement NIL habitablement_G abitacion NOM +abitacion_T abitacle NOM +abitacle_T abitance NIL habitance_G abitant NOM +abitant_T abitanz NOM *+abitant_T abitanze NIL habitance_G abitast VER <nolem> abitateur NIL habitateur_G abitation NOM *+abitacion_T abitations NOM *+abitacion_T abitator NOM +abitator_T abite VER +abiter_I abitee NIL habiter_G
pos tag2
lemma2
NOM VER
+abitance_T *+abiter_IT
VER
*+abiter_IT
Table 2 shows some sample entries from the resulting lexicon. The first column contains the graphical form, the following pairs of columns contain the part of speech tag and the lemma. Ambiguous forms (like abitant in the example) have more than one tag-lemma combination. Some forms like 3
The Tobler/Lommatzsch Altfranzösisches Wörterbuch is still a reference for many scholars, at least as long as the Dictionnaire étymologique de l’ancien français (DEAF) is not completed.
222
Achim STEIN
abitablement have no part of speech tag (hence “NIL”), others like abitast have no lemma. For abitance two different lemmas are provided (from the Godefroy and the Tobler/Lommatzsch respectively), and for abitanz, abitation(s), and abitee, the “morphological” rules suggested a matching lemma (*), which could also be verified in the Tobler/Lommatzsch lemma list (+). 3. Part of speech tagging and lemmatisation for Old French 3.1. Methods The TreeTagger is a probabilistic part of speech tagger which uses decision trees. It was developed by H. Schmid (Institut für Maschinelle Sprachverarbeitung, IMS, University of Stuttgart) and has been applied to several modern languages. Before discussing specific problems concerning its use for Old French, some general features of the TreeTagger will be presented. Contrary to other probabilistic tagging methods, which have difficulties in estimating small probabilities accurately from limited amounts of training data, the TreeTagger avoids the sparse data problem by using a binary decision tree which determines the appropriate size of the context used to estimate the transition probabilities. Possible contexts are not only trigrams, bigrams and unigrams, but also other kinds of contexts (e.g. tag-1=ADJ and tag-2≠ADJ and tag-2≠DET, for technical details see Schmid 1994). During the lookup of a word in the lexicon of the TreeTagger, the lexicon is searched first. If the word is found there, the corresponding tag probability vector is returned. If not, the TreeTagger tries to guess the right tag from the last letters of the word (suffix probabilities). So far TreeTagger modules (parameter files) have been developed for English, German, Modern French (Stein/Schmid 1995) and Italian.4 The TreeTagger consists of two separate programmes for training and tagging. The input for the training consists of several files: the lexicon, as described above, some minor files containing the tags for open classes (i.e. the categories for which a suffix decision tree is built in order to guess the category of unknown words), and the training text. 3.2. Results achieved in the Amsterdam Corpus For the training of the TreeTagger, the Amsterdam Corpus was split up in a larger training part (about 2.7 million words) and a smaller evaluation part (0.5 million words). The output of the training is a single binary parameter file containing the lexicon and the decision tree data. This 4
The TreeTagger and the parameter files are freely available for Linux, Solaris, and Mac OS-X (see WWW address given at the end of this contribution).
Resources and Tools
223
parameter file and the tagger software are sufficient to annotate new texts, the original lexicon files are not needed. Table 3 shows some lines of pos-disambiguated output5, one word per line, with the part of speech tag in the second and the lemma(s) in the third column. The TreeTagger selected the tag with the highest probability and inserts the corresponding lemma (special TreeTagger options allow for displaying all the possibilities with their probabilities, for manual disambiguation). All lemma forms corresponding to the selected part of speech tag are shown, graphical variants (e.g. doner, donner) as well as different solutions for a given form (devoir and dire for doit; vie and voie for vie). Table 3. TreeTagger output form part of speech tag la DET:def:obj:femi:sg bone ADJ:obj:femi:sg vie NOM:obj:femi:sg de PRE l DET:def:obj:femi:sg anciainne ADJ:obj:femi:sg chevalerie NOM:obj:femi:sg et CON:coord comant ADV lon PRO:invar doit VER:pres:3:sg sant NOM:obj:masc:sg doner VER:infi as PREDET:a:obj:femi:pl janz NOM:obj:femi:pl a PRE pie NOM:obj:masc:sg
lemma le bon vie1|voie de le UNKNOWN chevalerie et coment on devoir|dire sentir doner|donner a+le gent1 a pie
In view of the fact that the resources are incomplete, the lemmatisation results are encouraging. For evaluation, the Amsterdam Corpus was splitted in two parts: 2.7 million words for training and 500.000 words for evaluation. In the evaluation corpus, 97.8% of the tokens and 76.6% of the types have been lemmatised. Table 4 details this global result according to parts of speech.
5
Taken from Li abrejance de l’ordre de chevalerie, as included in the Amsterdam Corpus.
224
Achim STEIN
Table 4. Lemmatisation pos types ADJ 2774 ADV 1285 CON 136 DET 264 INT 28 NOM 9483 NPR 2185 PON 11 PRE 177 PREDET 29 PRO 438 PROCON 18 VER 7282 total 24110
lemmatised 76.75% 78.75% 100.00% 100.00% 57.14% 72.92% 30.07% 100.00% 100.00% 100.00% 100.00% 100.00% 82.95% 76.60%
tokens 24795 36923 26497 49427 251 80050 8800 14212 42914 6838 79774 21832 07652 399965
lemmatised 96.83% 98.89% 100.00% 100.00% 75.70% 95.08% 70.52% 100.00% 100.00% 100.00% 100.00% 100.00% 97.02% 97.80%
The precision of the pos tag assignment is 92.7%, the most frequent error being the confusion of en preposition with en conjunction (see table 5). Some other errors reveal shortcomings in the manual annotation of the Amsterdam Corpus. Table 5. The most frequent pos tagging errors Errors Form manually assigned TreeTagger 2011 en PRO:clit PRE 976 ne PRO:clit PROCON 722 a VER PRE 634 ne PROCON PRO:clit 351 i PRO:pers PRO:clit 331 en PRE PRO:clit 328 a PRE VER 310 de NOM PRE 188 c PRO:invar PROCON 184 n PROCON PRO:clit
4. Text format 4.1. Building the complete corpus Figure 2 resumes the processing of the resources mentioned in the previous paragraphs. The leading idea was to preserve the original information provided by A. Dees’ project, and to add new information in a transparent way. Therefore, new versions of the corpus (e.g. after
Resources and Tools
225
improvement of lexical or bibliographical resources) are always generated from scratch, i.e. taking the original text files as a starting point, and proceeding as described here. The merging of the lexicon resources was presented in section 2. In the following steps (frame “Tagging” in fig. 2), the merged lexicon and the training corpus are required by the training module of the TreeTagger to build the parameters, a binary file containing the Old French lexicon and the lexical and contextual probabilities (see section 3 for technical details).
Figure 2.
Building of corpus and tools
The process of re-formatting the original text files is represented in the frame “XML Mark-up” (fig. 2). For each of the texts, the corresponding record of the bibliography is read in and the relevant information is included in the XML element <subcorpus> (see following section). Each text is splitted into sentences or verses and words, and the morphological information already present in the original files is stored in the attribute “deespos”. The graphical forms, temporarily separated from the XML annotation, are now processed by the TreeTagger, using the parameter file, and producing the tagged and lemmatised corpus in which the XML mark-up is then re-introduced. Finally, the lexicon is used again in order to perform some final corrections (e.g. resolve differences between the manually attributed part of speech and the automatic annotation).
226
Achim STEIN
4.2. The XML format The XML annotation of this first version of the Nouveau Corpus d’Amsterdam (2006) is rather basic and was meant to meet the requirements of two query tools, Xaira (Oxford University Computing Service) and Twic (ILR, Stuttgart University). <subcorpus id="abe" deaf="JMeunAbB" titreDees="J. de Meun, Traduction de la première epître de P. Abélard, v. 1-821" editionDees="éd. Ch. Charrier, Paris 1934" manuscritDees="Paris, Bibl. Nat., fr. 920" regionDees="REGION PARISIENNE" codeRegional="54"
coefficientRegional="84"
vers="non"
ponctuation="non"
mots="18183" passage="v. 1-821/1588" commentairePhilologique="éd. C. CHARRIER, ms. BN fr. 920" qualite="ms3" commentaireForme="nil" auteur="JEAN DE MEUN" dateComposition="1280" dateManuscrit="1395" lieuComposition="frc." lieuManuscrit ="Paris" genre="nil" traditionTextuelle="nil" analyses="nil"> <s line="1"> <word pos="NOM:suj:masc:pl" deespos="003" lemma="essemple" src="+I">essamples <word pos="VER:ppre:pl" deespos="586" lemma="UNKNOWN">attaignens <word pos="PROCON" deespos="600" lemma="en1+le" src="S">ou <word pos="VER:ppre:pl" deespos="586" lemma="UNKNOWN">appaissans <word pos="ADV" deespos="311" lemma="sovent" src="*IT">souvent <word pos="DET:def:obj:masc:pl" deespos="104" lemma="le" src="S">les <word pos="NOM:obj:masc:pl" deespos="004" lemma="talent" src="*+IT">talens <word pos="PREDET:de:obj:masc:pl" deespos="114" lemma="de+le" src="S">des [. . .]
Figure 3.
The XML annotation of the “Nouveau Corpus d’Amsterdam”
At text level, the XML element <subcorpus> delimits each of the 299 texts in the corpus and contains the attribute-value pairs for the corresponding bibliographical record. Although not all attributes have values yet (e.g. genre=“nil”), considerable progress has been made in 2006 in the bibliographical documentation, and figure 3 shows the XML format of the upcoming second version with fairly consistent information about dates (date of composition: 1280, date of manuscript: 1395) and quality (of transcription or of the critical edition: qualite=“ms3” indicates that the transcription of the manuscript is of rather poor quality) of the texts. Gleßgen and Gouvert (forthcoming) give a detailed account of the philological aspects of this part of the project. It is needless to say that this work is essential for the quality of a medieval corpus, since it provides information without which corpus-based induction would be impossible. The diatopic classification of the texts was A. Dees’ original concern and is expressed by the attributes “regionDees” (e.g.
Resources and Tools
227
“région parisienne”), “codeRegionalDees” (e.g. “54”, which is synonymous to “région parisienne”), and “coefficientRegionalDees” (e.g. “84” on a closeness-to-region scale with a theoretical maximum of 100). Parallel to Dees’ calculated regional classification, the partners in Zurich (M.-D. Gleßgen) will indicate the traditional localisation of the texts (“lieuComposition”) and manuscripts (“lieuManuscrit”). Finally, the attribute “deaf” provides an interface to the extensive bibliography of the Dictionnaire étymologique de l’ancien français (DEAF), also available on-line.6 With respect to query techniques, the attribution of this information to the element <subcorpus> has been preferred to other forms of representation (e.g. a separate header in TEI style) because it allows the definition of a subcorpus using lists of values or regular expressions, to limit the query to specific regions or periods. Since the values can also be projected into the search results, occurrences can easily be sorted by date, region etc. At word level, the element <word> encloses the graphical form and has the following attributes: • the value of “pos” is the translation of the numerical part of speech code in “deespos”; • the value of “deespos” is the numerical part of speech code present in the original files (see fig. 1); • the value of “taggerpos” is the category automatically assigned by the TreeTagger (see section 3). For the sake of clarity, fig. 3 does not display the part of speech information contributed by the TreeTagger: it is useful for the analysis of discrepancies between manual and automatic annotation; • the value of “lemma” is the lemma (or concurring lemmas in ambiguous cases), assigned by the TreeTagger, as specified in the lexicon (see section 2); • the value of “src” indicates the source(s) of the lemma information (see section 2). 5. Conclusion The example of the Amsterdam Corpus has shown how lexical and textual resources of various origin can be combined and suitably used to create linguistic tools for the automatic treatment of Old French and to improve the quality of the corpus in several respects: • Availability: the corpus and tools (TreeTagger parameters, 6
Note that the names of attributes may change in future editions of the corpus. For up-to-date information please refer to the documentation provided with the corpus.
228
Achim STEIN
• • •
formatting and conversion tools) are available for research purposes free of charge; Fidelity: the corpus respects the original data (Dees’ corpus). New versions are dynamically created from the original files; Reusability: the corpus uses a standardised, system-independent format (XML); Quality: the edition includes as much philological information as possible or provides links to existing bibliographies.
Bibliography Blumenthal, Peter & Stein, Achim (eds.) (2002): Electronic edition of the Altfranzösisches Wörterbuch von Tobler, Lommatzsch etc.. Stuttgart: Franz Steiner Verlag. Dees, Anthonij (1987): Atlas des formes linguistiques des textes littéraires de l’ancien français. Tübingen: Niemeyer. Gleßgen, M.-D. & Gouvert, Xavier (forthcoming): “La base textuelle du Nouveau Corpus d’Amsterdam: objectifs, résultats, perspectives” - in: Kunstmann/Stein (forthcoming). Godefroy, Frédéric (1880): Dictionnaire de l’ancienne langue française et tous ses dialectes, Paris. Huguet, Edmond 1925ss: Dictionnaire de la langue française du seizième siècle. Paris: Champion. Kunstmann, Pierre et. al. (eds.): Ancien et moyen français sur le Web: Enjeux méthodologiques et analyse du discours. Ottawa: Les Éditions David. Kunstmann, Pierre & Stein, Achim (eds.) (forthcoming): Le Nouveau Corpus d’Amsterdam. Actes de l’atelier de Lauterbad, 23-26 février 2006, Stuttgart: Steiner. Morin, Yves-Charles (forthcoming): “Histoire du corpus d’Amsterdam: le traitement des données dialectales” - in: Kunstmann/Stein (forthcoming). Nouveau Corpus d’Amsterdam (2006): Corpus informatique de textes littéraires d’ancien français (ca 1150-1350), établi par Anthonij Dees (Amsterdam 1987), remanié par Achim Stein, Pierre Kunstmann et Martin-D. Gleßgen, Stuttgart: Institut für Linguistik/Romanistik. Prévost, Sophie & Heiden, Serge (2005): “Étiquetage d’un corpus hétérogène de français médiéval: enjeux et modalités” - Kabatek, Johannes & Pusch, Claus & Raible, Wolfgang (ed.): Romance Corpus Linguistics II: Corpora and Diachronic Linguistics, Tübingen: Narr. Schmid, Helmut 1994: Probabilistic Part-of-Speech Tagging using Decision Trees; in:Sima’an, K. & Bod, R. & Krauwer, S. & Scha, R. (eds.): Proceedings of the International Conference on New Methods in Language Processing (NeMLaP’94), Manchester September 1994.
Resources and Tools
229
Manchester: UMIST. Schmid, Helmut 2000: YAP: Parsing and Disambiguation With Feature-Based Grammars. Phd. Thesis Universität Stuttgart. Stein, Achim & Schmid, Helmut 1995: Etiquetage morphologique de textes français avec un arbre de décisions. traitement automatique des langues, vol. 36, no. 1-2: Traitements probabilistes et corpus, 23-35. Stein, Achim 2003: Étiquetage morphologique et lemmatisation de textes d’ancien français; in:Kunstmann, Pierre et. al. (eds.): Ancien et moyen français sur le Web: Enjeux méthodologiques et analyse du discours. Ottawa: Les Éditions David, 273-284. Tobler, Adolf & Lommatzsch, Erhard 1925ss: Altfranzösisches Wörterbuch. Berlin etc.: Weidmann. WWW addresses Dictionnaire étymologique de l’ancien français: http://www.deaf-page.de Nouveau Corpus d’Amsterdam: http://www.uni-stuttgart.de/lingrom/stein/corpus Laboratoire de Français Ancien (LFA, Ottawa): http://www.uottawa.ca/academic/arts/lfa/ Tobler/Lommatzsch Altfranzösisches Wörterbuch, version informatisée: http://www.uni-stuttgart.de/lingrom/stein/tl/ TreeTagger: Old French Parameter Files: http://www.uni-stuttgart.de/lingrom/stein/forschung/resource.html
230
Achim STEIN
2. Corpus Linguistics in Linguistic Informatics
Introduction Yuji KAWAGUCHI As mentioned earlier, the Center of UBLI has been interested in corpus-based linguistic analysis and corpus-based language education right from the outset since corpus-based perspectives are crucial to all researches related to the real use of any given language. However, most of these analyses concern written language; few have dealt with spoken language1. It is only in recent years that research on spoken language corpora has been taken seriously. UBLI has conducted field surveys since 2004 and built spoken language corpora for French, Spanish, Italian (Salentino dialect), Russian, Malaysian, Turkish, Japanese, and Canadian bilinguals 2 . Since building corpora for spoken language involves persistent, long hours of work such as transcription, analyzing spoken language corpora has only recently become possible. Although there has been some progress in research on learner corpora in Japanese and English, this is at a fairly nascent stage of development. These steady fundamental studies have given us a fruitful but challenging direction toward the analysis of spoken language corpora and their applications with regard to our teaching material. The construction of spoken language corpora and their linguistic analysis are crucial not only for the linguistic study of the various uses of these languages but also for foreign-language teaching and learning; this is because these corpora would provide authentic usages in various communicative aspects of verbal interactions. This section comprises ten different articles on both written and spoken corpora by the members and young research assistants of UBLI; each article will be briefly outlined in the following paragraphs. Yoichiro Tsuruga, in his “Transitive Direct, Transitive Indirect and Pronominal Verb Constructions in French—The Case of approcher—,” claims that it is not always easy to semantically distinguish between the three constructions of approcher with two arguments. His corpus analysis first reveals that contrary to the descriptions of French dictionaries, N0-approcherN1 is the most frequent and has a rich semantic variety. This construction is 1
2
Takagaki, Toshihiro, Susumu Zaima, Yoichiro Tsuruga, Francisco Moreno-Fernández and Yuji Kawaguchi (2005) Corpus-Based Approaches to Sentence Structures, Usage-Based Linguistic Informatics 2, Amsterdam/ Philadelphia: John Benjamins. cf. http://www.coelang.tufs.ac.jp/english/multilingual_corpus.html
234
Yuji KAWAGUCHI
oriented toward an “abstract displacement” and is specifically used for “seizing or attaining progressively.” Second, N0-se-approcher-de-N1 is used for a “concrete displacement” as well as for an abstract one. Third, N0-approcher-de-N1 is clearly less frequent than the two precedents and is oriented toward an abstract movement. Finally, N0-se-V is oriented toward a concrete movement with N0.hum and N0-V toward an abstract use with N0.temps. Yuji Kawaguchi, in his “Demonstratives in De Bello Gallico and Li Fet des Romains—A Parallel Corpus Approach to Medieval Translation—,” analyzes Li Fet des Romains, the Old French translation of Caesar’s Gallic War. Fundamental difficulties arise in the parallel corpus analysis of medieval translation due to the relatively free nature of medieval translation texts. Although the corpus is rather modest in size, demonstratives can be considered as linguistic categories that are related to the coherence of a text and can remain relatively intact even though translators modify their originals. In her “Patient-Orientedness in Resultative Compound Verbs in Chinese,” Keiko Mochizuki examines “Patient-orientedness” in Chinese by studying a database of V1V2 resultative compound verb examples in Chinese. She provides the following three pieces of evidence supporting “Patient-Orientedness”: (1) the suppression of an Agent-Subject of V1, (2) the suppression of an internal argument of V1, and (3) the suppression of the agency of V1 in the case in which V2 predicates an external argument. Takayuki Miyake, in his “Corpus Research in Chinese and Its Application to Chinese Language Teaching—A Case of Localizers in Chinese—,” applies usage-based data to Chinese language teaching. As a case study, he examines the usage of localizers in the Chinese corpus and suggests that the “~边” type should be introduced at an elementary level, mainly because this occurs with the highest frequency, and its usage, particularly in colloquial language, is frequent. The results of this corpus survey provide us with some new perspectives on the lexical selection in teaching materials. Shinjiro Kazama, in his “Rhetorical Questions with Interrogative Markers in Nanai,” collects sentences that include interrogative markers from his database comprising twelve texts. From these, he then selects tokens of rhetorical questions and sorts them on the basis of their usage and morphology. Finally, he refers to some other language such as Japanese. In this article, he shows that in Nanai rhetorical questions, indefiniteness and total negation may be encoded by interrogative markers and expressions. In suggesting the involvement of some other formal strategies in the rhetorical questions of Nanai and their possible correlation with other languages, he
Introduction
235
emphasizes the importance of this kind of study in UBLI in light of its cross-linguistic and inductive approach. Isamu Shoho, in his “Vacillation in the Selection Complementizers of Malay Transitive Verbs,” proposes the hypothesis that the selection of the complementizers supaya and untuk reflect the structural differences between transitive verbs that take the complementizer untuk and those that take the complementizer supaya; untuk forms a nexus relation with the post-verb argument, whereas supaya functions as a conjunction forming a noun clause. He finds that of the seven transitive verbs that are analyzed here, five show an irregularity with regard to the selection of complementizers. Considering that the other behaviors conform with those of transitive verbs in the same class, he contends that the irregularities shown by the five transitive verbs result from an erroneous analogy with another class of transitive verbs. Further, Shoho explains why such an erroneous analogy occurs. Hiroki Nomoto and Isamu Shoho, in their “Voice in Relative Clauses in Malay—A Comparison of Written and Spoken Language—,” maintain that based on their analysis of corpora, Written Malay and Colloquial Malay are the two distinct varieties of the classical diglossia and as such have different voice systems. In particular, Written and Colloquial Malay differ in two ways. First, Colloquial Malay lacks the bare passive, which is used in Written Malay. Second, the morphological active and passive are used more frequently than the bare active and passive in Written Malay, whereas the bare active is more frequently used in Colloquial Malay. They also discuss two theoretical issues, i.e., diglossia and the typological characterization of the voice in Malay. Asako Yoshitomi, in her “Testing the Primacy of Aspect and Reverse Order Hypothesis in Japanese Returnees—Towards Constructing a Corpus of Second Language Attrition Data—,” introduces a project to construct corpora that consist of learner language oral data obtained from Japanese learners of English and presents a preliminary study using one of these corpora, namely, the corpus of second language attrition data of returnee children. The process of English attrition is examined in two returnees from the perspectives of the Primacy of Aspect Hypothesis and the Reverse Order Hypothesis. The results indicate that the verb tense-aspect system regresses according to the predictions made by the two hypotheses; this implies that language learning and language attrition are universal processes that work in reverse directions and that the use of corpora in SLA research should be expanded to studies in second/foreign language attrition and relearning processes. Ayano Suzuki and Tae Umino, in their “Corpus-based Analysis of Lexical Errors of Advanced Japanese Learners,” analyze the lexical errors of advanced Japanese learners by using a preliminary version of the “Japanese
236
Yuji KAWAGUCHI
Composition Database of Advanced Learners” that was developed in the UBLI project. The analysis reveals that failure to understand collocation is an obstacle to the use of natural Japanese expressions by advanced Japanese learners. Therefore, based on this analysis, they have developed a trial textbook to teach collocation. Tomoko Tokita and Yuji Kawaguchi, in their “Syntactic Patterns of Intrasentential Code-Switching in the Discourse of Japanese-English Bilingual Families,” analyze Japanese-English bilingualism; they focus on the intrasentential code-switching patterns that emerged from their 3.5-hour-long corpus of two families living in Vancouver. It appears that for intrasentential code-switching to be acceptable, the spoken sentences should necessarily satisfy the matrix language syntax rules.
Transitive Direct, Transitive Indirect and Pronominal Verb Constructions in French — The Case of approcher — Yoichiro TSURUGA Introduction The problem of using direct and indirect constructions constitues the core of French syntax. In the direct construction N0-V-N1 (N0: “subject”, V: “verb”, N1: “object”, cf. Gross 1975), the positions of nominal constituents are relevant, because there are no relational indicator of their syntactic relations. On the other hand, in the indirect construction N0-V-prép-N1 (prép: “preposition”), owing to the presence of an relational indicator ―a preposition―, the positions in question evidently become less relevant (cf. Martinet 1960, 2005). The presence or absence of a relational indicator typically renders the relational «signifiés» transmitted by the two objects ―direct and indirect― as distinct in most cases. However, there are some cases in which their difference is not clear enough even for native French speakers. The distinction between the following constructions, for example, is not necessarily clear out of their contexts: traiter-N1 as opposed to traiter-de-N1 (“to treat”), toucher-N1 as opposed to toucher-à-N1 (“to touch”), and approcher-N1 as opposed to approcher-de-N1 (“to approach”). In the case of approcher, we must also add a pronominal verb construction se-approcher-de-N1. In On approche le rivage (“We are coming close to the bank”), On approche de Léa (“We come near Léa”), and On s’approche du balcon (“We come close to the balcony”), N1 expresses someone or something that is in the direction toward which one moves. In this, we can recognize a certain identity of the syntactic relational «signifié». Here it is noteworthy that in the case of the verb approcher, contrary to most French verbal constructions, it is the preposition de (“of, from”) that indicates the point of destination, whereas in other verbal constructions, this prepositon refers to the point of source from which one begins a movement: compare aller à Paris (“to go to Paris”) with venir de Paris (“to come from Paris”), and envoyer des livres à Paris (“to send books to Paris”) with envoyer des livres de Paris (“to send books from Paris”). What is unusual and interesting is that in se-approcher-de-N1 (“to come near N1”), the preposition de functions exactly as in se-écarter-de-N1 (“to become more
238
Yoichiro TSURUGA
distant from N1”) which expresses a precisely opposed movement. We must recognize that N0 -approcher-N1 , N0 -approcher-de-N1 and N0 -seapprocher-de-N1 indicate a movement of N0, which consequently means that N0 becomes progressively nearer («proche» (“near”) has a morphological relation with «approcher») to N1. In this manner, the movement is conceived from N1’s side. Confirming this fact, in this article, we propose to examine the uses of approcher and possibles differences between the three construcions in question that in appearance seem to express the same movement. However, we would like to precisely state here that our analysis is based on only one corpus. We should be satisfied in examining some essential part of the dynamic synchrony of approcher (Martinet 1979) 1. Descriptions from French dictionaries When one is interested in clarifying the uses of a verb, it is indispensable to refer to some of the representative dictionaries. In addition to consulting three dictionaries of language, we also analysed the Grand dictionnaire encyclopédique Larousse. The examples cited from these dictionaries will be simplified below, and those belonging to technical domains or idioms will not be provided, except in some necessary cases. 1.1. Grand Larousse de la langue française (1971) N0-V-N1-de-N2 : Le valet approche le foin de la charette. (“The servant brings the hay near the cart.”) (=Mettre plus près de quelque chose. Synonyme: rapprocher (“to put closer to something. Synonym: rapprocher”)) N0-V-N1 : Vous approchez ce rivage. (“You are getting near that bank.”) (=Classique. Venir près de quelque chose (Classic. “to come near something”) Ils approchent la noblesse. (“They access the nobility.”) (=Avoir accès auprès de quelqu’un (“to have access to someone”)) N0-V-de-N1 : Il approche de moi. (“He comes near me.”) (=Vieux. Venir auprès de quelqu’un “Old. to come close with someone”) Il approche de la cinquantaine. (“He is getting on for fifty.”) (=Figuré. Être près d’atteindre un moment (“Figurative. to be about to attain”)) Ce mérite n’approche pas du vôtre. (“This merit is not comparable to yours.”) (=Figuré. Être comparable à quelque chose, Synonyme: égaler (“Figurative. to be comparable to something, Synonym: égaler”)) N0-V : Le soir approchait. (“Night was drawing on.”) (=Être imminent
Transitive Direct, Transitive Indirect
N0-se-V-de-N1
N0-se-V
239
(“to be imminent”)) Voyageur, approche et respire enfin cette odeur. (“Travelers, come nearer and breathe well this smell.”) (=Venir tout près (“to come quite near”)) : Je m’approchais du balcon. (“I was getting near the balcony.”) (=S’avancer pour venir aurpès de quelque chose (“to advance to come close to something”)) L’intrigue s’approche de la réalité. (“The plot resembles the reality.”) (=Avoir de la ressemblance avec (“to have a resemblance to”)) : La nuit s’approche. (“Night is drawing on.”) (=Être imminent, devenir proche dans le temps (“to be imminent, become close in time”))
The transitive use above is not sufficiently rich. In addition to the construction with three arguments (N0, N1, de-N2), there are those that mean “coming near someone or something” and “accessing someone”. The latter use can be considered as a figurative extension of the former concrete use. With regard to the indirect construction N0-V-de-N1 above, it is essentially a matter of figurative and abstract uses (cf. the corpus analysis below). We must note that among all the uses with two arguments N0 and N1 above, the concrete use of N0-se-V-de-N1 is the only contemporary one that means a “concrete displacement” without any figurative nuance. The injunctive use of N0-V does not have the second argument (neither does the first one which is a subject and is omitted.) Further we note that N0-V and N0-se-V are very similar in meaning ―a “temporal imminence”― and that N0-V-de-N1 and N0-se-V-de-N1 are also similar in that they both express “ressemblance”. 1.2. Grand Robert de la langue française (2001) N0-V-N1-de-N2 : On approche un fauteuil de la table. (“They bring an armchair near the table.”) (=Mettre une chose près de quelqu’un ou de quelque chose (“to put one thing near someone or something”)) N0-V-N1 : Ne m’approchez pas. (“Don’t come near me.”) (=Venir près de quelqu’un (“to come near someone”)) Ils approchent les rois et les grands. (“They access the kings and the greats.”) (=Avoir libre accès auprès de (“to have free access to”)) N0-V-de-N1 : Il approchait de nous. (“He was coming near us.”) (=Venir près de quelqu’un ou de quelque chose (“to come near someone or something”)) Le navire approche du rivage. (“The ship is getting near the
240
Yoichiro TSURUGA
(N0-V)
N0-se-V-de-N1
N0-se-V
bank.”) On approche de la mort. (“They are coming near death.”) (=Figuré. Regarder en face (“Figurative. to look face to face”)) On approche du but. (“They are attaining the goal.”) (=Être sur le point d’atteindre (“to be on the point of attaining”)) On approche de la cinquantaine. (“They are getting on for fifty.”) On approche de la perfection. (“They are getting near perfection.”) (=Abstrait (“Abstract”)) Nos cathédrales approchent de la beauté du Panthéon. (“Our cathedrals come near the beauty of the Panthéon.”) Les moeurs qui approchent des nôtres nous touchent. (“The custmus that resemble to ours affect us.”) (=Vieux, Classique (“Old, Classic”)) : Approchez, venez ici. (“Come near, come here.”) (=Emploi sans de-N1 ou intansitif (“Use without de-N1 or intransitive”)) La rumeur approche. (“The clamor is getting near.”) Ce n’est pas tout à fait ce que vous dites, mais vous approchez. (“That is not exactly what you say, but you are getting near.”) (=Emploi sans de-N1. Être sur le point d’atteindre (“Use without de-N1. to be on the point of attaining”)) Le printemps approche. (“Spring is coming.”) (=Emploi sans de-N1 ou intansitif (“Use without de-N1 or intransitive”)) : Elle s’approche du feu. (“She comes near the fire.”) (=Aller se mettre auprès de (“to go to get close to”)) Flaubert s’approche de la perfection. (“Flaubert comes near perfection.”) : Approchez-vous, Néron. (“Come near, Néron.”) (=Emploi sans de-N1 (“Use without de-N1”)) Le départ s’approche. (“The departure is coming near.”) (=Être imminent (“To be imminent”))
As in the preceding case, the description of the direct construcion is not detailed sufficiently, whereas the indirect construction is analyzed using a variety of examples. The pronominal verb construction is also not sufficiently detailed; however we notice that it has concrete and abstract uses. We can note that the intransitive construcion is not treated independently. It is considered as an intransitive use or an absolute use of the transitive indirect construction. 1.3. Grand dictionnaire encyclopédique Larousse (1982) N0-V-N1-de-N2 : On approche le bureau de la fenêtre. (“They bring the desk near
Transitive Direct, Transitive Indirect
N0-V-N1
Être-Vé N0-V-de-N1 N0-V N0-se-V-de-N1 N0-se-V
241
the window.”) (=Mettre près de quelqu’un ou de quelque chose (“to put near someone or something”)) : Ne l’approchez pas. (“Don’t come near him.”) (=Se placer près de (“to take one’s place near”)) Il approche les plus hautes personnalités du pays. (“He accesses the highest personalities of the country.”) (=Avoir accès auprès (“to have access to”)) Il approche la trentaine. (“He is getting on for thirty.”) (=Être sur le point d’atteindre (“to be on the point of attaining”)) Il approche la solution du problème. (“He is approaching the solution of the problem.”) (=Être sur le point d’atteindre (“to be on the point of attaining”)) Il approche un marché. (“He goes to market.”) (=Prendre des contacts pour obtenir un contrat (“to get in touch to obtain a contract”)) : Il a une idée approchée de l’événement. (“He has an approximate idea of the event.”) (=Être approximatif (“to be approximate”))
: Le bateau (approche+s’approche) de la côte (+ : “or”). (“The ship is getting near the coast.”) (= Venir se placer de quelque chose ou de quelqu’un, être sur le point d’arriver (“to come to take place near something or someone, to be on the point of arriving”)) Le médecin s’approche du lit de la malade. (“The doctor comes near the the bed of the female patient.”) On (approche+s’approche) de la vérité. (“We come near the truth.”) (=Abstrait. Atteindre un certain objectif, un certain état (“Abstract. to attain a certain objective, a certain state”)) Les bruits de pas (approchent+s’approchent) de la porte. (“The steps’ sounds are getting near the door.”) (=Être de plus en plus nettement perceptible, en parlant d’un son (“to be more and more clearly perceptible, concerning a sound”)) La victoire s’approche. (“The victory is coming near.”) (=Langage soutenu. Être imminent en parlant d’une date, d’un événement (“Sustained language. to be imminent concerning a date, an event”)) L’examen approche. (“The examination is getting near.”) Roman qui (approche+s’approche) de la réalité. (“Novel that is close to reality.”) (=Ressembler (“to resemble”))
242
Yoichiro TSURUGA
It is noteworthy that the above description presents a clear persepective of the analysis on the actual state of the constructions, despite a limited space of an encyclopedia. The description is characteristic particularly for the following two points: First, the transitive direct construction, especially with two arguments is very detailed (cf. the corpus analysis below), and second, the similarity between the transitive indirect and pronominal verb constructions stands out in sharp relief. The mention of past participle’s use should also be noted. 1.4. Trésor de la langue française (1971) Among the four analysed dictionaries, it is the Trésor de la langue française that is the most specialized in contemporary French (although we can find exemples dated from the end of the 18th century). The Grand dictionnare encyclopédique does not seem to consider the examples from the 17th and 18th centuries. The first two dictionaries presented above ―the Grand Larousse and the Grand Robert― describe the French language since the 17th century. Needless to say, this difference appears in the descriptions. It must be mentioned that in the Trésor’s description, the division is first made between “I. Obsolescent, literary or idiomatic uses” and “II. Living uses”. Further, as in the case of the latter living uses, the Trésor’s description provides only pronominal verb and intransitive constructions: N0-se-V-de-N1, N0-se-V and N0-V. This is a definite position the validity of which will be verified in the following sections. I. Obsolescent, literary or idiomatic uses N0-V-N1-de-N2 : Elle approchait son tabouret près de la statue. (“She was bringing her stool near the statue”) (=Placer près de. Synonyme intensif plus usité: rapprocher (“to put near. More used intensive synonym: rapprocher”)) On approchait son visage si près que... (“We were bringing our face so close...”) N0-V-N1 : On approche la terre. (“We are getting near the land”) (=Maritime. Faire route vers une terre à vue, exemple de 1797 (“Maritime. to take one’s route to a land in sight, example of 1797”)) Les sculpteurs, après avoir dégrossi une figure en marbre, l’approchent à la pointe ou au ciseau. (“The sculptors, after giving a rough figure in marble, approaches it with a point or a chisel”) (=Sculpture. Amener un ouvrage à fin (“Sculpture. to bring a work to completion”)) On approche la célébrité nouvelle. (“They access the new
Transitive Direct, Transitive Indirect
N0-V-de-N1
II. Living uses N0-se-V-de-N1
N0-se-V
N0-V
243
celebrity.”) (=Avoir accès auprès de (“to have access to”)) : On approche de leurs ministres. (“They come near their ministers.”) (=Vieilli. Venir près de quelque chose ou de quelqu’un. Synonyme plus courant: s’approcher de (“Obsolescent. to come near something or someone. More current synonym: s’approcher de”)) J’approchais de cette ville. (“I was getting near that town.”) (=Par extension, courant. Être sur le point d’arriver en un lieu (“in a wider sense, current. to be on the point of arriving at a place”)) Nous approchions du jour des morts. (“We were coming near the All Soul’s Day.”) (=Par extension, courant. Être sur le point d’atteindre un moment de temps (“in a wider sense, current. to be on the point of attaining a temporal moment”)) Il approchait de la trentaine. (“He was getting on for thirty.”) J’approchai d’un état héroïque. (“I got near a heroic state.”) (=Figuré. Littéraire. Être près d’atteindre (“Figurative. Literary. to be about to attain”)) On approche de la perfection. (“They are getting near perfection.”) (=Figuré. Courant. Être près d’atteindre (“Figurative. Current. to be about to attain”)) Les didelphes approchent beaucoup de l’homme. (“The didelphic animals resemble human beings a lot.”) (=Vieilli. Ressembler à (“Obsolescent. to resemble”)) Rien n’approchait des magnificences accumulées dans la forteresse. (“Nothing equaled the accumulated magnificences of the fortress.”) (=Égaler (“to equal”)) : Il s’approcha des gens. (“He came near the folk.”) (=Venir près de quelqu’un ou de quelque chose (“to come near someone or something”)) Il s’approche du lit. (“He comes near the bed.”) : Trois hommes s’approchaient. (“Three men were coming near.”) (=N0-V) La nuit s’approchait. (“Night was drawing on.”) (=Être imminent. N0-V, plus courant dans ce sens (“to be imminent. N0-V, more current in this meaning”)) : Deux sergents approchaient. (“Two sergeants were coming near.”) (=Venir plus près de (“to come closer to”)) Ces pas lourds approchent. (“Those heavy steps’ sounds are getting near.”) (=Par métonymie (“by metonymy”)) Le bien heureux jour approchait. (“The very happy day was
244
Yoichiro TSURUGA coming.”) (=Être imminent (“to be imminent”)) On (fait+laisse+voit) approcher Léa. (“They (make+let+see) Léa come near.”) (=Plus usuel que «On fait s’approcher Léa», etc. (“more usual than «On fait s’approcher Léa», etc.”))
The transitive direct uses with two arguments are not sufficiently varied, and they are not considered to be very living ones by themselves. (Recall that more technical examples are not provided above: On approche la bête (“They rejoin the dog that got ahead.”), Ce taureau a approché plusieurs vaches (“This bull covered several cows”), etc.) It is noetworthy that neither the examples of concrete displacement (approcher la terre is from the 18th century) nor those of temporal imminence are given as direct construction (cf. the analysis below). On the other hand, the description of transitive indirect construction is very rich, despite their being classified as not living ones, which is fairly similar to the description offered by the Grand Robert in section 1.2, above. 1.5. Recapitulation First and foremost, we can generally comment on the fact that the description of N0-V-N1 is neither sufficiently detailed nor varied. On this point, the rich descriptive variety provided by the Grand dictionnaire encyclopédique Larousse is particularly noteworthy. Second, we can note that the description of the pronominal verb construction is considerably simple in almost all the dictionaries. Once again, on this point, we must note the description offered by the Encyclopédie Larousse by regrouping the transitive indirect and pronominal verb constructions, because it thereby succeeds in showing the variety in the latter. Further, we must state that the N1.abs (abs: “abstract”) of N0-se-V-deN1 is neglected only in the Trésor. Let us observe that the N1.abs can also provide the «singifié», which is close to that of «être en train d’atteindre l’objet» (“to be attaining the object”). Let us add that the possibility of N1.temps of the pronominal verb construction is not suggested in any of the four given dictionaires (temps: “temporal moment, period”). This possibility or impossibility must be verified in our corpus below. Third, les us recall that in the Grand Robert and the Trésor, the variety of uses of N0-V-de-N1 is brought into sharp relief. Even with regard to this point, the encyclopedic dictionary succeeds in presenting this variety by reunifying transitive indirect and pronominal verb constructions. Fourth, concerning the constructions N0-V and N0-se-V, without N1, we can say that the Grand Robert and the Trésor are the most detailed by providing the following variety: N0.hum-V (hum: “human”), N0.non-hum-V
Transitive Direct, Transitive Indirect
245
(non-hum: “non-human”), N0.temps-V, on the one hand, and N0.hum-se-V, N0.temps-se-V, on the other; in contrast, the Encyclopédie is the simplest by showing only: N0.temps-V and N0.temps-se-V. With regard to this point, it is interesting to observe the frequency of these constructions in our corpus. 2. Description of a corpus: the newspaper Le Monde 1994 We must admit that we limit ourselves to the description of the examples surveyed in one year ―1994― in the newspaper, Le Monde. The frequencies of the constructions are as follows: Of the total of 754 occurrences, N0-V-N1: 293, N0-se-V-de-N1: 190, N0-se-V: 52, N0-V-de-N1: 74, N0-V: 104, and that of little frequent constructions of the other types to which we shall refer below. 2.1. N0-V-N1 (293 occurrences) Contrary to what is apparently suggested by the descriptions of the the dictionaries presented above, the transitive direct construction is by far the most frequent and the most varied. The examples provided by the three dictionaries of language are as follows: (a) N0.hum-V-N1.non-hum-de-N2.loc (loc: “concrete place”), (b) N0.hum-V-N1.loc, and (c) N0.hum-V-N1.hum. With regard to the construction (a) having three arguments, it is mentioned in all kinds of French dictionaries, but we would prefer to neglect this construction here beacause it occurs only twice in our corpus. With regard to the second construction (b), we must note that its N1.loc can also be realized by N1.hum. In Ne m’approchez pas (“Don’t come near me”), me ordinarily means a simple place (which is also the case with ce chien of N’approchez pas ce chien (“Don’t get near that dog”) or with le quai of Le bateau approche le quai (“The ship is getting near the quay”)). In contrast, On approche la célébrité means “accessing”. Thus the distinction between the two constructions can be rather nuanced. What is essential here is rather to understand that it is impossible to cover all the varied transitive direct examples of our corpus with only the constructions (b) and (c). Therefore, drawing our inspiration from a short survey of our corpus, we would like to propose the following four groups with different «signifiés» of construction and the possible combinations of the arguments. However, we must precisely state at this point that the nominal characterization is only approximate and that the characteristics (“human”, “time”, etc.) are not necessarily exclusive of each other. 1. For concrete displacement, we propose N0(hum+anim+véhic)-V-N1(hum+ anim+ non-hum+loc) (anim: “animate”, véhic: “vehicle”). Example: Le bateau approche le quai. (“The ship is getting near the quay.”) In this frame, any N1 can be interpreted as a concrete place.
246
Yoichiro TSURUGA
2. For the meaning of “accessing, frequenting”, we propose, in principle, N0.hum-V-N1.hum. Example: On approche la célébrité. (“They access the celebrity.”) The semantic varieties expand particularly with the intervention of abstract elements. We can thus add the following combinations. 3. For the abstract displacement that can also be interpreted as “being comparable to”, we propose N0(hum+non-hum)-V-N1(non-hum+temps+chiffré+état+ abs). (chiffré: “degree, state the quantification of which is given by a figure”, état: “state, condition”). Example: On approche la quarantaine (“We are getting on for forty”), Leur revenu approche le nôtre (“Their income is comparable to ours”). (It is possible to reverse the N0’s and N1’s characterizations, which would give N0(non-hum+temps+chiffré+état+abs) -VN1(hum+non-hum), which is not, however, suggested by any of the above mentioned dictionaries. Example: La mort approchait Maurice. (“Death was coming near Maurice.”) In this frame of construction, N1 can be considered as an abstract place. One of the important characteristcs of this frame of construction, particularly compared with the following one, is that it is a question of an abstract and involuntary movement. 4. Finally, for the meaning of “attempting to seize, grasp, attain progressively the subject, problem, object, goal”, we propose N0(hum+non-hum) -V-N1(hum+non-hum+abs). Example: Il approche dans cette œuvre les clochards (“He approaches in this work the vagabonds”), Ce film approche la sauvagerie consciente (“This film approaches the conscious brutality.”). In this frame, N1 is considered as an object or a goal to be attained. Thus contrary to number 3, above, it is a question of a voluntary act. It is not always easy to draw the distinction between the third and fourth groups above, because “voluntary” is not always clearly distinguished from “involuntary” and, consequently, neither is “movement” from “act”. These difficulties are fairly natural, since there is only one formal frame of construction. Despite this fact, we believe that this classification can serve to “approach” the rich semantic variety provided by the one formal frame. Let us now examine the combinations that actually appeared in our corpus. The constructions that have a human subject are: Nhum-V-Nhum: 68 occurrences, Nhum-V-Ntemps: 10, Nhum-V-Nloc: 15, Nhum-V-Nnon-hum: 56, Nhum-V-Nabs: 25, Nhum-V-Nchiffré: 2, Nhum-V-Nanim: 7. Those having a non-human subject are: Nnon-hum-V-Nchiffré: 76, Nnonhum-V-Ntemps: 3, Nnon-hum-V-Nnon-hum: 20, Nnon-hum-V-Nabs: 5, Nnon-hum-VNhum: 4, Ntemps-V-Nchiffré: 1, Ntemps-V-Nhum: 1. We shall now reclassify these constructions according to the four groups proposed above.
Transitive Direct, Transitive Indirect
247
2.1.1. Concrete displacement: 30 occurrences 2.1.1.1. Nhum-V-Nloc (9 occurrences) (1)
L’île est encore féodale [...], et il y a des territoires, [...], qu’on ne peut approcher que par la mer. (“there are territories that we can approach only by sea”)
It is the question of a typical example of “concrete displacement”. However, it is not always the case that a concrete place N1 can suggest a concrete displacement. « Nloc » merely indicates “a nominal element that can easily be interpreted as a concrete place.” 2.1.1.2. Nhum-V-Nnon-hum (6 occurrences) (2)
[...] alors que la couleur perçue ne repose sur aucun support matériel et qu’en l’approchant on découvre qu’il n’y en a pas, mais le vide [...]. (“getting near the perceived color, we discover”) 2.1.1.3. Nhum-V-Nhum (8 occurrences) (3) Le fait qu’un spectateur armé ait pu approcher de si près le prince a semé le doute sur le dispositif de sécurité. (“an armed bystander could come so close to the prince”) The construction Nhum-V-Nhum often lends the meaning of “accessing”,
which is, however, not the case above. 2.1.1.4. Nhum-V-Nanim (7 occurrences) (4)
[...] avec l’espoir de se faire engager dans un cirque. Ils obtiennent le droit d’approcher les fauves et montent un extraordinaire numéro. (“They obtain the right to get near the big game”) Let us note that in our corpus, we come across N1.anim only when it is
the question of a concrete displacement. In general, we must recognize that any N1.conc (conc: “concrete”) ―whether it is Nnon-hum, Nhum, or Nanim― can be conceived as a concrete place. As shown above, the direct construction can mean a concrete displacement; however, it must be noted that in comparison with the other uses that we shall see below, this use in question (30 occurrences among 293) is not frequent (cf. the frequencies of “concrete displacement” in other constructions below). 2.1.2. “To access”: 57 occurrences The construction N0.hum-V-N1.hum often carries the meaning in question, but this is not always the case; moreover, this construction is not even indispensable to this meaning, at least in appearance.
248
Yoichiro TSURUGA
2.1.2.1. Nhum-V-Nhum (50 occurrences) (5)
[...] les personnes qui ont approché Artaud ont pu être frappées, […]. (“the people who accessed Artaud”) 2.1.2.2. Nhum-V-Nnon-hum (3 occurrences) (6) Nous n’avons pas approché Air France à propos de Sabena […]. (“We did not access Air France”) 2.1.2.3. Nnon-hum-V-Nhum (4 occurrences) (7) Une société de bâtiment et de travaux publics […], l’aurait alors approché en lui conseillant […]. (“A building or civil engineering and construction company would then have accessed him”) The constructions Nhum-V-Nnon-hum and Nnon-hum-V-Nhum should be considered as extensions of the construction of Nhum-V-Nhum which is
typical of the meaning in question. The relation between the construction Nhum-V-Nhum and the meaning of “accessing” is thus very close. 2.1.3. Abstract displacement: 134 occurrences 2.1.3.1. Nnon-hum-V-Nchiffré (76 occurrences) (8)
Le Parti communiste, dont le score approchait 18 % [...]. (“the score was getting near 18 %”) 2.1.3.2. Nnon-hum-V-Ntemps (3 occurrences) (9) Après deux années de reprise, le cycle conjoncturel américain approche une période de maturation [...]. (“the American conjunctural cycle is getting near a maturation period”) 2.1.3.3. Ntemps-V-Nchiffré (1 occurrence) (10) Devenus des “opérateurs”, ces ouvriers, dont la moyenne d’âge approche quarante-cinq ans, […]. (“these workers, whose average age is getting on for forty-five years”) 2.1.3.4. Ntemps-V-Nhum (1 occurrences) (11) Cziffra était un être fort, que la mort avait plusieurs fois approché, [...]. (“a strong human being that death had got near several times”) 2.1.3.5. Nnon-hum-V-Nabs (3 occurrences) (12) [...] la dissertation est une torture dont nous avons suffisamment souffert en bas âge, qui est supposée approcher la vérité par le raisonnement […]. (“a torture that is supposed to get near the truth”) 2.1.3.6. Nnon-hum-V-Nnon-hum (17 occurrences) (13) [...] dans des pays qui pourtant ont un revenu par tête approchant le nôtre que [...]. (“an income per person comparable to ours”) 2.1.3.7. Nhum-V-Nchiffré (2 occurrences) (14) Grand champion de ce début d’année, Madame Doubtfire [...] approche les 850 000 entrées en cinq semaines. (“Madame Doubtfire gets near the 850 000 entries”)
Transitive Direct, Transitive Indirect
249
2.1.3.8. Nhum-V-Ntemps (9 occurrences) (15) La moitié des magistrats ont été recrutés après 1981, observe un président de cour d’assises, qui approche la cinquantaine. (“a president of Assize Court, who is gettng on for fifty”) 2.1.3.9. Nhum-V-Nabs (7 occurrences) (16) Nommez-moi une guitariste qui ait approché le talent de Jimi Hendrix ou de Jeff Beck. (“a woman guitarist who got close to the talent of Jimi Hendrix or Jeff Beck”) 2.1.3.10. Nhum-V-Nnon-hum (12 occurrences) (17) [...] sur 110 m haies, le Britannique Colin Jackson, a approché son record du monde avec un chrono de 12 sec 94 ; [...]. (“the British, Colin Jackson, has got close to his world record”) 2.1.3.11. Nhum-V-Nloc (3 occurrences) (18) Vegard Ulvang n’est pas en mesure de répondre aux aspirations de ses supporters: blessé à une cuisse, il n’a pas approché le podium des 30 km […]. (“he did not get close to the 30 km rostrum”) An “abstract displacement” can be realized if one of the N0 and N1 is
not concrete. Moreover, this is not even a necessary condition. Recall that the “abstract displacement” is conceived above as “involuntary” and that it describes, rather, a situation of coming closer between N1 and N2, which does not imply that non concrete displacement is always limited only to what is involuntary. The displacement which is neither concrete nor involuntary shall be classified in the following group as a voluntary act of “attempting to seize, grasp, attain progressively”. 2.1.4. “To attempt to seize, attain progressively”: 72 occurrences 2.1.4.1. Nhum-V-Nnon-hum (35 occurrences) (19) Tout mon travail consiste à tenter de décrire ce à quoi ressemble le coeur maori. C’est ainsi que j’approche la question. (“I approach the question”) 2.1.4.2. Nhum-V-Nabs (18 occurrences) (20) Avec, en fin de lecture, le sentiment d’avoir approché le mystère de la mise en scène, autour duquel trop de livres tournent […]. (“the feeling of having seized the mystery of production”) 2.1.4.3. Nhum-V-Nhum (10 occurrences) (21) Quant à Dino Risi, il approche dans Barboni (1946) les clochards qu’il retrouvera, sur le mode grotesque, dans les Monstres. (“he approaches the vagabonds in Barboni (1946)”) 2.1.4.4. Nhum-V-Nloc (3 occurrences) (22) Mais on ne saurait la (=la ville) comprendre en l’approchant seulement par la souffrance, […]. (“we could not understand it by attempting to seize it only in its suffering”)
250
Yoichiro TSURUGA
2.1.4.5. Nhum-V-Ntemps (1 occurrence) (23) […] un projet qui permettrait d’approcher de façon dynamique notre époque à travers ceux qui la font avancer, [...]. (“to attempt to seize our era in a dynamic way”) 2.1.4.6. Nnon-hum-V-Nnon-hum (3 occurrences) (24) Aucune mesure relevant des pouvoirs publics ne peut approcher à elle seule un tel objectif. (“No action belonging to the public powers can attain such an objective only by itself”) 2.1.4.7. Nnon-hum-V-Nabs (2 occurrences) (25) [...] qu’un seul film ait approché depuis la sauvagerie consciente et la brutalité raisonnée de celui-ci. (“only one film approached since then the conscious savagery and brutality of the latter”)
Recall that in principle, all sorts of nominal elements can be conceived as objects to seize or attain. The meaning of “attempting to seize, attain progressively” seems to require that N0 be Nhum, because it is a question of a strong “intention” from N0’s side. In appearance, however, this is not an indispensable condition (see 2.1.4.6 Nnon-hum-V-Nnon-hum and 2.1.4.7 Nnon-hum-V-Nabs). At first, however, these cases are not frequent (only 5 occurrences among 72). Moreover, if we closely observe the examples in question, we see that the Nnon-hum of 2.1.4.6 and 2.1.4.7 is related to a human act or creation. In this sense, we must state that it is essentially a question of extension by personification. It is impossible to deny the fundamental relationship between “voluntary” and “human”. Above, we have examined the four groups of transitive direct construcion. We must recognize therein the rich semantic variety that this construction can permit. Moreover, let us emphasize that the meaning of “attempting to seize, attain progressively”, which corresponds precisely to the transitive construction, has real importance in our corpus with regard to the frequency plan and the one of semantic variety, and that, contrary to the negligence exhibited by the representative dictionaries of the French language. Apparently, this fact becomes particularly important when we compare the direct construcion with the others that follow. 2.2. N0-se-V-de-N1 (190 occurrences) In principle, we can also apply the grouping of 2.1 to the examples of pronominal verb constructions. However, a short survey of the examples recommends that we create a simpler grouping as the following: 1. For the concrete displacement, we propose N0(hum+anim+véhic+non-hum)se-V-de-N1(loc+non-hum+anim). Example: On s’approche du feu. (“We get close to the fire.”) In this frame of construction, N1 is interpreted as a concrete place.
Transitive Direct, Transitive Indirect
251
2. For the abstract displacement, we propose N0(hum+non-hum+abs+temps)se-V-de-N1(chiffré+temps+abs+loc+non-hum+hum). Example: On s’approche de la perfection. (“We are getting near perfection.”) Let us state clearly that contrary to the transtive direct construction, the question here is not necessarily of an abstract involuntary movement. It can well be a matter of an act or a movement that are abstract, but voluntary or involuntary. The distinction between only these two groups has a risk of being too general and may raise problems, which we see below. The interpretations of “accessing” or “attempting to seize, attain progressively” do not seem to be excluded. We believe, however, that the difference between the four groups on the one hand, and the two groups on the other, can reflect a fundamental difference between transitive direct and pronominal verb constructions. It is the form of the pronominal verb construction, which is more complex (approcher-N becoming se-approcher) and necessarily has recourse to the preposition de, that appears to impose additional restrictions on the semantic extension of construction. The real combinations we find in our corpus are as follows. The sentences that have a human subject are: Nhum-se-V-de-Nloc: 45 occurrences, Nhum-se-V-de-Nhum: 17, Nhum-se-V-de-Nnon-hum: 53, Nhum-seV-de-Nabs: 6, Nhum-se-V-de-Nchiffré: 4, and Nhum-se-V-de-Ntemps: 5. The sentences having a non-human subject are: Nanim-se-V-de-Nnon-hum: 1 occurrence, Nvéhic-se-V-de-Nloc: 6, Nvéhic-se-V-de-Nnon-hum: 2, Nnon-humse-V-de-Nloc: 9, Nnon-hum-se-V-de-Nhum: 3, Nnon-hum-se-V-de-Nnon-hum: 21, Nnon-hum-se-V-de-Nabs: 3, Nabs-se-V-de-Nabs: 2, Nnon-hum-se-V-de-Nchiffré: 11, Nnon-hum-se-V-de-Ntemps: 1, and Ntemps-se-V-de-Nhum: 1. Let us reclassify these combinations according to the two groups proposed above. 2.2.1. Concrete displacement: 87 occurrences 2.2.1.1. Nhum-se-V-de-Nloc (35 occurrences) (1)
[...] Emily grimpait [...], s’approchait de la fenêtre et [...]. (“Emily came near the window”) 2.2.1.2. Nhum-se-V-de-Nhum (16 occurrences) (2) L’air soucieux, un jeune homme s’approche de l’officier [...]. (“a young man comes near the officer”) 2.2.1.3. Nhum-se-V-de-Nnon-hum (15 occurrences) (3) En s’approchant de l’assiette pour humer l’exquis fumet, on perçoit sa propre image se former dans l’assiette, devenue miroir! (“Coming close to the plate..., we perceive”) 2.2.1.4. Nanim-se-V-de-Nnon-hum (1 occurrence) (4) [...] d’un chat s’approchant d’une tasse de crème. (“a cat coming close to a cup
252
Yoichiro TSURUGA
of cream”) 2.2.1.5. Nvéhic-se-V-de-Nloc (6 occurrences) (5) Une fois identifié, c’est un vaisseau interstellaire qui s’approche à grande vitesse de la Terre: [...]. (“an intersteller vessel that is getting close to the earth at a high speed”) 2.2.1.6. Nvéhic-se-V-de-Nnon-hum (2 occurrences) (6) Quand le jour s’est levé, […], quatre grues de 100 à 150 tonnes s’étaient approchées de la dalle de béton brisée. (“four cranes of 100 to 150 tons had gotten close to the broken concrete slab”) 2.2.1.7. Nnon-hum-se-V-de-Nloc (9 occurences) (7) C’est au moment où le cortège des pro-Aristide s’est approché du quartier général des FRAPH, rue de l’Enterrement, [...]. (“the pro-Aristides’ procession has gotten near the headquarters of the FRAPH”) 2.2.1.8. Nnon-hum-se-V-de-Nhum (3 occurrences) (8) Plus la caméra s’approchait d’elle, plus tout son corps se recroquevillait, [...]. (“Nearer the cine-camera was getting her”)
In general, there is no problem in identifying a “concrete displacement”. There are 87 occurrences of “concrete displacement” as opposed to the 103 of “abstract displacement”. Contrary to the transitive direct construction, we can note that the pronominal verb construction is sufficiently utilized for a “concrete displacement”, too. 2.2.2. Abstract displacement: 103 occurrences 2.2.2.1. Nhum-se-V-de-Nloc (10 occurrences) (9)
[...] le principe de base de toute sage politique en France: on ne doit s’approcher de l’école, des écoles, qu’avec des prudences […]. (“we must approach the school, the schools only with prudences”)
In (9), above, «l’école» (“the school”), if conceived out of context, can be understood as a concrete place. However, in the context, the question is of a certain object to which we must attend. As we have already noted above, any nominal element can be conceived as an object to seize or attain. In this sense, (9) is very close to the transitive direct construction expressing an object that is to be seized. Comparing On doit approcher l’école and On doit s’approcher de l’école, we notice that a semantic difference, if there is one, should arise from that between these constructions. The object’s directness is formally expressed in the transitive construction, whereas in the pronominal verb construction, the question is only of the object of our attention. We merely pay attention to this object, because it is treated indirectly with the help of the preposition de and also because the verb s’approcher itself is no more transitive and, as it were, “intransitivized”. However, this is just a discussion based on the formal difference of
Transitive Direct, Transitive Indirect
253
construction. There is possibly no substantial semantic difference that is considered by itself. It is important, however, to note that there are only a few cases such as the one in the examples of pronominal verb constructions, whereas there are many for the transitive direct construction. In addition, recall the proportion of the meaning in question for the two constructions: proportionally speaking, the pronominal verb construction fairly often expresses a “concrete displacement” but the transitive direct construction suggests in most cases an “abstract displacement” (including the meaning of “accessing someone”). It appears that, ultimately, it is this difference of proportion that consequently renders the meaning of “attempting to seize, attain progressively” as being better adapted to the direct construction than to the pronominal verb construction. 2.2.2.2. Nhum-se-V-de-Nabs (6 occurrences) (10) [...] les tabous, comme celui de “réécrire” Shakespeare. “Nous voulions nous approcher du mythe qu’est Richard, explique Pierre Pradinas. (“We wished to approach the myth which Richard is”)
In (10), we encounter the same problem. Here, «voulions» expresses a strong intention of an act. The meaning is exactly as that of “We wished to attempt to grasp the myth which Richard is”. This example could possibly suggest a division of the group of “abstract displacement” between “simple abstract displacement” and “attempting to seize, grasp, attain progressively”. 2.2.2.3. Nhum-se-V-de-Nchiffré (4 occurrences) (11) Ce garçon […] a attendu son tour pour s’approcher des premiers rangs. (“That boy waited his turn to get near the first ranks”) 2.2.2.4. Nhum-se-V-de-Ntemps (5 occurrences) (12) [...], ils s’approchent de la phase finale [...]. (“they are approaching the final phase”) 2.2.2.5. Nhum-se-V-de-Nnon-hum (38 occurrences) (13) Il s’agit d’une oeuvre d’art, au fond. Seul un écrivain comme Proust s’est approché de ce Temps introuvable, par métaphore, [...]. (“Only a writer like Proust got close to this undiscoverable Time”)
(13), above, also permits the interpretation of “seizing, attaining progressively”. 2.2.2.6. Nhum-se-V-de-Nhum (1 occurrence) (14) “Plus on s’approche du consommateur final, moins la situation est réjouissante [...]. (“Closer we get to the final consumer”) 2.2.2.7. Nnon-hum-se-V-de-Nnon-hum (21 occurences) (15) J’ai choisi des guitaristes dont le jeu s’approchait de celui de James. (the
254
Yoichiro TSURUGA
performance was comparable with that of James) 2.2.2.8. Nnon-hum-se-V-de-Nchiffré (11 occurrences) (16) “[...] la lutte contre l’inflation, qui s’approche aujourd’hui des 3%”. (“the inflation, which is nowadays getting close to the 3%”) 2.2.2.9. Nnon-hum-se-V-de-Ntemps (1 occurrence) (17) La ville des princes […], s’approche du cinquième anniversaire de la chute du mur, le 9 novembre, avec un solide optimisme. (“The town is gathering for the anniversary of the wall’s fall”) 2.2.2.10. Ntemps-se-V-de-Nhum (1 occurrence) (18) […] était-ce la Mort qui s’approchait de nous? (“was it the Death that was coming close to us?”) 2.2.2.11. Nnon-hum-se-V-de-Nabs (3 occurrences) (19) [...]: le cinéma comme fabrication, comme construction, pour s’approcher de la réalité (matérielle, intellectuelle, affective). (“the cinema as producion, as construction, to get close to the reality”) 2.2.2.12. Nabs-se-V-de-Nabs (2 occurrences) (20) Les “impuretés”, [...], devaient être noyées dans la masse, et l’équilibre réel s’approcher de l’équilibre théorique “optimal”. (“the real balance should get close to the theoretical ‘optimal’ one”)
In the previous sections, we have raised some cases that may lead to the interpretation of “seizing, attaining progressively”. There appears to be no possibility of this interpretation, regarding the constructions with N1(chiffré+ temps). However, it is merely a tendency that we notice there, and it is necessary to acknowledge that, in principle, any nominal element, if placed in a suitable context, can be conceived of as an object or a problem that we must have to deal with. Let us note that the pronominal verb construction can also express a meaning close to that of “seizing, attaining”, althoug we do not very often find examples suggesting such a meaning. Recall that only the Grand dictionnaire encyclopédique Larousse indicates the possible interpretation in question for the pronominal verb construction. 2.3. N0-se-V (52 occurrences) This construction is simple because it contains only one argument, namely N0. In most cases, we can consider it as a realization of 2.2. N0-se-V-de-N1, the indirect object (de-N1) of which is absent. As done ealier, we can first classify them into two groups, “concrete displacement” and “abstract displacement”, and then, classify them according to the nominal elements appearing in N0, which are in our corpus: Nhum, Nnon-hum, Nvéhic and Ntemps.
Transitive Direct, Transitive Indirect
255
2.3.1. Concrete displacement: 44 occurences 2.3.1.1. Nhum-se-V (36 occurrences) (1)
Le sommelier s’approche: Qu’est-ce que vous prendrez? (“The wine waiter comes near”) 2.3.1.2. Nnon-hum-se-V (5 occurrences) (2) [...] la caméra s’est même approchée pour [...]. (“the cine-camera has even come nearer”) 2.3.1.3. Nvéhic-se-V (3 occurrences) (3) L’enfant joue devant chez lui quand une voiture s’approche. (“a car comes near”)
2.3.2. Abstract displacement: 8 occurences 2.3.2.1. Nhum-se-V (1 occurrence) (4)
[...]: la confrontation de deux géants qui se suspectèrent, s’approchèrent, s’évitèrent, se mécomprirent, [...]. (“two giants who suspected , approached each other”) 2.3.2.2. Nnon-hum-se-V (1 occurrence) (5) Le Traité de Maastricht avait prévu une stratégie du resserrement: les monnaies devaient s’approcher le plus possible. (“the moneys must approach each other”) 2.3.2.3. Ntemps-se-V (6 occurrences) (6) La maladie les interdit et s’approche la rupture de Joseph d’avec le Delteil qu’il n’aime plus. (“the rupture comes near between Joseph and le Delteil”)
In (4) and (5), the question is of a pronominal verb construction known as “reciprocal” which appears to be extremely rare for the verb approcher. In (5), for exemple, it is, as it were, a matter of “the money A must approach the money B and reciprocally”, which would have a close relation with an “abstract displacement” of a transitive direct construction. In general, we can state that in the construction N0-se-V, the “concrete displacement” is dominant (36 occurrences among 52), whereas in the examples of “abstract displacement”, Ntemps-se-V is to be noted (6 occurrences among 8). Note that in this sense, Ntemps-se-V is close to Ntemps-V, which is by far the most frequent in N0-V of “abstract displacement” (89 occurrences among 91). We shall note this below and recall again that it was brought into sharp relief by the Grand dictionnaire encyclopédique Larousse. 2.4. N0-V-de-N1 (74 occurrences) This is the third complexe construction after N0-V-N1 and N0-se-V-deN1. We propose to classify them first, according to the nominal elements that appear in N0 and N1, and thereafter, according to the two groups of “concrete displacement” and “abstract displacement”.
256
Yoichiro TSURUGA
The constructions with N0.hum are: Nhum-V-de-Nloc: 11 occurrences, Nhum-V-de-Ntemps: 21, Nhum-V-de-Nnon-hum: 6, Nhum-V-de-Nabs: 6. The constructions with N0.non-hum are: Nnon-hum-V-de-Nloc: 4 occurrences, Nvéhic-V-de-Nloc: 3, Nvéhic-V-de-Nnon-hum: 1, Nnon-hum-V-de-Nchiffré: 9, Nnonhum-V-de-Ntemps: 4, Ntemps-V-de-Nchiffré: 1, Nnon-hum-V-de-Nnon-hum: 8 2.4.1. Concrete displacement: 18 occurrences 2.4.1.1. Nhum-V-de-Nloc (10 occurrence) (1)
Il a toutefois noté que les Serbes approchaient maintenant d’un bastion fortement défendu de l’armée gouvernementale bosniaque, […]. (“the Serbs were now coming near a bastion”) 2.4.1.2. Nhum-V-de-Nnon-hum (1 occurrence) (2) [...] approchant du berceau d’un nouveau-né [...] , le premier ministre incarnait [...]. (“coming near the cradle of a newborn, the prime minister incarnated”) 2.4.1.3. Nnon-hum-V-de-Nloc (4 occurrences) (3) […] chaque fois que sa crinière blonde approchait du but bulgare. (“its blond mane came close to the Bulgarian goal ”) 2.4.1.4. Nvéhic-V-de-Nloc (3 occurrence) (4) UN Airbus roumain bat de l’aile en approchant d’Orly; [...]. (“An Airbus ... coming near Orly”) 2.4.1.5. Nvéhic-V-de-Nnon-hum (1 occurrence) (5) [...], deux remorqueurs italiens ont pu approcher de l’épave [...]. (“two Italian tugs could come near the wreck”)
2.4.2. Abstract displacement: 55 occurrences 2.4.2.1. Nhum-V-de-Ntemps (21 occurrences) (6)
“Au fur et à mesure qu’on approchait de la date des élections, la tension montait”, […]. (“they were coming close to the election day”) 2.4.2.2. Nhum-V-de-Nabs (6 occurrences) (7) [...] Philippe Séguin approche de la vérité quand il oppose l’objectif de “pleine activité” au mirage entretenu du retour au “plein emploi”. (“Philippe Séguin comes close to the truth”) 2.4.2.3. Nhum-V-de-Nnon-hum (5 occurrences) (8) [...] elle lui donna l’impulsion de travailler afin d’approcher du but unique que son éducation avait déjà défini. (“she gave him the stimulus to work in order to get close to the unique goal”)
(8) can have the meaning that is close to that of “attaining the goal”. Regarding the frequency, let us observe that the construction N0-V-de-N1 is clearly oriented toward the abstract meaning (“abstract displacement”: 55 occurences as opposed to “concrete displacement”: 18 occurrences), which is a difference that we can confirm in comparison with the construcion
Transitive Direct, Transitive Indirect
257
N0-se-V-de-N1. From this point of view, we can say that N0-V-de-N1 can be better adapted to the meaning of “seizing, attaining progressively” than can N0-se-V-de-N1. However, it is probable that this difference is neither easily realized nor perceived. In any case, the example in question is very rare for the transitive indirect construction. 2.4.2.4. Nhum-V-de-Nloc (1 occurrence) (9)
[...] l’ampleur de la demande qui s’exprime [...] en direction de l’école est la preuve [...] qu’on ne doit en approcher qu’à pas comptés. (“We must attempt to approach the school only prudently”) 2.4.2.5. Nnon-hum-V-de-Nchiffré (9 occurrences) (10) L’écart entre les taux français et allemands, [...], s’est un peu réduit, approchant de 0,5 point. (“The distance...coming close to 0.5 point”) 2.4.2.6. Nnon-hum-V-de-Ntemps (4 occurrences) (11) Le procès [...] approchait mardi 15 février de sa fin, […]. (“The trial was getting close to its end”) 2.4.2.7. Ntemps-V-de-Nchiffré (1 occurrence) (12) [...] un total de 294 000 en trois semaines; et encore 27 000 pour les Vestiges du jour qui approche des 300 000 en cinquième semaine. (“the day which gets near the 300 000”) 2.4.2.8. Nnon-hum-V-de-Nnon-hum (8 occurrences) (13) [...] que l’économie approchait de la pleine utilisation de ses capacités de production […]. (“the economy was getting near the full utilization”) In general, the construction N0-V-de-N1 is almost as rich as the ones of N0-V-N1 and N0-se-V-de-N1, although it is considerably less frequent.
Here, the “concrete displacement” is less frequent than the “abstract one”. From this point of view, let us compare the three constructions. N0-V-N1 N0-se-V-de-N1 N0-V-de-N1
“concrete” : 30 occurrences : 87 occurrences : 19 occurrences
“abstract” 263 occurrences 103 occurrences 55 occurrences
It is noteworthy that in the case of the pronominal verb construction, the “concrete” is almost as frequent as the “abstract”, whereas in the case of the other two construcions, the tendency is clearly orientend toward the “abstract”. Let us note that for the construction N0-V-de-N1, Nhum-V-de-Nloc is remarkable in the constructions of “concrete displacement” (10 out of 19 occurrences) and that Nhum-V-de-Ntemps is very frequent among the constructions of “abstract displacement” (21 out of 55 occurrences).
258
Yoichiro TSURUGA
2.5. N0-V (104 occurrences) This construction is as simple as that of 3.3. N0-se-V. Here, the simplicity is further accentuated by N0.temps which occurs very frequently (89 out of 104 occurrences). We shall proceed here as in the preceeding sections. As nominal elements coming into N0, there are: Nhum, Nnon-hum, and Ntemps. 2.5.1. Concrete displacement: 13 occurrences 2.5.1.1. Nhum-V (10 occurrences) (1)
Puis nous avons couru [...]. Les Allemands approchaient. (“The Germans were coming near”) 2.5.1.2. Nnon-hum-V (2 occurrences) (2) [...] tous les ports espagnols savent qu’approche une escadre ennemie et se préparent à l’attaquer. (“an enemy’s squadron is coming near”) 2.5.1.3. Nvéhic-V (1 occurrence) (3) [...] quand la pelleteuse approche et porte de son bec […]. (“the excavator comes near”)
2.5.2. Abstract displacement: 91 occurrences 2.5.2.1. Nnon-hum-V (2 occurrences) (4)
[...] “Le fascisme approche à grands pas”; [...]. (“the fascism is coming near”)
2.5.2.2. Ntemps-V (89 occurrences) (5)
Alors qu’approchait l’heure de la retraite, [...]. (“the time of retirement was coming near”) Regarding the “concrete displacement”, we can note Nhum-V (10 out of 13 occurrences), which resembles the case of Nhum-se-V (36 out of 44 occurrences of N0-se-V “concrete”). Further, with regard to the “abstract displacement”, Ntemps-se-V is also to be noticed (6 out of 8 occurrences of N0-se-V “abstract”), which is also comparable to Ntemps-V (89 out of 91 occurrences of N0-V “abstract”). The frequency of Ntemps-V almost leads us to state that N0-V is conceived to transmit an “abstract temporal
displacement”. 3. Conclusion In addition to the constructions analysed above, the following occurrences were observed in our corpus. N0-V-N1(-de-N2): 2 occurrences, N1-être-Vé(-par-N0): 25 occurrences, N1(,)Vé(-par-N0): 10 occurrences, and N1-se-laisser-V(-par-N0): 4 occurrences. These constructions cannot be treated here. However, we can simply note the presence of the passive construction that indirectly indicates the importance of the transitive direct construction.
Transitive Direct, Transitive Indirect
259
We can summarize the essential points as follows. First, for the construction N0-V-N1 (293 occurrences), we must note its highest frequency and also its rich semantic variety. Further, its tendency is clearly oriented toward an “abstract” use, which can suggest that the “concrete displacement” ―which should be considered as the most fundamental for the verb approcher― is assumed by other constructions. Moreover, it is important to note that the abstract use of a “voluntary act” of the direct construction is very living. Second, for the construction N0-se-V-de-N1 (190 occurrences), we must state that the “concrete” use is almost as important as the “abstract” one. We can further state that when there are two arguments N0 et N1, it is the pronominal verb construction that mainly assumes the meaning of “concrete displacement”. It is also important to recall the frequency and the rich variety of its use of “abstract displacement”. Further, let us notice that there are even cases in which the interpretation of “seizing, attaining progressively” seems possible. Third, with regard to the construction N0-V-de-N1 (74 occurrences), we must state that it, too, has a rich semantic variety, although it is clearly less frequent than the preceding two constructions. This construction is oriented toward an “abstract” use, as in the case of N0-V-N1. In addition, among the uses of “abstract displacement”, there are even those that suggest the interpretation of “seizing, attaining progressively”, although they are an exception. Among these “abstract” uses, it is Nhum-V-de-Ntemps that is to be parlticularly noted. Finally, with regard to the constructions N0-V (104 occurrences) and N0-se-V (52 occurrences), we must observe their differences of tendency. Regarding the former, it is strongly oriented toward an “abstract displacement”, and it is Ntemps-V that has by far the highest frequency. Moreover, it is Nhum-V that dominates N0-V “concrete”. With regard to the second construcion, it is clearly oriented toward a “concrete displacement” and it is Nhum-se-V that is by far dominant. The two constructions resemble each other in that for the “concrete displacement”, N0.hum is to be noted, while for the “abstract displacement”, N0.temps is to be noted. Compared with the descriptions provided by dictionaries, our description will show the importance and the richness of the transitive direct construction. Moreover, the transitive use with three arguments is quasi-absent in our corpus. Surveying the quantitative proportions of the constructions can offer some valuable information on the real and actual state of the functioning and structure of a language. It is finally this aspect of the problems that must be clarified by an anlysis of the corpus. In fact, this is merely one of the
260
Yoichiro TSURUGA
aspects of linguistic analysis that we have been adhereing to since the very beginning in the field of structural and functional linguistics (cf. Harris 1951, 1991, Martinet 1960, 1979, 1985). 4. Linguistic Informatics At the end of this article, we would like to add some comments on what can be conceived between corpus analysis, information sciences and language acquisition. In other words, under which form can “Linguistic Informatics” that profitably joins together linguistics, informatics and language acquisition-teaching be conceived? Ordinarily, we believe that language teaching is an application of linguistics and that informatics offers an effective tool for linguistics and language teaching. Contrary to this common and unilateral perception, we can conceive of a reciprocal relationship between linguistics and informatics. Informatics can assist us in regrouping and classifying a vast quantity of corpus, the classification of which would require considerable time and effort without its involvement. At present, it is not difficult to survey even a million examples of one semantic unit in a very restricted amount of time. The development of software’s research capabilities also permits the ability to verify in a corpus of great dimensions, for example, whether or not a complexe continuation of classes and subclasses of semantic units can constitute a grammatical and realizable one in a given language. Of course, informatics has not yet clarified everything, and we are far from it, but there are many developments that lingistics can expect in future informatics. Inversely, linguistics should be able to suggest many essential elements to informatics, because, ultimately, it is only linguistics that attends to all the details of the structures and functionings of human languages. For example, in the domain of automatic processing of texts and machine translation, linguistics could offer many indispensable elements to informatics. For instance, can computers exclude non-verbal forms from all the mechanically detected “conjugation forms” of a given verb approcher? In fact, there remain many unclarified elements in linguistics and, consequently, in the mechanical processing of informatics. Thus, it is clear that there are many subjects to which linguistics and informatics should attend in collaboration. If the task of informatics is to treat information, it cannot neglect linguistic analyses, because the most important information that is transmitted in communication between human beings is realized in the form of language. The collaboration carried out under the direction of Maurice Gross at the LADL (Laboratoire d’Automatique Documentaire et Linguistique) and IGM (Institut Gaspard Monge) in France is rather exceptional (cf. Gross 1975,
Transitive Direct, Transitive Indirect
261
Boons, Guillet, Leclère 1976, Guillet, Leclère 1992, Silberztein 1993, Paumier 2006). Unfortunately, we cannot say that this sort of collaboration be now more and more intensified. However, despite these facts, it is evident that the reciprocal collaboration between linguists and specialists in informatics is indispensable to both these domains. What could be the relationship between linguistics and language teaching? Linguistics is applied to language teaching. However, it is not only a matter of a theoretical aspect but also of a material one which can be profoundly different from what was a quarter of a century ago. Regarding the uses of a verb, for example, we can now provide a real image of its functionings by analyzing a corpus of great extension. We can precisely state the frequencies of the constructions that are accepted by the verb in question, and if we so desire, it is even possible to indicate the frequency of the combination between a given nominal object N1 and the verb in question. Furthermore, it is possible to specify the frequencies of all the constructions of all the verbs of a given dictionary. Therefore, we can show a real image of the detailed functionings of the sentential constituents. This kind of quantified informations, if presented well, can offer a great advantage to language teaching. What linguistics brings to language teaching is more than evident; however, in return, what can linguistics receive from language teaching? This point has not been addressed clearly thus far. It is said that we learn and acquire a language in a linguistic community naturally and without any effort. However, is this actually true, and if so, to what extent? The importance of school education is essentially recognized and its significance is felt increasingly. And in one sense, we can say that language teaching ―including, of course, that of one’s mother tongue― is at the base of all teaching. Setting aside foreign language acquisition, one’s mother tongue ―French, for instance― should have a very different form from what it actually is, if no institutional factors (teaching, publication, television and radio broadcasting, transportation system, etc.) intervened in its acquisition (it would be very interesting, for example, to compare in detail French and a language that is fairly isolated from any so-called developed society). Conceived under this perspective, we realize that language teaching can offer much actual and valuable information, albeit of a specific kind, on linguistic communication, and particularly on the different aspects of language acquisition, regardless whether it is of one’s mother tongue or of a foreign language. In concrete terms, what can language teaching offer linguistic researches? From a corpus linguistics’ point of view, we can expect to establish different corpora of language acquisition and thus compare the analyses of ordinary corpora with those of acquisition process’ corpora ―spoken or written― established on the spot of language teaching
262
Yoichiro TSURUGA
and acquisition (which is called «grammaire des fautes» (“grammar of errors”), cf. Frei 1929). It is clear that the relationship between linguistics and language acquisition-teaching should be reciprocal. With regard to the relationship between language teaching and informatics, we should, as linguists, abstain from providing comments. Besides technical elements, however, this relation must also be reciprocal in the sense that the central aspect of their works is the constant transmission and exchange of language information. The fields of Linguistic Informatics or Informatic Linguistics, as conceived above, are far from being sufficiently established in our researches; however, we can confidently state that the process has surely and solidly begun. Bibliography Boons, Jean-Paul, Alain Guillet, Christian Leclère. 1976. La Structure des phrases simples en français: constructions intransitives, Genève, Droz. Busse, Winfried, Jean-Pierre Dubost. 1977. Französisches Verblexikon, Die Konstruktion der Verben im Französischen, Stuttgart, Klett-Cotta. Dubois, Jean, Françoise Dubois-Charlier. 1997. Les Verbes français, Paris, Larousse. Frei, Henri. 1929. La Grammaire des fautes. Introduction à la linguistique fonctionnelle, Paris, Geuthner et Genève, Kündig. Gross, Maurice. 1975. Méthodes en syntaxe, Paris, Hermann. Guillet, Alain, Christian Leclère. 1992. La Structure des phrases simples en français: constructions transitives locatives, Genève, Droz. Harris, Zellig S. 1951. Structural Linguistics, Chicago, The Univ. of Chicago Press. 1991. A Theory of Language and Information ―A Mathematical Approach, Oxford, Clarendon Press. Martinet, André. 1960, 2005. Éléments de linguistique générale, Paris, A. Colin. 1979. Grammaire fonctionnelle du français, Didier, Paris. 1985. Syntaxe générale, Paris, A. Colin. Paumier, Sébastien. 2006. Unitex 1.2. Manuel d’utilisation, Marne-la-Vallée, IGM, Univ. de Marne-la-Vallée. Silberztein, Max. 1993. Dictionnaires électroniques et analyse de textes. Le système Intex, Paris, Masson. Tsuruga, Yoichiro. 2005. “A Correspondence between N0-V-N1-de-N2 and N0-V-N2-Loc-N1 in French —The Case of planter—”, in Takagaki, T. et al. (ed.) Corpus-Based Approaches to Sentence Structures, Usage-Based Linguitic Informatics 2, Tokyo Univ. of Foreign Studies, Amsterdam, J.
Transitive Direct, Transitive Indirect
263
Benjamins, pp. 213-232. Grand dictionnaire encyclopédique Larousse, tome I, Paris, Larousse, 1982. Grand Larousse de la langue française, tome I, Paris, Larousse, 1971. Grand Robert de la langue française, tome I, Paris, SNL Le Robert, 2001. INDEX du DELAS. v8 et du Lexique-Grammaire des verbes, Paris, LADL, Univ. Paris VII, 1997. Trésor de la langue française. Dictionnaire de la langue du XIX eet du XX esiècles, tome I, Paris, Klincksieck, 1971.
264
Yoichiro TSURUGA
Demonstratives in De Bello Gallico and Li Fet des Romains — A Parallel Corpus Approach to Medieval Translation — Yuji KAWAGUCHI 1. Introduction It is well known that the genre of “romans antiques” represented by Le Roman de Troie, Enéas, and Le Roman de Brut flourished in the royal court of Henry II. Past events in these ancient romances denote not only ancient stories to be told but also a living history to be realized in the future.1 For medieval writers, ancient history was not a mere compilation of fixed knowledge, but a stock of human wisdom. With these written texts, this historical knowledge would be inherited throughout different eras. Chrétien de Troyes expresses himself with regard to the succession of human wisdom in the following passage. (1)
Ce nos ont nostre livre apris / Qu’an Grece ot de chevalerie Le premier los et de clergie. / Puis vint chevalerie a Rome Et de la clergie la some, / Qui or est an France venue. (Cligés, ms.BN 794, lines 27–35)
In this famous preamble of Cligés, Chrétien de Troyes declared that Greece had the first honor of chivalry and the liberal arts. Later, Rome fell heir to the chivalry and the summa of the liberal arts and finally, it reached France. To Chrétien de Troyes, France was a direct successor of the ancient wisdom of Greece. In the preface to his romance titled Erec et Enide, he proudly proclaimed that as long as Christianity persists, people would forever memorize the romance he would create. (2)
Des or comancerai l’estoire / Qui toz jorz mes iert an mimoire Tant con durra crestïantez; / De ce s’est Crestïens vantez. (Erec et Enide, ms.BN 794, lines 23–26)
The medieval art of translation was also in the line of such intellectual efforts 1
Baumgartner (1995) p.220.
266
Yuji KAWAGUCHI
to accede to ancient history as human wisdom. In the present article, we will analyze the medieval translation of Caesar’s Gallic War. The examination of the original Latin texts and their Old French translation is based on the parallel corpus of these two texts. The objective of this article is to shed light on different problematic aspects of medieval historical texts and to illustrate a possible model of a parallel corpus approach to medieval translation. 2. Quest for a new historical description: Li Fets des Romains It would not be an overstatement to assert that the translation of Caesar’s Gallic War—Li Fet des Romains (hereafter, Li Fet)—was a kind of antithesis of the abovementioned courtly romances, because the creation of the text was connected with the feud between the king of France and the Plantagenet dynasty. The relationship between the two was deteriorating after the Third Crusade, particularly due to the accession of the new king of France, Philip Augustus (1180–1223). Philip Augustus was keen to control the territory of the king of England. In 1202, there rose a dispute between King John of England and the Count of Poitou, one of John’s vassals. To resolve the clash, Philip summoned both of them to his feudal court. However, John refused to appear in court. In response, Philip declared that John should be deprived of all of his feudal lands in France. Finally, the victory of Philip Augustus at the battle of Bouvines in 1214 enhanced his absolute feudal authority and the prestige of the French monarchy in West Europe. Li Fet was thus written under conditions of political expansion carried out by the French monarchy during the reign of Philip. Although the project of royal chronicles also began under this political atmosphere, the kings of France and England adopted different attitudes toward the compilation of royal history. The Plantagenet dynasty protected the ancient romances, and insisting on the unity of the Anglo-Normand and new Anjou dynasties, the writers strove to record the formation of their dynasty. For instance, Chroniques des ducs de Normandie written by Benoît de Sainte-Maure circa 1172/1176 was a classic. On the other hand, the king of France did not have much interest in the compilation of royal history in Old French. Apparently, it was the increasing opposition against the Plantagenet dynasty that changed his negative comportment toward royal chronicles in Old French. Thus the translation of Gesta Philippi Augusti and its revised version Histoire de Philippe-Auguste by Jean de Prunai2 fall under the collection of unusual royal documents concerning King Philip. Chronique de Pseudo-Turpin, compiled circa 1202, and Chronique des rois de France, the compilation of 2
For more details, see Spiegel (1993) pp.269–270.
Demonstratives in De Bello Gallico
267
which began around 1210, were the earliest examples. This royal history project culminated in Grandes chroniques de France written in 1274 by the monks of the Abbey of Saint-Denis. We examine the kind of ideology that propelled this new project of compiling royal history in vernacular language. The preface of Chronique de Pseudo-Turpin is significant and frequently quoted in this regard. (3)
Voil commencer l’estoire si cum li bons enpereires Karlemaine en ala en Espagnie par la terre conquere sore les Sarrazins. Maintes genz si en ont oï conter et chanter mes n’est si mençonge non ço qu’il en dient e chantent cil chanteor ne cil jogleor. Nus contes rimés n’est verais. Tot est mençongie ço qu’il en dient car il n’en sievent rienz fors quant par oïr dire. (Chronique de Pseudo-Turpin, ms.BN fr.124, fol.1)
The above preface proclaimed a new historical recognition: Nus contes rimés... par oïr dire. “No rhymed tale is true. All that they tell are fictions because they know them only by hearsay.” Historical descriptions should not be based on hearsay, but should be the truth. Verse is not a convenient tool to achieve this purpose, and prose is more often than not a preferable means to depict historical truth. In reality, as far as the existent documents are concerned, medieval writers rarely used prose before 1200. Only four documents in prose are attested from the thirteenth century: Psalter de Montebourg in approximately 1100, Sermons of Maurice Sully in 1160–1196, Description de Jerusalem circa 1150, and Quatre livres des rois around 1170. The emergence of prose was a historical event in the thirteenth century. We can suppose that the transition in the style of consuming literature was closely related to the increase of prose in this period. Oral literature transformed itself into written texts in accordance with a gradual rise of literacy among laymen in the Middle Ages. The oral tradition that medieval intellect and nobility had been accustomed to was receding into the background, and instead of listening to literature, the thirteenth century had generalized the practice of reading it. This ideological trend contributed to the creation of a new literary genre that focused on the construction of virtual reality through written words. It was “a new linguistic model of truth, one based on written, not spoken language.”3 The same ideology was fructified in the royal chronicles in Old French.4 3 4
Spiegel (1993) p.68. Apart from the royal chronicles and Li Fet des Romains, for instance, the description of Constantinople by Robert Clari, a war correspondent as well as participant of the Fourth Crusade, is worth mentioning. He was also keen to depict the circumstances of the city as faithfully as possible, even if his virtual reality did not effectively represent the reality itself.
268
Yuji KAWAGUCHI
As we have seen, the transitional period of the Old French literary tradition witnessed a decline of verse, the rise of prose with the increasing number of written texts, and the ideology to describe truth rather than fiction. Two historical compilations come from the first quarter of the thirteenth century, Histoire ancienne jusqu’à César (hereafter, Histoire ancienne) circa 1208–13 and Li Fet circa 1213–14.5 These works are distinguished from other royal histories of France because on the one hand, they consider the Roman Empire as a political model and on the other hand, they consider Julius Caesar as a protagonist.6 However, in these works, Julius Caesar is not an ideal, but a model. They regard Caesar as an assassinated hero as well as a difficult man with an unstable and complicated personality, and they depict him as one of the figures who disclose the fragility of Roman civilization and its authority. Although Histoire ancienne and Li Fet are contemporary texts that share some similarities, they sharply contrast each other in terms of their historical recognition. The former is replete with numerous quotations from ancient romances and reveals some obvious shades of the literary taste of the Plantagenet dynasty. In fact, after a voluminous commentary on the Creation, the author of Histoire ancienne inserts quotations from Le Roman de Thèbes and Le Roman de Troie and excerpts several passages from Publius Vergilius Maro (70–19 BC) and Maurus Servius Honoratus (4–5 AD). In this respect, the author captures the prejudices of his predecessors and other contemporary historians, according to whom the Frankish Kingdom originated from the ancient polity of Troy. The manuscript of Chantilly no869 of Chronique des rois de France is based on the same origin. Surprisingly, it was not until the publication of Recherches de la France by Etienne Pasquier in 1599 that this error was finally corrected. It is likely that Li Fet was written in this period in order to explicate the royal authority of King Philip II.7 In the prologue, like Chrétien de Troyes, the author of Li Fet insists on the importance of learning human wisdom from ancient texts and of deeply ensconcing it in his living epoch through the actualization of past events.8 5
6 7 8
The dating is based on the analysis by Sneyders de Vogel on the fol.99b in the Manuscript 1391 (fol. 120c du ms. Vatican, Reg. 893). The translator mentions the Anglo-Norman coalition with Otto IV against Philip. The rumor of this coalition had spread throughout France from the second half of 1213 to July 1214, see Flutre (1974) p.8. On the medieval vogue of Caesar, see Leeker (1986). With regard to the character of Philip on Li Fet, see especially the prehistory of the compilation of this text (cf. note 5). “Die Funktion der Antike war dabei primär, als didaktisches Quellenmaterial zur moralischen Besserung der Leser zu dienen, wie etwa aus der Charakterisierung der Werke mit antiken Stoffen als “sage et de sens aprendant” bei Jean Bodel oder auch aus dem Prolog des Übersetzers zu den Fet des Romains hervorgeht”, see Leeker (1986) p.3.
Demonstratives in De Bello Gallico (4)
269
Mes cil qui plus sivent raison et droiture que delit charnel, qui font les proesces ou qui les recordent et metent en escrit, cil font a loer ; car ou recort des œvres anciennes aprent l’an que l’en doit fere et que l’en doit lessier. (Li Fet, fo1c lines 16–20)
The pursuit of truth in historical description appears to be closely related to this process of verbally actualizing past events and interpreting them as they occurred in the actual situation. Thus, ou recort des œvres anciennes aprent l’an que l’en doit fere et que l’en doit lessier, that is, “in the testimony of ancient texts, one can learn what to do and what to neglect.” In this respect, Li Fet may be involved in the genealogy of medieval French literature that seeks a new historical description. 3. Parallel corpus approach to translatio medievalis In the prologue to Li Fet, the author states his intention to portray the lives of twelve emperors, from Caesar to Domitian.9 (5)
Et commencerons nostre conte principalement a Juille Cesar, et le terminerons a Domicien, qui fu li douziemes empereres, (…)
Owing to unknown reasons, however, the book ends with the first volume, and its main part represents an Old French translation of Caesar’s De Bello Gallico (hereafter, De Bello). The popularity of this incomplete work is incontestable. Further, not only do we have 47 manuscripts from the golden age or the later period of the Middles Ages but also its Italian, Portuguese, and Catalan versions.10 Li Fet is an epochal history in two aspects. First, the author considers portraying the lives of twelve emperors in chronological order. Second, he attempts to describe the history of the Roman Empire itself. In translating De Bello into Old French, the author of Li Fet examines the transition period from the Roman Republic to the Roman Empire with Caesar as the protagonist. He is keen to observe historical objectivity, but as is evident from the first person plural forms in commencerons and terminerons in (5), the viewpoints are not limited to his, but extend to those of the readers. At best, Li Fet is a virtual history written from the perspective of the medieval people. In fact, copyists of the manuscripts of group α of Li 9
10
Probably from the birth of Caesar in 100 BC to the death of Domitian in 96 AD. The most extensive analyses of Li Fet are Flutre and Sneyders de Vogel (1977) and Jeanette Beer (1976). Li Fet remained unknown in Spain, and the passages in Grant crónica de Espanya in the fourteenth century were quoted from the Catalan version, see Leeker (1986) p.69.
270
Yuji KAWAGUCHI
Fet attributed De Bello not to Julius Caesar but to Julius Celsus Constantinus, who was accepted as the author of De Bello by medieval intellectuals: Ici commence Juliens conment Cesar conquist France, that is “Here begins Julius Celsus Constantinus how Caesar conquered France.” De Bello is set in the period from ancient Gaul to thirteenth century France. The opening passage of De Bello, Gallia est omnis diuisa in partes tres, (...) (BG I.1.) corresponds to the following sentence in Li Fet: France estoit molt granz au tens Juilles Cesar. (FR I.)11 The present tense in the Latin original, est diuisa, is translated into a past event by the imperfect estoit. The translator appears to have presented an illustration of the essential content and not a literary translation of the original. Beer supposes that “the translator is using Roman Gaul as a historical justification for Philip Augustus’ dream of a larger, unified France” (cf. Beer (1976) p.74). Presently, the principles of translatio medievalis cannot be discerned from the existent documents. Although it was written three centuries later, the third principle of Estienne Dolet may provide a clue to determining the mentality of the medieval translators. Dolet warns translators that to be too faithful to the original is to translate it word for word. Literary translation is due to a poor and defective mind: en traduisant il ne se fault pas asseruir iusques à la, que lon rende mot pour mot. Et si aulcun le faict, cela luy procede de pauureté, & deffault d’esprit.12 He continues to state that a translator should not be endowed with the abovementioned faculty and should work “without being hampered by word order, respecting whole sentences, so that the author’s intention will be properly expressed, observing attentively the property of both languages”: sans auoir esgard à l’ordre des mots il s’arrestera aux sentences, & faira en sorte, que l’intẽtion de l’autheur sera exprimée, gardant curieusement la proprieté de l’une, & l’aultre langue.13 To make a minute comparison of the texts of Li Fet and De Bello, we selected 54 chapters of the first book of De Bello and 3 corresponding chapters of Li Fet and constructed the parallel corpora of these two texts through a scripting language.14 For some diacritics and special fonts, our parallel corpus program creates a HTML file as output. The program 11
12
13 14
From the word France, we can easily interpret the strong royal authority of Philip and his nationalistic stance. BG represents De Bello and FR represents Li Fet. “I.1.” represents section 1 of chapter I. The Latin text edition is by Constans (1990). Nevertheless, the recommendation of a word-to-word or sentence-to-sentence translation by Jean d’Antioche is well known as one of the styles of translatio medievalis; see especially the introductory comment of Beer (1976) pp.1–2. Estienne Dolet (1540) La manière de bien traduire d’une langue en aultre. These corpora are based on the editions of Flutre et al. (1977) and Constans (1990).
Demonstratives in De Bello Gallico
271
juxtaposes two different files on a line-by-line basis, in this case, FILE 1 (represents Li Fet) and FILE 2 (represents De Bello). Nevertheless, the relatively liberal attitude adopted by the medieval translators toward the original creates difficulties for a parallel corpus analysis of medieval texts. Each line of De Bello is not always translated accurately into the corresponding line of Li Fet; as a result, the articulation of each line of De Bello does not coincide with that of Li Fet.
<meta http-equiv=“Content-Type” content=“text/html; charset=UTF-8”> Two Parallel Corpora : fet_afr.txt and fet_lat.txt
fet_afr.txt Chapter 1 || fet_lat.txt Chapter 1-1
FILE1: France estoit molt granz au tens Juilles Cesar. Ele estoit devisee en .iij. parties.
FILE 2: Gallia est omnis diuisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.
(...) FILE1: Ces .iij. manieres de François n’estoient pas d’un langage ne d’une maniere de vivre.
FILE 2: <strong>Hi omnes lingua, institutis, legibus inter se differunt. Gallos ab Aquitanis Garunna flumen, a Belgis Matrona et Sequana diuidit.
The result of parallel corpus search: keyword = Hi FILE 1 = Li Fet des Romains, FILE 2 = De Bello Gallico
The size of our parallel corpus is still rather modest; De Bello contains 8,280 Latin words and Li Fet, 13,227 Old French words.15 Differences in word count manifest additional descriptions in Li Fet that are missing in De Bello. The differences, however, extend beyond these simple aspects. Let us examine the first lines of De Bello and Li Fet. (6)
15
Gallia est omnis diuisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. (BG I.1.) France estoit molt granz au tens Juilles Cesar. Ele estoit devisee en .iij. parties.
The parallel corpus of Li Fet corresponds to pages 79–115 of Flutre et al. (1977).
272
Yuji KAWAGUCHI Li François qui manoient en une des parties estoient apelé Belgue; cil de la seconde partie Poitevin ou Aquitain, tout est un ; cil de la tierce Celte. (FR I.)
The first sentence of De Bello corresponds to the first three sentences in Li Fet. As we have seen, the first sentence of Li Fet, France estoit ... Cesar, is purely the translator’s invention. He addresses his own nation as “Li François.” The nations of Aquitani and Celtae are named Poitevin ou Aquitain and Celte, respectively. The second sentence appears to be more complex. (7)
Hi omnes lingua, institutis, legibus inter se differunt. Gallos ab Aquitanis Garunna flumen, a Belgis Matrona et Sequana diuidit. (BG I.1.) Ces .iij. manieres de François n’estoient pas d’un langage ne d’une maniere de vivre. (FR I.)
Only the first sentence of the original, Hi omnes ... se differunt, that is, “All of these nations differ themselves in language, institution and law,” was translated into Old French as follows: Ces .iij. manieres de François n’estoient pas d’un langage ne d’une maniere de vivre. This translates as “These three situations of François did not come from a single language and way of living.” The translator does not mention the natural boundaries of these three nations in the sentence: Gallos ab Aquitanis (...) Sequana diuidit. Rather, he mentions it in a sentence next but three. Garonne cort entre les Poitevins et cels François qui lors estoient apelé Celte. Marne et Saine les dessoivrent des Belgues, car ces .ij. iaues corent entre Celtes et Belgues. (FR I.) That is, “River Garonne runs between Poitevins and François who were called Celtes at that time. Rivers Marne and Seine separate them from Belgues, for two rivers run between Celtes and Belgues.” There remains no doubt that the author of Li Fet had no intention of producing a literary translation of De Bello. Well aware of the difficulties in the translation of the long-winded Latin of Julius Caesar, he simplified the original descriptions, shortened the lengthy sentences by dividing them, added numerous explanatory comments, and often changed the order of the original sentences. What should the term “parallel corpus” signify in such a free translation or adaptation that was widespread in the Middle Ages? What will be the basis for a comparison of these parallel corpora? The construction of parallel corpora for medieval texts is a laborious task. Let us consider the following passage. (8)
Horum omnium fortissimi sunt Belgae, propterea quod a cultu atque humanitate
Demonstratives in De Bello Gallico
273
prouinciae longissime absunt, minimeque ad eos mercatores saepe commeant (…) (BG I.1.) Belgue estoient li plus fort a cel tens: genz sanz solas et sanz conpaignie, par ce que loingtaing estoient, ne marcheant ne gent d’autres terres ne reperoient gaires entr’els, (…) (FR I.)
The translator explicates that Belgues were the strongest a cel tens “at that time.”16 However, curiously enough, he does not translate the following passage: a cultu atque humanitate prouinciae, which means “from the culture and the civilization of Province.”17 This prouincia means that since the Transalpine Gaul was occupied by the Roman Empire, it included Gallia Narbonensis.18 Our translator does not consider Gaul as belonging to Rome, but to France. Omitting this obsolete explanation, he proposes a simple interpretation of cause and effect: gens sanz solas et sanz conpaignie, par ce que loingtaing estoient, that is, “(They were) people without comfort and friend, for they were living far away.” In reality, he was unable to mention either the “Province” that had disappeared since the destruction of the Western Roman Empire or “the culture and the civilization (of Transalpine Gaul)” because of a cultural exchange that had occurred after the foundation of the Carolingian Empire. It was probably Northern France that became more civilized in the thirteenth century. The above passage therefore resulted from the translator’s compromise between two contradictory intentions, namely, describing ancient events and depicting the true history of his time. This is the reason why he eliminated a cultu atque humanitate prouinciae and attributed the absence of comfort and friend of Belgae to the fact that they were living in the backlands. The adaptation and reinterpretation of the original may demonstrate the historical recognition of medieval people as well as the efforts to describe objectively the alterity of the Roman Empire.19 In the same manner, he reinterpreted the original place-names and nations and posited them in medieval contexts in order that his readers might gain a better understanding of the situation. Consequently, in the original, Aquitani was paraphrased into Poitevin ou Aquitain (see example (6) and Leeker (1986) p.74). It is significant to note that the denomination of Germani as
16
17 18 19
According to the translator, Helvetii (that is, Helveçois) were part of Belgae (that is, Belgues). Entre ces Belgues que l’en clamoit Helveçois, ot un home riche et noble; Orgetorix fu apelez. “Among these Belgues that one called Helveçois, there was a rich and noble man. His name was Orgetorix.” (FR II.1), see Leeker (1986) p.74. The English translation is by Edwards (cf. Edwards (1979) p.3). The Roman province of Gallia Narbonensis was formed circa 121 BC. See some detailed comments in Croizy-Naquet (1999), pp.131–141.
274
Yuji KAWAGUCHI
Sesnes is also reflective of the medieval reality of the Duchy of Saxony.20 The same orientation holds true for the ancient tribes of Gaul. Dumnorigi Haeduo, fratri Diuiciaci (BG I.3) was interpreted as in the following passage: a Dmnorix, le frere Diviciacus le seignor d’Ostum (FR II.2), that is, “to Dumnorix, brother of Diviciacus, lord of Autun.” Judging from the fact that the ancient Celtic nation Haedui was not familiar to his medieval readers, the translator adapted it in his contemporary setting and named it the “nation of Autun.” Thereafter, in the text, he refers to Haedui as cels d’Ostun (FR II.13, II.15, III.6, III.7, III.14) , cil d’Ostun (FR II.17, III.11), Cil d’Ostun (FR III.7), cil d’Ostum (FR III.2), Cil d’Ostum (FR II.15), cels d’Ostum (FR II.10, II.13, II.15, II.16, II.18, III.2, III.4, III.6, III.7, III.13, III.14, III.15), Cil del païs d’Ostum (FR II.10), or Cil dou païs d’Ostum (FR II.27). He also refers to Dumnorix as Domnorix d’Ostum (FR II.8) and Diviciacus as Diviciacus d’Ostum (FR III.2, III.12).21 The author of Li Fet added the place-name Besançon to the original per Alpes (BG I.10) as follows: par les Alpes devers Besançon (FR II.9), which translates as “through the Alpes towards Besançon.” He must have assumed that the name Besançon would create a bridge between ancient Gaul and medieval France under Philip Augustus. In the following passage, the term Borgoigne or “Burgandy” may give his readers the impression of being involved in a living history. The Latin original tells us only about the proximity of Geneva to the territory of Helvetii: Extremum oppidum Allobrogum est proximumque Heluetiorum finibus Genua (BG I.6.). Li Fet, on the other hand, explicates that Geneva is situated at the extreme end of Burgandy: Genvre estoit la derrienne citez de Borgoigne (...) (FR II.5.). Common nouns are sometimes replaced by terms that are familiar to medieval readers. For instance, Orgetorix, the richest and noblest man of Helvetii, regni cupiditate inductus (BG I.2) “was possessed by a desire for the kingship” and exchanged a secret agreement with young nobles. Il fist une conjuroison de noble jovente par covoitise d’avoir regne et seignorie (FR II.1), that is, “he made a secret agreement with young nobles for a desire to have the throne and sovereignty.” Here, the concept of throne is translated 20
21
“Dazu gehört etwa die in den Fet des Romains ständig wiederkehrende Übersetzung des lateinischen “Germani”/“Germania” mit “Sesne” bzw. “Sessoigne”; ein Blick in einen historischen Atlas zeigt nämlich, daß das Herzogtum Sachsen noch im 12. Jahrhundert bis beinahe an den Rhein reichte, (...)”, Leeker (1986) p.75. The following are other occurrences of the term Ostum or Ostun: antor Ostum (FR II.14), autre d’Ostun (FR III.8), citeains d’Ostum (FR III.2), citez d’Ostum (FR III.2), message d’Ostum (FR III.8), païs d’Ostum (FR III.19), par devers Ostum (FR II.10), vers Ostum (FR II.10), and la voerie d’Ostun (FR II.22). On the alteration of place-names, see Leeker (1986) p.76.
Demonstratives in De Bello Gallico
275
into a collocation regne et seignorie, where the first term regne comes directly from the Latin etymon regnum and the second term seignorie or “seigniorage” is rooted in medieval feudalism. It is evident that the translator pays bilateral attention to the original text and the contemporary reader. 4. Is a parallel linguistic analysis possible in the case of medieval texts? As we have seen in the previous sections, a more or less free or adaptive nature of medieval translation narrows the possibility of a parallel corpus analysis. What are the viewpoints from which we can compare these parallel corpora? Preceding studies have focused mainly on textual plots, intertextual relationships and dramatis personae, and place-names and specific terms. In this pilot analysis, we wish to explore the possibility of a linguistic analysis of parallel corpora with special consideration to Latin demonstrative pronouns. The importance of the linguistic and pragmatic functions of pronouns and demonstratives has been recognized since the articles of Roman Jakobson and Emile Benveniste.22 Both categories cannot be identified with a real referent if they are not conceived in a given situation or context. Although the pronouns I and you are often correlated with the demonstratives this and that, a significant difference lies in the fact that demonstratives are incomplete or opaque in themselves and need to be saturated by a deictic point gesture or a concrete referent. Indeed, the primary function of demonstratives is to draw the attention of interlocutors toward referents to enable identification through the deictic gesture or the situation of enunciation.23 For our parallel corpus approach to medieval translation, demonstratives are interesting objects of analysis. This is because they are closely related, on the one hand, to the coherence of text or plot that is expected to be maintained even in a free translation, and on the other, to salient elements that both the original author and the translator wish to highlight for catching the readers’ attention. The following table shows occurrences of Latin demonstrative pronouns in our corpus.24 These pronouns have two different syntactic uses: pronominal and adnominal. Adnominal uses are relatively frequent in the case of hic and occur rarely in the case of is and ipse.25 There is no example of iste in the present corpus. We will begin with the explanation for is and ipse. 22 23 24
25
Jakobson (1971) and Benveniste (1966). Kleiber (1987) p.19. The term Demonstrativpronomen is due to Kühner and Stegmann (1962) p.617. Guy Serbat explains these categories under the rubric of “Démonstratifs, anaphoriques, articles” (cf. Serbat (1980) p.93). There is another similar pronoun idem in Latin, but we will not consider it in this article.
276
Yuji KAWAGUCHI
Latin demonstrative pronouns in our corpus occurrences adnominal use is 225 21 ( 9.3%) ipse 48 3 ( 6.3%) hic 81 29 ( 35.8%) iste 0 0 ille 18 0
pronominal use 204 ( 90.7%) 45 ( 93.7%) 52 ( 64.2%) 0 18 ( 100 %)
4.1. Latin is and ipse In Latin morphology, pronouns are not strictly distinguished from demonstratives (cf. note 24). The semantic duality of Latin demonstrative pronouns is evident. The following explanation by Guy Serbat summarizes the essence of the inherent problems of Latin. Adjective is expressed, according to the context, either by a French definite article or by a weak demonstrative (without particles –ci or –là): EAM partem Oceani quae est ad Hispaniam «LA partie de l’Océan qui est du côté de l’Espagne» EIVS disputationis sententias memoriae mandaui (Cic.) «j’ai gardé en mémoire les termes de CETTE discussion» A pronoun corresponds to our personal pronoun of the third person: Venit mihi obuiam tuus puer; IS mihi reddit... (Cic.) «Ton esclave vient à ma rencontre; IL me remet...» (...) On the contrary, what is surprising is the fact that Latin dispenses very often with a pronoun or an article in the context where we must introduce it. (Serbat 1980, p.96)
Apart from examples without any overt demonstrative pronoun is, which are out of the scope of the present consideration, there are three different usages of is: (1) determinant, (2) weak demonstrative,26 and (3) pronominal. The distinction between a determinant and a weak demonstrative is not always clear. Our criterion for classification is both morphological and semantic. First, a reading of the relevant passage of De Bello reveals that each is is classified either as a determinant or a demonstrative from a semantic viewpoint. Second, we check and confirm the translation of the same passage in Li Fet. By way of illustration, we provide two instances of is that are sorted as determinants.
26
“Das Pronomen is, ea, id ist das schwächste unter allen Demonstrativen, indem es zwischen den Personalpronomen und den eigentlichen Demonstrativen steht”, see Kühner and Stegmann (1962) p.617.
Demonstratives in De Bello Gallico (9)
277
Planities erat magna et in ea tumulus terrenus satis grandis. (...) Legionem Caesar quam equis uexerat, passibus ducentis ab eo tumulo constituit. (BG I.43) La champaigne estoit large entre les .ij. oz. En mileu ot un petit tretre ou il assemblerent a parlement. Cesar mist cels de la disme legion, qu’il avoit a chevals montez, a .cc. pas en sus dou tretre. (FR III.14)
In (9), eo tumulo “the barrow” is anaphoric because it has already been mentioned in the previous context, namely, et in ea tumulus terrenus “and in it (that is, a large plain) a barrow of earth.” It seems that the demonstrative pronoun is does not have a clear demonstrative function, but rather an anaphoric one. In other words, the barrow in question is precisely contextualized with only a reference to the preceding context in ea tumulus terrenus, inducing the author of Li Fet to translate it into dou tretre with a definite article. (10) Cum ex captiuis quaereret Caesar quam ob rem Ariouistus proelio non decertaret, hanc reperiebat causam, quod apud Germanos ea consuetudo esset ut matres familiae eorum sortibus et uaticinationibus declararent utrum proelium committi ex usu esset necne; ... (BG I.50) Li prison distrent que costume estoit entre les Sesnes que les matrones gitoient sort por enquerre la boenne hore de combatre, (FR III.21)
In (10), ea consuetudo “the/a custom” is not anaphoric but cataphoric, since the explanation of the custom is referred to after the conjunction ut: ut matres familiae... necne. The author of Li Fet did not use any determinant in (10). In this manner, the difficulty in classification notwithstanding, among 225 occurrences of is, there are 21 examples of the adnominal is, the function of which can be identified with a deictic pointing or a contextual reference. More precisely, there are 4 cases of pure deictic, i.e., spatiotemporal use, 11 of anaphoric use, and 6 of discourse-deictic use. Of the four instances of pure deictic use, three instances in (11)–(13) are translated into Old French demonstratives, ce and cele. (11) Eo die quo consuerat interuallo hostes sequitur (...) (BG I.22) Ce jor meïsmes suivi ses anemis a itel antreval (...) (FR II.21) (12) Ex eo die dies continuos quinque Caesar pro castris suas copias produxit (...) (BG I.48) Des ce jor en avant ne fina onques Cesar de sa gent ordener chascun jor et (...) (FR III.19) (13) Ex eo proelio circiter milia hominum CXXX superfuerunt eaque tota nocte continenter ierunt: nullam partem noctis itinere intermisso (...) (BG I.26)
278
Yuji KAWAGUCHI De cel estor eschaperent .cxxx. miliers que d’omes, que de fames, que d’anfanz. Cil ne finerent onques d’errer tant come cele nuiz lor dura, et l’endemein et au tierz jor, (...) (FR II.25)
There remains no doubt that in (11)–(13), the translator contextualizes eo die “that day” and ea nocte “that night” with reference to Caesar’s actions. The protagonist’s presence at the time of the speech act is perfectly relevant for evoking such temporal conceptualization. In (14), such a demonstrative meaning appears to be weak. Ad id tempus, which means “until that time,” corresponds to a temporal adverb, jusqu’a ore. (14) si pace uti uelint, iniquum esse de stipendio recusare, quod sua uoluntate ad id tempus pependerint. (BG I.44) se il volent pes avoir, folie est de mon treü retenir, que il m’ont paié em pes jusqu’a ore. (FR III.15)
Of the 11 examples of the anaphoric use of is, there are 3 typical cases in (15a)–(17a), wherein the author of Li Fet uses the demonstratives, cel, cele, or ceste. (15a) In eo itinere persuadet Castico, Catamantaloedis filio, Sequano, (...) (BG I.3) En cele voie porchaça Orgetorix plus son domage que son preu. (FR II.2) (15b) Postero die castra ex eo loco mouent. (GB I.15) L’endemein coillirent lor tentes, si se partirent de lor leu. (FR II.14) (16a) eam partem minime firmam hostium esse (...) (BG I.52) plus foible de cele part (FR III.23) (16b) Praeterea se neque sine exercitu in eas partes Galliae uenire audere quas Caesar possideret, (BG I.34) Ansorquetot, ge n’oseroie aler sanz grant ostz la ou Cesar a seignorie et pooir, (FR III.5) (17a) Is pagus appellabatur Tigurinus: (BG I.12) Cele qarte partie… (FR II.11)27 (17b) Extremum oppidum Allobrogum est proximumque Heluetiorum finibus Genua. Ex eo oppido pons ad Heluetios pertinet. (BG I.6) Genvre estoit la derrienne citez de Borgoigne et si pres des Helveçois que uns ponz de la vile apartenoit a els, (FR II.5)
27
The translator first explains Tuit li Helveçois estoient en qatre parties devisé, that is, “All the Helvetii were divided into four parts” and identifies Is pagus with “this fourth part” or Cele qarte partie.
Demonstratives in De Bello Gallico
279
We can determine the difference when we compare (15a) and (15b), (16a) and (16b), and so on. The translator does not make use of Old French demonstratives in (15b), (16b), and (17b). The semantics of the pronoun is fluctuates between anaphoric and deictic. In (15b), (16b), and (17b), anaphoric nuances become predominant since there are some factors that would prevent the adnominal is from assuming deictic interpretations. The subject of (15b), namely, Helvetii, is evident from the previous contexts: Ita Heluetios a maioribus ... dato discessit (BG I.14). In this example, ex eo loco, which translates as “from that place,” does not constitute the element that will draw the readers’ attention but that adverbially supports the more important element, castra movent, which means “(they) remove (their) camp.” This might be the reason why ex eo loco was translated into de lor leu in Li Fet. This eo is more pronominal than anaphoric in (15b). The subordinate construction eas partes... quas Caesar possideret in (16b) provides the relative pronoun clause la ou... in Old French. Ex eo oppido is coreferential with oppidum in (17b). The same anaphoric relations are maintained in Old French between two synonymous nouns, citez and vile. Since such anaphora cannot be admitted into the foremost position in a series of descriptions, the author translated Is pagus into Cele qarte partie in (17a). Three additional cases of the adnominal is translated into an Old French demonstrative are provided in (18)–(20). (18) Ex eo proelio (BGI.26) De cel estor (FR II.25) (19) in ea fuga periit; (BG I.53) et fu ocise en cele fuie (FR III.24) (20) propter eam adfinitatem, (BG I.18) por cele afinité (FR II.17)
All of the six examples of the discourse-deictic is are composed of collocational expressions, eam rem, eaque res, or eas res.28 Four of the six uses of is are translated into Old French demonstratives (see (21)–(24)). (21) quare sibi eam rem cogitandam (BG I.33) a entreprandre ceste besoigne, (FR III.4) (22) eaque res conloquium ut diremisset, (BG I.46) por ce estoit li parlemenz departiz, (FR III.17) (23) eas res iactari nolebat, (BG I.18) il ne voloit pas que tuit seüssent ce conseill (FR II.17) 28
For the definition of collocation, see Hausmann and Blumenthal (2006).
280
Yuji KAWAGUCHI (24) Ad eas res conficiendas (BG I.3) porveoir tot cel afere (FR II.2)
It can be said that according to the enunciative situation, the discourse-deictic noun res is often reinterpreted into additional concrete referents such as afere, besoigne, or conseill, which contributes to the saturation of the opacity of discourse-deictic reference. Since ipse comes from an intensified is by the particle –pse, it has an emphatic nuance of “itself” or “in question.” Interestingly, in our corpus, we cannot find any example wherein the adnominal ipse is translated into a demonstrative one. (25) Quae quidem res Caesari non minorem quam ipsa uictoria uolunptatem attulit, (BG I.53) Cesar, quant il le trova, il n’an fu pas meins liez que de sa victoire, (FR III.24) (26) Ipse autem Ariouistus tantos sibi spiritus, tantam adrogantiam sumpserat, (BG I.33) Ariovistus restoit montez en si grant orguill que ne fesoit pas a sosfrir, (FR III.4) (27) quam ipse imperator meritus uidebatur; (BG I.40) et en fu plus honorez que Lucius Silla, qui conseles estoit, (FR III.11)
4.2. Latin hic Hic, iste, and ille are Latin demonstratives. In the first book of the Gallic War, no instance of the demonstrative iste is attested. What is the main difference between the demonstrative pronouns hic and is? Traditional Latin grammar teaches that hic draws attention to a present thing, while is simply implies that a thing has already been mentioned or will be described in the subsequent lines.29 In fact, as seen in 4.1., is has a weak demonstrative function, and thus, the difference between hic and is depends on the degree of deictic power; hic particularly highlights the author’s concern. Among the Latin demonstratives, Ernout and Thomas regard hic as being more successful than ille in drawing the attention of the readers.30 The adnominal use of ille is absent in our corpus. The pure deictic or spaciotemporal use of the adnominal hic is not very frequent, which sharply contrasts with the relatively frequent use of the pure deictic is. Four examples are attested. Only example (28) is translated into 29
30
“Hic unterscheidet sich von is dadurch, daß es immer auf einen Genestand als einen gegenwärtigen hinweist, während is bloß andeutet, daß ein Gegenstand schon erwähnt sei oder im folgenden erst beschrieben werde (is, qui), ohne ihn als einen gegenwärtigen darzustellen” (see Kühner and Stegmann (1962) p.621). Ernout and Thomas (1972) p.188.
Demonstratives in De Bello Gallico
281
the demonstrative ces. (28) his omnibus diebus (BG I.48) toz ces .v. jors (FR III.19) (29) Hic locus aequo fere (BG I.43) En lieu ot un petit tretre (FR III.14)
Owing to unknown reasons, the translator eliminated the sentences under consideration in the other two examples. (30) Primam et secundam aciem in armis esse, tertiam castra munire iussit. Hic locus ab hoste circiter passus sexcentos, uti dictum est, aberat. (BG I.49) Les .ij. batailles fist estre totes armees et la tierce entendre as tentes drecier. (FR III.20) (31) Se prius in Galliam uenisse quam populum romanum. Numquam ante hoc tempus exercitum populi romani Galliae prouinciae finibus egressum. Quid sibi uellet? (BG I.44) Je ving ançois en France que li Romain; que quierent il en ma possession ? La province est moie; (FR III.15)
Among the examples of adnominal hic, there is a collocational expression that Caesar preferred to use. This collocational expression is in the ablative absolute construction as in the case of hac oratione habita or hac oratione adducti. Its pragmatic function is conjunctive and not referential, i.e., linking successive discourses, such that this hic is often translated into ensi or einsi, which mean “like this, thus,” or a itant, which means “then.” (32) Hac oratione habita mirum in modum conuersae sunt omnium mentes (BG I.41) Quant Cesar ot ensi parlé, a merveille se changerent li cuer de tote la chevalerie, (FR III.12) (33) Hac oratione habita concilium dimisit. (BG I.33) A itant se departi li conciles; (FR III.4) (34) Hac oratione ab Diuiciaco habita omnes qui aderant magno fletu auxilium a Caesare petere coeperunt. (BG I.32) Quant Diviciacus ot einsi parlé, tuit li autre baron de France crierent merci em plorant a Cesar... (FR III.3)
Two examples are omitted in the translation: Hac oratione adducti (BG I.3), hac oratione ... designari (BG I.18). Another collocational expression is his rebus. As observed in the case of the expression eas res in 4.1., his rebus was paraphrased in (35).
282
Yuji KAWAGUCHI (35) His rebus fiebat ut et minus late uagarentur et minus facile finitimis bellum inferre possent; (BG I.2) Ces .iij. clostures ne lessoient pas les Helveçois estendre a lor volenté por bataillier a estranges voisines genz; (FR II.1)
In (35), the concrete referents of Ces .iij. clostures, that is, “these three partitions,” will be clarified through the previous lines: they represent, first, the wide and deep river Rhenus; second, the very high mountains of Jura; and third, Lake Leman and the river Rhodanus. Four other examples of his rebus were omitted in Li Fet: His rebus adducti (BG I.3), His rebus (BG I.18), His omnibus rebus (BG I.19), and His rebus cognitis (BG I.33). The exophoric hic is generally translated into various demonstratives such as cele, ces, ceste, and cil. This confirms that the function of hic is to the draw attention of the readers to a referent. (36) Huic legioni Caesar et indulserat praecipue et propter uirtutem confidebat maxime. (BG I.40) A cele legion, ce dist Juliens, avoit il fete honor sovent et doné dons, si se fioit plus en sa vertu que en nule des autres. (FR III.11) (37) cum his quinque legionibus ire contendit. (BG I.10) si s’adreça vers France ot tot ces .v. legions par les Alpes devers Besançon. (FR II.9) (38) Hoc proelio trans Rhenum nuntiato Suebi, (BG I.54) Quant ceste bataille fu nonciee outre le Rin, (FR III.25) (39) Hic pagus unus cum domo exisset patrum nostrorum memoria L. Cassium consulem interfecerat et eius exercitum sub iugum miserat. (BG I.12) Ce furent cil meïsmes qui avoient ocis Lucius Cassius, le consele romain, et s’ostz desconfite et prise. (FR II.11)
However, here again, there is an ablative absolute construction hoc proelio facto that Caesar often used as a simple conjunction. Hoc proelio facto (BG I.13) is paraphrased by the translator into Apres la desconfiture de ces Tygurins (FR II.12). Beer explains : “The criterion determining whether the demonstrative is to be explained in full is always that of clarity.”31 Even though Beer’s solution is verisimilar, it seems more important for us to explore exactly what clarity meant to medieval writers. In the following examples, the translator omitted his regionibus and did not use the demonstrative for hoc toto proelio. Was this again due to the need for clarity?
31
Beer (1976) p.57.
Demonstratives in De Bello Gallico
283
(40) Qui nisi decedat atque exercitum deducat ex his regionibus, sese illum non pro amico, sed hoste habiturum. (BG I.44) Et se tu ne l’en meinnes errant, je ne te tendrai des ore en avant por ami, mes por annemi ; (FR III.15) (41) Nam hoc toto proelio, cum ab hora septima ad uesperum pugnatum sit, auersum hostem uidere nemo potuit. (BG I.26) La bataille ot duré de la septiesme hore dou jor jusq’au vespre, si que li uns ne veoit l’autre por la nuille et por la poudre. (FR II.25)
The discourse-deictic hic is paraphrased in several ways. The arrogant and bumptious Ariovistus responded with defiance to the mission that Caesar assigned to him. The translation of ces choses in (42) refers to the direct speech of Ariovistus in the following lines: Se ge avoie mestier de Cesar, ge iroie a lui; ensement, se il a mestier de moi, viengne a moi. (...) Et mout me merveill que Cesar et li pueple de Rome s’ont a entremetre de la moie France, que je ai vaincue et conquise par bataille. The discourse-deictic function of ces is saturated in reference to Ariovistus’ speech. The translation of Hac spe into por ce can be considered as approximative from the semantic viewpoint, but it is more important to remark that the previous line (Il m’est mestiers... mon empirement.) constitutes the endophoric referent of ce. The other hic is expressed by itel or tel (see (44)–(45)). (42) His respondis ad Caesarem relatis (...) (BG I.35) Li message renoncerent ces choses a Cesar. (FR III.6) (43) Amicitiam populi romani sibi ornamento et praesidio, non detrimento esse oportere, idque se hac spe petisse. (BG I.44) Il m’est mestiers que l’amistiez des Romains me torge a preu de m’onor et de mon acroissement, non pas de mon empirement. Et por ce la requis ge, (FR III.15) (44) Genus hoc erat pugnae (...) (BG I.48) En itel maniere de pongneïz estoient li Sesne aüsé: (FR III.19) (45) Populi romani hanc esse consuetudinem, (BG I.43) et li pueples de Rome a tel costume... (FR III.14)
Conclusion We make the following conclusions from this attempt to conduct a parallel corpus analysis. Li Fet des Romains was one of the medieval translations of Latin classics as well as one of the early historical works wherein writers attempted to establish a new linguistic model of truth based on written prose. Fundamental difficulties in the parallel corpus analysis of medieval translation arise from the relatively free and adaptive natures of
284
Yuji KAWAGUCHI
medieval translation texts, which render ambiguous and opaque definitions of the basis for comparison. Therefore, previous studies of medieval translation have compared aspects such as textual plots, intertextuality, and person names and place-names. Although the corpus is rather modest in size, the present parallel corpus analysis might cause some sort of breakthrough. Demonstratives are closely related to the coherence of text that should be maintained by medieval translators even in their free and adaptive translation. Like the original authors, translators would capture readers’ attentions by means of demonstratives. In other words, demonstratives belong to linguistic categories that can remain relatively intact even though translators modify their originals. In our corpus, the opposition of the adnominals is and hic is apparent particularly in pure spatiotemporal use: is is generally translated into Old French demonstratives (see (11)–(14)). In discourse-deictic use, there are numerous collocational expressions for both is and hic: eam rem, eas res, hac oratione, his rebus, etc. In this case, we can observe a similar reinterpretation or paraphrase of res or oratio: ceste besoigne for eas res, cel afere for eam rem, ensi or a itant for hac oratione habita, ces choses for his respondis, por ce for hac spe, etc. The translator, on the other hand, has a tendency to omit his rebus in the translation. The exophoric hic is generally translated into demonstratives in Old French; on the contrary, the author of Li Fet translates only half the exophoric is into demonstratives. This is probably due to the anaphoric rather than deictic nature of is. References Baumgartner E. 1995. “Roman antiques, histoires anciennes et transmission du savoir aux XIIe et XIIIe siècles”, in Medieval Antiquity, A. Welkenhuysen et alii (eds.), Leuven University Press, 219-235. Beer J. M.A. 1976. A Medieval Caesar, Droz, Genève. Benveniste E. 1966. “La nature des pronoms,” In Problèmes de linguistique générale, 1, Gallimard, 251-257. Buridant C. 1983. “Translatio medievalis. Théorie et pratique de la traduction médiévale”, Travaux de Linguistique et de Littérature 21, 81-136. 2000. Grammaire nouvelle de l’ancien français, SEDES. Chomarat J. 1993. “Dolet traducteur des Lettres familières de Cicéron”, In Etudes sur Etienne Dolet, Le Théâtre au XVI e siècle, Le Forez, Le Lyonnais et l’Histoire du livre, Babriel-André Pérouse (Ed.) Droz, Genève, 91-102. Constans L.-A. 1990. César Guerre des Gaules, tome I Livres I-IV, Les
Demonstratives in De Bello Gallico
285
Belles Lettres, Paris. Croizy-Naquet C. 1999. Écrire l’histoire romain au début du XIII e siècle, Honoré Champion, Paris. Diessel H. 2006. “Demonstratives.” In Encyclopedia of Language & Linguistics, Second Edition, Keith Brown (Ed.) Elsevier, Amsterdam; Tokyo, 430-435. Edwards H.J. 1979. Caesar The Gallic War, Loeb Classical Library, Harvard University Press, Cambridge. Ernout A. and Thomas F. 1972. Syntaxe Latine, 2e Edition, 5e tirage revu et corrigé, Klincksieck, Paris. Flutre L.-F. 1974. Les Manuscrits des Faits des Romains, 1932, Slatkine Reprints, Genève. Flutre L.-F. and K. Sneyders de Vogel 1977. Li Fet des Romains, 2 vols., Paris-Groningue, E. Droz-J.-B. Wolters, 1938, Slatkine Reprints. Hausmann F.J. and P. Blumenthal 2006. “Présentation: collocations, corpus, dictionnaires”, in Langue française 150, 3-13. Jakobson R. 1971. “Shifters, verbal categories, and the Russian verb,” In Roman Jakobson Selected Writings II Word and Language, Mouton, 130-147. Kawaguchi Y. 2004. “Medieval translation of Caesar’s Gallic Wars — Li Fet des Romains —,” (in Japanese) Human Arts and Sciences 7, University of Human Arts and Sciences, 27-45. Kleiber G. 1987. “L’opposition cist / cil en ancien français ou comment analyser les démonstratifs?,” Revue de linguistique romane 51, 5-35. Kühner R. and Carl Stegmann 1962. Ausführliche Grammatik der lateinischen Sprache, Satzlehre Erster Teil, Wissenschafliche Buchgesellschaft, Darmstadt. Leeker J. 1986. Die Darstellung Cäsars in den romanischen Literaturen des Mittelalters, Frankfurt am Mainz, Analecta romanica 50. Marchello-Nizia C. 1995. L’Évolution du français, Ordre des mots, démonstratifs, accent tonique, Armand Colin. Schmid-Chazan M. 1980. “Les traductions de la “Guerre des Gaules” et le sentiment national au Moyen Age (1)”, Annales de Bretagne et des pays de l’ouest, 87, Université d’Angers, 387-407. Serbat G. 1980. Les structures du latin, Editions A. & J. Picard, Paris. Soutet O. 1992. Etudes d’ancien et de moyen français, P.U.F. Spiegel G. M. 1993. Romancing the Past. The Rise of Vernacular Prose Historiography in Thirteenth-Century France, University of California Press, Oxford, England. Wunderli P. 1993. “Le rôle des démonstratifs dans «La Vie de Saint Léger» Deixis et anaphore dans les plus anciens textes français, In Le passage à
286
Yuji KAWAGUCHI
l’écrit des langues romanes, M. Selig, B. Frank, J. Hartmann, Tübingen, 157-179.
Patient-Orientedness in Resultative Compound Verbs in Chinese Keiko MOCHIZUKI 1. Introduction The aim of this paper is to examine “Patient1-Orientedness” in Chinese by studying a database of resultative compound verb examples in Chinese. Li and Thompson (1976) classify Chinese as a topic-prominent language type and offer the following “pseudo-passive’’ examples as one of the characteristics of the topic-prominent language type. (1)
(2)
Zhei-jian xinwen guangbo le. this –CL2 news broadcast PFV3 This news (topic), it has been broadcast. (Liand Thompson 1976:480) Nei-ben shu yijing chuban le. that – CL book already publish PFV That book (topic), it has already been published. (Liand Thompson 1976:480)
Both Guangbo (broadcast) and chuban (publish) have the following argument structure: (3) [Agent, Theme4]
This is a typical argument structure of transitive verbs, and neither Guangbo (broadcast) nor chuban (publish) has the inchoative intransitive function. Nevertheless, both in (1) and (2), the agent does not realize as an argument. Instead, a theme realizes as xinwen (news) and shu (book), respectively. Whether or not these are the topic or the subject is a controversial issue; however, both (1) and (2) are unmarked sentences in Chinese. If they are changed to bei passive forms, they will bear an adverse meaning, as can be seen below: (4)
1 2 3 4
5
Zhei-jian this –CL
xinwen news
bei BEI5
guangbo broadcast
le. PFV
“Patient’’ implies the entity that undergoes a change of state by an external force. Classifier Perfective “Theme’’ is a term of the semantic role, and it refers to the entity that either moves or undergoes a change of state or simply exists. In the case of (3), “theme’’ is actually equivalent to “Patient’’; however, I will adopt “theme’’ for the argument structure. A passive marker that constitutes “Patient + bei (+Agent) + Verb,” which essentially expresses an adverse meaning.
288
Keiko MOCHIZUKI
(5)
This news (topic), it has been broadcast (and we had an adverse result by this). Nei-ben shu yijing bei chuban le. that – CL book already BEI publish PFV That book (topic), it has already been published (and we had an adversative result by this).
Here, we find a mismatch in the voice system in Chinese; the passive form is marked although the agent of the transitive verb is suppressed. Li and Thompson (1976:467) note that the passive construction is common among subject-prominent languages; on the other hand, among topic-prominent languages, passivization does not occur at all (e.g., Lahu, Lisu), appears as a marginal construction that is rarely used in speech (e.g., Mandarin), or carries a special meaning (e.g., the “adversity’’ passive in Japanese). According to their explanation, the reason for the relative insignificance of the passive among topic-prominent languages is that it is the topic, not the subject, that plays a more significant role in sentence construction; any noun phrase can be the topic of a sentence without registering anything on the verb. In fact, any noun phrase can be the topic. In this paper, however, I would like to claim that the “Patient’’ tends to undergo “foregrounding’’ and be located in the sentence-initial position in Chinese. Tai (1984) claims that Chinese is a “Patient-Oriented’’ language, while English is an “Agent-Oriented’’ language. He claims that English has four Vendler (1967) categories: Activities, Accomplishments, Achievements, and States. On the other hand, Chinese only has Activities, States, and Result6. Tai (1984:291) notes that “to kill’’ in English is an accomplishment verb, which necessarily implies the death of the recipient of the action; thus, (6) is ungrammatical: (6) * I killed John but he didn’t die.
On the other hand, the corresponding verb “sha” (to kill) in Chinese does not necessarily imply the death of the recipient of the action, as (7) displays: (7)
Zhangsan sha le Lisi liangci, Lisi dou mei si. Zhangsan attempted to kill PFV Lisi twice Lisi all MEI7 die Zhangsan performed the action of attempting to kill Lisi, but Lisi did not die.
In (7), “sha” (to kill) functions as an activity verb that does not imply the death of the recipient, and therefore, the result “si (to die)’’ can be “cancelled”. On the other hand, the ungrammaticality of (8) shows that the resulative compound verb “sha-si” (to kill-to die) guarantees the attainment of the goal, namely, the death of the recipient of the action. 6 7
Tai (1984:294) claims that resultative compound verbs and resultative simple verbs (e.g., “si” (to die) belong to one single category, namely, “Results.” MEI is a negative that negates an event with aspectural features.
Patient-Orientedness in Resultative Compound Verbs (8) * Zhangsan sha-si le Lisi liangci, Zhangsan kill-die PFV Lisi twice * Zhangsan killed Lisi twice, but Lisi did not die.
Lisi Lisi
dou all
mei MEI
289 si. die
The contrast between (6) and (7) suggests that while English has the category of accomplishment verbs, which have both the action (i.e., [+durative]) and result (i.e., [+telic]) aspects, there is no such category in Chinese because action verbs (e.g., “sha” (to kill)) do not have the result aspect. Tai (1984:295) argues that the reason why Chinese action verbs do not have an implicational structure (i.e., action with result aspect) can be attributed to the characteristics of Chinese as a Patient-Oriented language type. He notes the following: “As an agent-oriented language, English looks at the ending point of an event from the viewpoint of an agent and thus allows action verbs to have implicational structures. By contrast, as a patient-oriented language, Chinese looks at the ending point of an event from the viewpoint of an affected patient and therefore its action verbs do not exhibit implicational structures. Instead, it allows the action part of a resultative verb compound to be presupposed and the result part to be asserted. This contrast can be visualized in English Agent (action) → Chinese Patient (result) ” ← (Tai 1984:295)
Mochizuki (2004:199–208) supports Tai’s claim through two pieces of evidence. The first piece of evidence is the fact that Chinese has a rich system of “decausativization,”8 which suppresses the agent on argument realization and foregrounds the Patient as a subject while English rarely has “decausativization.” Compare the following examples: (9) a. * Cherry trees planted in the park. b. Yinghua shu zhong zai gongyuan -li. Cherry tree plant LOC9 park inside Cherry trees have been planted in the park. c. Kooen-ni sakura-no ki-ga uw- atteiru. park-LOC cherry-GEN10 tree-NOM11 plant intransitive suffix PFT12 8
9 10 11 12
The term “decausativization” was first proposed by Kageyama (1996:Chapter 4). He argues that Japanese uses the suffix –ar- to make an inchoative intransitive function from a causative transitive verb; e.g., u-e-ru (vt plant) vs. uw-ar-u (vi be planted) and tum-e-ru (vt pack) vs. tum-ar-u (vi become packed). Locative marker Genitive marker Nominative marker Perfect
290
Keiko MOCHIZUKI
In (9b), the transitive verb “zhong (plant)” functions as an intransitive verb, that is, it is “decausativized” like uw-ar-u (vi be planted) in Japanese when “zhong (plant)” is embedded in an “existential sentence.”13 The existential sentence is one of strategies that foregrounds the patient “cherry trees” of the action “plant” and backgrounds and suppresses an agent of the action “plant.” The ungrammaticality of (9a) suggests that English has no way of decausativization since the agent cannot be backgrounded. Another strategy of decausativization in Chinese is attaching postverbal resultative complements. Let us compare the following examples of “bake”: (10) a. * A cake baked. b. Dangao kaohao cake bake- completed c. Keeki-ga yak - ecake- NOM bake-intransitive suffix
le. PFT ta. PAST
(10b) displays that a transitive verb “kao” (bake) functions as an intransitive verb when it is followed by the postverbal resultative complement “-hao” (attained an ideal state, completed), which foregrounds the patient “cake” of the action “bake” and backgrounds an agent of the action. (10) displays the same contrast between English, Chinese, and Japanese as that displayed by (9): the ungrammaticality of (10a) suggests, again, that English has no way of decausativization since an agent cannot be backgrounded. Mochizuki (2004:202–203) notes the second piece of evidence in terms of a word-formation in the causative-inchoative alternation in Chinese; an inchoative intransitive form is morphologically more basic than a corresponding causative transitive form. (11) a.
inchoative intransitive form po
b.
break (Vi) duan
c.
break (Vi) huai ruin, damage
causative transitive form da - po (beat break) break (Vt) qie - duan (cut break) cut (Vt) nong - huai (act on something ruin, damage) break (Vt)
Three sets of the inchoative intransitive-causative transitive form in (11) show that the inchoative intransitive forms are simple predicates (intransitive verbs or adjectives) and the causative transitive forms are formed by 13
This term is according to Li and Thompson (1981:Chapter 17).
Patient-Orientedness in Resultative Compound Verbs
291
compounding, that is, by adding action verbs in order to constitute resultative compound verbs. This morphological phenomenon suggests that, in Chinese, foregrounding the “theme’’ as a subject of an inchoative intransitive verb is a more basic expression. The aim of this paper is to examine whether this analysis can be supported by the database of 1,673 resultative compound verb sentences collected from A Dictionary of Chinese Verb-Resultative Complement Phrases (1987). 2. Procedure The source, A Dictionary of Chinese Verb-Resultative Complement Phrases (1987; henceforth, DCVC), includes 5,000 sentences in which 322 verb-resultative complement phrases are used. These 322 verb-resultative complement phrases are selected from broad fields: newspapers, magazines, novels, scripts, screenplays, and even ordinary conversations. To begin with, we excluded 18 resultative complements that are considered to be grammaticalized complements: “-cheng” (become), “-chu” (indicating outward movement), “-dao” (indicating a goal or result), “-de” (be finished; be ready), “-diao” (become detached from; come off; fall), “-gei” (indicating “to~”; “for~”; “with~”; or “by~”), “-guo” (across; past; through; over), “-jian” (see), “-jin” (into; in), “-shang” (to; at; a higher or better position), “-wan” (finish), “-xia” (indicating an action moving or directed downwards), “-zai” (indicating time, place, or condition), “-zhao” (achieve a goal), “-zhu” (holding on to~; steady; firm), “-qi” (up; upwards), and “-zou” (away). Next, we collected 1,673 examples of 304 resultative complements after excluding the 18 resultative complements listed above. We classified these 1,673 examples into two categories: the type in which a resultative complement predicates an internal argument, that is, patient or theme, and another type in which a resultative complement predicates an external argument—that is, an agent, experiencer, or a cause which triggers a change of state. Henceforth, the first predicate of the Verb-Resultative Complement compound will be referred to as V1 and the second predicate, as V2. 3. The type in which V2 predicates an internal argument This type is an unmarked resultative compound verb type that constitutes 82.79% (1,385 examples among 1,673). This category can be further classified as the Agent/Cause-Subject type (52.78%, 731/1,385) and the Patient-Subject type (47.22%,654/1,385). In the following discussion,I would like to show that the Patient-Subject type is frequently used.
292
Keiko MOCHIZUKI
3.1. Agent/Cause-Subject type First, let us examine the Agent/Cause-Subject type. This type has two patterns in terms of the argument structure of V1V2, as shown in (12) below: (12) Agent/Cause-Subject type (731/1385, 52.78%) a. [Agent/Cause, Theme] (702/1385 = 50.69%) Ta kan-wan xin yihou, jiu ba xin si-sui le. He read-finish letter after, soon BA letter tear-into pieces PFT He tore up a letter after reading it. (DCVC:311) b. [Event (Agent, (Theme)), ] (29/1385=2.09%) Wo yinwei pao-man le yibu, suoyi mei ganshang qiche. I because run-late PFT one step therefore MEI in time for bus Because I did not run fast enough, I just missed the bus. (DCVC:241)
(12a) is a typical causative transitive structure, which can be represented through the following event structure, namely, the “Lexical Conceptual Structure” (henceforth, LCS): (13) [ x ACT (ON y) ] CAUSE [ BECOME [ y BE AT – z ]] ∣ ∣ ∣ ∣ ∣ He tear letter letter pieces
(12b) is a marked structure wherein V2 has “event” as a semantic role and embeds the argument structure of V1 [Agent, Theme] in “event.” (12b) is considered as having the following LCS: (14) [ [EVENT x ACT ] BE AT- z ] ∣ ∣ ∣ I run late
(15) presents the case in which an inanimate “cause” is the subject; this pattern is also a typical causative transitive structure since (13) is in Chinese. (15) da-po (hit-break) Bingbao ba boli chuang 14 Hailstones BA glass window The hailstones broke the window.
dou all
da-po hit-break
le. PFV (DCVC:259)
(15) has the following LCS, which is similar to that in (13): (16) x CAUSE [ BECOME [ y BE AT – z ]] ∣ ∣ ∣ the hailstones the window break
Mochizuki (2004: 202–203) notes that the resultative compound verbs with the argument structure [Agent/Cause, Theme] as well as LCS (13) or (16) are basically transitive verbs. This is also suggested from the inchoative intransitive/causative transitive pairs listed in (11).
14
Ba is a marker for the affected object located in the preverbal position.
Patient-Orientedness in Resultative Compound Verbs
293
3.2. Patient-Subject type Among 1,385 examples, we found 654 examples of the patient-subject case. Let us now consider the following examples. (17) Patient-Subject type (654/1,385, 47.22%) a. [theme, ] (510/1,385, 36.82%) Zhe ding maozi dai-zang le, gai xixi le. this CL hat wear-dirty PFV have to wash PFT. The hat has been worn for a long time, so gets dirty. You should wash it. (DCVC:372) b. BEI passives (144/1,385 , 10.40%) Tushuguan-li de jiben huabao henkuai jiu bei xuesheng fan-jiu le. library-in DE some magazine quickly then BEI student turn-old PFT Students quickly dirtied the pictorial magazines in the library. (DCVC:202)
(17a) is considered as having the following LCS: (18) [ x ACT FOR ∣ wear this hat
LONG TIME]15 CAUSE [ BECOME [ y BE AT – z ]] ∣ ∣ this hat dirty
(18) is similar to (13) and (16) in that it also displays a causal relation. The only difference is with regard to whether or not x realizes as an argument. In (18), x, that is, the agent of wearing the hat is suppressed, and therefore, it does not realize as any argument. Instead, the patient “this hat” is foregrounded and realizes as a subject, while the shadowed causal action is backgrounded. Here, it should be noted that the resultative compound verbs with the LCS as in (18) is essentially transitive; therefore, “dai-zang (wear-dirty)” is essentially transitive, as indicated in (19): (19) Ni ba zhe ding maozi dai-zang You BA this CL hat wear-dirty You dirtied this hat, you have to wash it.
le, PFV
gai xixi le. have to wash PFT. (DCVC:372)
Let us return to the passive case (17b). (17b) is supposed to have the following LCS: (20) [ x ACT ON y ] CAUSE [ BECOME [ y BE AT – z ]] ∣ ∣ ∣ ∣ ∣ students turn pages magazine magazine dirty
In (20), the causal event “many students turned the pages of the magazine” is backgrounded and the patient, “magazine,” is foregrounded; therefore, the patient, “magazine,” is selected as a subject of the resultative compound verb “fan-jiu” (turn pages-old). Let us see some more patient-subject examples.
15
The shaded portion in the LCS indicates that it is “backgrounded.”
294
Keiko MOCHIZUKI (21) a. Zhe zuo fangzi gai- ai le , he zhouwei de gao loudaxha bu xietiao This CL house build-low PFV, with surrounding DE tall buildings NEG balanced This house is built too low, it looks out of balance with the surrounding tall buildings. (DCVC:1) b. Mantou kao- jiao le jiu bu neng chi le. Bun bake-burnt PFT then NEG can eat PFT You can’t eat the bun if you burn it. (DCVC:191) c. Zhe jian wuzi ban-kong le, zhi shengxia yige da yigui This CL room move-empty PFT, only left one big wardrobe hai mei ban-zou. still NEG move-away This house is almost empty, there is only a big wardrobe left to move. (DCVC:215)
All resultative compound verbs among (17a) and (21) have the same compounding processes, such as those in (22): (22)
V1 [Agent, Theme1] + V2 [Theme2, ] 16 a. compounding → V1V2 [Agent, Theme1-Theme2 ] (causative transitive) b. decausativization → V1V2 [Theme1-Theme2, ] (inchoative intransitive)
The fact that examples of decausativization constitute 36.82% among the 1,386 examples is significant since the decausativization cannot occur in English 17 . This can also be observed from the corresponding English translation: “the hat has been worn” (passive), “this house is built too low” (passive), “you burn it” (causative transitive), and “this house is almost empty” (just state, not inchoative). I would like to propose that the fact that examples of decausativization constitute 36.82% offers a piece of supporting evidence for “Paitient-Orientedness” in Chinese. Another piece of supporting evidence for “Paitient-Orientedness” is the fact that among the 1,673 examples, we found 73 examples displaying the following compounding pattern: (23)
V1 [ Agent, Theme1] + V2 [ Theme2, ] a. compounding → V1V2 [Agent, Theme2] (causative transitive, 33/73 examples) b. decausativization → V1V2 [Theme2 ] (inchoative intransitive, 40/73 examples)
This type suppresses the internal argument of V1 and preserves the internal argument of V2. Let us see some more examples. (24) a. Xi yifu qian xian ba xiuzi wan qilai, buran ba xiuzi dou wash clothes before first BA sleeves roll up otherwise BA sleeves all xi- shi le. wash-wet PFT Roll up your sleeves first before washing clothes, otherwise you will get them wet. (DCVC:299) 16 17
“Theme1-Theme2” implies that the internal arguments of V1 and V2 are identified. Kageyama (1996:Chapter 4).
Patient-Orientedness in Resultative Compound Verbs
295
b. Wo de yanjing jiushi zai ruo guang xia kan shu kan -huai de. my DE eyes just be LOC weak light under read books read damaged modal particle My eyes have been damaged just because I read books under a weak light. (DCVC:176) c. Ta he - zui le, tangxia jiu shuizhao le. He drink-drunk PFT lie down sonn fall asleep PFT He got drunk and fell asleep immediately upon lying down. (DCVC:403)
In (24a), “xi-shi” (wash-wet) is a causative transitive verb and has the argument structure of (23a). A significant phenomenon is that the argument structure in (23a) does not inherit “Theme1” from V1; rather, it inherits “Theme2” from V2. This implies that “Theme2” from V2 is more prominent because the result is foregrounded. This fact supports Tai’s analysis (Tai 1984:295) that the action part of a resultative compound verb is to be presupposed and the result part is to be asserted in Chinese. (24b) is an example of the decausativization of the type seen in (23b). In (24b), “my eyes” is a patient-subject, and “kan-huai” (read-damaged) focuses on the change of state of my eyes. The internal argument of “kan” (read) is suppressed at least in the resultative compound verb “kan-huai,” although “shu” (book) appears in a previous context as an object of “kan” (read). In (24c), the internal argument of “he” (drink) is suppressed, although it is predictable from V2 “zui” (drunk). It should be noted that in our data, we could not find any case in which the internal argument of V2 is suppressed. This asymmetry between V1 and V2 in terms of the inheritance of arguments strongly supports that the patient in the result should be realized as an argument, and it also supports Patient-Orientedness in Chinese. 4. The type in which V2 predicates an external argument Among 1,673 examples, we found 288 examples (17.21%) in which V2 predicates an external argument (Agent, Experiencer). Comparing this type with that in which V2 predicates an internal argument (82.79%, 1,385 examples among 1,673), there is a striking difference in their occurrence. This contrast suggests that the causal relation represented in (13) is unmarked in the resultative compound verbs in Chinese as well as the resultatives in English18. (13) is repeated below: (13) [x ACT (ON y) ] CAUSE [ BECOME [ y BE AT – z ]]
18
Levin and Rappaport Hovav (1995:34) propose “the direct object condition,” which requires a resultative secondary predicate to predicate a direct object in a resultative sentence.
296
Keiko MOCHIZUKI
However, the resultative compound verbs in Chinese allow other types of LCS and argument realizations. In the following discussion, I will examine marked but possible types. The type in which V2 predicates an external argument can be subcategorized into four types in terms of the argument realization based on the frequency of usage in our data, as listed in (25): (25) a. V1 [Agent,
] + V2 [Experiencer,
]→ V1V2 [Experiencer, ] (137/1,673, 8.18%) e.g. tiao-fan (dance-get tired of), wan-gaoxing (play-happy), pao-ke (run-thirsty) zou-lei (walk-tired) b. V1 [Agent, Theme] + V2 [Experiencer , ] →V1V2 [Experiencer, Theme ] (115/1,673, 6.87%) e.g. chi-guan (eat-get used to), chi-ni (eat-get sick of), kan-gou (watch-enough) c. V1 [Theme1, ] + V2 [Theme2 , ] → V1V2 [Theme1-Theme2, ] (24/1,673, 1,43%) e.g. dong-bing (be chilled-get sick), hua-dao (slip-fall), e-xing (hungry-awake), d. V1 [Agent, Theme] + V2 [Experiencer , Theme] →V1V2 [Experiencer, Theme ] (12/1,673, 0.72%) e.g. ting-dong (listen to-understand), lian-hui (train-master), da-ying (play-win), ti-shu (kick-lose)
Note that in (25a,b,d), the resultative compound verbs inherit the experiencer from V2, not the Agent from V1. The inheritance of the experiencer can be examined by testing whether or not the volitional adverb “pinmingdi” (as hard as one can) can co-occur. The result reveals that none of the examples in (25a,b,c,d) can co-occur with the volitional adverb “pinmingdi.” See the following contrast: (26) a. Ta pinmingdi tiao. He as hard as one can dance He dances as hard as one can. b. *Ta pinmingdi tiao- fan He as hard as one can dance-get tired of
le. PFT
This contrast in (26) shows that a semantic prime in “tiao-fan” (dance-get tired of ) indicates a psychological change of state “fan” (get tired of), which takes the experiencer as its subject. The same contrast can be observed both in the type in (25b) “chi-guan” (eat-get used to) and that in (25d) “ting-dong” (listen to-understand); one of them can co-occur with the volitional adverb “pinmingdi.” The fact that the resultative compound verbs inherit the experiencer from V2 offers the third piece of evidence for the Patient-Orientedness analysis since both the action and the agent in V1 are backgrounded, while a change of state in V2 is foregrounded; therefore, the resultative compound
Patient-Orientedness in Resultative Compound Verbs
297
verbs inherit the experiencer from V2. In this sense, the experiencer is considered to be a patient who experiences a psychological change of state. Let us now turn to (25c), in which both V1 and V2 have theme and the resultative compound verbs inherit the theme as a subject. In this case, both the causal event and result event are unaccusative events. Following this, I will present examples for each of four categories listed in (25) and examine their LCS. Let us first examine “tiao-fan” (dance-get tired of) type (V1 [Agent, ] + V2 [Experiencer, ]→ V1V2 [Experiencer, ]). (27) a. Ta wan-gaoxing le, lian fan dou wangji chi le. He play-happy PFT even meal all forget eat PFT He was so enthusiastic for playing that he forgot even to eat. (DCVC:144) b. Tian re, pao le yitian de lu zhen youdianr pao-ke le. climate hot run PTV one day DE road really a little run-thirsty PFT I had been running about for whole day under such a hot climate, so I am really thirsty. (DCVC:215) c. Women dou zou-lei le, zuo-xialai xie yihuir ba. we all walk-tired PFT, sit-down rest for a while modal particle We all got tired after a long walk, let’s rest for a while. (DCVC:226)
The LCS of this type can be represented as (28): (28) [ x ACT ] CAUSE [ BECOME [ x BE ∣ ∣ ∣ He play he I run for whole day I We walk a lot we
AT-z ] ∣ happy thirsty tired
Second, let us now look at examples of the “chi-guan (eat-get used to)” type (V1 [Agent, Theme] + V2 [Experiencer , ]→V1V2 [Experiencer, Theme ]. (29) a. Ta chi-guan le lajiao, dundun fan -li dou dei fang. He eat-get used to PFT chilli every meal-in all must put He is used to eating chilli and has to have it for every meal. (DCVC:152) b. Tianshi wo chi-ni le, xiang chi dian xian de. Sweets I eat-be sick of PFT want eat a little salty DE I am rather sick of eating sweet things so I want to eat salty things. (DCVC:248) c. Zhexie jiemu dajia kan-gou le. These program all watch-enough PFT These programs are watched enough. (DCVC:150)
The LCS of this type can be represented as (30) below:
298
Keiko MOCHIZUKI (30) [ x ACT MANY TIMES ] CAUSE [ BECOME [ x ∣ ∣ ∣ He eat chilli he I eat sweets I All watch these programs all
BE
AT-z ] ∣ get used to chilli get sick of sweets get tired of these programs
Third, let us look at examples of the “dong-bing” (be chilled-get sick) type (V1[Theme1,] + V2 [Theme2, ] → V1V2 [Theme1-Theme2, ]). (31) a. Ruguo zuotian ni chumen shi chuan-shang dayi, jiu buhui If yesterday you go out when wear-on coat then would not dongbing le. be chilled get sick PFT If you had worn a coat before you went out yesterday, you wouldn’t have fallen sick. (DCVC:17) b. Ta buxiaoxin cai zai xigua pi shang hua-dao le. He carelessly tread LOC watermelon peel on slip-fall PFT He carelessly trod on the peel of a watermelon and then tumbled. (DCVC:76) c. Wanshang wo mei chi wanfan, shui dao banye e-xing le. Night I NEG eat supper sleep until midnight hungry-awake PFT I did not eat supper last night, so I felt very hungry and woke up at midnight. (DCVC:348)
The LCS of this type can be represented as (32) below: (32) [ BECOME[ y BE AT-z1 ] CAUSE [ BECOME [ y BE AT-z2 ] ] ∣ ∣ ∣ ∣ you be chilled you get sick he slip he fall I hungry I awake
This type is marked in that there is no action in the causal event, but it still preserves the cause-result relation. The last type, the “ting-dong” (listen to-understand) type, is also marked because there is no cause-result relation; V1 and V2 only reflect a temporal sequence. This type has the following argument structure and LCS: (33) a. V1 [Agent, Theme] + V2 [Experiencer , Theme] →V1V2 [Experiencer, Theme ] b. [ x ACT ] RESULTS IN [ BECOME [ x BE AT- z ] ]
In (33b), “RESULTS IN” links two events since two events are not in a causal relation, but they are listed according to a temporal sequence. Let us now look at some examples: (34) a. Wo chongfu le I again PFV
haojibian, ta dou meiyou fanying, wo xiang several times he all NEG response I think
Patient-Orientedness in Resultative Compound Verbs
299
ta dagai shi meiyou ting-dong. he probably copula NEG listen to-understand Although I have repeated myself several times, he probably didn’t understand me as he showed no response. (DCVC:121) b. Lian-hui le Taiqiquan haishi youyong de. train-master PFT Taiqi (Chinese shadow boxing) still useful modal particle It is useful to train oneself and master Taiqi. (DCVC:184) c. Zhe chang qiu tamen da-ying le. This CL game they play-win PFT They played this game and, as a result, they won. (DCVC:353) d. Women bu neng ba zhe chang qiu ti-shu le! We NEG can BA this CL game kick-lose modal particle We cannot lose this soccer game! (DCVC:301)
To summarize, we observed the existence of “Patient-Orientedness” in cases wherein V2 predicates an external argument: the suppression of agent from the action verb V1 and the inheritance of experiencer from V2. 5. Conclusion In this paper, three pieces of supporting evidence for “Patient-Orientedness” are offered. The first piece of evidence is the fact that among 1,385 examples, 47% display the patient-subject type and 36.82% display “decausativization” in the causative transitive compound verbs, thus lexically suppressing the Agent-Subject of V1. The second piece of evidence is the suppression of an internal argument of V1 (e.g. “xi-shi” (wash-wet)). The third piece of evidence is the suppression of the agency of V1 in the case in which V2 predicates an external argument (e.g. tiao-fan (dance-get tired of)). The most significant fact is that we failed to find any case that the argument of V2 is suppressed. Finally, I would like to summarize the statistical result of 1,673 resultative compound verb examples collected from A Dictionary of Chinese Verb-Resultative Complement Phrases (1987) as follows. (35) A. The type in which V2 predicates an internal argument 1) Agent/Cause-Subject type a. [ Agent/Cause, Theme] b. [Event (Agent, (Theme)), ] 2) Patient-Subject type a. [Theme, ]
82.79% (1,385/1,673) 43.69%, (731/1,673) 41.96%, (702/1,673) 1.73% (29/1,673) 39.09% (654/1,673) 30.48% (510/1,673)
300
Keiko MOCHIZUKI b. BEI passives 8.60% (144/1,673) B. The type in which V2 predicates an external argument 17.21% (288/ 1,673) 1) V1 [Agent, ] + V2 [Experiencer, ] → V1V2 [Experiencer, ] 8.18% (137/1,673) 2) V1 [Agent, Theme] + V2[Experiencer , ] →V1V2 [Experiencer, Theme ] 6.87% (115/16,73) 3) V1 [Theme1, ] + V2 [Theme2, ] → V1V2 [Theme 1-Theme 2, ] 1.43% (24/1,673) 4) V1 [Agent, Theme] + V2 [Experiencer, Theme] →V1V2 [Experiencer, Theme ] 0.72% (12/1,673)
Bibliography Kageyama, Taro 1996. Doushi-Yimiron (Verbal Semantics). Tokyo: Kuroshio Publishers. Levin, Beth and Malka Rappaport Hovav 1995. Unaccusativity: At the Syntax-Lexical Semantics Interface. Massachusetts: The MIT Press Li, N. Charles and Sandra A. Thompson 1976. Subject and Topic: A New Typology of Language, In Li, N. Charles(ed.) Subject and Topic. 457-489. New York: Academic Press. Li, N Charles and Sandra A. Thompson 1981. Mandarin Chinese: A Functional Reference Gramar. California: University of California Press. Mochizuki, Keiko 2004. Causative and Inchoative Alternation: Comparative Studies on Verbs in Chinese and Japanese. Ph.D. Thesis of National Tsing Hua University. Taiwan. Tai, James H-Y. 1984. “Verbs and Times in Chinese: Vendler’s Four Categories”, in Lexical Semantics, p.289-296, Chicago Linguistic Society. Vendler, Zeno. 1967. Linguistics in Philosophy. Cornell University Press. Wang, Yannong, Qun Jiao and Yong Pang eds. 1987. A Dictionary of Chinese Verb-Resultative Complement Phrases. Beijing: The Beijing Language Institute Press.
Corpus Research in Chinese and Its Application to Chinese Language Teaching — A Case of Localizers in Chinese — Takayuki MIYAKE 1. Introduction In recent years, corpus linguistics of the Chinese language has gained popularity in China. Corpus linguistic data in a language has potential applications in several areas, such as information processing and academic research; one of the primary application areas is language teaching. The purpose of this study is to first briefly introduce two types of Chinese corpora—large-scale text databases and an annotated corpus—and to examine the validity of applying the corpus data to Chinese language teaching by using the former type of corpus, namely, text databases. In this paper we will investigate the usage of localizers in the Chinese corpus as a case study, and we will propose an improved method of indicating localizers in Chinese-language teaching. 2. Two types of Chinese corpuses There are two types of corpuses in Chinese: large-scale text databases and the annotated corpus. Before using the Chinese corpus, it is necessary to decide upon the type of corpus that would be required. A distinct property of corpora data is that the features of one corpus are completely different from those of another. Two representative examples of these types of Chinese corpora1 will be introduced in this paper. 2.1. Large-scale text databases The largest representative written Chinese corpus is The Modern Chinese Language Corpus (MCLC), which is being compiled at the Research Institute of Language Application in Beijing. The aim was to include 70 million contemporary Chinese characters in this corpus, and in the year 2002, 20 million Chinese characters were already successfully inputted into the text database. This text file has been compiled from written Chinese data 1
For further details about the trends of corpus linguistics in China, see Wang, J. (2001) and Feng, Z. (2002).
302
Takayuki MIYAKE
after the year 1919, particularly from after the year 1977. The genre of this corpus has broad fields: humanities, 59.4 percent; natural science, 17.24 percent; newspaper materials, 13.79 percent; and miscellaneous class, 9.36 percent. Thus, this text file can be regarded as an example of the heterogeneous corpus. In addition to this, we can cite some other instances of large-scale text databases: a 20-million-character corpus of Chinese compiled by the Beijing University of Aeronautics and Astronautics in 1983, a 5.27-million-character corpus of modern Chinese literature by Wuhan University in 1979, etc. 2.2. Annotated corpora The Institute of Computational Linguistics (ICL) of Beijing University and the Fujitsu Research Institute of Japan have begun to segment and annotate a 27-million-character corpus of newspaper material of People’s Daily of the year 1998 in China. At present, this is the largest tagged corpora of modern Chinese, which contains linguistic information about word segments and parts of speech. An annotated corpus can make a significant contribution toward linguistic study and language information processing. Although an annotated corpus is very useful for linguistic study, such corpora are still limited in number, and therefore, assembling data to prepare a large annotated corpus is not that simple a task. Raw text files, on the other hand, are comparatively easy to collect and can be used freely.2 This paper is a case study using the large-scale text databases of the Chinese language. 3. A case study: Localizers in Chinese 3.1. The range of localizers in Chinese A localizer is a sub-class of nouns in Chinese. It plays a significant role in Chinese grammar; this is because some sentence types require a sentence component that expresses location, and not a noun that connotes some objects. For example, the subject of the “有” sentence structure like “里边 儿” in (1) and the object of the verb “在” like “学校外边儿” in (2) must be localizers. (1) (2)
里边儿有什么东西? “What is inside?” 邮局在学校外边儿。 “The post office is outside the school.”
Further, the position after the preposition “在” also requires a component that expresses location; thus, while (3) is grammatically correct, (4) is not 2
For further introduction on annotating the Chinese corpus, see Zhou, Q. and Yu, S. (1997).
Corpus Research in Chinese
303
acceptable since the noun “桌子” implies the object “desk” and not the location. (3) (4)
他把书放在桌子上了。 “He put the book on the desk.” *他把书放在桌子了。 “*He put the book the desk.”
I limit the discussion in this study to the compound localizers listed in Zhu (1982:44), as presented in Table 1. Table 1. “~边” type
“~面” type
“~头” type
above/over
上边
上面
上头
down/under/below
下边
下面
下头
front/forward
前边
前面
前头
behind/at the back
后边
后面
后头
in/inside
里边
里面
里头
outside
外边
外面
外头
left
左边
左面
right
右边
右面
east
东边
东面
东头
south
南边
南面
南头
west
西边
西面
西头
north
北边
北面
北头
side
旁边
Zhu (1982:44) does not list “旁边,” while Liu et al. (2001:55) as well as most Chinese textbooks do so; therefore, in this paper, we include this member in our list. We are not concerned with another type that Zhu (1982) does not list: “以~” (like “以上” “以前”) and “之~” (like “之后” “之下”) due to the following reasons: (i) Many members of this type have already undergone grammaticalization and changed their meanings. (ii) Some members should be considered as phrases rather than words, which are not recorded as words in Xiandai Hanyu Cidian3. (iii) Almost all Chinese textbooks for second language education do 3
Division of Dictionary Compilation, The Institute of Linguistics, The Chinese Academy of Social Sciences, 2005. The Contemporary Chinese Dictionary 5th edition. Beijing: Commercial Press.
304
Takayuki MIYAKE
not record this type of localizers at all. The range listed in Table 1 is incidentally almost identical to that dealt with by most Chinese textbooks for second language education in Japan. As Zhu (1982) points out, simple localizers are bound and they cannot be used freely; in contrast, compound localizers are free with regard to usage, but they have three types of localizers that appear to be used equally. Although a large number of researches have been carried out with regard to the difference in the usage of simple and compound localizers, little is known about the difference in these three types of compound localizers. The question that we should ask here is with regard to the range of the compound localizers that should be taught in second language teaching. Is it really necessary that all types of compound localizers be treated equally in the elementary level classes? 3.2. The investigation of the Chinese textbooks Let us begin our analysis by examining Chinese textbooks for second language education published in Japan. Localizers are among the most important points in elementary Chinese grammar; therefore, many textbooks deal with this grammatical aspect at the elementary level. However, there exist some differences with regard to the range of localizers presented in the textbooks. The samples were collected from 70 Chinese textbooks for second language education published in Japan. The distribution of the range of localizers is shown in Table 2. Table 2. The range of localizers The textbooks that present all 3 types: “~边”, “~面” and “~头”
Numbers of textbooks 22
The textbooks that present 2 types: “~边” and “~面”
17
The textbooks that only present 1 type: “~边”
30
The textbooks that only present 1 type: “~面”
1
This table shows that most of the types of localizers presented in the textbooks are “~边.” The textbooks that only present the “~边” type amounted to 30; further, although some textbooks presented 2 or 3 types of localizers, almost all of them presented the “~边” type, with the exception of one textbook that presented the “~面” type. These data are of great value since they reveal that although there are 35 compound localizers in Table 1, they are not treated equally in the textbooks. Some members are so common that all textbooks include them, while some that are not so common are not included in the textbooks. Table 3 presents the distribution of the compound localizers in the 70 textbooks.
Corpus Research in Chinese Table 3. 上边 69
上面 40
上头 21
下边 69
下面 40
下头 21
前边 69
前面 40
前头 21
后边 69
后面 40
后头 21
里边 65
里面 38
里头 20
外边 63
外面 38
外头 20
左边 65
左面 33
右边 65
右面 33
东边 48
东面 29
东头 6
南边 48
南面 29
南头 6
西边 48
西面 29
西头 6
北边 48
北面 29
北头 6
305
旁边 55
The “~边” type occurs most frequently, and the numbers of “~面” and “~头” are slightly smaller. Although the selection of localizers of each textbook is not based on scientific observations, it may be presumed that the data presented in Table 3 reflect the real usage of localizers through the introspection of each author. An examination of the corpus is required to confirm that this treatment adopted by a majority of textbooks can be proven through evidence. 4. The investigation of the corpus 4.1. The corpus treated in this paper The following approach was employed in the observation of this corpus. 4.1.1. Data collection In this paper, we chose three different types of corpora to demonstrate the differences in the levels of speech: (Corpus 1) Contemporary Beijing Colloquial Corpus (当代北京口语语 料) (口)4 by Beijing Language and Culture University, 1993 The first corpus that we chose is a corpus of colloquial Chinese language. This corpus has collected utterances of 374 people in the city of Beijing. The speakers were selected from Beijing, and they belonged to various occupations, ages, and ethnic groups. The investigators met these informants and recorded their talks on six topics. The only common 4
The characters within parentheses are abbreviated marks that will indicate the sources of the illustrative sentences used later.
306
Takayuki MIYAKE
prerequisite was that the informants should have been born in Beijing, and that their parents should also have been from Beijing. This corpus faithfully reflects the natural colloquial utterances of Chinese, especially those of the Beijing dialect. (Corpus 2) Online novels by authors around Beijing We chose this corpus as an example of the language used in literary works. We constructed this corpus from a Website that also enabled us to download Chinese novels. We selected novelists from Beijing whose novels are written in the Chinese that is used around Beijing. We can broadly consider this Chinese used in these novels as a written form of the Beijing dialect. The names of the authors and novels we selected in this paper are as follows. 刘恒 ∶ 《贫嘴张大民的幸福生活》 王朔 ∶ 《动物凶猛》(动), 《我是你爸爸》(爸),《你不是一个俗人》, 《痴人》 , 《浮出海面》(浮),《给我顶住》,《过把瘾就死》,《一半是火焰一半是 海水》,《无人喝采》,《空中小姐》(空),《刘慧芳》,《看上去很美》, 《懵然无知》,《千万别把我当人》,《人莫予毒》,《谁比谁傻多少》, 《玩的就是心跳》 ,《枉然不供》 ,《顽主》 , 《我是“狼”》 , 《橡皮人》(橡), 《修改后发表》, 《许爷》 ,《一点正经没有》(一),《永失我爱》 王小波∶ 《2015》 ,《未来世界》(未),《黄金时代》,《白银时代》,《我的阴阳两 界》
These novels were mainly downloaded from the Website “亦凡公益图 书馆” (http://www.shuku.net/) and partly from the Website “中国青少年新 世纪读书网” (http://www.cnread.net/index.htm), and they were preserved as plain text files. (Corpus 3) Beijing Daily and Beijing Evening News CD-ROM, 2000 This is the corpus of the language of news stories. “Beijing Daily” and “Beijing Evening News” are newspapers published in Beijing. By using a CD-ROM, we were able to use the digital data of all the articles in these papers. We consider the language used in the media as being representative of the typical formal style in written language. In this case, only the articles that appeared in Beijing Daily in January 2000 were used in order to use an equal number of data from among the three types of corpora. These three types of corpora are all text files containing approximately 1.7 million characters; they have the following three dimensions in common: (i) Time: Present-day Chinese (ii) Area: Beijing5 5
We intentionally chose the colloquial corpus of the Beijing dialect. With regard to the reason behind using the corpus of the Beijing dialect, see Zhu (1987). Consequently we also chose the Chinese used around the Beijing area for the other two language styles in order to integrate the usage areas of the three corpora.
Corpus Research in Chinese
307
(iii) Quantity: Approximately 1.7 million characters Further, the only difference among these three corpora is with regard to their respective language style: colloquial Chinese, the Chinese used in literary works, and the Chinese used in news stories. As Tao (1999) and Li (2003) point out, the language style exerts tremendous influence on grammatical phenomena, and therefore, it is crucial to distinguish different language styles in studying grammar. 4.1.2. Data processing As mentioned above, all the three corpus data used by used are the raw text files containing approximately 1.7 million Chinese characters. To investigate these text files, we used the text editor “EmEditor Professional v6,” which offers full Unicode support. Using this text editor, we processed simplified Chinese characters in a Japanese Windows XP environment.6 4.2. The investigation of the corpus Let us now discuss the result of the research of the corpus. In the tables, “colloquial,” “novels,” and “newspapers” denote Corpora 1, 2, and 3, respectively. 4.2.1. “上~” (up/above/over) type and “下~” (down/under/below) type Table 4. colloquial
novels
newspapers
Total number
上边 上面 上头
127 34 123
14 173 5
4 44 2
145 251 130
下边 下面 下头
38 12 3
13 132 0
1 19 0
52 163 3
(5)
(6)
6
也没准儿你再看看,再仔细发现,可能它那脚上吧,脚掌上边儿还趴着一 个小狮子。(口) “Maybe if you look again, if you look carefully again, there may be one more small lion lying at the sole of the foot of the lion.” 他边吃饭还在边看一份报纸,上面有一些密密麻麻的名字,可能是某个委 员会或主席团的名单。(空) “He was reading a newspaper over dinner. The newspaper was full of people’s names, which may be a name list of some committee or chairman.”
Downloaded from http://www.emeditor.com/jp/index.htm
308
Takayuki MIYAKE (7)
(8)
到山顶儿以后,我说咱们看看,到那儿去以后,觉得就是应该看看是哇, 可是当时呢就是上气儿不接下气儿这样儿,坐的大石头上,山顶儿大石头 上头,休息了一会儿,才看看下边儿的景色。(口) “After we reached the top of the mountain, I said that we should look at the beautiful sight. But at that time, we were panting from the climb, so we sat on the big rock and took rest. After a short rest, we looked downward at the sight.” 下面广场有两个妇女在吵架,旁边围了一圈稀稀落落的人,有战士和小女 孩。(动) “Two women were quarreling in the square below. There were few people around; a soldier and a little girl were also there.”
In the “上~” (up/above/over) type, “上边” and “上头” are used frequently in colloquial language, and “上面” is most frequently used in novels. In the “下~” (down/under/below) type, “下边” is most often used in colloquial language and “下面” is most frequently used in novels. 4.2.2. The “前~” (front) and “后~” (back) types Table 5. colloquial
novels
newspapers
Total number
前边 前面 前头
80 9 19
31 113 4
3 25 10
114 147 33
后边 后面 后头
73 7 42
23 226 5
3 36 2
99 269 49
(9)
她买的时候儿呢,就是队伍前边儿呢是,比较开始比较慢,后来呢就稍微 快一点儿了。(口) “When she bought it, the front part of the line of people was moving relatively slowly. And after that, the line began to move a little fast.” (10) 我走在前面,老邱和燕生跟在后面。(橡) “I walked in front, and Lao Qiu and Yansheng followed behind.” (11) 前头一个人儿表演,后头一个人说。(口) “A man was giving a performance in front, and another man was talking behind.” (12) 它正好儿是在苏联大使馆后边儿。(口) “It was at the back of the soviet embassy.”
The cases of the “前~” and “后~” types are fairly similar to those of the “上~” and “下~” types. “前边”, “后边”, “前头,” and “后头” are most frequently used in colloquial language, whereas “前面” and “后面” are most often used in novels.
Corpus Research in Chinese
309
4.2.3. The “里~” (in/inside) and “外~” (outside) types Table 6. colloquial
novels
newspapers
Total number
515
56
10
581
里面
69
176
49
294
里头
887
7
1
895
外边
184
51
5
240
外面
21
168
19
208
外头
155
7
0
162
里边
(13) 我们四个女生睡在一间屋子里,两个男生睡在,睡在另一间就旁边儿的教 室里边儿。(口) “All four of us were schoolgirls and we slept in one room. Two schoolboys slept in the next classroom.” (14) 我坐在一旁笑眯眯地听,伸手拿茶几上的烟盒,发现里面空了。(浮) “I sat nearby and listened with a smile. I stretched my hand and took the cigarette box on the table and found that there were no cigarettes inside.” (15) 啊,因此就在我的脑子里头,没有一个失业或就业的问题。(口) “Yes, so in my mind the problem of unemployment and employment does not exist.” (16) 我们就骑车到外边儿去。(口) “We got on bicycles and went out.” (17) 外面天已经黑了,果然有些凉意。(爸) “It was already dark outside, and it became cooler as we expected.” (18) 他们自个儿上外头旅游,玩儿了一趟回来了。(口) “They went out on a trip, and came back after having a good time once.”
Note that the frequency of the “里~” type is very high. “里边”, “外边”, “里头,” and “外头” are most frequently used in colloquial language. On the other hand “里面” and “外面” are mainly used in novels. 4.2.4. The “左~” (left) and “右~” (right) types Table 7. colloquial
novels
newspapers
Total number
左边
3
16
2
21
左面
0
1
1
2
右边
2
4
3
9
右面
0
5
1
6
310
Takayuki MIYAKE (19) 那字念 ‘矜’,告诉你——左边一 ‘矛’ 右边一 ‘今’。(一) “That character is ‘矜’, the left is ‘矛’, and the right is ‘今’.” (20) 他走到山路上,左面是山林,故而相当黑;右面是山谷,故而比较明亮。(未) “He walked on the mountain road. On the left-hand side of the road was the forest, so it was pretty dark. On the right-hand side was the valley, so it was very sunny.”
We should remember that “*左头” and “*右头” do not exist in Chinese. They do not appear in our corpus at all. Unexpectedly, the frequencies of the “左~” and “右~” types are not high. The most frequently used member “左边” occurred only 21 times. 4.2.5. The “东~” (east), “南~” (south), “西~” (west), and “北~” (north) types Table 8. 东边
colloquial
novels
newspapers
Total number
21
4
7
32
东面
0
1
2
3
南边
35
15
1
51
南面
2
3
2
7
西边
27
9
4
40
西面
0
1
0
1
北边
37
6
4
47
北面
0
1
1
2
Although Zhu (1982) and some textbooks include “东头,” “南头,” “西 头,” and “北头” in compound localizers, we do not consider these four members as general localizers to be exact. Liu et al. (2001:55) points out the following: (21) “东头儿”、“南头儿”、“西头儿”、“北头儿” 中的 “头儿” 与后缀 “头” 不同, 要重读,而且要儿化,意思是 “顶端、末梢”。 “头儿” in “东头儿”, “南头儿”, “西头儿,” and “北头儿” are different from the suffix “头.” It must be stressed and must add the retroflex ending “r.” It means “the peak, the end.”
Consequently, whereas “东, 南, 西, 北” + “~边” and “东, 南, 西, 北” + “~面” are listed in Xiandai Hanyu Cidian as localizers, “东头,” “南 头,” “西头,” and “北头” are not listed at all. This fact reveals that Xiandai Hanyu Cidian does not consider “东头,” “南头,” “西头,” and “北头” as
Corpus Research in Chinese
311
localizers. Therefore, we excluded these four members from our analysis.7 (22) 东边儿呢,是北京火车站,西边儿呢,是货场。(口) “On the east was Beijing Station, and on the west there was a freight shed.” (23) 这个,南边儿有一个操场,它是属于这个区里边儿的。(口) “There is a ground in the south. It belongs to this ward.” (24) 洛阳北边儿要建一个飞机场,嗯,然后呢,要准备是大开发嘛。(口) “An airport will be built in the north of Luoyang city, and after that, they are planning the big development.”
The members that express directions are not used so often either. The most frequently used ones are “东边,” “南边,” “西边,” and “北边,” that is, they all belong to the “~边” type. 4.2.6. “旁边” (side) Table 9. 旁边
colloquial
novels
newspapers
Total number
80
177
25
282
(25) 旁边几桌吃饭的男女纷纷转过头来紧张地盯着我们。(动) “The people who were having lunch on the nearby tables looked back one after another and stared at us in a tensed manner.”
“旁边” is the only member; “*旁面” and “*旁头” do not exist in Chinese. “旁边” is often used both in colloquial language and in novels. 5. Data analysis 5.1. The distribution of “~边,” “~面,” and “~头” From the above data, we know the total numbers of each of the three types of localizers “~边,” “~面,” and “~头,” which appeared in our corpus. Table 10 shows the distribution (the highest frequency has been colored). Based on the information in Table 10, can we arrive at the conclusion that the medium of instruction for each member with the highest frequency should be Chinese, as in (26)? (26) Localizers in Chinese 上面,下面,前面,后面,里头,外边,左边,右边,东边,南边,西边, 北边,旁边
Our answer is “no.” Here, we show the reason why we should not present the localizers as in (26) at the elementary level. 7
Zou, S. and Tian, Q. (2001:161) assert that “东头,” “南头,” “西头,” and “北头” should be counted as localizers, but the example they present is “村子东头,” which means “the end.”
312
Takayuki MIYAKE
Table 10. 上边 145
上面 251
上头 130
下边 52
下面 163
下头
前边 114
前面 147
前头 33
后边 99
后面 269
后头 49
里边 581
里面 294
里头 895
外边 240
外面 208
外头 162
左边 21
左面
2
9
右面
6
东边 32
东面
3
南边 51
南面
7
西边 40
西面
1
北边 47
北面
2
右边
3
旁边 282
5.2. The reason behind choosing the “~边” type It is a fact that the members listed in (26) are used most frequently in the corpus, but we do not consider them as the most representative members of Chinese localizers presented at the elementary level. It is clear that, in this list, three different types of localizers—the “~边” type, the “~面” type, and the “~头” type—are intermingled and this list might be an unnecessary aspect to be learned for learners of Chinese. In this case, which type of localizers should we choose—the “~边” type, the “~面” type, or the “~头” type? In this paper, we suggest that the “~边” type should be presented as Chinese localizers at the elementary level of Chinese language teaching, like in (27). (27) Localizers in Chinese 上边,下边,前边,后边,里边,外边,左边,右边,东边,南边,西边, 北边,旁边
Although this observation may appear to be oversimplified, we have strong reasons for choosing the “~边” type. [i] The highest frequency among all the three types Table 11 presents the total numbers of three types of localizers in three language types. This data reveals that the “~边” type is most frequently used in the corpus; it appears 1,431 times, whereas the “~面” type appears 1,353 times and the “~头” type appears 1,272 times. Thus, the frequency of the “~边” type is highest in the three types of localizers, although the remainder is not so large. This information is also presented in Figure 1.
Corpus Research in Chinese
313
Table 11. colloquial
newspaper
Total number
1,222
419
72
1,431
“~面”
154
1,000
199
1,353
“~头”
1,229
28
15
1,272
Total number
2,525
1,270
261
4,056
“~边”
novels
1,800 1,600 1,400 1,200 1,000 800 600 400 200 0 “~边” colloquial
“~面” novels
“~头” newspaper
Figure 1.
[ii] The highest frequency in colloquial language Let us now look at each member’s distribution in three different language styles. 3,000 2,500 2,000 1,500 1,000 500 0 colloquial “~边”
Figure 2.
novels “~面”
newspaper “~头”
314
Takayuki MIYAKE
A glance at Table 11 and Figure 2 will reveal that in colloquial language, “~边” is used most frequently. In contrast, “~面” is used in almost all novels. Instead, the frequency of “ ~ 面 ” in colloquial language and newspapers is fairly low. “~头” is almost exclusively used in colloquial language.8 Further, if we lay emphasis on teaching colloquial language, the “~边” type must be the best choice. [iii] The disunion of the members With regard to the high frequency in colloquial language, one might point out that not only the “~边” type but also the “~头” type is used with equal frequency in colloquial language. However, we do not choose the “~ 头” type as a member that should be presented at elementary level of language teaching. The distribution of the “~头” type is completely partial. The frequency in written language is almost equal to zero, whereas the “~ 边” type is used both in colloquial and written languages; therefore, if we teach the “~边” type, students can manage both the types of language. More importantly, Table 10 clearly shows that the “~头” type lacks many members that the other two types have. Further, “*左头,” “*右头,” and “*旁 头” do not exist in Chinese at all, and we do not consider “东头,” “南头,” “西头,” and “北头” as localizers, as mentioned above. Thus, if we teach the “~头” type as localizers, students could not learn how to say the quarters. On the other hand, the “~边” type has the most members; “旁边” is the only member that the other two types do not have; therefore, we can teach most general localizers if we present the “~边” type. In addition to these reasons, there is another factor that might influence the frequency of localizers, which we should take into account. In our analysis thus far, we have deliberately not taken the other meanings of each member into account. In fact, some members of localizers have many entries of meanings that do not express the meaning of directions, and this might have some influence on the frequency in the corpus. Consider, for example, the interpretation of “下面” in Xiandai Hanyu Cidian. The meaning of “下面” has three entries: ①位置较低的地方。(below; under; underneath) ②词序靠后的部分。(next; following) ③指下级。(lower level; subordinate) In contrast, “下头” has only two entries: ①位置较低的地方。(below; under; underneath) ②指下级。(lower level; subordinate) Thus, “下面”has an additional meaning that “下头” does not have, namely, “next; following.” (The interpretation of “下边” is “下面,” so we can consider the interpretation of “下边” to be identical to that of “下面”.) 8
This conclusion almost corresponds to the indication of Wu, Z. (Lü, S.) 1965.
Corpus Research in Chinese
315
Thus, if one wishes to express the second meaning of “下面” (“next; following”), one would naturally use “下面” but not “下头,” thus resulting in an increase in the frequency of “下面.” The issue of meaning is thus not irrelevant to our corpus data. 5.3. The other frequency factor Thus far, we mainly have examined the difference between the three types—the “~边” type, the “~面” type, and the “~头” type—and treated them as though “上~,” “下~,” “前~,” etc. Although the members of each group are equivalent in their usage, there exist differences between each member of localizers. Table 12 shows the difference of frequency between each member in our corpus. Table 12. colloquial 284 53 108 122 1,471 360 3 2 21 37 27 37
“上~” “下~” “前~” “后~” “里~” “外~” “左~” “右~” “东~” “南~” “西~” “北~”
novels 192 145 148 254 239 226 17 9 5 18 10 7
newspaper 50 20 38 41 60 24 3 4 9 3 4 5
Total number 526 218 294 417 1,770 610 23 15 35 58 41 49
Here, we do not intend to further examine the difference in different language styles. It is of great significance to note that there is a large difference of frequency in the real usage of localizers. Figure 3 shows this difference clearly. 2000
“里~”
1800 1600 1400 1200 1000 800 600 “上~” 400 200 0
Figure 3.
“外~”
“后~” “前~” “下~” “左~”“ 右~”“ 东~”“ 南~”“ 西~”“北~”
316
Takayuki MIYAKE
The most frequently used type is “里~”; “外~” and “上~”follow it. Here, we would like to focus on the smaller members. We would like to emphasize, as Xie (2001:75) also points out, that “左~,” “右~,” “东~,” “南~,” “西~,” and “北~” are used very rarely. They are not used so often in real language. Therefore, in order to reduce students’ burden of remembrance, another alternative might be to first present “上~,” “下~,” “前~,” “后~,” “里~,” and “外~” as Chinese localizers at an elementary level, and treat “左~,” “右~,” “东~,” “南~,” “西~,” and “北~” as individual vocabulary. 6. Conclusion Our analysis of the Chinese corpus has revealed that there exists a difference of frequency among the three types of localizers in Chinese. The aim of this paper was to apply this usage-based data to Chinese language teaching, especially at the elementary level. It seems appropriate to remark that the “~边” type should be presented at the elementary level, mainly because its frequency is the highest, and in particular, its usage in colloquial language is high. The results of this corpus survey afford some new perspectives on the selection of vocabulary in Chinese language teaching. A foreseeable extension of this research would be to extend the range of corpus research to other function words in Chinese and apply the data to Chinese language teaching. References Feng, Z. 2002. “Evolution and Present Situation of Corpus Research in China”. Journal of Chinese Language and Computing Vol.12, No.1. 43-62. Li, Q. 2003. “Building of Grammar System for Teaching Chinese as a Second Language Based on Style of Writing”. Chinese Language Learning No.3. 49-55. Liu, Y., W. Pan, and H. Gu. 2001. Shiyong Xiandai Hanyu Yufa (Zengdingben). Beijing:Commercial Press. Lu, W. 1999. “The application of Databank in Teaching Chinese to Foreign Learners”. Journal of Xiamen University (Arts & Social Sciences) No.4. 112-115. Tao, H. 1999. “Discourse taxonomies and their grammatico-theoretical implications”. Contemporary Linguistics Vol.1, No.3. 15-24. Wang, J. 2001. “Recent Progress in Corpus Linguistics in China”. International Journal of Corpus Linguistics Vol.6, No.2. 281-304. Wang, R. 2003. “A Survey of the Use of Statistical Methods in TCSL Research”. Chinese Language Learning No.3. 60-64.
Corpus Research in Chinese
317
Wu, Z. (Lü, S.) 1965. “A Survey of the Use of Localizers”. Studies in Chinese Grammar (Beijing: Commercial Press.), Lü, S. 1984. 291-300. Xie, H. 2001. “Additional Remarks on the Monosyllabic and Disyllabic Locative Synonyms”. Language Teaching and Linguistic Studies No.2. 71-76. Zhao, W. 2001. “A Brief Discussion about the Scope and Characteristics of the Modern Chinese Localizers”. Journal of Jiangsu Institute of Education (Social Science) Vol.17, No.5. 79-83. Zhou, Q. & Yu, S. 1997. “Annotating the Contemporary Chinese Corpus”. International Journal of Corpus Linguistics Vol.2, No.2. 239-258. Zhu, D. 1982. Yufa Jiangyi Beijing: Commercial Press. 1987. “The subject of modern Chinese grammatical studies”. Chinese Language No.5. 321-329. Zou, S. and Q. Tian. 2001. “On Some Problems of Localizers in modern Chinese”. Problems in Linguistics. IssueⅠ (Jilin Renmin Chubanshe) Dai, Z. and J. Lu. 2001. 155-170.
318
Takayuki MIYAKE
Rhetorical Questions with Interrogative Markers in Nanai Shinjiro KAZAMA 0. Introduction Nanai is a Tungusic language that is spoken in Far East Russia. The field of Linguistics has dealt little with topics on rhetorical questions, assuming that these topics belong to the realm of the rhetoric. In Nanai, however, topics on rhetorical questions are concerned with grammar, as is noted below: • They appear frequently (in fact, in the current study, 20% of all sentences include interrogative markers). • They require some unique verb endings. • They are related to such phenomena as indefiniteness and total negation, which sheds light on our analysis on the function of interrogative markers in this language. • They enable us to note how interrogative markers function in different languages such as Japanese, which in turn would provide a basis for discussions of geographic distributions and influences of the forms in question as well as for typological studies.
The present study employs Kazama (2001) as its data source, wherein twelve texts (330 pages in total) are available. I collect the sentences that include the interrogative markers from this text corpus. And then I select tokens of rhetorical questions from the sentences and sort them on the basis of their usage and morphology. Note that throughout this study, vowel harmony and its allomorphy will be dealt with in a manner such that the form of the open vowel allomorph represents the morpheme. Further, examples that are irrelevant to our discussion will be omitted. 1. Interrogative markers as employed in rhetorical questions We obtain the following seven types of interrogative markers (e.g., interrogative pronouns, interrogative adverbs, and so on) that are used in rhetorical questions: xai “what,” xooni ’how’, xaosi “to where,” xaido “to/at where,” xamacaa “what kind of,” ui “who,” and xaali “when.” We do not obtain the following interrogative markers used as rhetorical expressions. (The number in the brackets of each item indicates the frequency of its
320
Shinjiro KAZAMA
occurrence as it appears in the entire texts): xado “how many” (30), xaimi “why” (12), xaioji “for what” (9), xaandami “in order to do what” (4), and xajaji “from where” (3). It is probable that these interrogative markers may involve rhetorical questions with a more extensive corpus. The following table lists the number of tokens of each type of usage that the seven interrogative markers may exhibit. Types of usages other than rhetorical questions will be examined in the following section. The tokens of rhetorical questions occupied 12.2% of all tokens of interrogative markers (113 out of 925 tokens), while the number would add up to 20% if we exclude the dummy usage, which will be discussed later. Since it may be difficult to distinguish between rhetorical questions and non-rhetorical or straightforward questions, only those that are clearly rhetorical questions were counted as the former type; in other words, those questions that satisfy the following two criteria were counted: a rhetorical question must exhibit an obvious contradiction to what is inferred from the context, and it must, in some manner, assume a tone of cross-examination. For attested examples of rhetorical questions, refer to Section 3. Here, we assume that such declined forms of xai “what” as xai-wa “what (ACC)” and xai-ji “what (INST),” or those verb forms whose root is xai- (i.e., xai- “do what”), are collectively counted as examples of xai. Although xai-mi “why,” xai-o-ji “for what,” and xai-do “to/at where” may also be treated like this, I did not assume these forms as examples of xai, with their semantic distinctness and remarkable frequencies. Table 1.
The occurrence of interrogative markers and the ratio of the rhetorical question in each interrogative type xai xooni xaosi xaido xamacaa ui xaali meaning what how where to where what kind of who when Dummy usage 340 0 0 19 0 0 0 Question 75 60 38 21 12 8 5 Total affirmation 21 9 8 1 7 2 5 Total negation 53 11 8 11 17 22 5 Indefinite usage 15 5 1 3 6 0 0 Co-relational construction 0 7 0 0 0 0 0 Emphasis of adjectives 23 0 0 0 1 0 0 Emotional expression 0 3 0 0 0 0 0 Rhetorical question 53 30 8 5 10 6 1 Percentage of R.Q. 9.1% 24% 12.6% 8.3% 18.9% 15.8% 6.3% TOTAL
580
125
63
60
53
38
16
Rhetorical Questions with Interrogative Markers
321
2. Types of usage other than rhetorical questions In the following sections, I will offer some remarks on each type of usage as listed in Table 1 in order to help readers have a clear idea of how the rhetorical question is related to these types of usages. Note that I will concentrate mainly on xai “what” and its range of usage. 2.1. Dummy usage Dummy usage involves the use of question words as fillers in moments of hesitation on the part of the speaker. In some cases, the speaker would restate his words; in others, he would not. This usage corresponds to the Japanese filler “are,” as in are motte kite kure, hora ano hito ni moratta okasi “Bring that, you know what, the cake he gave.” In the present study, this usage was observed only for xai “what” and xaido “to where.” The xai in this particular usage carries number/case/person affixes and is used with no obvious restrictions. Further, it may be used as a verb stem, which was counted as the dummy usage. (1) n-i-du--ni=l, xai-ci, sun-ci--ni n-x-ni abaa, go-PARTIC-EP-3SG=PART what-DIR edge-DIR-EP-3SG go-PAST-3SG nothing (He) went to, how can I say, to the edge, but (she)’s gone. (Kazama 2001: 49) (2) tui ta-raa=la xai-xa-ni, sia-xan xai-xan xoji-xan thus do-ANT=PART what-PAST-3SG eat-PAST what-PAST finish-PAST After that, (she) did what, ate, and did what, finished. (Kazama 2001: 28)
There is a subtype of dummy usage, which slightly differs from that observed thus far: when the speaker realizes that his utterance was incorrect or inappropriate, he either utters xai~ (with a high-pitched tone and a prolongation) followed by what he intended to say or adds some comments on where he misspoke in the previous utterance. I classified this usage as a type of dummy usage. Notably, 340 out of the 580 attested tokens of xai were of the dummy usage; in other words, only 240 tokens constituted a more substantial usage as an interrogative marker. To my knowledge, Tungus languages such as have distinct forms only for this usage, the exceptions being Nanai and Ulcha, which employ xai “what” for this purpose. (Uilta: anuni, Udihe, Orochi: ai, Ewen: u, Negidal: un, these forms are not relevant to the word “what” in each language.) 2.2. Question usage Question usage requires little account because interrogative markers are to be employed for this purpose. However, it is noteworthy that this usage is
322
Shinjiro KAZAMA
not at all a major type of usage of interrogative markers: with the exclusion of dummy usages, we obtain the following percentages of the question usage for each interrogative marker: xai 31.3%; xooni 47.2%; xaosi 60.3%; xaido 53.8%; xamacaa 22.6%; ui 21.1%; xaali 31.3%. In sum, we obtain only 23.4% of question usage. Furthermore, as will be presented below, some examples of question usage indicate a kind of annoyance or cross-examination, which lead us to believe that they are not merely seeking information, and it is not basically a genuine question. If we exclude these suspicious examples from this usage, we would receive a much lower percentage. However, this study assumed that these examples are of question usage. (3) “mii, n-uci-si muckn moo-wa, I.NOM go-COND-2SG alone log-ACC Should you leave me, how could I do with logs?
xooni how
ta-ori.” do-IMPERS
(Kazama 2001: 102) (4) “amaa isinda-xa-ni isi-i=j’’=m. “xaido bi-i-ni,” un-dii, father arrive-PAST-3SG arrive-PRES=PART what-DAT be-PRES-3SG say-PRES ami-si. ami-ko-ji tui bi-uri-ni bu, father-2SG father-PROP-INS so be-IMPERS-3SG we.NOM (The child) says, “Dad is here; he’s arriving.” (His mother) says, “Where is he?” “Your father. If he had been here, would it be that our life is like this?” (Kazama 2001: 54)
2.3. Total affirmation and total negation Total affirmation is exemplified by an example such as xai=daa xm bii. “There is everything,” while total negation is exemplified by an example such as xai=daa abaa. “There is nothing.” Thus, total affirmation involves an interrogative marker followed by a particle =daa (or =daada), often together with xm “all” optionally. Total negation also requires the particle =daa, followed by the negative marker. The attested examples from the texts are provided below (in (6), the negative form of the verb is -(r)asi(n), which is in bold). (5) ota-wa, aron, xai=daa xm tsic-uri jaka-sal toobo-raa shoes-ACC jambeau what=PART all put on-IMPERS thing-PL take up-ANT After taking up shoes, jambeau, whatever to wear, (Kazama 2001: 70) (6) undii, “mii=l xai-wa=daa saa-rasim-bi” =m=da, say I.NOM=PART what-ACC=PART know-NEG.PRES-1SG=PART=PART (He) says, “I don’t know anything.” (Kazama 2001: 55)
As we noted in Table 1, ui “who” and xamacaa “what kind of” involve
Rhetorical Questions with Interrogative Markers
323
total negation more frequently than question usage. Although it is also observed in Japanese, this grammatical strategy for expressing total affirmation and total negation, i.e., involving an interrogative marker followed by a particle, is obviously not universal (cf. English, where these concepts are encoded without interrogative markers: there is nothing). What is noted is that total negation and total affirmation are semantically related to rhetorical questions, in that mii xai saara. “How do I know?” for example, has a similar effect to mii xai=daa saa-rasim-bi. “I do not know anything,” in terms of semantics. Therefore, it is pointed out that rhetorical questions involving an interrogative marker form a semantic continuum with total negation, which would indicate that the former developed from the latter. This is still speculative and requires further cross-linguistic researches. At this stage, I suggest the hypothesis that if a language extensively employs interrogative markers for rhetorical questions, it is also likely to employ interrogative markers for total negation. 2.4. Indefinite usage In Kazama (2003: 290–1), I characterized indefinite usage as identified in Tungusic languages as follows: In Tungus languages, indefiniteness is expressed by an interrogative marker followed by an interrogative sentence marker, as is the case even in Japanese. [Nanai] dr oja-la-ni xai=noo desk top-LOC-3SG what=PARTIC There is something on the desk.
bi-i-ni. be-PRES-3SG
ui=nuu uik-w dukt-i-ni. who=PARTIC door-ACC knock-PRES-3SG Someone is knocking on the door. sia-xa-si=noo. eat-PAST-2SG=PART Have you eaten?
In Korean, indefiniteness is expressed directly using an interrogative marker. According to Hashimoto (1981: 24–67), the Chinese language and the languages of its neighbors to the south do not formally distinguish between questions and indefinite expressions. The present study, however, revealed that the above description requires elaboration. It is indeed the case that there are examples in which an
324
Shinjiro KAZAMA
indefinite expression is encoded by =noo, which may appear sentence-finally for Yes-No questions, as the following examples demonstrate: (7) “ca-wa=tanii xai-wa=noo japa-i-do-ji=tanii, that-ACC=PART what-ACC=PART take-PRES-DAT-REF.SG=PART l-l-ni ts toala-xa-ni,” skirt-LOC-3SG ONOMAT get hold up-PAST-3SG “When (she) was reaching for that, something, her clothes got held up with something.” (Kazama 2001: 77)
However, there are other examples that do not involve this construction. Before proceeding to examine such examples, it is necessary to clarify the following point. We need to distinguish between definiteness and referentiality. For example, in Japanese, one comes across such examples as hako no naka ni nani ka aru “There is something in the box” and nani ka jomu mono wa naika “Is there anything to read?” Both these examples involve an interrogative marker (nani: “what”) carrying a particle ka such that these two appear to be the same indefinite expression. However, the above two propositions are encoded with distinct formal devices in Russian—as in “V jashshike chto-to est’.” and “Net li chego-nibud’ pochitat’?”—where chto-to (something which exists, i.e., referential) and chto-nibud’ (something whose actual existence is irrelevant, i.e., non-referential) are formally distinguished. Returning to Nanai, we observe the following examples: (8) “sii ini pulsi-mi xai=daa doolji-aci-si=noo,” you.NOM today go around-SIM what=PART hear-NEG.PAST-2SG=PART “Did you hear anything while you were walking around?” (Kazama 2001: 160) (9) “xai-wa=daa ta-mi xooni bi-uri, what-ACC=PART do-SIM how be-IMPERS mii n-x-si xamia-la-ni, I.NOM go-PAST-2SG after-LOC-3SG “How could I live and what would I do, if you leave me?” (Kazama 2001: 101)
Note here that Nanai employs =daa for non-referential objects, which, though speculative at this stage, does not appear to imply “something specific” or “in some specific way” altogether. This suggests that Nanai, like Russian, is also sensitive to referentiality. Another fact that should be noted is that =daa is a form used for total affirmation. Thus, in Nanai, non-referentiality forms a continuum with total affirmation, unlike in Japanese, where such continuum is not observed.
Rhetorical Questions with Interrogative Markers
325
2.5. Co-relational construction A co-relational construction is exemplified by an expression such as ui joboasi osini ui siarasi. “If you won’t work you shan’t eat (lit. who does not work, who does not eat).” Tsumagari (1996: 182) refers to this construction as a “quasi-relative pronominal construction by way of interrogative repetition.” In the corpus, I found only seven attestations of this type of construction, all of which involved xooni “how,” as in “As (someone) did (something) in some way, so did (someone else) in the same way (lit. As (someone) did in what way, so did (someone else) in what way.” It is pointed out that the following type of examples involving xooni is the most likely to be observed in co-relational constructions of Nanai: (10) xooni, aa-ni taktoola-xa-ni how brother-3SG step-PAST-3SG xooni n-lu-xn ti pujin=ul, how go-INC-PAST that heroine=PART As the elder brother left his footprints, so the heroine went along in the same way (lit. In what way the elder brother left the footprints, in what way the heroine went along.) (Kazama 2001: 279)
2.6. Emphasis of adjectives Just as English and Japanese, Nanai may employ interrogative markers for the emphasis of adjectives. In this usage, xai=daa (“what” plus a particle) is exclusively employed, and it follows the adjective that is emphasized, unlike in Japanese in which this order is reversed. The only exception to these generalizations is (12) below, where xamacaa is used and is followed by the adjective: (11) “rd xai=daa,” un-dii-ni ik-ni, strange what=PART say-PRES-3SG sister-3SG “mii xoonaa u-k-j, I.NOM how say-PAST-1SG “How strange it is,” says the elder sister, “How did I say?” (Kazama 2001: 32) (12) ic-i-ni=l xamacaa oo-ko urun tuli-ni=m see-PRES-3SG=PART what logic-PROP people garden-3SG=PART ic-x-ni. see-PAST-3SG A brief look at their garden shows how neat these people are. (Kazama 2001: 36)
326
Shinjiro KAZAMA
3. Verb ending forms in rhetorical questions This chapter sets out to examine rhetorical questions as encoded by interrogative markers. As has been noted earlier, the fact that the verb ending form in rhetorical questions is distinct from the forms that appear in non-rhetorical expressions is particularly interesting. In what follows, I will first provide a brief overview of the various types of verb ending forms that are observed; I will then examine each form in detail. The following table summarizes the types of verb ending forms that appear in rhetorical questions (particular attention is to be paid to the numbers in bold). Each form will be discussed in detail in the following sections. Table 2. The verb ending forms found in the rhetorical questions Purposive converb + ta (to do)
Subjunctive mood
TOTAL
Participle (Future)
Participle (Negative)
Participle (Affirmative)
Negation of existence
Nominal and adjectival predicate
Obligative stem-formative form
0
15
6
6
0
2
1
7
0
1
53
0
1
0
15
0
0
3
1
0
10
0
30
xaosi
0
0
1
0
0
0
6
0
1
0
0
8
xaido
0
0
0
2
1
0
1
0
1
0
0
5
xamacaa
0
0
0
1
0
3
4
2
0
0
0
10
ui
0
0
0
1
3
0
2
0
0
0
0
6
xaali
0
0
0
0
0
0
1
0
0
0
0
1
15
1
16
25
10
3
19
4
9
10
1
113
TOTAL
Impersonal participle
Present indicative final form 1SG
Present indicative final form 3SG 15
xooni
xai
3.1. Third person present indicative final form In Nanai, several verbs that serve as sentence-final predicates are participles, which exhibit both nominal and adnominal features, carrying case suffixes by themselves and modifying nouns. On the other hand, final forms, which may only serve as sentence-final predicates, are not frequently observed in Nanai. Avrorin (1961: 65) notes that some statistics showed that participles occupied 70% of all tokens, while
Rhetorical Questions with Interrogative Markers
327
converbs and final forms occupied 21% and 9%, respectively. Here, the final form includes imperatives, and if these are excluded, the rest, i.e., the indicative final form, should amount to much lesser than 9%. Malchukov (2000: 450–1) summarizes Avrorin’s description and provides the following hierarchy regarding verb forms: Markedness Hierarchy in verbal tense forms Person Hierarchy: 1 > 2 > 3 Number Hierarchy: Sg > Pl Tense Hierarchy: Present > Past < higher in frequency > higher in emphasis
This hierarchy indicates that in each category (person, number, and tense), the verb forms on the left-hand side are higher in frequency than those on the right-hand side, and that the latter ones carry especially significant meaning than the former. In my experience, the majority of present indicative forms involved the first and second persons as used in conversations between two interlocutors, encoding directly experienced activities or events. With regard to tense, on the other hand, the past tense form appears to be observed more frequently than the present tense form, although this claim requires further examinations. Discussions on different final forms and their evaluations in terms of the hierarchy will be left open for the present. Of specific relevance here is the fact that the third person present indicative final form, which is rarely found otherwise, is observed frequently in rhetorical questions. The following set of examples illustrate this point (Note that the internal structure of the verb form in question involves -(r)a(n) if the stem ends with a vowel, or -(d)a(n) if the stem ends with a consonant, each of which is followed by a person suffix: saa- “know”...1sg. saa-ram-bi, 2sg saa-ra-ci, 3sg. saa-ra, 1pl. saa-ra-po, 2pl. saa-ra-so, 3pl. saa-ra-l). (13) “sii kol kol bi-i-si, you.NOM ONOMAT be-PRES-2SG “How could I know if you keep silent?”
mii, I.NOM
xai what
saa-ra,” know-IND.PRES
(Kazama 2001: 127) (14) “nai-ni=la jia anaa xai baa-ra” =m=da. man-3SG=PART partner without what find-IND.PRES=PART=PART “How could a woman give a birth to a baby alone?” (Kazama 2001: 66)
The fact that a final form may encode a rhetorical question has already been
328
Shinjiro KAZAMA
noted by Malchukov (2000: 452): Semantically, the affirmative forms [i.e., final form in the present study, the present author] differ from the corresponding indicative [i.e., participle in the present study, the present author] in that the former involve more commitment to the truth of the proposition on the part of the speaker. Thus, verbal forms, in contrast to participials, are not combinable with hypothetical markers, such as the modal particle bid’ere ‘maybe’ (L. Z. Zaksor, p.c.). Moreover, verbal affirmative forms, when used in interrogative sentences, as in (10b) or (24), have the force of a rhetorical question, asserting rather than questioning the truth of the proposition.
Although this statement by Malchukov (2000) is of significance, he did not elaborate further: questions such as whether or not the rhetorical question involving interrogative markers must also involve the final form of a verb, and why the final form may be employed in rhetorical question, are left open for discussion. It is necessary, therefore, to examine the functions of final forms as used in rhetorical questions. I will now offer some remarks on this problem. First, we must examine the attested functions of final forms as a preliminary discussion on which the possible function of final forms as in rhetorical questions can be considered. Here, I will focus on the third person present final form. Following Avrorin (1961), Malchukov (2000: 451) refers to the final form as an “affirmative mood” or the “validational,” assuming that it developed from the directional evidential form that is, according to him, preserved in Udihe. a. Mi I.NOM “I say.” b. Mi I.NOM “I (do)
un-di-i. say-PRES.PART-1SG un-dem-bi. say-PRES-1SG say.”
On the other hand, Malchukov notes that the third person present final form is restricted to emphatic contexts. The following examples are from Avrorin (1961: 106–9). Note that the transcription is adjusted to the convention in the present study, and the examples involving rhetorical questions are excluded. In addition, the gloss and emphasis are by the present author. xai rd-w-ni baa-ra! what interesting-ACC-3SG find-IND.PRES What an interesting thing he has found!
Rhetorical Questions with Interrogative Markers
tumci-mi ta-ruu. xulu take care-SIM do-IMP squirrel Be careful, or the squirrel will run!
329
nu-r! go away-IND.PRES
ada-asi-si osini, believe-NEG-2SG if i ixon-du bi-i urun=d saa-ra-l=tanii. this village-DAT be-PRES people=PART know-IND.PRES-PL=PART Whether you believe or not, the people in this village seem to know that. bicix-w ba-ra=ma. letter-ACC get-IND.PRES=PART (He) got the letter.
It is pointed out that the final form as used in contexts other than rhetorical questions is likely to involve a sudden realization on the part of the speaker, and hence, the brand new information, as in the first two examples above. A similar usage is found in the latter two examples, although these carry additional =tanii and =ma. I assume that these two examples also involve new information. =tanii is often used for encoding the topic of a sentence and for contrastive focus, like wa in Japanese, while =ma mainly functions to validate the future form of predicative verbs. 3.2. Obligative stem-formative form As one of its verb derivational affixes (stem-formative suffix), Nanai has the obligative stem-formative affix -ila “have to; should; be supposed to.” As noted in Table 2, this form often appears in rhetorical questions, although it only co-occurs with xai “what.” In Kazama (2001), we obtain 25 tokens of -ila, out of which 16 appear in rhetorical questions. (15) aala anaa nai xai nai ulsi-il-i. hand without person what person like-OBL-PRES How would people like a person who has no hands? (Kazama 2001: 105) (16) nimaan pikt-ni=k xai oida-mi ur-il-i. folktale child-3SG=PART what take time-SIM grow-OBL-PRES That’s a child in a folktale, why would it grow up with a long time? (Kazama 2001: 255)
The last example is the kind of fixed expression that is typical of folk narratives.
330
Shinjiro KAZAMA
3.3. Impersonal participle An impersonal participle is a participle that does not carry a person suffix, and encodes such expressions as “should,” “can,” “be done (by),” “be supposed to,” and so on. The allomorph -ori is for the stem that ends with a vowel, while the allomorph -bori is for the stem that ends with a consonant. As is evident from Table 2, this verb form is the most frequently observed form in rhetorical questions, occupying approximately a quarter of all the tokens. Also noted is the fact that this form co-occurs with a full range of interrogative markers, with the exception of xaosi “where” and xaali “when.” It co-occurs most frequently with xooni “in what way.” Briefly examining Japanese, we find that the interrogative nande “in what way, how” may be used in rhetorical questions, which Kokugo gakkai eds. (1980: 714) refers to as the “colloquial typological form,” as in nande ...na mono ka “How can... (emphasis mine).” Thus, we have a similar construction in Nanai and Japanese in that in Nanai, xooni frequently occurs with an impersonal participle. Given below are examples of xooni co-occurring with an impersonal participle. (17) “mii simbi waa-mi mut-sim-bi, I.NOM you.NOM kill-SIM can-NEG.PRES-1SG aa-bi xooni waa-ori.” brother-REF.SG how kill-IMPERS “I could never kill you, how could I kill my own elder brother?” (Kazama 2001: 158)
Following xooni, we will examine xai “what.” xai in this particular construction does not encode “what,” but instead encodes “why.” That is, xai , as in this particular construction, is semantically an equivalent of xooni. In Japanese, nande “why” and nani ga “what” may be interchangeable under a certain context as in nande/nani ga battiri umaku nanka iku mon ka “how can it be successful at all?” (18) “nai xm xm bi-i-w-ni, xai people ONOMAT be-PRES-ACC-3SG what “People keep silent, how could they utter a word?”
um-buri,” say-IMPERS (Kazama 2001: 251)
The following examples involve interrogative markers other than xooni and xai. (19) “xai-do bi-puu-ji bi-uri-ni,” what-DAT be-PURP-REF.SG be-IMPERS-3SG “Where could we find our place to live?” (20) “xamacaa what kind of
nai-ja-ni person-ACC-3SG
tui thus
(Kazama 2001: 111) sirkci-uri-ni” =m. persecute-IMPERS-3SG=PART
Rhetorical Questions with Interrogative Markers
331
“What person would be beaten in such a harsh way?” (Kazama 2001: 252) (21) si ui-ji l-uri. ti pikt-ni=d xm waa-o-xan. now who-INS be afraid-IMPERS that child-3SG=PART all kill-REPET-PAST Now who would be scary (for us)? (We) killed that child, too, all of them. (Kazama 2001: 204)
3.4. Nominal and adjectival predicates A nominal predicate is illustrated in the following example: (22) “saman=ola xai saman, jn=ul xai jn,” shaman=PART what shaman priestess=PART what priestess “You say she is a shaman, but what would make us call her a shaman? You say she is a priestess, but what would make us call her as such?” (Kazama 2001: 138)
The following example is of adjectival predicates: (23) “ca-wa baa-ori xai maa,” that-ACC get-IMPERS what hard “What difficulty would be involved with getting that?” (Kazama 2001: 161)
An expression similar to the adjectival predicate in (23) is xai tj-ku, where -ko “with” is attached to the adjectival predicate. I have heard this form being used several times in daily conversations, and I suspect that it has been idiomatized. Onenko (1980: 415) also notes this, describing xai tj as “what meaning does it have? No meaning.” Here, it should be noted that tj means “truth.”. (24) “xai tj-ku,” un-dii, “kt nii tolki-i-ni. what truth-PROP say-PRES woman person dream-PRES-3SG “what would be worth believing?” he said, “That’s just what a woman saw in her dream.” (Kazama 2001: 130)
The following example also involves -ko “with,” and it is an idiomatic expression. (25) “mii mn lci-u-ji, kk-u-ji I.NOM oneself slave-ALIEN-REF.SG female slave-ALIEN-REF.SG waa-xam-bi=la, ui-du daalji-ko,” kill-PAST-1SG=PART who-DAT relation-PROP “Even if I kill my slave, a female slave, who would care?” (Kazama 2001: 327)
3.5.1. Participle (affirmative) As pointed out earlier, the participle is the most frequently observed
332
Shinjiro KAZAMA
verb form. It is, therefore, fairly predictable and natural that it appears in rhetorical questions. 3.5.2. Participle (negative) A negative form of the predicate is equivalent to total affirmation, as illustrated below (note that -asi is a negative verb (present participle)). (26) “nai-ja ic-mri xooni otoli-asi-so” =m=d. person-ACC see-SIM.PL how recognize-NEG-2PL=PART=PART “How could you fail to recognize this person?” (Kazama 2001: 264)
3.5.3. Participle (future) Kokugo gakkai ed. (1980: 714) describes constructions such as dare ga...sinaidaroo “who will not do” and doosite...surudaroo “why will (someone) do” (emphasis mine) as the “colloquial typological form.” In Japanese, therefore, future expression is also likely to occur in rhetorical questions. In Nanai, we obtain an example such as the one provided below (the future participle is -jaa(n)): (27) “aja~ l-uri jaka-la, xai n-jm-bi,” all right be afraid-IMPERS thing-LOC what go-FUT-1SG “That’s all right, who would visit such a terrifying thing?” (Kazama 2001: 28)
3.6. Purposive converb As shown below, what we refer to as a purposive converb is more accurately a construction [V-purposive converb -(po)o-person suffix ta-], where ta- is the verb root “do; say.” Thus, this entire construction encodes the expression “be going to do; suggest doing.” This construction constitutes 10 tokens in the corpus, all involving the interrogative xooni “in what way”: (28) xooni n-u-ri ta-i-so, bu-dii-su=rd-ni=oani.” how go-PURP-REF.PL do-PRES-2PL die-PRES-2PL=PART=PART=PART “How would you go there? You are sure to die.” (Kazama 2001: 117)
3.7. Subjunctive mood The following example is the only attested example of the subjunctive mood (-mca-person suffix): (29) tui thus xai what
pata-rii-do-a-ni frighten-PRES-DAT-EP-3SG tui thus
un-dii-du--ni, nu-xn osini say-PRES-DAT-EP-3SG leave-PAST if o-mca=m=da. become-SUBJ=PART=PART
Rhetorical Questions with Interrogative Markers
333
Would it have been like this if you had left as people told you to do so? (Kazama 2001: 33)
4. Conclusions This study has shown that in Nanai, rhetorical questions, indefiniteness, and total negation may be encoded by interrogative markers and interrogative expressions, suggesting that these concepts are semantically related to one another. Kamei, Koono, and Chino eds. (1996: 280) describe the term “interrogative” as the following (translation mine): An interrogative sentence is used to clarify what is uncertain to the speaker; however, this form may sometimes be used for other purposes as well. (snip) A surface interrogative form may contribute to emphasizing the utterance of the speaker. The so-called rhetorical question, which involves a surface negative interrogative, is not essentially a negative sentence, but rather an emphatic and rhetorical interrogative by which the speaker leads the hearer to negate the interrogative sentence. (snip) In interrogative expressions, that which is unclear to the speaker is interrogated, thus the interrogated is that which is indefinite. It is this indefiniteness involved in interrogative expressions that allows many languages to employ interrogative markers as indefinite markers. In Japanese, for example, such indefinite expressions as nanimokamo and nani kara nani made involve an interrogative nani, which is used here as an indefinite marker; moreover, expressions such as nani ka and dare ka, which involve an interrogative marker followed by a particle, are exclusively used for indefinite markers. These are obviously transferred from interrogative markers. (snip) Interrogative markers become accessible to indefinite expressions when their semantic content of interrogation is bleached, such that their semantic content of indefiniteness, as a result, becomes conspicuous. When this occurs, it is more likely that interrogative markers carry some other element, just as nani ka and dare ka in Japanese. In Latin, for example, quidam “someone” is analyzed as quis “who” and a suffix dam.
Thus, it is frequently observed in different languages that rhetorical questions, indefiniteness, and total negation (the last of which, according to Kamei, Koono, and Chino eds. (1996), is treated as indefiniteness in a broader sense) are encoded by interrogative markers and/or interrogative expressions as a whole. On the other hand, I have shown that in Nanai, a rhetorical question requires unique verb forms on the part of the sentence-final predicates, such as the third person present final form, the obligative form, and the impersonal participle. A rhetorical question in Nanai may involve other formal strategies than what has been discussed in this study, including those which do not require interrogative markers but involve particles such as =tanii and =kaa. It is
334
Shinjiro KAZAMA
apparent that the clarification of these expressions has consequences for uncovering the functions of final forms and the semantic concept evidentiality in Nanai, and is of interest in considering the correlations between them and Kakarimusubi constructions in Japanese. This research topic is still an untapped frontier that needs to be explored in future research. The present study is based on the corpus collected by fieldwork. I believe that this kind of study plays an important role in the Linguistic Informatics of the 21st century COE in TUFS, because of its cross-linguistic and inductive approach. Abbreviations ACC: accusative case; ALIEN: alienability; ANT: anterior converb; COND: conditional converb; DAT: dative case; DIR: directive case; EP: epenthetic vowel; IMP: imperative; IMPERS: impersonal participle; INC: inchoative aspect; INDIC: indicative mood; INS: instrumental case; LOC: locative case; NEG: negative; NOM: nominative case; OBL: obligative; ONOMAT: onomatopoeia; PART: particle; PARTIC: participle; PAST: past; PL: plural; PRES: present participle; PROP: constant/permanent proprietive; PURP: purposive converb; REF: reflexive; REPET: repetitive-reversive aspect; SG: singular; SIM: simultaneous converb; SUB: subjunctive mood; 1: 1st person; 2: 2nd person; 3: 3rd person; -: suffix boundary; =: particle boundary.
References Avrorin, V. A. 1961. Grammatika nanajskogo jazyka, t. II. Moskva/Leningrad: AN SSSR. Hashimoto, M. 1981. Gendai Hakugengaku [Modern Linguistics -Front Lines of Current Linguistic Studies-]. Tokyo: Taisyuukan-syoten. Ikegami, J. 2001. Tsunguusugo Kenkyuu [Researches on the Tungus Language]. Tokyo: Kyuukosyoin. 2002. Zootei Uilta Kootoo Bungei Genbunshuu [Uilta Oral Literature: A Collection of Texts (Publicatios on Tungus Languages and Cultures 16]. ELPR (Endangered languages of the North Pacific Rim) Publications series A2-013. Suita: Osaka Gakuin University. Kamei T., R. Koono & E. Chino eds. 1996. Gengogaku Daijiten [Sanseido Encyclopaedia of Linguistics] Vol. 6. Tokyo: Sanseido. Kazama, S. 2001. Naanai no Minwa to Densetsu 6 [Nanay Folk Tales and Legends 6 (Publicatios on Tungus Languages and Cultures 15]. ELPR Publications series A2-005. Suita: Osaka Gakuin University. 2003. “Arutaishogengo no 3 guruupu (Tyuruku, Mongoru, Tsunguusu), oyobi Tyoosengo, Nihongo no Bunpoo wa Hontoo ni Nite Irunoka [Do the Three Groups of the “Altaic” Languages (Turkic, Mongolic, and Tungusic), as well as Korean, really Resemble Japanese
Rhetorical Questions with Interrogative Markers
335
Grammatical Structure?: An Attempt at a Contrastive Grammar Analysis]”. Perspectives on the Origins of the Japanese Language, Nichibunken Japanese Studies Series, No. 31. Vovin, A. & T. Osada eds. Kyoto: International Resarch Center for Japanese Studies. 249-340. Kokugo gakkai ed. 1980. Kokugogaku Daijiten [Encyclopaedia of Japanese Linguistics]. Tokyo: Tookyoodoo-syuppan. Malchukov, A. N. 2000. “Perfect, evidentiality and related categories in Tungusic languages”. Evidentials: Turkic, Iranian and neighbouring languages. Johanson, L. and Utas, Bo. eds. Berlin; New York: Mouton de Gruyter. Onenko, S. N. 1980. Nanajsko-Russkij slovar’. Moskva: Izd. Russkij jazyk. Tsumagari, T. 1996. “Tyuugoku, Roshia no Tsunguususyogo [The Tungusic Languages in China and Russia]”. Gengo Kenkyuu [Journal of the Linguistic Society of Japan] No.110. Kyoto: The Linguistic Society of Japan. 177-90.
336
Shinjiro KAZAMA
Vacillation in the Selection of Complementizers of Malay Transitive Verbs Isamu SHOHO 1. Introduction In Shoho (1999), the author conducted research on characterizing transitive verbs that take complementizers headed with agar, supaya, untuk, and zero complementizers. It was contended that the difference between untuk dan supaya is related not only to subcategorization at the level of lexicon but also to a reflection of the structural difference of the complementizers. This difference plays an important role in differentiating sentences that look alike superficially, but that in fact, must be distinguished from one another. The difference proposed by the author can deal fairly well with the problem of complementizer selection using the target transitive verbs. However, as the author himself admitted, there were some unexplainable irregularities found with regard to the selection of complementizers by the target verbs. In other words, some cases of complementizer selection cannot be explained on the basis of the hypothesis proposed by the author. The author merely stated that these irregularities arise from false analogies with a different sentence pattern. Therefore, the solving of this problem in a more consistent and thorough manner is awaited. From a broader perspective and different angle, this paper aims to review the problem of complementizer selection using some transitive verbs; this is expected to solve the problems that remained from the preceding research. In this research, the author used data retrieved from the headlines and national news from Berita Harian Online News; the data was collected from September 2005 through March 2006. To supplement this corpus, the author used a compilation of news from Berita Harian Online News collected over the last two months, i.e., from August through September 2006. These data revealed that the situation appears more confusing than it did in the year 1999 when the author conducted research on complementizer selection. The author has found many cases where the same transitive verb sometimes takes the supaya complementizer, whereas at other times takes the untuk complementizer, even when the situation is clearly the same. Using a relatively large amount of data, the author has succeeded in summing up the canonical characteristics of each group of transitive verbs. In this research, the author does not adopt the stance of deeming all the data collected as
338
Isamu SHOHO
reflecting the right linguistic instinct. The author has to admit that some irregularities are based on erroneous analogies or an incorrect analysis of the sentence pattern. This paper is a slightly audacious attempt to show that some irregularities stray from the canonical characteristics are among what should be disregarded, and at the same time provide the reason that the selection of the complementizer is incorrect. With this, it is hoped that the linguistic instinct that distinguishes between the use of supaya and untuk regains its own glow. In Shoho (1999), the author maintains that the following sentences show superficial similarity, i.e., all of them have the structure of subject+verb +complement. This superficial similarity among them suggests the same sentence structure. (1) (2) (3)
Dia mengajak saya menziarahi datuknya di kampung. (He invited me to visit his grandpa in the country.) Mereka mencadangkan Tuan Pengerusi membubarkan persidangan itu. (They proposed that the Chairman should dissolve the proceeding.) Doktor forensek mengesahkan gigi palsunya tersangkut di kerongkongnya. (The forensic doctor confirmed false teeth stuck in his throat.)
However, if we extract the initial element in the complement sentence and place it in the sentence initial position, there appears to be a difference in the selection of complementizers. This is shown by the following examples. (4) (5) (6)
Saya diajaknya untuk menziarahi datuknya di kampung. (I was invited by him to visit his grandpa in the country.) Tuan pengerusi dicadangkan oleh mereka supaya membubarkan persidangan itu. (A motion was proposed to the Chairman to dissolve the proceeding.) Gigi palsunya disahkan oleh doctor forensek (φ) tersangkut di kerongkongnya. (False teeth have been confirmed to have stuck in his throat.)
Concerning sentence (1), the elements that can be inserted in the postverbal position are restricted by the verb so that the features of the elements to be inserted are compatible with those of the preceding verb. In more concrete terms, the elements to be inserted in the postverbal position must have a [+human] feature because only human beings can be invited to do something. From this fact, it can safely be stated that the postverbal position in this construction must be a place that is directly governed by the verb, i.e., the object of the main sentence. The sequence that follows the object is the complement sentence of the object. Before considering the structure of sentence (2), it is necessary to point out that mencadangkan requires an object after itself as can be shown in the following pair. (7)
Mereka mencadangkan pembubaran persidangan itu. (They proposed a dissolution of the proceeding.)
Vacillation in the Selection of Complementizers (8)
339
*Mereka mencadangkan. (They proposed.)
Based on the fact that an object is obligatorily filled in after mencadangkan, we can safely say that in the following sentence, the supaya complement sentence as a whole functions as an object. (9)
Mereka mencadangkan supaya Tuan Pengerusi membubarkan persidangan itu. (They proposed that the Chairman should dissolve the proceeding.)
As is shown in the following sentence, before the supaya clause, mencadangkan has a prepositional phrase headed with kepada (to). (10) Mereka mencadangkan kepada Tuan Pengerusi supaya membubarkan persidangan itu. (They proposed to the Chairman that he should dissolve the proceeding.)
One hypothesis about the structure of sentence (2) can be analyzed as in (11): (11) Mereka mencadangkan (kepada) Tuan Pengerusi [cp φ membubarkan persidangan itu]. (They proposed to the Chairman that he should dissolve the proceeding.)
Another possibility of analyzing (2) is as follows: (12) Mereka mencadangkan [cp φ Tuan Pengerusi membubarkan persidangan itu]. (They proposed that the Chairman should dissolve the proceeding.)
It is empirically known that the left-most element in the embedded sentence can only be moved if the subject of the main sentence is filled with an EXE element, i.e., an expletive empty element. As can be attested by the following sentences, the subject of dicadangkan is supposed to be filled with EXE. (13) EXE dicadangkan supaya Tuan Pengerusi membubarkan persidangan itu (It was proposed that the Chairman should dissolve the proceeding.)
The sentence of led by supaya in (13) can be passivized as in sentence (14). (14) EXE dicadangkan supaya persidangan itu dibubarkan oleh Tuan Pengerusi. (It was proposed that the proceeding should be dissolved by the Chairman.)
In (14), persidangan itu is extracted from the supaya complement sentence and replace EXE, which generates (15). (15) Persidangan itu dicadangkan supaya dibubarkan oleh Tuan Pengerusi. (*The proceeding is proposed to be dissolved by the Chairman.)
In the corpus, we cannot find sentences like (16) where supaya has been replaced with untuk. (16) *Persidangan itu dicadangkan untuk dibubarkan oleh Tuan Pengerusi. (*The proceeding is proposed to be dissolved by the Chairman.)
On the contrary, in the corpus, we come across sentences like (17), which the author deems as inappropriate. (17) Tuan pengerusi dicadangkan untuk membubarkan persidangan itu. (The Chairman is proposed to dissolve the proceeding.)
The generation of sentence (16) is based on an erroneous analogy with the
340
Isamu SHOHO
mengajak construction; in this analogy, the postverb argument has selectional restriction, i.e., the NP filled in the postverb position must have +human features. From what we have seen, it can be concluded that sentence (5) is derived from sentence (18). (18) Mereka mencadangkan (kepada) Tuan Pengerusi supaya membubarkan persidangan itu.
We cannot consider (19) to be the source sentence of (5), because if it were, it cannot be explained why only (16) is ungrammatical. (19) Dicadangkan (kepada) Tuan Pengerusi supaya membubarkan persidangan itu. (It was proposed to the Chairman that he should dissolve the proceeding.)
In contrast with sentence (16), which is ungrammatical, sentence (20) is acceptable. (20) Tuan Pengerusi dicadangkan supaya membubarkan persidangan itu. (To the Chairman was proposed that he should dissolve the proceeding.)
The derivational process of (20) is as shown in (21), where Tuan Pengerusi functions as a topic with the zero operator being moved from the postverbal prepositional phrase to the Spec of CP, which is a sister of TOP. (21) [CP [ [TP (kepada) Tuan Pengerusi] [CP [Spec [IP [Spec [VP [V dicadangkan oleh mereka ] [Prep ti] [DP supaya membubarkan persidangan itu]]]] (To the Chairman was proposed that he should dissolve the proceeding.)
Now, we proceed to the structural analysis of sentence (3). From the viewpoint of the selectional restriction on the postverbal element, (3) shows contrast with (1). The following pair shows the difference between them. (22) Dia mengajak penjaga pintu mencuri keris sakti sultan itu. (He spurred the janitor to steal the holy sword of the sultan.) (23) *Dia mengajak keris sakti sultan itu dicuri penjaga pintu. (*He spurred the holy sword of the sultan to be stolen by the janitor.) (24) Polis mengesahkan penjaga pintu mencuri keris sakti sultan itu. (The police have confirmed that the janitor stole the holy sword of the sultan.) (25) Polis mengesahkan keris sakti sultan itu dicuri penjaga pintu. (The police have confirmed that the holy sword was stolen by the janitor.)
The verb mengajak requires an element with +human features after itself; this selectional restriction excludes (23) as untenable. On the contrary, mengesahkan has no such restriction on the postverbal element; this is attested by (24) and (25). Unlike mengajak, mengesahkan does not require a +human-featured element after itself. From the fact that there is no difference in cognitive meaning between (24) and (25), it can be said that mengesahkan takes a sentential object. Therefore, the postverbal element (gigi palsunya) is an embedded sentence subject. In this point, there lies a difference between (1) on one hand, and (2) and (3) on the other. In sentence (1), the postverbal element is a part of the main sentence. In (2) and (3), the postverbal element
Vacillation in the Selection of Complementizers
341
is not a part of the main sentence; it is part of the embedded sentence. This difference has something to do with the cliticizability of the postverbal element. The following sentences show that the element outside the main clause naturally resists being cliticized. (26) Dia mengajaknya mencuri keris sakti sultan itu. (He invited me to steal the sacred sword of the sultan.) (27) *Mereka mencadangkannya mencuri keris sakti sultan itu. (*They proposed to him to steal the sacred sword of the sultan.) (28) *Polis mengesahkannya mencuri keris sakti sultan itu. (*The police have confirmed him to have stolen the sword.)
In the following sentence, the sentence initial element (keris sakti sultan itu) is extracted from the subject position of the embedded sentence. (29) Keris sakti sultan itu disahkan dicuri penjaga pintu itu.
Only the verbs that have EXE filled in the subject position allow an element to be extracted from the embedded subject position. Disahkan is one such verb. The source sentence of (29) is (30). (30) EXE disahkan keris sakti sultan itu dicuri penjaga pintu itu. (It was confirmed that the sacred sword of the sultan was stolen.)
This group of verbs includes dimaklumkan, difahamkan, dianggap, diharapkan, etc. From what we have observed, it is now clear that constructions that are superficially the same must be differentiated from each other. In all these sentences, we find the appearance of the zero form complementizer. In the following sentences, we will show where the zero form complementizer appears in each sentence. (31) Dia mengajak saya [cp φ menziarahi datuknya di kampung]. (32) Mereka mencadangkan [cp φ Tuan Pengerusi membubarkan persidangan itu]. (33) Doktor forensek mengesahkan [cp φ gigi palsunya tersangkut di kerongkongnya].
In all the above examples, the zero complementizers (φ) can be replaced with overt complementizers. The zero complementizers in sentences (31), (32), and (33) can be replaced with untuk, supaya, and bahawa, respectively. We have observed that in Malay, there are four complementizers that appear after transitive verbs: bahawa, supaya, untuk, and zero. To these, we can add another complementizer, agar. In almost all situations, agar and supaya are interchangeable. 2. Structual differences reflected by untuk and supaya A comparison of the following sentences makes it easy to grasp what the structural difference between untuk and supaya reflects. (34) Dia mengajak saya untuk menziarahi datuknya di kampong. (35) *Dia mengajak saya supaya menziarahi datuknya di kampong.
342
Isamu SHOHO (36) Dia mencadangkan saya supaya menziarahi datuknya di kampong. (37) *Dia mencadangkan saya untuk menziarahi datuknya di kampong.
In (34), the object (saya) and untuk complement sentences constitute a unit, which has a nexus relation. The lack of one component renders the sentence meaningless as is attested by the ungrammaticality of (38). (38) *Dia mengajak untuk menziarahi datuknya di kampong.
On the contrary, in (36), saya in oblique case and the supaya complement are independent and do not constitute a unit. Therefore, the lack of the obliquecased noun phrase does not impair its grammaticality. If we delete saya, it is still interpretable. Consider the following example. (39) Dia mencadangkan supaya menziarahi datuknya di kampong. (He proposed visiting his grandpa in the country.)
In (39), the supaya complement sentence functions as the sentence object. The untuk complement sentence in (35) is compared to the to infinitive in the English sentence “He told me to do it immediately.” In the latter sentence also, the direct objects me and to complement sentence and constitute a unit. On the other hand, supaya in (36) is compared to the to infinitive in the English sentence “He promised me to do it immediately.” As I have stated before, in such a sentence as (34), the object saya and untuk constitute an inseparable unity. With this fact in mind, the gap after the verb in (40) is interpreted as being an empty place as a result of the deletion, and not as a hole filled with nothing from the outset. In addition, the unfilled hole is interpreted as meaning unspecific, general people. (40) Kami akan bantu untuk mengatasi kesusahan ini. (We will help tide over this plight.)
As opposed to a supaya complement sentence, an untuk complement sentence does not function as an object. This fact is corroborated by the following sentence, where an untuk complement sentence follows an intransitive verb. Needless to say, intransitive verbs do not require an object after themselves. Therefore, the untuk complement sentence in this case does not constitute an object sentence. (41) “Walaupun Chong Wei kalah dalam empat perlawanan akhir, permainannya sentiasa meningkat dan dia perlu lebih berusaha untuk mencapai kedudukan yang sama dengan Lin Dan (Although Chong Wei lost in the final four games, he usually plays better trying to go along with Lin Dan.)
Other than berusaha (contrive), berjaya (succeed), sedia (ready), and mampu (afford) can be used before untuk complement sentences.
Vacillation in the Selection of Complementizers
343
3. Sentence patterns based on selectional restrictions of untuk and supaya In Chapter 1, we differentiated four sentence patterns based on three criteria, i.e., whether untuk can be inserted, whether supaya can be inserted, and whether kepada can be inserted. Below, we will provide exponents of each sentence pattern. Some examples with overt complementizers are paired with those with zero complementizers. A. S+VT+[cp supaya ・・・]/ [cp bahawa・・・] (42) Mereka mencadangkan [cp φTuan Pengerusi membubarkan persidangan itu]. (43) Mereka mencadangkan [cp supaya Tuan Pengerusi membubarkan persidangan itu]. (44) Doktor forensek mengesahkan [cp φ gigi palsunya tersangkut di kerongkongnya]. (45) Doktor forensek mengesahkan [cp bahawa gigi palsunya tersangkut di kerongkongnya ]. B. S+VT+O+[cp untuk・・・] (46) Dia mengajak saya [cp φ menziarahi datuknya di kampung]. (47) Dia mengajak saya [cp untuk menziarahi datuknya di kampong].
In this pattern, NP after the VT is in object case. C. S+VT+ O+[cp supaya ・・・] (48) Mereka mencadangkan Tuan Pengerusi [cp supaya membubarkan persidangan itu].
In this pattern, NP after VT is in object case, which is originally an oblique case, i.e., kepada NP. D. S+VT+kepada O+[cp supaya ・・・] (49) Mereka mencadangkan kepada Tuan Pengerusi supaya membubarkan persidangan itu.
Another possibility of the following pattern is rejected. (50) *Mereka mencadangkan kepada Tuan Pengerusi [φ membubarkan persidangan itu].
The ungrammatical status of (50) reveals the existence of the restriction that if NP after VT is in oblique case, C after the oblique case must be filled with an overt complementizer. In the sentence pattern of the C class, NP after VT is originally an NP in oblique case. This NP can be headed with the preposition kepada, which turns into the D pattern. However, in the following sentence, NP after VT cannot appear as headed with kepada. (51) *Beliau mengarahkan Kepada Pergerakan Pemuda dan Putri Uwno supaya mempergiatkan usaha wenarik ahli baru.
As is shown by sentence (49), in many cases of the VT+NP+supaya construction, kepada can be inserted before NP. Based on the hypothesis that a supaya complement sentence functions as a noun clause and does not
344
Isamu SHOHO
constitute a unit with NP before itself, the status of NP before supaya in sentence (51) is puzzling. Unlike the case of mencadangkan, kepada cannot be inserted before Pergerakan Pemuda dan Puteri Umno. A supaya complement sentence with NP before itself does not constitute a unit. From this, it can be said that NP (in this case, Pergerakan Pemuda dan Puteri Umno) is not in objective case. The reasonable solution about the status of NP before supaya is to posit the NP inside the supaya complement sentence, and the residing place is Spec of CP. Spec of CP is Ā position that is exempt from case assignment. (52) Selain itu, beliau mengarahkan Pergerakan Pemuda dan Puteri Umno supaya mempergiatkan usaha menarik ahli baru kerana sasaran utama parti ialah golongan muda. (Other than this, he directed Umno’s Youth Movement to invite new members to join the party because the party’s main target is the young.) (Berita Harian Online 11/9/2006)
The underlined part of (52) is schematized as (53). (53) Beliau mengarahkan [CP Pergerakan Pemuda dan Puteri Umno [c’ supaya [IP mempergiatkan usaha menarik ahli baru ]]]
The sentence pattern with the configuration shown in (53) is classified as E. E. S+VT+[CP NP [c’ supaya・・・]] (54) Dia mengarahkan Tuan Pengerusi supaya membubarkan persidangan itu.
The verbs that form E pattern constructions include menyeru. An example of this is (55). (55) Justeru, mengambil perintah Allah yang menyeru hambanya supaya menimba ilmu menerusi surah Al-Alaq, siri Ilmuwan Islam kali ini menampilkan tokoh ilmuwan yang menimba pengetahuan daripada al-Quran dan diterjemah dalam hasil kerja mereka. (Following Allah’s order to absorb knowledge through the chapter Al-Alaq, the series of Islamic scholars this time will deal with scholars who get knowledge from al-Quran and express it in their works. (Berita Harian Online 8/9/2006)
4. Characteristics of each pattern In Chapter 3, we contended that the constructions formed by transitive verbs with the four complementizers—agar, supaya, untuk, and φ (zero form complementizer)— form five patterns. In this chapter, the characteristics that accompany each of the five patterns we have observed will be revealed. In Chapter 5, cases that show a wavering in judgment about the selection of complementizers will be dealt with. The main aim of this paper is to clarify what causes vacillation in the choice of complementizers. Focusing on only a few characteristics accompanying each pattern erroneously shows common qualities with another pattern; this constitutes one of the reasons that there is confusion
Vacillation in the Selection of Complementizers
345
about the choice of complementizers. In considering this kind of problem, we should observe syntactic behaviors wherein each pattern is shown in as broad a perspective as possible. Adopting this broader perspective enables us to clearly see the differences as well as similarities among the patterns. At the same time, we should decide on which characteristics are relatively important in classifying a particular verb under an appropriate heading. What we observe in this chapter will prepare a basis in judging the heading under which a certain transitive verb should be classified. We will see what syntactic behaviors a certain verb indicates when the diagnostic criteria mentioned later are applied to it. We will find that some of the characteristics are correlated with each other. The corpus we used for selecting characteristics specific to each pattern is a compilation of headline and national news over six-months from Berita Harian Online News. To supplement these data, we also retrieved news from Berita Harian Online News for the last two months, i.e., August and September 2006. The diagnostic criteria, which when applied to verbs show certain syntactic behaviors are as follows: I. Whether there is a selectional restriction imposed on the postverb argument II. Whether a passive interpretation can be made with the sequence of the postverb argument and what follows it III. Whether kepada can be inserted in front of the postverb argument IV. Whether untuk can be inserted after the postverb argument V. Whether supaya can be inserted after the postverb argument VI. Whether supaya can be inserted after the verb In the remainder of this chapter, in the order in which the above six criteria have been mentioned, the syntactic behaviors shown by each pattern when the criteria are applied to verbs of each pattern will be revealed. In the following tables, asterisks attached to some transitive verbs signify that the asterisked verbs show vacillation; in the corpus, the verbs take both supaya and untuk. I. Whether there is selectional restriction imposed on the postverb argument mengajak + *mempelawa + *meminta - *memujuk + mencadangkan - *mengarahkan - memerintahkan - *menggesa - memaksa - menyuruh + merayu +
346
Isamu SHOHO
II. Whether a passive interpretation can be made with the postverb argument as a subject and what follows as a predicate mengajak - *mempelawa - *meminta + *memujuk - mencadangkan + *mengarahkan + memerintahkan + *menggesa + memaksa + menyuruh - merayu -
N.B. In case the postverb argument is followed by an overt complementizer (in more concrete terms, agar or supaya), what is relevant is the discontinuous sequence of the postverb argument as a subject and what follows the overt complementizer as a predicate. III. Whether kepada can be inserted in front of the postverb argument mengajak - *mempelawa - *meminta - *memujuk - mencadangkan + *mengarahkan - memerintahkan - *menggesa - memaksa - menyuruh - merayu + IV. Whether untuk can be inserted after the postverb argument mengajak + *mempelawa - *meminta + *memujuk + mencadangkan - *mengarahkan + memerintahkan + *menggesa + memaksa + menyuruh + merayu -
Vacillation in the Selection of Complementizers
347
V. Whether supaya can be inserted after the postverb argument mengajak - *mempelawa - *meminta + *memujuk + mencadangkan + *mengarahkan + memerintahkan + *menggesa + memaksa + menyuruh - merayu + VI. Whether supaya can be inserted after the verb mengajak - *mempelawa - *meminta + *memujuk - mencadangkan + *mengarahkan + memerintahkan + *menggesa + memaksa - menyuruh - merayu +
5. Seeking a solution for explaining characteristics deviating from the canonical ones Observing the tabulated data in the preceding chapter reveals that some verbs show characteristics that deviate from those of canonical verbs. In some cases, a certain transitive verb has some irreconcilable characteristics; in others, a certain transitive verb shows qualities that are common to another pattern, but at the same time, has characteristics in common with yet another pattern. In the previous above mentioned research, i.e., Shoho (1999), the author pointed out some irregularities—which, in particular, caused wavering in judgment on the selection of complementizers—that should not have appeared. However, we did not proceed further to find a solution for explaining these irregularities. This chapter aims at solving the irregularities that do not conform with the canonical pattern, including those that remained unsolved or untouched by the previous research. Using news compiled over a period of six months from Berita Harian Online News as a corpus for this research, we discovered many new facts that were veiled or unclear in the previous research. For example, we have become aware of a new pattern of
348
Isamu SHOHO
arrangement of words in complement sentences. In the complement sentence after meminta, we find an inverted passive form, while the usual passive forms are observed more frequently. (56) “Justeru, GMGBM mengemukakan memorandum kepada Menteri Pelajaran, baru-baru ini, antara lain meminta dilakukan penyelarasan bagi menjaga kepentingan pengurusan di sekolah rendah,” katanya. (That’s why GMGBM has recently presented a memorandum to the Education Minister; among other things, it asks him to coordinate for the interests of elementary school management.) (Berita Harian Online 31/8/2006).
In the six tables in the previous chapter, the asterisked item indicates that the transitive verb shows irregularities in that the verbs take both untuk and supaya complementizers. In the remainder of this chapter, we will deal with these irregularities to seek a solution that can explain the cause of confusion in selecting complementizers. Firstly, we will consider the irregularities shown by mempelawa. In both the corpora we use, there seems to be no case of supaya appearing after mempelawa. However, in the following sentence we can see that the supaya complementizer appears after the passive form of dipelawa. How can we explain the use of supaya after the passive form of mempelawa? (57) Peter Chin berkata, semua produk dan teknologi baru itu sudah siap dan bersedia untuk dikomersialkan. Orang ramai terutama usahawan yang berminat dengan produk dan teknologi itu dipelawa supaya datang sendiri ke MPOB dan membuat perbincangan dengan penyelidik terbabit. (Peter Chin said that all the products and technology for it are now ready to be commercialized. The public, among other businessmen, who are interested this product and the technology are kindly invited to come to MPOB by themselves for discussions with the researcher.) (Berita Harian Online 5/8/2006)
Considering the paradigms shown by mengajak, which is in the same group as mempelawa, both the verbs show a difference only in terms of this point. In addition to this, the use of supaya after diajak (passive form of mengajak) is not accepted as is shown in sentence (55). (58) *Mereka diajak supaya datang sndiri ke MPOB. (They were invited to come by themselves to MPOB by themselves).
We cannot find any explanation for allowing dipelawa to be followed by supaya. It is only natural that both verbs should agree in terms of all the points. After considering all these points, we can safely say that this use of supaya after dipelawa is not appropriate. This form is among the things that should be disregarded.
Vacillation in the Selection of Complementizers
349
Secondly, we will consider the irregularities regarding meminta. Meminta requires supaya after the postverb argument as is shown in sentence (59). (59) Pemimpin Buruh, Kim Beazley berkata, meminta pendatang supaya mereka akan menghormati perbezaan agama dan pandangan politik dan juga wanita boleh membantu pemimpin Islam menangani kumpulan pelampau. – AFP (Labor Union Chief, Kim Beazly said that requiring visitors to profess they will respect the country’s religion, politics, and women will help the Islamic leaders cope with extremists.) (Berita Harian Online 12/9/2006)
However, in the following sentence, we find the use of untuk after the passive form of diminta. (60) Mereka akan diminta untuk membawa suami untuk diperiksa jenis darah mereka. Jika suami. (They will be required to carry their husband to be examined for his blood type.) (Berita Harian Online 15/1/2006)
The table in the preceding chapter indicates that meminta is different from mengajak with regard to four points concerning (1) selectional restriction on the postverb argument, (2) the possibility of a passive interpretation with the postverb argument as a subject and what follows as a predicate, (3) the possibility of inserting supaya after the postverb argument, and (4) the possibility of inserting supaya after the verb. From this fact, it can be concluded that the postverb argument in (59) does not constitute a unit in a nexus relation with a supaya complement sentence. The structure of (59) is as shown in the E pattern, with pendatang located in the Spec of the supaya complement sentence. This leads us to contend that the appropriate complementizer after the postverb argument is supaya. The existence of untuk after the postverb argument can be explained by an erroneous analogy with pattern A; the mengajak construction is one example of this. In this case, we should avoid using untuk in sentence (60). Thirdly, we will consider the question of complementizer selection after the memujuk. We have observed that all the paradigms with the exception of V conform with those of mengajak; this implies that the postverb argument constitutes a nexus relation with the following complement sentence. This leads us to conclude that the use of the untuk complementizer is appropriate as in sentence (62), while the use of the supaya complementizer must be avoided. (61) “Oleh itu, kita akan berusaha memujuk mereka secara perlahan-lahan supaya dapat membayar tunggakan hutang masing-masing kepada Mara,” katanya. (“Accordingly, we will resort to peaceful means to persuade them to pay their overdue debt to MARA,” he said. (Berita Harian Online 3/9/2006)
350
Isamu SHOHO (62) HASIL kejayaannya memujuk seorang lelaki Malaysia yang disyaki terbabit dalam kes pembunuhan seorang wanitaVietnam di Kent, England, untuk menyerah diri, AsistenSuperintendan Parusuraman Subramaniam menerima penghargaan Ketua Polis Negara. (He has successfully persuaded a Malay man suspected to be involved in the case of murder of a Vietnamese woman in Kent, England, to turn himself in. For this heroic deed, Assistant Superintendent Parusuraman Subramaniam has been commended by Chief of Police of the state. (Berita Harian Online 5/8/2006)
Fourthly, we will deal with the complementizer selection of mengarahkan. The case of mengarahkan is in sharp contrast with that of memujuk. As opposed to memujuk, the paradigms of mengarahkan do not conform with those of mengajak with regard to the six diagnostic criteria; this leads us to conclude that the selection of the untuk complementizer is not appropriate as in sentence (63). (63) “Selepas dikategorikan sebagai projek terbengkalai, KPKT sama ada akan melantik pemaju baru atau mengarahkan pemaju yang sama untuk meneruskan projek,” katanya. (After classification under the setback project, KPKT will appoint a new land promoter or direct the same land promoter to continue the project.) (Berita Harian Online 29/8/2006)
Lastly, we will consider the problem of complementizer selection of menggesa. The same can be said about the case of menggesa as that of meminta: the paradigms of menggesa do not conform with those of mengajak with regard to four diagnostic criteria. From this, it can be concluded that the postverb argument in (64) does not constitute a unit in a nexus relation with a supaya complement sentence. The structure of (64) is as shown in the E pattern, with pengusaha ladang sawit located in the Spec of the supaya complement sentence. This leads us to contend that the appropriate complementizer after the postverb argument is supaya. As in the case of meminta, an erroneous analogy with pattern A produces an inappropriate use of untuk as shown in sentence (65). (64) Kerajaan menggesa pengusaha ladang sawit supaya meningkatkan pelaburan bagi meningkatkan produktiviti ekoran unjuran peningkatan ke atas permintaan minyak sayuran sebanyak 169 juta tan menjelang 2020. (The government urged coconut planters to boost investment for the attainment of greater productivity because the demand for palm oil is predicted to increase to as much as 169,000,000 tons by 2020.) (Berita Harian Online 21/9/2006) (65) Beliau menggesa masyarakat Islam Amerika untuk mencabar imej salah terhadap Islam yang digambarkan oleh media dan ahli politik supaya dasar luar yang lebih seimbang dapat dicapai. – AFP (He urged the Islamic society in
Vacillation in the Selection of Complementizers
351
America to defy the mistaken image of Islam produced by the mass media and politicians for the attainment of more balanced foreign policies.) (Berita Harian Online 3/9/2006)
Bibliography Nik Safiah Karim. 1978. Bahasa Malaysia Syntax. KualaLumpur: Dewan Bahasa dan Pustaka. Postal, Paul M. 1974. On Raising. Cambridge, Mass./London: M.I.T. Press. Rosenbaum, Peter S. 1972. The Grammar of English Predicate Complement Cnstructions. Cambridge, Mass: M.I.T. Press. Sanat Md. Nasir. 1987. Ayat Komplemen Bahasa Malaysia. Kuala Lumpur: Dewan Bhasa dan Pustaka. Shoho, Isamu. 1999. “Penggolongan Ayat Komplemen Frasa Kata Kerja” in Journal of the Institute Language Research. Tokyo: Tokyo University of Foreign Studies.
352
Isamu SHOHO
Voice in Relative Clauses in Malay — A Comparison of Written and Spoken Language* — Hiroki NOMOTO and Isamu SHOHO 1. Introduction Voice has been one of the most favourite topics of study for many Malay/Indonesian linguists1. This is because the voice system of Malay, like those of many other Austronesian languages, is more than a simple bipolar opposition between the active and the passive. For this and another reason, which we will discuss later in section 5.3, some typologists were also intrigued by the Malay voice. Briefly, Malay voice is peculiar enough to warrant many people’s attention. The aim of this paper is to show that it is peculiar indeed, and that, at the same time, it is in fact not peculiar at all. These two statements are obviously contradictory. However, they are both true. The clue to arriving at such a conclusion lies in a clear distinction between the written and the spoken language. We refer to the two varieties of Malay as ‘Written Malay’ and ‘Colloquial Malay’ respectively. In some languages, the written and spoken varieties do not exhibit significant differences; this is the case, for instance, with modern Japanese and the modern European languages like English. On the other hand, there are languages whose written and spoken varieties differ to such an extent that they can even be regarded as two distinct languages. This situation is known as diglossia (Ferguson 1959). Speakers of the first type of languages may * 1
The following abbreviations are used in this paper. COMP: complementiser; INT: interjection; NP: noun phrase; PERF: perfect; PROG: progressive; t: trace. Indonesian, the national language of Indonesia, is linguistically (but not politically) a dialect of Malay. There are a number of differences between Indonesian and the dialect of Malay discussed here, i.e. Standard (but not standardised!) Malay of Malaysia. However, the conspicuous differences are mostly phonological (e.g. phonological rules, prosody, etc.) or lexical (e.g. lexical items, lexical meanings, etc.). As far as sentential syntax and semantics are concerned, the two dialects are almost identical. Therefore, when discussing the sentence-level grammar of Malaysian Malay, one can and needs to consult previous studies on Indonesian as well, of course, always bearing in mind the possibility of there being subtle differences. As the present paper pertains to the core part of sentential grammar, we will assume that what has been reported on Indonesian is basically true with Malaysian Malay, at least for the written language.
354
Hiroki NOMOTO and Isamu SHOHO
quite understandably find it difficult to imagine what diglossia is like unless they have an immediate experience of it. This phenomenon is typically found in non-Western contexts. Among the well-known examples are Arabic and Tamil. Japanese belonged to this type before the Genbun-itchi movement or the harmonisation of written and spoken language in the late nineteenth century (cf. Twine 1978 among others). One of the authors of this paper has claimed that Written Malay and Colloquial Malay are the two distinct varieties in diglossia (Nomoto & Tsuji 2006). If this is indeed the case, it is not surprising that the two varieties have different voice systems. It is in this sense that both the contradictory statements mentioned earlier can be simultaneously true. We will demonstrate that the voice system of Written Malay is indeed peculiar, whereas that of Colloquial Malay is commonplace. Here, it should be noted that the hypothesis that Written Malay and Colloquial Malay are two distinct varieties of diglossia has, to our knowledge, scarcely been proved in a scientific manner2. This paper provides a piece of evidence in support of this hypothesis. We do not presume that the diglossia hypothesis is correct. Rather, we will first purely examine the data from the written and spoken corpora separately and then compare the results. If the results of the examination into the two types of corpora exhibit any significantly striking differences, it would count as evidence in support of the diglossia hypothesis. This is the course of argumentation that we intend to pursue. Before proceeding to the main discussion, it might be necessary to briefly comment on the two varieties of Malay. The opposition of Written Malay and Colloquial Malay is based on the degree of formality, with the former being more formal than the latter. The names ‘Written’ and ‘Colloquial’ merely indicate the types of communication in which they are primarily used. Alternatively, they can also be referred to as ‘Formal/High Malay’ and ‘Informal/Low Malay’ respectively. Needless to say, this division is an idealisation. It is not always easily identifiable which variety a particular instance of actual language use belongs to since the two varieties are often mixed in varying proportions. In other words, the reality is that language use falls on a continuum between the two varieties. Nevertheless, we attempt to understand it as a result of the interplay between the two idealised discrete systems, which is a typical scientific approach known as reductionism. The remainder of this paper is organised as follows. Section 2 provides 2
Noriah (2006) describes the linguistic situation in Malaysia as polyglossia, a kind of extended diglossia. However, her discussion centres only on the social facts (e.g. legal stipulations regarding Malay and English and the religious role played by Arabic) and no evidence is presented that is based on the language itself.
Voice in Relative Clauses in Malay
355
a brief introduction to the voice system of Malay. Section 3 explains the methodology of the study. The results are presented and analysed in section 4. Finally, section 5 concludes the paper. Some parts of what follows overlap Nomoto (forthcoming), which is the preliminary study to this paper. 2. Malay voice system According to the previous studies, which are mostly based on the written language, voices in Malay can be classified into the following four categories: morphological active, morphological passive, bare active, and bare passive3. (1)
a. Morphological active Dia sudah mem-baca buku itu. she PERF MEN-read book that ‘She has already read the book.’ b. Morphological passive Buku itu sudah di-baca (oleh)-nya. book that PERF DI-read (by)-her ‘The book was already read by her.’ c. Bare active Dia sudah baca buku itu. she PERF read book that ‘She has already read the book.’ d. Bare passive Buku itu sudah dia baca. book that PERF she read ‘She has already read the book./The book, she has already read.’4
The morphological active and passive are characterised by the prefixes meNand di- respectively. The bare active and passive differ in terms of word order, specifically that of the agent and the auxiliary/negation/adverb. The agent precedes the auxiliary/negation/adverb in the bare active form, while no element can intervene between the agent and the verb stem in the bare passive form. The position of the theme argument is not relevant in distinguishing the passives from the actives since it can also appear after the verb. Thus, (1b) and (1d) allow the following variants respectively: 3
4
These terminologies are from Voskuil (2000). A variety of terminologies are used in the literature to refer to one or more of the four voices discussed here. For example, the morphological passive and the bare passive are often referred to as passive 1 (P1) and passive 2 (P2) respectively. See Nomoto (forthcoming, section 2) for details and a discussion of them. Note that this Malay construction is not an instance of topicalisation but syntactically passive. See Chung (1978).
356
Hiroki NOMOTO and Isamu SHOHO
dibacanya buku itu and sudah dia baca buku itu. 3. Methodology This section explains the methodology of the study. Section 3.1 describes the corpora that we used, and section 3.2 describes our examination of these corpora. 3.1. Corpora This study uses two different types of corpora that were both built by us. One is based on written materials and the other on spoken conversations. Recall that the division between Written Malay and Colloquial Malay is an idealisation and that actual language use typically contains both varieties. Therefore, although we refer to these corpora as the corpus of Written Malay and that of Colloquial Malay, this is purely to facilitate exposition and does not imply that the corpus consists exclusively of Written Malay or Colloquial Malay. The corpus of Written Malay consists of 16 short stories (cerpen) that appeared in the monthly magazine Dewan Masyarakat between January 2005 and May 2006. The total word count of the corpus is 23,605. The list of the 16 short stories is provided in the Appendix. Since short stories usually contain dialogues, we separated them from the remaining narrative parts while conducting the examination described below. We did this because we expected that the dialogues would exhibit a pattern identical to that in Colloquial Malay. However, contrary to our expectation, the result exhibited a pattern similar to the narrative instead. In other words, the dialogues in the short stories do not entirely reflect the real spoken conversations. While they do contain some expressions that are characteristic of spoken conversations, they do not conform to their grammatical patterns. Therefore, in what follows, we will not distinguish between the dialogues and the narrative. The corpus of Colloquial Malay consists of 32 sessions of casual dialogues by 20 students from Universiti Kebangsaan Malaysia5. The total recording time is approximately 30.5 hours, and 22 dialogues have been transcribed. The total word count is 172,855. There are four major categories of dialogues based on two parameters: (i) the degree to which the subject of the conversation is controlled (free or partly controlled) and (ii) the manner in which the conversation takes place (face-to-face or over the telephone). In 5
We built this corpus as a part of a research project at the Tokyo University of Foreign Studies (21st Century Centre of Excellence Programme: Usage-Based Linguistics Informatics). The official name of this corpus is Multilingual Corpora (Malay). Visit the following website for more information about the corpus: http://www.coelang.tufs.ac.jp/ multilingual_corpus/ms/ (accessed 18/10/2006).
Voice in Relative Clauses in Malay
357
free conversations, the participants were simply provided with a general topic. On the other hand, in partly controlled conversations, the participants were provided with a general topic and several particulars that had to be mentioned during the conversation. The topics were mostly non-technical routine issues with a special focus on Malaysian and Japanese culture and society. These corpora are neither large nor representative. Largeness and representativeness are generally considered as two criteria of good corpora (Sinclair 2005). The small size of our corpora is admittedly a disadvantage, but not one that we believe is fatal. With regard to representativeness, we, as reductionists, intend that our corpora not be representative in the usual sense. It being representative would in fact be detrimental to our purpose here, because what we wish to investigate is not the broad picture of language use but the extreme cases in the written and spoken domains. We want two corpora that are maximally biased, one towards the Written Malay extreme and the other towards the Colloquial Malay extreme of the continuum. For example, a recording of an entire day’s television broadcast does not make a good corpus for our purposes because it is too varied, ranging from material at the Colloquial Malay extreme (e.g. reality shows) to that at the Written Malay extreme (e.g. news). Although the problem of what best represents the extremes is difficult to gauge, short stories and casual conversations are probably not beside the mark6. 3.2. Design of examination This section explains how the examination of the above-mentioned corpora was conducted. We scoured the corpus for relative clauses with the complementiser yang as well as a gap in which either the external or internal argument (i.e. underlying subject and object respectively) of a transitive verb is relativised. For ease of exposition, in what follows, we refer to the relativisation of the external and internal arguments as ‘subject relativisation’ and ‘object relativisation’ respectively. Those relative clauses in which the non-relativised argument is null are excluded. Such cases occur mainly in Colloquial Malay and are not numerous. The example below is from the 6
One might wonder if newspaper articles can also serve our purpose. In fact, we also examined newspaper articles because we had built a corpus of newspaper articles earlier. However, as we examined the data, it was evident that they were not suitable to our purpose because almost all the sentences had a third person agent, reflecting the objective style of description that is typical of journalism. This is problematic because bare passives are so rare for third person agents that some grammarians regard bare passive sentences with third person agents as being ungrammatical.
358
Hiroki NOMOTO and Isamu SHOHO
corpus of Colloquial Malay. The null argument is indicated by pro. (2)
Cakap Kelantan? Apa yang [pro nak cakap t]?7 speak Kelantan what COMP want speak ‘Speak Kelantanese? What shall I say?’
However, we do not apply this exclusion to morphological passive sentences. This is because their agents are not expressed explicitly far more frequently than those of sentences in the other voices. We examine relative clauses but not main clauses for the following two reasons. First, it has been believed—mistakenly—that only the surface subject is accessible to relativisation in Malay (Yeoh 1979: chapter 4; Comrie 1981: 150); NPs in other grammatical relations must somehow be changed to the subject in order to be accessible to relativisation8. The voice phenomenon is most easily observable in the context of relative clauses because such grammatical relation changes are basically achieved with different selections of voice. The second reason for examining relative clauses but not main clauses is the constraint imposed by the early stage of the development of Malay corpus linguistics. Retrieving relative clauses as specified above from raw texts is a possible task while distinguishing between the different types of clauses is not. The former can be achieved with a simple search function in text editors. A parser may enable us to achieve the latter. However, according to Knowles & Zuraidah (2006), there appears to be an automated part-of-speech tagger of Malay, which is not available to us at this time9, and neither is there a parser yet. According to the previous descriptions of Written Malay, both subject and object relativisation allow more than one option depending on the selection of voice. There are two options for subject relativisation: morphological active (3a) and bare active (3b). (3)
7 8
9
a. orang yang [t sudah mem-baca buku itu] person COMP PERF MEN-read book that ‘the person who has already read the book’ b. orang yang [t sudah baca buku itu] person COMP PERF read book that
This sentence cannot be bare passive (i.e. nak pro cakap) since it is believed that the agent of bare passive is obligatory and overt. As we will see shortly, the direct object of bare active is relativisable (Chung 1976; Musgrave 2001; Cole & Hermon 2005; Nomoto 2006a). Hassal (2005) reports that the direct object of the morphological active is also relativisable in some cases. The tagger could have automated the process of picking out the yang relative clauses with transitive verbs, which we did manually. Such an automation is of great help, for more than 2,000 out of approximately 2,500 instances of yang relative clauses in the corpus of Colloquial Malay are not with transitive verbs (cf. Nomoto, forthcoming).
Voice in Relative Clauses in Malay
359
‘the person who has already read the book’
For object relativisation, there are three options: morphological passive (4a), bare active (4b) and bare passive (4c). (4)
a. buku yang [t’ di-baca t (oleh)-nya] book COMP DI-read (by)-her ‘the book which was read by her’ b. buku yang [dia sudah baca t] book COMP she PERF read ‘the book which she has already read’ c. buku yang [t’ sudah dia baca t] book COMP PERF she read ‘the book which she has already read’
The object of a morphological active sentence cannot be relativised. This is due to the blocking effect of NP movement by the prefix meN- (cf. Saddy 1991). (5)
*buku yang [dia mem-baca t] book COMP she MEN-read ‘the book which she read’
Given these options, what needs to be investigated are (i) whether or not all these options are available in both Written Malay and Colloquial Malay, (ii) the frequency of each available voice and (iii) whether there are any significant differences between Written Malay and Colloquial Malay. Hence, our primary function at present is to basically classify all the instances into either of the above categories. In order to do so, however, a fourth category is required, namely ‘indeterminate’. Recall that the bare active and the bare passive are distinguished based on the relative position between the agent and the auxiliary/negation/adverb. This means that if the latter element is absent in a clause, the clause will be indeterminate between the bare active and the bare passive. Thus, buku yang dia baca ‘the book which he read’ can be analysed as either bare active (6a) or bare passive (6b). (6)
a. buku yang [dia baca t] book COMP that read b. buku yang [t’ dia baca t]
4. Results and analyses This section presents the results of the corpus examination described in the last section; their analyses are also provided. The result and analysis of subject relativisation are presented in section 4.1, followed by those of object relativisation in section 4.2, which are more complicated and have much to be discussed.
360
Hiroki NOMOTO and Isamu SHOHO
4.1. Subject relativisation Table 1 shows the result for subject relativisation. Table 1. Voice choice in subject relativisation Morphological active Written Malay Colloquial Malay
107 (97.3%) 63 (34.1%)
Bare active 3 ( 2.7%) 122 (65.9%)
Total 110 185
Below are some examples from our corpora. (7)
(8)
Morphological active a. Written Malay Zahwa didera ombak rindu yang [mem-(p)ukul pantai hati-nya]. Zahwa be.whipped wave yearn COMP MEN-hit coast heart-her ‘Zahwa was whipped by the waves of yearning that hit the coast of her heart.’ b. Colloquial Malay Biasanya siapa yang [me-lakukan samseng] ni? usually who COMP MEN-do gangster this ‘Who are the people who usually become gangsters?’ Bare active a. Written Malay Begitu juga kehendak lelaki ini yang [sentiasa mahu aku so too wish man this COMP always want me menyerahkan segala jiwa dan raga-ku kepada-nya]. yield all spirit and flesh-my to-him ‘So was this man’s wish, who always wanted me to yield all my spirit and flesh to him.’ b. Colloquial Malay Tapi tak semua yang [ambil dadah] tu nak hilangkan tekanan. but not all COMP take drug that want lose pressure ‘But not all of those who take drugs want to relieve their stress.’
In Written Malay, morphological active clauses comprise an overwhelming majority. Although bare actives are few in number, they exist for a qualitative reason. Certain verbs seldom or never take the prefix meN(e.g. makan in the sense of ‘to eat’). The extremely small number of bare active clauses is due to the rarity of such verbs. In contrast, Colloquial Malay contains more bare actives than morphological actives. The ratio of bare actives to morphological actives is approximately 2:1. Some authors describe this feature of Colloquial Malay as the omission of the prefix meN- (e.g. Onozawa 1996: 226). However, such a description relies on a false and, in our opinion, unhealthy, widespread assumption that the colloquial variety is merely a simplified version of the
Voice in Relative Clauses in Malay
361
written one. If one is to study Colloquial Malay in its own right, it is more adequate to state that bare active is the normal selection of voice in Colloquial Malay and the prefix meN- can be added to bring about some additional effects such as formality. In brief, the two voices are available in both Written Malay and Colloquial Malay; however, they exhibit a striking difference with regard to frequency. 4.2. Object relativisation Let us now turn to object relativisation. The result is shown in Table 2, followed by some examples from our corpora. Table 2. Voice choice in object relativisation Morphological Bare active passive Written Malay 90 (66.2%) 4 (2.9%) Colloquial Malay 38 (13.8%) 23 (8.4%) (9)
Bare passive
Indeterminate
Total
21 (15.4%) 11 ( 4.0%)
21 (15.4%) 203 (73.8%)
136 275
Morphological passive a. Written Malay Ketewasan yang [di-rasai oleh suami-ku] itu dapat juga aku rasai. defeat COMP DI-feel by husband-my that can too I feel ‘I could also feel the sense of defeat that my husband had felt.’ b. Colloquial Malay Rupanya tau, jauh-jauh dia merantau-rantau, rupanya sebelah apparently know far she go.abroad apparently side rumah abang aku juga yang [di-ambil-nya]. house elder.brother my too COMP DI-take-her ‘It seems, you know, she went to study in faraway countries, but the man chosen by her in the end was my elder brother’s neighbour.’ (10) Bare active a. Written Malay … orang lain yang [kami tidak pernah kenal] people other COMP we not PERF know ‘… other people whom we haven’t seen.’ b. Colloquial Malay La la, tu yang [aku duk fikir] tu, eiyy. INT INT that COMP I PROG think that INT ‘Oh, come on. That’s what I was thinking.’
362
Hiroki NOMOTO and Isamu SHOHO (11) Bare passive a. Written Malay … ini adalah wasiat yang [terpaksa aku tunaikan]. this be will COMP have.to I fulfil ‘… this is a will which I have to execute.’ b. Colloquial Malay Kau tau kan apa yang [akan polis lakukan pada samseng ni]. you know not what COMP will police do to gangster this ‘You must know what action the police will take against the gangsters.’ (12) Indeterminate a. Written Malay Azlan yang [aku temui] sudah berbeza dari segi fizikal-nya. Azlan COMP I meet PERF different from aspect physique-his ‘The Azlan I met was now physically different from what he had been.’ b. Colloquial Malay Kari yang [nenek aku buat] tu lain tau. curry COMP grandmother my make that other know ‘You know, the curry my grandmother makes is different.’
In Written Malay, there are plenty of morphological passive clauses. In contrast, bare actives are fewer in number, but they exist for the same qualitative reason mentioned in the previous subsection. The number of bare passives is robust enough to prove their existence. In contrast, in Colloquial Malay, the vast majority are the ‘bare’ types with no voice morpheme, namely the bare active, the bare passive and the indeterminate. The number of bare actives can be considered as being robust. Thus, it is certain that bare actives exist. The number of bare actives in Colloquial Malay is considerably greater than that in Written Malay. In fact, the bare passive is problematic. It does not occur so frequently as to ensure its existence as a part of the voice system of Colloquial Malay. Nor does it have any qualitative reason to justify its existence unlike the bare actives in Written Malay. Therefore, we would rather regard the 11 instances (4%) as the result of code-mixing with Written Malay and claim that Colloquial Malay does not have bare passives. As we have noted earlier, no corpora of Colloquial Malay consist only of itself. Code-mixing is inevitable. There are four pieces of evidence that support this claim. First, based on our observation, a major part of the dialogues in the corpus is, at some level, fairly close to the basilect, in which code-mixing with Written Malay occurs only occasionally. Second, an examination of how many people produced bare actives and bare passives reveals that bare actives were produced by as many as 16 people out of 20, whereas bare passives were produced by only 5. If the bare passive shared the same status as the bare active in the voice
Voice in Relative Clauses in Malay
363
system of Colloquial Malay, it must have been produced more frequently and by more people. Furthermore, as the third piece of evidence, all of the five speakers who produced bare passives also produced bare actives, but none produced only bare passives. The fourth piece of evidence is in tandem with a similar study conducted by Cole et al. (2006), after which the preliminary study of the present one is modelled. They studied Jakarta Indonesian, the colloquial variety of Indonesian which is usually spoken by the population of Jakarta in the course of their daily lives (Wouk 1989). They conducted a similar corpus examination as that in this study and concluded that Jakarta Indonesian proper did not have the category of voice that this study refers to as the bare passive. Table 3 summarises the results obtained by them. The corpus named CHILD consists of utterances by children; A-C, of utterances by adults talking to children; and A-A1 and A-A2, of utterances by adults talking to adults. All of them are corpora of spoken language. Table 3. Voice choice in object relativisation in Jakarta Indonesian (Cole et al. 2006) Corpus (speaker) Morphological Bare active Bare passive Indeterminate Total passive CHILD (children) 56 (62.2%) 6 (6.7%) 2 ( 2.2%) 26 (28.9%) 90 A-C (adults) 65 (68.4%) 5 (5.3%) 2 ( 2.1%) 23 (24.2%) 95 A-A1 (adults) 28 (31.1%) 7 (7.8%) 16 (17.8%) 39 (43.3%) 90 A-A2 (adults) 51 (29.3%) 12 (6.9%) 17 ( 9.8%) 94 (54.0%) 174
Observe the different patterns between the corpora involving children (i.e. CHILD and A-C) on the one hand and those involving only adults (i.e. A-A1 and A-A2) on the other, especially with regard to the bare passive. The proportions of bare passives in the former are considerably smaller than those in the latter. From this clear disparity, Cole et al. (2006) conclude as follows: adults’ utterances are at the mesolectal level; therefore, code-mixing occurs with the acrolect, i.e. Standard Indonesian; on the other hand, the utterances involving children represent the basilect, and hence, no code-mixing occurs; if one considers the basilect as the pure form of Jakarta Indonesian, it can be concluded that there is no bare passive in Jakarta Indonesian. Returning to our data, the result of Colloquial Malay parallels the Jakarta Indonesian corpora that involve children rather than those that involve only adults. That is to say, the bare passive is the least frequent category, and the second least frequent category—the bare active—is used more than twice as frequently as former. Hence, the same line of argument as employed by Cole et al. (2006) should also apply to Colloquial Malay.
364
Hiroki NOMOTO and Isamu SHOHO
These four points corroborate the claim that the 11 instances of bare passive are in fact mixed Written Malay. Therefore, the bare passive cannot be included in the voice system of Colloquial Malay and, as a consequence, the instances assigned to the category ‘indeterminate’ turn out to be those of the bare active, with the exception of some mixed Written Malay expressions. As a tangential remark, the Jakarta Indonesian corpus involving only adults parallels Written Malay as far as the relation between the bare active and the bare passive is concerned. In both corpora, the bare passive is used more frequently than the bare active. However, the parallelism between Malay (Written and Colloquial) and Jakarta Indonesian (involving only adults and involving children) ceases to hold once the status of the morphological passive is taken into consideration. The morphological passive is the most frequently used category in Written Malay and in Jakarta Indonesian involving children, but not in Jakarta Indonesian involving only adults. In this last case, it is the indeterminate (i.e. either the bare active or the bare passive) that is used most frequently, which holds similarities with Colloquial Malay but not with Written Malay. We find it rather surprising that in Jakarta Indonesian, the morphological passive is used more often when children are involved than when only adults are involved. As we have noted thus far, in Malay, the morphological voices—morphological active and passive—are associated with the formal variety, i.e. Written Malay, while the bare voices—bare active, bare passive, and the indeterminate—are associated with the informal variety, i.e. Colloquial Malay. However, in Jakarta Indonesian, the association is in reverse, provided that the situation is more informal when children are involved than when only adults are involved. 5. Conclusions In this section, we will first summarise the findings thus far (section 5.1) and then discuss two theoretical issues, namely diglossia (section 5.2) and the typological anomaly claimed for the Malay voice (section 5.3). 5.1. Summary According to the previous studies on Written Malay, there are four voice categories in Malay: morphological active, morphological passive, bare active, and bare passive. In this study, we examined how one of these voices is chosen in the relative clauses of Written Malay and Colloquial Malay using the corpora of the two varieties. Both subject and object relativisation allow for more than one option of voice choice. Given this, three questions were raised: (i) whether or not all the options are available in both Written
Voice in Relative Clauses in Malay
365
Malay and Colloquial Malay, (ii) the frequency of each available voice and (iii) whether there are any significant differences between Written Malay and Colloquial Malay. With regard to questions (i) and (ii), the answers can be summarised as in (13)-(14), where ‘x > y’ indicates that the category x is greater in frequency than the category y. (13) Written Malay a. Subject relativisation: Morphological active > Bare active b. Object relativisation: Morphological passive > Bare passive > Bare active (14) Colloquial Malay a. Subject relativisation: Bare active > Morphological active b. Object relativisation: Bare active > Morphological passive
The answer to question (iii) follows from the above. There are indeed significant differences between Written Malay and Colloquial Malay. The latter has a voice system that is different from that in the former in two aspects, namely the number of voices that exist and the frequency. With regard to the number of voices that are existent, Written Malay has four voices whereas Colloquial Malay has three. We claimed that Colloquial Malay lacks the bare passive. With regard to the frequency, in Written Malay, the morphological voices (i.e. morphological active and passive) are more frequently used than the bare voices (i.e. bare active and passive). On the other hand, the reverse is true in Colloquial Malay. 5.2. Diglossia The stark differences noted above between Written Malay and Colloquial Malay count as evidence based on grammatical differences for the hypothesis that Written Malay and Colloquial Malay are the two distinct varieties in diglossia. The term ‘diglossia’ used here is in the sense of Ferguson (1959), which is often referred to as classical diglossia as opposed to extended diglossia (Fishman 1967; Platt 1977 inter alia). The former concerns two (or more) varieties of one language while the latter goes beyond this restriction and encompasses the varieties of genetically unrelated languages as well. In a diglossic community, ‘two varieties of a language exist side by side throughout the community, with each having a definite role to play’ (Ferguson 1959: 325). One of the two varieties is called the H(igh) variety and the other the L(ow) variety. Ferguson characterises diglossia by means of the following nine criteria: (a) function, (b) prestige, (c) literary heritage, (d) acquisition, (e) standardisation, (f) stability, (g) grammar, (h) lexicon and (i) phonology. Although not all of these nine criteria need to be satisfied since
366
Hiroki NOMOTO and Isamu SHOHO
diglossia is a gradient, variable phenomenon (Schiffman 1997), it appears that they are all fulfilled in the case of Written Malay and Colloquial Malay as the H and L varieties respectively. (a) Written and Colloquial Malay are used for different purposes. (b) Speakers regard Written Malay as being superior to Colloquial Malay. There are some native speakers of Malay who claim that their Malay is not good in fluent (Colloquial) Malay. We ascribe such behaviour partly to the low prestige they ascribe to their daily speech, i.e. Colloquial Malay. (Another reason for holding this opinion, which many people often mention, is to justify their use of English rather than Malay.) (c) Written Malay has been the language of serious literature, although there may be some exceptions among modern literary works. (d) Children learn to speak Colloquial Malay first, as their native tongue. Written Malay is learned later through formal education. (e) It is Written Malay that was standardised, has an established orthography and has attracted the attention of many local linguists (Nomoto 2006b). (f) It is fairly probable that diglossia lasted for centuries, although we are unsure if this is provable at all. Literacy had long been under the monopoly of the upper class and only in the last century did the common people begin to record and write what they spoke (Nomoto 2006b). (g) The present study deals with this point. Written Malay has an additional voice category that Colloquial Malay lacks—the bare passive. This is consistent with Ferguson’s characterisation that ‘H has grammatical categories not present in L’ (Ferguson 1959: 333). (h) Some words are used only in Written Malay but not in Colloquial Malay and vice versa. Nomoto (2006b) studied the multipurpose preposition kat, which is only used in Colloquial Malay. Nomoto and Tsuji (2006) investigated Information and Communication Technology (ICT) terms in the written and spoken languages of Malay, Japanese and German, and found that the rate of the written-spoken difference is the greatest in Malay; this, we claimed, has do to with the fact that Malay has not undergone any serious attempts to resolve diglossia, unlike the other two languages. (i) At present, we have no concrete data to illustrate the phonological differences between Written Malay and Colloquial Malay. However, prosody is obviously different between the two varieties. Moreover, in the Ninth International Symposium on Malay/Indonesian Linguistics (ISMIL) in 2005, David Gil suggested that Colloquial Malay had an additional middle vowel phoneme that Written Malay did not. We are yet to confirm the veracity of this claim ourselves. Given all these facts, it appears highly reasonable to believe that the diglossia hypothesis is correct. Noriah’s (2006) remark that ‘a diglossia phenomenon comparable to Ferguson’s (1959) classical diglossia does not exist at all in Malaysia since Malaysia is a bilingual or multilingual
Voice in Relative Clauses in Malay
367
country10’ needs to be reconsidered. Although it is true that the model of classical diglossia does not apply to the whole linguistic situation of Malaysia, it does apply to a component of it. 5.3. Markedness: On the typological anomaly of the Malay voice At the very beginning of this paper, we stated that the voice system of Malay is peculiar enough to attract the attention of some typologists for two reasons. One of them already mentioned was that the Malay voice system is not a simple bipolar opposition between the active and the passive. The preceding discussion demonstrated this in both Written Malay and Colloquial Malay. The second reason pertains to markedness. It is with regard to this aspect that different pictures emerge between Written Malay and Colloquial Malay. In discussing markedness, we adopt two of the four relevant criteria proposed in Comrie (1988), namely formal complexity and (raw) frequency11. The markedness in terms of the two criteria is summarised in Table 4. Formal complexity is independent of the written-spoken difference. The morphological active and passive are, of course, marked literally by the voice morphemes meN- and di- respectively12. On the other hand, the bare active and passive have no such voice markers, and hence are unmarked. Next, the frequency differs between Written Malay and Colloquial Malay. In Written Malay, the morphological active and passive are used more frequently than the bare active and passive (cf. (13)). Therefore, the morphological type is the unmarked voices and the bare type is the marked voices. Conversely, in Colloquial Malay, it is the bare type that is unmarked and the morphological type that is marked (cf. (14)).
10
11 12
Her original sentence in Malay is as follows: ‘fenomena diglosia yang menyamai diglosia klasik Ferguson (1959) memang tidak wujud di Malaysia kerana Malaysia merupakan negara dwibahasa atau pelbagai bahasa’ (Noriah 2006: 117). The other two criteria are the degree of productivity and discourse distribution. Comrie (1988) takes this up as a notable characteristic of Indonesian, a dialect of Malay (cf. footnote 1). In languages like English and Japanese, voice-marking occurs only in the passive. He steps further to showcase the bare passive in contrast to the morphological active in order to illustrate lesser formal markedness of the passive. This, however, is an inappropriate way of presenting data if one takes the bare active into account as well. The formal complexity of the bare ‘active’ and the bare ‘passive’ is the same. It is unknown from the article whether or not he assumes the bare active as a part of the Indonesian voice system. At any rate, this argument is only relevant to Written Malay but not to Colloquial Malay, which lacks the category of bare passive.
368
Hiroki NOMOTO and Isamu SHOHO
Table 4.
Markedness in terms of formal complexity and frequency Formal complexity Morphological active/passive marked Written Malay Bare active/passive unmarked Morphological active/passive marked Colloquial Malay Bare active unmarked
Frequency unmarked marked marked unmarked
Typologically speaking, the voice system of Written Malay is abnormal and indeed noteworthy, as has been pointed out in the literature. Formal complexity and frequency are in a crossed relation. That is to say, the morphologically marked voices are unmarked in terms of frequency and the morphologically unmarked voices are marked in terms of frequency. However, the voice system of Colloquial Malay is in fact commonplace. The morphologically unmarked voice is also unmarked in terms of frequency and the morphologically marked voices are also marked in terms of frequency. While this may not be of great interest to some typologists, it is an important fact nonetheless, because this is the manner in which people use language for the most part in their lives. From the discussion above, it is evident that the distinction between Written Malay and Colloquial Malay is very important. The distinction between the two has not been clearly demarcated in the literature, which often caused disagreements among researchers with regard to the grammaticality/acceptability of data. Drawing the distinction is necessary because the two varieties exhibit significant differences with respect to some phenomena. This is exactly the case with the voice system. Diglossia, a sociolinguistic theme as it was originally, has fundamental ramifications for every branch of linguistics, including corpus linguistics. References Chung, S. 1976. “An object-creating rule in bahasa Indonesia.” Linguistic Inquiry 7. 41-87. . 1978. “On the subject of two passives in Indonesian.” In C. Li (ed.) Subject and Topic. 57-98. New York: Academic Press. Cole, P. and G. Hermon. 2005. “Subject and non-subject relativization in Indonesian.” Journal of East Asian Linguistics 14. 59-81. , and Y. Tjung. 2006. “Is there pasif semu in Jakarta Indonesian?” Oceanic Linguistics 45. 64-90. Comrie, B. 1981. Language Universals and Linguistic Typology. Oxford: Blackwell. . 1988. “Passive and Voice.” In M. Shibatani (ed.) Passive and Voice. 9-23. Amsterdam: John Benjamins.
Voice in Relative Clauses in Malay
369
Fishman, J. 1967. “Bilinguialism with and without diglossia, diglossia with and without bilingualism.” Journal of Social Issues 32. 29-38. Ferguson, C.A. 1959. “Diglossia.” Word 15. 325-340. Hassal, T. 2005. “Taboo object relative clauses in Indonesian.” In P. Sidwell (ed.) SEALS XV: Papers from the 15th Meeting of the Southeast Asian Linguistics Society (Pacific Linguistics, E-1). 1-18. Canberra: Research School of Pacific and Asian Studies, Australian National University. Knowles, G.O. and Zuraidah M.D. 2006. Word Class in Malay: A Corpus-Based Approach. Kuala Lumpur: Dewan Bahasa dan Pustaka. Musgrave, S. 2001. Non-subject Arguments in Indonesian. Ph.D dissertation, University of Melbourne. Nomoto, H. 2006a. A Study on Complex Existential Sentences in Malay. MA thesis, Tokyo University of Foreign Studies. . 2006b. “The multi-purpose preposition kat in Colloquial Malay.” In Y. Tsuruga et al. (eds.) Gengojouhougaku Kenkyuuhoukoku 11. 69-94. Tokyo University of Foreign Studies. . forthcoming. “Voice in Colloquial Malay relatives.” In Y. Tsuruga et al. (eds) Gengojouhougaku Kenkyuuhoukoku 12. 95-114. Tokyo University of Foreign Studies. and T. Tsuji. 2006. “Adaptability of a language to globalisation —the influence of written-spoken differences.” Paper presented at the 1st World Congress on the Power of Language: Theory, Practice, and Performance. Noriah M. 2006. “Diglosia dan dasar bahasa dalam komuniti pelbagai bahasa [Diglossia and language policy in a multilingual community].” Jurnal Persatuan Linguistik 7. 109-120. Onozawa, J. 1996. Kiso Mareeshiago [Basic Malay]. Tokyo: Daigakushorin. Platt, J. 1977. “A model for polyglossia and multilingualism (with special reference to Singapore and Malaysia).” Language in Society 6. 361-78. Saddy, D. 1991. “WH scope mechanisms in bahasa Indonesia.” In L.L.S. Cheng and H. Demirdash (eds.) MIT Working Papers in Linguistics 15. 183-218. Schiffman, H.F. 1997. “Diglossia as a sociolinguistic situation.” In F. Coulmas (ed.) The Handbook of Sociolinguistics. 205-216. Oxford: Blackwell. Sinclair, J. 2005. “Corpus and text—basic principles.” In M. Wynne (ed.) Developing Linguistic Corpora: A Guide to Good Practice. 1-16. Oxford: Oxbow Books. Available online from http://ahds.ac.uk/linguisticcorpora/ (accessed 18/10/2006). Twine, N. 1978. “The Genbunitchi movement: its origin, development, and conclusion.” Monumenta Nipponica 33. 333-356.
370
Hiroki NOMOTO and Isamu SHOHO
Voskuil, J.E. 2000. “Indonesian voice and A-bar movement.” In I. Paul et al. (eds.) Formal Issues in Austronesian Linguistics. 195-213. Dordrecht: Kluwer. Wouk, F. 1989. The Impact of Discourse on Grammar: Verb Morphology in Spoken Jakarta Indonesian. Ph.D dissertation, UCLA. Yeoh, C. 1979. Interaction of Rules in Bahasa Malaysia. Ph.D dissertation, University of Illinois at Urbana-Champaign.
Appendix. Sources of the corpus of Written Malay The list below is arranged according to the month of publication. As all are from the same monthly magazine, i.e. Dewan Masyarakat, the name of the magazine is omitted. 1. Ibnu Ahmad Al-Kurauwi. Kerudung Kefiyyeh Baba. January 2005, 59-61. 2. Peter Augustine Goh. Saudara, Seorang Tua, dan Kisah Ini. March 2005, 60-62. 3. Razali Endun. Ayam Den Lopeh! April 2005, 59-61. 4. Nik Azman NM. Bukan Tamu Asing. May 2005, 59-61. 5. Haji Chacho Haji Bulah. Rumahku. June 2005, 59-61. 6. Siti Zarina Md. Israry. Hanky-Panky. July 2005, 59-61. 7. Saifullizan Yahaya. Mala Ayu. August 2005, 59-61. 8. MNL. Tiada Datuk di Sini! September 2005, 59-61. 9. Mohd. Kasim Mahmud. Gelek Si Meera. October 2005, 59-61. 10. Muhd. Mansur Abdullah. Sepi Hatiku Ini. November 2005, 59-61. 11. Osman Ayob. Angin Perubahan. December 2005, 59-61. 12. Ibnu Ahmad Al-Kurauwi. Tiga Skrip, Satu Skrin, Kereta Api. January 2006, 59-62. 13. Nazel Hashim Mohamad. Prahara Tsunami. February 2006, 59-61. 14. Sarimah Othman. Apabila Tiang Seri Bergegar. March 2006, 59-61. 15. Amir Hamzan Mohd. Wazir. Paranoid. April 2006, 59-61. 16. Sharif Putera. Cinta Petir. May 2006, 59-61.
Testing the Primacy of Aspect and Reverse Order Hypothesis in Japanese Returnees — Towards Constructing a Corpus of Second Language Attrition Data — Asako YOSHITOMI 1. Overview The purpose of the present paper is twofold: One is to introduce the outline of a project that is currently being undertaken at Tokyo University of Foreign Studies (TUFS) by the Usage-Based Linguistic Informatics (UBLI), Second Language Acquisition (SLA) English Research Group, to construct corpora that consist of learner language data of Japanese learners of English. The other is to present a preliminary study that makes use of one of these corpora, namely the corpus of second language attrition data of Japanese returnee children, to illustrate that the use of corpora in SLA research should be expanded to studies in second/foreign language attrition and re-learning processes. In particular, the paper examines two Japanese returnee children’s process of English attrition from the perspective of what has been called the Primacy of Aspect Hypothesis and the Reverse Order Hypothesis in the literature. Results from the preliminary study indicate that the verb tense-aspect system regresses according to the predictions made by the two hypotheses. This finding implies that language learning and language attrition are universal processes working in reverse directions, and that the construction of a second language attrition corpus is a promising line of inquiry for future studies in SLA as well as in second/foreign language education. 2. Introduction In 2005, the SLA English Research Group at TUFS started to construct three types of learner language corpora comprised of oral data from Japanese learners of English as a second/foreign language (ESL/EFL). Based on a survey conducted in 2004 (Ueda & Nishikawa), the Group decided to focus on collecting the following language data, which were judged to be relatively uncommon in the SLA literature: I. Interview style dialog between high school students learning
372
Asako YOSHITOMI
English as a foreign language in Japan and a native speaker of English; II. Interview style dialog between returnee children from English-speaking countries and a native speaker of English; III. Conversation style ‘trialog’ among university students whose first language (L1) is Japanese, second language (L2) is English, and who are currently learning a third language (L3) at university. Corpus I consists of longitudinal oral data collected from high school students who only have the experience of studying EFL in Japan. Additional cross-sectional data are planned to be collected this year from a larger number of Japanese high school students in interview sessions with a native English speaker. Corpus II also consists of longitudinal oral data collected from returnee children in Yoshitomi (1999) as well as additional data currently being collected. Returnees are English learners who have spent a number of years during their childhood in an English speaking country, and who have therefore acquired English quite naturally as their second language (ESL), but who undergo a process of forgetting English after returning to Japan. In most cases, returnee children go to English maintenance classes once a week to retain their English skills. However, the sudden decrease in their opportunity to use English in their everyday life in Japan often leads to a gradual loss of their English skills. Once these children enter junior high school, they receive formal English education at school, usually in a typical EFL environment. The research group is following returnee children who are in the process of losing and then re-learning English. This is supplemented by data collected by a questionnaire about the returnees’ motivation and attitude towards maintaining and re-learning English, their self-confidence in using English, their language choice in daily communication, and their reflection on how they adapted themselves to the educational culture in Japan when they returned. Corpus III consists of data collected from conversations among non-native speakers of English grouped in three. The majority of existing corpora consist of monologs or dialogs, hence, it was considered significant to collect conversations among more than two people. Retrospective data are collected after each conversation session to enable protocol analyses of the learners’ use of communication strategies and their awareness of cross-linguistic effects that result from learning two or more languages at the same time. In addition, narrative data are being collected by all three types of English learners described above, using the well-known Frog story (Frog, Where Are You?) by Mercer Mayer (1969). How all these learner language
Testing the Primacy of Aspect
373
data can be applied to SLA research will be briefly discussed at the end of this paper in section 7. In the study that follows, I analyze two returnee children’s longitudinal interview data, using part of corpus II. I first review the theoretical background of the two hypotheses that are tested in section 3, before stating the research questions and hypotheses in section 4. In sections 5 and 6, I present the methodology, result, discussion, and conclusion. 3. Theoretical background 3.1. Primacy of Aspect in SLA In SLA research, so-called morpheme studies that examined the order of acquisition of L2 grammatical morphemes saw their heydays in the 1970s. These studies were stimulated by similar research in L1 acquisition and resulted in the discovery of a common sequence in the learning of grammatical morphemes, or a ‘natural order of acquisition’ among L2 learners of different L1 backgrounds. Later, studies emerged that investigated the developmental stages of individual morphemes or grammatical subsystems as compared to investigating the relative order of acquisition among different types of morphemes. Whereas the former type of morpheme studies were mostly form oriented, the latter type of studies on developmental stages started to adopt a form-function analysis of learner language data in which linguistic forms are examined in terms of the language functions they are intended to carry. In the late 1970s, L1 researchers found that children’s initial stages of development in verbal morphology are largely influenced by the inherent or lexical aspect of verbs (Bardovi-Harlig 2000). Similar lines of studies followed in SLA research from the 1980s, which led to the formulation of the Primacy of Aspect Hypothesis, or simply Aspect Hypothesis, which states that “First and second language learners will initially be influenced by the inherent semantic aspect of verbs or predicates in the acquisition of tense and aspect markers associated with or/affixed to these verbs” (Andersen & Shirai 1994: 133). In order to understand the hypothesis, it is necessary to distinguish two linguistic concepts: that is, grammatical aspect and lexical aspect. Grammatical aspect, also known as viewpoint aspect, refers to the different ways in which the expresser intends to view a situation. If you compare two sentences, (1) “I ate an apple” and (2) “I was eating an apple,” the event described is identical. So are the linguistic expressions (“eat the apple”) that are used to refer to the event. However, the two sentences differ in grammatical aspect. Whereas sentence (1) adopts perfective grammatical aspect and views the situation externally in its entirety with an endpoint,
374
Asako YOSHITOMI
sentence (2) uses imperfective grammatical aspect that views the situation internally as an interval during an ongoing event without a clear endpoint (Smith 1991). In English, the progressive form is typically used to depict an imperfective viewpoint and is defined essentially by continuousness (Comrie 1976). Shirai and Andersen (1995) describe the prototypical progressive as having the features [-telic] (i.e., not completed) and [+durative]. In English, progressive and non-progressive aspect can be marked with all tenses (Bardovi-Harlig 2000). Lexical aspect or inherent aspect, on the other hand, refers to the semantic properties of the linguistic forms, especially verbs or predicates, used to describe a situation. Although there are a number of different lexical aspectual categories employed in the SLA literature, here I introduce the four-way classification proposed by Vendler (1967). Vendler distinguishes four aspectual categories, which are states, activities, accomplishments, and achievements. According to Bardovi-Harlig (2000), states persist over time without change, are not interruptable, and include examples such as know, need, want, like, and be as in be happy. Activities involve duration and have no specific endpoint as in play, walk, read, and rain. Accomplishments also involve duration but have an endpoint. For example, write a letter, bake a cake, explain, and prepare require a span of time to accomplish and can be finished or completed at a certain point. Achievements capture the beginning or the end of an action, and can be reduced to a point such as in start a game, arrive, notice, break, and shoot. Andersen (1991) distinguishes the four categories using three semantic features: punctual, telic, and dynamic. States are [-punctual] [-telic] [-dynamic]; activities [-punctual] [-telic] [+dynamic]; accomplishments [-punctual] [+telic] [+dynamic]; and achievements [+punctual] [+telic] [+dynamic]. The framework proposed by Andersen based on Vendler’s categories is widely used in SLA research today and is therefore, adopted in the current study. The Aspect Hypothesis predicts that in English, (1) learners initially associate and use the imperfective aspect marker –ing with prototypical [-telic] [+dynamic] (i.e., activity) verbs regardless of the required grammatical forms; (2) perfective past –ed and perfect tense –en are first restricted to [+punctual] [+telic] (i.e., achievement) verbs; (3) The simple present tense marker –s is first predominantly used with state verbs; and (4) in later stages, the distributional bias of each morpheme towards prototypical verb type gradually relaxes as the learners become able to mark verbs with grammatical tense and aspect that do not coincide with the lexical aspect of the verbs to freely express their viewpoint of events or situations (Housen 2002).
Testing the Primacy of Aspect
375
According to the findings of function-form analyses, in the earliest stage of acquisition before tense-aspect morphology is used, and distributional biases between tense markers and the lexical aspect of verbs emerge, learners adopt various pragmatic means to express temporality (Bardovi-Harlig 2000). Temporal reference can be established in four ways, by relying on the interlocutor in scaffolded discourse, by the use of context, by contrasting events, and by following chronological order in explaining events (ibid). It is also worth mentioning that very advanced learners of L2 and native speakers of that language also exhibit distributional biases between verb forms and lexical aspect, but that they are more able to freely combine grammatical markers and verbs or predicates that do not match in terms of the inherent aspect they denote. As Andersen and Shirai (1996) rightly point out, the distributional biases claimed in the Aspect Hypothesis should be seen as relative tendencies, not absolute restrictions. 3.2. Reverse Order Hypothesis in language attrition Language attrition refers to “the loss of any language or any portion of language by an individual or a speech community” (Lambert & Freed, 1982: 1). Four types of language attrition are generally recognized: L1 loss, L2/foreign language loss, death of an entire language, and language deterioration in neurologically impaired patients or the elderly. The focus of the present paper is on L2 attrition by Japanese returnees who have spent a number of years in an ESL environment. The process of forgetting a language is often believed to be the undoing of the learning process. The notion has been interpreted to refer to two related but different characteristics of language loss: the Inverse Relation Hypothesis and the Reverse Order Hypothesis (Yoshitomi 1992). The Inverse Relation Hypothesis postulates that there is an inverse relationship between proficiency level prior to the onset of attrition and the rate and/or the amount of loss. In other words, what is learned best is least forgotten, and those who have learned better, or become more proficient, are less vulnerable to loss. The hypothesis has been supported by several studies (Godsall-Myers, 1981; Bahrick 1984; Moorcroft & Gardner 1987). The Reverse Order Hypothesis, which comes from the concept of “regression” in aphasia (Jacobson 1962), states that attrition is the mirror image of acquisition, that is, the last thing learned is the first to be forgotten. The hypothesis may refer to three different linguistic levels (DeBot & Weltens 1991): (1) within skills (i.e., within phonology, morphology, syntax, lexicon, etc.); (2) within languages, namely, in acquisition, perception precedes production, and spoken language precedes written language,
376
Asako YOSHITOMI
whereas in language loss, the sequence is reversed; and (3) between languages, that is, with respect to the order of acquisition and loss of languages in multilinguals. The focus of the present study is the attrition phenomenon at the within skills, or intra-skills level, especially in verb morphology. Support for the other two types of attrition in reverse order have also been reported (Yoshitomi 1992). In general, therefore, it has been found that learners do tend to lose what they learned last, and maintain what they learned at earlier stages of acquisition. However, to my knowledge, the majority of language attrition studies testing the Reverse Order Hypothesis have tended to examine the formal characteristics of linguistic expressions used by the learners without enough attention paid to how those forms are used to encode language functions. Hence, an increased number of functional analyses should be conducted especially in studies of language attrition. Functional analyses are of two kinds; form-function analysis in which a specific form is selected and the meanings that form realizes is analyzed; and function-form analysis in which a certain language function is selected and the linguistic forms that perform that function are identified (Ellis & Barkhuizen 2005). In this study, a form-function analysis of the verb forms used by returnee children is conducted to test the Reverse Order Hypothesis at the intra-skills level. The degree of attrition is generally a function of the length of L2 disuse (often referred to as the ‘incubation period’). However, the Inverse Relation Hypothesis states that beyond a certain critical threshold of language proficiency, L2 skills become less vulnerable to attrition. This is believed to lead to the phenomenon called the ‘initial plateau’ or the initial stages of incubation period during which language skills are relatively unsusceptible to attrition. It has also been documented that certain linguistic elements can survive loss regardless of non-use during a relatively long incubation period (Bahrick 1984). In Yoshitomi (1992), a cognitive-psychological model of language acquisition and attrition based on neurobiological findings was proposed. The model essentially claims that both language acquisition and attrition are consequences of neural plasticity. Neural plasticity allows input to alter the configuration of existing knowledge networks in memory storage. New information is compared with prior knowledge and stored in matched patterns. It is first stored in working memory via modality-specific processing systems, then in intermediate memory where information is integrated and associated with other information, and finally in long-term memory, or permastore. The transition of information to long-term storage involves consolidation which gradually strengthens certain connections and
Testing the Primacy of Aspect
377
eliminates or weakens others. Linguistic knowledge represented in the connections that are eliminated become lost. Since memory in long-term storage has gone through consolidation, the connectivity is stronger and, thus, less vulnerable to attrition. Vulnerability to attrition is greatest with respect to recently acquired, unconsolidated knowledge. Information which survives competition and reorganization becomes the basis for the processing of new information. Hence the model predicts that form-function mappings also undergo the same process of weakening connections and that such regression sets in after a certain length of incubation period instead of immediately after the start of L2 disuse (or markedly infrequent use compared to the acquisition period spend in the ESL environment). 4. Research questions and hypotheses How does the returnees’ tense-aspect system attrite? The cognitivepsychological model of language acquisition and attrition predicts that in L2 attrition, the tense-aspect system of the learners will regress in the reverse order of what is predicted by the Aspect Hypothesis in SLA. In other words, assuming that the returnees have acquired basic verb morphology by the end of their stay in the U.S.: (1) (2)
(3)
at the initial stages of incubation period, they are able to mark verbs relatively freely regardless of lexical aspect to express their viewpoints; gradually, their verb morphology regresses towards a more biased distribution, where grammatical tense and aspect markers tend to co-occur with verbs or predicates whose lexical aspect matches that of the markers; in later stages of attrition, where verb morphology is severely reduced, returnees increasingly rely on pragmatic/contextual means to express tense and aspect.
5. Method 5.1. Subjects The subjects are two Japanese returnee girls, Yuko and Hiro (pseudonyms). At the time of initial data collection, Yuko was 9:7 years old and had returned to Japan three weeks before. She had spent 3:8 years in the U.S. Hiro, on the other hand, was 9:6 years old at the time of initial data collection, and had already spent 13 months in Japan. She had lived in the U.S. for 5:5 years. Yuko and Hiro lived in large metropolitan areas in the U.S. and went to local American elementary schools. They also attended supplementary Japanese schools on Saturdays. Since returning to Japan, they have been living in Tokyo. Yuko attends a private elementary school, while Hiro attends a public elementary school. They both go to English maintenance classes on Saturdays. In addition, Yuko has an English class for returnees at her school once a week
378
Asako YOSHITOMI
for one hour. Yuko and Hiro have no brothers or sisters. Their fathers work at large Japanese corporations which have branches in the U.S. Both their parents have graduated from university in Japan. 5.2. Data collection Yuko and Hiro each had two interview-style conversations with a young female American English speaker with an eight-month interval between the first and second interviews. The first set of data was collected when Yuko had an incubation period of 3 weeks, whereas Hiro had an incubation period of 13 months. The second set of data was collected approximately 8 months later, when Yuko’s incubation period was 9 months, and Hiro’s incubation period was 21 months (1:9 years). The native English-speaking interviewer asked questions about their experience at school, family trips, and other everyday events both in the U.S. and in Japan. The interviewer was instructed to interact naturally with the girls, while endeavoring to ask questions about the past, present, and future to elicit various verb tenses. Otherwise, the interview was quite free and both girls seemed to enjoy the opportunity to interact with a native speaker. They often volunteered to provide details about their experience without being asked, and talked about interesting episodes in their lives that revealed their outgoing and active personalities. Each conversation session lasted for approximately one hour. 5.3. Data Analysis All speech data collected were transcribed and analyzed using the Codes for the Human Analysis of Transcript (CHAT) transcription and coding format and Computerized Language Analysis (CLAN) package programs developed by MacWhinney (1991). The frequency of vocabulary types and tokens used during the conversations was calculated. Also, the use of verb morphology (copula be, auxiliary be, simple verbs in their base form, the third person singular –s marker, past regular and irregular forms of simple verbs, and modal auxiliaries) was examined by calculating the percentage of contexts in which verb morphology was supplied in obligatory contexts (%SOC) and the percentage of verb morphology used in a target-like manner (%TLU). %SOC is the percentage of correct verb morphology supplied in linguistic contexts in which the verb morphology is required. %TLU refers to the percentage of correct forms used among the total verb morphology. Furthermore, the use of complex syntactic structures was examined, and the percentage of error-free clauses was obtained. Finally, finite verbs were coded for their forms (base vs. third person singular present vs. past vs. past
Testing the Primacy of Aspect
379
participle) and lexical aspect (state vs. activity vs. accomplishment vs. achievement) and the frequencies of finite verb types and tokens were tallied to analyze the relationship between tense-aspect marking and inherent verb aspect. The lexical aspect of the verbs or predicates was categorized employing the diagnostic tests used in Shirai (1991): Step 1: State or nonstate Does it have a habitual interpretation in simple present? If no ➪ State (e.g., I love you) If yes ➪ Nonstate (e.g., I eat bread) ➪ Go to step 2 Step2: Activity or nonactivity Does “X is V-ing” entail “X has V-ed” without an iterative/habitual meaning? In other words, if you stop in the middle of V-ing, have you done the act of V? If no ➪ Activity (e.g. run) If yes ➪ Nonactivity (e.g., run a mile) ➪ Go to step 3 Step 3: Accomplishment or achievement [If test (a) does not work, apply test (b) and possibly (c).] (a) If “X V-ed in Y time (e.g., 10 minutes),” then “X was V-ing during that time.” If yes ➪ Accomplishment (e.g., He painted a picture) If no ➪ Achievement (e.g., He noticed a picture) (b) Is there ambiguity with almost? If yes ➪ Accomplishment (e.g., He almost painted a picture has two readings: he almost started to paint a picture/he almost finished painting a picture.) If no ➪ Achievement (e.g., He almost noticed a picture has only one reading.) (c) “X will VP in Y time (e.g., 10 minutes)” = “X will VP after Y time.” If no ➪ Accomplishment (e.g., He will paint a picture in an hour is different from He will paint a picture after an hour, because the former can mean that he will spend an hour painting a picture, but the latter does not.) If yes ➪ Achievement (e.g., He will start singing in two minutes can only have one reading, which is the same as He will start singing after two minutes, with no other reading possible.)
5.4. Results and discussion 5.4.1. Vocabulary types/tokens and use of verb morphology The vocabulary types/tokens used by Yuko at session 1 were 785/5,267 and at session 2: 500/2,624, while those used by Hiro at session were 484/2,582 and at session 2: 448/3,158. Considering the fact that both girls participated actively in the conversations and talked for about one hour in each session about similar topics with the same interlocutor, it is very likely
380
Asako YOSHITOMI
that the marked decrease in the vocabulary types used by the girls implies attrition in lexical knowledge. In Yuko’s case, there is a 50 percent reduction in the number of tokens used comparing session 1 and 2. It seems that Yuko’s vocabulary has shrunk during the first 8 to 9 months of incubation. As for Hiro, the increase in the number of vocabulary tokens indicates that she is actually speaking more in session 2 than in session 1. Nevertheless, the types of words used during the interaction are decreasing. Thus, Hiro’s vocabulary can also be regarded as regressing. Although it is impossible to compare the two girls on an identical line, it should be noted that gradual attrition is taking place as the incubation period increases combining the two girls’ data, with vocabulary types and tokens regressing from Yuko’s session 1 data (at an incubation period of 3 weeks): 785/5,267, to her session 2 data (at an incubation period of 9 months): 500/2,624, to Hiro’s session 1 data (at an incubation period of 13 months): 484/2,582, and to her session 2 data (at an incubation period of 21 months): 448/3,158. Verb morphology used by Yuko and Hiro is summarized in Tables 1 and 2, respectively. Verb morphology that was examined but bore little or no data, such as auxiliary have + past participles, was excluded from the table. Dashes (—) indicate no data. Percentages in parentheses indicate that there were less than three occurrences of the form in the data. Overall, the numbers indicate that the verb morphology of the two returnees do not show marked regression. In some cases there are even indications of slight improvement as observed in the slight increase in %SOC and/or %TLU values of some uses of the copula in Yuko’s case and in the use of can/could in Hiro’s case. However, these improvements are very slight, suggesting that they may merely be natural fluctuations in performance that could occur in any collection of oral data. In comparison, possible signs of loss can be observed in Yuko’s use of third person singular –s and irregular verb past forms as well as in Hiro’s use of the copula was and irregular past verb forms. In general, the %SOC and %TLU values are lower in Hiro’s performance compared to Yuko’s in both sessions. The difference is conspicuous in the use of third person singular –s and in regular and irregular verb past forms in that Hiro hardly ever uses the third person singular correctly. Although her use of regular past forms improves in %SOC from session 1 to session 2, the %TLU goes down, which suggests that although she is supplying the correct past form more frequently in session 2, the use of past forms when she does use them is not quite accurate.
Testing the Primacy of Aspect Table 1. Verb morphology used by Yuko Form Copula (am) Copula (is) Session 1 2 1 %SOC 100 (100) 97 %TLU 100 (100) 90
Copula (are) 2 1 100 58 88 100
Copula (was) 2 1 67 94 100 97
381
2 100 94
Form Session %SOC %TLU
Copula (were) 1 2 86 (0) 100 (0)
Auxiliary (am) 1 2 100 (100) 100 (100)
Auxiliary (is) 1 2 (100) (100) (100) (100)
Auxiliary (are) 1 2 (100) (50) (100) (100)
Form Session %SOC %TLU
Auxiliary (was) 1 2 100 (100) 100 (100)
Modal (can) 1 2 100 100 100 80
Modal (could) 1 2 100 (0) 100 (0)
Modal (will) 1 2 (100) (100) 100 75
Form
Verb Stem
3rd pers. sing. (-s)
Past (regular)
Past (irregular)
Session %SOC %TLU
1 100 93
2 100 96
2 85 100
2 100 88
1 100 100
Table 2. Verb morphology used by Hiro Form Copula (am) Copula (is) Session 1 2 1 %SOC (0) 78 98 %TLU — 100 98
1 95 97
Copula (are) 2 1 99 67 93 (100)
1 94 100
2 82 100
Copula (was) 2 1 69 100 92 100
2 89 89
Form Session %SOC %TLU
Copula (were) 1 2 (0) (100) (100) —
Auxiliary (am) 1 2 — — 100 —
Auxiliary (is) 1 2 — (100) — (50)
Auxiliary (are) 1 2 (100) 67 (100) 100
Form Session %SOC %TLU
Auxiliary (was) 1 2 — (100) — (100)
Modal (can) 1 2 100 97 67 100
Modal (could) 1 2 33 100 (0) 78
Modal (will) 1 2 — 90 — 90
Form
Verb Stem
3rd pers. sing. (-s)
Past (regular)
Past (irregular)
Session %SOC %TLU
1 100 77
2 96 87
2 8 (100)
2 63 56
1 — —
1 33 80
1 89 100
2 73 87
5.4.2. Use of complex syntactic structures and error-free clauses The use of complex syntactic structures by Yuko and Hiro is summarized in Table 3. It is quite obvious from the table that the success
382
Asako YOSHITOMI
rate of using complex clauses gradually decreases as the incubation period increases. Overall, Yuko, with a shorter incubation, performs more accurately than Hiro. In Hiro’s case, we see that she actually attempts to use more complex clauses in session 2 than in session 1, which is an indication of her willingness to communicate actively during the interaction. However, despite her attempt to use more complex clauses, her success rate goes down in session 2. Table 4 shows the percentage of error-free clauses among total clauses uttered. Here again the overall picture is the same. The percentage of error-free clauses decreases as the incubation period increases. Hiro utters many clauses in session 1, but the accuracy of those clauses is lower than that of Yuko’s in session 2. Table 3. Percentage of attempted and successful complex clauses Returnee Yuko Hiro Session 1 2 1 2 %Attempted/Total 7.6 2.7 4.8 12.1 %Successful/Attempted 88.1 66.7 37.5 21.7 Table 4. Use of error-free clauses Returnee Yuko Session 1 Error-free clauses 478 Total number of clauses 556 %Error-free clauses 86%
2 460 563 81.7
Hiro 1 617 832 74.2
2 201 492 40.9
5.4.3. Tense-aspect marking Tables 5 and 6 illustrate the distribution of verb form and lexical aspect used by Yuko and Hiro, respectively. Verb forms that co-occur the most with lexical aspect are indicated in bold. The bare numbers indicate the type and token counts of forms that appeared in the returnees’ speech. The numbers in parentheses show the percentage of that count among the subtotal counts of the form. The verb counts in these tables only include those used in finite clauses. Both verb types and tokens regress in both girls’ speech, but especially in Yuko’s performance. Yuko used a total of 151 types/441 tokens of finite verbs in session 1, whereas she used only 77 types/171 tokens in session 2, despite speaking approximately for the same length of time with the same interlocutor. Hiro, in comparison, used a total of 96 types/217 tokens of finite verbs in session 1, which only slightly regressed in token counts to 99
Testing the Primacy of Aspect
383
types/194 tokens in session 2. Nevertheless, we can see that Yuko uses more types and tokens of finite verbs in session 2 (at an incubation period of 9 months) than Hiro in session 1 (at an incubation period of 13 months). Table 5. Distribution of verb form and lexical aspect used by Yuko Session 1 Session 2 Form V-aspect token (%) type (%) token (%) BASE State 90 ( 54.5) 12 ( 22.6) 67 ( 63.8) Act. 31 ( 18.8) 14 ( 26.4) 13 ( 12.9) Acc. 14 ( 8.5) 11 ( 20.8) 13 ( 12.9) Ach. 30 ( 18.2) 16 ( 30.2) 12 ( 11.4) subtotal 165 ( 37.4) 53 ( 35.1) 105 ( 61.4) PRES State 11 ( 37.9) 7 ( 63.6) 6 ( 60.0) Act. 6 ( 20.7) 1 ( 9.1) 2 ( 20.0) Acc. 3 ( 10.3) 2 ( 18.2) 1 ( 10.0) Ach. 9 ( 31.0) 1 ( 9.1) 1 ( 10.0) subtotal 29 ( 6.6) 11 ( 7.3) 10 ( 5.8) PROG State 4 ( 7.2) 4 ( 25.0) 1 ( 5.0) Act. 38 ( 73.1) 4 ( 25.0) 11 ( 55.0) Acc. 7 ( 12.7) 2 ( 12.5) 2 ( 10.0) Ach. 6 ( 10.9) 6 ( 37.5) 6 ( 30.0) subtotal 55 ( 12.5) 16 ( 10.6) 20 ( 11.7) PAST State 35 ( 20.1) 7 ( 14.5) 4 ( 12.1) Act. 45 ( 25.9) 13 ( 27.7) 12 ( 36.4) Acc. 42 ( 24.1) 13 ( 27.7) 4 ( 12.1) Ach. 52 ( 29.9) 14 ( 29.8) 13 ( 39.4) subtotal 174 ( 39.5) 47 ( 31.1) 33 ( 19.3) PERF State 12 ( 66.7) 1 ( 4.2) 2 ( 66.7) Act. 1 ( 5.6) 16 ( 66.7) 0( 0) Acc. 4 ( 22.2) 4 ( 16.7) 1 ( 33.3) Ach. 1 ( 5.6) 3 ( 12.5) 0( 0) subtotal 18 ( 4.1) 24 ( 15.9) 3 ( 1.8) TOTAL 441 ( 100) 151 ( 100) 171 ( 100)
type (%) 8 ( 32.0) 7 ( 28.0) 4 ( 16.0) 6 ( 24.0) 25 ( 32.5) 3 ( 50.0) 1 ( 16.7) 1 ( 16.7) 1 ( 16.7) 6 ( 7.8) 1 ( 6.3) 7 ( 43.8) 2 ( 12.5) 6 ( 37.5) 16 ( 20.8) 4 ( 14.8) 10 ( 37.0) 4 ( 14.8) 9 ( 33.3) 27 ( 35.1) 2 ( 66.7) 0( 0) 1 ( 33.3) 0( 0) 3 ( 3.9) 77 ( 100)
384
Asako YOSHITOMI
Table 6. Distribution of verb form and lexical aspect used by Hiro Session 1 Session 2 form V-aspect token (%) type (%) token (%) BASE State 30 ( 21.7) 6 ( 10.5) 41 ( 30.4) Act. 46 ( 33.3) 19 ( 33.3) 41 ( 30.4) Acc. 28 ( 20.3) 16 ( 28.1) 33 ( 24.4) Ach. 34 ( 24.6) 16 ( 28.1) 20 ( 14.8) subtotal 138 ( 63.6) 57 ( 59.3) 135 ( 69.6) PRES State 2 ( 33.3) 2 ( 33.3) 0 ( 0) Act. 3 ( 50.0) 3 ( 50.0) 2 ( 66.7) Acc. 0( 0) 0 ( 0) 0 ( 0) Ach. 1 ( 16.7) 1 ( 16.7) 1 ( 33.3) subtotal 6 ( 2.8) 6 ( 2.8) 3 ( 1.5) PROG State 0( 0) 0 ( 0) 0 ( 0) Act. 16 ( 100) 12 ( 100) 5 ( 55.6) Acc. 0( 0) 0 ( 0) 2 ( 22.2) Ach. 0( 0) 0 ( 0) 2 ( 22.2) subtotal 16 ( 7.4) 12 ( 7.4) 9 ( 4.6) PAST State 8 ( 14.2) 4 ( 20.0) 5 ( 12.8) Act. 9 ( 16.1) 5 ( 25.0) 10 ( 25.6) Acc. 20 ( 35.7) 5 ( 25.0) 15 ( 38.5) Ach. 19 ( 33.9) 6 ( 30.0) 9 ( 23.1) subtotal 56 ( 25.8) 20 ( 20.8) 39 ( 20.1) PERF State 0( 0) 0 ( 0) 1 ( 12.5) Act. 0( 0) 0 ( 0) 3 ( 37.5) Acc. 1 ( 100) 1 ( 100) 1 ( 12.5) Ach. 0( 0) 0 ( 0) 3 ( 37.5) subtotal 1 ( 0.5) 1 ( 0.5) 8 ( 4.1) TOTAL 217 ( 100) 96 ( 100) 194 ( 100)
type (%) 6 ( 14.6) 12 ( 29.3) 14 ( 34.1) 9 ( 22.0) 41 ( 51.9) 0 ( 0) 1 ( 50.0) 0 ( 0) 1 ( 50.0) 2 ( 2.5) 0 ( 0) 5 ( 55.6) 2 ( 22.2) 2 ( 22.2) 9 ( 11.4) 3 ( 15.0) 4 ( 20.0) 6 ( 30.0) 7 ( 35.0) 20 ( 25.3) 1 ( 14.3) 3 ( 42.9) 1 ( 14.3) 2 ( 28.6) 7 ( .9) 79 ( 100)
Looking at the distribution of verb forms according to lexical aspect, we can see that overall, the Aspect Hypothesis is supported. In Yuko’s data, state verbs co-occur most with third person singular–s (PRES in the tables), activity verbs most with progressive forms (PROG), and achievement verbs most with past forms (PAST) as predicted by the hypothesis. We can also see that as the incubation period increases, the bias becomes stronger, with higher percentage of state verbs co-occurring with PRES, and higher percentage of achievement verbs co-occurring with PAST. As for PROG, co-occurrence with activity verbs does not increase in terms of the
Testing the Primacy of Aspect
385
percentage of tokens but does increase in terms of the percentage of types. The result thus seems to generally support the Regression Hypothesis. Hiro’s data is less clear-cut. There is a predominant co-occurrence between activity verbs and PROG, as predicted by the Aspect Hypothesis. As for PAST, both accomplishment and achievement verbs tend to co-occur. Since both types of verbs share the feature [+telic], it is understandable that they are likely to be marked with the PAST marker that prototypically has the meaning of an endpoint. In Hiro’s data, state verbs do not predominantly co-occur with PRES. Rather, activity verbs tend to be marked. However, the frequency of PRES itself is very low, indicating that Hiro is simply not marking verbs frequently enough with third person singular -s to bear a biased distribution of verb forms with particular lexical aspect. This result is not surprising since we have already seen in section 5.4.1 that there is considerable regression in Hiro’s use of third person singular –s. In both returnees’ data, there were few occurrences of PERFECT as has already been shown in the verb morphology form analysis in section 5.4.1. The girls may not have fully acquired the perfect form at the end of the acquisition phase, and/or the discourse content of their interaction with the interviewer may have affected its frequency. Nothing definite can be said based on such a small amount of data. Taken together, it seems reasonable to say that during the initial stages of attrition, the distributional bias between verb form and lexical aspect becomes increasingly skewed, in that verb markers increase in their probability of co-occurring with verbs that have similar aspectual features. After a longer incubation period, grammatical marking begins to deteriorate so much so that the distributional bias itself cannot be seen as clearly as earlier stages of attrition. A closer look at Hiro’s speech actually reveals that in both sessions, but particularly in session 2, she relies heavily on contextual or pragmatic devices to refer to past events. For example, she would say something like, “When I live in America, I read [BASE] books a lot” to mean “When I lived in America, I read [PAST]/used to read books a lot.” Since the interviewer knows that Hiro spent a number of years in America and then came back to Japan, she is able to understand that Hiro is talking about the past with no difficulty. This result is indeed what the Regression Hypothesis predicts. 6. Summary and conclusion In sum, oral attrition data collected from the two female returnees indicate that in terms of a form analysis of vocabulary types/tokens, verb morphology, and of the accuracy in the use of simple and complex clauses,
386
Asako YOSHITOMI
there is clear evidence of attrition. As for verb morphology, not only do the forms regress, but the functions attributed to those forms also undergo attrition. In terms of the verb forms, third person singular –s, which is typically considered to be learned in the late stages of English acquisition, show the most considerable regression. In terms of functions, the distributional bias between verb forms and lexical aspect recedes in the opposite direction of what the Aspect Hypothesis predicts; that is, the returnees become less able to express their viewpoint of a situation by imposing a certain grammatical aspect on a verb or predicate that does not coincide in lexical aspect with the grammatical marker. This regression is followed by further attrition in which the returnee starts to largely omit grammatical marking itself and rely on non-syntactic measures, such as context and pragmatic meaning, to express tense and aspect. This result, in turn, provides support for the Reverse Order Hypothesis. 7. Future research 7.1. The significance and limitations of the current study The significance of this study is that it was able to lend support to the Aspect Hypothesis and Reverse Order Hypothesis which claim to depict universal characteristics in the process of language learning and loss, respectively. Evidence shown here that the two hypotheses can work together to predict linguistic performances in language attriters should encourage further lines of inquiry that effectively combine the achievements in the fields of SLA and language attrition. Such attempts will undoubtedly bear interesting insights into the study of the language development process in general. There are, however, obvious shortcomings to this study. Quantitatively, a more careful examination of how forms in general, not just the verb expressions, are used to realize language functions should be carried out. Furthermore, the study is limited to analyzing oral data from two returnees interviewed on only two occasions each. Similar investigations would benefit from using a larger set of corpora. At this moment, I am not aware of any large-scale corpus that consists of language attrition and/or re-learning data. Hence, the construction of such corpora is in great demand. 7.2. Potential future research projects As I have mentioned in section 1, the data analyzed in the present study was part of a corpus being constructed by the SLA English Research Group at TUFS. The construction and analysis of L2 attrition corpora will not only contribute to the study of language attrition, but should provide insight to the universal processes involved in language learning, loss, and
Testing the Primacy of Aspect
387
re-learning. The present study has demonstrated that with the utilization of such corpora, it is possible to test interesting hypotheses that are combined outcomes of inquiry in the fields of SLA and language attrition. Currently, the SLA English Research Group has started conducting pilot studies on the following topics using the three types of corpora introduced at the beginning of this paper: — analyzing the use and development of backchannelling devices in native-speaker vs. non-native speaker discourse; — surveying communication strategies used in non-native speaker discourse based on a protocol analysis of retrospective data; — investigating the relationship between sociopsychological factors and L2 maintenance or re-learning processes in returnees; — exploring the possibility of using corpora of oral data to assess L2 fluency by measuring the use of smallwords in speech (a research topic inspired by a study by Hasselgren (2002)); and — inquiring how formulaic sequences are used and acquired by L2 learners in free speech as well as in narrative discourse, and comparing the difference between ESL data and EFL data in terms of the naturalness in the choice of expressions such as idioms and phrasal verbs. At the same time, the Group is continuously working on constructing language learner corpora of Japanese learners/attriters of ESL and EFL. Although we are still in the early stages of this work, the value of such studies is already beyond doubt. Acknowledgement I thank Dr. Alison Stewart at Tokyo University of Foreign Studies for commenting on an earlier version of this paper. References Andersen, R.W. 1991. “Developmental Sequences: The emergence of aspect marking in second language acquisition”. Crosscurrents in Second Language Acquisition and Linguistic Theories, Huebner & Ferguson (eds.) 1991. 305-324. Amsterdam:John Benjamins. Andersen, R.W. and Y. Shirai. 1994. “Discourse Motivations for Some Cognitive Acquisition Principles”. Studies in Second Language Acquisition 16. 133-156. . 1996. “The Primacy of Aspect in First and Second Language Acquisition: The pidgin-creole connection.” Handbook of Second Language Acquisition, Ritchies & Bhatia (eds.) 1996. 527-570. London: Academic Press.
388
Asako YOSHITOMI
Bahrick, H. 1984. “Fifty Years of Second Language Attrition: Implications for programmatic research”. Modern Language Journal 68. 105-111. Bardovi-Harlig, K. 2000. Tense and Aspect in SLA. Malden, MA: Blackwell. Comrie, B. 1976. Aspect. Cambridge, UK: Cambridge University Press. DeBot, K. and B. Weltens. 1991. “Recapitulation, Regression, and Language Loss”. First Language Attrition. Seliger & Vago (eds.) 1991. 31-52. Cambridge, UK: Cambridge University Press. Ellis, R. and A. Barkhuizen. 2005. Analysing Learner Language. Oxford, UK: Oxford University Press. Godsall-Myers, J. 1981. The Attrition of Language Skills in German Classroom Bilinguals: A case study. Dissertation Abstracts International, 43, 57A. Hasselgren, A. 2002. “Learner Corpora and Language Testing: Smallwords as markers of learner fluency”. Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching, Granger, Hung, and Petch-Tyson (eds.) 2002. 143-173. [Language Learning & Language Teaching 6]. Amsterdam: John Benjamins. Housen, A. 2002. “A Corpus-Based Study of the L2-Acquisition of English Verb System.” Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching, Granger, Hung, & Petch-Tyson (eds.) 2002. 77-116. [Language Learning & Language Teaching 6]. Amsterdam: John Benjamins. Jakobson, R. 1962. Selected Writings, 1. Phonological studies. The Hague: Mouton. Kuhberg, H. 1992. “Longitudinal L2-Attrition Versus L2-Acquistion, in Three Turkish Children: Empirical findings”. Second Language Research 892. 138-154. Lambert, R.D. and B.F. Freed (eds.) 1982. The Loss of Language Skills. Rowley, MA: Newbury House. MacWhinney, B. 1991. The CHILDES Project: Tools for analyzing talk. Hove and London: Lawrence Erlbaum. Moorcroft, R. and R.C. Gardner. 1987. “Linguistic Factors in Second Language Loss”. Language Learning, 37(3). 327-340. Shirai, Y. 1991. Primacy of Aspect in Language Acquisition: Simplified input and prototype. Unpublished doctoral dissertation, University of California, Los Angeles. Shirai, Y. and R.W. Andersen. 1995. “The Acquisition of Tense-Aspect Morphology”. Language 71. 743-762. Ueda, M. and M. Nishikawa. 2004. 植田恵、西川惠. 『日本人英語学習者 の学習者言語コーパス基礎調査』言語情報学研究報告 5. 吉冨、 根岸、海野(編) .110-115. 東京外国語大学 21 世紀 COE「言語運
Testing the Primacy of Aspect
389
用を基盤とする言語情報学拠点」[“A Survey of Learner Language Corpora of Japanese Learners of English”. Working Papers in Linguistic Informatics 5. Yoshitomi, Negishi, and Umino (eds.) 110-115. Tokyo University of Foreign Studies, 21st Century COE: Usage-Based Linguistic Informatics.] Vendler, Z. 1967. “Verbs and Times”. Linguistics and Philosophy. Vendler (ed.) 97-121. Ithaca, NY: Cornell University Press. Weltens, B. 1987. “The Attrition of Foreign Language Skills: A literature review”. Applied Linguistics 8(1). 22-38. Yoshida, K., K. Arai, T. Fujita, T. Hattori, K. Nagano, A. Okamura, M. Tanaka, K. Yanaura, and A. Yoshitomi. 1989. 吉田研作他 『帰国子女 の外国語保持に関する一調査』『帰国子女の外国語保持に関する 調査研究報告書』 第 1 巻. 12-28. 海外帰国子女教育振興財団. [“On the Retention of a Foreign Language by Returnees”. A Survey of the Foreign Language Retention of Returnees, Vol. 1. 12-28. Tokyo: Kaigai Shijo Kyooiku Shinkoo Zaidan.] Yoshitomi, A. 1992. “Towards a Model of Language Attrition: Neurobiological and psychological contributions.” Issues in Applied Linguistics 3(2). 293-318. Yoshitomi, A. 1999. “On the Loss of English as a Second Language by Japanese Returnee Children.” Second Language Attrition in Japanese Contexts, Hansen (ed.) 1999. 80-111. New York: Oxford University Press.
390
Asako YOSHITOMI
Corpus-based Analysis of Lexical Errors of Advanced Japanese Learners Ayano SUZUKI and Tae UMINO 0. Introduction In language education settings, instructors observe their students and constantly employ measures to address the difficulties that their students encounter. One such measure is employed with regard to teaching materials. The development of teaching materials from corpus-based analysis is a large-scale, systematic extension of this daily process. By analyzing a learner’s corpus as a direct representation of a large number of students and using this knowledge to create teaching materials, we believe that it is possible to develop materials that will be highly valuable for learners. The authors have been involved in the compilation of the ‘Japanese Composition Database of Advanced Learners’, which is a collection of the essays by international students in the Japanese program at the Tokyo University of Foreign Studies. The work of approximately 80 international students, totaling 326 pieces, was digitized and assembled (see section 1). Suzuki (2006) used a preliminary version of this corpus to analyse the lexical errors of advanced Japanese learners. The analysis showed that failure to understand collocation was an obstacle to the use of natural Japanese expressions by these learners. However, present materials for teaching the use of collocation are not sufficient. In this light, to teach collocation, we have developed a trial textbook ‘Learn them together! Verbs and nouns’. This textbook is still at its preliminary stage, and not in a unified form. In this paper, we will describe this textbook as an effort to develop teaching materials based on a learner language corpus and explore the possibilities of developing teaching materials based on the corpus-based analysis. 1. Developing the ‘Japanese Composition Database of Advanced Learners’ This section will describe and review the features of the learner language corpus. According to Richards, Platt and Weber (1985), learner language is the type of language produced by second- and foreign-language learners who are in the process of learning a language. The learner language displays unique characteristics that are not exhibited by a native speaker of that language. A learner language corpus is a collection of such
392
Ayano SUZUKI and Tae UMINO
learner language data texts. Furukawa (2006) describes research using a Japanese learner language database. According to this research, it is difficult to compile sufficient natural language use data for learner languages, particularly for languages with a small number of learners compared to ordinary language corpuses. For this reason, databases are often created by using the learner language data acquired from educational settings. In addition, learner language corpuses have mainly been used for education or for research on lexical error analysis and second language acquisition. However, very few studies have used these corpuses for developing teaching materials. This research employed the ‘Japanese Composition Database of Advanced Learners’ (the 21st Century COE Program ‘Usage-Based Linguistic Informatics’) compiled by the authors. The database was derived from a test database consisting of the work of approximately 150 international students in the Japanese program at the Tokyo University of Foreign Studies. The public version of the database includes 326 essays of 80 students who obtained permission for their essays to be published. They had all cleared Level 1 in the Japanese Language Proficiency Test prior to seeking admission in the university. The source data was taken from essays submitted to a Japanese writing course for international students during a five-year period. The instructors for this course during the time were Tae Umino and Futoshi Kawamura. For this course, the instructors collected and analysed examples of errors as a part of composition lessons at the beginning of the course, and worked toward creating teaching materials (Umino & Kawamura 2004). In order to systematize this work, efforts began to convert the accumulated documents into a corpus. The essays were written on different topics depending on the educational objectives and were based on five themes: ‘introduction of a friend’, ‘self-introduction’, ‘position paper (based on one of five pre-determined topics)’, ‘description (a description of an advertisement of the student’s choice with comments on its effect)’, and ‘report (the student reads and comments on a paper entitled ‘Language and Identity’ while citing references’. The uncorrected versions of these essays are contained in the database. Essays on ‘introduction of a friend’, ‘self-introduction’, and ‘position paper’ were handwritten during the lessons in the classroom and were required to be 800 characters in length. The other essays on ‘description’ and ‘report’ were required to be 2000 characters long and were assigned as homework during summer and winter vacations. Essays on these two themes were both handwritten as well as typed. The first languages of these learners were mainly Chinese (Mandarin), Korean, Chinese and Korean, Chinese (Shanghainese), and Mongolian, with
Corpus-based Analysis of Lexical Errors
393
a small number with Cantonese, Taiwanese, Lao, Thai, and Arabic. Overall, the majority of learners had a Chinese character-based language as their first language. All of these learners had cleared Level 1 of the Japanese Language Proficiency Test prior to seeking admission in the university, and could be considered as being at approximately the same level. A sample of the actual essay data is shown in the Appendix A, largely unchanged from the original, digitized and having the layout of a Word document. The errors of characters which could not be inputted or recognized are provided in the endnotes. All of the essay data have been assigned an ‘essay number’. Each essay also shows the topic, whether it was handwritten or typed, and the learner’s individual number and first language. In addition to the essay data, an essay index was created. The index can be used to find the ‘first language symbol’, ‘typed essays’, and ‘missing essays’. It is also possible to search and display essays based on the following four criteria: first language, theme, learner, and hand written/typed. This corpus provides language data by advanced learners of Japanese who have a similar level of proficiency. It also allows for a certain amount of learner language data controlled for first language or essay topic. In this manner, the corpus is useful for understanding a certain feature of advanced learner languages. We believe that analyses of this feature will have a significant effect on instruction and teaching materials which will better meet the needs of learners at this level. The following sections present an analysis of lexical errors using this corpus and possible directions for developing teaching materials. 2. Analysis using the ‘Japanese Composition Database of Advanced Learners’ As stated earlier, the ‘Japanese Composition Database of Advanced Learners’ can be useful for understanding tendencies of the Japanese of advanced learners. Suzuki (2006) used this corpus to analyse lexical errors made by advanced learners. The findings of this study are outlined below. 2.1. Summary of research: Suzuki (2006) The objective of this research was to categorize and analyse lexical errors found in the written work of advanced learners of Japanese, and to reveal the cause of these errors resulting in ‘unnaturalness’. These advanced learners were able to use Japanese without too much trouble in everyday life and were at a level where they were able to express complex and abstract ideas. However, their Japanese may have been wrong or somehow ‘unnatural’ and their lexical errors were considered to be one of the common causes of ‘not mistaken but unnatural’ Japanese. In this research,
394
Ayano SUZUKI and Tae UMINO
‘error’ has been defined in keeping with the definition proposed by Ellis (1994): ‘Deviation from the norms of the target language’. Further, the analysis comprised elements judged to be unnatural or inappropriate. This analysis used a preliminary version of the ‘Japanese Composition Database of Advanced Learners’, which included 567 pieces written by approximately 150 individuals. Three hundred and thirty essays on the three themes of ‘introduction of friend’, ‘self-introduction’, and ‘position paper’ were the subject of this analysis. Only three of the five themes available were selected because these essays were all handwritten in the classroom. Errors were investigated based on ‘sentence grammar’ and ‘discourse grammar’ standards from Nagatomo and Sakota (1987) and the ‘error identification process’ (fig. 1). ‘Sentence grammar’ refers to the system of grammatical rules for sentences taken as the basic unit. ‘Discourse grammar’ on the other hand, refers to the system of linguistic rules in the context of discourse, exceeding the sentence as the basic unit. Using these two standards, it is possible to clarify the norms of the target language. Further, this allows coverage of both cases of grammatical and discourse errors wherein the learner’s intention is unclear, despite there being no grammatical error. IN ↓ Is the sentence grammar correct? YES
NO ↓ Flagged as an error in sentence grammar and recorded together with the corrected sentence. Is it permissible in terms of discourse grammar? YES ↓ OUT
NO ↓ Flagged as an error in discourse grammar and recorded together with the corrected discourse.
Figure 1. Nagatomo and Sakota 1987: 145
This ‘error identification process’ helped in extracting and identifying the errors. The errors were then categorized in the manner suggested by Hosokawa (1990). Finally, the categories were as follows:
Corpus-based Analysis of Lexical Errors
395
1. Coinage (231 errors) 1.1 Combination of kanji compounds (72 errors) e.g.) * gogaku-sensei → gogaku no sensei [a teacher of languages] * benkyoo-seikatsu → benkyoo-bakarino seikatsu [a life dominated by study] 1.2 Usage of first language or other learner language, including coinages (93 errors) e.g.) * shizenteki na kootoo-hyoogen → shizen-na kootoo-hyoogen [natural Japanese expressions] * honyaku ni naritai → honyaku-ka ni naritai [I want to become a translator.] 1.3 Overgeneralization (66 errors) e.g.) ningentte, tenkei to hi-tenkei ga aru → tenkeiteki na hito to soudenai hito ga iru [There are people that are typical, and those that aren’t.] 2. Errors relating to similar expressions (265 errors) 2.1 Inappropriate use of kanji compounds (145 errors) e.g.) * tyooju-suru → nagaiki-suru [to have a long life] 2.2 Use of synonymy (94 errors) e.g.) narau and manabu shiru and wakaru 2.3 Use of expressions of similar form or pronunciation (26 errors) e.g.) mata and mada amari and amarini 3. Collocation errors (106 errors) 3.1 Verb and Noun collocation (87 errors) e.g.) * hitori-gurashi o yaru → hitori-gurashi o suru [living alone] 3.2 Adjective and noun collocation (19 errors) e.g.) * erai hoohu → rippana hoohu [an admirable aspiration] 4. Redundancy (23 errors) e.g.) * kaisya no OL → OL [a woman office worker] * saisyo atta toki no daiichi-insyo → daiichi-insyoo [first impression] 5. Errors using idioms (5 errors) e.g.) * kokoro o oni ni shite yume o suteta → namida o nonde yume o suteta [I gave up the dream though it was very regrettable.] 6. Other errors in meaning or use (67 errors)
2.2. Tendencies in lexical errors of advanced learners Among the categories listed in 2.1, the common errors were those in ‘coinage’ (231), ‘those related to similar expressions’ (265), and ‘collocation errors’ (106), accounting for 85% of the total errors. Of the total of 702 errors, 231 were counted in coinage. Moreover, there were 265 errors related to similar expressions and 106 collocation errors. The following section provides a more detailed analysis of the collocation errors. This result indicates the tendency in the lexical errors of advanced learners. For
396
Ayano SUZUKI and Tae UMINO
analyses of the other categories, please refer to ‘Working papers in Linguistic Informatics 10’. 2.2.1. Collocation errors Benson et al. (1986) defined collocation as a ‘recurrent, semi-fixed combination’. According to this definition, there are two types of collocations: grammatical collocation and lexical collocation. A grammatical collocation ‘is a phrase consisting of a dominant word (noun, adjective, verb) and a preposition or grammatical structure such as an infinitive or a clause’. Lexical collocations ‘in contrast to grammatical collocations, normally do not contain prepositions, infinitives, or clauses’. In Suzuki (2006), the focus was on lexical errors, and therefore, the collocation errors considered were limited to lexical collocation. The collocation errors extracted in this paper were divided into several categories. The first category comprises verb and noun, and adjective and noun collocations. In terms of frequency, verb-noun collocations were observed more often than adjective-noun collocations. There was only one example of adjective and noun collocation in the previous work by Suzuki (2002): *ninki ga ookatta -> ninki ga takakatta [It had a high popularity.]. The current analysis identified many other errors, such as *nooryoku o hukameru -> nooryoku o takameru [increasing abilities] and *erai hoohu -> rippana hoohu [an admirable aspiration]. In verb and noun collocation errors, cases relating to the verb ‘suru’ were frequently observed. For example, *hitori-gurashi o yaru was used erroneously instead of hitori-gurashi o suru [living alone], and suru was confused with yaru. In this case, hitori-gurashi is collocated with ‘yaru’ and is an error. As another example, *sai-syuppatsu o hajimarimashita was used to mean sai-syuppatsu o suru [to restart]: as is evident, the verb ‘suru’ has been substituted with another verb. The correct expression for sai-syuppatsu is sai-syuppatsu o suru. Other examples include *shiken ni sanka-suru -> shiken o ukeru [to take a test] and *sutoresu o kaisan-suru -> sutoresu o kaisyoo-suru [to relieve stress]. For verb-noun collocation errors, collocations involving suru were especially apparent. Many errors resulted from the inability to distinguish between the usage of synonyms such as -o suru and -o yaru, suru and naru, and tsutomeru and tsuku. These have been categorized as collocation errors instead of errors involving similar expressions because the main characteristic of these errors is the differentiation in two usages with a noun. However, the similarity in meanings is also part of the cause of the errors. Many of these synonyms are very basic, and hence, it is unlikely that a misunderstanding of the meaning of the words may cause the errors.
Corpus-based Analysis of Lexical Errors
397
Therefore, while examining the collocation errors, it is necessary to consider the relationships between synonymous verbs and adjectives as groups or pairs, rather than only relationships between nouns and verbs or nouns and adjectives. In groups (or pairs) of synonyms, there are words that form collocations and those that do not. When introducing words, it is necessary to teach their meanings since these are capable of forming collocations. In particular, for suru and yaru, suru and naru, and other cases where suru is confused with other verbs, it is necessary to clearly indicate the verb that is used with a noun. Alternatively, collocation from the perspective of synonyms could be taught together at one stage of instruction. For example, for suru and yaru, each of the nouns collocated with suru and yaru could be grouped. On their first appearance, these are presented without order, but it is necessary to organize them at some point. Other frequently observed collocations included those related to rates or probability and also those related to ability. Collocations related to probability appeared only in the essays on the theme of ‘description’. However, these errors almost always appeared when probability was dealt with, e.g. *hanzairitsu ga hueru -> hanzairitsu ga agaru [the crime rate increased]. The word ‘probability’ (words with ‘-ritsu’) collocates with agaru [increase] and sagaru [decrease]. However, hueru and heru collocate with the nouns that express number (such as amount or frequency). Learners who used hueru and heru in hanzairitsu ga hueru were probably confusing the expression with the collocation of the word ‘number’ as in hanzai ga hueru [the number of crimes increased]. I. Verb-Noun Collocation (1) suru and yaru (Erroneous) (Correct) *arubaito o yaru → arubaito o suru [to have a part-time job] *hitori-gurashi o yaru → hitori-gurashi o suru [to live alone] *nihongo o suru → nihongo o yaru [to learn (do) Japanese] (2) suru and naru (Erroneous) (Correct) *mutyuu suru → mutyuu ni naru [to be engrossed] *jootatsu ni naru → jootatsu suru/joozu ni naru [to improve or become good at something] *hanpirei ni naru → hanpirei suru [to be inversely proportionate] (3) suru and other verbs (Erroneous) (Correct) *saisyuppatsu o hajimaru → saisyuppatsu o suru [to restart] *hanashi o kawasu → hanashi o suru/kotoba o kawasu [to converse]
398
Ayano SUZUKI and Tae UMINO
*satsujin o suru → satsujin o okasu/hito o korosu [to commit murder / to kill someone] (4) Collocation related to work (Erroneous) (Correct) *shigoto ni tsutomeru → shigoto o suru [to work] *kyoosyoku o suru → kyoosyoku ni tsuku [to take up a job in the field of education] (5) Collocation relating to probability/rates (Erroneous) (Correct) *hanzairitsu ga hueru → hanzairitsu ga agaru [the crime rate increased] *hanzairitsu ga heru → hanzairitsu ga sagaru [the crime rate decreased] (6) Others (Erroneous) (Correct) *shiken ni sanka-suru → shaken o ukeru [to take a test/to sit an examination] *sutoresu o kaisan-suru → sutoresu o kaisyoo-suru [to relieve stress] *haru no kaori ga kanjitekuru youna kyoositsu de → haru no kaori ga tadayottekuru youna kyoositsu de [in the classroom where the smell of spring is in the air] II. Adjective-Noun Collocation (Erroneous) (Correct) *nooryoku o hukameru / masu → nooryoku o takameru [to raise one’s ability] *nooryoku ga heta → nooryoku ga hikui [to have no potential] *erai hoofu → rippana hoofu [an admirable aspiration] *tairyoku ga yowai → tairyoku ga nai [to lack physical strength] *kookishin ga ooi → kookishin ga tsuyoi [to have a strong curiosity] *tsunagari ga yowai → tsunagari ga usui [to have weak connections] Figure 2.
Collocation errors
3. Toward developing teaching materials on collocation for advanced learners 3.1. The need for teaching materials on collocation As shown above, the ‘Japanese Composition Database of Advanced Learners’ contains a large number of the following three types of errors: coinages, errors related to similar expressions, and collocation errors. Among these three types of errors, coinage and confusion between similar expressions have been investigated to a certain extent by previous research (for example, Sato & Rou (1993)). However, collocation errors in studies, other than Suzuki (1999, 2002) have not been well documented. In addition, Akimoto (1993) and Taniguchi (2001) have also argued for the necessity of instructions on collocation. Akimoto (1993) used tests to survey the compound word ability of intermediate learners. Taniguchi (2001) extracted and analysed collocation from existing beginner textbooks and
Corpus-based Analysis of Lexical Errors
399
found it to be lacking in standards for treating collocations in Japanese language education. This was the impetus for focusing on collocation and for developing teaching materials for collocation based on learner language analysis. To date, there are not many teaching materials and resources that have been developed to teach collocation in Japanese. Some examples are: (1) Pea de oboeru iroirona kotoba: sho, tyuukyuu gakusyuusya no tameno rengo no seiri [Remembering different words in pairs: Organizing compound words for beginners and intermediate learners] by Akimoto and Aruga (1996), and (2) Nihongo o migakoo: Meishi, dooshi kara manabu rengo rensyuu-tyoo [Polishing your Japanese: A practice book for learning compound words from verbs and nouns] by Kanda, Sato and Yamada (2002). The former is aimed at beginner and intermediate learners. Basic collocations were extracted from five standard Japanese textbooks and divided into 33 lessons by theme. A unique feature of this textbook is that it has an index for both verbs and nouns at the end. This is a very effective textbook for learning collocation. Its contents and the substance and quantity of the drills are appropriate and of superior quality. However, the collocations contained in this textbook are elementary and not suitable for advanced learners. The latter (Nihiongo o migako: Meishi, dooshi kara manabu rengo rensyuu-tyoo) is targeted toward intermediate and advanced learners. It focuses on collocations found in newspaper editorials and columns. Therefore, there is some disunity in the collocations addressed in each lesson. Further, many of the collocations shown are close to idioms, expressions consisting of a fixed exclusive set of words expressing a certain meaning as a whole. These teaching materials are useful for learning collocations related to certain topics or themes. However, neither of them is completely sufficient when considered against the actual state of the learner language. As mentioned in section 2.2.1, two observations can be made from the error analysis of the learner language corpus. The first is that for collocation errors, there are certain verbs for which these kinds of errors are more frequent. The second is that there exist some issues in learning not only because of an understanding of the relationship between nouns and verbs but also because of the differentiation in the usage of verbs with similar meanings (for example, suru and yaru, tsutomeru and tsuku.) These issues were considered while proposing the creation of a new textbook as a trial version. An outline of this textbook is provided in the next section. In the following sections, we describe our attempt to create a trial textbook which reflects these issues in order to discuss the potential of using a learner language corpus for material development.
400
Ayano SUZUKI and Tae UMINO
3.2. Development and structure of the trial textbook: Issyoni oboeyo! Dooshi to Meishi, [Learn them together! Verbs and Nouns] 3.2.1. Development of the trial textbook As discussed in 3.1, the learner language corpus could be a useful resource for developing teaching material which meets learners’ levels. However, how is the analysis actually used? For this project, the given below procedure was followed. First, collocations requiring the most instruction were extracted. This was based on the learner language corpus analysis. Next, the most easily confused verb pairs and groups were selected from the extracted collocations. For this project, this selection was also based on the corpus analysis. The selected verbs and noun groups they collocated with were extracted. This procedure was carried out using the error analysis and by referring to the Japanese language collocation dictionary, Nihongo Hyoogen Katsuyoo Jiten. Through this process, the collocations to be taught were determined. Explanations and practice problems were created while paying attention to the different usages of similar verbs and adjectives. Practice problems were either newly created or taken from sentences found in the error analysis. This was done because it allows the text to better address the actual needs of learners. Given below is a detailed description of the example of the trial textbook created using the abovementioned process. 3.2.2. Objective and structure The trial textbook considered in this Issyoni oboeyo! Dooshi to Meishi [Learn them together! Nouns and Verbs] is aimed at advanced learners. The summary contains an outline of the text. The actual samples are in the Appendix B. The textbook has two main characteristics. The first is the focus on collocation resulting from the outcome of the learner language corpus analysis. The second is the construction of the syllabus based on synonymous verb groups. As stated before, existing teaching materials show a tendency to organize collocations by theme or topic of the texts used. In contrast, this textbook pays attention to differentiating the usages of similar verbs to select and organize collocations. In the same way, practice problems are targeted at differentiating usages of similar verbs. Moreover, the textbook includes writing activities in order to help students to apply what they learn, in addition to remembering collocations. An actual syllabus might contain suru and yaru, suru and naru, manabu and narau, agaru/sagaru and hueru/heru. In this section, we will introduce the first lesson from the trial textbook, tsutomeru and tsuku.
Corpus-based Analysis of Lexical Errors
401
The contents of each lesson are organized in the following manner. First, an approximate 250-character long text is shown; this text includes the collocations to be taught in that lesson. Underlined sections are the collocations for the first lesson. The texts could be diary entries, conversations, or short speeches with foreign students acting as the main characters. The text used this lesson is a self-introduction by a student. Next, the lesson shows the collocation pairs and groups to be taught. In this lesson, the pair is tsutomeru and tsuku. Other possibilities include groups such as agaru/sagaru and hueru/heru. The noun groups presented were selected from the Nihongo Hyoogen Katsuyoo Jiten, collocation dictionary. The groups are presented with case particles alongside explanations of differences in usage. These two sections serve as an introduction and make the learner aware of the collocations to be taught in the lesson. These collocations are then practiced in a drill section. These drills include newly written text as well as text derived from the error analysis. The end of the lesson contains the following two sections: ‘column’ and ‘practice writing’. Column is a 500-character piece related to the content of the main text that introduces an aspect of Japan. The level of difficulty of this piece is slightly higher than that of the main text. Column is followed by a ‘Q&A’ section, which lies between the column and the practice writing section. In the ‘practice writing’ section, learners are asked to write their own thoughts in the ‘column’. Through these activities, we believe that learners will be able to learn how collocations are actually used. The above shows the structure and content of the lessons. Target Selection of collocations Contents
Structure of lessons Notes
Advanced learners of Japanese/For self-study (classroom use is also possible) 1) A list of collocations is selected based on the results of the error analysis 2) A pair or group of easily confused synonyms is introduced in each lesson 3) Paired nouns are selected using the Nihongo Hyoogen Katsuyoo Jiten (1) suru, yaru (2) suru, naru (3) suru, other verbs (4) tsutomeru, tsuku (collocations related to ‘work’) (5) agaru/ sagaru, hueru/ heru (6) manabu, narau (7) Other collocations Main text→Collocation list and simple explanations→Drills→Column→ ‘Practice writing’ For vocabulary beyond the range of Level 1 of the Japanese Language Proficiency Test, furigana and simple explanations will be included.
Figure 3. Outline of the trial textbook Issyo ni oboeyo! Dooshi to Meishi
402
Ayano SUZUKI and Tae UMINO
3.2.3. Use and possible applications The target group for this textbook is advanced learners of self-study. As stated above, the main text and collocation list are aimed at making the learner aware of the collocations to be taught in that lesson. The learner only needs to be able to understand the content of the main text, and not simply memorize the collocation list. These collocations are practised in the drill section. The drills will comprise ‘fill in the blanks’ type exercises using collocations from the collocation list. It is desirable at this point for the learner to practice the drills several times and remember the collocations. The ‘column’ and ‘practice writing’ sections at the end are an expansion of the rest of the lesson. The ‘column’ is not only meant for the learner to read but also to provide them with an opportunity to contemplate. By writing his/her thoughts, the learner can learn to use the collocations presented in the drill. This textbook could be used in the same way for advanced learners in the classroom. Further, this text could be used by students aiming to clear Level 1 of the Japanese Language Proficiency Test or of similar ability since its vocabulary has been selected from that level. For this purpose, the instructor will need to introduce and explain new vocabulary. In the classroom, students might be directed to read and discuss the ‘column’ section before beginning to write. Communicating thoughts by speech is an effective manner of preparing students to write. In anticipation of this use, the ‘practice writing’ sections contains questions such as ‘What is it like in your own country’? This might be very interesting in a class with students from various backgrounds. 3.3. Remaining questions and issues for the future So far, two issues have appeared through the discussion on the trial textbook. In this section, we will discuss questions and issues that became apparent through the creation of the textbook. First, it is apparent that the current analysis is insufficient. For the textbook to be considered as a whole, the data from the error analysis is insufficient for determining the kind of collocations that should be selected or the verbs that should be covered in the same lesson. More essay data needs to be analysed or a different type of data needs to be gathered and analysed in order to address this issue. In addition, it will be necessary to examine more literature in order to understand the other synonyms that have been identified as potentially difficult for selecting the correct collocation. Second, it will be necessary to analyse the meanings of the synonyms themselves in order to actually write each lesson. For example, the error analysis showed that suru and yaru were easily confused synonyms. When
Corpus-based Analysis of Lexical Errors
403
creating teaching materials on these two verbs, it soon becomes apparent that there exists more than one possible relationship between verbs and nouns. In addition to the meaning of the verb, the context and composition of the text also contributes to the relationship between verbs and nouns. Therefore, it will be necessary to consider the extent of analysis and the extent to which it should be reflected in the teaching material. 4. Conclusion In this paper, we have discussed the analysis of the learner language corpus and the development of teaching materials based on that analysis. By using the learner language corpus analysis, it is possible to create teaching materials that better meet the actual needs of learners. However, there are several points requiring further consideration. The first relates to the quantity of data. As mentioned above, the analysis of just one corpus does not produce sufficient content for the production of teaching materials. This issue can be addressed by using other corpuses. The second relates to the quality of data. The data used in the corpus is close to data that Ellis (1994) refers to as ‘natural language use data’. The types of linguistic structures that appear in this data cannot be free from the influence of the type of the essay topic or the instructions provided in class. Further, it is impossible to acquire comprehensive data for different structures. These limitations can be overcome by combining the data with other types such as ‘elicited language use data’, as suggested by Ellis (1994). Furthermore, in the process of developing teaching materials, it is obvious that works from wider areas of research must be considered. This extends beyond the analyses of structure and meaning of the content to be included in the teaching material. By using the results obtained from both cross-sectional and longitudinal research, it may be possible to address some of the limitations of the corpus. With these points in mind, future research will focus on systematizing the development process for second language teaching materials using learner language corpuses. References Akimoto, M. 1993. Goi kyooiku ni okeru rengo shidoo no igi ni tsuite [On the Significance of Collocations in Vocabulary and Language Teaching] The Proceedings of the 4th Conference on second Language Research in JAPAN pp.29-51 Akimoto, M., & Aruga, C. 1996. Pea de oboeru iroirona kotoba: syo, tyuukyuu gakusyuusya no tameno rengo no seiri [Remembering different words in pairs: Organizing compound words for beginners and
404
Ayano SUZUKI and Tae UMINO
intermediate learners] Tokyo, Musashino-syoin Benson, Morton, Evelyn Benson and Robert Ilson. 1986. The BBI combinatory dictionary of English: a guide to word combinations Amsterdam: J. Benjamins R. Ellis 1994. The Study of Second Language Acquisition Oxford: Oxford University Press Furukawa, A. 2006. Nihongo no gakusyuusya gengo koopasu ni kansuru kiso tyoosa [Basic research on Japanese learner-language corpora] Working papers in Linguistic Informatics 10 Corse materials, Evaluation, Second Language Acquisition(SLA) the Graduate School of Area and Culture Studies, Tokyo University of Foreign Studies pp.243-252 Himeno, M. 2004. Nihongo Hyoogen Katsuyoo Jiten [Dictionary of Collocations in Japanese] Tokyo, Kenkyuusya Hosokawa, H. 1990. Furansujin no nihongo sakubun ni okeru goyoo to sono syurui [Errors and their types in the Japanese writing of the French learners] Studies in humanities by the College of Liberal Arts, Kanazawa University 27(2) pp.119-160 Inagaki, S. 1976. Gaikokujin-gakusei no ‘kaku’ koto ni yoru hyoogenryokusakubun no naka no goyoo-rei kara- [Examples of errors found in the expressive compositions by overseas students] Annual Reports 1 International Christian University pp.23-38 Japan Foundation 2002. Japanese Language Proficiency Test: Test Content Specification (Revised Edition) Tokyo, Bonjinsya Kanda, Y. , Sato, Y. & Yamada, A. 2002. Nihongo o migakoo: Meishi, dooshi kara manabu rengo rensyuu-tyoo [Polishing your Japanese: A practice book for learning compound words from verbs and nouns] Tokyo, Kokin-shoin Matsumoto, A. 2005. TUFS Gengo Module Nihongo Kaiwa Module ni okeru nihon-jijoo no kaisetsu deeta fairu [Notes and data for the development of ‘Culture and Life’ of ‘TUFS Japanese Dialogue Modules’] Tokyo University of Foreign Studies, the Graduate study Minna no Kyoozai saito http://momiji.jpf.go.jp/kyozai/index.php Nagatomo, K. & Sakota, K 1987. Goyoo bunseki no kiso kenkyuu (1) [Basic study of error analysis (1)] Annals of educational research 33 pp.144-149 Higashihiroshima, Chuugoku Shikoku kyooiku gakkai Jack Richards, John Platt, Heidi Weber 1985. Longman dictionary of applied linguistics Longman Sato, S. & Lu FengJun 1993. Dairen gaikokugo gakuin nihongo-gakubu gakusei no nihongo sakubun ni mirareru goyoo [Some Examples of Mistakes in Japanese Composition by the Students of Dalian Institute of Foreign Languages] Faculty of Literature Hokusei review 30 pp.107-124
Corpus-based Analysis of Lexical Errors
405
Suzuki, T. 1999. Imiteki na goyoo ni mirareru omona keikoo —kansyuuteki ni teityaku shita hyoogen oyobi ruiji no hyoogen ni kakawaru ayamari— [Error Analysis of Japanese Language Learners’ Written Compositions from a Semantic Aspect: Focusing on Conventional and Synonymous Expressions] Creation of a ‘Database of Japanese Compositions’ written by Learners of Japanese 1996-1998 Grant-in-Aid for Scientific Research(Project number 08558020)Head investigator: OHSO Mieko (Nagoya University, Graduate School of Languages and Cultures, Professor) Suzuki, T. 2002. 2000nendo tyuukyuu sakubun ni mirareru goi, imi ni kakawaru goyoo —syotyuukyuu reberu ni okeru goi, imi kyooiku no jujitsu o mezashite— [Error Analysis of Intermediate Level Written Compositions: Lexical and Semantic Education from Elementary through Intermediate Level Japanese] Bulletin of Japanese Language Center for International Students 28 pp. 27-42 Suzuki, A. 2006 Jookyuu nihongo gakusyuusya no sakubun ni mirareru goi no goyoo [Analysis of lexical errors of advanced Japanese learner] Working papers in Linguistic Informatics 10 Corse materials, Evaluation, Second Language Acquisition(SLA) the Graduate School of Area and Culture Studies, Tokyo University of Foreign Studies pp.221-242 Taniguchi, S. 2001. Nihongo kyooiku ni okeru korokeesyon no atsukai [Treatment of collocations in Japanese education] Annals of educational research 47 Higashihiroshima, Chuugoku Shikoku kyooiku gakkai pp.381-386 Umino, T. and Kawamura, F. (Eds.) 2004. Bunsyoo-hyogen waakubukku [Workbook of Writing in Japanese] (unpublished material)
406
Ayano SUZUKI and Tae UMINO
Appendix A. a sample of the ‘Japanese Composition Database of Advanced Learners’ A009-ChK 新人紹介 ▲▲▲さんは 3 年前、韓国からの留学生で私は色んな面でとても個性があって素 晴しい方だなと思っております。 韓国で短大を卒業なさって、韓国のクレジットカード会社で OL として働いたそ うですが、どうしても自分の大きなゆめを##1 するために日本へ留学なさったそう です。 ▲さんは、日本への留学の目的はまず自分が好きな文学がやりたいことと、そし てまた視野を広げながら世界中の人たちとコンタクトを取りながら友達もいっぱい 作りたいという抱負を持っていらっしゃるそうです。日本へいらしゃってる##2 ま では日本#3 学校、そして他の大学に在籍しながらも、外交員になるゆめを##4 す るために##5 東京外国語大学の日本課程へ入学され、ゆめの第一歩を##6 できて 本人は充実感たっぷりながらも「とりあえずよかったと思います。でもまたこれか らなんですよ。」とおっしゃっていました。 短いながらもインタビューをしているうちに私は、自分にとっても勉強になった 気がします。自分の抱負を抱えて一歩一歩前に進んでいる彼女の姿はとてもすばら しいと思っております。 これから国際的舞台で活#7 する留学生ののみなさんも彼女みたいにゆめをもっ ていらっしゃると思いますが。これからも 4 年間一緒に勉強するわけですからお互 いに勉強しながら頑張って欲しいです。 最後ながら、私は▲さんが自分の外交員になるゆめを##8 ようお祈りします。
1 2 3 4 5 6 7 8
「実現」だと思われるが、崩れている。 「初日」か。字が崩れている。 文脈から「語」とわかるが、崩れている。 1 に同じ。 文脈などから「現在」とわかるが、崩れている。 1 に同じ。 「躍」が崩れている。中国簡体字? 1 に同じ。
Corpus-based Analysis of Lexical Errors
407
Appendix B. Issyoni oboeyo! Dooshi to Meishi
第 1 課 初めまして! <自己紹介> ―「勤める」? 「就く」?― 初めまして、中国から来た王と申します。去年の 4 月に日本に来て、1 年間 日本語学校で勉強しました。中国では大学のコンピューター学科を卒業して、 日本企業に勤めていました。そこでたくさんの日本人と知り合って、日本と 日本語に興味を持つようになりました。日本に来たときは、日本の進んだコ ンピューター技術を学ぶつもりでしたが、いつの間にかコンピューターより 日本語に夢中になっていました。大学で一生懸命勉強して、将来は通訳の仕 事に就きたいと思います。どうぞよろしくお願いいたします。
©Minna no Kyoozai saito ・日本企業に勤める ・通訳の仕事に就く
「勤める」「就く」とペアになる名詞いろいろ ☆佐藤さんはコンピューターの会社に勤めています。 「勤める」:仕事をする場所を表わす名詞と一緒に使います。 企業、会社、銀行、役所、官庁、 図書館、学校、郵便局、新聞社、 研究所、消防署、工場、スーパー、 業務課、人事部、窓口 など
に
勤める
☆将来は教職に就きたいです。 「就く」:地位や役職などを表わす名詞と一緒に使います。 管理職、役職、教職、 監督の座、政権の座、 職業、兵役、仕事、定職 など
に
就く
408
Ayano SUZUKI and Tae UMINO
ドリル 次のカッコ内に、「勤める」か「就く」を、前後と合うように形を変えていれなさい。 (1) A:金さんの将来の夢は? B:通訳の仕事に( )たいと思って。 A:そうなんだ、じゃあがんばって勉強しないとね。 B:はい。大学院にも進みたいです。 (2) 王さんは中国で日本企業に( )ていたそうです。 (3) 父は新聞社に( )ています。 (4) 私が( )ている工場では、全部で 100 人の人が働いています。 (5) 私は大学に( )ています。といっても大学教授ではなく、事務職です。 (6) A:私、今銀行の窓口に( )ているの。 B:あれ、大学のとき「教職に( )たい」って言ってなかった? A:そうなんだけど、教員採用試験に落ちたからあきらめたの。 (7) 男性が義務として兵役に( )なければならない国は多いですが、日本 にはそういう制度はありません。 (8) サッカー日本代表の監督の座には誰が( )のか、注目が集まっていま す。 (9) 彼は若くして管理職に( )。 (10) 最近、定職に( )ない若者が増えてきた。 (11) A:田中さん、就職はどうするの? B:うん、鈴木先生の紹介で、医学研究所に( )ことになったんだ。 A:そうなんだ、よかったじゃない。 B:大学での研究を生かして、薬の開発をしてみたいな。 (12) 佐藤さんは大手化粧品メーカーの人事部に( )ています。 (13) 4 月から会社の重要なポストに( )ことになった。 (14) 選挙の結果、××党にかわって△△党が新しく政権の座に( )ことに なった。 (15) 市の図書館に司書として( )ています。 (16) A:4 月から父も私と同じ大学に通うの。 B:えっ、どうして? A:20 年間( )会社を辞めて、大学院で勉強するんだって。 B:へぇ、すごいなぁ。
Corpus-based Analysis of Lexical Errors
409
しゅう か つ
コラム:日 本 の大 学 生 の「就 活 」 日本の大学生の多くは、3 年生の 2 月~4 年生の 5 月ごろに渡って卒業後 の仕事先を探します。これを「就職活動」と呼び、略して「就活」といいま す。 「就活」の時期になると、茶色だった髪を黒く染めなおし、黒やグレー のスーツを着た学生を、大学の中やオフィス街などでよく見かけます。 最近ではインターネットを使った「就活」が主流になってきています。会 社の情報をホームページや就職活動サイトを通じて集めることができるだ けでなく、入社試験の応募までできてしまうところもあります。さらに、面 接官が好感を持つスーツの選び方や髪型、女性の場合は化粧の仕方について も、インターネットで知ることができます。 選考は、会社説明会→筆記試験→面接→内定通知、という流れが一般的で、 面接は 2~3 回、応募者の多い人気企業の場合は 5~6 回ほど行うところもあ ります。「就活」のためにはしばしば授業を休まなければならないこともあ り、教師もそれを認めています。 たいていの人は 20~30 社、多い人で 100 社近くに履歴書を出しますが、 最終的に採用通知をもらうのは 2~3 社、というのが現状です。
Q&A ①「就活」とは何ですか。いつ行いますか。 ② 何を使った「就活」が主流になっていますか。それを使って、どんなこ とができますか。
書いてみよう ・ 日本語を勉強して、将来どんなことに生かしたいですか。また将来はどんな仕事 をしたいですか。 ・ あなたの国では仕事をどのようにして決めますか。日本の「就活」のようなもの はありますか。
410
Ayano SUZUKI and Tae UMINO
Syntactic Patterns of Intrasentential Code-Switching in the Discourse of Japanese-English Bilingual Families Tomoko TOKITA and Yuji KAWAGUCHI 1. Introduction Bilingualism, which is widespread in the modern world, has attracted a considerable amount of attention over the past decades. Code-switching, as a common language contact phenomenon, has been one of the central topics of linguistic discussion, and research on it has been well documented from various perspectives, in various places, involving various subjects. Gumperz (1982:59) defines code-switching as “the juxtaposition within the same speech exchange of passages of speech belonging to two different grammatical systems or subsystems.” Code-switching appears in two patterns. (1)
There’s children iru yo.
(Nishimura 1997)
exist TAG (There’s children.)
(2)
A: So I’ll come back at about two o’clock. B: Parfait, c’est bien. Perfect it’s
(Heller 1988)
good
(Perfect, it’s good.)
The first pattern is referred to as ‘intrasentential code-switching’, wherein the switch from one language to another occurs within one sentence itself. As can be seen in example (1) from the corpus of Japanese-Canadians, both English and Japanese are used in one sentence. The second pattern is called ‘intersentential code-switching’; in this pattern, the switch occurs between two sentences. For example, in conversation (2), which is from the corpus of an Anglophone and a Francophone in Montreal, one sentence is in English while the other is in French. This paper discusses code-switching as it is observed in the corpus of the conversations between Japanese-English bilinguals in metropolitan Vancouver, Canada. Multilingualism has developed here due to an increasing influx of immigrants. Japanese — which has been used in Vancouver for more than a century — is one of the minority languages. In 2001, according to Statistics Canada (2002), almost all the Japanese in
412
Tomoko TOKITA and Yuji KAWAGUCHI
Vancouver could speak at least English and Japanese, implying that they engaged in a complex network of language interactions and that they practiced multilingualism. In order to fully understand their language practices, as a first step, this paper focuses on intrasentential code-switching, by considering the structural possibilities of and constraints in Japanese-English code-switching. 2. Conceptual Framework In this section, we present the two theoretical hypotheses that form the conceptual framework of this paper. The first is the Equivalence Constraint, and the second is the Matrix Language Frame Model. They will be followed by a discussion on code-switching and borrowing. Poplack (1980) proposed the Equivalence Constraint, which is based on the word order difference between two languages. According to this model, code-switching will occur at any syntactic boundary “where juxtaposition of L1 and L2 elements does not violate a syntactic rule of either language” (p. 586). (3)
I
(4)
(Yo)
(5)
I
told him
that
so that
le dije
eso
told him
that
he
would bring it
fast.
pa’que (el)
la trajera
ligero.
pa’que
la trajera
ligero.
Sentence (3) is in English; sentence (4), Spanish. The vertical lines between the two represent the syntactic boundary where code-switching could occur. Sentence (5) is an example of a code-switched sentence that can be derived from sentences (3) and (4), following the equivalence constraint. Myers-Scotton (1993) proposed the Matrix Language Frame Model. This model distinguishes the matrix language which builds the frame of a sentence, from the embedded language which is inserted in the matrix language frame. There are two patterns of intrasentential code-switching. (6) (7)
Anakula plate mbili… (Swahili–English) (He eats two plates…) Ni-ka-maliza all the clothing. (Swahili–English) (And I’ve finished [washing] all the clothing.)
In the first pattern, the embedded language constituents are inserted in the frame of the matrix language. For example in (6), Swahili is the matrix language and the English word “plate” is the embedded language constituent. In the second pattern, embedded language islands, that is, elements of the embedded language, are inserted in the frame set by the matrix language. In the case of (7), the matrix language is Swahili, and the English element
Syntactic Patterns of Intrasentential Code-Switching
413
“all the clothing” is the embedded language island. It is necessary to discuss the concept of borrowing, which at times, is central to the study of intrasentential code-switching (e.g., Muysken 1995; Pfaff 1979; Poplack & Sankoff 1984). This is because borrowing and code-switching are related, given that elements from more than one language are drawn upon in on-line speech production. For example, Poplack & Meechan (1995:200) define borrowing as “the adaptation of lexical material to the morphological and syntactic patterns of the recipient language,” thus distinguishing it from code-switching. They admit however that the distinction between them is not always clear-cut. Moreover, some researchers do not consider them to be distinct entities. Therefore, there is no consensus either on their definitions or on the distinction between them. 3. The Japanese Language in Vancouver In this section, we describe the language situation in Vancouver, with special reference to Japanese. Vancouver is characterized by the development of multilingualism; for nearly 40% of its population, a language other than English and French — Canada’s official languages — is their mother tongue. In 2001 (Statistics Canada 2002), 1.3% of the total population of Vancouver is Japanese speakers. Japanese is the mother tongue of 14,400 people or 0.7% of the population, and it is the second language for 9,875 people or 0.5% of the population. It should be noted that, in addition to this, a number of Japanese students and workers stay in Vancouver for certain periods of time, as do tourists or visitors. Among the 14,400 people whose mother tongue is Japanese, approximately half (7,515: 52.2%) use predominantly Japanese at home and the other half (6,380: 44.3%), English (Statistics Canada 2002). However, as the census inquired about “the language most spoken at home,” the data obtained was not reflective of multilingual practices at home. Given that more than 90% of the Japanese in Vancouver are married to members belonging to other ethnic groups and that almost all can speak both Japanese and English (Statistics Canada 2002), several languages may be used in conversations in the family. According to Dagenais & Berron (2001), who examined a few immigrant families in Vancouver, the dynamic nature of language practices was observed even in families wherein the parents shared the same heritage language. 4. Methodology For Japanese speakers in Vancouver, it is not unusual to speak Japanese along with English. Many studies have demonstrated that code-switching occurs in a rule-governed manner. Focusing on intrasentential code-switching,
414
Tomoko TOKITA and Yuji KAWAGUCHI
this study considers the structural possibilities and constraints found in Japanese-English code-switching. We analyzed the natural conversations of two families living in Vancouver. The family conversations that took place at home were deliberately chosen as data; this is because home is one of the important sites in which the heritage language is used. After acquiring their consent, we requested the mothers in each family to record their conversations using an IC recorder. Subsequently, the transcription, coding, and analysis of the data were carried out. Even though the data comprised natural conversations and the researchers were not present at the time of data collection, we have to admit that all the subjects knew that they were being recorded. However, this does not interfere with our purpose, since our aim is to syntactically analyze the colloquial exchange between them. The recorded information is presented in Table 1. Table 1.
Recorded information Family 1
Family 2
Recorded Date
2006/2/20–2006/3/28
2006/2/19–2006/3/1
Recorded Time
95 m 49 s (8 times)
121 m 51 s (11 times)
Recorded Place
Home
Home
Meal time, etc.
Meal time, etc.
Recorded Situation
The two families are identified as Family 1 and Family 2. They were recruited from the Japanese Heritage Language School that the children attended once a week. Their attributes are presented in Table 2. Table 2. Subjects Family 1
Family 2
Father
Canada
Canada
Mother
Japan
Japan
Father
Late thirties
Late thirties
Mother
Late thirties
Early forties
Canada
Canada
10-year-old boy 8-year-old boy
8-year-old boy
Parents Place of Origin Age Children Place of Origin Sex and Age
Syntactic Patterns of Intrasentential Code-Switching
415
Family 1: The father was raised in British Columbia, Canada. He has frequently visited Japan with his family but has never lived there. The mother moved to Canada about fifteen years previously and has relatives in Japan. Their two boys were born in Canada, and they visit Japan more than once a year. Family 2: The father was raised in Alberta, Canada. He worked in Japan for four years while in his early twenties, and after that, he was back in British Columbia. The mother moved to Canada about ten years previously and has relatives in Japan. Their son was born in Canada, and he has visited Japan three times. 5. Structures of Family Language Practices On the basis of the recorded data, we calculated the frequency of each language spoken as well as the occurrence of code-switching for each member. This was done in order to demonstrate how each family member practiced bilingualism at home, using his/her linguistic repertoire. We took each turn as a unit, which were divided into four categories. The first two categories comprised the turns in which Japanese and English were spoken, respectively. The third category comprised the turns of code-switching, wherein Japanese and English were spoken in combination. The fourth category comprised the turns wherein it was difficult to judge the language; this category included words such as proper names, places, and light nods. This study did not distinguish between code-switching and borrowing. Lacking an appropriate method of distinction between them, we might have narrowed the scope of our analysis. Only ‘established loanwords’ (typically showing full linguistic integration, native-language synonym displacement, and widespread diffusion even among recipient-language monolinguals, Poplack & Meechan 1995:200) were excluded from our analysis. In Japanese, ‘computer’ and ‘salad’ are examples of established loanwords, and in English, some of these are ‘sushi’ and ‘karaoke’. Family 1: As illustrated in Graph 1, Japanese and English were used by all the members except for the father, 1A, who was not observed to speak Japanese. He is considered to be an English unilingual and is excluded from our study.
416
Tomoko TOKITA and Yuji KAWAGUCHI
Graph 1. Family 1’s language practices
85.8%
1A
74.4%
1B
67
10.1%
4.3% 1C
15.1%
1D
22% 0
76%
65.4% 100
200
7.1% 300
Japanese
400
English
500
Mixed
600
700
800
Unmarked
Many utterances by the mother, 1B, were in Japanese, and for more than 90% of the mixed utterances, the matrix language was Japanese. Hence, Japanese appears to be her dominant language in family conversations. With regard to the older son, 1C, most of the utterances were in English, and for about 80% of his mixed utterances, the matrix language was English. This implies that his dominant language is English. Similarly, the majority of the utterances of the younger son, 1D, were in English, and for approximately 50% of the mixed utterances, the matrix language was English. This implies that his dominant language is English. Family 2: In Family 2, both Japanese and English were used by each member. Many of the utterances by the father, 2A, were in English. He made less than ten mixed utterances, and although the matrix language for these was either English or Japanese, his dominant language appears to be English. In the case of the mother, 2B, many utterances were in Japanese, and for almost all her mixed utterances, the matrix language was Japanese. Hence, Japanese is dominant in her language practices at home. The utterances of the son, 2C, were partly in Japanese and partly in English, and for 70% of his mixed utterances, the matrix language was Japanese. In his case, the data was insufficient to conclusively determine his dominant language at home.
Syntactic Patterns of Intrasentential Code-Switching
417
Graph 2. Family 2’s language practices
2A 8.6%
85.4%
71.9%
2B
2C
23.2%
47.4%
0
100
Japanese
38%
200
300
English
400
Mixed
7.4%
500
600
700
Unmarked
The above two graphs show how language practices vary between members of the same family. The linguistic backgrounds seem to have influence on it. The graphs also demonstrate how language usage, particularly that of children, varies between families. The children in Family 1 tended to use more English: this may be because the two children spoke in English with each other and their father does not understand Japanese. On the other hand, in Family 2, the child does not have any brother or sister, and his father understands Japanese: these facts may take him to use both English and Japanese with the same frequency. However, it should be noted that these graphs present only a partial picture of the family conversations. The pattern might change depending on the situation and topic; it will also change over time. 6. Intrasentential Code-switching In this section, we present the intrasentential code-switching that was observed in our subjects. First, the nature of these switches is illustrated. Then, some patterns are analyzed from the syntactic and the discourse perspectives. 6.1. Linguistic Properties of Switched Segments After having found that switching occurred among our bilingual subjects, though not frequently, we directed our attention to the nature of the intrasentential switches. Four main categories were extracted from our corpus, in addition to some other categories described as ‘others’; these are shown in Table 3.
418
Tomoko TOKITA and Yuji KAWAGUCHI
Table 3.
Code-switching items by syntactic categories and language English Japanese Total Number (in Japanese) (in English) of CS
Percentage of Total CS
Noun/Compound Noun Adjective Verb Discourse Marker Others
210 46 41 18 18
15 1 0 15 2
225 47 41 33 20
61.5% 12.8% 11.2% 9% 5.5%
Totals
333
33
366
100%
Table 3 reveals that nouns and compound nouns are the most frequently switched category; this confirms the findings of other studies (Poplack 1980; Nishimura 1997). Moreover, English items have a greater tendency to be inserted into sentences having Japanese as the matrix language; that is, English functions as the embedded language. In the case of adjectives and verbs, Japanese items are rarely inserted into a sentence having English as the matrix language. However, discourse markers appear to be easily switched in both sentences having Japanese or English as the matrix languages. The highest frequency was observed in the switching of nouns/compound nouns, which appeared in various syntactic structures. In the following sections, we will mainly analyze this category. 6.2. Syntactic Patterns 6.2.1. Noun Phrases This section deals with the possibilities for switching within noun phrases. In the case of both Japanese and English noun phrases, a determiner, a demonstrative, and a qualifying adjective are followed by the noun; that is, Japanese and English noun phrases follow a parallel word order. However, Japanese does not have articles. In our corpus, switching within noun phrases was found in the sentences of both matrix languages. When the matrix language was Japanese, two patterns of switched noun phrases were observed. The first pattern involves the combination of ‘a Japanese determiner/adjective and an English noun’, as in the sentences below. (8)
1B: Dame nano, sono water jya? don’t like
that
TAG
(You don’t like that water?)
(9)
2B: Sugoku omoshiroi movie o very
interesting
mita
no.
acc. watched TAG
(I watched a very interesting movie.)
Syntactic Patterns of Intrasentential Code-Switching
419
In sentence (8), the noun phrase ‘sono water’ comprises a Japanese demonstrative (sono) and an English noun (water). In sentence (9), the noun phrase ‘omoshiroi movie’ comprises a Japanese adjective (omoshiroi) and an English noun (movie). The second pattern combines ‘an English adjective and a Japanese noun’, for example, sentence (10). (10) 2B: usui light na iro de, kono, hontoni mieru light
color with this
really
youna no
be looked like
de, kaite ii.
thing with draw can
(With a light color, with this which seems real, you could draw.)
In sentence (10), the noun phrase ‘light na iro’ comprises an English adjective (light) and a Japanese noun (iro). It should be noted that ‘na’ is placed between the English adjective and the Japanese noun. Japanese has two types of qualifying adjectives that are equivalent to English adjectives: one is an ‘adjective’, which directly modifies the noun following it; the other is an ‘adjectival noun’, which requires the ‘na’ suffix to modify the noun following it. In our corpus, all the switched English adjectives were suffixed by ‘na’; this is consistent with the findings of Azuma (2001). However, unlike the case in the previous pattern, a noun phrase consisting of ‘an English determiner and a Japanese noun’ was not observed here. In both the patterns of sentences which had Japanese as the matrix language, we observed that the noun phrases were not governed by the English articles. This implies that these noun phrases are inserted following the rules of the matrix language’s syntax; in this case, Japanese. When the matrix language was English, only the pattern of ‘an English determiner and a Japanese noun’ was observed, as in example (11). (11) 1D: Do I put a ten
right here?
period (Do I put a period right here?)
In sentence (11), the noun phrase ‘a ten’ contains an English article (a) and a Japanese noun (ten). These examples demonstrate that it is possible to switch within noun phrases. However, there are certain noun phrases within which code-switching does not occur. For example, a switch does not occur between a quantifier and a noun, as follows. (12) 1B: two words ni
shichattan datte, ground to
Dat. made
hear
hog to.
and
TAG
(He split them into two words, ground and hog.)
(13) 1D: The mou ikko was…, what’s the last word again? other one (The other one was…, what is the last word again?)
(12) is a sentence which has Japanese as the matrix language; the English
420
Tomoko TOKITA and Yuji KAWAGUCHI
noun phrase ‘two words’ is inserted without a switch. In (13), which is a sentence having English as the matrix language, the Japanese noun phrase ‘mou ikko’ is inserted. Table 4 statistically depicts the switching patterns within/of the noun phrases in our corpus. Table 4. Code-switching patterns (noun phrase) Matrix Language
Japanese
English
Total
Det/Adj (Japanese) + N (English)
5
Det/Adj (English) + N (Japanese)
12
Quantifier (Japanese) + N (English)
0
Quantifier (English) + N (Japanese)
0
Quantifier (English) + N (English)
16
Det/Adj (Japanese) + N (English)
0
Det/Adj (English) + N (Japanese)
2
Quantifier (Japanese) + N (English)
0
Quantifier (English) + N (Japanese)
0
Quantifier (Japanese) + N (Japanese)
3
Japanese-English code-switching appears to be permissible in both matrix languages. However, the occurrences of switching within the noun phrase and those of the switched noun phrase comprising the quantifier are more frequently observed in the sentences having Japanese as the matrix language than in those having English as the matrix language. Our findings that Japanese-English code-switching occurs within the noun phrases are similar to the results in Bentahila & Davies (1983), which analyzed the syntax of Arabic-French code-switching, as well as in Pfaff (1979), which analyzed the syntax of Spanish-English code-switching. However, in Japanese-English, a switch does not appear to occur in the noun phrases that have the ‘quantifier and a noun’ structure. In this way, our subjects practiced switches within noun phrases without breaking the syntactic structure of the matrix language in which they conversed. 6.2.2. Prepositional/Postpositional Phrases In this section, the switching within prepositional/postpositional phrases is examined. A prepositional phrase in English corresponds to a postpositional phrase in Japanese: in English, the word order is ‘a preposition and a noun phrase’, while in Japanese, it is ‘a noun phrase and a postposition’. Table 5 shows the code-switching patterns within prepositional/ postpositional phrases that could theoretically occur in both sentences having
Syntactic Patterns of Intrasentential Code-Switching
421
Japanese or English as the matrix language. Table 5. Code-switching patterns (prepositional/postpositional phrase) Matrix Language Total Japanese
English
NP (English) + Postposition (Japanese)
28
Preposition (English) + NP (Japanese)
0
Preposition (English) + NP (English)
1
Preposition (English) + NP (Japanese)
0
NP (English) + Postposition (Japanese)
0
NP (Japanese) + Postposition (Japanese)
0
In our corpus, the switch occurred only in the sentences which had Japanese as the matrix language. Moreover, not all structurally possible switches were equally probable in the three types of configurations: English noun phrases were frequently inserted with a Japanese postposition. In only one example, the entire English prepositional phrase was embedded in a sentence which had Japanese as the matrix language. English prepositions were never switched by themselves, which was consistent with the findings of Azuma (2001). These patterns were similar to that observed in Pfaff (1979), which studied on English-Spanish code-switching. The sentences below are examples from our corpus. (14) 1D: zenbu, all of them,
same day ni on
shita. did
(I did all of them on the same day.)
(15) 2B: Yeah, from school jyanakute iku yo. Neg.
go
TAG
(You will go, not from school.)
Sentence (14) is an example of a frequent occurrence: a switch between an English noun phrase (same day) and a Japanese postposition (ni). It should be mentioned that not all the switched English noun phrases in our corpus were governed by articles. The use of Japanese postpositions in the sentences having Japanese as the matrix language, as well as the fact that the articles were not marked within the English noun phrase, implies that English noun phrases, instead of Japanese ones, are embedded and that this type of switch does not break the Japanese syntactic structure. Sentence (15) is a singular example, wherein an entire English prepositional phrase (from school) is embedded in a sentence which has Japanese as the matrix language. In this way, switching occurs within a postpositional phrase in the sentences having Japanese as the matrix language, and the language of the
422
Tomoko TOKITA and Yuji KAWAGUCHI
postposition and the matrix language are identical in most cases: a switched noun phrase is only embedded in the matrix language. This switching pattern follows the Japanese rules of syntax. 6.2.3. Verb Phrases This section examines the possibility of switching between a verb and an object noun phrase. While English has a ‘V+O’ order, Japanese has an ‘O+V’ order. Moreover, in Japanese the particle ‘o’ is placed as an accusative marker after an object noun phrase. However, in colloquial Japanese, the particle ‘o’ is sometimes dropped when case marking is contextually predictable (Masuoka & Takubo 1992). Table 6 statistically depicts the code-switching patterns that occurred within the verb phrases in our corpus. Table 6. Code-switching patterns (verb phrase) Matrix Language
Total
NP (English) + V (Japanese)
Japanese English
25
NP (English) + “o” + V (Japanese)
8
V (English) + NP (Japanese)
7
V (English) + NP (Japanese) + “o”
0
In the sentences which had Japanese as the matrix language, there were 33 examples of switching from an English noun phrase to a Japanese verb. Among these, an English noun phrase accompanied the Japanese accusative marker ‘o’ in 8 examples. In the remaining 25 examples, the English noun phrases did not accompany the accusative marker. This result is similar to the tendency observed in Japanese colloquial exchange. In the sentences which had English as the matrix language, 7 examples of a switched Japanese noun phrase were observed; however, the Japanese noun phrases did not accompany the Japanese accusative marker ‘o’. The following are examples of sentences which have Japanese as the matrix language. (16) 2B: kou like this
short sleeve o
kite, dou kana?
acc. wear, how TAG
(Wear the short sleeve shirts like this, what do you think?)
(17) 2B: kou like this
rainbow color lettering suru jyanai. do
TAG
((We) do rainbow color lettering like this, don’t we?)
Sentence (16) is an example in which a switched English noun phrase accompanies the Japanese accusative marker ‘o’, whereas in sentence (17),
Syntactic Patterns of Intrasentential Code-Switching
423
the English noun phrase does not. The English object noun phrases in our corpus are not marked by articles, which is in accordance with the Japanese syntactic structure. Sentence (18) is an example of a Japanese noun phrase being embedded in a sentence having English as the matrix language. (18) 1C: I’ll have gohan. rice (I’ll have rice.)
The fact that a Japanese noun phrase is not accompanied by the Japanese accusative marker ‘o’ in a sentence having English as the matrix language means that our subjects follow English syntactic rules and embed Japanese noun phrases, instead of English ones. However, in Nishimura (1997), this pattern is reported, as in (19). (19) We never know anna koto o.
(Nishimura 1997)
such thing acc. (We never know such a thing.)
It is unclear whether this pattern was observed in Nishimura’s corpus (we were unable to ascertain whether this sentence was uttered by her subject or if Nishimura formulated it for the paper); if it was from her corpus, how frequently was this pattern observed. Or, the differences in our findings might be in the backgrounds of the subjects: Nishimura’s subjects were second generation Japanese people living in Toronto, Ontario, and they were born in the 1920s or 1930s. Further investigation on this aspect is necessary. Thus, although the surface structure of a verb phrase is not equivalent between Japanese and English, the switch occurs while retaining the matrix language’s syntactic rules. Therefore, the Equivalence Constraint proposed by Poplack (1980) does not hold true with respect to our Japanese-English corpus. 6.2.4. Dislocation of the Direct Object Noun Phrases In sentences which have English as the matrix language, switching occurs when the object noun phrase undergoes either a left dislocation or a right dislocation. (20) 1D: shake, I’m going to eat it last. (Salmon, I’m going to eat it last.)
(21) 1C: There’s Japanese, I’ll write it in the bun, nihongo. (There’s Japanese, I’ll write it in the sentence, Japanese.)
In sentence (20), the object ‘it’ was dislocated on the left-hand side, switching into the Japanese ‘shake’. In sentence (21), the object ‘it’ was dislocated on the right-hand side, switching into the Japanese ‘nihongo’. Thus, the switches — both left dislocation and right dislocation — occurred
424
Tomoko TOKITA and Yuji KAWAGUCHI
without breaking the syntactic structure of English. 6.3. Topic Phrases This section examines the possibility of a topic phrase being switched. According to Kuno (1973:28), a topic phrase is “a noun phrase that indicates the person or things that have already appeared in the conversation.” In Japanese, topic phrases are typically marked by the particle ‘wa’. In addition to ‘wa’, there could be topic markers such as ‘nara’, ‘tte’, and ‘ttara’, or the topic markers could be dropped (Masuoka & Takubo, 1992). English, however, does not have topic markers. For our analysis, in sentences having Japanese as the matrix language, a switch within a topic phrase, a switched English topic phrase, would be analyzed. In sentences having English as the matrix language, switched Japanese topic phrase would be the subject of our analysis. Table 7. Code-switching patterns (topic phrase) Matrix Language Total Japanese
40
English
2
Table 7 shows the frequency of switched topic phrases. There were 40 instances of switched topic phrases in the sentences which had Japanese as the matrix language, while only 2 such instances were observed in the sentences which had English as the matrix language. The switching of topic phrases appears to be possible in both matrix languages; however, it occurs more frequently in the sentences which have Japanese as the matrix language. The following were observed in the sentences which had Japanese as the matrix language. (22) 1B: Jya, fourth wa so,
dare?
topic who
(so, who was the fourth?)
(23) 2B: Belly allergy tte
aru
yone.
topic exist TAG (There is the expression “belly allergy”, right?)
(24) 1B: Nani, brown stuff tte. what
topic
(What is it, the brown stuff?)
(25) 1B: Sugoi jan, J and T. great
TAG
(That’s great, J and T.)
Syntactic Patterns of Intrasentential Code-Switching
425
In sentence (22), the topic phrase is at the head of the sentence. The switched English noun (fourth) is embedded, accompanied by the Japanese topic marker ‘wa’. In sentence (23), the topic phrase is at the head of the sentence, and the switched English compound noun (belly allergy) is embedded with the Japanese topic marker ‘tte’. According to Masuoka & Takubo (1992), the topic marker ‘tte’ is mainly used in conversations. Sentence (24) is an example in which the topic phrase is placed at the end of the sentence. The switched English phrase (brown stuff) is embedded along with the topic marker ‘tte’. Sentence (25) is an example of a topic phrase that does not accompany a topic marker. Note that in these examples, articles are not placed before the English nouns, in accordance with the rules of Japanese syntax. Below is an example of a switched topic phrase in a sentence having English as the matrix language; however, such instances were not frequently observed in our corpus. (26) 1D: The mou ikko was…, what’s the last word again? other one (The other one was… what is the last word again?)
Therefore, it is possible to switch topic phrases without breaking the syntactic rules of the matrix language. Many examples of such a switch were observed in the sentences that are essentially Japanese, wherein topic phrases normally comprise a switched English noun phrase and a Japanese topic marker. This implies that the English noun phrases were embedded instead of the Japanese ones. 7. Conclusion Thus far, we have analyzed Japanese-English bilingualism, focusing on the intrasentential code-switching patterns from the conversations of two families. Code-switching occurs in both sentences having Japanese or English as the matrix language without breaking the syntactic rules of the matrix language, even though the surface structures of Japanese and English are sometimes not equivalent. The equivalence constraint proposed by Poplack (1980) does not hold true, as other studies (e.g., Azuma 1993) demonstrated. The Matrix Language Frame Model by Myers-Scotton (1993) seems rather to explain Japanese-English intrasentential code-switching, even though there are certain rules governing it, such as the nonoccurrence of a switch between a quantifier and a noun. We frequently observed the insertion of English items into Japanese sentences. Since our subjects reside in Vancouver, where English is a dominant language, certain items may be referred to in English, even in a Japanese sentence.
426
Tomoko TOKITA and Yuji KAWAGUCHI
It is necessary to conduct further investigations. These should include studies on a longer corpus obtained from other bilingual subjects, since the 3.5-hour-long corpus is inadequate for a thorough understanding of this phenomenon. For the intrasentential code-switching, research should be conducted on the switched verbs, adjectives, and discourse markers that were observed in this study, in addition to what we analysed. We also need to examine the intersentential code-switching from the Conversation Analysis Approach, which has been popular for some years. These analyses would take us to a fully understanding of the language practices of Japanese-English bilingual families in Vancouver, as code-switching is a part of them. Acknowledgements We wish to express our sincere gratitude to the families who participated in this study. References Azuma, S. 1993. “Word order vs. word class: portmanteau sentences in Bilinguals”. Japanese/Korean Linguistics, 2. Stanford Linguistic Society. 1993. 193-204. Azuma, S. 2001. “Functional categories and codeswitching in Japanese/ English”. R. Jacobson (ed.) Codeswitching worldwide II. Berlin: Mouton de Gruyter. 2001. 91-103. Bentahila, A. & Davies, E. 1983. “The syntax of Arabic-French code-switching”. Lingua 59. 313-348. Dagenais, D., & Berron, C. 2001. “Promoting multilingualism through French immersion and language maintenance in three immigrant families”. Language, culture and curriculum 14, 2. 142-155. Gumperz, J. J. 1982. Discourse strategies. New York: Cambridge University Press. Heller, M. 1988. “Strategic ambiguity: Code-switching in the management of conflict”. M. Heller (ed.) Code-switching: Anthropological and sociolinguistic perspectives. Berlin: Mouton de Gruyter. 1988. 77-96. Kuno, S. 1973. Nihon bunpo kenkyu [Studies on Japanese grammar]. Tokyo: Taishukan shoten. Masaoka, T., & Takubo, Y. 1992. Kiso nihongo bunpou: kaitei ban [Basic Japanese grammar: revised edition]. Tokyo: Kuroshio Shuppan. Muysken, P. 1995. “Code-switching and grammatical theory”. L. Milroy & P. Muysken (eds.) One speaker, two languages: Cross-disciplinary perspectives on code-switching. New York: Cambridge University Press. 1995. 177-198.
Syntactic Patterns of Intrasentential Code-Switching
427
Myers-Scotton, C. 1993. Duelling languages. Oxford: Oxford University Press. Nishimura, M. 1997. Japanese-English code-switching: syntax and pragmatics. New York: Peter Lang. Phaff, C. 1979. “Constraint on language mixing: Intrasentential code-switching and borrowing in Spanish/English”. Language 55. 291-316. Poplack, S. 1980. “Sometimes I’ll start a sentence in Spanish y termino en Espanol: toward a typology of code-switching”. Linguistics 18. 581-618. Poplack, S. & Meechan, M. 1995. “Patterns of language mixture: nominal structure in Wolof-French and Fongbe-French bilingual discourse”. L. Milroy & P. Muysken (eds.) One speaker, two languages: Crossdisciplinary perspectives on code-switching. New York: Cambridge University Press. 1995. 199-232. Poplack, S. & Sankoff, D. 1984. “Borrowing: the synchrony of integration”. Linguistics 22. 99-136. Statistics Canada. 2002. 2001 Census of Canada. Ottawa: Statistics Canada. http://www.statcan.ca/
428
Tomoko TOKITA and Yuji KAWAGUCHI
Index of Proper Nouns Albanian 130 ALEIC 55, 56 American Research on the Treasury of the French Language (ARTFL) 88 Amsterdam Corpus 217 Analyse et Traitement Informatique de la Langue Française (ATILF) 67, 70, 87-90, 217 Anglo-Norman Dictionary (AND) 89, 90 Arabic 130 Arumanian 130 Arvanitika 130 Association des Bibliophiles Universels (ABU) 86 Asturiano 130 Atlante Linguistico Italiano (ALI) 39 Atlas Italian und der Südschweiss (AIS) 39, 44, 56 Atlas Linguistique de la France (ALF) 39, 41, 42, 55, 56 Aubagne in Provence 210 Auvergne 206 Bâle 192, 193 Bantu 116, 119, 130 Base de Français Médiéval 88 Base textuelle du Moyen Français 88 BDLC 55, 57, 60, 62 Beijing Daily 306 Beijing Evening News 306 Beijing Language and Culture University 305 Beijing University of Aeronautics and Astronautics 302 Belgian 183 Belgium 181 Beltext 125
Berita Harian Online 337, 344, 345, 347 -351 Bibliotheca Augustana 86 Breton 120, 130 British Isles 179 British National Corpus (BNC) 133 Castillano 130 Catalan 130 Celtic 130 Chartres 213 Chinese 133, 134, 136-139, 323 Chinese National Corpus 134 CNRS 55, 62 Codes for the Human Analysis of Transcript (CHAT) 378 Colloquial Malay 354 Computerized Language Analysis (CLAN) 378 Consortium pour les Corpus de Français Médiéval 90 Contemporary Beijing Colloquial Corpus 305 Corsica 44, 55-59 Corsican 55, 58-60 Créole 130 Croatian 130 Csango 121, 130 Danish 119, 130 De Bello Gallico 269 Detroit 174 Dictionnaire du Moyen Français (DMF) 88, 90 Dictionnaire Électronique de Chrétien de Troyes (DÉCT) 89, 90 Dictionnaire Étymologique d’Ancien Français 89
430
Index of Proper Nouns
EmEditor Professional 307 English 175, 176 English-Norwegian Parallel Corpus (ENPC) 132 Estonian 130 Europe/European 39, 169, 179, 181, 182, 186 Ewen 321 Finnish 130 Flemish 130, 181 France 39, 57, 171, 174-177, 180, 186, 199 Franco-provençal 130 Frantext 70, 73, 78, 88, 125, 127 French 56, 57, 115, 116, 120, 122-127, 130, 169, 175, 179, 182, 184 French Ministry of Education 173 Frog, Where Are You? 372 Fujitsu Research Institute 302 Galician 130 Gallic War 266, 280 Gallica 85 German 130, 177 Germanic 116, 130 Great Britain 179 Greek 116, 130 Indo-European 116 Institut Gaspard Monge (IGM) 260 Institute of Computational Linguistics of Beijing University 302 International Corpus of English (ICE) 133 Irish 130 Italian 55, 56, 121, 130 Italy 59 Jakarta Indonesian 363 Japan 39 Japanese 323 Japanese Composition Database of
Advanced Learners 391, 392, 394, 398 Korean National corpus 134 Laboratoire d’Automatique Documentaire et Linguistique (LADL) 260 Laboratoire de Français Ancien (LFA) 86-88 Latin 116, 117, 124 Le Monde 245 Leipzig 193 Li Fet des Romains 266, 283 Lille 169, 171, 173, 176-178, 180, 181, 183 London 176 Lyon 205, 210 Maghrebians 176 Magyar 121, 130 Maine 205 Malay 341, 350 Maltese 130 Manx Gailic 130 Marseilles 203 Mayenne 196, 200 Montferrand in Auvergne 210 Multilingual Corpora (Malay) 356 NALC 44, 45, 55, 57-59, 61 Nanai 319 Navajo 148, 155, 157-161 Nederlands 121, 130 Negidal 321 New York 172, 174 Norman 123, 130 North America 179, 195 Northwestern France 195 Norwegian 119, 130 Nouveau Corpus d’Amsterdam 87, 89 Orochi 321 Paris 55, 177, 180, 193, 213 Parley 130
Index of Proper Nouns
Parme 193 People’s Daily 302 Phonologie du Français Contemporain (PFC) 169, 186 Poitevin 123 Poitou 203 Princeton University 86 Québec 196 Research Institute of Language Application in Beijing 301 Romance languages 116, 117, 121, 123, 130 Rumanian 130 Russia 319 Russian 130 Sardinia 59 Scandinavian 121 Scots 130 Scottish 130 Second World War/World War Two 177, 178 Semitic 116 Serbian 130
Slavic 116, 121 Spanish 148 Swedish 119, 130 Textes de Français Ancien 88 Tokyo University of Foreign Studies (TUFS) 371, 386 Toronto/Torontarian 175 TreeTagger 70 Tungusic languages 319 Turkish 130 Udihe 321 Uilta 321 Ukrainian 130 Ulcha 321 Usage-Based Linguistic Informatics (UBLI) 371 Valence in Dauphiné 210 Vancouver 411-414, 425, 426 Walloon 123, 130 Westvlaamasch 121, 130 Written Malay 354 Wuhan University 302 Xiandai Hanyu Cidian 303, 310, 314
Names AIJMER, K. 132, 135 AKIMOTO, M. 399 ALLARD, R. 125 ALTENBERG, B. 132, 135 ANDERSEN, R.W. 373-375 ARMSTRONG, N. 172, 175-177, 184 ARUGA, C. 399 AUBIN, G. 197 AUGUSTUS, P. 266, 270, 274 AUROUX, S. 127 BAGHBAN, M. 147 BAHRICK, H. 376
431
BAKER, M. 132 BANITT, M. 192-194 BARDOVI-HARLIG, K. 373-375 BARKHUIZEN, A. 376 BASTIN, Y. 119 BAUVOIS, C. 172, 173, 183 BEAULIEUX, C. 198, 208, 211 BEAULUÈRE, L. 195, 196 BEER, J. 270, 282 BÉNÉTEAU, M. 195, 196 BENVENISTE, E. 275 BERNARD, P. 67
432
Index of Proper Nouns
BIBER, D. 148-152, 154 BIEDERMANN-PASQUES, L. 208, 211 BIRD, G. C. 211 BISSEX, P. 147 BLONDHEIM, D.S. 192, 194 BOONS, J.P. 261 BOTTIGLIONI, G. 44, 45, 55, 56 BOUVIER, J. 43 BRANDIN, L. 193, 194 BROUWERS, D.D. 195, 196 CAESAR, J. 266, 268-270, 272, 278, 281-283 CARTON, F. 124, 125, 180, 184 CATACH, N. 198, 208, 211 CHAMBERS, J. K. 169, 172, 175 CHRÉTIEN de TROYES 265, 268 COHEN, M. 202, 203 COLLINS, B. 178, 183 COMRIE, B. 136, 367, 374 CONRAD, S. 148, 150, 152 CREVIER, I. 197, 198 CROWHURST, M. 147 DALBERA, J.P. 49 DALBERA-STEFANAGGI, M.J. 44, 55, 58, 59 DARMESTETER, A. 192-194 DEBOT, K. 375 DEES, A. 217 DELOFFRE, F. 200 DEMAIZIÈRE, C. 198 DOCHERTY, D. 172 DURAND, J. 169 EDMONT, E. 39, 40, 44, 45, 55-57 ELLIS, R. 376, 394, 403 ELOY, J.M. 115, 120, 123, 125, 180, 185 EPISTEMON 213 ERNST, G. 194-196, 199, 200, 211 FELLER, J. 125
FENG, Z. 301 FERGUSON, C.A. 365, 366 FIRMIN-DIDOT, A. 204, 208, 209, 212, 213 FOISIL, M. 199, 200 FOULKES, P. 172 FOX, S. 176 FRANCIS, W.M. 169 FREED, B.F. 375 FREI, H. 262 FUKUMOTO, N. 87 GARDETTE, P. 43 GARDNER, R. C. 375 GAUNA, M. 213 GERNER, H. 89 GILLIÉRON, J. 39, 40, 42, 44, 49, 5557 GLESSGEN, M.D. 87, 217 GODBERG, H. 195, 196 GODSALL-MYERS, J. 375 GOEBL, H. 47, 118 GOOSKENS, C. 119, 122 GRANGER, S. 132 GROSS, M. 237, 260 GUIFFREY, G. 200 GUILLET, A. 261 HAIMERL, E. 118 HARANO, N. 87 HARRIS, Z.S. 260 HASSELGREN, A. 387 HERMANS, H. 204 HOFLAND, K. 132, 134 HOHENBERG, B. 180 HORVATH, B. 176 HOUSEN, A. 374 HUNSTON, S. 132, 133, 135 HUNT, K. 147 JABERG, K. 39, 56 JAKOBSON, R. 275, 375
Index of Proper Nouns
JOHANSSON, S. 132, 134 JUD, J. 39, 56 KANDA, Y. 399 KATZENELLENBOGEN, L. 193, 194 KAWAGUCHI, Y. 47, 87, 118 KIWITT, M. 193, 194 KLEIBER, G. 275 KLOSS, H. 117 KUNSTMANN, P. 87, 217 LABOV, W. 172, 179, 182 LAKS, B. 169, 178, 179 LAMBERT, M. 193, 194 LAMBERT,R.D. 375 LANDRY, R. 125 LE BERRE, Y. 125 LE DU, J. 125 LEBEGUE, M. 124 LECLÈRE, C. 261 LEECH, G. 136 LEEKER, J. 268, 269, 273, 274 LEES, L. 180 LEFEBVRE, A. 169, 172-174, 176, 177 LEPAGE, Y. 87 LI, Q. 307 LIU, Y. 303, 310 LOBAN, W. 147 LODGE, R. A. 169, 185, 195, 197, 199, 200 LOMMATZSCH, E. 217 LYCHE, C. 169 MACWHINNEY, B. 378 MARCELLESI, J.B. 121 MARCHELLO-NIZIA, C. 88 MARTEL, P. 43 MARTIN, R. 88, 217 MARTINEAU, F. 88, 195-197 MARTINET, A. 202, 204, 237, 238, 260 MAYER, M. 372 McENERY, A. 132, 133, 135-139
433
MEES, A. 178, 183 MEYER, C. 148 MEYER, K. 86 MILLET, A. 212 MILROY, J. 183 MILROY, L. 183, 185 MOORCROFT, R. 375 MORIN, E. 123 MORIN, Y.C. 202, 204, 207, 209, 212 MULJACIC, Z. 116, 117 MYERS-SCOTTON, C. 412, 425 NAGATOMO, K. 394 NELDE, P.H. 117 NISHIKAWA, M. 371 NORDBERG, B. 184 OHIFEARNAIN, T. 115 OLLIER, M. L. 87 PAUMIER, S. 261 PAYNE, A. 175 PEDERSEN, I.L. 182 PIAZZOLA, A. 62 PLOUZEAU, M. 87 POLONI, E.S. 119 POOLEY, T. 169, 173, 175-178, 181184 POP, S. 183 POPLACK, S. 412, 413, 415, 418, 423, 425 RAVIER, C.F. 43 REENEN, P. van 219 REPPEN, R. 147-152 REYNAUD, A. 180 RILEY, W. 174 ROCHE, D. 197 ROSSET, T. 200 SAINT-GÉRAND, J.P. 199, 201 SAKOTA, K. 394 SANKOFF, D. 174 SATO, Y. 399
434
Index of Proper Nouns
SCHLEICHER, A. 116 SCHMIDT, J. 117 SÉGUY, J. 43, 47, 57 SERBAT, G. 276 SHIRAI, Y. 373-375, 379 SHUY, R. 174 SILBERZTEIN, M. 261 SIMONI-AUREMBOU, M.R. SIMPSON, R. 148 SINGY, P. 172 SISKIN, H.J. 193, 194 SOUVAY, G. 89 STEIN, A. 87, 89 SUZUKI, S. 87 TANNEN, D. 154 TAO, H. 307 TAURISSON, D. 194, 196 TERRY, B.A. 211 THUROT, C. 202, 204 TIAN, Q. 311
40, 123
TOBLER, A. 217 TRUDGILL, P. 169, 172, 178 TUAILLON, G. 43 UEDA, M. 371 VASSEUR, G. 125 VENDLER, Z. 374 WANG, J. 301 WELTENS, B. 375 WENKER, G. 39 WILSON, A. 132, 135 WOLF, B. 195, 196 WOLFRAM, W. 174 XIAO, Z. 135-139 YAMADA, A. 399 YOSHITOMI, A. 371, 372, 375, 376 YU, S. 302 ZHOU, Q. 302 ZHU, D. 303, 304, 306, 310 ZOU, S. 311
Index of Subjects Abstand 117 abstract displacement 246, 248, 249, 251-259 accomplishment (verb) 374, 379, 385 accusative marker 422, 423 achievement (verb) 374, 379, 384, 385 activity (verb) 374, 379, 384, 385 adnominal 275, 277, 279-281, 284 advanced learners of Japanese 393, 401 agar 337, 341, 344, 346 age 171, 176-179 agent-oriented 288, 289 agriculture 181, 182 anaphoric 277-279, 284 ancestral language 176, 182 argument 239, 242, 245, 254, 259 argument realization 289, 296 article 418, 419, 421, 423, 425 aspect 373-375, 377, 379, 386 Aspect Hypothesis 373-375, 377, 384386 association 69, 74 atlas 118, 124 attitude 119, 120, 122, 124 Ausbau 117, 125 background 290, 293, 296 bare active 355 bare passive 355 behawa 341, 343 bi-directional parallel corpus 138, 141 bilingual corpus 132 bilingualism 411, 415, 425 build 116, 117, 120, 121, 123, 125 cartographic 57 cartography 60, 61 causative-inchoative alternation 290
cedilla 197, 205 change (linguistic) 177, 182, 186 code-mixing 362, 363 coding system 92 cognitive-psychological model 376, 377 collateral 115, 120, 121, 123, 127 collocation 67-69, 71, 79-82, 391, 395, 396, 398-401 colloquial norm 183 communication strategy 372, 387 community 172, 177, 180-182 comparatism/comparatist 115, 116 composite life style 181 compound localizer 303, 304, 310 computer-assisted translation (CAT) 135, 140 concordance 87 concrete displacement 239, 244, 245, 247, 249-259 consciousness 118, 119, 122, 126 contact 115, 117, 119, 120 continuous speech 185 continuum 184 contrastive study 135, 138, 141 convergence to supra-local norm 184 conversations 94 cooccurrence 68, 70, 71, 81 corpus of language acquisition 261 corpus/corpora 56, 58, 62, 169, 171, 178, 184, 334, 371-373, 386 corpus–based language studies 131 corpus-based translation 135 critical edition 86 critical threshold 376 cross linguistic and inductive approach 334
436
Index of Subjects
cross-linguistic researches 323 database 55, 57, 58, 60, 62, 98, 191, 192, 209, 214 decausativization 289, 290, 294, 295, 299 deictic 275, 277, 279, 280, 284 demonstrative 275-277, 279, 280, 282, 284 demonstrative pronoun 275-277, 280 desocialisation 181, 185 developmental stages 373 diachronic/diachronical 40, 41, 43, 51 diacritic/diacritical 192, 197, 205, 209 dialect atlas 179, 182, 185 dialect geography 169, 178, 180, 185, 186 dialect of English 176 dialect/dialectal 41-43, 46, 51, 172, 176, 180, 181, 183, 184 dialectology/dialectological 39, 51, 118, 123, 124, 127, 169, 176, 179, 182, 183, 185 dialectometry/dialectometrist 118, 119 diatopy/diatopical 43, 48, 51 dictionary 89 diglossia/diglossic 121, 125, 126, 353, 354, 365, 366, 368 discourse-deictic 277, 279, 283, 284 distributional bias 375, 377, 385, 386 divergence 116, 121, 177, 178 durative 374 dynamic 374 dynamic concordance 88 education 172, 173 electronic libraries 86 electronic texts corpora 85 elementary student writing 147 embedded language 412, 413, 418 encoded texts 86, 87
English as a foreign language (EFL) 371, 372, 387 English as a second language (ESL) 371, 372, 375, 377, 387 English language development 147 epilinguistic 116, 126 ethnic-minority norms 176 ethnolinguistic background 176 etymology/etymological 48-51 existential sentence 290 expletive empty element (EXE) 339 fieldwork 334 first language (L1) 372, 373, 375 foreground 288, 290, 293, 296 form-function analysis 373, 376 French 191, 193, 195, 199, 201, 202, 205, 206 French dialects 193, 195, 199, 202, 205, 206 frequency 245, 247, 250, 256, 259, 261 function 373, 376 functional linguistics 260 function-form analysis 375, 376 gender 171, 175, 179, 182, 183 genealogy/genealogical 116, 117, 119 Generation X 177, 183 glossary/glossaire 192-194 glottalisation 179, 183 grammaire des fautes (grammar of errors) 262 grammatical morpheme 373 grammatisation 127 graphic system 191, 192 graphical 123-125, 127 Hawthorne effect (Observer’s Paradox) 184 Hebrew characters 192, 193 historical 116, 119-123 history 116, 117, 119-121, 124, 127
Index of Subjects
human subject 246, 251 immigration 176, 181 imperfective 374 income 172, 173 incubation period 376-378, 380, 383, 384 initial plateau 376 intercomprehension 119, 122 interpretive map 40, 47 intransitive construction 240, 242 Inverse Relationship Hypothesis 375, 376 involuntary movement 246, 251 Japanese returnees 371 koinè/koinèisation 118, 121-123, 125, 127, 172, 181 la’az/le’azim 193 Labovian approach 170 language 185 language practices 412, 413, 416, 417, 426 learner language 371, 373, 391 learner language corpus 391, 400, 403 lemmatization 222 leveling (dialect) 172, 176, 181, 182 Levenshtein distance 120 lexical aspect 373-375, 377, 382-386 lexical change 49 Lexical Conceptual Structure 292 lexical errors 391, 393, 395 lexical indexes 88 lexical renewal 51 lexicostatistics 119 linguistic atlas 39-44, 47, 51, 55 linguistic change 39 linguistic informatics 260 literary 115, 118, 123-127 literature 115, 122, 125 local language/norm/vernacular 175, 176, 179
437
locality 172, 182 localizer 301, 302, 304, 305, 310-312, 314-316 log likelihood 70, 71, 79, 80, 82 machine translation (MT) 135, 140 markedness 367 matrix language 412, 416, 418-425 medieval translation 266, 275, 283 memaksa 345-347 memerintahkan 345-347 meminta 345-350 memory 376, 377 mempelawa 345-348 memujuk 345-347, 349, 350 mencadangkan 338-347 mengajak 338, 340-343, 345-350 mengarahkan 343-347, 350 menggesa 345-347, 350 menyuruh 345-347 merayu 345-347 metalinguistic 116 meter 191 minimal-pairs 184 mobility 171, 173, 182, 183, 185 morphological active 355 morphological passive 355 motivation/motivational 40, 48, 49, 51 multi-dimensional analysis 148, 149, 151, 152 multilingual corpus 131, 132 multilingualism 411-413 mute letters 197, 198 narrative 372, 387 natural order of acquisition 373 near/nearness 115, 117-122, 127 network 58-61 network-related variants 182 neural plasticity 376 nexus relation 342, 349, 350
438
Index of Subjects
non-human subject 246, 251 norm (linguistic/prestige) 183 N-O-R-M-S 182 Observer’s Paradox 184 Old French 85, 88-90, 193, 217 onomasiology/onomasiological 46, 4951 oral corpus 92 orthography/orthographic/orthographical 191, 192, 194, 195, 197, 199-205, 208, 209, 213 parallel and comparable corpus 131, 133, 134, 141 parallel corpus 133, 135, 136, 139, 266, 270, 272, 275, 283 part of speech tagging 222 patient-oriented 288, 295 patient-orientedness 287, 289, 294, 296, 299 patois (local dialect) 177 perfective 373, 374 periurban 180-182 permastore 376 phonetic distance 119, 120 phonographic correspondences 191 phonographic principle 191 Picard 115, 117, 123-127, 130, 179, 184, 195, 205, 206 pragmatic 275, 281 pre-industrial society 180 Primacy of Aspect Hypothesis 371, 373 progressive 374, 384 pronominal verb construction 237, 240, 242, 244, 250-255, 257, 259 prosody 92 protocol 92, 193, 214 protocol analysis 372, 387 prototypical 374, 385 proximity 115, 116, 118, 120-123, 127,
130 punctual 374 questionnaire 55, 57-59, 61, 62 reading tasks 94 real time 178, 179 reconstruction 43, 49-51, 191 reform 192, 201, 202, 205, 208, 209 regiolect(al) features 177, 184 regional 175, 181 Regression Hypothesis 375, 385 re-learning 371, 387 representativeness 357 resultative compound verb 287-289, 291 -297 resultatives 295 retrospective data 372, 387 reversal of the sociolinguistic pattern 183 Reverse Order Hypothesis 371, 375, 376, 386 rhymes 191 rural 179, 181 schwa 97 script 193, 202, 209 scripta 195 second language (L2) 371-373, 375-377, 386, 387 second language acquisition (SLA) 371, 373, 374, 377, 386, 387 second/foreign language attrition/loss 371, 375-377 semasiology/semasiological 40, 46, 4951 semi-literate 191, 192, 194 simple localizer 304 social class 172, 173, 176, 177, 179, 182, 183 socialization 182, 185 sociolinguistic 115, 120, 125, 127, 173,
Index of Subjects
174, 178, 179, 182, 183, 185, 186 sparse data problem 222 Spec of CP 340, 344 specificity 68, 70, 71, 79-82 speech (forms) 177, 178 spelling 191, 192, 194-196, 200, 201, 204, 205, 208, 209, 213, 214 spontaneous speech 183, 184 Stammbaum 116 standard 175, 178, 182-184 standardise/standardisation 115, 116, 118, 122-127 static concordance 87 structural linguistics 260 student language development 148 stylistic variation 171, 184 subject-prominent 288 supaya 337-350 supra-local variety 176, 178, 182-184 teaching materials 398, 403 telic 374, 385 tense 373-375, 377-379 tense-aspect system 377, 382 textile (industry) 181, 183 third language (L3) 372 tiers 98 Time Apparent time 177 TOP 340 topic marker 424, 425 topic-prominent 287, 288 tradition(al) forms/varieties 172, 180-
439
182, 184 transcription 86, 87, 98 transitive direct construction 242, 245, 250-253, 255, 258, 259 transitive indirect construction 240, 242, 244, 257 translation study 139, 141 TreeTagger 222 type 172, 180, 184 typological studies 319 untuk 337-339, 341-346, 348-350 urban 169, 172, 179-182 variants/variety/features 169, 178, 182, 183 variation 172, 177, 180, 182, 184 variety 178, 183, 184 variety speech 178 verb morphology 373, 376, 378, 380, 381, 385, 386 verb tense-aspect system 371 vernacular 169, 178, 182-184 voice 355 voluntarism/voluntarist 117, 121, 126 voluntary act 246, 249, 259 word list 184, 185 word order 412, 418, 420, 422 word-final consonant devoicing (WFCD) 177, 183 Xaira 226 XML 225, 226 zero complementizer 337, 341, 343
440
Index of Subjects
Contributors
273
Contributors Yuji KAWAGUCHI
Faculty of Foreign Studies, Tokyo University of Foreign Studies
Jean-Philippe DALBERA
École Pratique des Hautes Études, Faculty of Humanities, Arts and Social Sciences, University of Nice
Marie-José DALBERA-STEFANAGGI
Faculty of Humanities, University of Corsica
Peter BLUMENTHAL
Faculty of Philosophy, University of Cologne
Pierre KUNSTMANN
Faculty of Arts, University of Ottawa
Chantal LYCHE
Faculty of Humanities, University of Oslo
Jean-Michel ELOY
Faculty of Arts, University of Amiens
Tony McENERY
Faculty of Arts and Social Sciences, Lancaster University
Zhonghua XIAO
Faculty of Arts and Social Sciences, Lancaster University
Randi REPPEN
TESL & Applied Linguistics Faculty, Northern Arizona University
Tim POOLEY
Department of Humanities, Arts and Languages, London Metropolitan University
Yves Charles MORIN
Faculty of Arts and Sciences, University of Montreal
Achim STEIN
Faculty of Humanities, University of Stuttgart
Yoichiro TSURUGA
Faculty of Foreign Studies, Tokyo University of Foreign Studies
Keiko MOCHIZUKI
Faculty of Foreign Studies, Tokyo University of Foreign Studies
Takayuki MIYAKE
Faculty of Foreign Studies, Tokyo University of Foreign Studies
Shinjiro KAZAMA
Faculty of Foreign Studies, Tokyo University of Foreign Studies
Isamu SHOHO
Faculty of Foreign Studies, Tokyo University of Foreign Studies
Hiroki NOMOTO
Ph.D. candidate, Graduate School of Area and Culture Studies, Tokyo University of Foreign Studies
Asako YOSHITOMI
Faculty of Foreign Studies, Tokyo University of Foreign Studies
Ayano SUZUKI
MA, Graduate School of Area and Culture Studies, Tokyo University of Foreign Studies
Tae UMINO
Faculty of Foreign Studies, Tokyo University of Foreign Studies
Tomoko TOKITA
Ph.D. candidate, Graduate School of Area and Culture Studies, Tokyo University of Foreign Studies
In the series Usage-Based Linguistic Informatics the following titles have been published thus far or are scheduled for publication: 6 5 4 3 2 1
Kawaguchi, Yuji, Toshihiro Takagaki, Nobuo Tomimori and Yoichiro Tsuruga (eds.): Corpus-Based Perspectives in Linguistics. 2007. vi, 442 pp. Kawaguchi, Yuji, Susumu Zaima and Toshihiro Takagaki (eds.): Spoken Language Corpus and Linguistic Informatics. 2006. vi, 434 pp. Yoshitomi, Asako, Tae Umino and Masashi Negishi (eds.): Readings in Second Language Pedagogy and Second Language Acquisition. In Japanese Context. 2006. vi, 274 pp. Kawaguchi, Yuji, Ivan Fónagy and Tsunekazu Moriguchi (eds.): Prosody and Syntax. Crosslinguistic perspectives. 2006. vi, 384 pp. Takagaki, Toshihiro, Susumu Zaima, Yoichiro Tsuruga, Francisco Moreno Fernández and Yuji Kawaguchi (eds.): Corpus-Based Approaches to Sentence Structures. 2005. vi, 317 pp. Kawaguchi, Yuji, Susumu Zaima, Toshihiro Takagaki, Kohji Shibano and Mayumi Usami (eds.): Linguistic Informatics – State of the Art and the Future. The first international conference on Linguistic Informatics. 2005. viii, 363 pp.